Test BM25 page ranking

Overview

PyPi module N/A
git repository https://bitbucket.org/arrizza-public/test-bm25
git command git clone git@bitbucket.org:arrizza-public/test-bm25.git
Verification Report https://arrizza.com/web-ver/test-bm25-report.html
Version Info
  • Ubuntu 22.04 jammy, Python 3.10

Summary

This project tests the use of the BM25 ranking function. The intent is to eventually use it to create a search function. The user enters a query and the BM25 ranks the pages of a website (or other scenarios) in order of the most relevant.

It currently is using the python rank_bm25 module. There are several options in that module, this test is currently using BM25L. I have briefly tried other options, for example BM25Plus. That one returns a rank value (low) even if the word is not in the page. I believe that would be confusing to a user since listing any page implies that it contains info relevant to their search query.

Typical output:

./doit
==== doit: starting...
<snip>
 15 'shield',
 15 'lcd',
  9 'tachometer',
  6 'adafruit',
  6 'sensor',
  6 'rpm',
  6 'touchscreen',
  6 'panel',
  5 'basic',
  4 'interface',
enter query: 

The list of words shows the most common words used after even more common words are excluded. You can change this in the _save_corpus() function:

min_count = 0 
max_count = 10   # <- change this to 0 to not see any words.

The full list of "trimmed" words is in out/corpus_trimmed.json. These are the remaining words after the more common words are excluded, see _save_corpus():

if word in ['a', 'i', 'n', 'x', 'c', 'to', 'is', 'of', 'in', 'it', 'on', 'or', 'be', 'if', 'an', 'at',
            'as', 'up', 'by', 'so', 'we', 'do', 'no', 'ok', 'ms', 'my', 'pc', 'go', 'vm', 'ai', 'us',
            ]:
    continue
<snip>

Note: I use JetBrain's PyCharm that has a spell checker which works even in JSON files. I used this facility to check for any spelling errors in all of my website pages. There were over 77,000 entries in out/corpus.json but by trimming the more common ones, I got that down to 25,000. Going through these I found and fixed spelling errors and tweaked the code that creates the corpus in _load_corpus() to exclude and adjust the text found in the raw pages:

for word in line.lower().split(" "):
    word = word.replace('\\n', '')
    word = word.replace('</strong>', '')
    word = word.replace('</td>', '')
    <snip>

You can also use out/corpus_counts.json to help out with that. It lists all the words in order of their frequency found in all pages. For example, there are around 7800 items found in my website.

Note: since I tend to show code in my README files and outputs, those also show up in the corpus. For example, "1000.013" is one of the items. I have left these in, since it could be possible that a user may want to search for a specific value. With only 77,000 entries in the corpus, the impact on search time is negligible.

How to use

Start the test app with ./doit and then enter which ever query you want to try. I use the corpus_counts.json to give me a hint as to which words are rare and most common.

Enter "quit" or ctrl-C to end the app.

enter query: rpm
[0.         2.25970813 0.         0.         2.01871901 1.83296645]
Found 3 pages for: rpm
  1]      2.260  https://your_url/page5     <== shows at most 10 pages, in this case only 3 pages were found
  2]      2.019  https://your_url/page4
  3]      1.833  https://your_url/page3
---
enter query: panel
[0.         0.         2.4512066  0.         2.01871901 1.83296645]
Found 3 pages for: panel
  1]      2.451  https://your_url/page6    <== the top most line has the "best" page for the given query 
  2]      2.019  https://your_url/page4
  3]      1.833  https://your_url/page3
---
enter query: arrizza 
[0. 0. 0. 0. 0. 0.]
Found 0 pages for: arrizza                 <== no hits for this query
---
enter query: arduino
[0.59817425 0.57318761 0.4311185  0.40424854 1.04005337 1.67881556]
Found 6 pages for: arduino
  1]      1.679  https://your_url/page3    <== this phrase is used in all pages
  2]      1.040  https://your_url/page4
  3]      0.598  https://your_url/page2
  4]      0.573  https://your_url/page5
  5]      0.431  https://your_url/page6
  6]      0.404  https://your_url/page1
---
enter query: quit                         <== use "quit" to exit the app

The BM25 rankings for all words found are shown for all pages. In this case there are only 6 pages:

[0.59817425 0.57318761 0.4311185  0.40424854 1.04005337 1.67881556]

The summary shows up to top_n values, see run() function to change that:

top_n = 10
top_scores, num_scores = self._get_top(page_scores, top_n)

Note: multiple words can be used for a query:

enter query: rpm panel
[0.         2.25970813 2.4512066  0.         4.03743801 3.6659329 ]
Found 4 pages for: rpm panel
  1]      4.037  https://your_url/page4    
  2]      3.666  https://your_url/page3
  3]      2.451  https://your_url/page6
  4]      2.260  https://your_url/page5

For "rpm" page5 is the best hit (see above), for "panel" page6 is the best hit. For the combination page4 is the best hit. You can check this out by opening pages/page4.md and search for these words. That page has both and multiple instances of them.

- John Arrizza