Overview
PyPi module | N/A |
git repository | https://bitbucket.org/arrizza-public/test-bm25 |
git command | git clone git@bitbucket.org:arrizza-public/test-bm25.git |
Verification Report | https://arrizza.com/web-ver/test-bm25-report.html |
Version Info |
|
- installation: see https://arrizza.com/setup-common
Summary
This project tests the use of the BM25 ranking function. The intent is to eventually use it to create a search function. The user enters a query and the BM25 ranks the pages of a website (or other scenarios) in order of the most relevant.
It currently is using the python rank_bm25 module. There are several options in that module, this test is currently using BM25L. I have briefly tried other options, for example BM25Plus. That one returns a rank value (low) even if the word is not in the page. I believe that would be confusing to a user since listing any page implies that it contains info relevant to their search query.
Typical output:
./doit
==== doit: starting...
<snip>
15 'shield',
15 'lcd',
9 'tachometer',
6 'adafruit',
6 'sensor',
6 'rpm',
6 'touchscreen',
6 'panel',
5 'basic',
4 'interface',
enter query:
The list of words shows the most common words used after even more common words are excluded. You can change this in the _save_corpus() function:
min_count = 0
max_count = 10 # <- change this to 0 to not see any words.
The full list of "trimmed" words is in out/corpus_trimmed.json. These are the remaining words after the more common words are excluded, see _save_corpus():
if word in ['a', 'i', 'n', 'x', 'c', 'to', 'is', 'of', 'in', 'it', 'on', 'or', 'be', 'if', 'an', 'at',
'as', 'up', 'by', 'so', 'we', 'do', 'no', 'ok', 'ms', 'my', 'pc', 'go', 'vm', 'ai', 'us',
]:
continue
<snip>
Note: I use JetBrain's PyCharm that has a spell checker which works even in JSON files. I used this facility to check for any spelling errors in all of my website pages. There were over 77,000 entries in out/corpus.json but by trimming the more common ones, I got that down to 25,000. Going through these I found and fixed spelling errors and tweaked the code that creates the corpus in _load_corpus() to exclude and adjust the text found in the raw pages:
for word in line.lower().split(" "):
word = word.replace('\\n', '')
word = word.replace('</strong>', '')
word = word.replace('</td>', '')
<snip>
You can also use out/corpus_counts.json to help out with that. It lists all the words in order of their frequency found in all pages. For example, there are around 7800 items found in my website.
Note: since I tend to show code in my README files and outputs, those also show up in the corpus. For example, "1000.013" is one of the items. I have left these in, since it could be possible that a user may want to search for a specific value. With only 77,000 entries in the corpus, the impact on search time is negligible.
How to use
Start the test app with ./doit
and then enter which ever query you want to try.
I use the corpus_counts.json to give me a hint as to which words are rare and most common.
Enter "quit" or ctrl-C to end the app.
enter query: rpm
[0. 2.25970813 0. 0. 2.01871901 1.83296645]
Found 3 pages for: rpm
1] 2.260 https://your_url/page5 <== shows at most 10 pages, in this case only 3 pages were found
2] 2.019 https://your_url/page4
3] 1.833 https://your_url/page3
---
enter query: panel
[0. 0. 2.4512066 0. 2.01871901 1.83296645]
Found 3 pages for: panel
1] 2.451 https://your_url/page6 <== the top most line has the "best" page for the given query
2] 2.019 https://your_url/page4
3] 1.833 https://your_url/page3
---
enter query: arrizza
[0. 0. 0. 0. 0. 0.]
Found 0 pages for: arrizza <== no hits for this query
---
enter query: arduino
[0.59817425 0.57318761 0.4311185 0.40424854 1.04005337 1.67881556]
Found 6 pages for: arduino
1] 1.679 https://your_url/page3 <== this phrase is used in all pages
2] 1.040 https://your_url/page4
3] 0.598 https://your_url/page2
4] 0.573 https://your_url/page5
5] 0.431 https://your_url/page6
6] 0.404 https://your_url/page1
---
enter query: quit <== use "quit" to exit the app
The BM25 rankings for all words found are shown for all pages. In this case there are only 6 pages:
[0.59817425 0.57318761 0.4311185 0.40424854 1.04005337 1.67881556]
The summary shows up to top_n values, see run() function to change that:
top_n = 10
top_scores, num_scores = self._get_top(page_scores, top_n)
Note: multiple words can be used for a query:
enter query: rpm panel
[0. 2.25970813 2.4512066 0. 4.03743801 3.6659329 ]
Found 4 pages for: rpm panel
1] 4.037 https://your_url/page4
2] 3.666 https://your_url/page3
3] 2.451 https://your_url/page6
4] 2.260 https://your_url/page5
For "rpm" page5 is the best hit (see above), for "panel" page6 is the best hit. For the combination page4 is the best hit. You can check this out by opening pages/page4.md and search for these words. That page has both and multiple instances of them.