a more scalable full text search via SQLite #63
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I implemented a POC for SQLite-based full text search.
You can try a demo here:
https://phiresky.github.io/replayweb.page/?source=https://phiresky.github.io/replayweb.page/examples/netpreserve-twitter-4k-minimal.wacz#view=pages&query=фізичної+юридичної
It's implemented in the service worker. The api is a GET to
/w/api/c/<collectionid>/sqliteFtsSearch?matchString=foo&limit=10
which responds with a ndjson stream of rows and progress events:This means you can start showing results as more are found, before the search is complete.
If there's an error, a {"type": "error", "message": "..."} object is returned in the stream.
Here's the changes made to the frontend (outside of this PR): phiresky/replayweb.page@202673f
Notes:
the script to convert a pages.jsonl to a pages-fts.sqlite3 file is in [src/sqlite-fts/create-sqlite-fts.ts].
i added the full text index for "dako-gov-ua.wacz" to the netpreserve-twitter wacz instead so I could push it to github as an example - the search works but the warc files are obviously missing.
i did not implement detection of presence of the sqlite index. right now it just assumes
/pages/extraPages-fts.sqlite3
exists.right now everything stays cached in ram, there needs to be some eviction (right now you need to evict by unloading the service worker manually)
right now I directly pass the query to the sqlite MATCH syntax. that means the following mean different things:
"foo bar"
vsfoo bar
I created two variants: a "minimal" FTS index and a "full" index. the minimal index only allows for searching for any number of words in the page content but can't filter by closeness of words, by occurence count, sort by BM25 relevance, or even display the page contents.
We'll need a larger pages.jsonl to really evaluate this. The ones you sent are too small. With the pages.jsonl from
dako-gov-ua.wacz
these are the stats:original extraPages.jsonl: 49MB
extraPages.jsonl(zipped): 3.5MB
4kB pages, minimal index (search in page content but can only show id, title and url, not page content itself):
це запити фізичної юридичної
: 16 requests, 127kBВИДИ ЗАПИТІВ
: 13 requests, 111 kBСоціально правові
after searching forВИДИ ЗАПИТІВ
: 3 requests, 20kB total. this shows that after the initial query a lot of needed data is already fetched, even when the text is completely different. for more similar queries it has to fetch even less (or nothing).4kB pages, full index:
це запити фізичної юридичної
: 36 requests, 332kBВИДИ ЗАПИТІВ
: 34 requests, 319 kB"Соціально-правові"
after searching forВИДИ ЗАПИТІВ
: 4 requests, 29kB total.32kB pages, full index:
це запити фізичної юридичної
: 19 requests, 635kBВИДИ ЗАПИТІВ
: 16 requests, 537kB"Соціально-правові"
after searching forВИДИ ЗАПИТІВ
: 4 requests, 130kB.I had to modify the wa-sqlite library to add
-DSQLITE_ENABLE_FTS5
toWASQLITE_DEFINES
, that's why I included a wasm binary of that library.