High level: Rethink the embedding database structure #5

0hq · 2023-07-03T18:09:24Z

tinyvector should look to be as simple as possible while being as powerful as possible. What is the best abstraction for an embedding database that minimizes the complexity and size of the codebase?

We've made a few assumptions so far:

Only one index can be tied per table, since we're assuming most users won't need multiple indexes on the same data.
Indexes can only be tied to a single table and cannot span multiple tables or special clauses. This might need to change in the future? Do we want to allow indexes to be built on multiple tables/with complex filtering?
Indexes should try to not be mutable, instead, should force manual deletion and recreation? We may want to have a number of mutable indexes for compatibility, but it seems to be more straightforward (from a performance and a user experience perspective) to intend for most indexes to be immutable.
Holding all indexes in memory and intending for vertical scaling seems like the simplest way to build tinyvector. In most common use-cases, it seems that vectors can easily be held in memory on reasonable hardware. If needed, you can do dimensionality reduction on your vectors to decrease memory impact and increase performance. Is this the right direction?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High level: Rethink the embedding database structure #5

High level: Rethink the embedding database structure #5

0hq commented Jul 3, 2023

High level: Rethink the embedding database structure #5

High level: Rethink the embedding database structure #5

Comments

0hq commented Jul 3, 2023