If I wanted to do that I'd use PgVector, since I use Postgres for just about everything. There'd need to be a really good reason to go with a specialized DB.
The ANN index IVF implemented in pgvector has very poor performance, with only around 50% recall. Is it a good enough reason, or you don't care about results accuracy in favor of the comfort of using a multitool?
You can do everything with PostGres:
Full-text search, but there are better engines for it: Elastic, Meilisearch, etc. right?
You can also store JSON into Postgres, but you should better use MongoDB for NoSQL purposes, right?
The reason for this is: dedicated tools are always better, faster, and more feature-rich.
> You can also store JSON into Postgres, but you should better use MongoDB for NoSQL purposes, right?
It's a common folklore at this point that Postgres is sometimes a better document store than most NoSQL databases (including Mongo), see for example this which is also at the front page of HN today, https://news.ycombinator.com/item?id=35544499
> The reason for this is: dedicated tools are always better, faster, and more feature-rich.
Depends. Polyglot persistence has the benefit of letting you use the "best tool for the job" for each job but that benefit can fall apart if you have cross-cutting concerns. If you need to query across different storages you often end up compromising on several of the initial benefits (e.g. performance from passing data between storages or resource use and consistency from duplicating critical data).
For example, you could store your graph data in Neo4J and your document data in MongoDB but good luck doing graph queries that need to access the document data. OR you could use something like Tiger or Arango that's a graph database that can also store data other than pure edges.
“Dedicated tools are always better, faster, and more feature-rich.”
Yes, most people are perfectly OK with just one car.
Specialists need special cars, but general public most of the time is ok with a family car.
PgVector might not be production ready, but it doesn't mean that for most of the situations Postgres's full-text, JSON, GIS, Graph-walking, Queues etc. isn't good enough with advantages of using just one database. There's a new category of problems when you let your data be in multiple places. On top of that, when you go full into Pg, with things like PostgREST, you can sometimes end up with a very minimalistic backend.
It is very common for production data to live in multiple places; one for transactions, one for analytics, one for serving, one for ETL. The company I worked at rolled its own similarity search engine before these vector databases were a thing.
> The ANN index IVF implemented in pgvector has very poor performance, with only around 50% recall.
My understanding is that this mostly due to the default settings that pgvector uses (nprobes = 3) and not due to the usage of IVF. The recall would improve significantly with better defaults. This of course would also increase the latency of vector searches, but that is the trade-off of using IVF instead of HNSW (worse latency at high recall, but much lower storage/memory costs).
It could be that pgvector isn't good enough for serious use. It would be great to see some benchmarks. The one you site sounds like there's enough reason to try something more specialized.
Having many point solutions is problematic from a cost and complexity standpoint. General purpose solutions that would work for 80% of your use cases would be better long term. Having said that, I don’t think Postgres can be used as a general purpose solution to cover vector search, full text search and NoSQL use cases. The best general purpose solution would expose a unified API but under the hood use different storage engines to support these diverse vector search and full text search use cases.