If canola oil tastes like something, it's really disgusting IMO. I kinda hate the stuff even though my dad made good money growing it. OTOH, the very sweet smell of the plant's flowers is pleasant enough if pretty basic and the honey is similar.
The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one
the fundamental mismatch i saw is "creating a new segment for each individual dml", it is possible to alleviate but i don't think there's a good general solution.
Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"
`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:
- Created composite b-tree indexes for each of the queries with boolean predicates
- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB
But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.
Thanks for reporting this! I'm having trouble finding the link you are referring to though. Would you mind sharing a link to the file/page containing the dead link?
This is what we did:
DB with pg_search: We created a single BM25 index
DB without pg_search: We created all these indexes
GIN index on message (for full-text search)
GIN index on country (for text-based filtering)
B-tree indexes on severity, timestamp, and metadata->>'value' (to speed up filtering, ordering, and aggregations)
See the problem? You didn't create an index on the vector in the without-pg_search case. You didn't compare apples to apples. TFA is all about that.
Perhaps you can argue that creating a fastupdates=on index would have been the right comparison, but you didn't do that in that blog.
> You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:
TFA isn't cherry-picking to show you that one query could have gone faster. TFA is showing that you didn't compare apples to apples. Looking at those 12 queries nothing screams at me that TFA's approach of storing the computed tsvector wouldn't work for those too.
Perhaps pg_search scales better and doesn't require trading off update for search performance, and that would be a great selling point, but why not just make that point?
No, they didn't. They aren't Neon and didnt do the benchmarks in the linked article. They are a postgres maintainer.
If you actually read their comment instead of raging you will see that they are saying that pg_search is a simple generic index definition that makes a _variety_ of queries work with little effort, and you can still add the additional optimisations (that are already documented - which they linked to) where needed.
It adds a small overhead to transactions. The exact number depends on how much you're inserting and indexing, but is in the milliseconds.
We have adopted strong consistency because we've observed most of our customers run ParadeDB pg_search in a separate instance from their primary Postgres, to tune the instance differently and pick more optimized hardware. The data is sent from the primary Postgres to the ParadeDB instance via logical replication, which is Zero-ETL. In this model, there are no transactions powering the app executed against the ParadeDB instance, only search queries, so this is a non-issue.
We had previously adopted weak consistency because we expected our customers to run pg_search in their primary Postgres database, but observed this is not what they wanted. This is specifically for our mid-market/enterprise customers. We may bring back weak consistency as an optional feature eventually, as it would enable faster ingestion.
pg_search (full text search Postgres extension) can be used with pgvector for hybrid search over Postgres tables. It comes with a helpful hybrid search function that uses relative score fusion. Whereas rank fusion considers just the order of the results, relative score fusion uses the actual metrics outputted by text/vector search.
Could you clarify which result you're referring to as the baseline and "number 1 place?"
I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.
The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.
In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.
1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.
2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).
Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.
This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.
It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)
If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.
Can you elaborate on what you mean by the "Postgres way of doing things"? Also, what is wrong with using Postgres to query data in external object stores? It is a common occurrence for businesses to store parquet artefacts in object storage, and querying them is often desirable.
It depends. If you're happy with Databricks, etc. you might be good. But we've seen many users want the simplicity of querying data from Postgres for analytics, especially in case of JOINing both analytics and transactional data