Hacker Newsnew | past | comments | ask | show | jobs | submit | retakeming's commentslogin

What would you say is the most neutral oil then?


Sunflower oil? It seems to very reliably taste like nothing.


Personally I have Canola and Sunflower oil tied. Vegetable Oil I guess deserves a mention here too.


If canola oil tastes like something, it's really disgusting IMO. I kinda hate the stuff even though my dad made good money growing it. OTOH, the very sweet smell of the plant's flowers is pleasant enough if pretty basic and the honey is similar.


Shout out to grapeseed oil


The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one


the fundamental mismatch i saw is "creating a new segment for each individual dml", it is possible to alleviate but i don't think there's a good general solution.


I'm one of the pg_search maintainers. Hello! A few thoughts.

First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: https://www.postgresql.org/docs/current/textsearch-tables.ht....

Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"

`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

- Created composite b-tree indexes for each of the queries with boolean predicates

- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB

But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.


just an fyi: The blog link in your readme does not work.


Thanks for reporting this! I'm having trouble finding the link you are referring to though. Would you mind sharing a link to the file/page containing the dead link?


From the blog about pg_search linked by TFA:

  This is what we did: 

    DB with pg_search: We created a single BM25 index 
    DB without pg_search: We created all these indexes
        GIN index on message (for full-text search)
        GIN index on country (for text-based filtering)
        B-tree indexes on severity, timestamp, and metadata->>'value' (to speed up filtering, ordering, and aggregations)
See the problem? You didn't create an index on the vector in the without-pg_search case. You didn't compare apples to apples. TFA is all about that.

Perhaps you can argue that creating a fastupdates=on index would have been the right comparison, but you didn't do that in that blog.

> You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

TFA isn't cherry-picking to show you that one query could have gone faster. TFA is showing that you didn't compare apples to apples. Looking at those 12 queries nothing screams at me that TFA's approach of storing the computed tsvector wouldn't work for those too.

Perhaps pg_search scales better and doesn't require trading off update for search performance, and that would be a great selling point, but why not just make that point?


Why so angry?

> "You didn't ..."

No, they didn't. They aren't Neon and didnt do the benchmarks in the linked article. They are a postgres maintainer.

If you actually read their comment instead of raging you will see that they are saying that pg_search is a simple generic index definition that makes a _variety_ of queries work with little effort, and you can still add the additional optimisations (that are already documented - which they linked to) where needed.


Are you sure parent is the author of that blog post?

Maybe I’m reading the whole thread wrong, but it looks like you are screaming at a maintainer of pg_search that someone else did a poor benchmark


There is an art to communicating that I think people learn around their college years...


Good question. That was from a very old version of pg_bm25 (since renamed to pg_search). BM25 indexes are now strongly consistent.


Nice! I've seen other extensions that don't have transactional semantics, which runs counter to the norm for PG.

So since it was previously weakly consistent due to performance reasons, how does strong consistency affect transactional inserts/updates latency?


It adds a small overhead to transactions. The exact number depends on how much you're inserting and indexing, but is in the milliseconds.

We have adopted strong consistency because we've observed most of our customers run ParadeDB pg_search in a separate instance from their primary Postgres, to tune the instance differently and pick more optimized hardware. The data is sent from the primary Postgres to the ParadeDB instance via logical replication, which is Zero-ETL. In this model, there are no transactions powering the app executed against the ParadeDB instance, only search queries, so this is a non-issue.

We had previously adopted weak consistency because we expected our customers to run pg_search in their primary Postgres database, but observed this is not what they wanted. This is specifically for our mid-market/enterprise customers. We may bring back weak consistency as an optional feature eventually, as it would enable faster ingestion.


If you're a Postgres pg_search user, partial indexes (https://docs.paradedb.com/search/full-text/index#partial-bm2...) can solve.


If you're creating one Postgres index per user, you're going to have a bad time really fast.


Would eat up inodes? Not sure how these work, does it create separate files for each one?


Index per tenant but yes, if you have hundreds or thousands of tenants that becomes a more difficult problem to manage but not at all unmanageable.


pg_search (full text search Postgres extension) can be used with pgvector for hybrid search over Postgres tables. It comes with a helpful hybrid search function that uses relative score fusion. Whereas rank fusion considers just the order of the results, relative score fusion uses the actual metrics outputted by text/vector search.


Could you clarify which result you're referring to as the baseline and "number 1 place?"

I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.

The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.


Would be great to also see new pg_lakehouse and datafusion benchmark results here: https://duckdblabs.github.io/db-benchmark/

Currently Datafusion is much slower than duckdb or OOMing.


pg_house just wasn't as catchy!

In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.


Thanks!

1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.

2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).


Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.

This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.

It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)


If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.


Can you elaborate on what you mean by the "Postgres way of doing things"? Also, what is wrong with using Postgres to query data in external object stores? It is a common occurrence for businesses to store parquet artefacts in object storage, and querying them is often desirable.


It depends. If you're happy with Databricks, etc. you might be good. But we've seen many users want the simplicity of querying data from Postgres for analytics, especially in case of JOINing both analytics and transactional data


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: