For those of you from the AI world, this is the equivalent of the bitter lesson ...

szarnyasg · 2025-05-22T13:43:08 1747921388

AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.

mehulashah · 2025-05-22T18:49:23 1747939763

AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.

Data size is a red herring in the conversation.

nojito · 2025-05-22T23:13:00 1747955580

>Data size is a red herring in the conversation.

Not really. A Redshift paper just shared that.

>..here is a small number of tables in Redshift with trillions of rows, while the majority is much more reasonably sized with only millions of rows. In fact, most tables have less than a million rows and the vast majority (98 %) has less than a billion rows.

The argument can be made that 98% of people using redshift can potentially get by with DuckDB.

https://assets.amazon.science/24/3b/04b31ef64c83acf98fe3fdca...