For those of you from the AI world, this is the equivalent of the bitter lesson and DeWitts argument about database machines from the early 80s. That is, if you wait a bit with the exponential pace of Moores law (or modern equivalents), improvements in “general purpose” hardware will obviate DB specific improvements. The problem is that back in 2012, we had customers that wanted to query terabytes of logs for observability, or analyze adtech streams, etc. So, I feel like this is a pointless argument. If your data fit on an old MacBook Pro, sure you should’ve built for that.
AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.
AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.
>..here is a small number of tables in Redshift with
trillions of rows, while the majority is much more reasonably sized
with only millions of rows. In fact, most tables have less than a
million rows and the vast majority (98 %) has less than a billion
rows.
The argument can be made that 98% of people using redshift can potentially get by with DuckDB.