I ask because, if I didn't know either word, the one would mean, to me, "tiny storage next to a big body of data" and the other would mean "a big body of data".
You can't resell someone a file system if you don't rename it to a "data lake". You can't resell someone an indexed file system from the 80's that can be queried with SQL unless you rename it a "data lakehouse".
Wait til you dive into ETL vs ELT or even better, if you've been doing ETL since long before "ELT" was "a thing", but everyone did ETL actually in an ELT fashion...
It's not only the names, but something like 98% of the tools too, that suck.
Disclaimer - work at Snowflake. Two quick points to mention.
1. Snowflake has always used blob stores + file data + metadata. Architecturally it’s actually always been very Lakehouse-y
2. Parquet and Iceberg should be equivalent in performance and features. It’s more than playing nicely - it’s more choose your own adventure where all things are equal.
"(Data) lakehouse" is an amalgamation of data warehouses and data lakes. It's meant to enable querying and all the support (transaction, etc) of traditional data warehouses on a data lake (unstructured data lying on cheap storage).
> "(Data) lakehouse" is an amalgamation of data warehouses and data lakes. It's meant to enable querying and all the support (transaction, etc) of traditional data warehouses on a data lake (unstructured data lying on cheap storage).
Thank you for that. Do you have any suggestions on where one would start if they wanted to get a better idea and/or some experience using lakehouses?
That's what it originally meant, at least in my experience. It was when warehouses got access to commodity storage through virtualization options (Hey! I can read S3 from Redshift and it looks like a Redshift table). Similar to Postgres foreign data wrappers or polybase in sql server.
Databricks (with Delta as the underpinning) seems to have lead the charge of lakehouse meaning, your data lake+file formats/helpers+compute==data lake+datawarehouse==lakehouse.
The latter seems to be the prevailing definition today with the former aging in place.
I ask because, if I didn't know either word, the one would mean, to me, "tiny storage next to a big body of data" and the other would mean "a big body of data".