Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does lakehouse have the same meaning as datalake?

I ask because, if I didn't know either word, the one would mean, to me, "tiny storage next to a big body of data" and the other would mean "a big body of data".



Date lake can be thought of a file system. Imagine 100 CSVs in folders, usually stored on S3.

Data Lakehouse involves adding things like the ability to query via SQL, the ability to update/insert/delete, transactions.

Where before people needed warehouses for BI and lakes for data science, they can now have only one approach.

It’s likely to be a big trend as data moves to this format and arrangement and the DBMS vendors like Snowflake have to play nicely with it.


Data warehouse, data lake, data lakehouse. The data world has some terrible terminology.


You can't resell someone a file system if you don't rename it to a "data lake". You can't resell someone an indexed file system from the 80's that can be queried with SQL unless you rename it a "data lakehouse".


Wait til you dive into ETL vs ELT or even better, if you've been doing ETL since long before "ELT" was "a thing", but everyone did ETL actually in an ELT fashion...

It's not only the names, but something like 98% of the tools too, that suck.


Disclaimer - work at Snowflake. Two quick points to mention.

1. Snowflake has always used blob stores + file data + metadata. Architecturally it’s actually always been very Lakehouse-y

2. Parquet and Iceberg should be equivalent in performance and features. It’s more than playing nicely - it’s more choose your own adventure where all things are equal.


> Where before people needed warehouses for BI and lakes for data science, they can now have only one approach.

This is all very interesting, and thank you for taking the time to explain. Any good starting points for someone who would like to know more?


Databricks popularised the concept and explain it very well - https://youtu.be/g11y-kJHr3I?si=j8FAkFsIjScHv24f

It’s a technology independent pattern though.



"(Data) lakehouse" is an amalgamation of data warehouses and data lakes. It's meant to enable querying and all the support (transaction, etc) of traditional data warehouses on a data lake (unstructured data lying on cheap storage).


> "(Data) lakehouse" is an amalgamation of data warehouses and data lakes. It's meant to enable querying and all the support (transaction, etc) of traditional data warehouses on a data lake (unstructured data lying on cheap storage).

Thank you for that. Do you have any suggestions on where one would start if they wanted to get a better idea and/or some experience using lakehouses?


That's what it originally meant, at least in my experience. It was when warehouses got access to commodity storage through virtualization options (Hey! I can read S3 from Redshift and it looks like a Redshift table). Similar to Postgres foreign data wrappers or polybase in sql server.

Databricks (with Delta as the underpinning) seems to have lead the charge of lakehouse meaning, your data lake+file formats/helpers+compute==data lake+datawarehouse==lakehouse.

The latter seems to be the prevailing definition today with the former aging in place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: