What it is
A Lakehouse is a Fabric item that combines the flexibility of a data lake (any file format, schema-on-read) with the discipline of a relational warehouse (managed Delta-Parquet tables, ACID writes, optimized queries). Every Lakehouse sits on OneLake, so the data is automatically open, governed, and reachable by every other Fabric workload — without copies.
Two surface areas come with each Lakehouse:
- The Lakehouse explorer for managing files (
Files/) and tables (Tables/) interactively or from Spark notebooks. - A free, auto-generated SQL endpoint for T-SQL read access (queries, views, RLS) on top of the managed Delta tables — exposed automatically.
When to use it
Default to a Lakehouse when you're starting fresh on Fabric and any of the following are true:
- You need to land semi-structured or unstructured files (JSON, Parquet, CSV, PDF, images) before transforming.
- Your transformation tooling is Spark (PySpark, Scala, R) or you want to keep that option open.
- You want a clean medallion architecture (bronze / silver / gold) with strong separation of raw, conformed, and serving layers.
- You're building Direct Lake semantic models — the path with the lowest latency from data landing to Power BI.
Prefer a Warehouse instead if your team thinks in T-SQL, your data arrives already-structured, you need stored procedures, or you want multi-table ACID across a star schema. The two can — and often do — live side-by-side; cross-database joins are a first-class feature.
How it fits in Fabric
The Lakehouse is the most-connected item in Fabric. Data Factory pipelines, Dataflows Gen2, Mirroring, and Eventstream all write into it. Spark notebooks and Spark Job Definitions transform inside it. Warehouses query across into it. Power BI consumes it via the SQL endpoint or — better — via Direct Lake on the semantic model layer.
Medallion accelerator: 6 steps to production
This is the pattern we use on most engagements. The whole sequence is templated in our Medallion Lakehouse Starter accelerator.
-
Lay out the Lakehouses. Three Lakehouses per domain:
lh_bronze,lh_silver,lh_gold. Keep them in separate workspaces with separate capacity allocations so a runaway Bronze ingest doesn't starve Gold reporting. -
Define the contract for each tier. Bronze = raw + append-only + immutable. Silver = cleansed, deduplicated, conformed types. Gold = analytics-ready, often star-schema shaped. Document the contract in a
READMEin each Lakehouse and enforce with Best Practice Analyzer rules. -
Ingest into Bronze. Pipelines or Copy Job for batch; Eventstream for streaming; Mirroring for transactional sources. Always land in
Files/when the schema is uncertain, intoTables/when it's known. -
Transform Bronze → Silver in notebooks. Parameterized PySpark notebooks driven by a metadata-driven framework. One notebook per table, called from a master orchestrator. Use Delta MERGE for CDC; use
OPTIMIZEandVACUUMon a schedule. -
Transform Silver → Gold for the semantic layer. Star schema tables sized for Direct Lake. Use Type-2 dimensions only where business needs them. Add a
last_updated_utccolumn to every Gold table — your refresh metrics depend on it. - Wire Power BI via Direct Lake. Build the semantic model directly on Gold tables. No import refreshes, no scheduled refresh windows. Power BI sees the changes seconds after Spark writes them.
Best practices
- Partition by what you query, not what you ingest. A
year/monthpartition on a sales fact is almost always the right answer;file_arrival_dateusually is not. - Run
OPTIMIZEnightly on hot tables. Direct Lake's performance depends on Delta file sizing — aim for 100–500 MB Parquet files. - Use schemas (preview) to namespace tables.
sales.ordersreads much better thanorders_saleswhen your Gold layer grows. - Keep notebooks idempotent. A re-run of a Silver notebook should never duplicate rows. Always assume your orchestrator will retry.
- Source-control everything. Lakehouses themselves aren't versioned, but the notebooks and pipeline definitions are. Use Fabric Git integration from day one.
Common pitfalls we see
Tables/ from notebooks without using Delta. A saveAsTable in Parquet (not Delta) creates an "unidentified" table that the SQL endpoint and Direct Lake won't recognize. Always write Delta.VACUUM. Old file versions pile up and inflate storage. Set a retention policy and run VACUUM on a schedule that fits your time-travel requirements.