Microsoft Fabric Lakehouse — Medallion, Direct Lake, OneLake

What it is

A Lakehouse is a Fabric item that combines the flexibility of a data lake (any file format, schema-on-read) with the discipline of a relational warehouse (managed Delta-Parquet tables, ACID writes, optimized queries). Every Lakehouse sits on OneLake, so the data is automatically open, governed, and reachable by every other Fabric workload — without copies.

Two surface areas come with each Lakehouse:

The Lakehouse explorer for managing files (Files/) and tables (Tables/) interactively or from Spark notebooks.
A free, auto-generated SQL endpoint for T-SQL read access (queries, views, RLS) on top of the managed Delta tables — exposed automatically.

When to use it

Default to a Lakehouse when you're starting fresh on Fabric and any of the following are true:

You need to land semi-structured or unstructured files (JSON, Parquet, CSV, PDF, images) before transforming.
Your transformation tooling is Spark (PySpark, Scala, R) or you want to keep that option open.
You want a clean medallion architecture (bronze / silver / gold) with strong separation of raw, conformed, and serving layers.
You're building Direct Lake semantic models — the path with the lowest latency from data landing to Power BI.

Prefer a Warehouse instead if your team thinks in T-SQL, your data arrives already-structured, you need stored procedures, or you want multi-table ACID across a star schema. The two can — and often do — live side-by-side; cross-database joins are a first-class feature.

How it fits in Fabric

The Lakehouse is the most-connected item in Fabric. Data Factory pipelines, Dataflows Gen2, Mirroring, and Eventstream all write into it. Spark notebooks and Spark Job Definitions transform inside it. Warehouses query across into it. Power BI consumes it via the SQL endpoint or — better — via Direct Lake on the semantic model layer.

OneLake is the substrate. A Lakehouse's tables and files are stored in OneLake as open Delta-Parquet. That means a Lakehouse in workspace A can be read by a Warehouse in workspace B, by Power BI in workspace C, and by external tools (Spark, Databricks, Trino) — all without copies. The governance perimeter is Purview, not the Lakehouse itself.

Medallion accelerator: 6 steps to production

This is the pattern we use on most engagements. The whole sequence is templated in our Medallion Lakehouse Starter accelerator.

Lay out the Lakehouses. Three Lakehouses per domain: lh_bronze, lh_silver, lh_gold. Keep them in separate workspaces with separate capacity allocations so a runaway Bronze ingest doesn't starve Gold reporting.
Define the contract for each tier. Bronze = raw + append-only + immutable. Silver = cleansed, deduplicated, conformed types. Gold = analytics-ready, often star-schema shaped. Document the contract in a README in each Lakehouse and enforce with Best Practice Analyzer rules.
Ingest into Bronze. Pipelines or Copy Job for batch; Eventstream for streaming; Mirroring for transactional sources. Always land in Files/ when the schema is uncertain, into Tables/ when it's known.
Transform Bronze → Silver in notebooks. Parameterized PySpark notebooks driven by a metadata-driven framework. One notebook per table, called from a master orchestrator. Use Delta MERGE for CDC; use OPTIMIZE and VACUUM on a schedule.
Transform Silver → Gold for the semantic layer. Star schema tables sized for Direct Lake. Use Type-2 dimensions only where business needs them. Add a last_updated_utc column to every Gold table — your refresh metrics depend on it.
Wire Power BI via Direct Lake. Build the semantic model directly on Gold tables. No import refreshes, no scheduled refresh windows. Power BI sees the changes seconds after Spark writes them.

Best practices

Partition by what you query, not what you ingest. A year/month partition on a sales fact is almost always the right answer; file_arrival_date usually is not.
Run OPTIMIZE nightly on hot tables. Direct Lake's performance depends on Delta file sizing — aim for 100–500 MB Parquet files.
Use schemas (preview) to namespace tables. sales.orders reads much better than orders_sales when your Gold layer grows.
Keep notebooks idempotent. A re-run of a Silver notebook should never duplicate rows. Always assume your orchestrator will retry.
Source-control everything. Lakehouses themselves aren't versioned, but the notebooks and pipeline definitions are. Use Fabric Git integration from day one.

Common pitfalls we see

Treating the Lakehouse as a single tier. Stuffing raw, cleansed, and serving data into one Lakehouse seems convenient. It isn't — it ties every downstream consumer to a moving target and blows up your capacity planning.

Writing into Tables/ from notebooks without using Delta. A saveAsTable in Parquet (not Delta) creates an "unidentified" table that the SQL endpoint and Direct Lake won't recognize. Always write Delta.

Forgetting VACUUM. Old file versions pile up and inflate storage. Set a retention policy and run VACUUM on a schedule that fits your time-travel requirements.

Lakehouse

What it is

When to use it

How it fits in Fabric

Medallion accelerator: 6 steps to production

Best practices

Common pitfalls we see

Related items

Building a Lakehouse for production?