What it is
Data Engineering is the Fabric workload that gives you Apache Spark on OneLake — managed, scaled, billed in Capacity Units. It groups Lakehouses, notebooks, Spark Job Definitions, and environments under one umbrella. If you've used Synapse Spark or Databricks, this is the equivalent surface — minus the cluster management.
Spark pools & autoscale
Fabric provides starter pools and custom pools. Starter pools are pre-warmed (sub-10-second sessions); custom pools let you pick node size and autoscale ranges. Autoscale handles bursty workloads — set the maximum nodes to your CU ceiling and Spark figures out the rest.
Spark Job Definitions
A Spark Job Definition (SJD) is the production execution unit: a packaged JAR, PySpark file, or .NET assembly that runs on a schedule or from a pipeline. Use SJDs when:
- The logic is stable and rarely changes
- You need stronger packaging discipline than notebooks provide
- Multiple workloads invoke the same code
Use notebooks for interactive development and orchestrated transformations; promote stable, library-heavy code to SJDs.
Environments
Environments package Python/R/Spark library versions and Spark configuration. Versioned, reusable across notebooks and SJDs. Treat them as production infrastructure — one environment per project; promote through dev/test/prod.
Best practices
- Use starter pools for development. Sub-10-second session starts beat anything with cold provisioning.
- Tune autoscale ceilings. The most common cause of unexpected CU spikes is an autoscale-uncapped notebook.
- Optimize Delta writes. Set
spark.databricks.delta.autoOptimizeon, runOPTIMIZEon a schedule,VACUUMper your retention policy. - Right-size your data. If your transformations run in seconds on Pandas, you don't need Spark — use a single-machine Python notebook.