Stark Informatics
Home · Solutions · Data Engineering

Data Engineering

Fabric's Spark workload: Lakehouses, notebooks, Spark Job Definitions, runtime environments, autoscale. Everything you need to land, transform, and serve large-scale data on OneLake.

GAWorkload · Data Engineering· 8 min read

What it is

Data Engineering is the Fabric workload that gives you Apache Spark on OneLake — managed, scaled, billed in Capacity Units. It groups Lakehouses, notebooks, Spark Job Definitions, and environments under one umbrella. If you've used Synapse Spark or Databricks, this is the equivalent surface — minus the cluster management.

Spark pools & autoscale

Fabric provides starter pools and custom pools. Starter pools are pre-warmed (sub-10-second sessions); custom pools let you pick node size and autoscale ranges. Autoscale handles bursty workloads — set the maximum nodes to your CU ceiling and Spark figures out the rest.

Spark Job Definitions

A Spark Job Definition (SJD) is the production execution unit: a packaged JAR, PySpark file, or .NET assembly that runs on a schedule or from a pipeline. Use SJDs when:

  • The logic is stable and rarely changes
  • You need stronger packaging discipline than notebooks provide
  • Multiple workloads invoke the same code

Use notebooks for interactive development and orchestrated transformations; promote stable, library-heavy code to SJDs.

Environments

Environments package Python/R/Spark library versions and Spark configuration. Versioned, reusable across notebooks and SJDs. Treat them as production infrastructure — one environment per project; promote through dev/test/prod.

Best practices

  • Use starter pools for development. Sub-10-second session starts beat anything with cold provisioning.
  • Tune autoscale ceilings. The most common cause of unexpected CU spikes is an autoscale-uncapped notebook.
  • Optimize Delta writes. Set spark.databricks.delta.autoOptimize on, run OPTIMIZE on a schedule, VACUUM per your retention policy.
  • Right-size your data. If your transformations run in seconds on Pandas, you don't need Spark — use a single-machine Python notebook.

Production-grade data engineering

Our Medallion Lakehouse Starter ships with the patterns that make Spark workloads predictable and observable.

See the accelerator