Stark Informatics
Home · Solutions · Data Factory

Data Factory

The ingestion and orchestration layer of Microsoft Fabric. 200+ connectors, three execution surfaces — Pipelines, Dataflows Gen2, and Copy Job — and tight integration with notebooks, warehouses, and lakehouses.

GAWorkload · Data Factory· 9 min read

What it is

Data Factory in Fabric is the modern data-integration workload. It inherits 200+ connectors from Azure Data Factory and Power Query, and combines three execution surfaces under one workload:

  • Pipelines — activity-based orchestration. Best for control flow, parameters, dependencies, and triggering other Fabric items.
  • Dataflows Gen2 — Power Query M for low-code transformations. Best for semi-structured cleansing and Excel-shaped data.
  • Copy Job — simplified, high-throughput data movement with built-in CDC. Best when you just want data to arrive on a schedule.

Pipelines

Pipelines are the orchestration backbone. An activity-based DAG runs notebooks, Spark jobs, copies, lookups, web calls, and child pipelines. Common patterns:

  • Metadata-driven ingest. A control table lists sources; a ForEach activity iterates and calls Copy. Add new sources by inserting rows, not by editing pipelines.
  • Notebook orchestration. Pipelines schedule and parameterize notebooks. Pipeline = the operator's surface; notebook = the engineer's surface.
  • Cross-workload triggers. Run Spark, refresh a semantic model, and post to Teams from one pipeline.

Dataflows Gen2

Power Query in the browser, writing directly to Lakehouse, Warehouse, KQL DB, or Azure SQL destinations. Best for citizen developers and Excel-shaped data. Heavy-volume transformations belong in Spark or T-SQL — Dataflows are for cleansing and reshaping, not for terabyte joins.

Copy Job

The simplest path from source to OneLake. You point Copy Job at a source and a Lakehouse/Warehouse destination; it handles scheduling, incremental loads, change-data-capture, and monitoring. No pipeline boilerplate. Reach for it when you need to land data and the transformation logic is "none."

Choosing the right tool

i
Pipeline when you need control flow, parameters, or to orchestrate other items.
Dataflow Gen2 when business analysts will own the logic and the transformations are Power Query-shaped.
Copy Job when you just need data to arrive on a schedule and the source supports CDC.

Best practices

  • Metadata-driven everything. Source lists in a control table beat 200 individual pipelines every time.
  • Parameterize for environments. Source connection strings and Lakehouse references should be deployment-pipeline variables, not hard-coded.
  • Use Copy Job before you build a custom pipeline. If Copy Job can do it, it's cheaper and more maintainable.
  • Schedule with intent. A 5-minute pipeline frequency is a Capacity Unit hog. Aim for the longest interval the business will tolerate.

Common pitfalls

!
Doing heavy transformation in Dataflows. Dataflows are great for cleansing, brittle for big joins. Move heavy lifting to notebooks or T-SQL.
!
Re-implementing CDC in pipelines. Mirroring exists. Use it for transactional sources unless the source isn't supported.

Need help shaping your ingestion?

The choice between Pipelines, Dataflows, Mirroring, and Eventstream is the single biggest design decision in many Fabric projects.

Talk to an architect