DE-PRO Cheatsheet — DLT, Streaming, Performance & Reliability on Databricks

Last-mile DE-PRO review: incremental pipelines, Structured Streaming (watermarks/checkpoints), Delta Live Tables concepts, performance tuning pickers (shuffle/skew/file layout), and production troubleshooting heuristics.

Use this for last‑mile review. Pair it with the Resources for coverage and IT Mastery to harden production instincts.


1) Production pipeline lens (what DE-PRO is really testing)

If two answers “work,” choose the one that is:

  • Recoverable: checkpoints/state, idempotent writes, safe retries
  • Observable: clear metrics/logs, quality gates, lineage
  • Low blast radius: staged changes, small reversible steps

2) Incremental batch + CDC (the safest default patterns)

Upsert with MERGE (CDC)

1MERGE INTO silver t
2USING cdc s
3ON t.id = s.id
4WHEN MATCHED AND s.op = 'D' THEN DELETE
5WHEN MATCHED THEN UPDATE SET *
6WHEN NOT MATCHED THEN INSERT *;

Rules of thumb

  • Ensure the ON condition is unique on the source side.
  • Make pipelines idempotent: re-running the same input should not double-apply changes.

3) Structured Streaming essentials (checkpointing, watermarks, late data)

The two most-tested concepts

ConceptWhy it mattersFailure mode
Checkpointenables exactly-once style recovery for sinksdeleting/moving checkpoint breaks correctness
Watermarkbounds state and handles late datamissing watermark → unbounded state

Streaming write (conceptual template)

1(df
2  .withWatermark("event_time", "10 minutes")
3  .writeStream
4  .format("delta")
5  .option("checkpointLocation", "/chk/orders")
6  .outputMode("append")
7  .start("/delta/silver/orders"))

Exam cues

  • Late data policy changes outcomes (drop vs update aggregates).
  • Triggers control latency/cost; stateful ops need careful tuning.

4) Delta Live Tables (DLT) — what to remember

DLT is about declarative pipelines with built-in operational structure:

  • pipeline graph (table dependencies)
  • quality expectations
  • managed execution/monitoring
    flowchart LR
	  BR["Bronze (ingest)"] --> SI["Silver (clean + dedupe)"]
	  SI --> GO["Gold (metrics)"]

Operator mindset: treat expectations as guardrails; don’t silently pass bad data downstream.


5) Performance pickers (shuffle, skew, file layout)

Shuffle vs skew diagnosis

SymptomLikely causeSafe next step
Slow joins/aggregationsheavy shufflereduce data early; pick join strategy; tune partitions
One task runs foreverdata skewhandle hot keys; split/skew hints (concept-level)
Lots of tiny fileswrite patterncompaction/OPTIMIZE (concept-level)

File layout rules of thumb

  • Don’t over-partition (small file explosion).
  • Compact when needed; keep partition columns low/medium cardinality.
  • Use Z-order/data skipping where supported (concept-level).

6) Reliability + troubleshooting quick pickers

  • Streaming duplicates: checkpoint misuse, non-idempotent sink, or incorrect output mode for stateful ops.
  • State grows forever: missing/incorrect watermark; unbounded aggregation.
  • MERGE “explodes” rows: source not unique on merge keys.
  • Reprocessing/backfill: prefer explicit versioning and safe re-runs over ad-hoc manual deletes.