Core Data Engineering Interview Questions

(6 questions)

Q1. What is Medallion Architecture and what are the 3 layers?

Medallion Architecture is a data design pattern for organizing a data lake into three progressively refined layers: Bronze, Silver, and Gold. Data flows through these layers gaining quality, structure, and business value at each stage.

Bronze (Raw Layer): An exact copy of source data, unchanged. Whatever arrives from the source — including bad records, duplicates, wrong types — is stored as-is. No transformations. Purpose: audit trail, recovery point, source of truth. If any downstream logic turns out to be wrong, you re-derive from Bronze. Stored as Parquet or original file format, partitioned by ingestion date.

Silver (Curated / DWH Layer): The data warehouse equivalent. Bronze data is cleaned, standardized, and conformed here. Transformations include: deduplication, null handling, type casting, schema standardization, cross-source joins, SCD Type 2 for dimension history. This is the single source of truth for business data — reliable, validated, structured. Stored as Delta tables.

Gold (Serving / Presentation Layer): Aggregated, domain-specific tables optimized for consumption. Pre-computed KPIs, denormalized fact tables for fast BI queries, ML feature tables. Power BI dashboards, analytical reports, and data science workloads read from Gold. Designed for fast reads, not for raw querying.

Why three layers?

  • Bronze absorbs all source chaos
  • Silver standardizes and governs it
  • Gold serves it to different consumers
  • Each layer can evolve independently

Real example: An e-commerce company ingests raw order CSV files into Bronze. Silver cleanses and joins them with customer and product dimensions. Gold has a daily_revenue_by_region table that Power BI reads for dashboards.

Q2. What is Delta Lake? Explain ACID transactions in simple terms.

Delta Lake is an open-source storage layer built on top of Parquet files that adds database-like reliability to a data lake. The core feature it adds is ACID transactions — which plain Parquet/CSV files on a data lake don't have.

What Delta adds over plain Parquet:

  • ACID transactions
  • Time travel (query table at any past version)
  • Schema enforcement (reject incompatible writes)
  • MERGE/UPSERT support
  • OPTIMIZE + ZORDER for performance
  • Automatic handling of small files

ACID in simple terms:

  • A — Atomicity ('All or nothing'): A write either fully succeeds or fully fails. If your pipeline writes 1 million rows and crashes at row 500,000 — readers see ZERO new rows, not 500,000 partial rows. Delta achieves this through a transaction log (_delta_log/) that records a commit only when ALL data is written.

  • C — Consistency ('Data is always valid'): The table always moves from one valid state to another. Schema rules are enforced — you can't write a string into an integer column. Delta rejects incompatible writes before they corrupt data.

  • I — Isolation ('Concurrent reads/writes don't interfere'): While a writer is adding new data, readers still see the old complete version. Multiple writers use optimistic concurrency control — they detect conflicts and retry. No reader ever sees a half-written state.

  • D — Durability ('Committed data is permanent'): Once Delta says a write succeeded, it's permanently recorded in the transaction log in cloud storage. Even if the cluster crashes immediately after, the data is safe.

Real scenario:

  • An ADF pipeline writes 5M rows to a Delta table and crashes halfway through. Without Delta (plain Parquet): 2.5M partial rows visible to readers → corrupt data. With Delta: readers see the last complete version — the half-written rows are invisible.

Q3. Explain the difference between Data Lake, Data Warehouse, and Lakehouse. architecture, lakehouse, data-modeling

These are three generations of analytical data storage, each addressing the limitations of the previous.

Data Warehouse (1980s–2000s): A relational database optimized for analytics — structured, schema-on-write, SQL-queryable. Examples: SQL Server DWH, Teradata, Snowflake, Azure Synapse dedicated SQL pool. Great for SQL queries and BI. Limitations: expensive to store huge volumes, can't handle unstructured data (images, logs, JSON), rigid schema makes schema changes costly, poor fit for ML workloads.

Data Lake (2010s): Massively scalable cheap storage (HDFS, ADLS Gen2) for ANY data format — CSV, JSON, Parquet, images, video. Schema-on-read — schema applied when reading, not when writing. Great for ML, big data, raw data retention. Limitations: no ACID transactions (partial writes corrupt data), no schema enforcement (any garbage can be written), poor SQL performance, hard to handle updates/deletes.

Lakehouse (2020s — the current paradigm): Best of both worlds. Cheap data lake storage (ADLS, S3) with database-like reliability (Delta Lake / Iceberg). Schema enforcement + ACID transactions + SQL support + ML-friendly. Examples: Databricks Lakehouse, Microsoft Fabric Lakehouse, Snowflake on Iceberg.

FeatureData LakeData WarehouseLakehouse
CostVery lowHighLow
ACIDNoYesYes
Schema enforcementNoYesYes
Unstructured dataYesNoYes
SQL analyticsPoorExcellentGood
ML workloadsYesPoorYes
Updates/DeletesVery hardEasyEasy

Q4. What is Parquet format and why do we prefer it over CSV?

Parquet is a columnar binary file format designed specifically for analytical workloads. It's the de facto standard for data lakes and is what Delta Lake uses under the hood.

Columnar vs Row-based storage:

  • CSV/JSON: row-based — each row is stored together on disk
  • Parquet: columnar — all values of each column are stored together

Why columnar is better for analytics: Most analytical queries read only a few columns from a wide table. A query SELECT revenue FROM sales on a 100-column CSV must read all 100 columns to get revenue. Parquet: only reads the revenue column from disk — ignores all other 99 columns.

Concrete advantages over CSV:

1. File size (5-10× smaller): Parquet compresses data per column. Within a column, values are often similar (all 'USD' currency, all integers in a range) — column-level encoding + Snappy/Gzip compression reduces size dramatically. A 10GB CSV often becomes 800MB Parquet.

2. Query performance (10-100× faster):

  • Column pruning: only read columns referenced in SELECT
  • Predicate pushdown: Parquet stores min/max statistics per column per row group → Spark skips row groups where the filter can't match
  • Less I/O = dramatically faster

3. Schema embedded: Parquet files carry their schema internally. No need for a schema file or schema inference (which reads the whole CSV). Types are preserved — integer stays integer, not '42' string.

4. Splittable: Parquet files can be split and read in parallel by multiple Spark tasks.

When CSV is still used: Source system output (most systems export CSV natively), human-readable files, small files where performance doesn't matter, interoperability with non-Spark tools.

Rule: Land as CSV in Bronze if source sends CSV (preserve source format). Convert to Parquet/Delta as the first Bronze→Silver transformation step.

Q5. What is data partitioning in Delta Lake? What is Z-Ordering?

Delta Lake supports two complementary data organization strategies for query performance: partitioning and Z-Ordering.

Partitioning: Physically organizes files into folders based on a column value.

df.write.partitionBy('year', 'month').format('delta').save('/silver/orders/') # Creates: /silver/orders/year=2024/month=01/*.parquet # /silver/orders/year=2024/month=02/*.parquet

When you query WHERE year=2024 AND month=01, Spark only reads the matching folder — completely skips all other year/month folders. For a 2TB table: a monthly query reads only 15GB instead of 2TB.

Rule: Partition by low-cardinality columns (year, month, region). Never partition by high-cardinality columns (customer_id, order_id) — creates millions of tiny folders.

Z-Ordering (within partitions): Z-Order sorts and colocates data within Parquet files so related rows are physically stored near each other. Uses a space-filling curve to interleave multiple column orderings.

OPTIMIZE silver.fact_orders ZORDER BY (customer_id, product_id)

After ZORDER: all orders for customer_id=12345 are in the same 1-2 files. Delta stores min/max stats per file — a query WHERE customer_id=12345 skips 95% of files immediately.

Why Z-Order for high-cardinality (what partitioning can't do):

  • Partitioning by customer_id (millions of customers) = millions of tiny files = disaster
  • Z-Ordering by customer_id = data clustered within existing (year/month) partitions + aggressive file skipping

Combine both for best results:

  • partitionBy('year','month') = skip whole folders
  • ZORDER BY (customer_id) = skip files within the folder
  • Result: a 2TB table query for one customer in one month reads < 10MB

Real impact: Gold table with 5 years of orders. Before ZORDER: customer query reads 200GB. After: reads 200MB. 1000× less I/O.

Q6. What is CDC (Change Data Capture) and how does it work?

CDC (Change Data Capture) is a technique for tracking and capturing only the rows that have changed (inserted, updated, or deleted) in a source system since the last extraction — rather than re-reading the entire table every run.

Why CDC matters:

  • For example, suppose an orders table has 500 million records, but daily only around 50,000 rows change due to new orders, cancellations, or status updates.
  • If we use full load, the ETL reads all 500 million rows every day, which is slow and puts heavy load on the source database.
  • So instead, we use CDC to fetch only the 50,000 changed records since the last run. This makes the pipeline much faster and reduces the amount of processed data significantly, almost 10,000× less data compared to full load.

How it works — three main approaches:

1. Timestamp/Watermark-based (simplest, most common): Requires a modified_date or updated_at column reliably updated on every change. Store the last successful load timestamp (watermark).

  • Suppose yesterday’s ETL run completed at: 2026-05-10 23:59:59
  • We store this as the watermark.
  • Next ETL run:
WHERE modified_date > watermark.
  • This fetches only: Newly inserted orders, Updated orders.

  • Pros: Simple, works with any database

  • Cons: MISSES hard deletes (deleted rows disappear), requires reliable timestamp column

2. Log-based CDC (most powerful): Reads the database transaction log (SQL Server: Change Tracking tables, Oracle: LogMiner, MySQL: binlog). Captures every INSERT, UPDATE, DELETE in exact order. Tools: Debezium, AWS DMS, Azure DMS.

  • Pros: Captures deletes, captures all changes in order, minimal source load
  • Cons: Requires DBA to enable, more infrastructure complexity

3. Trigger-based: Database triggers fire on change and write to a shadow change log table. ETL reads the change log.

  • Pros: Explicit control
  • Cons: Performance overhead on every source write, complex maintenance.