BigData / Apache Parquet Interview Questions
Apache Parquet is an open-source, columnar storage file format designed for the Hadoop ecosystem and modern big-data platforms. Unlike row-based formats (CSV, JSON), Parquet stores each column's data contiguously on disk. This layout allows query engines to read only the columns they need, dramatically reducing I/O.
Key reasons Parquet is widely used:
- Columnar layout — queries touch only the relevant columns, not full rows.
- Efficient compression — columns hold homogeneous types, so codecs like Snappy and GZIP achieve higher ratios.
- Predicate pushdown — embedded min/max statistics let engines skip entire row groups that cannot satisfy a WHERE clause.
- Schema embedding — the schema is stored inside the file, eliminating external schema management.
- Broad ecosystem support — Spark, Hive, Presto, Flink, DuckDB, Pandas, AWS Athena, and Snowflake all read/write Parquet natively.
CSV is simple but untyped and inefficient at scale. Parquet improves on it in every dimension that matters for analytics:
| Aspect | CSV | Parquet |
|---|---|---|
| Storage layout | Row-based | Columnar |
| Compression | Low (mixed types per row) | High (homogeneous per column) |
| Schema | None (inferred) | Embedded in file |
| Partial reads | Must scan full row | Read only needed columns |
| Predicate pushdown | No | Yes (min/max statistics) |
| Schema evolution | Manual | Built-in (add columns) |
| Ecosystem | Universal | Hadoop, Spark, cloud lakes |
For OLAP-style queries that aggregate a few columns across millions of rows, Parquet is typically 10–100× faster to query and 3–10× smaller than an equivalent CSV.
Parquet organises data in a three-level hierarchy:
- Row Group — a horizontal slice of the dataset, typically 128 MB–1 GB of data. Each row group contains one column chunk per column.
- Column Chunk — all values for a single column within a row group. This is the unit of compression and encoding.
- Page — the smallest addressable unit inside a column chunk (default 1 MB). Pages can be data pages, dictionary pages, or index pages.
At the end of the file, a footer stores the schema and per-column statistics (min, max, null count, distinct count). Readers fetch the footer first to plan which row groups and pages to skip.
Parquet File\n├── Row Group 1 (128 MB)\n│ ├── Column Chunk: id\n│ ├── Column Chunk: name\n│ └── Column Chunk: amount\n├── Row Group 2\n│ └── ...\n└── Footer (schema + statistics)
Schema evolution is the ability to change a Parquet dataset's schema over time without rewriting existing files. Parquet natively supports:
- Adding columns — new columns appear as
nullin older files when read together with newer files. - Renaming columns — supported via field IDs (used in formats like Iceberg on top of Parquet).
- Widening types — e.g., INT32 → INT64 is safe; narrowing is not.
In Apache Spark, set mergeSchema = true to merge schemas across multiple Parquet files automatically:
df = spark.read.option("mergeSchema", "true").parquet("s3://bucket/data/")
Schema evolution is critical for long-lived data lakes where upstream producers add new fields without co-ordinating with all downstream consumers.
Column pruning (also called projection pushdown) is an optimisation where the query engine reads only the columns referenced in the query, ignoring all other column chunks on disk. Because Parquet stores each column separately, skipping unrequested columns costs nothing beyond reading the footer.
Example — a table has 50 columns but the query only needs 3:
SELECT user_id, revenue, country FROM events WHERE date = '2026-01-01';
The engine reads only the user_id, revenue, country, and date column chunks — the other 46 are never loaded from disk. Combined with predicate pushdown (row group skipping via statistics), this makes Parquet queries orders of magnitude faster than scanning full rows.
Parquet and Avro serve different access patterns. The table below summarises when to prefer each:
| Criterion | Choose Parquet | Choose Avro |
|---|---|---|
| Query pattern | OLAP — aggregate few columns over many rows | Row-based access — read/write entire records |
| Streaming | Batch / micro-batch | Streaming (Kafka, Flink event records) |
| Write frequency | Write-once, read-many | Frequent writes / appends |
| Schema evolution | Column adds; limited | Full backward/forward compatibility |
| Compression | Higher for analytics | Moderate; better for write throughput |
A common pattern: ingest events as Avro in Kafka, then compact and convert to Parquet in a data lake for batch analytics.
Parquet uses two complementary techniques to minimise storage:
Encoding — applied first, at the column level, to exploit data patterns:
- Dictionary Encoding — replaces repeated values with integer codes. Ideal for low-cardinality columns (e.g., status, country).
- Run-Length Encoding (RLE) — collapses consecutive identical values into (value, count) pairs.
- Delta Encoding — stores differences between successive integers; great for timestamps and monotonic IDs.
- Bit-packing — packs small integers into fewer bits.
Compression codec — applied after encoding to the byte stream:
- Snappy (default in Spark) — fast, moderate ratio (~2–3×).
- GZIP/Zlib — higher ratio (~4–6×) but slower CPU.
- ZSTD — best balance; recommended for most new deployments.
- LZO, LZ4, Brotli — specialised use cases.
Encoding and codec are configured independently per column, so hot columns can use fast codecs while archive columns use ZSTD for maximum compression.
The Vectorized Parquet Reader (introduced in Spark 2.0) reads a batch of rows at once directly into an in-memory columnar format (ColumnarBatch) rather than converting each row individually to a JVM object. This avoids object creation overhead and allows the JVM's JIT compiler to apply SIMD-style optimisations.
Key benefits:
- Reduces per-row CPU overhead significantly.
- Enables whole-stage code generation to operate on batches.
- Speeds up filters and projections applied directly to column vectors.
Enable or verify via Spark config:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true") # default true
For complex nested schemas (e.g., deeply nested structs/maps), the vectorized reader may fall back to row-by-row mode automatically.
When a dataset is composed of Parquet files written at different times (possibly with different schemas), you have several options:
1. Spark mergeSchema — the simplest approach; Spark unions all schemas and fills missing columns with null:
df = spark.read.option("mergeSchema", "true").parquet("s3://datalake/events/")
2. AWS Glue Schema Registry — version schemas centrally; consumers register against a schema version and handle evolution rules explicitly.
3. Table formats (Iceberg / Delta / Hudi) — track schema history at the table level; ALTER TABLE ADD COLUMN is applied transactionally without rewriting files.
4. Manual reconciliation — use df.schema to inspect each file's schema, build a unified schema, then apply df.select(unified_cols) with lit(None).cast() for missing columns before union.
Always validate after merging: check for unexpected nulls or type coercions that widen types silently.
Diagnosing and tuning slow Parquet queries in Spark follows a layered approach:
- Check partitioning — ensure the table is partitioned on high-cardinality filter columns (e.g.,
date,region). Without partitions, Spark scans all files.df.write.partitionBy("date", "region").parquet("s3://path/") - Verify column pruning — use
SELECT col1, col2rather thanSELECT *so only required column chunks are read. - Align filters with statistics — filter on columns that have good min/max spread so row group skipping is effective. Sort data by the filter column before writing (Z-ordering or file-level sorting).
- Right-size row groups — too-small row groups → many footer reads; too-large → coarse skipping. Target 128 MB–512 MB per row group.
- Check compression codec — Snappy (fast decompression) suits CPU-bound queries; ZSTD suits I/O-bound scenarios.
- Enable Adaptive Query Execution (AQE) —
spark.sql.adaptive.enabled=truereoptimises joins and coalesces small partitions at runtime. - Inspect the Spark UI — look at the scan stage: bytes read vs. bytes skipped tells you how effective pushdown is.
Snowflake can query and load Parquet files staged in cloud storage using a two-step approach:
Step 1 — Stage the files (S3, Azure Blob, or GCS external stage):
CREATE OR REPLACE STAGE my_stage
URL = 's3://my-bucket/parquet-data/'
CREDENTIALS = (AWS_KEY_ID='...' AWS_SECRET_KEY='...');
Step 2a — Infer schema and query directly (no table creation needed):
SELECT $1:user_id::INT, $1:revenue::FLOAT
FROM @my_stage (FILE_FORMAT => 'parquet_fmt');
Step 2b — Use INFER_SCHEMA to auto-create table:
CREATE TABLE events USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(INFER_SCHEMA(
LOCATION => '@my_stage',
FILE_FORMAT => 'parquet_fmt'
))
);
COPY INTO events FROM @my_stage FILE_FORMAT = (TYPE = 'PARQUET');
INFER_SCHEMA reads the Parquet footer metadata to derive column names and types automatically, eliminating manual DDL.
Parquet defines primitive types (physical storage) and logical types (semantic meaning layered on top).
Primitive types: BOOLEAN, INT32, INT64, INT96 (legacy timestamps), FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY.
Common logical types (annotations on primitives):
| Logical Type | Physical Mapping | Example |
|---|---|---|
| STRING | BYTE_ARRAY (UTF-8) | "London" |
| DATE | INT32 (days since epoch) | 2026-01-01 |
| TIMESTAMP_MILLIS | INT64 | 1704067200000 |
| DECIMAL | INT32/INT64/BYTE_ARRAY | 99.99 |
| LIST | Repeated group | [1, 2, 3] |
| MAP | Repeated key-value group | {k: v} |
| ENUM | BYTE_ARRAY | "ACTIVE" |
| UUID | FIXED_LEN_BYTE_ARRAY(16) | RFC-4122 |
Understanding the primitive/logical split matters when debugging type mismatch errors between Spark, Hive, and other readers.
Spark provides first-class Parquet support via the DataFrameReader and DataFrameWriter APIs.
Read:
# Read a single file or directory of Parquet files
df = spark.read.parquet("s3://my-bucket/events/")
# With options
df = (spark.read
.option("mergeSchema", "true")
.parquet("hdfs:///datalake/transactions/"))
Write:
# Overwrite with Snappy compression (default)
df.write.mode("overwrite").parquet("s3://my-bucket/output/")
# Partition by date and region, use ZSTD
(df.write
.partitionBy("date", "region")
.option("compression", "zstd")
.mode("append")
.parquet("s3://my-bucket/partitioned/"))
Register as temp view for SQL:
df.createOrReplaceTempView("events")
spark.sql("SELECT date, SUM(revenue) FROM events GROUP BY date").show()
PyArrow provides a low-level Parquet library that is fast, pure Python-friendly, and interoperates with Pandas:
Write:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
pq.write_table(table, "output.parquet", compression="zstd")
Read:
table = pq.read_table("output.parquet")
df = table.to_pandas()
Read specific columns only (column pruning):
table = pq.read_table("output.parquet", columns=["user_id", "revenue"])
Read with filter pushdown:
import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/data/", format="parquet")
table = dataset.to_table(
columns=["user_id", "revenue"],
filter=ds.field("date") == "2026-01-01"
)
PyArrow's dataset API supports partitioned directories and applies filter pushdown automatically using Parquet statistics.
Partitioning organises files into a directory hierarchy based on column values, following the Hive partition layout:
s3://bucket/events/
date=2026-01-01/
part-0000.parquet
date=2026-01-02/
part-0001.parquet
region=US/
...
When a query filters on a partition column, the engine lists only matching subdirectories and skips all others — this is called partition pruning. No Parquet files in the skipped partitions are opened at all.
Example savings: a table with 3 years of daily data (1,095 partitions) filtering on a single date skips 99.9% of files before any Parquet-level statistics are consulted.
Trade-offs:
- Partition on columns with moderate cardinality (dates, regions, categories) — not user IDs (too many small files).
- Avoid over-partitioning: thousands of tiny files hurt throughput more than they help with pruning.
A Bloom Filter is a probabilistic data structure that answers "is this value possibly in this row group?" with zero false negatives. Parquet 1.12+ supports per-column Bloom filters stored in the file footer.
They complement min/max statistics for high-cardinality columns where the min–max range spans the full value space (e.g., UUIDs, email addresses). Without a Bloom filter, the engine must open every row group; with one, it can skip groups that definitely do not contain the queried value.
Enable in PySpark/Spark:
spark.conf.set("parquet.bloom.filter.enabled#user_id", "true")
spark.conf.set("parquet.bloom.filter.expected.ndv#user_id", "1000000")
When to use:
- Equality filters (
WHERE uuid = '...') on high-cardinality string/UUID columns. - Not useful for range queries (use sorted data + min/max for that).
Trade-off: adds size to the footer; tune expected.ndv (number of distinct values) to control false-positive rate vs. filter size.
All three are Apache-ecosystem formats but optimised for different workloads:
| Feature | Parquet | ORC | Avro |
|---|---|---|---|
| Layout | Columnar | Columnar | Row-based |
| Ecosystem fit | Spark, Presto, Hive, cloud lakes | Hive-native; Spark | Kafka, Flink, Hadoop MR |
| Compression | Excellent (ZSTD, Snappy) | Excellent (ZLIB, Snappy, ZSTD) | Good (Deflate, Snappy) |
| Schema evolution | Column adds; mergeSchema | Column adds; type promotions | Full backward/forward compatibility |
| Nested data | Strong (Dremel-style) | Good | Native (JSON-like nested) |
| Best for | Analytics, data lake | Hive OLAP, HBase integration | Streaming, messaging, OLTP-like writes |
In practice: most modern data lake stacks (Databricks, AWS lake formation, Google BigQuery) default to Parquet. ORC remains dominant in Hive-heavy shops.
Z-ordering is a multi-dimensional data-skipping technique that physically co-locates related rows across multiple filter columns within Parquet files. It maps multiple column values to a single Z-order curve value and sorts data along that curve.
Problem it solves: Standard partitioning and sorting works well for one column. When queries filter on two or more independent columns (e.g., WHERE region='EU' AND product_category='Electronics'), traditional single-column sorting cannot co-locate all relevant rows.
Z-ordering interleaves the bits of the filter column values so that rows with similar values in multiple columns are physically adjacent.
In Delta Lake (Databricks):
OPTIMIZE events ZORDER BY (region, product_category);
After Z-ordering, the same query skips far more row groups because the min/max ranges for both columns are tight within each file.
Trade-off: Z-order OPTIMIZE is a full rewrite of affected files and should be scheduled periodically, not on every write.
Apache Iceberg is a high-performance open table format for huge analytic datasets, designed to replace traditional Hive table management. Iceberg stores actual data in Parquet (or ORC/Avro) files and adds a metadata layer on top.
Iceberg adds to raw Parquet:
- ACID transactions — snapshot isolation for concurrent reads and writes.
- Hidden partitioning — partition transforms (bucket, truncate, year/month/day) applied transparently without user-specified partition paths.
- Full schema evolution — rename, reorder, drop columns tracked via field IDs.
- Time travel — query any historical snapshot:
SELECT * FROM events VERSION AS OF 12345; - Metadata statistics — manifest files track per-file min/max/null counts for fast planning.
Parquet is the default data format in Iceberg deployments (Spark, Flink, Trino, Hive). Iceberg handles all the metadata overhead so Parquet files can remain immutable and easily compactable.
DuckDB can query Parquet files directly without loading them into a database, using SQL:
-- Query a local Parquet file
SELECT region, SUM(revenue)
FROM read_parquet('events.parquet')
GROUP BY region;
-- Glob pattern for multiple files / partitioned directory
SELECT * FROM read_parquet('s3://bucket/data/**/*.parquet')
WHERE date = '2026-01-01';
DuckDB is fast on Parquet for several reasons:
- Columnar execution engine — DuckDB processes data in column vectors natively, aligning perfectly with Parquet's columnar layout.
- Push-down filters — WHERE conditions are pushed into the Parquet reader, exploiting row-group statistics and Bloom filters.
- Projection pushdown — only referenced columns are read from disk.
- Parallel I/O — multiple row groups are read in parallel across CPU cores.
- Zero-copy reads — avoids deserialising data into intermediate objects.
This makes DuckDB an excellent tool for ad-hoc analytics on Parquet data lakes without a cluster.
The Parquet file footer is a serialised Thrift structure at the end of every Parquet file. It contains:
- The full file schema (field names, types, nesting).
- Per-row-group metadata: byte offsets, compressed/uncompressed sizes, row counts.
- Per-column-chunk statistics: min value, max value, null count, distinct count.
- Encoding and compression codec per column chunk.
- Bloom filter offsets (if present).
Readers always fetch the footer first because it is small (typically kilobytes) and provides the complete map needed to plan which row groups and column chunks to read. Without the footer, the reader would have to scan the entire file sequentially.
The last 4 bytes of a Parquet file are the magic bytes PAR1; the 4 bytes before that are a 32-bit integer giving the footer length, so readers seek to the end of the file first.
Parquet uses the Dremel encoding (Google's paper, 2010) to represent arbitrarily nested data in a flat columnar layout. Two extra per-value integers are stored alongside each column's data:
- Definition level — how many optional fields in the path are actually defined (non-null). Encodes null positions without storing nulls explicitly.
- Repetition level — which repeated field in the path started a new list element. Encodes list boundaries.
Example — a column orders.items.price where items is a repeated field:
Row 1: orders.items = [{price: 10}, {price: 20}]
Row 2: orders.items = [] (empty list)
Stored as:
price column: [10, 20]
repetition: [0, 1] (0=new top-level record, 1=repeated value)
definition: [2, 2, 1] (includes row 2's empty-list marker)
This lets engines reconstruct nested structures perfectly while retaining columnar access for analytics on leaf fields.
The small file problem occurs when a Parquet dataset accumulates thousands of tiny files (often from streaming writes or over-partitioning). Each file requires a separate HDFS/S3 metadata operation and a separate footer read, causing significant overhead.
Effects:
- Slow query planning — the driver/namenode must enumerate and open many files.
- Inefficient Parquet statistics — small row groups give poor skipping benefits.
- High object-store API costs (S3 LIST + GET per file).
Solutions:
- Compaction job — periodically coalesce small files into larger ones. In Spark:
spark.read.parquet(path).coalesce(N).write.mode("overwrite").parquet(path) - Delta Lake / Iceberg OPTIMIZE —
OPTIMIZE my_table; - Streaming micro-batch tuning — increase trigger interval or use
maxFilesPerTriggerto reduce write frequency. - Hudi MOR → COW compaction — compact merge-on-read log files into base Parquet files.
Both control the number of output Parquet files, but they work differently:
| Aspect | repartition(N) | coalesce(N) |
|---|---|---|
| Shuffle | Full shuffle — all data redistributed across N partitions | No full shuffle — merges existing partitions locally |
| Output | Exactly N evenly-sized partitions | Up to N partitions; may be uneven |
| Direction | Can increase or decrease partitions | Can only decrease partitions |
| Cost | Higher — network + disk I/O | Lower — local merge |
| When to use | Need even file sizes; before a wide join | Reducing many small files cheaply after a filter |
For writing right-sized Parquet files, a common pattern is:
# After a heavy filter that leaves few rows, coalesce is cheaper
df.filter(df.date == "2026-01-01").coalesce(4).write.parquet(output_path)
AWS Athena is a serverless interactive query service that uses Presto/Trino under the hood. It reads Parquet files directly from S3 using the Parquet columnar reader.
Steps to query Parquet in Athena:
- Define a Glue Data Catalog table pointing to the S3 prefix:
CREATE EXTERNAL TABLE events (
user_id BIGINT,
event_type STRING,
revenue DOUBLE,
event_date DATE
)
STORED AS PARQUET
LOCATION 's3://my-bucket/events/'
TBLPROPERTIES ("parquet.compress"="SNAPPY");
- Query as normal SQL — Athena applies column pruning and predicate pushdown automatically.
Cost optimisation: Athena charges per byte scanned. Parquet compression + columnar reads can reduce costs by 90%+ compared to querying CSV for the same logical data.
Partition projection or partition-filtered queries (WHERE event_date BETWEEN ...) reduce both scan size and latency.
Predicate pushdown is the process of evaluating filter conditions as early as possible — before data reaches the compute layer — using metadata stored inside Parquet files.
End-to-end flow for WHERE amount > 1000:
- Reader fetches the Parquet footer and reads per-row-group statistics:
min_amount,max_amount. - For each row group, if
max_amount ≤ 1000, the entire row group is skipped — no decompression, no I/O. - For surviving row groups, the
amountcolumn chunk is decompressed and values are evaluated against> 1000at the page level (with page-level indexes in Parquet 2.x). - Only matching rows are returned to the engine.
Parquet 2.0 introduced column indexes (min/max per page, not just per row group), enabling finer-grained skipping within a row group.
Combined with partition pruning (directory skipping) and Bloom filters (equality skipping), predicate pushdown is the single most important performance feature of the Parquet format.
Producing high-quality Parquet files that perform well at query time requires attention at write time:
- Target 128 MB–512 MB row groups — too small wastes footer reads; too large makes predicate skipping coarse.
- Sort data before writing on filter columns — tight min/max ranges per row group dramatically improve skipping.
- Choose the right compression codec — ZSTD for I/O-bound workloads; Snappy for CPU-bound (fast decompression).
- Enable dictionary encoding on low-cardinality columns — automatic in most frameworks but verify it is not being disabled.
- Partition on moderate-cardinality columns (e.g., date, country) — never on user IDs or UUIDs.
- Avoid tiny files — compact regularly if streaming or incremental writes produce many small files.
- Enable Bloom filters on high-cardinality equality columns (UUIDs, hashed IDs).
- Embed correct schema types — use TIMESTAMP_MICROS not INT96 (deprecated); use DECIMAL not DOUBLE for monetary values.
- Test with query benchmarks after schema changes to confirm no regression in pushdown effectiveness.
BigQuery uses Capacitor, its proprietary columnar format, which shares the same fundamental principles as Parquet: columnar layout, aggressive encoding, and embedded statistics. When you export from BigQuery or import into it, Parquet is the preferred external format.
Exporting BigQuery tables to Parquet:
bq extract \
--destination_format PARQUET \
--compression SNAPPY \
myproject:mydataset.mytable \
gs://my-bucket/export/*.parquet
Loading Parquet into BigQuery:
bq load \
--source_format=PARQUET \
myproject:mydataset.newtable \
gs://my-bucket/data/*.parquet
BigQuery can also query external Parquet tables via BigLake without loading — applying its own predicate pushdown against Parquet statistics in GCS.
Delta Lake is an open-source storage layer built by Databricks that adds transactional guarantees on top of Parquet files stored in object storage. The core idea: all changes (inserts, updates, deletes) are written as immutable Parquet files and tracked via a JSON transaction log (the _delta_log directory).
ACID properties in Delta Lake:
- Atomicity — a transaction either fully commits (log entry added) or is rolled back (log entry absent).
- Consistency — schema enforcement prevents corrupt writes.
- Isolation — Optimistic Concurrency Control (OCC) detects conflicts between concurrent writers.
- Durability — log + data files in object storage are highly durable.
Data files remain Parquet; Delta adds a _delta_log/ with JSON commit files that record which Parquet files are added or removed in each transaction. Periodic checkpoint files (Parquet snapshots of the log) speed up log replay.
# Spark Delta write with schema enforcement
df.write.format("delta").mode("append").save("/delta/events")
# Time travel
spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")
Raw Parquet files are immutable — you cannot update individual rows. Delta Lake adds MERGE INTO support, which implements upserts by reading affected Parquet files, rewriting them with changes, and recording the transaction:
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/delta/customers")
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdateAll(
).whenNotMatchedInsertAll(
).execute()
Or in SQL:
MERGE INTO customers AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Under the hood: Delta identifies which Parquet files contain matching rows, rewrites only those files with updated values, and records the old files as removed and new files as added in the transaction log. Non-matching files are untouched.
