What is a typical ETL throughput?

Extraction from databases typically achieves 50–200 MB/sec depending on network and query complexity. Loading into a warehouse runs 50–300 MB/sec. Transformation throughput varies wildly—simple column mappings are fast; complex joins and lookups are slow.

How does parallelism affect runtime?

Ideal parallelism divides runtime linearly (4 workers = 4× faster). Real-world parallelism is sub-linear due to coordination overhead, shared resources, and data skew. Expect 60–80% efficiency per additional worker.

What if my ETL exceeds its batch window?

Options: increase parallelism, switch to incremental/CDC processing, optimize slow transforms, upgrade hardware, or extend the batch window. Incremental processing often provides the biggest improvement—processing only changed records instead of the full dataset.

Should I use streaming instead of batch ETL?

Streaming (Kafka + Flink/Spark Streaming) eliminates batch windows entirely by processing records as they arrive. It adds operational complexity but ensures data freshness in seconds instead of hours. Use streaming for latency-sensitive use cases.

How do I estimate records from volume?

Divide total volume in bytes by average record size. For example, 100 GB of data with an average record size of 200 bytes contains approximately 500 million records. Check your source system statistics for accurate average record sizes.

What is the overhead between ETL phases?

Phase transitions (extract → transform → load) include startup time, connection setup, and buffering. In well-designed pipelines, phases overlap (pipeline parallelism), reducing total time. In sequential pipelines, add 5–15% for inter-phase overhead.

ETL Runtime Estimator

Estimate ETL pipeline runtime from data volume, throughput, transform time, and parallelism factor. Plan batch processing windows.

Data Volume

Extract Throughput

MB/s

Load Throughput

MB/s

Records to Process

Transform Time / Record

Parallelism Factor

Extract Phase

8.5 min

Transform Phase

69.44 hr

Load Phase

11.4 min

Serial Total

69.78 hr

Sum of all values

Parallel Total

17.44 hr

4× parallelism

Wall Clock

17.44 hr

Planning notes, formulas, and examples

About the ETL Runtime Estimator

ETL (Extract, Transform, Load) pipelines have strict time windows. A nightly batch that must complete before business hours needs predictable runtime estimates. If data volume grows faster than pipeline throughput, the batch window eventually overflows—causing stale data, missed SLAs, and panicked on-call engineers.

This calculator models ETL runtime by breaking it into three phases: extraction (volume ÷ read throughput), transformation (volume × per-record processing time), and loading (volume ÷ write throughput). A parallelism factor divides the total to account for concurrent workers, partitioned processing, or distributed compute. The result is an estimated wall-clock runtime.

Use this estimator for capacity planning batch jobs in Airflow, dbt, Spark, or any ETL framework. It helps you determine whether a pipeline fits within its SLA window and how much parallelism you need to meet deadlines as data volume grows.

When This Page Helps

Batch windows are finite. If your ETL doesn't finish on time, dashboards show stale data and downstream systems break. This calculator tells you whether your pipeline fits its window and how much parallelism to add when it doesn't.

How to Use the Inputs

Enter the total data volume to process in GB.
Enter the extraction throughput in MB/sec.
Enter the per-record transformation overhead in milliseconds.
Enter the estimated record count (or leave blank to calculate from volume).
Enter the load throughput in MB/sec.
Set the parallelism factor (number of concurrent workers).
Review the estimated total runtime and per-phase breakdown.

Formula used

extract_time = volume_GB × 1024 / extract_throughput_MB_sec; transform_time = records × ms_per_record / 1000; load_time = volume_GB × 1024 / load_throughput_MB_sec; total = (extract + transform + load) / parallelism

Example Calculation

Result: ~29.5 minutes total

Extract: 100 GB × 1024 / 200 MB/s = 512 sec. Transform: 500M records × 0.5ms = 250,000 sec (but parallelized). Load: 100 GB × 1024 / 150 MB/s = 683 sec. Serial total: 251,195 sec. With 4× parallelism: ~62,799 sec. In practice, extract+load overlap with transform, giving ~29.5 min effective runtime.

Tips & Best Practices

Profile each ETL phase separately to identify the bottleneck (usually transform or load).
Increase parallelism by partitioning data by date, region, or hash key.
Spark and Flink auto-scale workers; set max parallelism to fit your batch window.
Use incremental/CDC processing instead of full reloads to reduce volume.
Monitor ETL runtime trends—add alerts when runtime exceeds 70% of the batch window.
Schedule dependent ETL jobs with Airflow DAGs to maximize pipeline concurrency.

Optimizing Transform Performance

The transform phase is often the bottleneck. Optimize by: pre-filtering rows early, using vectorized operations instead of row-by-row processing, caching lookup tables in memory, and avoiding unnecessary serialization/deserialization between stages.

Incremental vs. Full Load

Full reloads process the entire dataset every run. Incremental loads process only changed records (using CDC, timestamps, or change flags). Switching from full to incremental can reduce volume by 90–99%, dramatically cutting runtime.

Monitoring and Alerting

Track ETL runtime, records processed, and error rates per run. Set alerts when runtime exceeds 70% of the batch window. Log phase-level timing to identify which phase is growing fastest. Use these trends to plan scaling before SLAs are missed.

Sources & Methodology

Last updated: February 8, 2026

Frequently Asked Questions

Extraction from databases typically achieves 50–200 MB/sec depending on network and query complexity. Loading into a warehouse runs 50–300 MB/sec. Transformation throughput varies wildly—simple column mappings are fast; complex joins and lookups are slow.