AI / ML Tribune

Flight Delays: The Long Wait

A binary classifier that flags 15-minute departure delays two hours before takeoff — built across 41.5 million flights and seven years of data, engineered and modeled in distributed PySpark on Databricks.

By Alejandra Rosas, Ambro Quach, Andrei Lupan, Annelise Meyer & Margaret Lubega

Machine Learning at ScalePySparkFeature EngineeringClassificationNeural Networks6 min read

What makes a flight late — weather stacks up

┊ 30.2% avg

Normalclear
29.6%
High winds> 25 mph
42.5%
Moderateprecip or low vis
44.3%
Severeprecip + low vis
54.3%
54.3%
delays when precip + low visibility stack up
Nearly double a clear day's 29.6%. The model we built flags 83.6% of delays, two hours before departure.

The goal of this project was to predict whether a flight will be delayed 15 minutes or more, using only information available two hours before scheduled departure. That two-hour window is what makes the prediction useful: it gives airlines time to proactively adjust gate assignments, reallocate crews, and notify passengers before delays cascade through the network.

We built our own dataset to do it — a custom join of U.S. Department of Transportation on-time flight records with matched NOAA Global Hourly weather observations, spanning 2015–2021 and roughly 41.5 million flights. Every feature was constrained to information realistically known at T-minus-two-hours, so the model never sees the future. And because a missed delay costs an airline 10–30× more than a false alarm, we optimized for recall and F2 — catching real delays matters more than avoiding the occasional unnecessary alert.

What makes a flight late

Before modeling anything, we spent a long time just looking at what actually drives a delay. A few patterns showed up clearly.

Weather stacks up. Weather effects are not linear. Normal conditions run at 29.6% delays — slightly below the overall average. High winds (over 25 mph) raise that to 42.5%, and moderate severity (heavy precipitation or low visibility) is similar at 44.3%. The biggest jump happens when both occur together: severe conditions — precipitation plus low visibility — reach 54.3% delays, a 24.7-point gap versus normal. Single conditions raise delays a lot; combined constraints are even worse. Severe cases are rare (about 0.3% of flights), but they have an outsized impact when they hit. Precipitation's effect also varies a lot by region — from an 18-point bump in the Pacific to a 28-point bump in the East South Central — so the same weather signal can mean different things depending on where it happens.

Delays build through the day. Hour of day shows a clear cascade: delays are lowest in the morning and build as the day goes on. Early flights (7–9 AM) have the lowest rates, 13.6% to 15.9%. Delays rise through the afternoon, cross the 30.2% average around 3 PM, and peak at 38.6% around 7 PM — roughly a 25-point gap from morning to evening. Day of week interacts with this: Saturday is the best day overall (26.8%), while Thursday and Friday evenings are the worst.

Delay rate by hour of day, building through the afternoon

A late first flight haunts the rest of the day. Sequence position in an aircraft's daily rotation shows a strong cascade. The first flight has 37.3% delays; the second and third stay high (31–32%); then rates fall — 28% by the fourth, down to 17.1% by the eighth. About a 20-point improvement from first to last, as the schedule gets a chance to recover.

Execution beats size. There's no simple relationship between carrier size and delay performance. Southwest runs the largest schedule (2.38M flights) and still has 34.3% delays, about 4 points above average. At the other extreme, Alaska is much smaller (355K flights) but best-in-class at 21.5%, and Delta stays strong at 24.3% across 1.6M flights. Across the 16 carriers with 50,000+ flights, delay rates span 21.5% (Alaska) to 40.1% (JetBlue) — an 18.6-point range. Carriers with similar scale can have very different outcomes, so execution matters more than volume alone.

Delay rate by carrier — no clean relationship with size

Mid-range flights sit in an awkward zone. Delay rates rise from 28.1% on the shortest routes to a peak of 33.4% at 1,000–1,250 miles, then settle around 30–31% for longer hauls. This inverted-U suggests mid-range flights aren't short enough for quick turnarounds, but not long enough to benefit from larger schedule buffers.

Delay rate by distance group — an inverted-U peaking mid-range

Geography matters — but volume doesn't. Delay rates by region range from 26.7% (West North Central) to 35.2% (Mid-Atlantic). The Northeast is consistently the worst, which fits congested airspace and hubs like EWR, JFK, LGA, and BOS; the West performs best. And volume alone doesn't explain delays — the Pacific division handles 1.82M flights and still holds 28.6%, while the Mid-Atlantic has fewer flights but sits at 35.2%.

Delay rate by U.S. census division

Being a big hub doesn't mean being late. Airport network centrality (PageRank) shows that hub importance doesn't automatically translate into higher delays. Atlanta dominates the network yet sits at 23–27% delays, well below the 30.2% average. There's no clean linear relationship, and the smaller airports show the widest spread of all — anywhere from 15% to 63%.

Airport centrality (PageRank) versus delay rate

Building it at scale

With the patterns mapped, the modeling problem was mostly one of scale and discipline. The full dataset is 41.5 million flights across seven years, so everything ran in distributed PySpark on Databricks.

The discipline part was avoiding leakage. We used a 13-fold blocked time-series cross-validation — six-month training windows, three-month validation windows, with a one-day embargo between them so no fold could peek at the future. To handle class imbalance (only about 20% of flights are delayed), we undersampled the majority class in the training folds only, leaving validation and test sets at their natural rates. Then we evaluated four model families: logistic regression, random forest, gradient-boosted trees, and a multilayer-perceptron neural network.

Did it work?

On a held-out test set — all of Q4 2021, 1.64 million flights the models had never seen — here's how they landed:

ModelF2PrecisionRecallAUC-PRStrength
Logistic Regression (threshold-tuned)0.5410.2240.8360.265Highest recall — catches 83.6% of delays
Gradient Boosted Trees0.4610.2840.5460.303Best precision and ranking ability
Random Forest0.4380.2190.6100.246Fast baseline (~88 min to train)

Logistic regression, with a tuned decision threshold, caught the most delays — 83.6% of them, at an F2 of 0.541. Simply moving the threshold from 0.5 to 0.4 lifted recall from 54.6% to 83.6% at no extra training cost, catching 5 of every 6 delayed flights. That's the right trade for operations, where a missed delay is far costlier than an unnecessary alert.

Gradient-boosted trees had the best precision (28.4%) and ranking ability (AUC-PR 0.303) — better for deciding which flights are truly highest-risk when resources are limited. Random forest was the fast baseline.

One honest caveat: every model degraded from training to test (GBT went from 0.627 to 0.461 F2). That gap is a distribution shift — the training data spans 2015–2021, but the test quarter is late 2021, deep in post-COVID operations that never fully returned to the old patterns.

What we'd do next

The results show this is operationally feasible with existing infrastructure: you can forecast a 15-minute delay two hours out, accurately enough to act on. The clearest next steps are drift-aware retraining to handle that post-COVID shift, explicit COVID-era indicators for 2020–2021, and richer features like interaction terms and airline-specific behavior.

A machine-learning-at-scale project from UC Berkeley's School of Information, by Alejandra Rosas, Ambro Quach, Andrei Lupan, Annelise Meyer, and Margaret Lubega.

More From AI / ML Tribune