knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(dplyr)
library(ggplot2)
library(nycflights13)
Introduction
EDA is not about finding one final answer, but about asking
questions, creating summaries and
visualizations, and then using what we discover to ask new
questions.
Let’s pretend we’re analysts for a large airline or we work in
airport operations at a large airport to:
- Airlines want to reduce delays (lower costs, happier
customers))
- Travelers want to know which flights are more reliable.
- Airports want to manage congestion.
We will use the flights dataset from the
nycflights13 package, which contains all flights departing
New York City in 2013.
Step 1: Take a Look at the Data
flights
Step 2: Overall Flight Delays
flights %>%
summarize(
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
median_dep_delay = median(dep_delay, na.rm = TRUE),
n_flights = n()
)
Discussion:
- The average delay is much higher than the
median delay.
- Most flights are only slightly late, but a few extremely
delayed flights skew the average.
flights %>%
ggplot(aes(x = dep_delay)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
xlim(-50, 200) +
labs(title = "Distribution of Departure Delays",
x = "Departure Delay (minutes)", y = "Number of Flights")

Business Interpretation:
- For airlines: extreme delays represent operational
failures that are costly and need special focus.
- For passengers: averages can be misleading — most flights are fine,
but some ruin the experience.
Step 3: Do Delays Differ by Airline?
flights %>%
group_by(carrier) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(carrier, avg_dep_delay), y = avg_dep_delay)) +
geom_col(fill = "darkred") +
coord_flip() +
labs(title = "Average Departure Delay by Airline",
x = "Airline", y = "Average Delay (minutes)")

Discussion:
- Some airlines consistently perform better than others.
Business Interpretation:
- Airlines: compare performance against competitors.
- Travelers: choose carriers with stronger on-time records.
Step 4: Do Delays Change by Time of Day?
flights %>%
mutate(hour = sched_dep_time %/% 100) %>%
group_by(hour) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = hour, y = avg_dep_delay)) +
geom_line(color = "blue", linewidth = 1.2) +
geom_point() +
labs(title = "Average Departure Delay by Hour of Day",
x = "Scheduled Departure Hour", y = "Average Delay (minutes)")

Discussion:
- Delays are low in the morning but increase steadily throughout the
day.
Business Interpretation:
- Airlines: add buffer in evening schedules to avoid knock-on
effects.
- Travelers: fly early to avoid delays.
Step 5: From EDA to Prediction
So far, our EDA has revealed patterns:
- Delays are not evenly distributed (most flights okay, a few very
late).
- Some airlines perform better than others.
- Time of day strongly influences delays.
This naturally suggests we could predict departure
delay using some of these variables. For example:
- Predictor:
carrier → Some airlines are more
punctual.
- Predictor:
hour → Morning vs evening matters.
- Predictor:
month → Seasonal weather patterns could
matter.
Let’s create a simplified dataset ready for modeling:
flights_model <- flights %>%
mutate(
hour = sched_dep_time %/% 100,
delayed = ifelse(dep_delay > 15, 1, 0) # classify as delayed (>15 min)
) %>%
select(carrier, month, hour, delayed) %>%
na.omit()
head(flights_model)
Discussion: Here we transformed the problem into a
classification task:
- Outcome variable:
delayed (1 if departure delay > 15
minutes, else 0).
- Predictors:
carrier, month,
hour.
We are not building a model here, but after EDA we can be confident
that these variables carry predictive power
because:
carrier differences are visible in averages.
hour shows a clear upward trend in delays.
month likely captures seasonal/weather effects (e.g.,
winter storms).
Business Interpretation:
- Airlines could use such a model to anticipate delays and reallocate
resources.
- Airports could predict peak risk periods and staff accordingly.
- Passengers (via apps) could be warned of flights with higher delay
probabilities.
Step 6: Conclusion
In this notebook, we saw how EDA is an iterative
process:
- Start with overall patterns (delays in general).
- Drill down by categories (airline).
- Explore additional dimensions (time of day).
- Connect findings to predictive modeling:
identifying useful predictors.
EDA is not the end — it’s the foundation for building
models that can help answer “what will happen?” instead of just
“what happened?”.
LS0tDQp0aXRsZTogIkV4cGxvcmF0b3J5IERhdGEgQW5hbHlzaXMgaW4gUjogQWlybGluZSBEZWxheXMiDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KLS0tDQoNCmBgYHtyIHNldHVwLCBpbmNsdWRlPVRSVUV9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsIG1lc3NhZ2UgPSBGQUxTRSwgd2FybmluZyA9IEZBTFNFKQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkoZ2dwbG90MikNCmxpYnJhcnkobnljZmxpZ2h0czEzKQ0KYGBgDQoNCiMgSW50cm9kdWN0aW9uDQoNCkVEQSBpcyBub3QgYWJvdXQgZmluZGluZyBvbmUgZmluYWwgYW5zd2VyLCBidXQgYWJvdXQgKiphc2tpbmcgcXVlc3Rpb25zKiosIGNyZWF0aW5nICoqc3VtbWFyaWVzIGFuZCB2aXN1YWxpemF0aW9ucyoqLCBhbmQgdGhlbiB1c2luZyB3aGF0IHdlIGRpc2NvdmVyIHRvIGFzayBuZXcgcXVlc3Rpb25zLg0KDQpMZXQncyBwcmV0ZW5kIHdlJ3JlIGFuYWx5c3RzIGZvciBhIGxhcmdlIGFpcmxpbmUgb3Igd2Ugd29yayBpbiBhaXJwb3J0IG9wZXJhdGlvbnMgYXQgYSBsYXJnZSBhaXJwb3J0IHRvOg0KDQoqIEFpcmxpbmVzIHdhbnQgdG8gcmVkdWNlIGRlbGF5cyAobG93ZXIgY29zdHMsIGhhcHBpZXIgY3VzdG9tZXJzKSkNCiogVHJhdmVsZXJzIHdhbnQgdG8ga25vdyB3aGljaCBmbGlnaHRzIGFyZSBtb3JlIHJlbGlhYmxlLg0KKiBBaXJwb3J0cyB3YW50IHRvIG1hbmFnZSBjb25nZXN0aW9uLg0KDQpXZSB3aWxsIHVzZSB0aGUgYGZsaWdodHNgIGRhdGFzZXQgZnJvbSB0aGUgYG55Y2ZsaWdodHMxM2AgcGFja2FnZSwgd2hpY2ggY29udGFpbnMgYWxsIGZsaWdodHMgZGVwYXJ0aW5nIE5ldyBZb3JrIENpdHkgaW4gMjAxMy4gIA0KDQotLS0NCg0KIyBTdGVwIDE6IFRha2UgYSBMb29rIGF0IHRoZSBEYXRhDQoNCmBgYHtyfQ0KZmxpZ2h0cw0KYGBgDQoNCi0tLQ0KDQojIFN0ZXAgMjogT3ZlcmFsbCBGbGlnaHQgRGVsYXlzDQoNCmBgYHtyfQ0KZmxpZ2h0cyAlPiUNCiAgc3VtbWFyaXplKA0KICAgIGF2Z19kZXBfZGVsYXkgPSBtZWFuKGRlcF9kZWxheSwgbmEucm0gPSBUUlVFKSwNCiAgICBtZWRpYW5fZGVwX2RlbGF5ID0gbWVkaWFuKGRlcF9kZWxheSwgbmEucm0gPSBUUlVFKSwNCiAgICBuX2ZsaWdodHMgPSBuKCkNCiAgKQ0KYGBgDQoNCioqRGlzY3Vzc2lvbioqOg0KDQoqIFRoZSAqKmF2ZXJhZ2UgZGVsYXkqKiBpcyBtdWNoIGhpZ2hlciB0aGFuIHRoZSAqKm1lZGlhbiBkZWxheSoqLg0KKiBNb3N0IGZsaWdodHMgYXJlIG9ubHkgc2xpZ2h0bHkgbGF0ZSwgYnV0IGEgZmV3ICoqZXh0cmVtZWx5IGRlbGF5ZWQgZmxpZ2h0cyoqIHNrZXcgdGhlIGF2ZXJhZ2UuDQoNCmBgYHtyfQ0KZmxpZ2h0cyAlPiUNCiAgZ2dwbG90KGFlcyh4ID0gZGVwX2RlbGF5KSkgKw0KICBnZW9tX2hpc3RvZ3JhbShiaW53aWR0aCA9IDUsIGZpbGwgPSAic3RlZWxibHVlIiwgY29sb3IgPSAid2hpdGUiKSArDQogIHhsaW0oLTUwLCAyMDApICsNCiAgbGFicyh0aXRsZSA9ICJEaXN0cmlidXRpb24gb2YgRGVwYXJ0dXJlIERlbGF5cyIsDQogICAgICAgeCA9ICJEZXBhcnR1cmUgRGVsYXkgKG1pbnV0ZXMpIiwgeSA9ICJOdW1iZXIgb2YgRmxpZ2h0cyIpDQpgYGANCg0KKipCdXNpbmVzcyBJbnRlcnByZXRhdGlvbioqOg0KDQoqIEZvciBhaXJsaW5lczogZXh0cmVtZSBkZWxheXMgcmVwcmVzZW50ICoqb3BlcmF0aW9uYWwgZmFpbHVyZXMqKiB0aGF0IGFyZSBjb3N0bHkgYW5kIG5lZWQgc3BlY2lhbCBmb2N1cy4NCiogRm9yIHBhc3NlbmdlcnM6IGF2ZXJhZ2VzIGNhbiBiZSBtaXNsZWFkaW5nIOKAlCBtb3N0IGZsaWdodHMgYXJlIGZpbmUsIGJ1dCBzb21lIHJ1aW4gdGhlIGV4cGVyaWVuY2UuDQoNCi0tLQ0KDQojIFN0ZXAgMzogRG8gRGVsYXlzIERpZmZlciBieSBBaXJsaW5lPw0KDQpgYGB7cn0NCmZsaWdodHMgJT4lDQogIGdyb3VwX2J5KGNhcnJpZXIpICU+JQ0KICBzdW1tYXJpemUoYXZnX2RlcF9kZWxheSA9IG1lYW4oZGVwX2RlbGF5LCBuYS5ybSA9IFRSVUUpKSAlPiUNCiAgZ2dwbG90KGFlcyh4ID0gcmVvcmRlcihjYXJyaWVyLCBhdmdfZGVwX2RlbGF5KSwgeSA9IGF2Z19kZXBfZGVsYXkpKSArDQogIGdlb21fY29sKGZpbGwgPSAiZGFya3JlZCIpICsNCiAgY29vcmRfZmxpcCgpICsNCiAgbGFicyh0aXRsZSA9ICJBdmVyYWdlIERlcGFydHVyZSBEZWxheSBieSBBaXJsaW5lIiwNCiAgICAgICB4ID0gIkFpcmxpbmUiLCB5ID0gIkF2ZXJhZ2UgRGVsYXkgKG1pbnV0ZXMpIikNCmBgYA0KDQoqKkRpc2N1c3Npb24qKjoNCg0KKiBTb21lIGFpcmxpbmVzIGNvbnNpc3RlbnRseSBwZXJmb3JtIGJldHRlciB0aGFuIG90aGVycy4NCg0KKipCdXNpbmVzcyBJbnRlcnByZXRhdGlvbioqOg0KDQoqIEFpcmxpbmVzOiBjb21wYXJlIHBlcmZvcm1hbmNlIGFnYWluc3QgY29tcGV0aXRvcnMuDQoqIFRyYXZlbGVyczogY2hvb3NlIGNhcnJpZXJzIHdpdGggc3Ryb25nZXIgb24tdGltZSByZWNvcmRzLg0KDQotLS0NCg0KIyBTdGVwIDQ6IERvIERlbGF5cyBDaGFuZ2UgYnkgVGltZSBvZiBEYXk/DQoNCmBgYHtyfQ0KZmxpZ2h0cyAlPiUNCiAgbXV0YXRlKGhvdXIgPSBzY2hlZF9kZXBfdGltZSAlLyUgMTAwKSAlPiUNCiAgZ3JvdXBfYnkoaG91cikgJT4lDQogIHN1bW1hcml6ZShhdmdfZGVwX2RlbGF5ID0gbWVhbihkZXBfZGVsYXksIG5hLnJtID0gVFJVRSkpICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBob3VyLCB5ID0gYXZnX2RlcF9kZWxheSkpICsNCiAgZ2VvbV9saW5lKGNvbG9yID0gImJsdWUiLCBsaW5ld2lkdGggPSAxLjIpICsNCiAgZ2VvbV9wb2ludCgpICsNCiAgbGFicyh0aXRsZSA9ICJBdmVyYWdlIERlcGFydHVyZSBEZWxheSBieSBIb3VyIG9mIERheSIsDQogICAgICAgeCA9ICJTY2hlZHVsZWQgRGVwYXJ0dXJlIEhvdXIiLCB5ID0gIkF2ZXJhZ2UgRGVsYXkgKG1pbnV0ZXMpIikNCmBgYA0KDQoqKkRpc2N1c3Npb24qKjoNCg0KKiBEZWxheXMgYXJlIGxvdyBpbiB0aGUgbW9ybmluZyBidXQgaW5jcmVhc2Ugc3RlYWRpbHkgdGhyb3VnaG91dCB0aGUgZGF5Lg0KDQoqKkJ1c2luZXNzIEludGVycHJldGF0aW9uKio6DQoNCiogQWlybGluZXM6IGFkZCBidWZmZXIgaW4gZXZlbmluZyBzY2hlZHVsZXMgdG8gYXZvaWQga25vY2stb24gZWZmZWN0cy4NCiogVHJhdmVsZXJzOiBmbHkgZWFybHkgdG8gYXZvaWQgZGVsYXlzLg0KDQotLS0NCg0KIyBTdGVwIDU6IEZyb20gRURBIHRvIFByZWRpY3Rpb24NCg0KU28gZmFyLCBvdXIgRURBIGhhcyByZXZlYWxlZCAqKnBhdHRlcm5zKio6DQoNCiogRGVsYXlzIGFyZSBub3QgZXZlbmx5IGRpc3RyaWJ1dGVkIChtb3N0IGZsaWdodHMgb2theSwgYSBmZXcgdmVyeSBsYXRlKS4NCiogU29tZSAqKmFpcmxpbmVzKiogcGVyZm9ybSBiZXR0ZXIgdGhhbiBvdGhlcnMuDQoqICoqVGltZSBvZiBkYXkqKiBzdHJvbmdseSBpbmZsdWVuY2VzIGRlbGF5cy4NCg0KVGhpcyBuYXR1cmFsbHkgc3VnZ2VzdHMgd2UgY291bGQgKipwcmVkaWN0IGRlcGFydHVyZSBkZWxheSoqIHVzaW5nIHNvbWUgb2YgdGhlc2UgdmFyaWFibGVzLiBGb3IgZXhhbXBsZToNCg0KKiBQcmVkaWN0b3I6IGBjYXJyaWVyYCDihpIgU29tZSBhaXJsaW5lcyBhcmUgbW9yZSBwdW5jdHVhbC4NCiogUHJlZGljdG9yOiBgaG91cmAg4oaSIE1vcm5pbmcgdnMgZXZlbmluZyBtYXR0ZXJzLg0KKiBQcmVkaWN0b3I6IGBtb250aGAg4oaSIFNlYXNvbmFsIHdlYXRoZXIgcGF0dGVybnMgY291bGQgbWF0dGVyLg0KDQpMZXTigJlzIGNyZWF0ZSBhIHNpbXBsaWZpZWQgZGF0YXNldCByZWFkeSBmb3IgbW9kZWxpbmc6DQoNCmBgYHtyfQ0KZmxpZ2h0c19tb2RlbCA8LSBmbGlnaHRzICU+JQ0KICBtdXRhdGUoDQogICAgaG91ciA9IHNjaGVkX2RlcF90aW1lICUvJSAxMDAsDQogICAgZGVsYXllZCA9IGlmZWxzZShkZXBfZGVsYXkgPiAxNSwgMSwgMCkgICAjIGNsYXNzaWZ5IGFzIGRlbGF5ZWQgKD4xNSBtaW4pDQogICkgJT4lDQogIHNlbGVjdChjYXJyaWVyLCBtb250aCwgaG91ciwgZGVsYXllZCkgJT4lDQogIG5hLm9taXQoKQ0KDQpoZWFkKGZsaWdodHNfbW9kZWwpDQpgYGANCg0KKipEaXNjdXNzaW9uKio6DQpIZXJlIHdlIHRyYW5zZm9ybWVkIHRoZSBwcm9ibGVtIGludG8gYSAqKmNsYXNzaWZpY2F0aW9uIHRhc2sqKjoNCg0KKiBPdXRjb21lIHZhcmlhYmxlOiBgZGVsYXllZGAgKDEgaWYgZGVwYXJ0dXJlIGRlbGF5ID4gMTUgbWludXRlcywgZWxzZSAwKS4NCiogUHJlZGljdG9yczogYGNhcnJpZXJgLCBgbW9udGhgLCBgaG91cmAuDQoNCldlIGFyZSBub3QgYnVpbGRpbmcgYSBtb2RlbCBoZXJlLCBidXQgYWZ0ZXIgRURBIHdlIGNhbiBiZSBjb25maWRlbnQgdGhhdCB0aGVzZSB2YXJpYWJsZXMgY2FycnkgKipwcmVkaWN0aXZlIHBvd2VyKiogYmVjYXVzZToNCg0KKiBgY2FycmllcmAgZGlmZmVyZW5jZXMgYXJlIHZpc2libGUgaW4gYXZlcmFnZXMuDQoqIGBob3VyYCBzaG93cyBhIGNsZWFyIHVwd2FyZCB0cmVuZCBpbiBkZWxheXMuDQoqIGBtb250aGAgbGlrZWx5IGNhcHR1cmVzIHNlYXNvbmFsL3dlYXRoZXIgZWZmZWN0cyAoZS5nLiwgd2ludGVyIHN0b3JtcykuDQoNCioqQnVzaW5lc3MgSW50ZXJwcmV0YXRpb24qKjoNCg0KKiBBaXJsaW5lcyBjb3VsZCB1c2Ugc3VjaCBhIG1vZGVsIHRvIGFudGljaXBhdGUgZGVsYXlzIGFuZCByZWFsbG9jYXRlIHJlc291cmNlcy4NCiogQWlycG9ydHMgY291bGQgcHJlZGljdCBwZWFrIHJpc2sgcGVyaW9kcyBhbmQgc3RhZmYgYWNjb3JkaW5nbHkuDQoqIFBhc3NlbmdlcnMgKHZpYSBhcHBzKSBjb3VsZCBiZSB3YXJuZWQgb2YgZmxpZ2h0cyB3aXRoIGhpZ2hlciBkZWxheSBwcm9iYWJpbGl0aWVzLg0KDQotLS0NCg0KIyBTdGVwIDY6IENvbmNsdXNpb24NCg0KSW4gdGhpcyBub3RlYm9vaywgd2Ugc2F3IGhvdyBFREEgaXMgYW4gKippdGVyYXRpdmUgcHJvY2VzcyoqOg0KDQoxLiBTdGFydCB3aXRoIG92ZXJhbGwgcGF0dGVybnMgKGRlbGF5cyBpbiBnZW5lcmFsKS4NCjIuIERyaWxsIGRvd24gYnkgY2F0ZWdvcmllcyAoYWlybGluZSkuDQozLiBFeHBsb3JlIGFkZGl0aW9uYWwgZGltZW5zaW9ucyAodGltZSBvZiBkYXkpLg0KNC4gQ29ubmVjdCBmaW5kaW5ncyB0byAqKnByZWRpY3RpdmUgbW9kZWxpbmcqKjogaWRlbnRpZnlpbmcgdXNlZnVsIHByZWRpY3RvcnMuDQoNCkVEQSBpcyBub3QgdGhlIGVuZCDigJQgaXTigJlzIHRoZSAqKmZvdW5kYXRpb24gZm9yIGJ1aWxkaW5nIG1vZGVscyoqIHRoYXQgY2FuIGhlbHAgYW5zd2VyIOKAnHdoYXQgd2lsbCBoYXBwZW4/4oCdIGluc3RlYWQgb2YganVzdCDigJx3aGF0IGhhcHBlbmVkP+KAnS4NCg0KLS0tDQo=