knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(dplyr)
library(ggplot2)
library(nycflights13)

Introduction

EDA is not about finding one final answer, but about asking questions, creating summaries and visualizations, and then using what we discover to ask new questions.

Let’s pretend we’re analysts for a large airline or we work in airport operations at a large airport to:

  • Airlines want to reduce delays (lower costs, happier customers))
  • Travelers want to know which flights are more reliable.
  • Airports want to manage congestion.

We will use the flights dataset from the nycflights13 package, which contains all flights departing New York City in 2013.


Step 1: Take a Look at the Data

flights

Step 2: Overall Flight Delays

flights %>%
  summarize(
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
    median_dep_delay = median(dep_delay, na.rm = TRUE),
    n_flights = n()
  )

Discussion:

  • The average delay is much higher than the median delay.
  • Most flights are only slightly late, but a few extremely delayed flights skew the average.
flights %>%
  ggplot(aes(x = dep_delay)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  xlim(-50, 200) +
  labs(title = "Distribution of Departure Delays",
       x = "Departure Delay (minutes)", y = "Number of Flights")

Business Interpretation:

  • For airlines: extreme delays represent operational failures that are costly and need special focus.
  • For passengers: averages can be misleading — most flights are fine, but some ruin the experience.

Step 3: Do Delays Differ by Airline?

flights %>%
  group_by(carrier) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(carrier, avg_dep_delay), y = avg_dep_delay)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(title = "Average Departure Delay by Airline",
       x = "Airline", y = "Average Delay (minutes)")

Discussion:

  • Some airlines consistently perform better than others.

Business Interpretation:

  • Airlines: compare performance against competitors.
  • Travelers: choose carriers with stronger on-time records.

Step 4: Do Delays Change by Time of Day?

flights %>%
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(hour) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = avg_dep_delay)) +
  geom_line(color = "blue", linewidth = 1.2) +
  geom_point() +
  labs(title = "Average Departure Delay by Hour of Day",
       x = "Scheduled Departure Hour", y = "Average Delay (minutes)")

Discussion:

  • Delays are low in the morning but increase steadily throughout the day.

Business Interpretation:

  • Airlines: add buffer in evening schedules to avoid knock-on effects.
  • Travelers: fly early to avoid delays.

Step 5: From EDA to Prediction

So far, our EDA has revealed patterns:

  • Delays are not evenly distributed (most flights okay, a few very late).
  • Some airlines perform better than others.
  • Time of day strongly influences delays.

This naturally suggests we could predict departure delay using some of these variables. For example:

  • Predictor: carrier → Some airlines are more punctual.
  • Predictor: hour → Morning vs evening matters.
  • Predictor: month → Seasonal weather patterns could matter.

Let’s create a simplified dataset ready for modeling:

flights_model <- flights %>%
  mutate(
    hour = sched_dep_time %/% 100,
    delayed = ifelse(dep_delay > 15, 1, 0)   # classify as delayed (>15 min)
  ) %>%
  select(carrier, month, hour, delayed) %>%
  na.omit()

head(flights_model)

Discussion: Here we transformed the problem into a classification task:

  • Outcome variable: delayed (1 if departure delay > 15 minutes, else 0).
  • Predictors: carrier, month, hour.

We are not building a model here, but after EDA we can be confident that these variables carry predictive power because:

  • carrier differences are visible in averages.
  • hour shows a clear upward trend in delays.
  • month likely captures seasonal/weather effects (e.g., winter storms).

Business Interpretation:

  • Airlines could use such a model to anticipate delays and reallocate resources.
  • Airports could predict peak risk periods and staff accordingly.
  • Passengers (via apps) could be warned of flights with higher delay probabilities.

Step 6: Conclusion

In this notebook, we saw how EDA is an iterative process:

  1. Start with overall patterns (delays in general).
  2. Drill down by categories (airline).
  3. Explore additional dimensions (time of day).
  4. Connect findings to predictive modeling: identifying useful predictors.

EDA is not the end — it’s the foundation for building models that can help answer “what will happen?” instead of just “what happened?”.


LS0tDQp0aXRsZTogIkV4cGxvcmF0b3J5IERhdGEgQW5hbHlzaXMgaW4gUjogQWlybGluZSBEZWxheXMiDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KLS0tDQoNCmBgYHtyIHNldHVwLCBpbmNsdWRlPVRSVUV9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsIG1lc3NhZ2UgPSBGQUxTRSwgd2FybmluZyA9IEZBTFNFKQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkoZ2dwbG90MikNCmxpYnJhcnkobnljZmxpZ2h0czEzKQ0KYGBgDQoNCiMgSW50cm9kdWN0aW9uDQoNCkVEQSBpcyBub3QgYWJvdXQgZmluZGluZyBvbmUgZmluYWwgYW5zd2VyLCBidXQgYWJvdXQgKiphc2tpbmcgcXVlc3Rpb25zKiosIGNyZWF0aW5nICoqc3VtbWFyaWVzIGFuZCB2aXN1YWxpemF0aW9ucyoqLCBhbmQgdGhlbiB1c2luZyB3aGF0IHdlIGRpc2NvdmVyIHRvIGFzayBuZXcgcXVlc3Rpb25zLg0KDQpMZXQncyBwcmV0ZW5kIHdlJ3JlIGFuYWx5c3RzIGZvciBhIGxhcmdlIGFpcmxpbmUgb3Igd2Ugd29yayBpbiBhaXJwb3J0IG9wZXJhdGlvbnMgYXQgYSBsYXJnZSBhaXJwb3J0IHRvOg0KDQoqIEFpcmxpbmVzIHdhbnQgdG8gcmVkdWNlIGRlbGF5cyAobG93ZXIgY29zdHMsIGhhcHBpZXIgY3VzdG9tZXJzKSkNCiogVHJhdmVsZXJzIHdhbnQgdG8ga25vdyB3aGljaCBmbGlnaHRzIGFyZSBtb3JlIHJlbGlhYmxlLg0KKiBBaXJwb3J0cyB3YW50IHRvIG1hbmFnZSBjb25nZXN0aW9uLg0KDQpXZSB3aWxsIHVzZSB0aGUgYGZsaWdodHNgIGRhdGFzZXQgZnJvbSB0aGUgYG55Y2ZsaWdodHMxM2AgcGFja2FnZSwgd2hpY2ggY29udGFpbnMgYWxsIGZsaWdodHMgZGVwYXJ0aW5nIE5ldyBZb3JrIENpdHkgaW4gMjAxMy4gIA0KDQotLS0NCg0KIyBTdGVwIDE6IFRha2UgYSBMb29rIGF0IHRoZSBEYXRhDQoNCmBgYHtyfQ0KZmxpZ2h0cw0KYGBgDQoNCi0tLQ0KDQojIFN0ZXAgMjogT3ZlcmFsbCBGbGlnaHQgRGVsYXlzDQoNCmBgYHtyfQ0KZmxpZ2h0cyAlPiUNCiAgc3VtbWFyaXplKA0KICAgIGF2Z19kZXBfZGVsYXkgPSBtZWFuKGRlcF9kZWxheSwgbmEucm0gPSBUUlVFKSwNCiAgICBtZWRpYW5fZGVwX2RlbGF5ID0gbWVkaWFuKGRlcF9kZWxheSwgbmEucm0gPSBUUlVFKSwNCiAgICBuX2ZsaWdodHMgPSBuKCkNCiAgKQ0KYGBgDQoNCioqRGlzY3Vzc2lvbioqOg0KDQoqIFRoZSAqKmF2ZXJhZ2UgZGVsYXkqKiBpcyBtdWNoIGhpZ2hlciB0aGFuIHRoZSAqKm1lZGlhbiBkZWxheSoqLg0KKiBNb3N0IGZsaWdodHMgYXJlIG9ubHkgc2xpZ2h0bHkgbGF0ZSwgYnV0IGEgZmV3ICoqZXh0cmVtZWx5IGRlbGF5ZWQgZmxpZ2h0cyoqIHNrZXcgdGhlIGF2ZXJhZ2UuDQoNCmBgYHtyfQ0KZmxpZ2h0cyAlPiUNCiAgZ2dwbG90KGFlcyh4ID0gZGVwX2RlbGF5KSkgKw0KICBnZW9tX2hpc3RvZ3JhbShiaW53aWR0aCA9IDUsIGZpbGwgPSAic3RlZWxibHVlIiwgY29sb3IgPSAid2hpdGUiKSArDQogIHhsaW0oLTUwLCAyMDApICsNCiAgbGFicyh0aXRsZSA9ICJEaXN0cmlidXRpb24gb2YgRGVwYXJ0dXJlIERlbGF5cyIsDQogICAgICAgeCA9ICJEZXBhcnR1cmUgRGVsYXkgKG1pbnV0ZXMpIiwgeSA9ICJOdW1iZXIgb2YgRmxpZ2h0cyIpDQpgYGANCg0KKipCdXNpbmVzcyBJbnRlcnByZXRhdGlvbioqOg0KDQoqIEZvciBhaXJsaW5lczogZXh0cmVtZSBkZWxheXMgcmVwcmVzZW50ICoqb3BlcmF0aW9uYWwgZmFpbHVyZXMqKiB0aGF0IGFyZSBjb3N0bHkgYW5kIG5lZWQgc3BlY2lhbCBmb2N1cy4NCiogRm9yIHBhc3NlbmdlcnM6IGF2ZXJhZ2VzIGNhbiBiZSBtaXNsZWFkaW5nIOKAlCBtb3N0IGZsaWdodHMgYXJlIGZpbmUsIGJ1dCBzb21lIHJ1aW4gdGhlIGV4cGVyaWVuY2UuDQoNCi0tLQ0KDQojIFN0ZXAgMzogRG8gRGVsYXlzIERpZmZlciBieSBBaXJsaW5lPw0KDQpgYGB7cn0NCmZsaWdodHMgJT4lDQogIGdyb3VwX2J5KGNhcnJpZXIpICU+JQ0KICBzdW1tYXJpemUoYXZnX2RlcF9kZWxheSA9IG1lYW4oZGVwX2RlbGF5LCBuYS5ybSA9IFRSVUUpKSAlPiUNCiAgZ2dwbG90KGFlcyh4ID0gcmVvcmRlcihjYXJyaWVyLCBhdmdfZGVwX2RlbGF5KSwgeSA9IGF2Z19kZXBfZGVsYXkpKSArDQogIGdlb21fY29sKGZpbGwgPSAiZGFya3JlZCIpICsNCiAgY29vcmRfZmxpcCgpICsNCiAgbGFicyh0aXRsZSA9ICJBdmVyYWdlIERlcGFydHVyZSBEZWxheSBieSBBaXJsaW5lIiwNCiAgICAgICB4ID0gIkFpcmxpbmUiLCB5ID0gIkF2ZXJhZ2UgRGVsYXkgKG1pbnV0ZXMpIikNCmBgYA0KDQoqKkRpc2N1c3Npb24qKjoNCg0KKiBTb21lIGFpcmxpbmVzIGNvbnNpc3RlbnRseSBwZXJmb3JtIGJldHRlciB0aGFuIG90aGVycy4NCg0KKipCdXNpbmVzcyBJbnRlcnByZXRhdGlvbioqOg0KDQoqIEFpcmxpbmVzOiBjb21wYXJlIHBlcmZvcm1hbmNlIGFnYWluc3QgY29tcGV0aXRvcnMuDQoqIFRyYXZlbGVyczogY2hvb3NlIGNhcnJpZXJzIHdpdGggc3Ryb25nZXIgb24tdGltZSByZWNvcmRzLg0KDQotLS0NCg0KIyBTdGVwIDQ6IERvIERlbGF5cyBDaGFuZ2UgYnkgVGltZSBvZiBEYXk/DQoNCmBgYHtyfQ0KZmxpZ2h0cyAlPiUNCiAgbXV0YXRlKGhvdXIgPSBzY2hlZF9kZXBfdGltZSAlLyUgMTAwKSAlPiUNCiAgZ3JvdXBfYnkoaG91cikgJT4lDQogIHN1bW1hcml6ZShhdmdfZGVwX2RlbGF5ID0gbWVhbihkZXBfZGVsYXksIG5hLnJtID0gVFJVRSkpICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBob3VyLCB5ID0gYXZnX2RlcF9kZWxheSkpICsNCiAgZ2VvbV9saW5lKGNvbG9yID0gImJsdWUiLCBsaW5ld2lkdGggPSAxLjIpICsNCiAgZ2VvbV9wb2ludCgpICsNCiAgbGFicyh0aXRsZSA9ICJBdmVyYWdlIERlcGFydHVyZSBEZWxheSBieSBIb3VyIG9mIERheSIsDQogICAgICAgeCA9ICJTY2hlZHVsZWQgRGVwYXJ0dXJlIEhvdXIiLCB5ID0gIkF2ZXJhZ2UgRGVsYXkgKG1pbnV0ZXMpIikNCmBgYA0KDQoqKkRpc2N1c3Npb24qKjoNCg0KKiBEZWxheXMgYXJlIGxvdyBpbiB0aGUgbW9ybmluZyBidXQgaW5jcmVhc2Ugc3RlYWRpbHkgdGhyb3VnaG91dCB0aGUgZGF5Lg0KDQoqKkJ1c2luZXNzIEludGVycHJldGF0aW9uKio6DQoNCiogQWlybGluZXM6IGFkZCBidWZmZXIgaW4gZXZlbmluZyBzY2hlZHVsZXMgdG8gYXZvaWQga25vY2stb24gZWZmZWN0cy4NCiogVHJhdmVsZXJzOiBmbHkgZWFybHkgdG8gYXZvaWQgZGVsYXlzLg0KDQotLS0NCg0KIyBTdGVwIDU6IEZyb20gRURBIHRvIFByZWRpY3Rpb24NCg0KU28gZmFyLCBvdXIgRURBIGhhcyByZXZlYWxlZCAqKnBhdHRlcm5zKio6DQoNCiogRGVsYXlzIGFyZSBub3QgZXZlbmx5IGRpc3RyaWJ1dGVkIChtb3N0IGZsaWdodHMgb2theSwgYSBmZXcgdmVyeSBsYXRlKS4NCiogU29tZSAqKmFpcmxpbmVzKiogcGVyZm9ybSBiZXR0ZXIgdGhhbiBvdGhlcnMuDQoqICoqVGltZSBvZiBkYXkqKiBzdHJvbmdseSBpbmZsdWVuY2VzIGRlbGF5cy4NCg0KVGhpcyBuYXR1cmFsbHkgc3VnZ2VzdHMgd2UgY291bGQgKipwcmVkaWN0IGRlcGFydHVyZSBkZWxheSoqIHVzaW5nIHNvbWUgb2YgdGhlc2UgdmFyaWFibGVzLiBGb3IgZXhhbXBsZToNCg0KKiBQcmVkaWN0b3I6IGBjYXJyaWVyYCDihpIgU29tZSBhaXJsaW5lcyBhcmUgbW9yZSBwdW5jdHVhbC4NCiogUHJlZGljdG9yOiBgaG91cmAg4oaSIE1vcm5pbmcgdnMgZXZlbmluZyBtYXR0ZXJzLg0KKiBQcmVkaWN0b3I6IGBtb250aGAg4oaSIFNlYXNvbmFsIHdlYXRoZXIgcGF0dGVybnMgY291bGQgbWF0dGVyLg0KDQpMZXTigJlzIGNyZWF0ZSBhIHNpbXBsaWZpZWQgZGF0YXNldCByZWFkeSBmb3IgbW9kZWxpbmc6DQoNCmBgYHtyfQ0KZmxpZ2h0c19tb2RlbCA8LSBmbGlnaHRzICU+JQ0KICBtdXRhdGUoDQogICAgaG91ciA9IHNjaGVkX2RlcF90aW1lICUvJSAxMDAsDQogICAgZGVsYXllZCA9IGlmZWxzZShkZXBfZGVsYXkgPiAxNSwgMSwgMCkgICAjIGNsYXNzaWZ5IGFzIGRlbGF5ZWQgKD4xNSBtaW4pDQogICkgJT4lDQogIHNlbGVjdChjYXJyaWVyLCBtb250aCwgaG91ciwgZGVsYXllZCkgJT4lDQogIG5hLm9taXQoKQ0KDQpoZWFkKGZsaWdodHNfbW9kZWwpDQpgYGANCg0KKipEaXNjdXNzaW9uKio6DQpIZXJlIHdlIHRyYW5zZm9ybWVkIHRoZSBwcm9ibGVtIGludG8gYSAqKmNsYXNzaWZpY2F0aW9uIHRhc2sqKjoNCg0KKiBPdXRjb21lIHZhcmlhYmxlOiBgZGVsYXllZGAgKDEgaWYgZGVwYXJ0dXJlIGRlbGF5ID4gMTUgbWludXRlcywgZWxzZSAwKS4NCiogUHJlZGljdG9yczogYGNhcnJpZXJgLCBgbW9udGhgLCBgaG91cmAuDQoNCldlIGFyZSBub3QgYnVpbGRpbmcgYSBtb2RlbCBoZXJlLCBidXQgYWZ0ZXIgRURBIHdlIGNhbiBiZSBjb25maWRlbnQgdGhhdCB0aGVzZSB2YXJpYWJsZXMgY2FycnkgKipwcmVkaWN0aXZlIHBvd2VyKiogYmVjYXVzZToNCg0KKiBgY2FycmllcmAgZGlmZmVyZW5jZXMgYXJlIHZpc2libGUgaW4gYXZlcmFnZXMuDQoqIGBob3VyYCBzaG93cyBhIGNsZWFyIHVwd2FyZCB0cmVuZCBpbiBkZWxheXMuDQoqIGBtb250aGAgbGlrZWx5IGNhcHR1cmVzIHNlYXNvbmFsL3dlYXRoZXIgZWZmZWN0cyAoZS5nLiwgd2ludGVyIHN0b3JtcykuDQoNCioqQnVzaW5lc3MgSW50ZXJwcmV0YXRpb24qKjoNCg0KKiBBaXJsaW5lcyBjb3VsZCB1c2Ugc3VjaCBhIG1vZGVsIHRvIGFudGljaXBhdGUgZGVsYXlzIGFuZCByZWFsbG9jYXRlIHJlc291cmNlcy4NCiogQWlycG9ydHMgY291bGQgcHJlZGljdCBwZWFrIHJpc2sgcGVyaW9kcyBhbmQgc3RhZmYgYWNjb3JkaW5nbHkuDQoqIFBhc3NlbmdlcnMgKHZpYSBhcHBzKSBjb3VsZCBiZSB3YXJuZWQgb2YgZmxpZ2h0cyB3aXRoIGhpZ2hlciBkZWxheSBwcm9iYWJpbGl0aWVzLg0KDQotLS0NCg0KIyBTdGVwIDY6IENvbmNsdXNpb24NCg0KSW4gdGhpcyBub3RlYm9vaywgd2Ugc2F3IGhvdyBFREEgaXMgYW4gKippdGVyYXRpdmUgcHJvY2VzcyoqOg0KDQoxLiBTdGFydCB3aXRoIG92ZXJhbGwgcGF0dGVybnMgKGRlbGF5cyBpbiBnZW5lcmFsKS4NCjIuIERyaWxsIGRvd24gYnkgY2F0ZWdvcmllcyAoYWlybGluZSkuDQozLiBFeHBsb3JlIGFkZGl0aW9uYWwgZGltZW5zaW9ucyAodGltZSBvZiBkYXkpLg0KNC4gQ29ubmVjdCBmaW5kaW5ncyB0byAqKnByZWRpY3RpdmUgbW9kZWxpbmcqKjogaWRlbnRpZnlpbmcgdXNlZnVsIHByZWRpY3RvcnMuDQoNCkVEQSBpcyBub3QgdGhlIGVuZCDigJQgaXTigJlzIHRoZSAqKmZvdW5kYXRpb24gZm9yIGJ1aWxkaW5nIG1vZGVscyoqIHRoYXQgY2FuIGhlbHAgYW5zd2VyIOKAnHdoYXQgd2lsbCBoYXBwZW4/4oCdIGluc3RlYWQgb2YganVzdCDigJx3aGF0IGhhcHBlbmVkP+KAnS4NCg0KLS0tDQo=