This data dive examines how reading documentation affects interpretation of the nycflights dataset. By identifying unclear variables and investigating how missing values and encoding choices influence analysis, this exercise highlights the importance of carefully referencing documentation before building models or drawing conclusions.
At first glance, dep_time appears numeric, but documentation reveals that it is recorded in HHMM format (e.g., 517 = 5:17 AM), not as a continuous numeric measure.
If this documentation were ignored:
We might treat it as a linear numeric variable
Differences would be interpreted incorrectly (e.g., 900 − 830 ≠ 70 minutes)
Why encode it this way?
Likely reflects how airline systems store timestamps.
Maintains compact numeric storage
carrier contains short two-letter codes (e.g., UA, DL, B6). Without documentation, these codes are not meaningful.
Why encode this way?
Industry-standard airline codes
Efficient storage
Matches FAA/IATA conventions
Without documentation:
Airline-level conclusions could be misinterpreted
Mislabeling could occur
The variable dep_delay contains both negative and positive values.
Documentation explains:
Negative values = early departures
However, it does not fully clarify:
Whether negative delays represent true early pushback or scheduled rounding differences
Whether early departures are operationally meaningful or just artifacts of schedule padding
This ambiguity could affect interpretation of “on-time performance.”
library(tidyverse)
library(nycflights13)
flights |>
filter(!is.na(dep_delay)) |>
ggplot(aes(x = dep_delay, fill = dep_delay < 0)) +
geom_histogram(bins = 40, color = "white") +
scale_fill_manual(values = c("grey70", "steelblue")) +
labs(
title = "Distribution of Departure Delays",
x = "Departure Delay (minutes)",
y = "Number of Flights",
fill = "Early Departure"
) +
theme_classic()
The histogram shows a large concentration of flights around zero minutes, with a substantial number of negative delay values representing early departures. While the documentation states that negative values indicate early departure, it does not fully clarify whether these represent true operational efficiency or schedule padding. This distinction is important because early departures reduce the overall mean delay. If early departures reflect scheduling practices rather than performance, including them in summary statistics could underestimate actual delay severity.
delay_summary <- flights |>
filter(!is.na(dep_delay)) |>
summarise(
mean_with_neg = mean(dep_delay),
mean_without_neg = mean(dep_delay[dep_delay >= 0])
)
delay_summary
## # A tibble: 1 × 2
## mean_with_neg mean_without_neg
## <dbl> <dbl>
## 1 12.6 34.9
The mean departure delay changes substantially depending on whether early departures are included. When negative delays are included, the average delay is approximately 12.6 minutes. However, when early departures are excluded, the average delay increases to approximately 34.9 minutes. This large difference demonstrates how encoding early departures as negative values significantly affects summary statistics. Without carefully referencing documentation and clearly defining how delay is measured, conclusions about airline performance could be misleading.
Misinterpreting negative delays: The large difference between the mean delay including and excluding early departures shows how encoding choices can substantially alter conclusions about airline performance.
Mechanical or encoding-driven bias: Derived or encoded variables (such as delay values and time formats) may introduce structural effects that influence summary statistics.
Ignoring implicit structural gaps: Missing combinations of categorical variables (e.g., certain carriers not operating at certain airports) may be mistakenly treated as data errors rather than structural realities.
Clearly define how delay is measured and report whether early departures are included.
Present summary statistics both with and without negative values when appropriate.
Convert time-coded variables into proper time objects before analysis.
Check for explicit and implicit missing data before modeling.
Document all variable transformations and assumptions.
##Explicit Missing Rows
flights |>
summarise(
missing_arr_delay = sum(is.na(arr_delay)),
missing_air_time = sum(is.na(air_time))
)
## # A tibble: 1 × 2
## missing_arr_delay missing_air_time
## <int> <int>
## 1 9430 9430
The dataset contains 9,430 missing values for both arr_delay and air_time. The identical number of missing values suggests that these are not random data entry errors but instead reflect structurally missing information. Most likely, these correspond to flights that were cancelled or diverted and therefore do not have recorded arrival delay or air time values. This indicates that missingness in this dataset is meaningful and tied to operational outcomes rather than measurement noise.
flights |>
count(origin, carrier) |>
complete(origin, carrier)
## # A tibble: 48 × 3
## origin carrier n
## <chr> <chr> <int>
## 1 EWR 9E 1268
## 2 EWR AA 3487
## 3 EWR AS 714
## 4 EWR B6 6557
## 5 EWR DL 4342
## 6 EWR EV 43939
## 7 EWR F9 NA
## 8 EWR FL NA
## 9 EWR HA NA
## 10 EWR MQ 2276
## # ℹ 38 more rows
Implicit missing rows appear when completing all possible combinations of origin and carrier. Several origin–carrier pairs have n = NA, indicating that those airline–airport combinations do not exist in the dataset. These absences are structural rather than accidental; certain airlines simply do not operate from certain airports. This highlights the importance of distinguishing between structural absence and true missing data when preparing data for modeling.
delay_stats <- flights |>
filter(!is.na(dep_delay)) |>
summarise(
Q1 = quantile(dep_delay, 0.25),
Q3 = quantile(dep_delay, 0.75),
IQR = IQR(dep_delay)
)
delay_stats
## # A tibble: 1 × 3
## Q1 Q3 IQR
## <dbl> <dbl> <dbl>
## 1 -5 11 16
The first quartile of departure delay is -5 minutes and the third quartile is 11 minutes, yielding an interquartile range of 16 minutes. Using the 1.5 × IQR rule, values below -29 minutes or above 35 minutes are considered outliers. This indicates that flights departing more than 35 minutes late represent unusually large delays relative to the typical distribution. Because the distribution is strongly right-skewed, extreme positive delays are far more common than extreme early departures. These outliers may represent significant operational disruptions rather than data entry errors, and they can substantially influence summary statistics such as the mean.
Because delay data are heavily right-skewed, the IQR method may flag a large number of flights as outliers. Therefore, it is important to consider whether extreme delays reflect true events or data quality problems before removing them from analysis.
This data dive highlights the importance of reviewing documentation before interpreting statistical results. Several variables, such as dep_time, carrier, and delay measures, are encoded in ways that are not immediately clear without documentation. For instance, negative departure delays represent early departures, which significantly affect summary statistics. The large difference in mean delay when excluding early departures demonstrates how encoding choices can alter conclusions.
The missing data analysis shows that the identical number of missing values in arr_delay and air_time likely reflects cancelled or diverted flights rather than random data gaps. Ignoring these rows changes the population being analyzed and may underestimate operational disruptions. Similarly, implicit missing rows in categorical combinations reflect structural absences rather than data errors.
Outlier analysis indicates that delays greater than 35 minutes are statistically unusual, but given the strong right skew, these values may represent real operational events rather than mistakes. Overall, this investigation reinforces that statistical measures must be interpreted in the context of variable definitions, missingness structure, and documentation to avoid misleading conclusions.