Data Dive: Model Documentation and Data Understanding


Introduction

This data dive examines how reading documentation affects interpretation of the nycflights dataset. By identifying unclear variables and investigating how missing values and encoding choices influence analysis, this exercise highlights the importance of carefully referencing documentation before building models or drawing conclusions.


Columns That Were Unclear Without Documentation

(1) dep_time

At first glance, dep_time appears numeric, but documentation reveals that it is recorded in HHMM format (e.g., 517 = 5:17 AM), not as a continuous numeric measure.

If this documentation were ignored:

  • We might treat it as a linear numeric variable

  • Differences would be interpreted incorrectly (e.g., 900 − 830 ≠ 70 minutes)

Why encode it this way?

  • Likely reflects how airline systems store timestamps.

  • Maintains compact numeric storage

(2) carrier

carrier contains short two-letter codes (e.g., UA, DL, B6). Without documentation, these codes are not meaningful.

Why encode this way?

  • Industry-standard airline codes

  • Efficient storage

  • Matches FAA/IATA conventions

Without documentation:

  • Airline-level conclusions could be misinterpreted

  • Mislabeling could occur


Something Still Unclear After Documentation

The variable dep_delay contains both negative and positive values.

Documentation explains:

This ambiguity could affect interpretation of “on-time performance.”


Visualizing the Documentation Issue (dep_delay)

Visualization 1: Distribution Highlighting Negative Delays

library(tidyverse)
library(nycflights13)
flights |>
  filter(!is.na(dep_delay)) |>
  ggplot(aes(x = dep_delay, fill = dep_delay < 0)) +
  geom_histogram(bins = 40, color = "white") +
  scale_fill_manual(values = c("grey70", "steelblue")) +
  labs(
    title = "Distribution of Departure Delays",
    x = "Departure Delay (minutes)",
    y = "Number of Flights",
    fill = "Early Departure"
  ) +
  theme_classic()

Interpretation

The histogram shows a large concentration of flights around zero minutes, with a substantial number of negative delay values representing early departures. While the documentation states that negative values indicate early departure, it does not fully clarify whether these represent true operational efficiency or schedule padding. This distinction is important because early departures reduce the overall mean delay. If early departures reflect scheduling practices rather than performance, including them in summary statistics could underestimate actual delay severity.

Visualization 2: Mean Delay Including vs Excluding Negative Values

delay_summary <- flights |>
  filter(!is.na(dep_delay)) |>
  summarise(
    mean_with_neg = mean(dep_delay),
    mean_without_neg = mean(dep_delay[dep_delay >= 0])
  )

delay_summary
## # A tibble: 1 × 2
##   mean_with_neg mean_without_neg
##           <dbl>            <dbl>
## 1          12.6             34.9

Interpretation

The mean departure delay changes substantially depending on whether early departures are included. When negative delays are included, the average delay is approximately 12.6 minutes. However, when early departures are excluded, the average delay increases to approximately 34.9 minutes. This large difference demonstrates how encoding early departures as negative values significantly affects summary statistics. Without carefully referencing documentation and clearly defining how delay is measured, conclusions about airline performance could be misleading.

Risks

  • Misinterpreting negative delays: The large difference between the mean delay including and excluding early departures shows how encoding choices can substantially alter conclusions about airline performance.

  • Mechanical or encoding-driven bias: Derived or encoded variables (such as delay values and time formats) may introduce structural effects that influence summary statistics.

  • Ignoring implicit structural gaps: Missing combinations of categorical variables (e.g., certain carriers not operating at certain airports) may be mistakenly treated as data errors rather than structural realities.

Mitigation Strategies

  • Clearly define how delay is measured and report whether early departures are included.

  • Present summary statistics both with and without negative values when appropriate.

  • Convert time-coded variables into proper time objects before analysis.

  • Check for explicit and implicit missing data before modeling.

  • Document all variable transformations and assumptions.

Missing Data Analysis

##Explicit Missing Rows

flights |>
  summarise(
    missing_arr_delay = sum(is.na(arr_delay)),
    missing_air_time = sum(is.na(air_time))
  )
## # A tibble: 1 × 2
##   missing_arr_delay missing_air_time
##               <int>            <int>
## 1              9430             9430

Interpretation

The dataset contains 9,430 missing values for both arr_delay and air_time. The identical number of missing values suggests that these are not random data entry errors but instead reflect structurally missing information. Most likely, these correspond to flights that were cancelled or diverted and therefore do not have recorded arrival delay or air time values. This indicates that missingness in this dataset is meaningful and tied to operational outcomes rather than measurement noise.

Implicit Missing Rows

flights |>
  count(origin, carrier) |>
  complete(origin, carrier)
## # A tibble: 48 × 3
##    origin carrier     n
##    <chr>  <chr>   <int>
##  1 EWR    9E       1268
##  2 EWR    AA       3487
##  3 EWR    AS        714
##  4 EWR    B6       6557
##  5 EWR    DL       4342
##  6 EWR    EV      43939
##  7 EWR    F9         NA
##  8 EWR    FL         NA
##  9 EWR    HA         NA
## 10 EWR    MQ       2276
## # ℹ 38 more rows

Interpretation

Implicit missing rows appear when completing all possible combinations of origin and carrier. Several origin–carrier pairs have n = NA, indicating that those airline–airport combinations do not exist in the dataset. These absences are structural rather than accidental; certain airlines simply do not operate from certain airports. This highlights the importance of distinguishing between structural absence and true missing data when preparing data for modeling.

Continuous Variable: Outliers

delay_stats <- flights |>
  filter(!is.na(dep_delay)) |>
  summarise(
    Q1 = quantile(dep_delay, 0.25),
    Q3 = quantile(dep_delay, 0.75),
    IQR = IQR(dep_delay)
  )

delay_stats
## # A tibble: 1 × 3
##      Q1    Q3   IQR
##   <dbl> <dbl> <dbl>
## 1    -5    11    16

Interpretation

The first quartile of departure delay is -5 minutes and the third quartile is 11 minutes, yielding an interquartile range of 16 minutes. Using the 1.5 × IQR rule, values below -29 minutes or above 35 minutes are considered outliers. This indicates that flights departing more than 35 minutes late represent unusually large delays relative to the typical distribution. Because the distribution is strongly right-skewed, extreme positive delays are far more common than extreme early departures. These outliers may represent significant operational disruptions rather than data entry errors, and they can substantially influence summary statistics such as the mean.

Because delay data are heavily right-skewed, the IQR method may flag a large number of flights as outliers. Therefore, it is important to consider whether extreme delays reflect true events or data quality problems before removing them from analysis.


Conclusion

This data dive highlights the importance of reviewing documentation before interpreting statistical results. Several variables, such as dep_time, carrier, and delay measures, are encoded in ways that are not immediately clear without documentation. For instance, negative departure delays represent early departures, which significantly affect summary statistics. The large difference in mean delay when excluding early departures demonstrates how encoding choices can alter conclusions.

The missing data analysis shows that the identical number of missing values in arr_delay and air_time likely reflects cancelled or diverted flights rather than random data gaps. Ignoring these rows changes the population being analyzed and may underestimate operational disruptions. Similarly, implicit missing rows in categorical combinations reflect structural absences rather than data errors.

Outlier analysis indicates that delays greater than 35 minutes are statistically unusual, but given the strong right skew, these values may represent real operational events rather than mistakes. Overall, this investigation reinforces that statistical measures must be interpreted in the context of variable definitions, missingness structure, and documentation to avoid misleading conclusions.