Default information

Metadata for Created Fields

Field Name Description Data Type Notes / Transformation
FL_DATE Flight date standardized to ISO (YYYY-MM-DD) Date Parsed first with ymd(), then mdy() for remaining formats
ROUTE_KEY Canonical “origin–destination” route identifier character ifelse(ORIGIN<DEST, paste(ORIGIN,DEST), paste(DEST,ORIGIN))
DEP_DELAY, ARR_DELAY Departure/arrival delay in minutes numeric Negative values floored at zero for cost modeling; extremes capped/filtered in EDA
AIR_TIME, DISTANCE Flight time (min) and distance (mi) numeric Stripped non-digits via gsub("[^0-9\\.]","",…), then as.numeric()
OCCUPANCY_RATE Fraction of seats filled per flight numeric Missing values imputed to the mean occupancy of that origin–destination pair
PASSENGERS Estimated passengers per flight (capacity × occupancy) numeric Computed as OCCUPANCY_RATE * 200
Ticket_Revenue Revenue from ticket sales per flight numeric PASSENGERS * AVG_FARE (uses half the round-trip fare per leg)
Baggage_Revenue Ancillary baggage fee revenue per flight numeric PASSENGERS * 35 * 0.5 (one bag on 50% of passengers per leg)
Total_Revenue Sum of Ticket_Revenue and Baggage_Revenue numeric
Fuel_Cost, Depreciation_Cost Variable operating costs per flight numeric Fuel_Cost = DISTANCE * 8; Depreciation_Cost = DISTANCE * 1.18
Airport_Fee Fixed airport usage fee at destination numeric $10 000 if large airport; otherwise $5 000
Delay_Cost Penalty cost for delays numeric 75 * (max(DEP_DELAY–15,0) + max(ARR_DELAY–15,0))
Total_Cost Sum of Fuel_Cost, Depreciation_Cost, Airport_Fee, and Delay_Cost numeric
Profit Operating profit per flight numeric Total_Revenue − Total_Cost
Flights Total one-way flight legs in the quarter integer n() rows per ROUTE_KEY
Round_Trips Number of complete round trips integer floor(Flights / 2)
Total_Pax Total passengers carried on that route (round trips basis) numeric sum(PASSENGERS) / 2
Total_Revenue (route) Sum of Total_Revenue across all flights in the quarter numeric
Total_Cost (route) Sum of Total_Cost across all flights in the quarter numeric
Total_Profit (route) Sum of Profit across all flights in the quarter numeric
Profit_per_Round_Trip Average profit generated by one round trip numeric Total_Profit / Round_Trips
Breakeven_Round_Trips Number of round trips needed to recoup the $90 M plane cost integer ceiling(90e6 / Profit_per_Round_Trip)

Data-Quality Insights & Remediation

  1. Mixed & inconsistent date formats
    • Issue: FL_DATE appeared as both "YYYY-MM-DD" and "M/D/YY".
    • Impact: Would break time-series analyses and joins.
    • Remediation: Wrote clean_dates() to try ymd() then mdy() so all dates become Date type.
  2. Numeric fields stored as characters
    • Issue: AIR_TIME, DISTANCE, and ITIN_FARE contained non-digit characters (e.g. “.0”).
    • Impact: Prevented correct summaries and arithmetic.
    • Remediation: Stripped non-digits with gsub() then as.numeric(). Checked how many NAs were introduced.
  3. Missing occupancy rates
    • Issue: Some flights had NA in OCCUPANCY_RATE.
    • Impact: Understated revenue and passenger counts.
    • Remediation: Imputed each missing value to the mean occupancy for its origin–destination pair.
  4. Extreme outliers in delays & distances
    • Issue: Few flights showed delays > 500 min or distances > 3 000 mi.
    • Impact: Skewed distributions and summary statistics.
    • Remediation: For EDA, capped at the 1st/99th percentiles or filtered to ±500 min; left full range when computing economic impact.
  5. Duplicate records
    • Issue: Exact duplicates in all datasets.
    • Impact: Artificially inflated counts and revenue.
    • Remediation: Applied distinct() at each cleaning step to remove duplicates.
  6. Sparse ticket data on many routes
    • Issue: Numerous routes had < 100 round-trip tickets.
    • Impact: Unreliable average fare estimates.
    • Remediation: Kept only routes with ≥ 100 tickets for per‐route fare calculations; for others, substituted global median fare.

End of documentation.