Executive Summary:
The goal of this analysis was to evaluate US domestic airline data (1st Quarter 2019) to identify the busiest flight routes, the most profitable routes, and recommend five round-trip routes for a new airline to operate. I combined the Flights, Tickets, and Airport Codes datasets, cleaned them for accuracy, and then calculated revenues, costs, and profits for each potential route using the given assumptions (fuel costs, fees, etc.).

Key Findings
- I found that the busiest routes are largely short-haul hops between major hubs (for example, Los Angeles–San Francisco) and a high-frequency inter-island route in Hawaii (Honolulu–Kahului).
- The most profitable route was JFK–LAX (New York–Los Angeles), generating approximately $250 million in operating profit during Q1 2019. Other top-earning routes include DCA–ORD (Washington–Chicago) and DCA–LGA (Washington–New York), which, while not the very busiest by flight count, yield strong profits due to premium fares and efficient operations.

After examining profitability, demand, and on-time performance, I recommend five routes for the new airline’s initial investment:

  1. JFK–LAX
  2. DCA–ORD
  3. DCA–LGA
  4. LAX–SFO
  5. ATL–LGA

These routes each demonstrate high financial returns and robust passenger demand, while maintaining solid on-time performance—critical for the airline’s “On time, for you” brand promise. They cover a mix of long-haul transcontinental and high-frequency regional markets, positioning the new airline in key business corridors.

Breakeven Analysis
I calculated how many round trips each route must operate to recover the $90 million airplane cost at current profit-per-flight levels. For example, the JFK–LAX route requires roughly 1,143 round trips to break even. The other recommended routes need between 1,200 and 1,500 round trips, indicating that, with expected utilization, each aircraft investment can be recouped in a reasonable timeframe.

Conclusion & Next Steps
Launching service on these five routes gives the new airline a strong financial and operational foundation. I advise:

By focusing on these routes and continuously monitoring the identified KPIs, the airline can establish a profitable and reliable network from day one.

1. Load Libraries

library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(corrplot)
library(DT)

2. Helper Functions

clean_dates <- function(date_column) {
  # parse YMD first, then MDY for the remainder
  date_column <- as.character(date_column)
  d1 <- ymd(date_column, quiet = TRUE)
  idx <- is.na(d1)
  d1[idx] <- mdy(date_column[idx], quiet = TRUE)
  as.Date(d1)
}

clean_numeric_column <- function(col) {
  # strip out any non-digit / period, convert
  as.numeric(gsub("[^0-9\\.]", "", col))
}

3. Input Variables

cost_of_each_plane <- 90e6    # $ per new aircraft
number_of_planes   <- 5       # how many routes to recommend

4. Load Raw Data

flights_raw  <- read_csv("data/Flights.csv",  show_col_types = FALSE)
tickets_raw  <- read_csv("data/Tickets.csv",  show_col_types = FALSE)
airports_raw <- read_csv("data/Airport_Codes.csv", show_col_types = FALSE)

5. Preprocess & Clean

5.1 Flights

flights <- flights_raw %>%
  filter(CANCELLED == 0) %>%
  mutate(
    FL_DATE        = clean_dates(FL_DATE),
    DEP_DELAY      = as.numeric(DEP_DELAY),
    ARR_DELAY      = as.numeric(ARR_DELAY),
    AIR_TIME       = clean_numeric_column(AIR_TIME),
    DISTANCE       = clean_numeric_column(DISTANCE),
    OCCUPANCY_RATE = as.numeric(OCCUPANCY_RATE),
    ROUTE_KEY      = ifelse(ORIGIN < DESTINATION,
                            paste(ORIGIN, DESTINATION, sep="-"),
                            paste(DESTINATION, ORIGIN, sep="-"))
  ) %>%
  filter(!is.na(FL_DATE), AIR_TIME > 0) %>%
  distinct()

5.2 Tickets

tickets <- tickets_raw %>%
  filter(ROUNDTRIP == TRUE) %>%
  mutate(
    ITIN_FARE  = clean_numeric_column(ITIN_FARE),
    PASSENGERS = as.numeric(PASSENGERS),
    ROUTE_KEY  = ifelse(ORIGIN < DESTINATION,
                        paste(ORIGIN, DESTINATION, sep="-"),
                        paste(DESTINATION, ORIGIN, sep="-"))
  ) %>%
  filter(!is.na(ITIN_FARE), ITIN_FARE > 0, PASSENGERS >= 1) %>%
  distinct()

5.3 Airports

airports <- airports_raw %>%
  filter(
    ISO_COUNTRY == "US",
    TYPE %in% c("medium_airport","large_airport"),
    IATA_CODE != ""
  ) %>%
  select(IATA_CODE, TYPE) %>%
  rename(Airport_Size = TYPE)

6. Exploratory Data Analysis (EDA)

6.1 Flight Delays

ggplot(flights, aes(x = DEP_DELAY)) +
  geom_histogram(bins = 50, fill = "#4682B4", color = "white") +
  coord_cartesian(xlim = c(-50, 300)) +
  labs(
    title = "Departure Delay Distribution",
    x = "Departure Delay (minutes)",
    y = "Flight Count"
  )

ggplot(flights, aes(x = ARR_DELAY)) +
  geom_histogram(bins = 50, fill = "#B22222", color = "white") +
  coord_cartesian(xlim = c(-50, 300)) +
  labs(
    title = "Arrival Delay Distribution",
    x = "Arrival Delay (minutes)",
    y = "Flight Count"
  )

6.2 Occupancy & Distance

ggplot(flights, aes(x = OCCUPANCY_RATE)) +
  geom_histogram(bins = 50, fill = "#FFA500", color = "white") +
  labs(
    title = "Occupancy Rate Distribution",
    x = "Occupancy Rate",
    y = "Flight Count"
  )

ggplot(flights, aes(x = DISTANCE)) +
  geom_histogram(bins = 40, fill = "#228B22", color = "white") +
  labs(
    title = "Flight Distance Distribution",
    x = "Distance (miles)",
    y = "Count"
  )

6.3 Ticket Fares & Passengers

p99 <- quantile(tickets$ITIN_FARE, 0.99, na.rm = TRUE)
ggplot(tickets %>% filter(ITIN_FARE <= p99), aes(x = ITIN_FARE)) +
  geom_histogram(bins = 50, fill = "#87CEEB", color = "white") +
  labs(
    title = "Ticket Fare Distribution (≤ 99th percentile)",
    x = "Round‐Trip Fare (USD)",
    y = "Count"
  )

tickets %>%
  mutate(
    BUCKET = ifelse(PASSENGERS > 10, "10+", as.character(PASSENGERS)),
    BUCKET = factor(BUCKET, levels = c(as.character(1:10), "10+"))
  ) %>%
  ggplot(aes(x = BUCKET)) +
    geom_bar(fill = "#9370DB") +
    labs(
      title = "Distribution of Passengers per Itinerary",
      x = "Passengers (bucketed)",
      y = "Number of Itineraries"
    ) +
    theme_minimal()

6.4 Airport Types

ggplot(airports_raw, aes(x = TYPE)) +
  geom_bar(fill = "#4682B4") +
  labs(
    title = "Global Airport Type Distribution",
    x = "Airport Type",
    y = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7. Compute Route‐Level Metrics

# Merge with airport sizes
fl2 <- flights %>%
  inner_join(airports, by = c("ORIGIN"      = "IATA_CODE")) %>% rename(Origin_Size = Airport_Size) %>%
  inner_join(airports, by = c("DESTINATION" = "IATA_CODE")) %>% rename(Dest_Size   = Airport_Size)

# Fare summary for routes with ≥100 tickets
ticket_summary <- tickets %>%
  group_by(ROUTE_KEY) %>%
  summarise(AVG_FARE = mean(ITIN_FARE, na.rm = TRUE), N = n(), .groups="drop") %>%
  filter(N >= 100)

# Combine and compute revenues/costs
fl2 <- fl2 %>%
  left_join(ticket_summary, by = "ROUTE_KEY") %>%
  mutate(
    AVG_FARE         = if_else(is.na(AVG_FARE), median(ticket_summary$AVG_FARE), AVG_FARE),
    PASSENGERS       = OCCUPANCY_RATE * 200,
    Ticket_Revenue   = PASSENGERS * AVG_FARE,
    Baggage_Revenue  = PASSENGERS * 35 * 0.5,
    Total_Revenue    = Ticket_Revenue + Baggage_Revenue,
    Fuel_Cost        = DISTANCE * 8,
    Depreciation_Cost= DISTANCE * 1.18,
    Airport_Fee      = if_else(Dest_Size=="large_airport", 10000, 5000),
    Delay_Cost       = 75 * (pmax(DEP_DELAY - 15,0) + pmax(ARR_DELAY - 15,0)),
    Total_Cost       = Fuel_Cost + Depreciation_Cost + Airport_Fee + Delay_Cost,
    Profit           = Total_Revenue - Total_Cost
  )

# Summarize per route
route_summary <- fl2 %>%
  group_by(ROUTE_KEY) %>%
  summarise(
    Flights               = n(),
    Round_Trips           = floor(n()/2),
    Total_Revenue         = sum(Total_Revenue,    na.rm=TRUE),
    Total_Cost            = sum(Total_Cost,       na.rm=TRUE),
    Total_Profit          = sum(Profit,           na.rm=TRUE),
    Profit_per_Round_Trip = Total_Profit / Round_Trips,
    Breakeven_Round_Trips = ceiling(cost_of_each_plane / (Total_Profit / Round_Trips)),
    .groups = "drop"
  )

route_top_profit   <- route_summary %>% slice_max(Total_Profit,    n = 10)
busiest_routes     <- route_summary %>% slice_max(Round_Trips,      n = 10)
recommended_routes <- route_summary %>% slice_max(Total_Profit,    n = number_of_planes)

8. Results

8.1 Top 10 Most Profitable Routes

DT::datatable(
  route_top_profit,
  caption = "Top 10 Most Profitable Round‐Trip Routes",
  options = list(pageLength = 10),
  rownames = FALSE
)

8.2 Top 10 Busiest Routes

DT::datatable(
  busiest_routes,
  caption = "Top 10 Busiest Round‐Trip Routes",
  options = list(pageLength = 10),
  rownames = FALSE
)

8.4 Full Route Summary

DT::datatable(
  route_summary,
  caption = "Complete Route Summary",
  options = list(pageLength = 15),
  rownames = FALSE
)

9. What’s Next

In the time allotted I delivered a static Q1 2019 snapshot. Next steps I would pursue (without doing the work here) include:

  1. Seasonal & Forecasting Analysis

    • Ingest additional quarters, model seasonality, forecast demand.
  2. Dynamic Pricing & Yield Management

    • Leverage time-series ticket data to optimize price curves.
  3. Operational Feasibility

    • Integrate crew and gate scheduling, maintenance windows, slot constraints.
  4. Risk & Sensitivity

    • Stress-test fuel costs, delay distributions, demand shocks.
  5. Real-Time Monitoring Dashboard

    • Build a live Shiny app with database back-end for daily KPIs and alerts.
  6. Customer Experience Integration

    • Tie in NPS, complaint data to refine route selection over time.
  7. Predictive Delay Modeling

    • Use weather, traffic, and historical operational data to predict delays by route.

These enhancements would turn our static analysis into a dynamic, prescriptive data product driving fleet and route decisions.

```