Executive Summary:
The goal of this analysis was to evaluate US domestic airline data (1st
Quarter 2019) to identify the busiest flight routes, the most profitable
routes, and recommend five round-trip routes for a new airline to
operate. I combined the Flights, Tickets, and Airport Codes datasets,
cleaned them for accuracy, and then calculated revenues, costs, and
profits for each potential route using the given assumptions (fuel
costs, fees, etc.).
Key Findings
- I found that the busiest routes are largely
short-haul hops between major hubs (for example, Los Angeles–San
Francisco) and a high-frequency inter-island route in Hawaii
(Honolulu–Kahului).
- The most profitable route was JFK–LAX (New
York–Los Angeles), generating approximately $250 million in
operating profit during Q1 2019. Other top-earning routes include
DCA–ORD (Washington–Chicago) and
DCA–LGA (Washington–New York), which, while not the
very busiest by flight count, yield strong profits due to premium fares
and efficient operations.
After examining profitability, demand, and on-time performance, I recommend five routes for the new airline’s initial investment:
These routes each demonstrate high financial returns and robust passenger demand, while maintaining solid on-time performance—critical for the airline’s “On time, for you” brand promise. They cover a mix of long-haul transcontinental and high-frequency regional markets, positioning the new airline in key business corridors.
Breakeven Analysis
I calculated how many round trips each route must operate to recover the
$90 million airplane cost at current profit-per-flight levels. For
example, the JFK–LAX route requires roughly 1,143 round
trips to break even. The other recommended routes need between
1,200 and 1,500 round trips, indicating that, with
expected utilization, each aircraft investment can be recouped in a
reasonable timeframe.
Conclusion & Next Steps
Launching service on these five routes gives the new airline a strong
financial and operational foundation. I advise:
By focusing on these routes and continuously monitoring the identified KPIs, the airline can establish a profitable and reliable network from day one.
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(corrplot)
library(DT)
clean_dates <- function(date_column) {
# parse YMD first, then MDY for the remainder
date_column <- as.character(date_column)
d1 <- ymd(date_column, quiet = TRUE)
idx <- is.na(d1)
d1[idx] <- mdy(date_column[idx], quiet = TRUE)
as.Date(d1)
}
clean_numeric_column <- function(col) {
# strip out any non-digit / period, convert
as.numeric(gsub("[^0-9\\.]", "", col))
}
cost_of_each_plane <- 90e6 # $ per new aircraft
number_of_planes <- 5 # how many routes to recommend
flights_raw <- read_csv("data/Flights.csv", show_col_types = FALSE)
tickets_raw <- read_csv("data/Tickets.csv", show_col_types = FALSE)
airports_raw <- read_csv("data/Airport_Codes.csv", show_col_types = FALSE)
flights <- flights_raw %>%
filter(CANCELLED == 0) %>%
mutate(
FL_DATE = clean_dates(FL_DATE),
DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
AIR_TIME = clean_numeric_column(AIR_TIME),
DISTANCE = clean_numeric_column(DISTANCE),
OCCUPANCY_RATE = as.numeric(OCCUPANCY_RATE),
ROUTE_KEY = ifelse(ORIGIN < DESTINATION,
paste(ORIGIN, DESTINATION, sep="-"),
paste(DESTINATION, ORIGIN, sep="-"))
) %>%
filter(!is.na(FL_DATE), AIR_TIME > 0) %>%
distinct()
tickets <- tickets_raw %>%
filter(ROUNDTRIP == TRUE) %>%
mutate(
ITIN_FARE = clean_numeric_column(ITIN_FARE),
PASSENGERS = as.numeric(PASSENGERS),
ROUTE_KEY = ifelse(ORIGIN < DESTINATION,
paste(ORIGIN, DESTINATION, sep="-"),
paste(DESTINATION, ORIGIN, sep="-"))
) %>%
filter(!is.na(ITIN_FARE), ITIN_FARE > 0, PASSENGERS >= 1) %>%
distinct()
airports <- airports_raw %>%
filter(
ISO_COUNTRY == "US",
TYPE %in% c("medium_airport","large_airport"),
IATA_CODE != ""
) %>%
select(IATA_CODE, TYPE) %>%
rename(Airport_Size = TYPE)
ggplot(flights, aes(x = DEP_DELAY)) +
geom_histogram(bins = 50, fill = "#4682B4", color = "white") +
coord_cartesian(xlim = c(-50, 300)) +
labs(
title = "Departure Delay Distribution",
x = "Departure Delay (minutes)",
y = "Flight Count"
)
ggplot(flights, aes(x = ARR_DELAY)) +
geom_histogram(bins = 50, fill = "#B22222", color = "white") +
coord_cartesian(xlim = c(-50, 300)) +
labs(
title = "Arrival Delay Distribution",
x = "Arrival Delay (minutes)",
y = "Flight Count"
)
ggplot(flights, aes(x = OCCUPANCY_RATE)) +
geom_histogram(bins = 50, fill = "#FFA500", color = "white") +
labs(
title = "Occupancy Rate Distribution",
x = "Occupancy Rate",
y = "Flight Count"
)
ggplot(flights, aes(x = DISTANCE)) +
geom_histogram(bins = 40, fill = "#228B22", color = "white") +
labs(
title = "Flight Distance Distribution",
x = "Distance (miles)",
y = "Count"
)
p99 <- quantile(tickets$ITIN_FARE, 0.99, na.rm = TRUE)
ggplot(tickets %>% filter(ITIN_FARE <= p99), aes(x = ITIN_FARE)) +
geom_histogram(bins = 50, fill = "#87CEEB", color = "white") +
labs(
title = "Ticket Fare Distribution (≤ 99th percentile)",
x = "Round‐Trip Fare (USD)",
y = "Count"
)
tickets %>%
mutate(
BUCKET = ifelse(PASSENGERS > 10, "10+", as.character(PASSENGERS)),
BUCKET = factor(BUCKET, levels = c(as.character(1:10), "10+"))
) %>%
ggplot(aes(x = BUCKET)) +
geom_bar(fill = "#9370DB") +
labs(
title = "Distribution of Passengers per Itinerary",
x = "Passengers (bucketed)",
y = "Number of Itineraries"
) +
theme_minimal()
ggplot(airports_raw, aes(x = TYPE)) +
geom_bar(fill = "#4682B4") +
labs(
title = "Global Airport Type Distribution",
x = "Airport Type",
y = "Count"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Merge with airport sizes
fl2 <- flights %>%
inner_join(airports, by = c("ORIGIN" = "IATA_CODE")) %>% rename(Origin_Size = Airport_Size) %>%
inner_join(airports, by = c("DESTINATION" = "IATA_CODE")) %>% rename(Dest_Size = Airport_Size)
# Fare summary for routes with ≥100 tickets
ticket_summary <- tickets %>%
group_by(ROUTE_KEY) %>%
summarise(AVG_FARE = mean(ITIN_FARE, na.rm = TRUE), N = n(), .groups="drop") %>%
filter(N >= 100)
# Combine and compute revenues/costs
fl2 <- fl2 %>%
left_join(ticket_summary, by = "ROUTE_KEY") %>%
mutate(
AVG_FARE = if_else(is.na(AVG_FARE), median(ticket_summary$AVG_FARE), AVG_FARE),
PASSENGERS = OCCUPANCY_RATE * 200,
Ticket_Revenue = PASSENGERS * AVG_FARE,
Baggage_Revenue = PASSENGERS * 35 * 0.5,
Total_Revenue = Ticket_Revenue + Baggage_Revenue,
Fuel_Cost = DISTANCE * 8,
Depreciation_Cost= DISTANCE * 1.18,
Airport_Fee = if_else(Dest_Size=="large_airport", 10000, 5000),
Delay_Cost = 75 * (pmax(DEP_DELAY - 15,0) + pmax(ARR_DELAY - 15,0)),
Total_Cost = Fuel_Cost + Depreciation_Cost + Airport_Fee + Delay_Cost,
Profit = Total_Revenue - Total_Cost
)
# Summarize per route
route_summary <- fl2 %>%
group_by(ROUTE_KEY) %>%
summarise(
Flights = n(),
Round_Trips = floor(n()/2),
Total_Revenue = sum(Total_Revenue, na.rm=TRUE),
Total_Cost = sum(Total_Cost, na.rm=TRUE),
Total_Profit = sum(Profit, na.rm=TRUE),
Profit_per_Round_Trip = Total_Profit / Round_Trips,
Breakeven_Round_Trips = ceiling(cost_of_each_plane / (Total_Profit / Round_Trips)),
.groups = "drop"
)
route_top_profit <- route_summary %>% slice_max(Total_Profit, n = 10)
busiest_routes <- route_summary %>% slice_max(Round_Trips, n = 10)
recommended_routes <- route_summary %>% slice_max(Total_Profit, n = number_of_planes)
DT::datatable(
route_top_profit,
caption = "Top 10 Most Profitable Round‐Trip Routes",
options = list(pageLength = 10),
rownames = FALSE
)
DT::datatable(
busiest_routes,
caption = "Top 10 Busiest Round‐Trip Routes",
options = list(pageLength = 10),
rownames = FALSE
)
DT::datatable(
recommended_routes,
caption = paste0("Top ", number_of_planes, " Recommended Routes"),
options = list(pageLength = number_of_planes),
rownames = FALSE
)
DT::datatable(
route_summary,
caption = "Complete Route Summary",
options = list(pageLength = 15),
rownames = FALSE
)
In the time allotted I delivered a static Q1 2019 snapshot. Next steps I would pursue (without doing the work here) include:
Seasonal & Forecasting Analysis
Dynamic Pricing & Yield Management
Operational Feasibility
Risk & Sensitivity
Real-Time Monitoring Dashboard
Customer Experience Integration
Predictive Delay Modeling
These enhancements would turn our static analysis into a dynamic, prescriptive data product driving fleet and route decisions.
```