Project 3

Author

Daniel Johnson

My data is sourced from the Bureau of Transportation Statistics. They have a data portal where you can download all the flight data from all the airports in a state by month or year. Each entry is by route by airline by month, so a route run all 12 months of the year by the same airline will have 12 entries. Over the course of the project I used a lot fo the variables. DEPARTURES_PERFORMED is the number of flights that actually occurred, SEATS and PASSENGERS are the number of available seats on the route offered by the airline and the passengers is how many people actually took the flights. We can use these to see how efficient airlines are on their routes. Distance is fixed and not distance traveled and will be extremely useful to compare metrics over distance. CARRIER_NAME is the name of the airline, and it along with DEST_CITY and DEST are useful for displaying the destination in a couple of different ways. Same goes with DEST_COUNTRY and DEST_COUNTRY_NAME, which can help us signify not just countries but domestic vs international travel. We can also use MONTH (which is numerical) to explore ridership over time throughout the whole year. I was not able to determine how the BTS collects this data, but I assume the FAA gives it to them and they aggregate it.

I chose this dataset because I have a love for transportation. My loves has ebbed and flowed over the years between different modes (cars, planes, trains etc) but I recently watched a video from a youtuber who used to talk about planes and the airline industry a lot and was inspired. I wanted to look at Dulles specifically since it’s my main home international airport and the one i’ve flown out of the most in the past. Transportation also relates to my love for geography, as those are forever intertwined. In the project I wish to explore the flights coming out of Dulles in 2024 and find interesting tidbits, trends and maybe even some fun facts.

Firstly, we must load our libraries, set the working directory and read in the data.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

Warning: package 'plotly' was built under R version 4.4.3


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("~/Data 110 Class Folder")
data <- read_csv("T_T100_SEGMENT_ALL_CARRIER.csv")

Rows: 28894 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (22): UNIQUE_CARRIER, UNIQUE_CARRIER_NAME, UNIQUE_CARRIER_ENTITY, REGION...
dbl (28): DEPARTURES_SCHEDULED, DEPARTURES_PERFORMED, PAYLOAD, SEATS, PASSEN...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Our original data includes all airports in Virginia, so we must first filter for just Dulles

iad <- data %>%
  filter(ORIGIN == "IAD")   #dulles

This data is of all flights, including cargo flights, so we must remove all cargo related flights as they can skew the passenger data. We filter for “Class F” flights, which are commercial. and aircraft of type “1”, which are passenger only planes, because some flights are combination cargo and passenger, where some of the passenger cabin is configured to hold cargo beyond what goes in the bottom of the plane.

I also filtered flights whos destinations were one of the 3 dc area airports. There are no scheduled services on theses routes, so I can only assume these are things like positioning flights, diversions, training flights and emergency landing

iad <- iad %>%
  filter(CLASS == "F") |>      #commercial flights only
  filter(AIRCRAFT_CONFIG == 1) |> #type 3 is combo passenger/cargo
  filter(DEST != "IAD") |>
  filter(DEST != "BWI") |>
  filter(DEST != "DCA")
  
  
 iad <- iad %>%        
   select(-c(11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 26, 27, 30, 31, 32, 33, 37, 41, 42, 45, 49, 50))

I also removed all the redundant columns that I wasn’t going to use to make reading the tables easier.

Main Visualization 1

I started doing exploratory plots comparing percentage of seats filled vs distance. This would basically be a measure of efficiency, as it would show what percentage of seats the airlines made avaliable were being filled by passengers.Right away I noticed there 4 distinct distance ranges, as seen below.

p2 <- iad |>
ggplot(aes(x = (PASSENGERS / SEATS), y = DISTANCE)) +
  geom_point() +
  #scale_y_reverse() +
  labs(
    title = "Percentage of seats filled vs Distance",
    caption = "Source: FRC 449",
    x = "Percentage of seats filled",
    y = "Distance"
  ) +
  theme_bw() +
  geom_smooth(method = 'lm', formula = y ~ x, color = "#e10000")

We can see that of the 4 distance “buckets” almost all of the lower percentage instances were on lower distance routes.

I first explored with a histogram to see what the distance distribution looked like when it was the central part of the graph and not the y-axis of another graph. You can see the 4 buckets much more clearly here. 0-2500 and over 6000 showed up pretty clearly, but I wanted to dive a little bit deeper.

p11 <- iad %>%
  ggplot(aes(x = DISTANCE)) +
  geom_histogram(bins = 50) +
  labs(title = "Histogram of Distances", x = "Distance", y = "Frequency") +
  theme_minimal()

I decided to filter first for domestic flights to see if that corresponded with the different distance “buckets” with a simple histogram.

domestic <- iad %>%
  filter(DEST_COUNTRY == "US")

p12 <- domestic %>%
  ggplot(aes(x = DISTANCE)) +
  geom_histogram(bins = 30) +
  labs(title = "Domestic", x = "Distance", y = "Frequency") +
  theme_minimal()

We can see above that every domestic flight from Dulles is under 2700 miles except for 2 destinations, which I later discovered to be Anchorage Alaska and Honolulu Hawaii.

I then did the same with international flights. Here you get much more variance and many more distinct buckets.

international <- iad %>%
  filter(DEST_COUNTRY != "US")

p13 <- international %>%  
  ggplot(aes(x = DISTANCE)) +
  geom_histogram(bins = 30) +
  labs(title = "International", x = "Distance", y = "Frequency") +
  theme_minimal()

Upon a closer look at the data itself as well as online mapping tools which visually show where you can fly to from Dulles, I discovered a couple of things. First, The left 2 bins of the histogram are all flights to eastern Canada (Toronto, Montreal, etc). I also learned that the first bin after the 2nd large gap was Reykjavik Iceland, and that every international flight above that mile mark (~2400 miles) were intercontinental flights ie those to South America, Europe, Africa and Asia. All the flights in that in between bucket were to the Caribbean, Mexico, western Canada or Central America. Seeing how there was significant overlap in longer haul domestic flights, such as those to the west cause and mountain west, with flights within North America, I decided that dividing up the flights by distance regardless of destination was appropriate as long as my lowest “bucket” captured all intra-continental flights. THis is also because there is very little overlap with domestic flights and inter-continental flights in terms of distance, the only cases being the aforementioned Anchorage and Honolulu, as well as San Francisco.

Based on visual inspection of the overall histogram as well as the prior knowledge of little overlap on inter-continental routes, I decided on the following “buckets:” 0-2500 miles, 2501-5000 miles, 5001-6000 miles, and 6000+. I mutated these new categories into my dataset below.

iad <- iad %>%
  mutate(DISTANCE_BUCKET = case_when(
    DISTANCE <= 2500 ~ "0-2500",
    DISTANCE <= 5000 ~ "2501-5000",
    DISTANCE <= 6000 ~ "5001-6000",
    TRUE ~ "6001+"
  )) %>%
  relocate(DISTANCE_BUCKET, .before = 9)

This leads to our final graph, which is a boxplot of percentage of seats filled by route by airline, broken down by our distance buckets.

ggplot(iad, aes(x = DISTANCE_BUCKET, y = (PASSENGERS / SEATS))) +
  geom_boxplot(fill = "#00FFFF") +
  labs(
    title = "Percentage of Seats Filled by Distance",
    caption = "Source: Bureau of Transportation Statistics",
    x = "Flight Distance",
    y = ""
  ) +
  theme_bw()

We can see that the 3 plots don’t look too dissimilar. The first one has a lot of lower outliers, which I believe is why its median and interquartile range are so similar to the others. The most similar are 2500-5000 and 6000+, which is interesting because the one in between them has a much larger interquartile range and no lower outliers.

I believe, based on my research and prior knowledge, that the majority of the flights with low occupancy are flights under the Department of Transportation’s Essential Air Service program. This is a program that guarantees that smaller towns and cities can be connected to the national airline network. These routes would otherwise be unprofitable, but the DOT subsidizes theses routes and makes sure they are run so that these communities have access. Otherwise for example, if you were in Bradford Pennsylvania (one of the destinations of one of these EAS flights), if you wanted to fly somewhere, you would have to drive over an hour each way to either Buffalo or Rochester to get to a small airport, or drive all the way to either Pittsburgh, Philadelphia or DC (several hours away), just to get on another flight, which adds hours to journies and is even worse if you have to be dropped off (imagine driving 4+ hours roundtrip just to drop someone off at the airport).

#Main visualization 2

For my next visualization, I wanted to continue to explore percentage of seats filled, but this time narrowed down and with interactivity. I chose to analyze United Airlines, because Dulles is a United hub, therefore they have a majority of flights, destinations and they have the most data. This means first we have to filter for just United.

united <- iad %>%
  filter(CARRIER_NAME == "United Air Lines Inc.")

The resulting dataset has an entry by destination by month, so we have to aggregate seats, passengers and departures for all flights across the entire year. We also add a percent filled column, calculated from our seats and passengers. This will make out plotly plot easier later.

united <- united %>%
  group_by(DEST_CITY_NAME) %>%
  summarize(
    total_departures = sum(DEPARTURES_PERFORMED),
    total_seats = sum(SEATS),
    total_passengers = sum(PASSENGERS),
    distance = first(DISTANCE)
  ) %>%
  mutate(
    percent_filled = ((total_passengers / total_seats) * 100)
  )

Here I mutated another column with our tooltip information and rounding for the percentage, before creating the plot.

united <- united %>%
  mutate(
    tooltip = paste(
      "Destination: ", DEST_CITY_NAME, "<br>",
      "Distance: ", distance, " miles<br>",
      "Percent Filled: ", round(percent_filled, 1), "%<br>",
      "Total Flights:", total_departures, "flights"
    )
  )

p <- ggplot(united, aes(x = distance, y = percent_filled, text = tooltip)) +
  geom_point(color = "#5B92E5") +
  #geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Distance vs Percentage of Seats Filled",
    caption = "Source: Bureau of Transportation Statistics",
    x = "Distance (miles)",
    y = "Percent Filled"
  )

ggplotly(p, tooltip = "text")

Here we can see there are 2 main clusters, with a few outliers. The group on the top left is all of the previously mentioned domestic and intra-continental flights, of which most are in a similar range in terms of percentage of seats filled. There are a few outliers with must lower percents, however. Curiously, most of them only have 1 flight, with a couple having a handful of flights. Most of these flights are to semi-major destinations, so I’m not sure why they are only running a handful of flights for each. I have 3 about these flights. One theory is that these are replacement flights for the regional airlines that run most of their short haul domestic service (see explanation with visualization 4). My second theory is that theses are flights that had to be diverted or make emergency landing. My third theory is that these are “positioning flights.” These are when the airline needs more planes at another airport, so they either schedule one random flight to get the plane to where they need it or they just fly it there without any passengers or employees only.

The second cluster is all of United’s long haul inter continental flights to South America, Europe and the Middle East. The 2 exceptions are the 2 really long domestic routes to Honolulu and Anchorage. There are also 2 outliers that are much further then any other destination, those being Tokyo Japan and Cape Town South Africa.

Main Visualization 3

Something else I wanted to explore is how international travel changed over the year, since our data is broken down by month. I wanted to see if all the countries generally followed the same pattern or if certain countries generated lots of ridership at different times.

First I created month labels for the graph since the months in the dataset are just numbers.

month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

Then, using my already created international flights dataset from before, filtered for flights over 3000, which should get us all the intercontinental flights, that way we don’t have too many lines because of all the Central American and Caribbean destinations.

international2 <- international %>%
  filter(DISTANCE > 3000)

Calculate the total ridership by country by month to account for countries served by multiple airlines.

total_by_month <- international2 %>%
  group_by(MONTH, DEST_COUNTRY_NAME) %>%
  summarize(total_passengers = sum(PASSENGERS), .groups = "drop")

Now we can plot each countries ridership by month. We have a lot of countries, too many to manually assign their colors, so I used scale_color_hue. I still wasn’t super happy woth the results, so i fed the graph into ggplotly so you can easily figure out which country is which.

p <- ggplot(total_by_month, aes(x = MONTH, y = total_passengers, group = DEST_COUNTRY_NAME, color = DEST_COUNTRY_NAME)) +
  geom_line(size = 0.8) +  
  geom_point(size = 1.5) +
  labs(
    title = "International Passenger Travel by Month (by Country)",
    caption = "Source: Bureau of Transportation Statistics",
    x = "Month",
    y = "Number of Passengers",
    color = "Country" # Title for the color legend
  ) +
  scale_x_continuous(breaks = 1:12, labels = month_labels) +
  theme_bw() +
  #theme(legend.text = element_text(size = 6))
  scale_color_hue(l = 60, c = 80) # Use non-default, distinct colors

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

ggplotly(p)

It is interesting that we can see an expected bump in ridership during the summer months for all of the top destinations. But once you go down far enough (lower then Japan), the countries stay pretty flat all year, likely due to a small sample aize and a base latent demand. We can also See how the United Kingdom and Germany are much higher then the rest, likely due to not just that htey are popular destinations, but they have major connecting hubs to connect to final destinations in Europe and beyond.A couple other interesting countries include Turkey, whose ridership falls off a cliff after October, possibly due to schedule changes or cancelled services, and the UAE, which sees a major uptick in ridership in October and November while basically every other country sees falling ridership.

Main Visualization 4 (Statistical Analysis)

I first created a dataframe just to explore how many unique airlines were there for each route, which I graphed as a histogram.

routeCount <- iad %>%
  group_by(DEST_CITY_NAME) %>%
  summarise(numAirlines = n_distinct(CARRIER_NAME), 
    #total_flights = sum(DEPARTURES_PERFORMED),
    #total_seats = sum(SEATS),
    #total_passengers = sum(PASSENGERS),
    #seat_percentage = (total_seats / sum(total_seats) * 100)
    )

Exploratory plot:

p14 <- routeCount %>%
  ggplot(aes(x = numAirlines)) +
  geom_histogram(binwidth = 1) + 
  labs() +
  theme_minimal()

I then created a new dataframe where we aggregated the total departures, seats and passengers per airline per route. This is because the data is given to us per airline per route PER MONTH, so i had to add all the months together. I then decided to continue my analysis using only passengers, and not seats, since paying passengers are what make the airlines money, so you will not see any more usage of seats.

yearly_summary <- iad %>%
  group_by(DEST_CITY_NAME, CARRIER_NAME) %>%
  summarize(
    total_departures = sum(DEPARTURES_PERFORMED),
    total_seats = sum(SEATS),
    total_passengers = sum(PASSENGERS),
    .groups = 'drop' # THis removes the grouping structure from the result
  )

I then created another dataframe to calculate the total passengers my route, regardless of airline.

route_totals <- yearly_summary %>%
  group_by(DEST_CITY_NAME) %>%
  summarize(
    route_total_passengers = sum(total_passengers),
    .groups = 'drop'
  )

Here we calculate each airlines “market share” on each route by calculating what percentage of the passengers carried on that route was by each airline.

market_share <- yearly_summary %>%
  left_join(route_totals, by = "DEST_CITY_NAME") %>%
  mutate(
    market_share = (total_passengers / route_total_passengers) * 100
  ) #FILTER OUT LOWER DEPARTURE NUMBERS

Here I do the statistical analysis. Our hypothesis is that for routes with multiple airlines serving them, the share of seats should be evenly distributed among the airlines. Basically, I am assuming there is proper “competition” on every route. We first group by the destination, then calculate how many airlines so we know what our “ideal” market share is. If we find a route has only 1 airlines, we cannot do a chi-squared test and we ignore, setting the chi-squared test and p-value to NA as to not confuse or skew our data. I have also put in an edge case for 0 airlines on a route for robustness but that would not be reported in the data.

chi_squared_results <- market_share %>%
  group_by(DEST_CITY_NAME) %>%
  nest() %>%
  mutate(
    chi_squared_test = map(data, function(df) {
      observed <- df$total_passengers
      num_airlines <- length(observed)
      if (num_airlines > 1) { 
        test_result <- chisq.test(observed, p = rep(1/num_airlines, num_airlines))
        tibble(statistic = test_result$statistic,
               p.value = test_result$p.value,
               Notes = test_result$method)
      } else if (num_airlines == 1) {
        tibble(statistic = NA, p.value = NA, method = "Only one airline on this route")
      } else {
        tibble(statistic = NA, p.value = NA, method = "No airlines on this route")
      }
    })
  ) %>%
  unnest_wider(chi_squared_test) %>%
  select(DEST_CITY_NAME, statistic, p.value, Notes)

I then compared each p-value versus an alpha of 0.05 to determine if routes had higher p values, meaning they supported our hypothesis that the market share would be even. If the p-value was lower, that means that on that route we can reject the null hypothesis that the market share is evenly distributed.

alpha <- 0.05

hypothesis_fit <- chi_squared_results %>%
  mutate(
    fit_null = case_when(
      is.na(p.value) ~ "One Airline on Route",
      p.value > alpha ~ "Fits Null Hypothesis",
      TRUE ~ "Doesn't Fit Null Hypothesis"
    )
  )

ggplot(hypothesis_fit, aes(x = fit_null, fill = fit_null)) +
  geom_bar() +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(
    title = "Number of Routes That Fit Our Hypothesis",
    caption = "Source: Bureau of Transportation Statistics",
    x = "Hypothesis Fit",
    y = "Number of Routes",
    fill = "Hypothesis Fit"
  ) +
  theme_bw() +
  theme(
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14)
  ) +
  scale_fill_manual(values = c("Fits Null Hypothesis" = "steelblue",
                               "Doesn't Fit Null Hypothesis" = "indianred", 
                               "One Airline on Route" = "seagreen"))

This shows that by our stated criteria, only 3 routes from Dulles are “competitive”, those 3 being Manchester New Hampshire, Sarasota Florida and South bend Indiana. Every other route either had vastly skewed market share or one 1 airline.

One of the reasons this might be the case is that the Baltimore/Washington area has 3 airports, IAD (Dulles), DCA (Reagan) and BWI. Often, airlines will focus on certain airports as “hubs”, where a significant amount of their operations in the area are based. Since the area has multiple airports, instead of competing at the same airport airlines will often mostly ignore other airports and focus on their hub in the region, only flying to the other airports from a few destinations (primarily their other hubs in the country) so that their national networks are complete enough for customers.

Another factor that can affect the data is that way the airlines are reported to the BTS and how airlines operate.The larger airlines (United, American and Delta) will contract out their smaller “regional” routes with other airlines, who will operate them under the larger airlines regional brand. For United this is “United Express”, Delta is “Delta Connection”, and American is “American Eagle”. This is not reported to the FAA and BTS, but to passengers these contracted out flights still fall under one of the major airlines when booking, navigating the airport or even looking at the livery of the plane.

One final small factor is that 2 of the New York City area airports (LGA and JFK) are labelled as the same destination city (New York City). I chose to use destination city instead of destination airport because it made reading the routes more digestible, and there was no other duplication, so it is possible (although unlikely given the data) that one of both of the New York airports are competitive if you separate them.

Usually when you do a chi-squared test, you want at least 5 datapoints. By the way I am calculating the data, this requirement is met by having at least 5 seats, but ideally you would want to filter for at least 5 departures per airline per route. I didn’t do this but it is an improvement i would pursue given more time.

Other studies could include looser criteria, such as multiple airlines having similar market share (instead of all on the route) or number of airlines reaching some market share threshold that is considered competitive instead of equal market share.

Citations:

U.S. Department of Transportation. “Essential Air Service.” US Department of Transportation, 22 June 2012, www.transportation.gov/policy/aviation-policy/small-community-rural-air-service/essential-air-service.

Wendover Productions. “The Design of Airline Route Networks.” YouTube, 1 May 2025, www.youtube.com/watch?v=sY7cQNx4Hg4. Accessed 2 May 2025.

admin. “Regional Airlines: What Are They & How Do They Operate?” AEROCADET | Articles & Reviews, 20 Aug. 2024, aerocadet.com/blog/regional-airlines-explained/.