Exploratory Analysis of Flight Data

Exploration of Flight Data of the Department of Transportation

This is an exploration of data provided by the United States Department of Transportation (DOT) detailing the United States’ extensive airport system and flight patterns.

Setup

# Directory
setwd("C:/Users/bmolin/Documents/GitProject/Flights/")

# Libraries
library(ggplot2)
library(lattice)
library(ggmap)
library(OpenStreetMap)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(geosphere)

## Loading required package: sp

Import Data from the Department of Transportation Website

Flight data needs to be manually downloaded from here: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258

Run the table with the following field selected. This analysis was conducted on 2016 data as of November 2016.

Variables:

Passengers
Unique Carrier
Unique Carrier Name
OriginAirportID
Origin
DestAirportID
Dest
Month

### Import Data

# Flights data
### This has to be manually downloaded from here: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258
### This analysis was based on 2016 as of November 2016 with the following fields marked:
# Passengers
# Unique Carrier
# Unique Carrier Name
# OriginAirportID
# Origin
# DestAirportID
# Dest
# Month
flights_raw <- read.csv("140897895_T_T100D_MARKET_US_CARRIER_ONLY.csv")
flights <- flights_raw

To supplement our data, we need information on the aiports. Data on this is obtained in a seperate table from DOT.

This has to be manually downloaded from here: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=288&DB_Short_Name=Aviation%20Support%20Tables

This analysis was based on 2016 as of November 2016 with the following fields marked:

AirportSeqID
AirportID
Airport
AirportName
AirportCityName
AirportCountryName
AirportStateName
AirportStateFIPS
CityMarketID
CityMarketName
Latitude
Longitude
AirportIsLatest

airports_raw <- read.csv("140897895_T_MASTER_CORD.csv")
airports_raw$X <- NULL # Scrap column
airports <- airports_raw

Pre-Processing

We need to clean and do some partial merging so we can move nimbly between our primary and supplementary data sources.

Steps: 1. We remove any airports that aren’t in our flight data. The airport data covers a much broader set than the flight data does. 2. We limit the airport data to only include the most recent location. The airport data currently contains legacy information on airport sites, and we don’t care about that. 3. We’re going to be examining the origin point of trips quite frequently and we’ll want their 3 character ID in both datasets to make it easier to navigate between derived flights datasets and the airport dataset.

airports <- airports[airports$AIRPORT_ID %in% append(flights$ORIGIN_AIRPORT_ID, flights$DEST_AIRPORT_ID),]
airports <- airports[airports$AIRPORT_IS_LATEST == 1,]

# Restrict airports to United States less territories
airports <- airports[airports$LATITUDE > 12 & airports$LONGITUDE < -20,]
flights <- flights[flights$ORIGIN_AIRPORT_ID %in% airports$AIRPORT_ID &
                     flights$DEST_AIRPORT_ID %in% airports$AIRPORT_ID, ]

airports$Name <- flights[match(airports$AIRPORT_ID, flights$ORIGIN_AIRPORT_ID),"ORIGIN"]

Exploratory Analysis

Now that our datasets have been cleaned and indexed, we are going to analyze the data to understand flight patterns in the United States. Our ultimate goal is to understand what kinds of flights exist and which airports regularly shuttle traffic amongst each other.

The first thing we’ll do is examine the ranges of passengers being flown across different patterns. We get the distribution of passengers traveling on each route.

# Frequency Distribution of # Of Passengers
layout(matrix(1:2, ncol=2))
hist(flights$PASSENGERS, main = "Histogram of Flight Size", sub = "All Flights", xlab = "PASSENGERS")
hist(flights[flights$PASSENGERS <= 100, "PASSENGERS"], main = "Histogram of Flight Size", sub = "Flights Under 100 Passengers", xlab = "PASSENGERS")

There’s a large number of flights on the lower range of the spectrum. When we spot-check flights with 0 Passengers, we get some surprising results.

zeroflights <- sum(flights$PASSENGERS == 0)
print(paste("There are", zeroflights, "flight connections that have no passengers"))

## [1] "There are 36402 flight connections that have no passengers"

flights <- flights[flights$PASSENGERS != 0,]

There is evidently a lot of flights that don’t carry passengers at all! These must be freight-only routes. This analysis will cover passenger travels only, so we remove these from the dataset.

Now we’ll check for seasonality in the data. I expect there to be a noticeable uptick in flight passenger volume in the summer months for vacation and December for the winter holidays.

# Over Time
passengers_monthly <- aggregate(PASSENGERS ~ MONTH, data = flights, sum)
p <- ggplot(aes(x = MONTH, y = PASSENGERS/1000000), data = passengers_monthly) + geom_line()
p + ggtitle("Number of Passengers Flying by Month of Year") + labs(y="Passengers (in millions)")

Our dataset doesn’t cover December, but it largely confirms our hypothesis regarding summer months being high volume periods. Additionally, the first few months are incredibly low volume. This is likely a post-holiday lull as people enjoy being home alone - and I’m sure inclement weather takes some credit as well as it discourages travel due to fear of delays and cancellations.

Now let’s take note of the actors shuttling passengers around. I anticipate some smaller carriers are much more affected by seasonal trends; for example, nobody is traveling to or from the Cape in Massachussets during the winter, so Cape Air should have next to no traffic in the winter or early spring months.

To make monthly trends easier to read across differently sized airlines, we broke up our plot into four seperate plots with their own scales.

# Top Carriers
carriers_rank <- aggregate(PASSENGERS ~ UNIQUE_CARRIER_NAME, data = flights, sum)
carriers_rank$rank <- rank(-carriers_rank$PASSENGERS)
carriers_rank <- carriers_rank[order(carriers_rank$rank),]
row.names(carriers_rank) <- NULL
head(carriers_rank)

##      UNIQUE_CARRIER_NAME PASSENGERS rank
## 1 Southwest Airlines Co.  136325736    1
## 2   Delta Air Lines Inc.  109770596    2
## 3 American Airlines Inc.  107360666    3
## 4  United Air Lines Inc.   67490484    4
## 5        JetBlue Airways   28524751    5
## 6  SkyWest Airlines Inc.   26722770    6

p <- ggplot(aes(x = reorder(UNIQUE_CARRIER_NAME, -PASSENGERS), y = PASSENGERS),
            data = carriers_rank[carriers_rank$rank <= 20,]) + geom_bar(stat = 'identity')
p <- p + theme(axis.text.x = element_text(angle = 90)) + ggtitle("Top Carriers by Passengers") + labs(x="Carrier")

# Top Carriers by Month
carriers_rank_month <- aggregate(PASSENGERS ~ UNIQUE_CARRIER_NAME + MONTH, data = flights, sum)
monthlymax <- aggregate(PASSENGERS ~ UNIQUE_CARRIER_NAME, data = carriers_rank_month, max)
monthlymax <- within(monthlymax, quartile <- as.integer(cut(PASSENGERS, quantile(PASSENGERS, probs=0:4/4), include.lowest=TRUE)))
carriers_rank_month$quartile <- monthlymax[match(carriers_rank_month$UNIQUE_CARRIER_NAME, monthlymax$UNIQUE_CARRIER_NAME),"quartile"]
layout(matrix(1:4, ncol=2, nrow = 2))
for (i in 1:4) {
  chartname <- paste0('Carriers by Month Quartile ', i)
  print(lattice::xyplot(PASSENGERS ~ MONTH | UNIQUE_CARRIER_NAME,
                        data = carriers_rank_month[carriers_rank_month$quartile == i,],
                        type = 'l',
                        par.strip.text=list(cex=.5),
                        layout=c(4,7),
                        main = chartname))
}

Indeed, a number of small carriers seem to almost exclusively operate in the summer months.

Now let’s look at how far people typically travel

# Distance between flights
origin_coordinates <- airports[match(flights$ORIGIN_AIRPORT_ID, airports$AIRPORT_ID), c("LONGITUDE", "LATITUDE")]
colnames(origin_coordinates) <- c("OLongitude", "OLatitude")
dest_coordinates <- airports[match(flights$DEST_AIRPORT_ID, airports$AIRPORT_ID), c("LONGITUDE", "LATITUDE")]
colnames(dest_coordinates) <- c("DLongitude", "DLatitude")
flights <- cbind(flights, origin_coordinates)
flights <- cbind(flights, dest_coordinates)

flights$distance <- with(flights, distHaversine(cbind(OLongitude, OLatitude), cbind(DLongitude, DLatitude), r=3959)) # In miles
hist(flights$distance/100, main = "Histogram of Flight Connections by Distance", breaks = 100, xlab = "Distance in Miles * 100")

dist_traveled <- aggregate(PASSENGERS ~ floor(distance/100), data = flights, sum)
colnames(dist_traveled) <- c("distance", "PASSENGERS")
p <- ggplot(aes(x = distance, y = PASSENGERS), data = dist_traveled) + geom_bar(stat = 'identity')
p + ggtitle("Histogram of Passenger Miles Traveled") + labs(x = "Distance in Miles * 100")

Most passengers seem to travel about 300 to 1000 miles, which is as expected. There is a decent amount of traveling going on below 100 miles though, which surprises me given a typical commute via car would cover that in 1-3 hours.

shortflights_state <- as.data.frame(table(airports[airports$Name %in% flights[flights$distance < 100, "ORIGIN"],"AIRPORT_STATE_NAME"]))
colnames(shortflights_state) <- c("State", "Connections")
p <- ggplot(aes(x = State, y = Connections), data = shortflights_state[shortflights_state$Connections != 0,]) + geom_bar(stat = 'identity')
p + theme(axis.text.x = element_text(angle = 90)) + ggtitle("Number of Flight Connections Under 100 Miles by Origin State")

The culprit here is flights in Alaska. Evidently there are enough places that you can’t travel between via land that easily.

Given that information, we wonder if there are higher concentrations of airports where it’s harder to travel via car. Harsher terrains like Alaska and traffic-heavy areas like Southern California and the Northeast are probably more saturated with airports, given the cost of commuting by land.

## Mapping

# Map all the airports
# Define Map parameters
bbox2 <- make_bbox(airports$LONGITUDE, airports$LATITUDE, f = .00001)

# Create Map
map <- openmap(c(bbox2[4], bbox2[1]), c(bbox2[2], bbox2[3]), type = "esri")
map <- openproj(map)

# All Airports
# Plot
p0 <- autoplot(map)
# Plot Points
p <- p0 + geom_point(
  aes(x = LONGITUDE, y = LATITUDE),
  data = airports,
  alpha = 0.8,
  size = 0.5
) + ggtitle("Airports in the USA")
p

p0 + stat_density2d(
  aes(x = LONGITUDE, y = LATITUDE, fill = ..level.., alpha = ..level..),
  bins = 6, data = airports,
  geom = "polygon"
) + ggtitle("Density of Airports in the USA")

# Adjust by # Of Passengers
passengers_out <- aggregate(PASSENGERS ~ ORIGIN_AIRPORT_ID, data = flights, sum)
passengers_out_airports <- merge(passengers_out, airports, by.x = 'ORIGIN_AIRPORT_ID', by.y = 'AIRPORT_ID')
passengers_out_airports <- passengers_out_airports[passengers_out_airports$PASSENGERS != 0,]
passengers_out_airports$PASSENGERS <- log(passengers_out_airports$PASSENGERS)
passengers_out_airports_long <- passengers_out_airports[rep(row.names(passengers_out_airports), passengers_out_airports$PASSENGERS), c(1,3:14)]

p0 + stat_density2d(
  aes(x = LONGITUDE, y = LATITUDE, fill = ..level.., alpha = ..level..),
  bins = 6, data = passengers_out_airports_long,
  geom = "polygon"
) + ggtitle("Density of Travelers by Origin Airport")

Indeed, our hypothesis here is confirmed, although I did not give enough credit to the Mid-Atlantic and Rust Belt regions.

Now we want to understand how these airports interact with each other. When we study flight patterns, can we identify communities of airports? These communities would allow us to make important judgements on such things as:

Where a contagion may spread, given a starting location
How do widespread delays in one place impact the schedules of others?
Where would you find a more diverse population, and where would those people be from?

This is a large network, so we’re going to take a broad look at the whole network, form our commmunities using a walktrap analysis, and then focus our network chart on more trafficked connections.

The color of the nodes correspond to the community each is grouped into.

## PLOT NETWORKS

# Create Network data
networks <- aggregate(PASSENGERS ~ ORIGIN + DEST, data = flights, sum)
networks <- networks[as.character(networks$ORIGIN) != as.character(networks$DEST),]
colnames(networks) <- c("Source", "Target", "weight")

edges <- as.matrix(networks)
g0 <- graph.data.frame(cbind(edges, attr), directed=TRUE)

## Layout
# Community
comm <- walktrap.community(g0)
membership <- membership(comm)
V(g0)$member <- membership
E(g0)$index <- E(g0)$weight

palette(rainbow(44))
for (i in 0:6) {
  
  wmin <- 10^i
  
  # Remove edges not meeting the condition and then remove isolated vertices
  g <- g0
  g <- delete.edges(g, which(E(g)$index < wmin))
  g <- delete.vertices(g,which(degree(g) < 1))  
  
  # Vertex
  V(g)$btw <- betweenness(g)
  V(g)$size = 3 + 6*as.numeric(V(g)$btw)/max(as.numeric(V(g)$btw))
  V(g)$color = palette()[V(g)$member + 1]
  V(g)$frame.color = NA
  V(g)$label.cex = .35
  if(i >= 5) {
  V(g)$label = V(g)$name
  } else {
    V(g)$label = NA
  }
  # Edge
  E(g)$weight <- log(as.numeric(E(g)$index))
  E(g)$width <- as.numeric(E(g)$weight)/max(as.numeric(E(g)$weight))
  E(g)$arrow.size <- .01
  
  rbPal <- colorRampPalette(c('gainsboro','black'))
  E(g)$color <- rbPal(10)[as.numeric(cut(as.numeric(E(g)$weight),breaks = 10))]
  
  #layout1 <- layout.fruchterman.reingold(g)
  
  # Plot
  plot.igraph(g, main = paste("Network Of Airports Sending Trips of", wmin, "Passengers or More"))
}

We see an abundance of members of communities 6 and 10 at the center of giant clusters of network nodes, with a few vectors sharing only one edge with that central cluster or completely detached from the network altogether. Once we zoom into the network of largest sites, we can see that the members of communities 6 and 10 are some of our largest national airports (LAX, JFK, ATL, etc). It’s interesting to note that LAX shares more strong connection to JFK than Atlanta does, despite Atlanta’s size and closer geographic proximity. This says a lot about the strong ties between Southern California and the Northeast, that their connections are stronger than those even in their own communities.

We’ll show the network maps of all the communities to see how each of the members relate to the other. We expect there to be a lot of connections between each node, which guided the algorithm that formed the communities.

# Community Analysis
members <- as.data.frame(table(membership))
loners <- sum(members$Freq == 1)
print(paste("There are", loners, "Airports That Don't Belong to a Community"))

## [1] "There are 475 Airports That Don't Belong to a Community"

# 475 Airports not placed into a community
members2 <- members[members$Freq != 1,]
ggplot(aes(x = membership, y = Freq), data = members2) + geom_bar(stat = 'identity') + ggtitle("Community Size")

for (i in members[members$Freq != 1,"membership"]) {
  
  mem <- i
  
  # Remove edges not meeting the condition and then remove isolated vertices
  g <- g0
  g <- delete.vertices(g,which(as.integer(V(g)$member) != mem))
  #  g <- delete.edges(g, which(as.integer(V(g)$member) != mem))
  
  
  # Vertex
  V(g)$size = 8
  V(g)$color = 'white'
  V(g)$frame.color = NA
  V(g)$label.cex = .75
  
  # Edge
  E(g)$weight <- log(as.numeric(E(g)$index))
  E(g)$width <- as.numeric(E(g)$weight)/max(as.numeric(E(g)$weight))
  E(g)$arrow.size <- .01
  
  rbPal <- colorRampPalette(c('gainsboro','black'))
  E(g)$color <- ifelse(E(g)$weight <= 1, "black", rbPal(10)[as.numeric(cut(as.numeric(E(g)$weight),breaks = 10))])
  
  #layout1 <- layout.fruchterman.reingold(g)
  
  # Plot
  plot.igraph(g, main = paste("Network of Community", mem))
  }

We’ll compare community sizes and see if the # of passengers traveling out of each community roughly corresponds to the community size.

flights$origin_community <- factor(membership[match(flights$ORIGIN, names(membership))])

community_passengers <- aggregate(PASSENGERS ~ origin_community, data = flights, sum)
p <- ggplot(aes(y = PASSENGERS, x = origin_community), data = community_passengers[community_passengers$origin_community %in% members2$membership,]) + geom_bar(stat = 'identity')
p + ggtitle("Number of Passengers sent in a Community")

# Communities 6 and 10 have the most members and drive the most traffic
airports$member <- flights[match(airports$Name, flights$ORIGIN), "origin_community"]

Finally, we’ll plot the airports again, but coloring the airports by their community so we can compare their geographic spread.

### Map All Airports in Top Communities
members3 <- members[members$Freq >= 5,]
# Plot
p0 <- autoplot(map)
# Plot Points
p <- p0 + geom_point(
  aes(x = LONGITUDE, y = LATITUDE, color = member),
  data = airports[airports$member %in% members3$membership,],
  alpha = 0.8,
  size = 1
) + ggtitle("Map of Communities")
p

### Map Non Community Airports
# Plot
p0 <- autoplot(map)
# Plot Points
p <- p0 + geom_point(
  aes(x = LONGITUDE, y = LATITUDE),
  data = airports[is.na(airports$member),],
  alpha = 0.8,
  size = 1
) + ggtitle("Map of Non Communities")
p

The mainland appears to be divided into East and West communities, with the West pulling in a number of airports around the Rust Belt. Hawaii appears to have its own community of airports seperate from anyone else, while Alaska has numerous tiny communities of airports, which speaks to a highly fractured airport system.

Those airports that don’t fall into any communities tend to fall mostly in the South and, surprisingly, on the West coast.