This doc can be found at rpubs and the input data files and .rmd file can be found on my github here: Flights, Weather. .rmd
First I’ll clean up my work space:
rm(list = ls())Then we’ll load all the packages that we’re going to need for this exercise:
library(tidyr)
library(dplyr)
library(ggplot2)Next we’ll load the data from a CSV file. On an initial attempt I had manually cleaned up the data directly in the CSV file by filling in column headers and adding the airline value in the “delayed” rows, however, based on some of the comments in the meetup this past week, I’ve decided to try to keep scalabilty in mind for this entire exercise. As such, I decided to keep the data messy and try to come up with a scalable programatic solution to these issues. I also tried to create code that will work “as is” if the number of cities or airlines contained in the data is increased. Let’s load the data, clean up the colNames and take a look:
flight.info <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/flight_info.csv",
header=TRUE,
stringsAsFactors = FALSE)
colnames(flight.info)[1:2] = c("Airline","Status")
flight.info## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
The data looks good, but we need to come up with a way to forward-fill the airline names into the “delayed” column - preferrable a way that would work as well on 40K rows as it does on these 4 rows. It looks like the tidyr library provides an simple way to do this. First, I’m going to convert the blanks to NAs, then I’ll use the fill() function to forward fill the “Airline” column:
is.na(flight.info) <- flight.info==''
flight.info <- flight.info %>% fill(Airline)Now we’ll tidy the data using the gather() function from tidyr. Rather than specify a specific column range, when calling gather(), I’ve decided to exclude “Airlines” and “Status”. This way, the code should scale as-is if more cities are added to the CSV file (i.e. if the data gets “wider”).
flight.tidy <- gather(flight.info,"City","FlightCount",-Airline,-Status)
head(flight.tidy,5)## Airline Status City FlightCount
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
Our data is now tidy - time for some analysis!
First we’ll take a 30K ft. view (lame pun intended) of the data. What do the raw numbers look like? What’s the breakdown in terms of number of flights per airline? Who’s delayed more frequently?
flights.total <- flight.tidy %>%
group_by(Airline) %>%
summarise(flights = sum(FlightCount))
flights.total$proportion <- flights.total$flights / sum(flights.total$flights)
flights.total## # A tibble: 2 x 3
## Airline flights proportion
## <chr> <int> <dbl>
## 1 ALASKA 3775 0.343
## 2 "AM WEST " 7225 0.657
ggplot(flights.total,aes(x=Airline,y=flights)) +
geom_bar(stat="identity",color = "Red", fill = "White") +
ggtitle("Raw Flight Count by Airline") +
theme(plot.title = element_text(hjust = 0.5))ggplot(flights.total,aes(x=Airline,y=proportion)) +
geom_bar(stat="identity",color = "Red", fill = "White") +
ggtitle("Proportion of Total Flights by Airline") +
ylab("Proportion") +
theme(plot.title = element_text(hjust = 0.5)) We can see that there are about 11K flights in total, with about 65% being AM WEST and the remaining 35% being ALASKA.
Now that we have the total flights for each airline, we’ll use it to compute the proportion of delayed flights per airline.
flights.delayed <- flight.tidy %>%
group_by(Airline,Status) %>%
summarise(flights = sum(FlightCount)) %>%
filter(Status == "delayed")
flights.total$delayed <- flights.delayed$flightsA quick numerical look at the proportion of delayed flights for ALASKA and AM West, respectively:
flights.total$delayed / flights.total$flights## [1] 0.1327152 0.1089273
And a visual Look
ggplot(flights.total,aes(x=Airline,y=delayed/flights)) +
geom_bar(stat="identity",fill="Blue") +
ggtitle("Proportion of Delayed Flights by Airline ") +
ylab("proportion") +
theme(plot.title = element_text(hjust = 0.5))We see here that ALASKA has a higher proportion of delayed flights than AM WEST. We probably need data for several more airlines before we can make a determination as to whether this difference is significant.
Next let’s look at the market-share for each airline in each market. Here we consider the proportion of flights for each airline relative to the total number of flights in that market.
flights.city <- flight.tidy %>%
group_by(City) %>%
summarise(flights = sum(FlightCount)) First a quick look at the total number of flights by city:
flights.city## # A tibble: 5 x 2
## City flights
## <chr> <int>
## 1 Los.Angeles 1370
## 2 Phoenix 5488
## 3 San.Diego 680
## 4 San.Francisco 1054
## 5 Seattle 2408
ggplot(flights.city, aes(x = City, y=flights / sum(flights))) +
geom_bar(stat="identity",position="dodge",fill="Red") +
ylab("proportion") +
ggtitle("Proportion of Flights by City") +
theme(plot.title = element_text(hjust = 0.5))Then we look at the proportional market share of each airline by City.
flights.market <- flight.tidy %>%
group_by(City,Airline) %>%
summarise(flights = sum(FlightCount)) %>%
ungroup() %>%
spread(Airline, flights, fill=0)
flights.marketProp <- flights.market
flights.marketProp[2:ncol(flights.market)] <- flights.market[2:ncol(flights.market)] / rowSums(flights.market[2:ncol(flights.market)])
flights.marketProp <- gather(flights.marketProp,"Airline","Proportion",-City)
head(flights.marketProp,5)## # A tibble: 5 x 3
## City Airline Proportion
## <chr> <chr> <dbl>
## 1 Los.Angeles ALASKA 0.408
## 2 Phoenix ALASKA 0.0425
## 3 San.Diego ALASKA 0.341
## 4 San.Francisco ALASKA 0.574
## 5 Seattle ALASKA 0.891
ggplot(flights.marketProp, aes(x = Airline, y=Proportion, fill = City)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") +
ylab("proportion") +
ggtitle("Proportional Market Share by Airline and City") +
theme(plot.title = element_text(hjust = 0.5))Here we see that ALASKA is dominant in the more northerly cities, particularly Seattle, whereas AM WEST is dominant in the more southerly cities, particularly Phoenix.
Next we’ll look at the proportion of delayed flights by airline and city.
flights.marketDelay <- flight.tidy %>%
group_by(City,Airline) %>%
filter(Status == "delayed") %>%
summarise(flights = sum(FlightCount)) %>%
ungroup() %>%
spread(Airline, flights, fill=0)
flights.marketDelayProp <- flights.marketDelay
flights.marketDelayProp[2:ncol(flights.market)] <- flights.marketDelay[2:ncol(flights.market)] / flights.market[2:ncol(flights.market)]
flights.marketDelayProp <- gather(flights.marketDelayProp,"Airline","Proportion",-City)
head(flights.marketDelayProp,5)## # A tibble: 5 x 3
## City Airline Proportion
## <chr> <chr> <dbl>
## 1 Los.Angeles ALASKA 0.111
## 2 Phoenix ALASKA 0.0515
## 3 San.Diego ALASKA 0.0862
## 4 San.Francisco ALASKA 0.169
## 5 Seattle ALASKA 0.142
flights.marketDelayProp %>%
group_by(Airline) %>%
summarise(proportion = mean(Proportion)) ## # A tibble: 2 x 2
## Airline proportion
## <chr> <dbl>
## 1 ALASKA 0.112
## 2 "AM WEST " 0.178
ggplot(flights.marketDelayProp, aes(x = Airline, y=Proportion, fill = City)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") +
ylab("proportion") +
ggtitle("Proportional of Delayed Flights by Airline and Market") +
theme(plot.title = element_text(hjust = 0.5))ggplot(flights.marketDelayProp, aes(x = Airline, y=1-Proportion, fill = City)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") +
ylab("proportion") +
ggtitle("Proportional of On-Time Flights by Airline and Market") +
theme(plot.title = element_text(hjust = 0.5))If we look at the breakdown by City we see that San Francisco and Seattle represent the highest proportion of delays and my suspicion is that this is probably weather-related. I found some NOAA Rainfall stats here which I copied an pasted into a CSV file to explore this idea a little bit.
rain.info <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/NOAA_Rainfall.csv",
header=TRUE,
stringsAsFactors = FALSE)
#replace spaces in city names w/ dots to match our pre-existing data.
rain.info$City <- chartr(" ", ".",rain.info$City)
head(rain.info,10)## Days City State Inches Millimeters
## 1 164 Portland Oregon 43.5 1104
## 2 125 Providence Rhode Island 47.2 1198
## 3 100 Raleigh North Carolina 46.0 1169
## 4 114 Richmond Virginia 43.6 1107
## 5 30 Riverside California 10.3 262
## 6 167 Rochester New York 34.3 870
## 7 60 Sacramento California 18.5 470
## 8 96 Salt.Lake.City Utah 16.1 409
## 9 83 San.Antonio Texas 32.3 820
## 10 42 San.Diego California 10.3 263
The rainfall data looks good, so now we’ll merge it with our flight-delay data:
rain.info <- left_join(flights.marketDelayProp,rain.info) ## Joining, by = "City"
head(rain.info,5)## # A tibble: 5 x 7
## City Airline Proportion Days State Inches Millimeters
## <chr> <chr> <dbl> <int> <chr> <dbl> <int>
## 1 Los.Angeles ALASKA 0.111 36 California 12.8 326
## 2 Los.Angeles ALASKA 0.111 36 California 12.8 326
## 3 Phoenix ALASKA 0.0515 30 Arizona 8.20 208
## 4 Phoenix ALASKA 0.0515 30 Arizona 8.20 208
## 5 San.Diego ALASKA 0.0862 42 California 10.3 263
The data is merged so now we can take a quick look and see if there is any relationship between weather and flight delays. First we’ll check and see if the number of rain-days shows any relationship to delays, then we’ll look at whethere there is any relationship between the quantity of precipitation and flight delays.
ggplot(rain.info, aes(x = Proportion, y=Days, color = City)) +
geom_point( size = 5) +
xlab("Proportion of Flight Delays") +
ylab("# Of Annual Precipitation Days") +
ggtitle("Flight Delays vs. Precipitation Days by City") +
theme(plot.title = element_text(hjust = 0.5))ggplot(rain.info, aes(x = Proportion, y=Inches, color = City)) +
geom_point( size = 5) +
xlab("Proportion of Flight Delays") +
ylab("Inches Of Annual Precipitation Days") +
ggtitle("Flight Delays vs. Inches of Precipitation by City") +
theme(plot.title = element_text(hjust = 0.5))In both cases above, we see evidence of a relationship between precipitation and and flight delays, which makes intuitive sense. Phoenix has the fewest delays and the least bad weather whereas Seattle has the most delays and the most bad weather. Looking at the flight delay data alone may have lead one to conclude that ALASKA was less efficient and more delay prone (percentage-wise) than AM WEST. The addition of the weather data seems to suggest that the City matters and that because ALASKA flys more flights to cities with poor weather, particularly Seattle, they experience proportionally more delays. In fact, the frequency of delays in the poor-weather markets (Seattle & SF) is significantly lower for ALASKA than for AM WEST, however AM WEST’s overall delay numbers are only better because they fly more flights to Phoenix. Maybe ALASKA’s pilots are better in the SF & seattle fog :)