Introduction:

I know that this assignment requires we use dplyr and tidyr, and I know that melt (from the reshape2 package) will be helpful in the process of combining all of the city columns in a column for the city name and a column for the frequency. Additionally, I will have to think of a creative solution to problem created with not all airlines names being missing from the airline name column. The point of this assignment is to work with the simple dataset provided, and I want to make sure that my work would still work with a larger dataset. I essentially do not want to be hardcoding anything, despite the fact that could be possible given the smaller size of this data set.

Loading necessary packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

Name the two columns that were missing names

messy_flights <- read.csv('flight_delays.csv')

messy_flights
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
messy_flights <- messy_flights %>%
  rename(Airline = X, Status = X.1, 
         'Los Angeles' = Los.Angeles, 
         'San Diego' = San.Diego,
         'San Francisco' = San.Francisco)

messy_flights
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Filling in missing values for airlines

for (x in 1:nrow(messy_flights)) {
  if (messy_flights$Status[x] == 'delayed') {
    messy_flights$Airline[x] = messy_flights$Airline[x-1]
  }
}

Getting rid of the blank row

messy_flights <- messy_flights %>% drop_na()

Melting the city columns together

flights <- melt(messy_flights, flights = c(Airline, Status)) %>%
  rename(City = variable, Frequency = value)
## Using Airline, Status as id variables
knitr::kable(head(flights))
Airline Status City Frequency
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12

Delays by City

by_city <- flights %>%
  group_by(City) %>%
  summarise(percent_delayed = round(100 * sum(Frequency[Status == 'delayed'])*1.0/sum(Frequency),2), total_flights = sum(Frequency), delayed_flights = sum(Frequency[Status == 'delayed']))

knitr::kable(by_city)
City percent_delayed total_flights delayed_flights
Los Angeles 13.07 1370 179
Phoenix 7.78 5488 427
San Diego 12.50 680 85
San Francisco 21.92 1054 231
Seattle 15.20 2408 366

Delays by Airline

by_airline <- flights %>%
  group_by(Airline) %>%
  summarise(percent_delayed = round(100 * sum(Frequency[Status == 'delayed'])*1.0/sum(Frequency),2), total_flights = sum(Frequency), delayed_flights = sum(Frequency[Status == 'delayed']))

knitr::kable(by_airline)
Airline percent_delayed total_flights delayed_flights
ALASKA 13.27 3775 501
AM WEST 10.89 7225 787

Delays by Airline and City

by_airline_city <- flights %>%
  group_by(Airline, City) %>%
  summarise(percent_delayed = round(100 * sum(Frequency[Status == 'delayed'])*1.0/sum(Frequency),2), total_flights = sum(Frequency), delayed_flights = sum(Frequency[Status == 'delayed'])) %>%
  arrange(City)
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
knitr::kable(by_airline_city)
Airline City percent_delayed total_flights delayed_flights
ALASKA Los Angeles 11.09 559 62
AM WEST Los Angeles 14.43 811 117
ALASKA Phoenix 5.15 233 12
AM WEST Phoenix 7.90 5255 415
ALASKA San Diego 8.62 232 20
AM WEST San Diego 14.51 448 65
ALASKA San Francisco 16.86 605 102
AM WEST San Francisco 28.73 449 129
ALASKA Seattle 14.21 2146 305
AM WEST Seattle 23.28 262 61

Conclusion:

From my analysis, based solely on this dataset, a few things are clear. The first discovery I found is that San Francisco seems to experience the highest percentages of delays on all flights. Other cities had more delayed flights, but in terms of percent of total San Francisco was by clearly the most delayed airport, ALASKA experiences more delays, by percentage, than does AM WEST and lastly, think twice about combining AM WEST and San Francisco for your travel plans because there is a decent chance you will be delayed.