Introduction:
I know that this assignment requires we use dplyr and tidyr, and I
know that melt (from the reshape2 package) will be helpful in the
process of combining all of the city columns in a column for the city
name and a column for the frequency. Additionally, I will have to think
of a creative solution to problem created with not all airlines names
being missing from the airline name column. The point of this assignment
is to work with the simple dataset provided, and I want to make sure
that my work would still work with a larger dataset. I essentially do
not want to be hardcoding anything, despite the fact that could be
possible given the smaller size of this data set.
Loading necessary packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Name the two columns that were missing names
messy_flights <- read.csv('flight_delays.csv')
messy_flights
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
messy_flights <- messy_flights %>%
rename(Airline = X, Status = X.1,
'Los Angeles' = Los.Angeles,
'San Diego' = San.Diego,
'San Francisco' = San.Francisco)
messy_flights
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Filling in missing values for airlines
for (x in 1:nrow(messy_flights)) {
if (messy_flights$Status[x] == 'delayed') {
messy_flights$Airline[x] = messy_flights$Airline[x-1]
}
}
Getting rid of the blank row
messy_flights <- messy_flights %>% drop_na()
Melting the city columns together
flights <- melt(messy_flights, flights = c(Airline, Status)) %>%
rename(City = variable, Frequency = value)
## Using Airline, Status as id variables
knitr::kable(head(flights))
| ALASKA |
on time |
Los Angeles |
497 |
| ALASKA |
delayed |
Los Angeles |
62 |
| AM WEST |
on time |
Los Angeles |
694 |
| AM WEST |
delayed |
Los Angeles |
117 |
| ALASKA |
on time |
Phoenix |
221 |
| ALASKA |
delayed |
Phoenix |
12 |
Delays by City
by_city <- flights %>%
group_by(City) %>%
summarise(percent_delayed = round(100 * sum(Frequency[Status == 'delayed'])*1.0/sum(Frequency),2), total_flights = sum(Frequency), delayed_flights = sum(Frequency[Status == 'delayed']))
knitr::kable(by_city)
| Los Angeles |
13.07 |
1370 |
179 |
| Phoenix |
7.78 |
5488 |
427 |
| San Diego |
12.50 |
680 |
85 |
| San Francisco |
21.92 |
1054 |
231 |
| Seattle |
15.20 |
2408 |
366 |
Delays by Airline
by_airline <- flights %>%
group_by(Airline) %>%
summarise(percent_delayed = round(100 * sum(Frequency[Status == 'delayed'])*1.0/sum(Frequency),2), total_flights = sum(Frequency), delayed_flights = sum(Frequency[Status == 'delayed']))
knitr::kable(by_airline)
| ALASKA |
13.27 |
3775 |
501 |
| AM WEST |
10.89 |
7225 |
787 |
Delays by Airline and City
by_airline_city <- flights %>%
group_by(Airline, City) %>%
summarise(percent_delayed = round(100 * sum(Frequency[Status == 'delayed'])*1.0/sum(Frequency),2), total_flights = sum(Frequency), delayed_flights = sum(Frequency[Status == 'delayed'])) %>%
arrange(City)
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
knitr::kable(by_airline_city)
| ALASKA |
Los Angeles |
11.09 |
559 |
62 |
| AM WEST |
Los Angeles |
14.43 |
811 |
117 |
| ALASKA |
Phoenix |
5.15 |
233 |
12 |
| AM WEST |
Phoenix |
7.90 |
5255 |
415 |
| ALASKA |
San Diego |
8.62 |
232 |
20 |
| AM WEST |
San Diego |
14.51 |
448 |
65 |
| ALASKA |
San Francisco |
16.86 |
605 |
102 |
| AM WEST |
San Francisco |
28.73 |
449 |
129 |
| ALASKA |
Seattle |
14.21 |
2146 |
305 |
| AM WEST |
Seattle |
23.28 |
262 |
61 |
Conclusion:
From my analysis, based solely on this dataset, a few things are
clear. The first discovery I found is that San Francisco seems to
experience the highest percentages of delays on all flights. Other
cities had more delayed flights, but in terms of percent of total San
Francisco was by clearly the most delayed airport, ALASKA experiences
more delays, by percentage, than does AM WEST and lastly, think twice
about combining AM WEST and San Francisco for your travel plans because
there is a decent chance you will be delayed.