#Data 607: Tidying and transforming Data
initial chart
The chart above describes arrival delays for two airlines across five destinations. Your task is to:
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
I copied the provided data to a csv and stored it on my github account.
flight_data <- read.csv("https://raw.githubusercontent.com/georg4re/DS607/master/data/flight-hw5-data.csv", stringsAsFactors = FALSE)
head(flight_data)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AMWEST on time 694 4840 383 320 201
## 5 AMWEST delayed 117 415 65 129 61
flight_data <- flight_data %>%
rename(airline = X, status = X.1) %>%
fill(airline) %>%
na.omit()
flight_data
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AMWEST on time 694 4840 383 320 201
## 5 AMWEST delayed 117 415 65 129 61
Flights by Airline and city
flightsbyairline <- flight_data %>%
gather("city", "Number", 3:7) %>%
spread("status", "Number", 3:7) %>%
rename(on_time='on time')
## Warning in if (!is.na(fill)) {: the condition has length > 1 and only the first
## element will be used
## airline city delayed on_time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AMWEST Los.Angeles 117 694
## 7 AMWEST Phoenix 415 4840
## 8 AMWEST San.Diego 65 383
## 9 AMWEST San.Francisco 129 320
## 10 AMWEST Seattle 61 201
Now, with the data “normalized” we want to perform some calculations.
flightsbyairline <- flightsbyairline %>%
mutate(flights = delayed + on_time, onTimeAvg = (on_time/flights)*100) %>%
arrange(desc(onTimeAvg))
flightsbyairline
## airline city delayed on_time flights onTimeAvg
## 1 ALASKA Phoenix 12 221 233 94.84979
## 2 AMWEST Phoenix 415 4840 5255 92.10276
## 3 ALASKA San.Diego 20 212 232 91.37931
## 4 ALASKA Los.Angeles 62 497 559 88.90877
## 5 ALASKA Seattle 305 1841 2146 85.78751
## 6 AMWEST Los.Angeles 117 694 811 85.57337
## 7 AMWEST San.Diego 65 383 448 85.49107
## 8 ALASKA San.Francisco 102 503 605 83.14050
## 9 AMWEST Seattle 61 201 262 76.71756
## 10 AMWEST San.Francisco 129 320 449 71.26949
By calculating the Avg of on time flights and sorting the table from high to low average we can start to see that Alaska seems to have a better average per city than AMWest. Let’s see it in a histogram:
ggplot(flightsbyairline, aes(x = city, y = onTimeAvg, fill=airline)) +
geom_bar(stat="identity", position = position_dodge2())
flightsbycity2 <- flight_data %>%
gather("city", "Number", 3:7) %>%
spread("airline", "Number", 3:7) %>%
mutate(flights = ALASKA + AMWEST) %>%
select(-c(ALASKA, AMWEST)) %>%
spread(status, flights) %>%
rename(on_time="on time") %>%
mutate(flights = delayed + on_time, onTimeAvg = (on_time/flights)*100) %>%
arrange(desc(onTimeAvg))
## Warning in if (!is.na(fill)) {: the condition has length > 1 and only the first
## element will be used
## city delayed on_time flights onTimeAvg
## 1 Phoenix 427 5061 5488 92.21939
## 2 San.Diego 85 595 680 87.50000
## 3 Los.Angeles 179 1191 1370 86.93431
## 4 Seattle 366 2042 2408 84.80066
## 5 San.Francisco 231 823 1054 78.08349
When we group the data by city, we can see that the city with the best on time average is Phoenix. Let’s compare all cities in a graph.
ggplot(flightsbycity2, aes(x = reorder(city, -onTimeAvg), y = onTimeAvg, fill=city)) +
geom_bar(stat="identity", position = position_dodge2()) +
theme(axis.title.x = element_blank())
Irrespective to Airline, Phoenix is the airport with the best average of on time flights and San Francisco has the worst average of the cities compared. Alaska does better than AMWEST in each of the cities compared.
…