The task for this week’s assignment is to create and read in untidy data displaying arrival times for two airlines for five different cities. After reading in the data, it has to be transformed into a tidy format where analysis can be performed to compare the arrival times for each airline.
Messy data columns are variables not values. Data is in a wide format when you have more columns than rows.
flights<-read.csv('FlightInformation.csv')
head(flights,25)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
The dataset clearly violates the principles of Tidy data. The columns for the cities are values rather than variables. Data for Status is duplicated and needs to be spread.
The code below creates a flights2 table from the original flights table and uses the pipe operator to string together each line of code. First, the gather function converts the aforementioned cities columns into one column containing a city, Additionally, the number of flights per city is captured in a new Flight_Count column. This is in keeping with principle II of tidy data where variables, not values are represented as columns.
Next, using the dplyr select function to select the Airline, Status, City, and Flight_Count, data from the Status and Flight_Count columns is spread to two new columns, delayed and OnTime. The end result is to get to principle V of tidy data: “V. An observation contains all values measured across all units.”
flights2 <- flights %>%
gather(City,Flight_Count,3:7) %>%
select(Airline, Status,City,Flight_Count) %>%
spread(Status,Flight_Count)
names(flights2)[4]<- "OnTime"
flights2
## Airline City delayed OnTime
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
Now that we have a dataset where a single obersvation contains all values measured across all units, Airline, City, Delayed, and OnTime. The name for OnTime had to be changed for the next section.
Using dplyr functions, the number of total flights, the percentage delayed and OnTime, and the difference between the OnTime and delayed percentages are added to the flights2 dataset.
flights2<-flights2 %>%
mutate(Total_Flights = delayed+OnTime) %>%
mutate(Delayed_PCT = delayed/Total_Flights)%>%
mutate(OnTime_PCT = OnTime/Total_Flights)%>%
mutate(Diff = OnTime_PCT - Delayed_PCT)
## Warning: package 'bindrcpp' was built under R version 3.5.1
flights2
## Airline City delayed OnTime Total_Flights Delayed_PCT
## 1 ALASKA Los.Angeles 62 497 559 0.11091234
## 2 ALASKA Phoenix 12 221 233 0.05150215
## 3 ALASKA San.Diego 20 212 232 0.08620690
## 4 ALASKA San.Francisco 102 503 605 0.16859504
## 5 ALASKA Seattle 305 1841 2146 0.14212488
## 6 AM WEST Los.Angeles 117 694 811 0.14426634
## 7 AM WEST Phoenix 415 4840 5255 0.07897241
## 8 AM WEST San.Diego 65 383 448 0.14508929
## 9 AM WEST San.Francisco 129 320 449 0.28730512
## 10 AM WEST Seattle 61 201 262 0.23282443
## OnTime_PCT Diff
## 1 0.8890877 0.7781753
## 2 0.9484979 0.8969957
## 3 0.9137931 0.8275862
## 4 0.8314050 0.6628099
## 5 0.8578751 0.7157502
## 6 0.8557337 0.7114673
## 7 0.9210276 0.8420552
## 8 0.8549107 0.7098214
## 9 0.7126949 0.4253898
## 10 0.7671756 0.5343511
Looking at delayed and OnTime raw numbers alone may not give the best insight into the data whereas the percentages give better insight and comparison between the two airlines.
filter(flights2, Diff== min(Diff) | Diff==max(Diff))
## Airline City delayed OnTime Total_Flights Delayed_PCT
## 1 ALASKA Phoenix 12 221 233 0.05150215
## 2 AM WEST San.Francisco 129 320 449 0.28730512
## OnTime_PCT Diff
## 1 0.9484979 0.8969957
## 2 0.7126949 0.4253898
For example, when looking at the minimum difference between OnTime/delayed percentages, we see that AM West to San Francisco has the smallest difference between being OnTime and Delayed. IOW, this AM West route to San Francisco has the highest delayed flights for any other airline on any other routes to the four citites. While the Alaska airlines route to Phoenix has the best OnTime performance for any other airline on any other routes.
summarise(flights2,Avg_Delay = mean(Delayed_PCT),
Variance = var(Delayed_PCT),
Standard_Deviation = sd(Delayed_PCT))
## Avg_Delay Variance Standard_Deviation
## 1 0.1447799 0.00513878 0.07168529
For all Airlines, the average percentage of delayed flights is 15%.
flights2 %>%
group_by(Airline) %>%
summarise(Avg_Delay = mean(Delayed_PCT),
Variance = var(Delayed_PCT),
Standard_Deviation = sd(Delayed_PCT))
## # A tibble: 2 x 4
## Airline Avg_Delay Variance Standard_Deviation
## <fct> <dbl> <dbl> <dbl>
## 1 ALASKA 0.112 0.00211 0.0459
## 2 AM WEST 0.178 0.00675 0.0821
When the average delayed is broken down by airline, we see that not only does AM West have higher average delays than Alaska, it also is higher than the average for both airlines.
flights2 %>%
group_by(City) %>%
summarise(Avg_Delay_Flights = mean(delayed),
Avg_Delay_Flights_by_PCT = mean(Delayed_PCT))
## # A tibble: 5 x 3
## City Avg_Delay_Flights Avg_Delay_Flights_by_PCT
## <chr> <dbl> <dbl>
## 1 Los.Angeles 89.5 0.128
## 2 Phoenix 214. 0.0652
## 3 San.Diego 42.5 0.116
## 4 San.Francisco 116. 0.228
## 5 Seattle 183 0.187
Finally, to confirm that percentages present a better insight than the raw numbers, we see that Seattle has a higher number of delayed flights, but in the context of the total number of flights per city, we see that San Francisco has the higher percentage of delayed flights. IOW, there’s a 23% chance that a flight into San Francisco will be delayed while there’s an 19% chance of a flight into Seattle will be delayed.
After tidying the data and doing an analysis we can clearly see that AM West airline has significantly more delays than does Alaska ariline. Converting the raw numbers to percentages allowed us to make better comparisons. The graphs below also support the conclusion. For each city, we see that AM West has a higher percentage of delayed flights over Alaska airlines.
ggplot(flights2, aes(fill=City, y=Delayed_PCT, x=Airline)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Comparison of Delayed Percentages between AM West and Alaska")
ggplot(flights2, aes(fill=City, y=OnTime_PCT, x=Airline)) +
geom_bar(position="dodge", stat="identity")+
ggtitle("Comparison of On Time Percentages between AM West and Alaska")