DATA607 Fall 2018 Week Five - Tidying and Transforming Data

Introduction

The task for this week’s assignment is to create and read in untidy data displaying arrival times for two airlines for five different cities. After reading in the data, it has to be transformed into a tidy format where analysis can be performed to compare the arrival times for each airline.

Principles of Tidy Data

  1. Observations are represented as rows.
  2. Variables are represented as columns.
  3. One type of observation per unit table. A dataset is a collection of values
  4. Each value is an intersection between an observation and a variable. Variables measures the same value across units.
  5. An observation contains all values measured across all units.

Messy data columns are variables not values. Data is in a wide format when you have more columns than rows.

Read in the Flights Information dataset.

flights<-read.csv('FlightInformation.csv')
head(flights,25)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Analysis Part One:

The dataset clearly violates the principles of Tidy data. The columns for the cities are values rather than variables. Data for Status is duplicated and needs to be spread.

Tidying:

The code below creates a flights2 table from the original flights table and uses the pipe operator to string together each line of code. First, the gather function converts the aforementioned cities columns into one column containing a city, Additionally, the number of flights per city is captured in a new Flight_Count column. This is in keeping with principle II of tidy data where variables, not values are represented as columns.

Next, using the dplyr select function to select the Airline, Status, City, and Flight_Count, data from the Status and Flight_Count columns is spread to two new columns, delayed and OnTime. The end result is to get to principle V of tidy data: “V. An observation contains all values measured across all units.”

flights2 <- flights %>% 
gather(City,Flight_Count,3:7) %>% 
select(Airline, Status,City,Flight_Count) %>% 
spread(Status,Flight_Count)

names(flights2)[4]<- "OnTime" 


flights2
##    Airline          City delayed OnTime
## 1   ALASKA   Los.Angeles      62    497
## 2   ALASKA       Phoenix      12    221
## 3   ALASKA     San.Diego      20    212
## 4   ALASKA San.Francisco     102    503
## 5   ALASKA       Seattle     305   1841
## 6  AM WEST   Los.Angeles     117    694
## 7  AM WEST       Phoenix     415   4840
## 8  AM WEST     San.Diego      65    383
## 9  AM WEST San.Francisco     129    320
## 10 AM WEST       Seattle      61    201

Now that we have a dataset where a single obersvation contains all values measured across all units, Airline, City, Delayed, and OnTime. The name for OnTime had to be changed for the next section.

Analysis Part Two:

Using dplyr functions, the number of total flights, the percentage delayed and OnTime, and the difference between the OnTime and delayed percentages are added to the flights2 dataset.

flights2<-flights2 %>% 
mutate(Total_Flights = delayed+OnTime) %>%      
mutate(Delayed_PCT = delayed/Total_Flights)%>% 
mutate(OnTime_PCT = OnTime/Total_Flights)%>% 
mutate(Diff = OnTime_PCT - Delayed_PCT)
## Warning: package 'bindrcpp' was built under R version 3.5.1
flights2
##    Airline          City delayed OnTime Total_Flights Delayed_PCT
## 1   ALASKA   Los.Angeles      62    497           559  0.11091234
## 2   ALASKA       Phoenix      12    221           233  0.05150215
## 3   ALASKA     San.Diego      20    212           232  0.08620690
## 4   ALASKA San.Francisco     102    503           605  0.16859504
## 5   ALASKA       Seattle     305   1841          2146  0.14212488
## 6  AM WEST   Los.Angeles     117    694           811  0.14426634
## 7  AM WEST       Phoenix     415   4840          5255  0.07897241
## 8  AM WEST     San.Diego      65    383           448  0.14508929
## 9  AM WEST San.Francisco     129    320           449  0.28730512
## 10 AM WEST       Seattle      61    201           262  0.23282443
##    OnTime_PCT      Diff
## 1   0.8890877 0.7781753
## 2   0.9484979 0.8969957
## 3   0.9137931 0.8275862
## 4   0.8314050 0.6628099
## 5   0.8578751 0.7157502
## 6   0.8557337 0.7114673
## 7   0.9210276 0.8420552
## 8   0.8549107 0.7098214
## 9   0.7126949 0.4253898
## 10  0.7671756 0.5343511

Looking at delayed and OnTime raw numbers alone may not give the best insight into the data whereas the percentages give better insight and comparison between the two airlines.

filter(flights2, Diff== min(Diff) | Diff==max(Diff))
##   Airline          City delayed OnTime Total_Flights Delayed_PCT
## 1  ALASKA       Phoenix      12    221           233  0.05150215
## 2 AM WEST San.Francisco     129    320           449  0.28730512
##   OnTime_PCT      Diff
## 1  0.9484979 0.8969957
## 2  0.7126949 0.4253898

For example, when looking at the minimum difference between OnTime/delayed percentages, we see that AM West to San Francisco has the smallest difference between being OnTime and Delayed. IOW, this AM West route to San Francisco has the highest delayed flights for any other airline on any other routes to the four citites. While the Alaska airlines route to Phoenix has the best OnTime performance for any other airline on any other routes.

summarise(flights2,Avg_Delay = mean(Delayed_PCT), 
              Variance = var(Delayed_PCT),
              Standard_Deviation = sd(Delayed_PCT))
##   Avg_Delay   Variance Standard_Deviation
## 1 0.1447799 0.00513878         0.07168529

For all Airlines, the average percentage of delayed flights is 15%.

flights2 %>% 
    group_by(Airline) %>% 
    summarise(Avg_Delay = mean(Delayed_PCT), 
              Variance = var(Delayed_PCT),
              Standard_Deviation = sd(Delayed_PCT))
## # A tibble: 2 x 4
##   Airline Avg_Delay Variance Standard_Deviation
##   <fct>       <dbl>    <dbl>              <dbl>
## 1 ALASKA      0.112  0.00211             0.0459
## 2 AM WEST     0.178  0.00675             0.0821

When the average delayed is broken down by airline, we see that not only does AM West have higher average delays than Alaska, it also is higher than the average for both airlines.

flights2 %>% 
    group_by(City) %>% 
    summarise(Avg_Delay_Flights = mean(delayed), 
              Avg_Delay_Flights_by_PCT = mean(Delayed_PCT))
## # A tibble: 5 x 3
##   City          Avg_Delay_Flights Avg_Delay_Flights_by_PCT
##   <chr>                     <dbl>                    <dbl>
## 1 Los.Angeles                89.5                   0.128 
## 2 Phoenix                   214.                    0.0652
## 3 San.Diego                  42.5                   0.116 
## 4 San.Francisco             116.                    0.228 
## 5 Seattle                   183                     0.187

Finally, to confirm that percentages present a better insight than the raw numbers, we see that Seattle has a higher number of delayed flights, but in the context of the total number of flights per city, we see that San Francisco has the higher percentage of delayed flights. IOW, there’s a 23% chance that a flight into San Francisco will be delayed while there’s an 19% chance of a flight into Seattle will be delayed.

Conclusion:

After tidying the data and doing an analysis we can clearly see that AM West airline has significantly more delays than does Alaska ariline. Converting the raw numbers to percentages allowed us to make better comparisons. The graphs below also support the conclusion. For each city, we see that AM West has a higher percentage of delayed flights over Alaska airlines.

ggplot(flights2, aes(fill=City, y=Delayed_PCT, x=Airline)) + 
    geom_bar(position="dodge", stat="identity") +
    ggtitle("Comparison of Delayed Percentages between AM West and Alaska")

ggplot(flights2, aes(fill=City, y=OnTime_PCT, x=Airline)) + 
    geom_bar(position="dodge", stat="identity")+
    ggtitle("Comparison of On Time Percentages between AM West and Alaska")