Week 5 Assignment

Tidying and Transforming Data

Getting Started

I decided to create a MySQL db for this assignment to help myself get more familiar with connecting R to MySQL and creating MySQL databases. First thing I did was create a table in MySQL and insert the following data so that it resembled the table below.

Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
- delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
- delayed 117 415 65 129 61

The table describes the arrival delays for two airlines at five airports located in the West United States, however, the data isn’t organized perfectly to be analyzed. I will use dyplr and tidyr to transform the data to make analyzing it easier. Imagine a scenario where my supervisor tells me there will be lots of employees traveling over the course of the next year in the Western US and they are want to know which airline will be most reliable when it comes to arriving on time. Using the data provided we can create some barplots to help visualize the number of delays for each airline.

library(RMySQL)
## Warning: package 'RMySQL' was built under R version 3.4.4
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.4.4
library(plyr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Once connection is established with the MySQL DB I can then query the table and save it as a data frame.

delays <- dbGetQuery(mydb, "SELECT *  FROM delays")
delays <- as.data.frame(delays)
delays
##   airline  status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Summarizing the data confirms that there are NA’s within the columns from the one blank row between the two airlines.

summary(delays)
##    airline             status           Los_Angeles       Phoenix      
##  Length:5           Length:5           Min.   : 62.0   Min.   :  12.0  
##  Class :character   Class :character   1st Qu.:103.2   1st Qu.: 168.8  
##  Mode  :character   Mode  :character   Median :307.0   Median : 318.0  
##                                        Mean   :342.5   Mean   :1372.0  
##                                        3rd Qu.:546.2   3rd Qu.:1521.2  
##                                        Max.   :694.0   Max.   :4840.0  
##                                        NA's   :1       NA's   :1       
##    San_Diego      San_Francisco      Seattle    
##  Min.   : 20.00   Min.   :102.0   Min.   :  61  
##  1st Qu.: 53.75   1st Qu.:122.2   1st Qu.: 166  
##  Median :138.50   Median :224.5   Median : 253  
##  Mean   :170.00   Mean   :263.5   Mean   : 602  
##  3rd Qu.:254.75   3rd Qu.:365.8   3rd Qu.: 689  
##  Max.   :383.00   Max.   :503.0   Max.   :1841  
##  NA's   :1        NA's   :1       NA's   :1

Lets remove the ugly row of NA’s out of the middle of our dataframe.

delays <- delays %>% drop_na(Los_Angeles)
delays
##   airline  status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

And fill the NA’s below the airlines using fill from tidyr.

delays <- delays %>% fill(airline)

Now lets change to layout from wide to long using gather from tidyr.

long <- delays %>%  gather(key=airport,value=count,3:7)
long
##    airline  status       airport count
## 1   ALASKA on time   Los_Angeles   497
## 2   ALASKA delayed   Los_Angeles    62
## 3  AM WEST on time   Los_Angeles   694
## 4  AM WEST delayed   Los_Angeles   117
## 5   ALASKA on time       Phoenix   221
## 6   ALASKA delayed       Phoenix    12
## 7  AM WEST on time       Phoenix  4840
## 8  AM WEST delayed       Phoenix   415
## 9   ALASKA on time     San_Diego   212
## 10  ALASKA delayed     San_Diego    20
## 11 AM WEST on time     San_Diego   383
## 12 AM WEST delayed     San_Diego    65
## 13  ALASKA on time San_Francisco   503
## 14  ALASKA delayed San_Francisco   102
## 15 AM WEST on time San_Francisco   320
## 16 AM WEST delayed San_Francisco   129
## 17  ALASKA on time       Seattle  1841
## 18  ALASKA delayed       Seattle   305
## 19 AM WEST on time       Seattle   201
## 20 AM WEST delayed       Seattle    61

Looking at out long dataframe, I would like to change the way it is assorted with arrange.

long <- long %>% arrange(airline, status)
## Warning: package 'bindrcpp' was built under R version 3.4.4
long
##    airline  status       airport count
## 1   ALASKA delayed   Los_Angeles    62
## 2   ALASKA delayed       Phoenix    12
## 3   ALASKA delayed     San_Diego    20
## 4   ALASKA delayed San_Francisco   102
## 5   ALASKA delayed       Seattle   305
## 6   ALASKA on time   Los_Angeles   497
## 7   ALASKA on time       Phoenix   221
## 8   ALASKA on time     San_Diego   212
## 9   ALASKA on time San_Francisco   503
## 10  ALASKA on time       Seattle  1841
## 11 AM WEST delayed   Los_Angeles   117
## 12 AM WEST delayed       Phoenix   415
## 13 AM WEST delayed     San_Diego    65
## 14 AM WEST delayed San_Francisco   129
## 15 AM WEST delayed       Seattle    61
## 16 AM WEST on time   Los_Angeles   694
## 17 AM WEST on time       Phoenix  4840
## 18 AM WEST on time     San_Diego   383
## 19 AM WEST on time San_Francisco   320
## 20 AM WEST on time       Seattle   201

Now that we have tidied up the data a bit we can group the data in different ways to summarize it. I can group by airline and status to get the total number of delayed and on time flights for each airline. Then I can group by just the airline to get the total number of flights for each airline. I would like to calculate the proportion of on time and delayed flights for each airline so I then combine the two data frames.

grpstatus <- long %>% group_by(airline, status) %>% summarise(total_count_by_airline_status = sum(count))
grpstatus
## # A tibble: 4 x 3
## # Groups:   airline [?]
##   airline status  total_count_by_airline_status
##   <chr>   <chr>                           <int>
## 1 ALASKA  delayed                           501
## 2 ALASKA  on time                          3274
## 3 AM WEST delayed                           787
## 4 AM WEST on time                          6438
grpairlines <- long %>% group_by(airline) %>% summarise(total_count = sum(count))
grpairlines
## # A tibble: 2 x 2
##   airline total_count
##   <chr>         <int>
## 1 ALASKA         3775
## 2 AM WEST        7225
combined <- rbind.fill(grpstatus,grpairlines)
combined
##   airline  status total_count_by_airline_status total_count
## 1  ALASKA delayed                           501          NA
## 2  ALASKA on time                          3274          NA
## 3 AM WEST delayed                           787          NA
## 4 AM WEST on time                          6438          NA
## 5  ALASKA    <NA>                            NA        3775
## 6 AM WEST    <NA>                            NA        7225

This creates a dataframe with a bunch of NA’s. To remove the NA’s in the total_count column I assign the total count of flights for each airline to all rows where the airline name matches. I then remove the bottom two rows from the data frame.

combined$total_count[combined$airline == 'ALASKA'] <- combined$total_count[combined$airline == 'ALASKA'] %>% tail(1)
combined$total_count[combined$airline ==  'AM WEST'] <- combined$total_count[combined$airline == 'AM WEST'] %>% tail(1)
combined <- combined[-c(5,6),]

I then can easily calculate the proportion of delayed and on time flights for both airlines. I then rearrange the data frame to sort by the status so that it will be easier to compare the airlines in the barplots.

combined$prop <- combined$total_count_by_airline_status/combined$total_count
paste(combined$airline, combined$status, sep = ' ')
## [1] "ALASKA delayed"  "ALASKA on time"  "AM WEST delayed" "AM WEST on time"
par(mar = c(10,4,4,2))
combined <- combined %>% arrange(status)
barplot(combined$total_count_by_airline_status,
        names.arg = (paste(combined$airline, combined$status, sep = ' ')),
        las = 2, main = 'Totals for each airline')

barplot(combined$prop,
        names.arg = (paste(combined$airline, combined$status, sep = ' ')),
        las = 2, yaxp = c(0.0, 0.9,9),
        main = 'Proportion of flights for each airline')

In conclusion, the barplot of the totals for each ariline makes it seem as if AM WEST would be the clear choice to with because it has almost double the number of on time flights as ALASKA. However, looking at the porportion of flights that were delayed or on time for each airline shows that there was not much of a difference between the airlines. In this case I would select AM WEST airline for our office travel needs because it had nearly double the amount of flights as ALASKA and the greater number of flights might make it easier to select flight times when booking travel for meetings.