Assignment 5: Tidy Data

Load Libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Get Source Data

I took the data from the assignment sheet and transcribed it to a csv and loaded onto GitHub for reproducibility. From the tidyr package, I used read_csv() to get the data. This function has a number of helpful arguments like skip_empty_rows and trim_ws. I left col_names as FALSE initially, but quickly changed it to TRUE when I noticed the function could ready the column headers correctly.

dataURL <- "https://raw.githubusercontent.com/iscostello/Data607/master/tidyset.csv"
flightData <- read_csv(dataURL, col_names = TRUE, skip_empty_rows = TRUE, trim_ws = TRUE)

## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_double(),
##   Phoenix = col_double(),
##   `San Diego` = col_double(),
##   `San Francisco` = col_double(),
##   Seattle = col_double()
## )

print(flightData)

## # A tibble: 5 x 7
##   X1      X2      `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

Get the Broom: Data Cleaning

Just looking at how the data is structured, I know I wanted to create a variable for “City.” Using gather() this was pretty straightforward and including the na.rm argument eliminated the remaining pesky NULL rows. I also took the time to rename the first and second columns.

flightPivot <- gather(flightData, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle", key = "City", value = "numFlights", na.rm = TRUE)

flightPivot <- fill(flightPivot,X1,)
names(flightPivot)[1] <- "Airline"
names(flightPivot)[2] <- "ArrivalStatus"
print(flightPivot)

## # A tibble: 20 x 4
##    Airline ArrivalStatus City          numFlights
##    <chr>   <chr>         <chr>              <dbl>
##  1 ALASKA  on time       Los Angeles          497
##  2 ALASKA  delayed       Los Angeles           62
##  3 AM WEST on time       Los Angeles          694
##  4 AM WEST delayed       Los Angeles          117
##  5 ALASKA  on time       Phoenix              221
##  6 ALASKA  delayed       Phoenix               12
##  7 AM WEST on time       Phoenix             4840
##  8 AM WEST delayed       Phoenix              415
##  9 ALASKA  on time       San Diego            212
## 10 ALASKA  delayed       San Diego             20
## 11 AM WEST on time       San Diego            383
## 12 AM WEST delayed       San Diego             65
## 13 ALASKA  on time       San Francisco        503
## 14 ALASKA  delayed       San Francisco        102
## 15 AM WEST on time       San Francisco        320
## 16 AM WEST delayed       San Francisco        129
## 17 ALASKA  on time       Seattle             1841
## 18 ALASKA  delayed       Seattle              305
## 19 AM WEST on time       Seattle              201
## 20 AM WEST delayed       Seattle               61

Data Transformation

I didn’t quite understand why the package was called dplyr/plyr until working in this section. With filter() and select() I pulled out just the “delayed” rows and removed arrival status. I fun trick I saw in one of the posted videos that was making the case for using select in dplyr was that you could use normal strings instead of integers to select columns. So I removed ArrivalStatus by simply adding a “-” before it. Neat trick I thought!

flightsDelay <- filter(flightPivot, ArrivalStatus == "delayed")

flightsDelay <- select(flightsDelay, -ArrivalStatus)
print(flightsDelay)

## # A tibble: 10 x 3
##    Airline City          numFlights
##    <chr>   <chr>              <dbl>
##  1 ALASKA  Los Angeles           62
##  2 AM WEST Los Angeles          117
##  3 ALASKA  Phoenix               12
##  4 AM WEST Phoenix              415
##  5 ALASKA  San Diego             20
##  6 AM WEST San Diego             65
##  7 ALASKA  San Francisco        102
##  8 AM WEST San Francisco        129
##  9 ALASKA  Seattle              305
## 10 AM WEST Seattle               61

Overall vs. By City

I’ll be interested to see others’ solutions to this section. I could not get group_by() or summarise() to work properly. After a bit of digging around I finally settled on using the base R aggregate(), which did what I needed it to do.

byAirline <- aggregate(flightsDelay[,3], list(flightsDelay$Airline), mean)
print(byAirline)

##   Group.1 numFlights
## 1  ALASKA      100.2
## 2 AM WEST      157.4

Analysis and Visualizations

Overall Performance

With the data cleaned and wrangled, I used bar charts to draw out the differences. Comparing airlines overall, it’s clear that America West has more on time flights on average than Alaska.

ggplot(byAirline, aes(x = Group.1, y = numFlights)) +
       geom_bar(stat = "identity")

Performance By City

Viewed by city, it’s interesting to see where the hubs for these airlines are, Phoenix for American West and Seattle for Alaska.

ggplot(flightsDelay, aes(x = City, y = numFlights, color = Airline)) +
  geom_bar(fill="white", alpha=0.1,stat = "identity", position="stack")

I think the main reason for the apparent discrepancy is because these two cities with the most delays are where these airlines fly to most frequently, namely their respective hubs.