library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
I took the data from the assignment sheet and transcribed it to a csv and loaded onto GitHub for reproducibility. From the tidyr package, I used read_csv() to get the data. This function has a number of helpful arguments like skip_empty_rows and trim_ws. I left col_names as FALSE initially, but quickly changed it to TRUE when I noticed the function could ready the column headers correctly.
dataURL <- "https://raw.githubusercontent.com/iscostello/Data607/master/tidyset.csv"
flightData <- read_csv(dataURL, col_names = TRUE, skip_empty_rows = TRUE, trim_ws = TRUE)
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_double(),
## Phoenix = col_double(),
## `San Diego` = col_double(),
## `San Francisco` = col_double(),
## Seattle = col_double()
## )
print(flightData)
## # A tibble: 5 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Just looking at how the data is structured, I know I wanted to create a variable for “City.” Using gather() this was pretty straightforward and including the na.rm argument eliminated the remaining pesky NULL rows. I also took the time to rename the first and second columns.
flightPivot <- gather(flightData, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle", key = "City", value = "numFlights", na.rm = TRUE)
flightPivot <- fill(flightPivot,X1,)
names(flightPivot)[1] <- "Airline"
names(flightPivot)[2] <- "ArrivalStatus"
print(flightPivot)
## # A tibble: 20 x 4
## Airline ArrivalStatus City numFlights
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San Diego 212
## 10 ALASKA delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 ALASKA on time San Francisco 503
## 14 ALASKA delayed San Francisco 102
## 15 AM WEST on time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
I didn’t quite understand why the package was called dplyr/plyr until working in this section. With filter() and select() I pulled out just the “delayed” rows and removed arrival status. I fun trick I saw in one of the posted videos that was making the case for using select in dplyr was that you could use normal strings instead of integers to select columns. So I removed ArrivalStatus by simply adding a “-” before it. Neat trick I thought!
flightsDelay <- filter(flightPivot, ArrivalStatus == "delayed")
flightsDelay <- select(flightsDelay, -ArrivalStatus)
print(flightsDelay)
## # A tibble: 10 x 3
## Airline City numFlights
## <chr> <chr> <dbl>
## 1 ALASKA Los Angeles 62
## 2 AM WEST Los Angeles 117
## 3 ALASKA Phoenix 12
## 4 AM WEST Phoenix 415
## 5 ALASKA San Diego 20
## 6 AM WEST San Diego 65
## 7 ALASKA San Francisco 102
## 8 AM WEST San Francisco 129
## 9 ALASKA Seattle 305
## 10 AM WEST Seattle 61
I’ll be interested to see others’ solutions to this section. I could not get group_by() or summarise() to work properly. After a bit of digging around I finally settled on using the base R aggregate(), which did what I needed it to do.
byAirline <- aggregate(flightsDelay[,3], list(flightsDelay$Airline), mean)
print(byAirline)
## Group.1 numFlights
## 1 ALASKA 100.2
## 2 AM WEST 157.4
With the data cleaned and wrangled, I used bar charts to draw out the differences. Comparing airlines overall, it’s clear that America West has more on time flights on average than Alaska.
ggplot(byAirline, aes(x = Group.1, y = numFlights)) +
geom_bar(stat = "identity")
Viewed by city, it’s interesting to see where the hubs for these airlines are, Phoenix for American West and Seattle for Alaska.
ggplot(flightsDelay, aes(x = City, y = numFlights, color = Airline)) +
geom_bar(fill="white", alpha=0.1,stat = "identity", position="stack")
I think the main reason for the apparent discrepancy is because these two cities with the most delays are where these airlines fly to most frequently, namely their respective hubs.