This script uses dplyr, tidyr, and ggplot from the tidyverse family of packages.
library(tidyverse)
library(magrittr)
Data ‘’Assignment4.csv’’can be obtained from GitHub repository https://github.com/robertwelk/DATA607_Assignment4.git
# The file is read in as a csv, with strings preserved as characters rather than factors. Two of the columns are renamed for clarity and ease in downstream processes.
data <- read.csv('Assignment4.csv', stringsAsFactors = F) %>%
rename(carrier=X, flight.status=X.1) %>%
as_tibble()
data
## # A tibble: 5 x 7
## carrier flight.status Los.Angeles Pheonix San.Diego San.Fransisco Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 Alaska on time 497 221 212 503 1841
## 2 "" delayed 62 12 20 102 305
## 3 "" "" NA NA NA NA NA
## 4 AM West on time 694 4840 383 230 201
## 5 "" delayed 117 415 65 129 61
Each column should be a variable and each row an observation. The data in its original format is considered to be wide as can be seen by the cross tabulation between carrier and individual cities. The city names are themselves part of one variable yet are given unique columns. Using tidyr, the function ‘gather()’ can be used to create a new column called ‘city’ with each displaying a unique value while preserving their associated counts.
For this data it makes sense to keep serparate the counts of flights for on-time and delayed, as they are each a distinct variable and should be given their own column. The ‘spread()’ is then used to create a sperate variable for on-time counts and delayed flight counts for each city for each carrier. The use of dplyr piping allows for readable code even while grouping processing steps together.
#add airline name to column 1
data[c(2,5),1] <- c('Alaska','AM West')
# make tidy by putting into long format, and remove blank rows
data.tidy <- data %>%
gather(city, count, 3:7) %>%
filter(!is.na(count)) %>%
spread(flight.status,count)
data.tidy
## # A tibble: 10 x 4
## carrier city delayed `on time`
## <chr> <chr> <int> <int>
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Pheonix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Fransisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM West Los.Angeles 117 694
## 7 AM West Pheonix 415 4840
## 8 AM West San.Diego 65 383
## 9 AM West San.Fransisco 129 230
## 10 AM West Seattle 61 201
To compare arrival delays between the two carriers, it is convenient to use a ratio describing the average number of on-time flight per a delayed flight. This calcualtion can be made with ease due to the format of the now tidy data set. Comparison is easy to make with a dodged barplot that displays much of the information (carrier, city, on-time/delayed flight counts) in one graph. As can be seen, Alaska has a higher percentage of flight that are on time in each of the 5 cities compared to AM West.
data.tidy %<>%
mutate(ratio = `on time`/`delayed`) %>%
mutate(total = `on time` + `delayed`)
data.tidy
## # A tibble: 10 x 6
## carrier city delayed `on time` ratio total
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Alaska Los.Angeles 62 497 8.02 559
## 2 Alaska Pheonix 12 221 18.4 233
## 3 Alaska San.Diego 20 212 10.6 232
## 4 Alaska San.Fransisco 102 503 4.93 605
## 5 Alaska Seattle 305 1841 6.04 2146
## 6 AM West Los.Angeles 117 694 5.93 811
## 7 AM West Pheonix 415 4840 11.7 5255
## 8 AM West San.Diego 65 383 5.89 448
## 9 AM West San.Fransisco 129 230 1.78 359
## 10 AM West Seattle 61 201 3.30 262
ggplot(data.tidy, aes(x=city,y=ratio,fill=carrier)) +
geom_bar(stat='identity',position = 'dodge')