607W5

Introduction

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines.

Load Libraries and Data

After replicating the file as a CSV and uploading to Github, our first step is to load into R and assess the data’s structure.

library(RCurl)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/607W5.csv")
RawTable <- read.csv(text = x)
RawTable

##         X     X.1 Los.Angeles Phoenix San.Diego San.Fransisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3 AM West On Time         694    4840       383           320     201
## 4         Delayed         117     415        65           129      61

Strategy

As we can see, this data violated several of our TidyData rules. Most notably, Columns 3 through 7 contain variable names and Column 2 contains multiple variables (On Time and Delayed).

We first need to add titles to Columns 1 and 2, and insert missing values into select cells in Column 1. From here, we can restructure our data using pivot_longer and pivot_wider as needed.

colnames(RawTable)[1] <- "Airline"
colnames(RawTable)[2] <- "Status"
RawTable[2,1] <- "Alaska"
RawTable[4,1] <- "AM West"

RawTable2 <- RawTable %>%   
pivot_longer(c(Los.Angeles, Phoenix, San.Diego, San.Fransisco, Seattle), names_to = "City", values_to = "Flight Count")

RawTable2

## # A tibble: 20 x 4
##    Airline Status  City          `Flight Count`
##    <fct>   <fct>   <chr>                  <int>
##  1 Alaska  On Time Los.Angeles              497
##  2 Alaska  On Time Phoenix                  221
##  3 Alaska  On Time San.Diego                212
##  4 Alaska  On Time San.Fransisco            503
##  5 Alaska  On Time Seattle                 1841
##  6 Alaska  Delayed Los.Angeles               62
##  7 Alaska  Delayed Phoenix                   12
##  8 Alaska  Delayed San.Diego                 20
##  9 Alaska  Delayed San.Fransisco            102
## 10 Alaska  Delayed Seattle                  305
## 11 AM West On Time Los.Angeles              694
## 12 AM West On Time Phoenix                 4840
## 13 AM West On Time San.Diego                383
## 14 AM West On Time San.Fransisco            320
## 15 AM West On Time Seattle                  201
## 16 AM West Delayed Los.Angeles              117
## 17 AM West Delayed Phoenix                  415
## 18 AM West Delayed San.Diego                 65
## 19 AM West Delayed San.Fransisco            129
## 20 AM West Delayed Seattle                   61

TidyTable <- RawTable2 %>%
  pivot_wider(names_from = Status, values_from = `Flight Count`)

TidyTable

## # A tibble: 10 x 4
##    Airline City          `On Time` Delayed
##    <fct>   <chr>             <int>   <int>
##  1 Alaska  Los.Angeles         497      62
##  2 Alaska  Phoenix             221      12
##  3 Alaska  San.Diego           212      20
##  4 Alaska  San.Fransisco       503     102
##  5 Alaska  Seattle            1841     305
##  6 AM West Los.Angeles         694     117
##  7 AM West Phoenix            4840     415
##  8 AM West San.Diego           383      65
##  9 AM West San.Fransisco       320     129
## 10 AM West Seattle             201      61

Analysis

As this table measures delayed and on-time flights across cities and airlines, we can infer that there are two different uses one might have for this table: “which airline runs on time most frequently on average across all cities?” and “which airline is more likely to arrive to a given city on time?”

We can add in a new column to answer the second question, and provide averages that answer the first question:

TidyTable$"Delayed Rate" <- (TidyTable$Delayed / (TidyTable$Delayed + TidyTable$`On Time`))

TidyTable

## # A tibble: 10 x 5
##    Airline City          `On Time` Delayed `Delayed Rate`
##    <fct>   <chr>             <int>   <int>          <dbl>
##  1 Alaska  Los.Angeles         497      62         0.111 
##  2 Alaska  Phoenix             221      12         0.0515
##  3 Alaska  San.Diego           212      20         0.0862
##  4 Alaska  San.Fransisco       503     102         0.169 
##  5 Alaska  Seattle            1841     305         0.142 
##  6 AM West Los.Angeles         694     117         0.144 
##  7 AM West Phoenix            4840     415         0.0790
##  8 AM West San.Diego           383      65         0.145 
##  9 AM West San.Fransisco       320     129         0.287 
## 10 AM West Seattle             201      61         0.233

sum(subset(TidyTable$Delayed, TidyTable$Airline == 'Alaska')) / sum(subset(c(TidyTable$Delayed, TidyTable$`On Time`), TidyTable$Airline == 'Alaska'))

## [1] 0.1327152

sum(subset(TidyTable$Delayed, TidyTable$Airline == 'AM West')) / sum(subset(c(TidyTable$Delayed, TidyTable$`On Time`), TidyTable$Airline == 'AM West'))

## [1] 0.1089273

Conclusions

AM West is only delayed on average, 11% of the time, while Alaska is delayed 13% on average. That being said, if one was traveling to Phoenix, they would likely want to fly Alaska as they are only delayed 5% of the time for that city specifically, and AM West has a delayed rate of nearly 8%.