Task Assignment

The chart above (not shown) describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.
  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
    • The URL to the .Rmd file in your GitHub repository. and 
    • The URL for your rpubs.com web page.

Introduction

Data for arrival delays for two airlines is presented in an untidy format requiring data cleanup in order to perform some basic analysis.

To complete all the tasks below 3 packages are required:

library(tidyr)
library(dplyr)
library(stringr)

If a package is not installed it must first be installed before it can be loaded.

Here are the steps we will follow:

  1. Import CSV file into R
  2. Prepare imported CSV file for manipulation
  3. Use gather() function from tidyr package to create one row per Airline, Status and City
  4. Use spread() function from tidyr pacakge to get data into final format with a variable in each column
  5. Perform analysis

Import raw data and manipulate for analysis

Read in raw CSV data from local directory.

rawcsv <- read.csv("Assignment 5.csv", header = TRUE, sep = ",", row.names = NULL, stringsAsFactors = FALSE)

rawcsv
##         X     X.1 Los.Angeles Phoneix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

In original file the ‘Airline’ and ‘Status’ columns were missing headers so we need to assign column names for airline and status

colnames(rawcsv)[1] <- "Airline"
colnames(rawcsv)[2] <- "Status"

Original file had one blank row separating airlines this imported as all NAs so we will remove row with all NA (i.e. the blank row)

rawcsv_narm <- na.omit(rawcsv)

We need to add in the airline names for the two missing values (the value did not repeat in the original file).

rawcsv_narm[2, 1] <- "ALASKA"
rawcsv_narm[4, 1] <- "AM WEST"

Now the data can be further manipulated for analysis. The gather() function was used to move data from wide to tall format.

prefinalcsv <- gather(rawcsv_narm, "City", "Count", 3:7)
prefinalcsv
##    Airline  Status          City Count
## 1   ALASKA on time   Los.Angeles   497
## 2   ALASKA delayed   Los.Angeles    62
## 3  AM WEST on time   Los.Angeles   694
## 4  AM WEST delayed   Los.Angeles   117
## 5   ALASKA on time       Phoneix   221
## 6   ALASKA delayed       Phoneix    12
## 7  AM WEST on time       Phoneix  4840
## 8  AM WEST delayed       Phoneix   415
## 9   ALASKA on time     San.Diego   212
## 10  ALASKA delayed     San.Diego    20
## 11 AM WEST on time     San.Diego   383
## 12 AM WEST delayed     San.Diego    65
## 13  ALASKA on time San.Francisco   503
## 14  ALASKA delayed San.Francisco   102
## 15 AM WEST on time San.Francisco   320
## 16 AM WEST delayed San.Francisco   129
## 17  ALASKA on time       Seattle  1841
## 18  ALASKA delayed       Seattle   305
## 19 AM WEST on time       Seattle   201
## 20 AM WEST delayed       Seattle    61

Now we will remove punctuation from city names that was introduced due to spaces in some (i.e. Los Angeles became Los.Angeles).

prefinalcsv$City <- str_replace(prefinalcsv$City, "[.]", " ")

Finally, we will move Status into its on variable.

finalcsv <- spread(prefinalcsv, Status, Count)

Analysis

We will examine delays for these two airlines overall and by city using the summarise() function from the dplyr package.

DelayOverall <- finalcsv %>% 
  group_by(Airline) %>%
  summarise(TotalDelayed=sum(delayed),TotalOnTime=sum(`on time`), PercentDelayed=round((TotalDelayed/(TotalDelayed+TotalOnTime))*100, 2))

DelayCity <- finalcsv %>% 
  group_by(Airline, City) %>%
  summarise(TotalDelayed=sum(delayed),TotalOnTime=sum(`on time`), PercentDelayed=round((TotalDelayed/(TotalDelayed+TotalOnTime))*100, 2))

DelayOverall
## # A tibble: 2 × 4
##   Airline TotalDelayed TotalOnTime PercentDelayed
##     <chr>        <int>       <int>          <dbl>
## 1  ALASKA          501        3274          13.27
## 2 AM WEST          787        6438          10.89

Overall, it would appear that Alaska has a higher % of delayed arrivals than Am West at 13.27% delayed. It is worth noting that Alaska has a much smaller (nearly 50% less) total number of observations in the set.

DelayCity
## Source: local data frame [10 x 5]
## Groups: Airline [?]
## 
##    Airline          City TotalDelayed TotalOnTime PercentDelayed
##      <chr>         <chr>        <int>       <int>          <dbl>
## 1   ALASKA   Los Angeles           62         497          11.09
## 2   ALASKA       Phoneix           12         221           5.15
## 3   ALASKA     San Diego           20         212           8.62
## 4   ALASKA San Francisco          102         503          16.86
## 5   ALASKA       Seattle          305        1841          14.21
## 6  AM WEST   Los Angeles          117         694          14.43
## 7  AM WEST       Phoneix          415        4840           7.90
## 8  AM WEST     San Diego           65         383          14.51
## 9  AM WEST San Francisco          129         320          28.73
## 10 AM WEST       Seattle           61         201          23.28

When we examine this data by city, again using the summarise() function from dplyr, the trend in arrival delays seems to reverse.

For each city Am West now has the higher % of delayed arrivals with some cities being as high as 28.73% (San Francisco).

What we have stumbled upon here is a rather classic illustration of Simpson’s paradox: A trend in data that reverses when the results are conditioned by an additional grouping variable.

In this case it appears that city has a dramatic impact on the % of flights that are delayed. If we step away from the data and considered the differences in weather between the cities below this paradox makes sense (consider weather in San Diego compared to Seattle on average). Given that the distribution of total flights to each city differs by airline we can see how this difference might emerge.

If we were considering which airline to take based on historical delay information it would be important to consider city in all analyses.