Assignment 5

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

##Read in the untidy file - The input file is created as instructed in the assignment and uploaded to github. The file is read in using the read_csv function from the url.

library(readr)
library(RCurl)
url <- getURLContent("https://raw.githubusercontent.com/tponnada/DATA607/master/Ifile.csv")
airline <- read_csv(url)

## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]

airline

## # A tibble: 5 x 7
##   X1      X2      `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

##Data cleansing

##Step 1: There is missing data in the first column where airline name is not repeated for each row; also an empty row between the two airlines is filled with NA’s when read in from the csv file, both of these issues need to be cleaned up. Note: The na’s can also be removed using the na.rm = TRUE argument of the gather function below.

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

## # A tibble: 4 x 7
##   X1      X2      `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 ALASKA  delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 AM WEST delayed           117     415          65             129      61

##Step 2: The dataset has four variables - airline name, arrival delay status (on time or delayed), destination airport (Los Angeles, Phoenix, San Diego, San Francisco and Seattle) as well as counts/frequency. To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows. Melting is parameterised by a list of columns that are already variables, or colvars for short. In this example, the colvars are airline name. The individual destination columns are melted into two variables: a new variable called “Destination” and a new variable called “Value” that contains the concatenated arrival delay counts by destination that were previously separated into individual columns. However, this form is not yet tidy because we have two variables stored in rows: “on time” and “delayed”. This needs to be cast.

airline_clean2 <- gather(airline_clean1, "Destination", "Value", 3:7, na.rm = TRUE); airline_clean2

## # A tibble: 20 x 4
##    X1      X2      Destination   Value
##    <chr>   <chr>   <chr>         <dbl>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  delayed Los Angeles      62
##  3 AM WEST on time Los Angeles     694
##  4 AM WEST delayed Los Angeles     117
##  5 ALASKA  on time Phoenix         221
##  6 ALASKA  delayed Phoenix          12
##  7 AM WEST on time Phoenix        4840
##  8 AM WEST delayed Phoenix         415
##  9 ALASKA  on time San Diego       212
## 10 ALASKA  delayed San Diego        20
## 11 AM WEST on time San Diego       383
## 12 AM WEST delayed San Diego        65
## 13 ALASKA  on time San Francisco   503
## 14 ALASKA  delayed San Francisco   102
## 15 AM WEST on time San Francisco   320
## 16 AM WEST delayed San Francisco   129
## 17 ALASKA  on time Seattle        1841
## 18 ALASKA  delayed Seattle         305
## 19 AM WEST on time Seattle         201
## 20 AM WEST delayed Seattle          61

##Step 3: Casting the X2 column. Steps 2 and 3 make the data tidy. There is one variable in each column and each row represents an observation. The final step is to clean the column names using the rename() function. We use this to give an informative name to column X1, and change to a consistent naming convention for the other columns.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ dplyr   1.0.2
## ✓ tibble  3.0.3     ✓ stringr 1.4.0
## ✓ purrr   0.3.4     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()

airline_clean3 <- spread(airline_clean2, "X2", "Value"); airline_clean3

## # A tibble: 10 x 4
##    X1      Destination   delayed `on time`
##    <chr>   <chr>           <dbl>     <dbl>
##  1 ALASKA  Los Angeles        62       497
##  2 ALASKA  Phoenix            12       221
##  3 ALASKA  San Diego          20       212
##  4 ALASKA  San Francisco     102       503
##  5 ALASKA  Seattle           305      1841
##  6 AM WEST Los Angeles       117       694
##  7 AM WEST Phoenix           415      4840
##  8 AM WEST San Diego          65       383
##  9 AM WEST San Francisco     129       320
## 10 AM WEST Seattle            61       201

airline_clean4 <- rename(airline_clean3, Airline_carrier = "X1", Arrival_delay = "delayed", On_Time = "on time"); airline_clean4

## # A tibble: 10 x 4
##    Airline_carrier Destination   Arrival_delay On_Time
##    <chr>           <chr>                 <dbl>   <dbl>
##  1 ALASKA          Los Angeles              62     497
##  2 ALASKA          Phoenix                  12     221
##  3 ALASKA          San Diego                20     212
##  4 ALASKA          San Francisco           102     503
##  5 ALASKA          Seattle                 305    1841
##  6 AM WEST         Los Angeles             117     694
##  7 AM WEST         Phoenix                 415    4840
##  8 AM WEST         San Diego                65     383
##  9 AM WEST         San Francisco           129     320
## 10 AM WEST         Seattle                  61     201

##Step 4: We used the tidy dataset above to perform the analysis asked in the assignment which is comparing the arrival delay for the two airlines. We have three variables and three comparisons available - Total number of flights that arrived late by carrier, number of flights that arrived late by destination and then combining carrier and destination, number of flights by carrier that arrived late at each destination.

##Looking at the carrier level first, Alaska airlines had fewer arrival delays as measured by the mean and the median compared to AM West, it also had a lower standard deviation or uncertainty. Second, the arrival delay at destination alone tells us that the mean delay was lowest at San Diego while the highest mean arrival delay was experienced at Phoenix which also had the highest standard deviation but this doesn’t tell us much about which of the carrier’s flights were responsible for skewing the delays at each airport. For this, we look at the last grouping - delays grouped by destination and carrier, which shows that at 4 of the 5 airports, Alaska airlines had fewer average arrival delays than AM West. Seattle was the only airport where the average number of flights for Alaska reported more delays than for AM West.

visualize1 <- airline_clean4 %>%
  group_by(Airline_carrier) %>%
  summarise(mean_delay1 = mean(Arrival_delay), median_delay1 = median(Arrival_delay), sd_delay1 = sd(Arrival_delay), sum = sum(Arrival_delay))

## `summarise()` ungrouping output (override with `.groups` argument)

visualize1

## # A tibble: 2 x 5
##   Airline_carrier mean_delay1 median_delay1 sd_delay1   sum
##   <chr>                 <dbl>         <dbl>     <dbl> <dbl>
## 1 ALASKA                 100.            62      120.   501
## 2 AM WEST                157.           117      147.   787

airline_clean4 %>%
  group_by(Destination) %>%
  summarise(mean_delay2 = mean(Arrival_delay), sd_delay2 = sd(Arrival_delay), sum2 = sum(Arrival_delay))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 4
##   Destination   mean_delay2 sd_delay2  sum2
##   <chr>               <dbl>     <dbl> <dbl>
## 1 Los Angeles          89.5      38.9   179
## 2 Phoenix             214.      285.    427
## 3 San Diego            42.5      31.8    85
## 4 San Francisco       116.       19.1   231
## 5 Seattle             183       173.    366

visualize2 <- airline_clean4 %>%
  group_by(Destination, Airline_carrier) %>%
  summarise(mean_delay3 = mean(Arrival_delay))

## `summarise()` regrouping output by 'Destination' (override with `.groups` argument)

visualize2

## # A tibble: 10 x 3
## # Groups:   Destination [5]
##    Destination   Airline_carrier mean_delay3
##    <chr>         <chr>                 <dbl>
##  1 Los Angeles   ALASKA                   62
##  2 Los Angeles   AM WEST                 117
##  3 Phoenix       ALASKA                   12
##  4 Phoenix       AM WEST                 415
##  5 San Diego     ALASKA                   20
##  6 San Diego     AM WEST                  65
##  7 San Francisco ALASKA                  102
##  8 San Francisco AM WEST                 129
##  9 Seattle       ALASKA                  305
## 10 Seattle       AM WEST                  61

##Step 5: We visualize by plotting the graphs for first mean arrival delay by carrier which as two points is not very informative. Secondly, looking at delays by destination and carrier as calculated above, we see that visually AM West flights on average experienced a high incidence of arrival delays at Phoenix which is it’s hub airport and similarly, Alaska Airlines experienced a high incidence of arrival delays on average at its hub Seattle airport.

##Conclusion: Ultimately, the analysis from the data and the graphs above makes business sense, as a carrier is likley to experience more number of flight delays at its busiest hub airport which this analysis confirms for the case of Alaska and AM West airlines. This is the explanation for the overall discrepancy between the per-city and the overall performance.

library(ggplot2)
ggplot(data = visualize1) + geom_point(aes(x = Airline_carrier, y = mean_delay1))

ggplot(data = visualize2, aes(x = Destination, y = mean_delay3)) + geom_line(aes(color = factor(Airline_carrier), group = Airline_carrier))

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.