This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
##Read in the untidy file - The input file is created as instructed in the assignment and uploaded to github. The file is read in using the read_csv function from the url.
library(readr)
library(RCurl)
url <- getURLContent("https://raw.githubusercontent.com/tponnada/DATA607/master/Ifile.csv")
airline <- read_csv(url)
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
airline
## # A tibble: 5 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
##Data cleansing
##Step 1: There is missing data in the first column where airline name is not repeated for each row; also an empty row between the two airlines is filled with NA’s when read in from the csv file, both of these issues need to be cleaned up. Note: The na’s can also be removed using the na.rm = TRUE argument of the gather function below.
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
## # A tibble: 4 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
##Step 2: The dataset has four variables - airline name, arrival delay status (on time or delayed), destination airport (Los Angeles, Phoenix, San Diego, San Francisco and Seattle) as well as counts/frequency. To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows. Melting is parameterised by a list of columns that are already variables, or colvars for short. In this example, the colvars are airline name. The individual destination columns are melted into two variables: a new variable called “Destination” and a new variable called “Value” that contains the concatenated arrival delay counts by destination that were previously separated into individual columns. However, this form is not yet tidy because we have two variables stored in rows: “on time” and “delayed”. This needs to be cast.
airline_clean2 <- gather(airline_clean1, "Destination", "Value", 3:7, na.rm = TRUE); airline_clean2
## # A tibble: 20 x 4
## X1 X2 Destination Value
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San Diego 212
## 10 ALASKA delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 ALASKA on time San Francisco 503
## 14 ALASKA delayed San Francisco 102
## 15 AM WEST on time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
##Step 3: Casting the X2 column. Steps 2 and 3 make the data tidy. There is one variable in each column and each row represents an observation. The final step is to clean the column names using the rename() function. We use this to give an informative name to column X1, and change to a consistent naming convention for the other columns.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ dplyr 1.0.2
## ✓ tibble 3.0.3 ✓ stringr 1.4.0
## ✓ purrr 0.3.4 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
airline_clean3 <- spread(airline_clean2, "X2", "Value"); airline_clean3
## # A tibble: 10 x 4
## X1 Destination delayed `on time`
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
airline_clean4 <- rename(airline_clean3, Airline_carrier = "X1", Arrival_delay = "delayed", On_Time = "on time"); airline_clean4
## # A tibble: 10 x 4
## Airline_carrier Destination Arrival_delay On_Time
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
##Step 4: We used the tidy dataset above to perform the analysis asked in the assignment which is comparing the arrival delay for the two airlines. We have three variables and three comparisons available - Total number of flights that arrived late by carrier, number of flights that arrived late by destination and then combining carrier and destination, number of flights by carrier that arrived late at each destination.
##Looking at the carrier level first, Alaska airlines had fewer arrival delays as measured by the mean and the median compared to AM West, it also had a lower standard deviation or uncertainty. Second, the arrival delay at destination alone tells us that the mean delay was lowest at San Diego while the highest mean arrival delay was experienced at Phoenix which also had the highest standard deviation but this doesn’t tell us much about which of the carrier’s flights were responsible for skewing the delays at each airport. For this, we look at the last grouping - delays grouped by destination and carrier, which shows that at 4 of the 5 airports, Alaska airlines had fewer average arrival delays than AM West. Seattle was the only airport where the average number of flights for Alaska reported more delays than for AM West.
visualize1 <- airline_clean4 %>%
group_by(Airline_carrier) %>%
summarise(mean_delay1 = mean(Arrival_delay), median_delay1 = median(Arrival_delay), sd_delay1 = sd(Arrival_delay), sum = sum(Arrival_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
visualize1
## # A tibble: 2 x 5
## Airline_carrier mean_delay1 median_delay1 sd_delay1 sum
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 100. 62 120. 501
## 2 AM WEST 157. 117 147. 787
airline_clean4 %>%
group_by(Destination) %>%
summarise(mean_delay2 = mean(Arrival_delay), sd_delay2 = sd(Arrival_delay), sum2 = sum(Arrival_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## Destination mean_delay2 sd_delay2 sum2
## <chr> <dbl> <dbl> <dbl>
## 1 Los Angeles 89.5 38.9 179
## 2 Phoenix 214. 285. 427
## 3 San Diego 42.5 31.8 85
## 4 San Francisco 116. 19.1 231
## 5 Seattle 183 173. 366
visualize2 <- airline_clean4 %>%
group_by(Destination, Airline_carrier) %>%
summarise(mean_delay3 = mean(Arrival_delay))
## `summarise()` regrouping output by 'Destination' (override with `.groups` argument)
visualize2
## # A tibble: 10 x 3
## # Groups: Destination [5]
## Destination Airline_carrier mean_delay3
## <chr> <chr> <dbl>
## 1 Los Angeles ALASKA 62
## 2 Los Angeles AM WEST 117
## 3 Phoenix ALASKA 12
## 4 Phoenix AM WEST 415
## 5 San Diego ALASKA 20
## 6 San Diego AM WEST 65
## 7 San Francisco ALASKA 102
## 8 San Francisco AM WEST 129
## 9 Seattle ALASKA 305
## 10 Seattle AM WEST 61
##Step 5: We visualize by plotting the graphs for first mean arrival delay by carrier which as two points is not very informative. Secondly, looking at delays by destination and carrier as calculated above, we see that visually AM West flights on average experienced a high incidence of arrival delays at Phoenix which is it’s hub airport and similarly, Alaska Airlines experienced a high incidence of arrival delays on average at its hub Seattle airport.
##Conclusion: Ultimately, the analysis from the data and the graphs above makes business sense, as a carrier is likley to experience more number of flight delays at its busiest hub airport which this analysis confirms for the case of Alaska and AM West airlines. This is the explanation for the overall discrepancy between the per-city and the overall performance.
library(ggplot2)
ggplot(data = visualize1) + geom_point(aes(x = Airline_carrier, y = mean_delay1))
ggplot(data = visualize2, aes(x = Destination, y = mean_delay3)) + geom_line(aes(color = factor(Airline_carrier), group = Airline_carrier))
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.