Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
table <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "on time", 497,221,212,503,1841),
c(NA, "delayed", 62,12,20,102,305),
c("AM WEST", "on time", 694,4840,383,320,201),
c(NA, "delayed", 117,415,65,129,61))
table_df <- as.data.frame((table))
write.csv(table_df, file = "Data607_assignment5.csv", row.names = FALSE)
csv_table <- read.csv("Data607_assignment5.csv", stringsAsFactors = FALSE)
csv_table
## V1 V2 V3 V4 V5 V6 V7
## 1 <NA> <NA> Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- csv_table[2:5,]
colnames(df) <- c("Airline", "Status", "LA", "PHX", "SD", "SF", "SEA")
df <- as.data.frame(df)
df
## Airline Status LA PHX SD SF SEA
## 2 ALASKA on time 497 221 212 503 1841
## 3 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
working_data <- gather(df, Destination, Flights, LA:SEA)
# We use the coalesce function to combine 2 columns. The lag() function is used to shift the air lines column by 1. Combining this with the coalensce function, we combine the regular column with the shifted column to remove the NAs
working_data$Airline = coalesce(working_data$Airline,lag(working_data$Airline))
# Rerrange the data by airline and status
?arrange()
working_data <- arrange(working_data, desc(Airline), Status)
working_data$Flights <- as.integer(working_data$Flights)
working_data
## Airline Status Destination Flights
## 1 AM WEST delayed LA 117
## 2 AM WEST delayed PHX 415
## 3 AM WEST delayed SD 65
## 4 AM WEST delayed SF 129
## 5 AM WEST delayed SEA 61
## 6 AM WEST on time LA 694
## 7 AM WEST on time PHX 4840
## 8 AM WEST on time SD 383
## 9 AM WEST on time SF 320
## 10 AM WEST on time SEA 201
## 11 ALASKA delayed LA 62
## 12 ALASKA delayed PHX 12
## 13 ALASKA delayed SD 20
## 14 ALASKA delayed SF 102
## 15 ALASKA delayed SEA 305
## 16 ALASKA on time LA 497
## 17 ALASKA on time PHX 221
## 18 ALASKA on time SD 212
## 19 ALASKA on time SF 503
## 20 ALASKA on time SEA 1841
We’re going to investigate what airline had most delays (proportions)
# use the filter functions to show only delayed flights
data <- filter(working_data, Status == 'delayed')
data
## Airline Status Destination Flights
## 1 AM WEST delayed LA 117
## 2 AM WEST delayed PHX 415
## 3 AM WEST delayed SD 65
## 4 AM WEST delayed SF 129
## 5 AM WEST delayed SEA 61
## 6 ALASKA delayed LA 62
## 7 ALASKA delayed PHX 12
## 8 ALASKA delayed SD 20
## 9 ALASKA delayed SF 102
## 10 ALASKA delayed SEA 305
# Looking at Alaska Flights
data.alaska <- as.data.frame(filter(data, Airline == 'ALASKA') %>%
mutate(Proportion_Total_Flights = Flights/sum(Flights)))
data.alaska
## Airline Status Destination Flights Proportion_Total_Flights
## 1 ALASKA delayed LA 62 0.12375250
## 2 ALASKA delayed PHX 12 0.02395210
## 3 ALASKA delayed SD 20 0.03992016
## 4 ALASKA delayed SF 102 0.20359281
## 5 ALASKA delayed SEA 305 0.60878244
# Looking at AM West flights
data.amwest <- as.data.frame(filter(data, Airline == 'AM WEST') %>%
mutate(Proportion_Total_Flights = Flights/sum(Flights)))
data.amwest
## Airline Status Destination Flights Proportion_Total_Flights
## 1 AM WEST delayed LA 117 0.14866582
## 2 AM WEST delayed PHX 415 0.52731893
## 3 AM WEST delayed SD 65 0.08259212
## 4 AM WEST delayed SF 129 0.16391360
## 5 AM WEST delayed SEA 61 0.07750953
library(ggplot2)
ggplot(data.alaska, aes(Destination, Proportion_Total_Flights)) + geom_point(color = 'red')
For Alaska airlines the issues seems to lie in thir Seattle destinations, which most delays occur
ggplot(data.amwest, aes(Destination, Proportion_Total_Flights)) + geom_point(color = 'blue')
For AM West airlines, the issues with delays occurs when Phoenix is the destination