1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  3. Perform analysis to compare the arrival delays for the two airlines.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.

Binding each row to create a table to write as a csv file

table <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"), 
               c("ALASKA", "on time", 497,221,212,503,1841), 
               c(NA, "delayed", 62,12,20,102,305),
               c("AM WEST", "on time", 694,4840,383,320,201),
               c(NA, "delayed", 117,415,65,129,61))
table_df <- as.data.frame((table))

Write the table as a csv file

write.csv(table_df, file = "Data607_assignment5.csv", row.names = FALSE)

Load the csv file into R

csv_table <- read.csv("Data607_assignment5.csv", stringsAsFactors = FALSE)
csv_table
##        V1      V2          V3      V4        V5            V6      V7
## 1    <NA>    <NA> Los Angeles Phoenix San Diego San Francisco Seattle
## 2  ALASKA on time         497     221       212           503    1841
## 3    <NA> delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Load tidyr and dplyr to transform data

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Set up table as df, eliminate first row, and rename columns

df <- csv_table[2:5,]
colnames(df) <- c("Airline", "Status", "LA", "PHX", "SD", "SF", "SEA")
df <- as.data.frame(df)
df
##   Airline  Status  LA  PHX  SD  SF  SEA
## 2  ALASKA on time 497  221 212 503 1841
## 3    <NA> delayed  62   12  20 102  305
## 4 AM WEST on time 694 4840 383 320  201
## 5    <NA> delayed 117  415  65 129   61

Use the gather function: takes multiple columns and collapses in to key-value pairs

working_data <- gather(df, Destination, Flights, LA:SEA)

# We use the coalesce function to combine 2 columns. The lag() function is used to shift the air lines column by 1. Combining this with the coalensce function, we combine the regular column with the shifted column to remove the NAs

working_data$Airline = coalesce(working_data$Airline,lag(working_data$Airline))

# Rerrange the data by airline and status
?arrange()
working_data <- arrange(working_data, desc(Airline), Status)
working_data$Flights <- as.integer(working_data$Flights)
working_data
##    Airline  Status Destination Flights
## 1  AM WEST delayed          LA     117
## 2  AM WEST delayed         PHX     415
## 3  AM WEST delayed          SD      65
## 4  AM WEST delayed          SF     129
## 5  AM WEST delayed         SEA      61
## 6  AM WEST on time          LA     694
## 7  AM WEST on time         PHX    4840
## 8  AM WEST on time          SD     383
## 9  AM WEST on time          SF     320
## 10 AM WEST on time         SEA     201
## 11  ALASKA delayed          LA      62
## 12  ALASKA delayed         PHX      12
## 13  ALASKA delayed          SD      20
## 14  ALASKA delayed          SF     102
## 15  ALASKA delayed         SEA     305
## 16  ALASKA on time          LA     497
## 17  ALASKA on time         PHX     221
## 18  ALASKA on time          SD     212
## 19  ALASKA on time          SF     503
## 20  ALASKA on time         SEA    1841

Analysis

We’re going to investigate what airline had most delays (proportions)

# use the filter functions to show only delayed flights
data <- filter(working_data, Status == 'delayed')
data
##    Airline  Status Destination Flights
## 1  AM WEST delayed          LA     117
## 2  AM WEST delayed         PHX     415
## 3  AM WEST delayed          SD      65
## 4  AM WEST delayed          SF     129
## 5  AM WEST delayed         SEA      61
## 6   ALASKA delayed          LA      62
## 7   ALASKA delayed         PHX      12
## 8   ALASKA delayed          SD      20
## 9   ALASKA delayed          SF     102
## 10  ALASKA delayed         SEA     305
# Looking at Alaska Flights
data.alaska <- as.data.frame(filter(data, Airline == 'ALASKA') %>% 
               mutate(Proportion_Total_Flights = Flights/sum(Flights)))
data.alaska
##   Airline  Status Destination Flights Proportion_Total_Flights
## 1  ALASKA delayed          LA      62               0.12375250
## 2  ALASKA delayed         PHX      12               0.02395210
## 3  ALASKA delayed          SD      20               0.03992016
## 4  ALASKA delayed          SF     102               0.20359281
## 5  ALASKA delayed         SEA     305               0.60878244
# Looking at AM West flights
data.amwest <- as.data.frame(filter(data, Airline == 'AM WEST') %>% 
               mutate(Proportion_Total_Flights = Flights/sum(Flights)))
data.amwest
##   Airline  Status Destination Flights Proportion_Total_Flights
## 1 AM WEST delayed          LA     117               0.14866582
## 2 AM WEST delayed         PHX     415               0.52731893
## 3 AM WEST delayed          SD      65               0.08259212
## 4 AM WEST delayed          SF     129               0.16391360
## 5 AM WEST delayed         SEA      61               0.07750953
library(ggplot2)

Plot the delayes in flights of both Airlines

ggplot(data.alaska, aes(Destination, Proportion_Total_Flights)) + geom_point(color = 'red')

For Alaska airlines the issues seems to lie in thir Seattle destinations, which most delays occur

ggplot(data.amwest, aes(Destination, Proportion_Total_Flights)) + geom_point(color = 'blue')

For AM West airlines, the issues with delays occurs when Phoenix is the destination