Description of the Assignment

Create a .CSV file that includes information in a “wide” structure. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Perform analysis to compare the arrival delays for the two airlines

Steps

Attach libraries, import and show the imported data

library(tidyr)
library(dplyr)
library(ggplot2)
rawdata = read.csv("https://raw.githubusercontent.com/L-Velasco/Fall16_IS607/master/IS607%20Assignment%20-%20Airlines.csv"); rawdata
##   Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on_time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3  AMWEST on_time         694    4840       383           320     201
## 4  AMWEST delayed         117     415        65           129      61

Tidy the data

Used gather() to make the city destination variables be part of observations and spread() to make the arrival type (on time, delayed) be variables. This step reshapes the data from “wide” to “tall” presentation.

tidydata <- rawdata %>%
  gather("destination","freq",3:7) %>%
  spread(Arrival,freq); tidydata
##    Airline   destination delayed on_time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6   AMWEST   Los.Angeles     117     694
## 7   AMWEST       Phoenix     415    4840
## 8   AMWEST     San.Diego      65     383
## 9   AMWEST San.Francisco     129     320
## 10  AMWEST       Seattle      61     201

Transform the data

Used mutate() to add two new variables that sums the total flights, and calculate the percent of delayed flights, per Airline and destination. These new variables explore the airline’s arrival performance across the city destinations.

transformed_data <- mutate(tidydata,
  total_flights = delayed + on_time,
  pct_delayed = round(delayed / total_flights * 100)); transformed_data
##    Airline   destination delayed on_time total_flights pct_delayed
## 1   ALASKA   Los.Angeles      62     497           559          11
## 2   ALASKA       Phoenix      12     221           233           5
## 3   ALASKA     San.Diego      20     212           232           9
## 4   ALASKA San.Francisco     102     503           605          17
## 5   ALASKA       Seattle     305    1841          2146          14
## 6   AMWEST   Los.Angeles     117     694           811          14
## 7   AMWEST       Phoenix     415    4840          5255           8
## 8   AMWEST     San.Diego      65     383           448          15
## 9   AMWEST San.Francisco     129     320           449          29
## 10  AMWEST       Seattle      61     201           262          23

Analyze and Plot the data

Visually show the percentage of delayed flights.

Delays Per Destination

Given the dataset, the Alaska airline seems to perform better arriving on time compared to Amwest in each destination.

plot_data <- select(transformed_data, Airline, destination, pct_delayed)

ggplot(data = plot_data,aes(x=destination, y=pct_delayed)) +
  geom_bar(stat="identity") +
  facet_wrap(~ Airline) +
  coord_flip() +
  labs(title="Comparison of Arrival Delays", x="Destination", y="Percentage of Arrival Delays")

Delays Per Airline

However, when we look at overall performance across all city destinations, Amwest seems to perform slightly better since this airline has the greater number of flights.

Airline_delayed_data <- transformed_data %>% 
  group_by(Airline) %>% 
  summarise(total_flights_overall = sum(total_flights), total_delayed = sum(delayed)) %>% 
  mutate(pct_delayed_overall = round((total_delayed / total_flights_overall) * 100))
Airline_delayed_data
## # A tibble: 2 × 4
##   Airline total_flights_overall total_delayed pct_delayed_overall
##    <fctr>                 <int>         <int>               <dbl>
## 1  ALASKA                  3775           501                  13
## 2  AMWEST                  7225           787                  11
ggplot(data = Airline_delayed_data,aes(x=Airline, y=pct_delayed_overall)) +
  geom_bar(stat="identity") +
  labs(title="Comparison of Arrival Delays", x="Airline", y="Percentage of Arrival Delays")