Introduction

In this assignment, we are asked to create a csv file that contains the information below and use tidyr and dplyr to transform the data into the wide structure, We are then required to perform comparison analysis for the arrival delays on the two airlines.

knitr::include_graphics('607data.png')

Load libraries

The tidyverse contains the following packages in R. We have called three of the packages namely dplyr, tidyr and ggplot2 for this assignment.

knitr::include_graphics('tidy.png')

library(dplyr)
library(tidyr)
library(ggplot2)

Creating and uploading csv

We created a csv file containing the stated data and uploaded it at github. Here is what the raw csv file looks like

knitr::include_graphics('rawcsv.png')

We then read the csv file into R using the following commands

raw <- read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/Assignment04.csv", sep = ",", header = TRUE, stringsAsFactors=FALSE)
raw<-na.omit(raw)
print(raw)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Tidying and Reshaping the data

First we need to fill in the blank fields in the table

  1. Add column names Airline and ArrivalStatus for the missing headers. and remove the blank row
  2. Fill in the missing values for Alaska and Amwest on [2,1] and [4,1]
  3. Check the structure to identify if we need conversion between data types
colnames(raw)[1:2] <- c("Airline", "ArrivalStatus")
raw[2, 1] <- "ALASKA"
raw[4, 1] <- "AM WEST"
str(raw)
## 'data.frame':    4 obs. of  7 variables:
##  $ Airline      : chr  "ALASKA" "ALASKA" "AM WEST" "AM WEST"
##  $ ArrivalStatus: chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61
##  - attr(*, "na.action")= 'omit' Named int 3
##   ..- attr(*, "names")= chr "3"
  1. We see that the number of flights for Phoenix and Seattle are of character type. We will need to convert these to numeric so we may do analysis later.
raw$Phoenix<-as.numeric(raw$Phoenix)
raw$Seattle<-as.numeric(raw$Seattle)
  1. The next step is to convert the data from the wide format that is provided to the long format. The long format takes up more space but is more flexible. An example of the difference between the two is given below:
knitr::include_graphics('long.png')

We have used the gather command to take columns 3 to 7 from raw and put them in one single column called NumFlights

long<- gather(raw, City, NumFlights, 3:7)
long
##    Airline ArrivalStatus          City NumFlights
## 1   ALASKA       on time   Los.Angeles        497
## 2   ALASKA       delayed   Los.Angeles         62
## 3  AM WEST       on time   Los.Angeles        694
## 4  AM WEST       delayed   Los.Angeles        117
## 5   ALASKA       on time       Phoenix        221
## 6   ALASKA       delayed       Phoenix         12
## 7  AM WEST       on time       Phoenix       4840
## 8  AM WEST       delayed       Phoenix        415
## 9   ALASKA       on time     San.Diego        212
## 10  ALASKA       delayed     San.Diego         20
## 11 AM WEST       on time     San.Diego        383
## 12 AM WEST       delayed     San.Diego         65
## 13  ALASKA       on time San.Francisco        503
## 14  ALASKA       delayed San.Francisco        102
## 15 AM WEST       on time San.Francisco        320
## 16 AM WEST       delayed San.Francisco        129
## 17  ALASKA       on time       Seattle       1841
## 18  ALASKA       delayed       Seattle        305
## 19 AM WEST       on time       Seattle        201
## 20 AM WEST       delayed       Seattle         61

Data Analysis and Visualization

  1. Suppose we want to see which city had the maximum number of flights delayed. We can filter for delayed flights and sort the table in descending to look at which cities have the most delayed flights. I am using a new variable long1 as I want to keep the original data intact.

Here we can see that Phoenix tops the list with Seattle coming in as second.

long1<-long
long1 %>%
  group_by(City) %>%
  filter(ArrivalStatus=="delayed") %>%
  summarize(TotalDelay=sum(NumFlights))%>%
  arrange(desc(TotalDelay))
## # A tibble: 5 x 2
##   City          TotalDelay
##   <chr>              <dbl>
## 1 Phoenix              427
## 2 Seattle              366
## 3 San.Francisco        231
## 4 Los.Angeles          179
## 5 San.Diego             85
  1. We can also look at delays by airline. We see that AM WEST has more delays than ALASKA.
long1 %>%
  group_by(Airline) %>%
  filter(ArrivalStatus=="delayed") %>%
  summarize(TotalDelay=sum(NumFlights))%>%
  arrange(desc(TotalDelay))
## # A tibble: 2 x 2
##   Airline TotalDelay
##   <chr>        <dbl>
## 1 AM WEST        787
## 2 ALASKA         501

However, in order to compare on an apple to apple basis, we need to compare the percentage of delays on total flights. We see that even though AM WEST had more delays in absolute terms, ALASKA has a higher percentage of flights delayed.

split = spread(long, "ArrivalStatus", "NumFlights")
colnames(split)[3:4] <- c("Delayed", "OnTime")
split%>%
  group_by(Airline)%>%
  summarize(totalDelay=sum(Delayed),
            totalonTime=sum(OnTime),
            PercentDelayed=totalDelay/(totalDelay+totalonTime)*100)%>%
  arrange(desc(PercentDelayed))
## # A tibble: 2 x 4
##   Airline totalDelay totalonTime PercentDelayed
##   <chr>        <dbl>       <dbl>          <dbl>
## 1 ALASKA         501        3274           13.3
## 2 AM WEST        787        6438           10.9

Let us look at the percentage of flight delayed by city

split1<-split%>%
  group_by(Airline, City)%>%
  mutate(DelayedPercent=sum(Delayed)/sum(Delayed+OnTime))%>%
  arrange(City)
ggplot(data=split1, aes(City, DelayedPercent, color=Airline, shape=Airline))+
  geom_point(stat="identity")+
  ggtitle("Delay Rate by Cities")

The findings are contradictory as ALASKA has an overall higher delay rate, but AM WEST has a higher delay rate in every city. This is probably because ALASKA operates comparatively more flights in Seattle and the high delay rate in this city has drawn up the average for ALASKA.

ggplot(long, aes(City, NumFlights, fill=Airline))+
  geom_bar(stat="identity")+
  ggtitle("Number of Flights operated by cities")

Conclusion

Overall, Alaska seems to be the better airline despite operating a much lower number of flights overall when we look at the data from these five cities.