In this assignment, we are asked to create a csv file that contains the information below and use tidyr and dplyr to transform the data into the wide structure, We are then required to perform comparison analysis for the arrival delays on the two airlines.
knitr::include_graphics('607data.png')
The tidyverse contains the following packages in R. We have called three of the packages namely dplyr, tidyr and ggplot2 for this assignment.
knitr::include_graphics('tidy.png')
library(dplyr)
library(tidyr)
library(ggplot2)
We created a csv file containing the stated data and uploaded it at github. Here is what the raw csv file looks like
knitr::include_graphics('rawcsv.png')
We then read the csv file into R using the following commands
raw <- read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/Assignment04.csv", sep = ",", header = TRUE, stringsAsFactors=FALSE)
raw<-na.omit(raw)
print(raw)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
First we need to fill in the blank fields in the table
colnames(raw)[1:2] <- c("Airline", "ArrivalStatus")
raw[2, 1] <- "ALASKA"
raw[4, 1] <- "AM WEST"
str(raw)
## 'data.frame': 4 obs. of 7 variables:
## $ Airline : chr "ALASKA" "ALASKA" "AM WEST" "AM WEST"
## $ ArrivalStatus: chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
## - attr(*, "na.action")= 'omit' Named int 3
## ..- attr(*, "names")= chr "3"
raw$Phoenix<-as.numeric(raw$Phoenix)
raw$Seattle<-as.numeric(raw$Seattle)
knitr::include_graphics('long.png')
We have used the gather command to take columns 3 to 7 from raw and put them in one single column called NumFlights
long<- gather(raw, City, NumFlights, 3:7)
long
## Airline ArrivalStatus City NumFlights
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Here we can see that Phoenix tops the list with Seattle coming in as second.
long1<-long
long1 %>%
group_by(City) %>%
filter(ArrivalStatus=="delayed") %>%
summarize(TotalDelay=sum(NumFlights))%>%
arrange(desc(TotalDelay))
## # A tibble: 5 x 2
## City TotalDelay
## <chr> <dbl>
## 1 Phoenix 427
## 2 Seattle 366
## 3 San.Francisco 231
## 4 Los.Angeles 179
## 5 San.Diego 85
long1 %>%
group_by(Airline) %>%
filter(ArrivalStatus=="delayed") %>%
summarize(TotalDelay=sum(NumFlights))%>%
arrange(desc(TotalDelay))
## # A tibble: 2 x 2
## Airline TotalDelay
## <chr> <dbl>
## 1 AM WEST 787
## 2 ALASKA 501
However, in order to compare on an apple to apple basis, we need to compare the percentage of delays on total flights. We see that even though AM WEST had more delays in absolute terms, ALASKA has a higher percentage of flights delayed.
split = spread(long, "ArrivalStatus", "NumFlights")
colnames(split)[3:4] <- c("Delayed", "OnTime")
split%>%
group_by(Airline)%>%
summarize(totalDelay=sum(Delayed),
totalonTime=sum(OnTime),
PercentDelayed=totalDelay/(totalDelay+totalonTime)*100)%>%
arrange(desc(PercentDelayed))
## # A tibble: 2 x 4
## Airline totalDelay totalonTime PercentDelayed
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 13.3
## 2 AM WEST 787 6438 10.9
Let us look at the percentage of flight delayed by city
split1<-split%>%
group_by(Airline, City)%>%
mutate(DelayedPercent=sum(Delayed)/sum(Delayed+OnTime))%>%
arrange(City)
ggplot(data=split1, aes(City, DelayedPercent, color=Airline, shape=Airline))+
geom_point(stat="identity")+
ggtitle("Delay Rate by Cities")
The findings are contradictory as ALASKA has an overall higher delay rate, but AM WEST has a higher delay rate in every city. This is probably because ALASKA operates comparatively more flights in Seattle and the high delay rate in this city has drawn up the average for ALASKA.
ggplot(long, aes(City, NumFlights, fill=Airline))+
geom_bar(stat="identity")+
ggtitle("Number of Flights operated by cities")
Overall, Alaska seems to be the better airline despite operating a much lower number of flights overall when we look at the data from these five cities.