![]()
Image of Chart
The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyselect)
library(ggplot2)
flights_data<-read.csv("https://raw.githubusercontent.com/juanellemarks/JuanelleMarks_DATA607_2018/master/flights_data.csv", sep = ",", header = TRUE)
flights_data
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA <NA>
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#class(flights_data)
There is a row in the data frame which separated information about the two airlines. The cells in this row contained NA. In tidying up the data frame, this row was removed.
flights_data<-drop_na(flights_data,)
flights_data
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
By default, columns one and two were labelled X and X.1. It is appropriate to rename these labels so as to better reflect the data they ‘label’.
names(flights_data)[1]<-"Airline"
names(flights_data)[2]<-"Flight_Status"
flights_data
## Airline Flight_Status Los.Angeles Phoenix San.Diego San.Francisco
## 1 ALASKA on time 497 221 212 503
## 2 delayed 62 12 20 102
## 4 AM WEST on time 694 4840 383 320
## 5 delayed 117 415 65 129
## Seattle
## 1 1,841
## 2 305
## 4 201
## 5 61
flights_data$Airline[flights_data$Airline==""]<-c("ALASKA","AM WEST")
flights_data
## Airline Flight_Status Los.Angeles Phoenix San.Diego San.Francisco
## 1 ALASKA on time 497 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 4 AM WEST on time 694 4840 383 320
## 5 AM WEST delayed 117 415 65 129
## Seattle
## 1 1,841
## 2 305
## 4 201
## 5 61
Asignment question requires that we compare the delay times of the two airlines in the data frame. The current format of the data frame is ideal for getting a sense of the flight status of each of the two airlines across the five cities. However, it will be difficult to conduct a comparison analysis of the two airline flight delays while the data is in this format. Hence, the data will be transformed to the long format.
flights_data1<- flights_data %>% gather(key = Destination, value= Count, c(3:7))
## Warning: attributes are not identical across measure variables;
## they will be dropped
flights_data1
## Airline Flight_Status Destination Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1,841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Since we are only interested in flight delay status of the two airlines across the five cities, it is appropriate then to filter out this information from the data frame.
flights_data2<-filter(flights_data1, Flight_Status=="delayed")
## Warning: package 'bindrcpp' was built under R version 3.4.4
flights_data2
## Airline Flight_Status Destination Count
## 1 ALASKA delayed Los.Angeles 62
## 2 AM WEST delayed Los.Angeles 117
## 3 ALASKA delayed Phoenix 12
## 4 AM WEST delayed Phoenix 415
## 5 ALASKA delayed San.Diego 20
## 6 AM WEST delayed San.Diego 65
## 7 ALASKA delayed San.Francisco 102
## 8 AM WEST delayed San.Francisco 129
## 9 ALASKA delayed Seattle 305
## 10 AM WEST delayed Seattle 61
In order to perform computational analysis on the Count variable, we need to convert it to numeric.
flights_data2$Count = as.numeric(as.character(flights_data2$Count))
flights_data2
## Airline Flight_Status Destination Count
## 1 ALASKA delayed Los.Angeles 62
## 2 AM WEST delayed Los.Angeles 117
## 3 ALASKA delayed Phoenix 12
## 4 AM WEST delayed Phoenix 415
## 5 ALASKA delayed San.Diego 20
## 6 AM WEST delayed San.Diego 65
## 7 ALASKA delayed San.Francisco 102
## 8 AM WEST delayed San.Francisco 129
## 9 ALASKA delayed Seattle 305
## 10 AM WEST delayed Seattle 61
With count converted to numeric, we can do a summary in order to get a sense of the delay times irregardless of airline or state.
summary(flights_data2$Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 61.25 83.50 128.80 126.00 415.00
#sd(flights_data2$Count)
summarise(flights_data2, mean=mean(Count), sd=sd(Count))
## mean sd
## 1 128.8 130.1382
From this given data set, it can be seen that that there is an average of 128 flight delays with a standard deviation of approximately 130.
#ggplot(flights_data2, aes(Count)) + geom_bar(stat="bin", binwidth = 20, fill="steelblue")
barplot(flights_data2$Count)
The distribution of the delay counts seem bimodal and somewhat skewed.
library(ggplot2)
Alaska_delays<-filter(flights_data2, Airline=="ALASKA")
Alaska_delays
## Airline Flight_Status Destination Count
## 1 ALASKA delayed Los.Angeles 62
## 2 ALASKA delayed Phoenix 12
## 3 ALASKA delayed San.Diego 20
## 4 ALASKA delayed San.Francisco 102
## 5 ALASKA delayed Seattle 305
ggplot(Alaska_delays,aes(Destination,Count)) + geom_bar(aes(fill = Airline), position = "dodge", stat = "identity") +
ylab("Count")
From the graph above it can be observed that the number of delayed flights to Seattle on Alaska airlies was not just ‘slightly’ larger, but instead ‘significantly’ larger compared to those to the other four cities. What could be the main contributing factor/s to this observation? To be noted also, is the pattern of flight delay counts to the other cities.There does not appear to be any significant consistencies among these counts.
AM_West_delays<-filter(flights_data2, Airline=="AM WEST")
AM_West_delays
## Airline Flight_Status Destination Count
## 1 AM WEST delayed Los.Angeles 117
## 2 AM WEST delayed Phoenix 415
## 3 AM WEST delayed San.Diego 65
## 4 AM WEST delayed San.Francisco 129
## 5 AM WEST delayed Seattle 61
ggplot(AM_West_delays,aes(Destination,Count)) + geom_bar(aes(fill = Airline), position = "dodge", stat = "identity") +
ylab("Count")
>In comparision, Am West airlines had significantly larger flight delays to Phoenix than to any other state. To be observed also, there is no large difference among the delay times to the other cities on this airline suggesting a consistent pattern of flight delays.
Airlines_Sum <- flights_data2 %>%
group_by(Airline) %>%
summarise(TotalDelay = sum(Count))
Airlines_Sum
## # A tibble: 2 x 2
## Airline TotalDelay
## <fctr> <dbl>
## 1 ALASKA 501
## 2 AM WEST 787
ggplot(Airlines_Sum) + aes(x= Airline, y= TotalDelay, fill = Airline) + geom_bar(stat="identity", position=position_dodge())
By observing the diagram above, it can be seen that AM West had the largest overall delay counts. This could mean that Alaska airlines is the more reliable of the two airlines.
ggplot(flights_data2, aes(x = Destination, y = Count, group = Airline, color = Airline, shape = Airline)) +
geom_point(aes(shape = Airline)) +
geom_line() +
labs(x = "Cities", y = "Delay_Counts") +
theme(legend.title=element_blank())
From observation of the line graph above, it can be see that Alaska airlines has a somewhat lower flight delay count than AM West across four of the five states.
Based on the analysis of the data in the dataset used in this assignment, Alaska airlines seems to be a more reliable airline with minimal flight delays, compared to AM West airlines. Across both airlines there is an average of 128 flight delays to five different cities.This information is important to note if airlines wish to optimise their service and maintain their clientelle (passengers). This information may also be of interest to non-frequent flyers as it should enable them to be more cognizant of this reality.
If i am to offer advice on choice of airline (Alaska or AM West) when travelling to the five cities addressed in this dataset, i would highly recommend Alaska airlines except when there is a need to travel to Phoenix. In this case AM West is the better choice/recommendation.