A .csv file of the data was created and it is being loaded below. While loading the null fields are given “NA”. The first column is named Airline and second column is named Status.
flight <- read.table("https://raw.githubusercontent.com/irene908/DATA-607/master/Assignment%205_Delays.csv", header=TRUE, sep=",", na.strings = c("", "NA"))
names(flight)[1] <- "Airline"
names(flight)[2] <- "Status"
flight
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
The empty row is deleted.
flight <- flight[!apply(is.na(flight[1:7]),1,all), ]
flight
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Update the missing fields using the following for loop.
for(i in 2:nrow(flight)) {
if(is.na(flight$Airline[i])){
flight$Airline[i] <- flight$Airline[i-1]
}
}
flight
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
COnverting the “wide” data into a “long” data. To do this the gather() function of tidyr is used.
long_flight <- gather(flight, "City", "Count", 3:7)
datatable(long_flight, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))
The spread() function in tidyr is applied to the Status field.
status_spread <- spread(long_flight,Status,Count)
datatable(status_spread, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))
group_by() and summarize() functions of dplyr are used to find the total number of flights that were on time by each airline.
overall_ontime <- status_spread %>% group_by(Airline) %>% summarize(avg_ontime = mean(`on time`))
overall_ontime
## # A tibble: 2 x 2
## Airline avg_ontime
## <chr> <dbl>
## 1 ALASKA 655.
## 2 AM WEST 1288.
Compare the per-city on-time performance for both airlines
per_city_ontime <- status_spread %>% group_by(Airline,City) %>% summarize(avg_ontime = mean(`on time`))
datatable(per_city_ontime, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))
ggplot(per_city_ontime ,aes(x= City, y=avg_ontime, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge())
Summarize the delay data
overall_delay <- status_spread %>% group_by(Airline) %>% summarize(avg_delay = mean(`delayed`))
overall_delay
## # A tibble: 2 x 2
## Airline avg_delay
## <chr> <dbl>
## 1 ALASKA 100.
## 2 AM WEST 157.
per_city_delay <- status_spread %>% group_by(Airline,City) %>% summarize(avg_delay = mean(`delayed`))
datatable(per_city_delay, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))
ggplot(per_city_delay ,aes(x= City, y=avg_delay, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge())
d <- status_spread %>% dplyr::group_by(Airline) %>%
dplyr::summarise(d_max=max(delayed),
d_min=min(delayed),
d_avg=mean(delayed),
d_sd=sd(delayed),
d_median=median(delayed),
IQR=IQR(delayed))
datatable(d, extensions = 'Scroller', options = list(scrollY = 100))
From the above analysis it is clear that AM WEST had more number of on time flights when compared to ALASKA. When we consider the on time flights in Seattle we can see that ALASKA has more on time flights compared to AM WEST.
So this proves that judging an airline based on all the flights is not the best option. We should go further on to each destination to see which airline is better for that destination.