FlightData <- read.csv("C:\\temp\\FlightInfo.csv", sep = ",", stringsAsFactors = FALSE)
FlightData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 AM WEST On Time 694 4840 383 320 201
## 4 <NA> Delayed 117 415 65 129 61
In the original data, values (city names) are being used as variables, the data also have some empty row values and meaningless column names. So in order to make the data tidy the wide format needs to be converted to long format so that all the city names can be arranged under one variable. Empty values in the rows and meaningless column names also need to be replaced with appropriate values and names respectively:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(tidyr)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
FlightInfo <- FlightData %>% mutate(X = na.locf(X, na.rm = F)) %>%
setnames(old = c("X", "X.1"), new = c("Airline", "Arrival")) %>%
gather("City", "Flight_Counts", 3:7)
FlightInfo
## Airline Arrival City Flight_Counts
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA Delayed Los.Angeles 62
## 3 AM WEST On Time Los.Angeles 694
## 4 AM WEST Delayed Los.Angeles 117
## 5 ALASKA On Time Phoenix 221
## 6 ALASKA Delayed Phoenix 12
## 7 AM WEST On Time Phoenix 4840
## 8 AM WEST Delayed Phoenix 415
## 9 ALASKA On Time San.Diego 212
## 10 ALASKA Delayed San.Diego 20
## 11 AM WEST On Time San.Diego 383
## 12 AM WEST Delayed San.Diego 65
## 13 ALASKA On Time San.Francisco 503
## 14 ALASKA Delayed San.Francisco 102
## 15 AM WEST On Time San.Francisco 320
## 16 AM WEST Delayed San.Francisco 129
## 17 ALASKA On Time Seattle 1841
## 18 ALASKA Delayed Seattle 305
## 19 AM WEST On Time Seattle 201
## 20 AM WEST Delayed Seattle 61
The data looks much better now but still there are two rows for each ovseravtion of a city/Airline pair, so more transformation is needed to make it tidy so that each ovservation can be arranged in a single row:
FlightData <- spread(FlightInfo, 2, 4)
FlightData
## Airline City Delayed On Time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
The below visualization of the shows comparison of the airlines based on their delayed flights. The plot depicts that in most of the cities AM West Airlines has the larger number of delayed flights except Seattle where Alaska Airline has more delayed flights.
Figure 1:
ggplot(FlightData, aes(x = City, y = Delayed)) + geom_point(aes(size = 1,
color = Airline))
Some statistics:
Flight_Statistics <- FlightData %>% group_by(Airline) %>% summarise(Avg.Delayed = mean(Delayed),
`Avg.On Time` = mean(`On Time`), `Total Delayed` = sum(Delayed),
`Total On Time` = sum(`On Time`), `Total Flights` = sum(Delayed +
`On Time`), `Percent Delayed` = round((`Total Delayed`/`Total Flights`),
2), `Percent On Time` = round((`Total On Time`/`Total Flights`),
2), `Maximum Delay` = max(Delayed), `Minimum Delay` = min(Delayed))
Flight_Statistics
## # A tibble: 2 x 10
## Airline Avg.Delayed `Avg.On Time` `Total Delayed` `Total On Time`
## <chr> <dbl> <dbl> <int> <int>
## 1 ALASKA 100.2 654.8 501 3274
## 2 AM WEST 157.4 1287.6 787 6438
## # ... with 5 more variables: `Total Flights` <int>, `Percent
## # Delayed` <dbl>, `Percent On Time` <dbl>, `Maximum Delay` <dbl>,
## # `Minimum Delay` <dbl>
Above data statistics shows that the percentage of delayed flights is higher for Alaska Airlines if all the flights are considered. Therefore if no further analysis is done it is possible to come up with a conclusion that AM West Airline is better since it has lower percentage of delayed flights.
Ratio of delayed and on time flights by City:
Delyed_Ratio_Cities <- mutate(FlightData, Percent_Delay_City = round(Delayed/(Delayed +
`On Time`), 2), Percent_ontime_City = round(`On Time`/(Delayed +
`On Time`), 2))
Delyed_Ratio_Cities
## Airline City Delayed On Time Percent_Delay_City
## 1 ALASKA Los.Angeles 62 497 0.11
## 2 ALASKA Phoenix 12 221 0.05
## 3 ALASKA San.Diego 20 212 0.09
## 4 ALASKA San.Francisco 102 503 0.17
## 5 ALASKA Seattle 305 1841 0.14
## 6 AM WEST Los.Angeles 117 694 0.14
## 7 AM WEST Phoenix 415 4840 0.08
## 8 AM WEST San.Diego 65 383 0.15
## 9 AM WEST San.Francisco 129 320 0.29
## 10 AM WEST Seattle 61 201 0.23
## Percent_ontime_City
## 1 0.89
## 2 0.95
## 3 0.91
## 4 0.83
## 5 0.86
## 6 0.86
## 7 0.92
## 8 0.85
## 9 0.71
## 10 0.77
The worst city in terms of delayed flights is San Francisco for both flights, both Airlines have largest delayed flights in San Francisco.Figure 2 and Figure 3 reveal that Alaska Airlines is better in every city compared to AM West Airlines. In every city Alaska Airline has smaller proportion of delayed flights and larger proportion of on time flights.
Figure 2:
p1 <- ggplot(Delyed_Ratio_Cities, aes(City, Percent_Delay_City)) +
geom_bar(aes(fill = Airline), stat = "identity", position = "dodge") +
labs(title = "Percentage of Delayed Flights by City ", y = "Percentge")
p2 <- ggplot(Delyed_Ratio_Cities, aes(City, Percent_ontime_City)) +
geom_bar(aes(fill = Airline), stat = "identity", position = "dodge") +
labs(title = "Percentage of on time Flights by City ", y = "Percentge")
grid.arrange(p1, p2, nrow = 2)
Figure 3:
ggplot(Delyed_Ratio_Cities, aes(x = City, y = Percent_Delay_City,
group = Airline, color = Airline)) + geom_line() + labs(title = "Percentage of Delayed Flights by City ",
y = "Percentge")
So Alaska Airline is better when the percentage of flights (both delayed and on time) are considered in every city. But AM West appears to be better when all the flights are considered at a time, which suggests that there must be some large values in one or two cities that would explain this discrepancy.
Figure 4 shows that in Phoenix AM West Airline has a huge number of flights compared to what Alaska has in there. Figure 5 shows that Phoenix also has a very large number of on time flights. Since the presence of Alaska Airline in Phoenix is very samll it is obvious that most of those on time flights belong to AM West Airline. Therefore this large number of on time flights in Phoenix affect the overall data in favor of AM West Airline and explains why AM West Airline looks better when the data is seen as a whole.
Figure 4:
ggplot(FlightInfo, aes(City, Flight_Counts)) + geom_bar(aes(fill = Airline),
stat = "identity", position = "dodge") + labs(title = "Flight counts by City ",
y = "Count")
Figure 5:
ggplot(FlightInfo, aes(x = Arrival, y = Flight_Counts, fill = City)) +
geom_bar(stat = "identity", position = "dodge") + xlab("Airlines") +
ylab("Amount of flights")
The density plot below also shows that Alaska Airline is doing better since it has higher density of lower percentage of delayed flights:
qplot(Percent_Delay_City, data = Delyed_Ratio_Cities, geom = "density",
color = Airline, linetype = Airline)
AM West Airline has lower percentage of delayed flights when all the data is considered. But when each city is seperately considered it becomes clear that Alaska Airline performs better and has lower percentage of delayed flights in each city. The huge number of flights of AM West Airline in Phoenix is actually responsible for this false impression that AM West Airline is better (when all the data is considered at a time).