Airlines and Status
library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
opts_chunk$set(echo=FALSE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)Introduction
In this assignment, we have been provided with a small messy data that describes delays for two airloned across five destinations. Our goal is to
Create a.csv file that includes all the information provided
Read the information from the .csv file into R, and use tidyr and dplyr as needed to tidy transform the data.
Perform analysis to compare the arroval delays for two airlines.
Create .csv file, load it in R and use tidyr and dplyr for tidy transform of the data.
# load the raw data
df <- read.csv("https://raw.githubusercontent.com/anilak1978/airlines/master/airlines.csv",
stringsAsFactors = FALSE, sep = ",", header = TRUE)
head(df) X X.1 Los.Angeles Phoenix San.diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 delayed 62 12 20 102 305
3 NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 201
5 delayed 117 415 65 129 61
As we can see the data has missing information, empty row values and other structural issues. Our goal is to bring this dataframe into tidy data standards, which is making each column a variable.
We need to fill the rows that has no values with the appropriate values
X X.1 Los.Angeles Phoenix San.diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 201
5 AM WEST delayed 117 415 65 129 61
Remove row 3 as it has NA values that we do not need
X X.1 Los.Angeles Phoenix San.diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 AM WEST delayed 117 415 65 129 61
Assign Row numbers properly to the data frame
X X.1 Los.Angeles Phoenix San.diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61
Assign Column Names properly to the data frame
colnames(df) <- c("Airlines", "Status", "Los Angeles", "Phoenix", "San Diego",
"San Francisco", "Seattle")
df Airlines Status Los Angeles Phoenix San Diego San Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61
Transform the data
# use gather to move from rectangular to move square (goal is to get each
# variablle its own column)
df <- df %>% gather(Destination, Flights, 3:7)
df Airlines Status Destination Flights
1 ALASKA on time Los Angeles 497
2 ALASKA delayed Los Angeles 62
3 AM WEST on time Los Angeles 694
4 AM WEST delayed Los Angeles 117
5 ALASKA on time Phoenix 221
6 ALASKA delayed Phoenix 12
7 AM WEST on time Phoenix 4840
8 AM WEST delayed Phoenix 415
9 ALASKA on time San Diego 212
10 ALASKA delayed San Diego 20
11 AM WEST on time San Diego 383
12 AM WEST delayed San Diego 65
13 ALASKA on time San Francisco 503
14 ALASKA delayed San Francisco 102
15 AM WEST on time San Francisco 320
16 AM WEST delayed San Francisco 129
17 ALASKA on time Seattle 1841
18 ALASKA delayed Seattle 305
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# use spread to add ontime and delayed as column variables for better
# analysis
df_2 <- spread(df, Status, Flights)
df_2 Airlines Destination delayed on time
1 ALASKA Los Angeles 62 497
2 ALASKA Phoenix 12 221
3 ALASKA San Diego 20 212
4 ALASKA San Francisco 102 503
5 ALASKA Seattle 305 1841
6 AM WEST Los Angeles 117 694
7 AM WEST Phoenix 415 4840
8 AM WEST San Diego 65 383
9 AM WEST San Francisco 129 320
10 AM WEST Seattle 61 201
Perform Analysis to compare arrival delays for the two airlines.
# Analysis with dplyr package - group the airlines
by_airlines <- df %>% group_by(Airlines, Status) %>% summarise(Flights = sum(Flights))
by_airlines# A tibble: 4 x 3
# Groups: Airlines [2]
Airlines Status Flights
<chr> <chr> <int>
1 ALASKA delayed 501
2 ALASKA on time 3274
3 AM WEST delayed 787
4 AM WEST on time 6438
library("ggplot2")
theme_set(theme_classic())
ggplot(by_airlines, aes(x = Airlines, y = Flights, fill = Status)) + geom_bar(stat = "identity",
width = 0.5) + labs(title = "Airlines and Status")When we compare the on time and delayed distribution between the Airlines and Flights, we see that their each has similar distribituon Alaska having 86% of their flights on time and AM West 89% of their flights on time.
We can also look at the distribution of ontime and delayed Status for Destination.
ggplot(df_2, aes(x = Destination, y = delayed, fill = Airlines)) + geom_bar(stat = "identity",
width = 0.5) + labs(title = "Destination vs Delayed")There are more signifacntly more delayed flights in Alaska Airlines for Seattle destination, and there are significantly more delayed flights for Phoenix destination for AM West Airlines. This might be due to the amount of flights that both airlines offer to those particular destinations however based on this analysis it might make sense to take Alaska Airlines for Phoenix flights and AM West for Seattle.
Conclusion
Based on our analysis, we can see that the Status of Flights being two categories; On time or Delayed, are similar between the two Airlines, AM West flights being slighlty higher chance to be ontime. However, when we look at the destinations, we can see that Airlines Status of Flights are different between the two airlines for each destination.