The chart below describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
csv <- read.csv("https://raw.githubusercontent.com/petferns/607-week5/master/untidy.csv", na.strings = c("", "NA"))
head(csv)
## ï.. X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
csv <- csv[!apply(is.na(csv[1:7]),1,all), ]
# Renaming the first column to Airline and second to Status
names(csv)[1] <- "Airline"
names(csv)[2] <- "Status"
head(csv)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
for(i in 2:nrow(csv)) {
if(is.na(csv$Airline[i])){
csv$Airline[i] <- csv$Airline[i-1]
}
}
wide_to_long <- gather(csv, "City", "Count", 3:7)
head(wide_to_long)
## Airline Status City Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
transformed <- spread(wide_to_long,Status,Count)
datatable(transformed, extensions = 'Scroller', options = list(scrollY = 200, scroller = TRUE ))
overall_ontime <- transformed %>% group_by(Airline) %>% summarize(avg_ontime = mean(`on time`))
## `summarise()` ungrouping output (override with `.groups` argument)
head(overall_ontime)
## # A tibble: 2 x 2
## Airline avg_ontime
## <chr> <dbl>
## 1 ALASKA 655.
## 2 AM WEST 1288.
ggplot(overall_ontime ,aes(x= Airline, y=avg_ontime, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge())
percity_ontime <- transformed %>% group_by(Airline,City) %>% summarize(avg_ontime = mean(`on time`))
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
ggplot(percity_ontime ,aes(x= City, y=avg_ontime, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge())
overall_delay <- transformed %>% group_by(Airline) %>% summarize(avg_delay = mean(`delayed`))
## `summarise()` ungrouping output (override with `.groups` argument)
head(overall_delay)
## # A tibble: 2 x 2
## Airline avg_delay
## <chr> <dbl>
## 1 ALASKA 100.
## 2 AM WEST 157.
ggplot(overall_delay ,aes(x= Airline, y=avg_delay, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge())
percity_delay <- transformed %>% group_by(Airline,City) %>% summarize(avg_delay = mean(`delayed`))
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
head(percity_delay)
## # A tibble: 6 x 3
## # Groups: Airline [2]
## Airline City avg_delay
## <chr> <chr> <dbl>
## 1 ALASKA Los.Angeles 62
## 2 ALASKA Phoenix 12
## 3 ALASKA San.Diego 20
## 4 ALASKA San.Francisco 102
## 5 ALASKA Seattle 305
## 6 AM WEST Los.Angeles 117
ggplot(percity_delay ,aes(x= City, y=avg_delay, fill=Airline)) +
geom_bar(stat="identity", position=position_dodge())
As per the overall performance graph, we see AM WEST has higher ontime flights compared to ALSAKA but when we see the same over per city performance, except two cities remaining cities have almost similar performance for both airlines. In city Seattle ALSAKA has higher ontime flights then AM WEST
When we compare the delay overview, AM WEST has higher delayed flights and comparing the ontime graph AM WEST has higher ontime flights - which contradicts to itself. This also means that higher ontime flights doesn’t guarantee lesser delay.