##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data will be loaded from the CSV file I created for the project. I will parse the data as a csv file and execute summary function to show some important information of the raw data.
urlfile <- "https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/HW5Assignment.csv"
airlineData <- read.table(file = urlfile, header = TRUE, sep = ",")
airlineData## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Since I loaded the data successfully, I decided to use a combination of tidyr, dplyr, and ggplot in order to show graphical comparisons of specific questions I would like to ask: a comparison of delays vs. on time statuses by Airline and by City, a comparison of percent delays by airlines by city, and a comparison of total percent delays by city regardless of airlines.
Solution: Using a combination of tidyr and dplyr functions, I was able to arrange my data to show flights as a column in the new data frame. To help ease my graphing of the fields, I combined Airline and Status fields into one field. Using the new data frame I created, I created a bar chart that shows a visual comparison of delay and on-time statuses for each airline by city. Delayed statuses are colored a darker color while on-time statuses are colored a lighter color. Alaska Airlines uses color variations of blue while American West uses color variations of red.
Analysis: By analyzing the chart, you can see that Phoenix and Seattle encompass the most flights for the two airlines. Both airlines appear to have the least amount of delays vs. total on-time flights in Phoenix. American West appears to have a commanding presence in Phoenix as well. The city with the second most flights appears to be Seattle. A more important gauge is to find the percentage of delays vs. on flight statuses as they will give a more accurate picture of air flight delays between the two airlines for the listed cities.
# Ref: [DAT] [GGP] [HTM]
airlineDataLF <- airlineData %>%
gather(City, Flights, Los.Angeles:Seattle) %>%
arrange(Airline, Status)
airlineDataLF <- unite(airlineDataLF,"AirlineStatus" , c("Airline","Status"), sep = "-")
airlineDataLF## AirlineStatus City Flights
## 1 ALASKA-delayed Los.Angeles 62
## 2 ALASKA-delayed Phoenix 12
## 3 ALASKA-delayed San.Diego 20
## 4 ALASKA-delayed San.Francisco 102
## 5 ALASKA-delayed Seattle 305
## 6 ALASKA-on time Los.Angeles 497
## 7 ALASKA-on time Phoenix 221
## 8 ALASKA-on time San.Diego 212
## 9 ALASKA-on time San.Francisco 503
## 10 ALASKA-on time Seattle 1841
## 11 AM WEST-delayed Los.Angeles 117
## 12 AM WEST-delayed Phoenix 415
## 13 AM WEST-delayed San.Diego 65
## 14 AM WEST-delayed San.Francisco 129
## 15 AM WEST-delayed Seattle 61
## 16 AM WEST-on time Los.Angeles 694
## 17 AM WEST-on time Phoenix 4840
## 18 AM WEST-on time San.Diego 383
## 19 AM WEST-on time San.Francisco 320
## 20 AM WEST-on time Seattle 201
plot1 <- ggplot(data=airlineDataLF, aes(x=City, y=Flights, fill=AirlineStatus)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=Flights), vjust=1.6, color="black", position = position_dodge(0.9), size=3.5) +
scale_fill_brewer(palette="Paired")
plot1 + scale_fill_manual(values=c("#1F618D","#7FB3D5", "#922B21","#F1948A"))## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
Solution: Using a combination of tidyr and dplyr functions, I was able to arrange my data to show flights as a column in the new data frame. I used mutate to add two new calculated fields: AlaskaPercentDelay and AMWestPercentDelay. To help ease my graphing of the fields, I combined Airline and Status fields into one field. Using the new data frame I created, I created a bar chart that shows a visual comparison of the percentage delays for each airline by city.
Analysis: As I indicated in the last chart and now in this chart, both airlines have the lowest percentage of delays in Phoenix at 5.43% for Alaska and 8.57% for American West. Conversely, both airlines have the highest percentage of delays in San Francisco at 20.28% for Alaska and 40.31% for American West. While American West enjoys the best flight delay percentage at Phoenix, it is certainly not enjoying the worst flight delay percentage at San Francisco.
#Ref: [DAT]
trdf <- rownames_to_column(airlineData) %>%
gather(var, value, -rowname) %>%
spread(rowname, value)## Warning: attributes are not identical across measure variables;
## they will be dropped
## City ALT ALD AMT AMD
## 2 Los.Angeles 497 62 694 117
## 3 Phoenix 221 12 4840 415
## 4 San.Diego 212 20 383 65
## 5 San.Francisco 503 102 320 129
## 6 Seattle 1841 305 201 61
airlineDataT <- mutate(trdf, AlaskaPercentDelay = as.numeric(ALD) / as.numeric(ALT) * 100, AMWestPercentDelay = as.numeric(AMD) / as.numeric(AMT) * 100 )
airlineDataT## City ALT ALD AMT AMD AlaskaPercentDelay AMWestPercentDelay
## 1 Los.Angeles 497 62 694 117 12.474849 16.85879
## 2 Phoenix 221 12 4840 415 5.429864 8.57438
## 3 San.Diego 212 20 383 65 9.433962 16.97128
## 4 San.Francisco 503 102 320 129 20.278330 40.31250
## 5 Seattle 1841 305 201 61 16.567083 30.34826
# Ref: [GRO]
barDataFrame <- data.frame(City=airlineDataT$City, Alaska = airlineDataT$AlaskaPercentDelay, AMWest = airlineDataT$AMWestPercentDelay)
ggplot(gather(barDataFrame , Airline, PercentDelay, -City), aes(x = City, y = PercentDelay, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label=sprintf("%0.2f", round(PercentDelay, digits = 2))), vjust=1.6, color="black", position = position_dodge(0.9), size=3.5) +
scale_fill_brewer(palette="Paired")Solution: Using the calculations above, I created a mutate function that calculates total percent delay for each city regarldess of airline. I created a chart that shows the percent delays for each city regardless of airline.
Analysis: As in the charts above, the least percent delay of 8.44% was found to be in Phoenix and the worst percent of 28.07% delay was found to be in San Francisco. That’s pretty chilling. More than a quarter of all flights at San Francisco is going to be delayed. Los Angeles has a much larger population than San Francisco and yet it enjoys a lesser percent delay at 15.03%.
airlineDataB <- mutate(trdf, PercentDelay = (as.numeric(ALD) + as.numeric(AMD)) / (as.numeric(ALT) + as.numeric(AMT)) * 100 )
airlineDataB## City ALT ALD AMT AMD PercentDelay
## 1 Los.Angeles 497 62 694 117 15.029387
## 2 Phoenix 221 12 4840 415 8.437068
## 3 San.Diego 212 20 383 65 14.285714
## 4 San.Francisco 503 102 320 129 28.068044
## 5 Seattle 1841 305 201 61 17.923604
# Ref: [GRO]
barDataFrame2 <- data.frame(City=airlineDataB$City, PercentDelay = airlineDataB$PercentDelay)
ggplot(barDataFrame2, aes(x = City, y = PercentDelay)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label=sprintf("%0.2f", round(PercentDelay, digits = 2))), vjust=1.6, color="black", position = position_dodge(0.9), size=3.5) +
scale_fill_brewer(palette="Paired")[DAT] Data Wrangling with dplyr and tidyr Cheat Sheet Retrieved from website: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
[GGP] ggplot2 barplots : Quick start guide - R software and data visualization Retrieved from website: http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
[GRO] **** Retrieved from website: https://stackoverflow.com/questions/37667630/grouped-bar-chart-r-using-counts-table
[HTM] HTML Color Codes Retrieved from website: https://htmlcolorcodes.com/