| X | X.1 | LosAngeles | Phoenix | SanDiego | SanFrancisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | ||||||
| AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
The table above describes arrival delays for two airlines across five destinations.
The ask is to:
Create a .CSV file that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
We load all required libraries:
library(httr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
We download the CSV from my GitHub account replacing all empty spaces with NAs to make the clean up easier later on:
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "chilleundso/DATA607/master/Assignment5/"
fileName <- "Airline.csv"
air <- read.csv(paste0(urlRemote, pathGithub, fileName),header = TRUE, na.strings=c(""," ","NA"))
kable(air)
| X | X.1 | LosAngeles | Phoenix | SanDiego | SanFrancisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
As we can see the header for the first two columns are not very helpful so we change them to something more explanatory, secondly we remove all rows that are entirely N/As and lastly we fill all missing airline names with the name that is directly above:
#rename the first 2 column headers to "Airline" and "Status"
air <- dplyr::rename(air, Airline = X, Status = X.1)
#removing all rows that are entirely N/As (row 3):
air <- dplyr::filter_all(air,any_vars(!is.na(.)))
#we use the fill function to fill any N/A with the value directly above it in the data frame to add airline names for the second and 4th row
air <- tidyr::fill(air,Airline)
kable(air)
| Airline | Status | LosAngeles | Phoenix | SanDiego | SanFrancisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AMWEST | delayed | 117 | 415 | 65 | 129 | 61 |
The above table now is entirely filled, however it is not in a format that is easy to be used for further investigations:
Using gather we will generate a row specific for every tripple of airline, status and departure city (in the table below we are only showing the first 6 entries to make the report more readable)
air <- tidyr::gather(air, "Destination", "Amount", -Airline, -Status)
kable(head(air))
| Airline | Status | Destination | Amount |
|---|---|---|---|
| ALASKA | on time | LosAngeles | 497 |
| ALASKA | delayed | LosAngeles | 62 |
| AMWEST | on time | LosAngeles | 694 |
| AMWEST | delayed | LosAngeles | 117 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
For our final reformatting we will use spread to move the amount of delays and on-time flights per destination and airline into the same row
air <- tidyr::spread(air,Status, Amount)
kable(air)
| Airline | Destination | delayed | on time |
|---|---|---|---|
| ALASKA | LosAngeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | SanDiego | 20 | 212 |
| ALASKA | SanFrancisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AMWEST | LosAngeles | 117 | 694 |
| AMWEST | Phoenix | 415 | 4840 |
| AMWEST | SanDiego | 65 | 383 |
| AMWEST | SanFrancisco | 129 | 320 |
| AMWEST | Seattle | 61 | 201 |
To make the data more easily comparable we “normalize” it by adding the delay_rate which gives the percentage of delays per total flight for each pair of destination and airline:
air$delay_rate <- air$delayed / (air$delayed+air$'on time')
kable(air)
| Airline | Destination | delayed | on time | delay_rate |
|---|---|---|---|---|
| ALASKA | LosAngeles | 62 | 497 | 0.1109123 |
| ALASKA | Phoenix | 12 | 221 | 0.0515021 |
| ALASKA | SanDiego | 20 | 212 | 0.0862069 |
| ALASKA | SanFrancisco | 102 | 503 | 0.1685950 |
| ALASKA | Seattle | 305 | 1841 | 0.1421249 |
| AMWEST | LosAngeles | 117 | 694 | 0.1442663 |
| AMWEST | Phoenix | 415 | 4840 | 0.0789724 |
| AMWEST | SanDiego | 65 | 383 | 0.1450893 |
| AMWEST | SanFrancisco | 129 | 320 | 0.2873051 |
| AMWEST | Seattle | 61 | 201 | 0.2328244 |
To investigate the behavior of delay rates per destination and airline we plot the delay rate for every city with both airlines next to each other in a grouped barchart (source for the code: https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html)
ggplot(air, aes(fill=Airline, y=delay_rate, x=reorder(Destination, delay_rate))) +
geom_bar(position="dodge", stat="identity") +
xlab('Destination') +
ylab('Delay Rate')
What we can see is that for every destination AMWEST has a higher delay rate than Alaska and the order of delay (from least to most) per destination is: Phoenix, San Diego, Los Angeles, Seattle and lastly San Francisco.
GitHub: https://github.com/chilleundso/DATA607/blob/master/Assignment5/Data607_Assignment5.Rmd