The purpose of the project is the following:
1. Read in flight data
2. Tidy the data
3. Compare flight delays across the airlines: AM WEST and ALASKA
As stated by Hadley Wickham on https://r4ds.had.co.nz/tidy-data.html, we’re going to make this data tidy by enforcing the following rules:
There are three interrelated rules which make a dataset tidy:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
We’re retrieivng the data in the form of a csv from GitHub. The data is in an untidy format with several empty lines, columns missing names, rows with multiple observations, etc.
flights <- read_csv("https://raw.githubusercontent.com/devinteran/Data607-Assignment2/master/Flights.csv")
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_double(),
## Phoenix = col_double(),
## `San Diego` = col_double(),
## `San Francisco` = col_double(),
## Seattle = col_double()
## )
kable(flights) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| X1 | X2 | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| NA | NA | NA | NA | NA | NA | NA |
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
First I will remove empty rows of NA and add airline names where they are implied.
#Remove empty row 2 and 5
flights <- flights[-c(1,4),]
#Add data values which are currently missing but are implied
flights[2,1] <- "ALASKA"
flights[4,1] <- "AM WEST"
#Rename columns
colnames(flights) <- c("Airline","Status","Los Angeles","Phoenix","San Diego","San Francisco","Seattle")
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
flights <- flights %>% pivot_longer(c("Los Angeles","Phoenix","San Diego","San Francisco","Seattle"),names_to = "City",values_to="Count")
| Airline | Status | City | Count |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | delayed | Los Angeles | 62 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
| ALASKA | on time | San Diego | 212 |
| ALASKA | delayed | San Diego | 20 |
To do this will be adding two columns using the pivot_wider() function.
flights <- flights %>% pivot_wider(names_from="Status",values_from="Count")
Here is our data in a clean format. Now we’re ready for analysis!
flights_per_city_per_airline <- flights %>% mutate(`Total Flights` = `on time` + delayed)
kable(flights_per_city_per_airline) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Airline | City | on time | delayed | Total Flights |
|---|---|---|---|---|
| ALASKA | Los Angeles | 497 | 62 | 559 |
| ALASKA | Phoenix | 221 | 12 | 233 |
| ALASKA | San Diego | 212 | 20 | 232 |
| ALASKA | San Francisco | 503 | 102 | 605 |
| ALASKA | Seattle | 1841 | 305 | 2146 |
| AM WEST | Los Angeles | 694 | 117 | 811 |
| AM WEST | Phoenix | 4840 | 415 | 5255 |
| AM WEST | San Diego | 383 | 65 | 448 |
| AM WEST | San Francisco | 320 | 129 | 449 |
| AM WEST | Seattle | 201 | 61 | 262 |
The most popular route for AM West is Phoenix and the most popular route for ALASKA is Seattle. In general, AM WEST looks to have more flights in most cities compared to ALASKA.
ggplot(flights_per_city_per_airline,aes(fill=Airline,x=reorder(City,`Total Flights`),y=`Total Flights`)) +
geom_bar(position="dodge", stat="identity") +
coord_flip()
Intersting findings include:
* AM WEST has a lot of routes and it also has more delays compared to ALASKA.
* San Francisco has the most delays for both airlines.
* Phoenix has the most routes for AM WEST but the lowest % delays
flights_perct <- flights_per_city_per_airline %>%
mutate(`Percent Flights Delayed` = delayed/`Total Flights`,
`Percent Flights On Time` = `on time` / `Total Flights`)
ggplot(flights_perct,aes(fill=Airline,x=reorder(City,-`Percent Flights Delayed`),y=`Percent Flights Delayed`)) +
geom_bar(position="dodge", stat="identity") +
coord_flip()
ggplot(flights_perct,aes(fill=City,x=`Total Flights`,label=City,y=`Percent Flights Delayed`,shape=City,color=Airline,size=4)) +
geom_point() +
guides(size=FALSE)
It is clear that although the airline, AM WEST, has more flights, they are delayed more often than the airline ALASKA. The most common route for AM WEST to Phoenix has the lowest % delays. With more data from more airlines it would be interesting to see if the correlation of increased number of flights and fewer delays holds. I would recommend flying ALASKA if possible to avoid delays. It would be interesting to be able to bring cost into this analysis.