flights <- read.csv("https://raw.githubusercontent.com/swigodsky/DATA-607/master/alaska_amwest.csv", stringsAsFactors = FALSE)
colnames(flights) <- c("airline", "status", "LA", "Phoenix", "San Diego", "San Francisco", "Seattle")
flights
## airline status LA Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
To tidy the data, I took out the empty row of data. I added the names of the airlines to the airline column for the delayed flights so that when the data is rearranged, the name of the airline accompanies the delayed data. I removed the commas from the numbers and converted the numbers, which were being stored as strings, to being stored as numbers. I then created a new data frame called flights_db. I used the gather function to make separate columns for city and number of flights. I then used the spread function to create separate columns for on time and delayed flights.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
flights <- flights[-3,]
flights[2,1] <- "ALASKA"
flights[4,1] <- "AM WEST"
flights_db <- flights %>%
gather("city", "number", 3:7) %>%
spread(status, number)
flights_db$`on time` <- gsub (",","", flights_db$`on time`)
flights_db$`on time` <- as.numeric(flights_db$`on time`)
flights_db$delayed <- as.numeric(flights_db$delayed)
flights_db
## airline city delayed on time
## 1 ALASKA LA 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST LA 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
I used the mutate function to perform calculations to each row and add those values as columns in the dataframe.
flights_db_delayed <- flights_db %>%
mutate(proportion_delay= delayed/(`on time`+delayed)) %>%
mutate(total_flights = `on time` + delayed) %>%
arrange(city) %>%
select(-delayed, -`on time`)
flights_db_delayed
## airline city proportion_delay total_flights
## 1 ALASKA LA 0.11091234 559
## 2 AM WEST LA 0.14426634 811
## 3 ALASKA Phoenix 0.05150215 233
## 4 AM WEST Phoenix 0.07897241 5255
## 5 ALASKA San Diego 0.08620690 232
## 6 AM WEST San Diego 0.14508929 448
## 7 ALASKA San Francisco 0.16859504 605
## 8 AM WEST San Francisco 0.28730512 449
## 9 ALASKA Seattle 0.14212488 2146
## 10 AM WEST Seattle 0.23282443 262
In each city, a greater percentage of AM WEST’s flights were delayed than ALASKA’s flights that were delayed. The discrepancy between the proprortion of delayed flights for ALASKA and AM WEST’s is most signficant in San Diego, San Francisco and Seattle.
I used ggplot to make a scatter plot and color coded the points according to airline.
library(ggplot2)
flights_db_delayed %>%
ggplot(aes(x=total_flights, y = proportion_delay))+
geom_point(aes(color=airline))
For AM WEST there appears to be a relationship bewteen the number of flights into a city and the proportional of flights that were delayed. The greater the number of flights into a city, the lower the proportion of flights that were delayed into the city. For ALASKA, the trend is not as dramatic, but there may be the reverse relationship. For ALASKA, the larger the number of flights, the greater percentage of delayed flights.
I used the group_by function to have the calculations be performed according to the different airlines. I then used the summarise function to perform calculations on the entire column that resulted in single values for each calculation.
flights_db %>%
group_by(airline) %>%
summarise(sum = sum(delayed), mean = mean(delayed), median=median(delayed), sd=sd(delayed))
## # A tibble: 2 x 5
## airline sum mean median sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 501 100.2 62 120.0175
## 2 AM WEST 787 157.4 117 147.1625
The mean and median number of delays to an individual city for ALASKA are signficantly less than for AM WEST. The median is much less than the mean for both airlines. This indicates a distribution that is skewed to the right.
flights_db_delayed %>%
group_by(airline) %>%
summarise(mean = mean(proportion_delay), median=median(proportion_delay), sd=sd(proportion_delay))
## # A tibble: 2 x 4
## airline mean median sd
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 0.1118683 0.1109123 0.04592624
## 2 AM WEST 0.1776915 0.1450893 0.08212854
The mean and median of the proportion of flights delayed into an individual city is less for ALASKA than for AM WEST. The mean and median proportion of flights delayed into a particular city for ALASKA are similar. This indicates that the distribution for the proportion of flights delayed into individual cities is symmetric.
flights_db %>%
group_by(airline) %>%
summarise(proportion_of_total_flights_delayed = sum(delayed)/sum(delayed+`on time`))
## # A tibble: 2 x 2
## airline proportion_of_total_flights_delayed
## <chr> <dbl>
## 1 ALASKA 0.1327152
## 2 AM WEST 0.1089273
This was a surprising result to me! Even though in each city, ALASKA has a lower percentage of flights delayed, because AM WEST had its lowest percentage of delays into Phoenix, where it had the most flights, the overall likelihood of an AM WEST flight being delayed is lower than the likelihood of an ALASKA flight being delayed.