Tidying Data with Flights Data Table

Import Flights Data from Github

flights <- read.csv("https://raw.githubusercontent.com/swigodsky/DATA-607/master/alaska_amwest.csv", stringsAsFactors = FALSE)

Rename Columns

colnames(flights) <- c("airline", "status", "LA", "Phoenix", "San Diego", "San Francisco", "Seattle")
flights

##   airline  status  LA Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time 497     221       212           503   1,841
## 2         delayed  62      12        20           102     305
## 3                  NA                NA            NA        
## 4 AM WEST on time 694   4,840       383           320     201
## 5         delayed 117     415        65           129      61

Tidy Data

To tidy the data, I took out the empty row of data. I added the names of the airlines to the airline column for the delayed flights so that when the data is rearranged, the name of the airline accompanies the delayed data. I removed the commas from the numbers and converted the numbers, which were being stored as strings, to being stored as numbers. I then created a new data frame called flights_db. I used the gather function to make separate columns for city and number of flights. I then used the spread function to create separate columns for on time and delayed flights.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

flights <- flights[-3,]
flights[2,1] <- "ALASKA"
flights[4,1] <- "AM WEST"

flights_db <- flights %>% 
  gather("city", "number", 3:7) %>%
  spread(status, number)
flights_db$`on time` <- gsub (",","", flights_db$`on time`)
flights_db$`on time` <- as.numeric(flights_db$`on time`)
flights_db$delayed <- as.numeric(flights_db$delayed)
flights_db

##    airline          city delayed on time
## 1   ALASKA            LA      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San Diego      20     212
## 4   ALASKA San Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST            LA     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

Calculate the Proportion of delayed flights arranged by city

I used the mutate function to perform calculations to each row and add those values as columns in the dataframe.

flights_db_delayed <- flights_db %>%
  mutate(proportion_delay= delayed/(`on time`+delayed)) %>%
  mutate(total_flights = `on time` + delayed) %>%
  arrange(city) %>%
  select(-delayed, -`on time`)
flights_db_delayed

##    airline          city proportion_delay total_flights
## 1   ALASKA            LA       0.11091234           559
## 2  AM WEST            LA       0.14426634           811
## 3   ALASKA       Phoenix       0.05150215           233
## 4  AM WEST       Phoenix       0.07897241          5255
## 5   ALASKA     San Diego       0.08620690           232
## 6  AM WEST     San Diego       0.14508929           448
## 7   ALASKA San Francisco       0.16859504           605
## 8  AM WEST San Francisco       0.28730512           449
## 9   ALASKA       Seattle       0.14212488          2146
## 10 AM WEST       Seattle       0.23282443           262

In each city, a greater percentage of AM WEST’s flights were delayed than ALASKA’s flights that were delayed. The discrepancy between the proprortion of delayed flights for ALASKA and AM WEST’s is most signficant in San Diego, San Francisco and Seattle.

Relationship between Number of Delays and Number of Flights

I used ggplot to make a scatter plot and color coded the points according to airline.

library(ggplot2)
flights_db_delayed %>%
  ggplot(aes(x=total_flights, y = proportion_delay))+
  geom_point(aes(color=airline))

For AM WEST there appears to be a relationship bewteen the number of flights into a city and the proportional of flights that were delayed. The greater the number of flights into a city, the lower the proportion of flights that were delayed into the city. For ALASKA, the trend is not as dramatic, but there may be the reverse relationship. For ALASKA, the larger the number of flights, the greater percentage of delayed flights.

Comparing the Number of Delays for each Airline

I used the group_by function to have the calculations be performed according to the different airlines. I then used the summarise function to perform calculations on the entire column that resulted in single values for each calculation.

flights_db %>% 
  group_by(airline) %>%
  summarise(sum = sum(delayed), mean = mean(delayed), median=median(delayed), sd=sd(delayed))

## # A tibble: 2 x 5
##   airline   sum  mean median       sd
##     <chr> <dbl> <dbl>  <dbl>    <dbl>
## 1  ALASKA   501 100.2     62 120.0175
## 2 AM WEST   787 157.4    117 147.1625

The mean and median number of delays to an individual city for ALASKA are signficantly less than for AM WEST. The median is much less than the mean for both airlines. This indicates a distribution that is skewed to the right.

Comparing the Proportion of Delays to individual cities for each Airline

flights_db_delayed %>% 
  group_by(airline) %>%
  summarise(mean = mean(proportion_delay), median=median(proportion_delay), sd=sd(proportion_delay))

## # A tibble: 2 x 4
##   airline      mean    median         sd
##     <chr>     <dbl>     <dbl>      <dbl>
## 1  ALASKA 0.1118683 0.1109123 0.04592624
## 2 AM WEST 0.1776915 0.1450893 0.08212854

The mean and median of the proportion of flights delayed into an individual city is less for ALASKA than for AM WEST. The mean and median proportion of flights delayed into a particular city for ALASKA are similar. This indicates that the distribution for the proportion of flights delayed into individual cities is symmetric.

Comparing the Proportion of Total Delays irrespective of city for each Airline

flights_db %>% 
  group_by(airline) %>%
  summarise(proportion_of_total_flights_delayed = sum(delayed)/sum(delayed+`on time`))

## # A tibble: 2 x 2
##   airline proportion_of_total_flights_delayed
##     <chr>                               <dbl>
## 1  ALASKA                           0.1327152
## 2 AM WEST                           0.1089273

This was a surprising result to me! Even though in each city, ALASKA has a lower percentage of flights delayed, because AM WEST had its lowest percentage of delays into Phoenix, where it had the most flights, the overall likelihood of an AM WEST flight being delayed is lower than the likelihood of an ALASKA flight being delayed.