Overview

This assignment explores the tasks of data transformation and aggregations within the R environment.

This is done using the provided “Flights Arrival Delay” dataset and the tidyr and dplyr packages.

Load and review the data

Load the .csv file

fileLocation <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Week5-Assignment/master/02%20Airlines%20Arrival%20and%20Delays.csv'

flightsData <- read.table(file = fileLocation, header = TRUE, sep = ",")

head(flightsData)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Form possible questions

This will help guide how the data will need to be restrutured.

The dataset appears to be about two airlines’ flight status counts for five cities. A few possible question would be:

  1. Which airline is larger (i.e., has more total flights)?
  2. Which airline has a better overall rate of “on time” flights?
  3. Does either airline have a better rate of “on time” flights for all cities?

Perform necessary cleanup and transformations

Adjust column names

names(flightsData) <- c("Airline"
                        ,"Flight_Status"
                        ,"Los_Angeles"
                        ,"Phoenix"
                        ,"San_Diego"
                        ,"San_Francisco"
                        ,"Seattle")

Subset to rows with data and reset row numbers

keepRows <-flightsData$Flight_Status != ''

flightsData2 <- flightsData[keepRows,]
rownames(flightsData2) <- NULL

Transform the Airline column so that each row lists its airline name

airline <- flightsData2$Airline
for(i in 1:length(airline))
{
  if (airline[i] != '') {tmp <- airline[i]}
  if (airline[i] == '') {airline[i] <- tmp}
  next
}
flightsData2$Airline <- airline

Get check the headers and first few rows of the altered table

head(flightsData2)
##   Airline Flight_Status Los_Angeles Phoenix San_Diego San_Francisco
## 1  Alaska       on time         497     221       212           503
## 2  Alaska       delayed          62      12        20           102
## 3 AM WEST       on time         694    4840       383           320
## 4 AM WEST       delayed         117     415        65           129
##   Seattle
## 1    1841
## 2     305
## 3     201
## 4      61

Transforrm the table from wide to long to with two new columsn [City] and [n_Flights]

library(tidyr)
flights_long <- gather(flightsData2, "City", "n_Flights", 3:7)
head(flights_long)
##   Airline Flight_Status        City n_Flights
## 1  Alaska       on time Los_Angeles       497
## 2  Alaska       delayed Los_Angeles        62
## 3 AM WEST       on time Los_Angeles       694
## 4 AM WEST       delayed Los_Angeles       117
## 5  Alaska       on time     Phoenix       221
## 6  Alaska       delayed     Phoenix        12

Transform the table so that each value of Flight_Status becomes its own column. This will allow for easier comparison of flight statuses.

flights_long$Flight_Status <- gsub(" ", "_", flights_long$Flight_Status)

flights_tidy <- spread(flights_long, Flight_Status, n_Flights)

head(flights_tidy)
##   Airline          City delayed on_time
## 1  Alaska   Los_Angeles      62     497
## 2  Alaska       Phoenix      12     221
## 3  Alaska     San_Diego      20     212
## 4  Alaska San_Francisco     102     503
## 5  Alaska       Seattle     305    1841
## 6 AM WEST   Los_Angeles     117     694

Add new columns to help with analysis.

library(dplyr)
flights_tidy <- mutate(flights_tidy
                       , total = delayed + on_time
                       , pct_on_time = on_time / total * 100)

head(flights_tidy)
##   Airline          City delayed on_time total pct_on_time
## 1  Alaska   Los_Angeles      62     497   559    88.90877
## 2  Alaska       Phoenix      12     221   233    94.84979
## 3  Alaska     San_Diego      20     212   232    91.37931
## 4  Alaska San_Francisco     102     503   605    83.14050
## 5  Alaska       Seattle     305    1841  2146    85.78751
## 6 AM WEST   Los_Angeles     117     694   811    85.57337

Analyze and compare the data

Calculate the overall on-time rate for each airline. Do this both as grand total rollup (weighted average) and a straight (unweighted) average of the variable pct_on_time.

airlines_overall <- flights_tidy %>% 
  group_by(Airline) %>% 
  summarise(total = sum(total)
            , pct_on_time_rollup = sum(on_time) / sum(total) * 100 
            , pct_on_time_avg = mean(pct_on_time))

airlines_overall
## # A tibble: 2 x 4
##   Airline total pct_on_time_rollup pct_on_time_avg
##   <fct>   <int>              <dbl>           <dbl>
## 1 Alaska   3775               86.7            88.8
## 2 AM WEST  7225               89.1            82.2

The different methods of aggregation do not provide aligned results.

Visually explore the transformed datasets. Given the large imbalance in the two airlines’ total flights, be sure to consider this variable in the exploration.

library(ggplot2)

overall <- ggplot(airlines_overall, aes(x=pct_on_time_avg, y = pct_on_time_rollup))
overall + geom_point(aes(color = Airline, size = total))

Explore if the imbalance of the number of flights is consistent across cities

by_city <- ggplot(flights_tidy, aes(x=total, y = pct_on_time)) 
by_city + geom_point(aes(color = Airline, size = total)) + facet_wrap(~City)

It appears Alaska has a higher on_time rate in all five cities included in the dataset. However, AM WEST has significantly more flights and has its largest number of flights at Phoenix, where its on_time rate is highest. This hides AM WEST’s lower on_time rates in other cites.

Show this finding through additional table transformations.

pct_on_time <- flights_tidy %>% 
                select (Airline, City, pct_on_time) %>%
                spread(Airline , pct_on_time)

total <- flights_tidy %>% 
           select (Airline, City, total) %>%
           spread(Airline , total)

airline_rates <- inner_join(total, pct_on_time, by = "City")

names(airline_rates) <- c("City"
                          ,"Alaska_n"
                          ,"AM_WEST_n"
                          ,"Alaska_pct_ontime"
                          ,"AM_WEST_pct_ontime")

airline_rates %>%
  select (City, Alaska_n, Alaska_pct_ontime, AM_WEST_n, AM_WEST_pct_ontime)
##            City Alaska_n Alaska_pct_ontime AM_WEST_n AM_WEST_pct_ontime
## 1   Los_Angeles      559          88.90877       811           85.57337
## 2       Phoenix      233          94.84979      5255           92.10276
## 3     San_Diego      232          91.37931       448           85.49107
## 4 San_Francisco      605          83.14050       449           71.26949
## 5       Seattle     2146          85.78751       262           76.71756

Conclusion

While AM WEST is by far the larger airline, having nearly double the total number of flights, passengers are more likely to have an on time flight with Alaska regardless of which city is being considered.