This assignment explores the tasks of data transformation and aggregations within the R environment.
This is done using the provided “Flights Arrival Delay” dataset and the tidyr and dplyr packages.
Load the .csv file
fileLocation <- 'https://raw.githubusercontent.com/ChadRyanBailey/607-Week5-Assignment/master/02%20Airlines%20Arrival%20and%20Delays.csv'
flightsData <- read.table(file = fileLocation, header = TRUE, sep = ",")
head(flightsData)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
This will help guide how the data will need to be restrutured.
The dataset appears to be about two airlines’ flight status counts for five cities. A few possible question would be:
Adjust column names
names(flightsData) <- c("Airline"
,"Flight_Status"
,"Los_Angeles"
,"Phoenix"
,"San_Diego"
,"San_Francisco"
,"Seattle")
Subset to rows with data and reset row numbers
keepRows <-flightsData$Flight_Status != ''
flightsData2 <- flightsData[keepRows,]
rownames(flightsData2) <- NULL
Transform the Airline column so that each row lists its airline name
airline <- flightsData2$Airline
for(i in 1:length(airline))
{
if (airline[i] != '') {tmp <- airline[i]}
if (airline[i] == '') {airline[i] <- tmp}
next
}
flightsData2$Airline <- airline
Get check the headers and first few rows of the altered table
head(flightsData2)
## Airline Flight_Status Los_Angeles Phoenix San_Diego San_Francisco
## 1 Alaska on time 497 221 212 503
## 2 Alaska delayed 62 12 20 102
## 3 AM WEST on time 694 4840 383 320
## 4 AM WEST delayed 117 415 65 129
## Seattle
## 1 1841
## 2 305
## 3 201
## 4 61
Transforrm the table from wide to long to with two new columsn [City] and [n_Flights]
library(tidyr)
flights_long <- gather(flightsData2, "City", "n_Flights", 3:7)
head(flights_long)
## Airline Flight_Status City n_Flights
## 1 Alaska on time Los_Angeles 497
## 2 Alaska delayed Los_Angeles 62
## 3 AM WEST on time Los_Angeles 694
## 4 AM WEST delayed Los_Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
Transform the table so that each value of Flight_Status becomes its own column. This will allow for easier comparison of flight statuses.
flights_long$Flight_Status <- gsub(" ", "_", flights_long$Flight_Status)
flights_tidy <- spread(flights_long, Flight_Status, n_Flights)
head(flights_tidy)
## Airline City delayed on_time
## 1 Alaska Los_Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San_Diego 20 212
## 4 Alaska San_Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM WEST Los_Angeles 117 694
Add new columns to help with analysis.
library(dplyr)
flights_tidy <- mutate(flights_tidy
, total = delayed + on_time
, pct_on_time = on_time / total * 100)
head(flights_tidy)
## Airline City delayed on_time total pct_on_time
## 1 Alaska Los_Angeles 62 497 559 88.90877
## 2 Alaska Phoenix 12 221 233 94.84979
## 3 Alaska San_Diego 20 212 232 91.37931
## 4 Alaska San_Francisco 102 503 605 83.14050
## 5 Alaska Seattle 305 1841 2146 85.78751
## 6 AM WEST Los_Angeles 117 694 811 85.57337
Calculate the overall on-time rate for each airline. Do this both as grand total rollup (weighted average) and a straight (unweighted) average of the variable pct_on_time.
airlines_overall <- flights_tidy %>%
group_by(Airline) %>%
summarise(total = sum(total)
, pct_on_time_rollup = sum(on_time) / sum(total) * 100
, pct_on_time_avg = mean(pct_on_time))
airlines_overall
## # A tibble: 2 x 4
## Airline total pct_on_time_rollup pct_on_time_avg
## <fct> <int> <dbl> <dbl>
## 1 Alaska 3775 86.7 88.8
## 2 AM WEST 7225 89.1 82.2
The different methods of aggregation do not provide aligned results.
Visually explore the transformed datasets. Given the large imbalance in the two airlines’ total flights, be sure to consider this variable in the exploration.
library(ggplot2)
overall <- ggplot(airlines_overall, aes(x=pct_on_time_avg, y = pct_on_time_rollup))
overall + geom_point(aes(color = Airline, size = total))
Explore if the imbalance of the number of flights is consistent across cities
by_city <- ggplot(flights_tidy, aes(x=total, y = pct_on_time))
by_city + geom_point(aes(color = Airline, size = total)) + facet_wrap(~City)
It appears Alaska has a higher on_time rate in all five cities included in the dataset. However, AM WEST has significantly more flights and has its largest number of flights at Phoenix, where its on_time rate is highest. This hides AM WEST’s lower on_time rates in other cites.
Show this finding through additional table transformations.
pct_on_time <- flights_tidy %>%
select (Airline, City, pct_on_time) %>%
spread(Airline , pct_on_time)
total <- flights_tidy %>%
select (Airline, City, total) %>%
spread(Airline , total)
airline_rates <- inner_join(total, pct_on_time, by = "City")
names(airline_rates) <- c("City"
,"Alaska_n"
,"AM_WEST_n"
,"Alaska_pct_ontime"
,"AM_WEST_pct_ontime")
airline_rates %>%
select (City, Alaska_n, Alaska_pct_ontime, AM_WEST_n, AM_WEST_pct_ontime)
## City Alaska_n Alaska_pct_ontime AM_WEST_n AM_WEST_pct_ontime
## 1 Los_Angeles 559 88.90877 811 85.57337
## 2 Phoenix 233 94.84979 5255 92.10276
## 3 San_Diego 232 91.37931 448 85.49107
## 4 San_Francisco 605 83.14050 449 71.26949
## 5 Seattle 2146 85.78751 262 76.71756
While AM WEST is by far the larger airline, having nearly double the total number of flights, passengers are more likely to have an on time flight with Alaska regardless of which city is being considered.