We are presented with a small flights dataset in a wide format and tasked with wrangling it into a tidy format.
The first step is clearly to load the data into R and get a feel for the issues the data has.
url <- 'https://raw.githubusercontent.com/sbellows1/607/master/Week5/flights%20sample.csv?token=ALKCMBPVKUXMZX4S3PIRRUS6LB5Y4'
filename <- 'flights sample.csv'
if (!file.exists(filename)){
download.file(url, filename)
}
df <- read.csv('flights sample.csv', na.strings = '')
head(df, 10)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA <NA> NA NA <NA>
## 4 AM WEST on time 694 4,840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
We see that one variable is being spread across multiple columns (Airport Location) and as such we must gather that into a single column to abide by tidy data principles. Before doing this however, we should remove the 3rd row that is simply whitespace and provides no information
df <- df[-3,]
head(df)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4,840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Now with the useless row removed we can begin to tidy the data
The clear first step necessary is to gather the location columns into a single variable. We will accomplish this with the pivot_longer function of the tidyr library. My first implementation of this failed because the columns are of different types as we can see from the tibble. Phoenix and Seattle are factor variables whereas the other Airports are integer variables. The reason for this is the commas in the large numbers for these airports, so we must remove these and change their types before proceeding.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##This is a scalable implementation that will iterate over all columns and change to integers if we were
##to receive more data
for (i in 3:ncol(df)){
if (class(df[,i]) == 'factor'){
df[,i] <- as.integer(gsub(',','',df[,i]))
}
}
head(df)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Again at this point, I went to gather the location columns but realized that another issue is that the first column has blank cells that need to be filled in.
df <- df %>% fill(X, .direction = 'down')
head(df)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Now that we have fixed all our other issues, we can gather the airport variable using pivot_longer. This will combine the 5 airports into one variable.
df <- df %>% pivot_longer(c('Los.Angeles', 'Phoenix', 'San.Diego', 'San.Francisco', 'Seattle'), names_to = 'airport', values_to = 'count')
head(df, 20)
## # A tibble: 20 x 4
## X X.1 airport count
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
Last but not least, we must provide column names to the two columns missing names and fix up the airport variable slightly.
df <- df %>% rename('airline' = 'X', 'delayed' = 'X.1')
df$airport <- gsub('\\.', ' ', df$airport)
head(df)
## # A tibble: 6 x 4
## airline delayed airport count
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
Now with a tidy dataset, we can begin analysis. First I would like to create a summary of the proportion of delayed flights for each airline and airport.
airline_delayed <- df %>% filter(delayed == 'delayed') %>% group_by(airline) %>% summarize(n = sum(count))
airline_total <- df %>% group_by(airline) %>% summarize(n = sum(count))
airport_delayed <- df %>% filter(delayed == 'delayed') %>% group_by(airport) %>% summarize(n = sum(count))
airport_total <- df %>% group_by(airport) %>% summarize(n = sum(count))
airline <- airline_delayed %>% inner_join(airline_total, by = 'airline')
airport <- airport_delayed %>% inner_join(airport_total, by = 'airport')
airline <- airline %>% mutate(delayed_prop = n.x/n.y)
airport <- airport %>% mutate(delayed_prop = n.x/n.y)
print(airline)
## # A tibble: 2 x 4
## airline n.x n.y delayed_prop
## <fct> <int> <int> <dbl>
## 1 ALASKA 501 3775 0.133
## 2 AM WEST 787 7225 0.109
print(airport)
## # A tibble: 5 x 4
## airport n.x n.y delayed_prop
## <chr> <int> <int> <dbl>
## 1 Los Angeles 179 1370 0.131
## 2 Phoenix 427 5488 0.0778
## 3 San Diego 85 680 0.125
## 4 San Francisco 231 1054 0.219
## 5 Seattle 366 2408 0.152
Here we can see that on the whole, More flights from Alaska airlines are delayed than flights from American Airlines. We also see that San Francisco has the highest proportion of delayed flights and Phoenix the lowest proportion.
airline %>% ggplot(aes(airline, delayed_prop)) + geom_bar(stat = 'identity', fill = 'palegreen1') + xlab('Airline') + ylab('Proportion of Flights Delayed')
airport %>% ggplot(aes(reorder(airport, delayed_prop), delayed_prop)) +
geom_bar(stat = 'identity', fill = 'palegreen1') + xlab('Airport') + ylab('Proportion of Flights Delayed')
An interesting question came to mind after seeing these visualizations. Are the different airlines represented a higher or lower amount at each airport? That could potentially account for some of the differences between airline and airports. To answer this I found the number of flights from each airline at each airport.
counts <- df %>% group_by(airline, airport) %>% summarize(n = sum(count))
counts %>% ggplot(aes(reorder(airport, n), y = n, fill = airline)) + geom_bar(stat = 'identity', position = 'dodge') + xlab('Airport') + ylab('Number of Flights')
Interesting! We see that American Airlines flies the vast majority of the flights out of Phoenix, the airport with by far the least delays. Alaska Airlines flies more flights out of San Francisco and Seattle, airports that tend to have high rates of delay.
However from our analysis we cannot tell whether it is the airlines causing airports to have higher average delay proportions or the airports causing airlines to have higher delay proportions. In order to discover this, we would need to to acquire data that has causal standing.