This assignment is focused on demonstrating a knowledge and ability to work with untidy data by doing the necessary transformations to the data to convert it into Tidy data that is then more functionally capable of being used for data analysis purposes.
The main steps in this assignment include: (1) Create a .CSV file (or optionally, a MySQL database!) that includes the untidy data provided, using a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from the .CSV file into R,and use tidy rand dplyr as needed to tidy and transform the data. (3) Perform analysis to compare the arrival delays for the two airlines.
This step imports the libraries that will be used to clean and analyze the data
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(scales) #Used for showing percents for y-axis labels below
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
The data from the homework file was manually duplicated in a .CSV file and then we import that sample data in the code below
untidy_data = read.csv('../input/assignment5_sample_data.csv')
untidy_data
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
Next we will perform the following steps on the untidy dataset to prepare it for conversion to a tidy dataset.
colnames(untidy_data)[1] = "Airline"
colnames(untidy_data)[2] = "Arrival Status"
destination_names <- colnames(untidy_data[3:length(untidy_data)])
destination_names
## [1] "Los.Angeles" "Phoenix" "San.Diego" "San.Francisco"
## [5] "Seattle"
#Remove row of blanks and NA values
untidy_data <- untidy_data %>% na.omit()
print(untidy_data)
## Airline Arrival Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
#Remove the comma from the large number strings, and convert all numeric strings into numeric values
untidy_data[destination_names] <- untidy_data[destination_names] %>% apply(MARGIN = 2, FUN = function(x) as.numeric(str_remove(x,",")))
print(untidy_data)
## Airline Arrival Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Change data from wide to long, by taking destination columns names and putting them into a "Destination" field, and the values for these columns into a "Num Flights" field
long_data <- untidy_data %>% pivot_longer(cols=destination_names,names_to = "Destination", values_to = "Num Flights")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(destination_names)
##
## # Now:
## data %>% select(all_of(destination_names))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
print(long_data)
## # A tibble: 20 × 4
## Airline `Arrival Status` Destination `Num Flights`
## <chr> <chr> <chr> <dbl>
## 1 "ALASKA" on time Los.Angeles 497
## 2 "ALASKA" on time Phoenix 221
## 3 "ALASKA" on time San.Diego 212
## 4 "ALASKA" on time San.Francisco 503
## 5 "ALASKA" on time Seattle 1841
## 6 "" delayed Los.Angeles 62
## 7 "" delayed Phoenix 12
## 8 "" delayed San.Diego 20
## 9 "" delayed San.Francisco 102
## 10 "" delayed Seattle 305
## 11 "AM WEST" on time Los.Angeles 694
## 12 "AM WEST" on time Phoenix 4840
## 13 "AM WEST" on time San.Diego 383
## 14 "AM WEST" on time San.Francisco 320
## 15 "AM WEST" on time Seattle 201
## 16 "" delayed Los.Angeles 117
## 17 "" delayed Phoenix 415
## 18 "" delayed San.Diego 65
## 19 "" delayed San.Francisco 129
## 20 "" delayed Seattle 61
#Change blanks in Airlines field to an NA value
long_data <- long_data %>% mutate(Airline = ifelse(Airline == "", NA, Airline))
print(long_data)
## # A tibble: 20 × 4
## Airline `Arrival Status` Destination `Num Flights`
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 <NA> delayed Los.Angeles 62
## 7 <NA> delayed Phoenix 12
## 8 <NA> delayed San.Diego 20
## 9 <NA> delayed San.Francisco 102
## 10 <NA> delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 <NA> delayed Los.Angeles 117
## 17 <NA> delayed Phoenix 415
## 18 <NA> delayed San.Diego 65
## 19 <NA> delayed San.Francisco 129
## 20 <NA> delayed Seattle 61
#Fill the missing Airlines values with the previous Airline name
long_data <- long_data %>% fill(Airline)
print(long_data)
## # A tibble: 20 × 4
## Airline `Arrival Status` Destination `Num Flights`
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
#Use janitor::clean_names function to standardize column names
long_data <- long_data %>% janitor::clean_names()
print(long_data)
## # A tibble: 20 × 4
## airline arrival_status destination num_flights
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
#Spread the arrival status field
tidy_data <- long_data %>% spread(key=arrival_status, value=num_flights)
print(tidy_data)
## # A tibble: 10 × 4
## airline destination delayed `on time`
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
#Clean column names
tidy_data <- tidy_data %>% janitor::clean_names()
print(tidy_data)
## # A tibble: 10 × 4
## airline destination delayed on_time
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
#Remove the "." from the destination city names
tidy_data <- tidy_data %>% mutate(destination = str_replace(destination, "\\.", " "))
print(tidy_data)
## # A tibble: 10 × 4
## airline destination delayed on_time
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Now that we have the data structured in a tidy data format, we are just about ready to conduct various analyses on the data. To assist in the analysis process, we first adjust our tidy data to include the following fields:
Additionally, to account for differences in the number of flights that a particular airline may have to a specific destination, we also created a separate data table that is summarized by each airline, and has the following additional fields:
flight_df <- tidy_data
flight_df <- flight_df %>% mutate(num_flights = delayed+on_time,
pct_delayed = delayed/num_flights,
pct_ontime = on_time/num_flights)
print(flight_df)
## # A tibble: 10 × 7
## airline destination delayed on_time num_flights pct_delayed pct_ontime
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Los Angeles 62 497 559 0.111 0.889
## 2 ALASKA Phoenix 12 221 233 0.0515 0.948
## 3 ALASKA San Diego 20 212 232 0.0862 0.914
## 4 ALASKA San Francisco 102 503 605 0.169 0.831
## 5 ALASKA Seattle 305 1841 2146 0.142 0.858
## 6 AM WEST Los Angeles 117 694 811 0.144 0.856
## 7 AM WEST Phoenix 415 4840 5255 0.0790 0.921
## 8 AM WEST San Diego 65 383 448 0.145 0.855
## 9 AM WEST San Francisco 129 320 449 0.287 0.713
## 10 AM WEST Seattle 61 201 262 0.233 0.767
airline_summary <- flight_df %>% group_by(airline) %>% summarize(total_delayed = sum(delayed),
total_ontime = sum(on_time),
total_flights = sum(num_flights),
pct_delayed = (total_delayed/total_flights),
pct_ontime = (total_ontime/total_flights))
print(airline_summary)
## # A tibble: 2 × 6
## airline total_delayed total_ontime total_flights pct_delayed pct_ontime
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 3775 0.133 0.867
## 2 AM WEST 787 6438 7225 0.109 0.891
#Average difference in pct of delayed flights by destination between the two airlines
mean(flight_df[flight_df$airline == 'AM WEST',]$pct_delayed - flight_df[flight_df$airline == 'ALASKA',]$pct_delayed)
## [1] 0.06582325
As a final step in our analysis process, we will create visualizations to help us compare the arrival delays for the two airlines
ggplot(data=airline_summary, aes(x=airline, y=total_delayed, fill=airline)) +
geom_bar(stat='identity') +
labs(title = "\n Number of Delayed Flights by Airline", y="Total Number of Delayed Flights", fill="Airline") +
theme(axis.title.x = element_blank()) +
geom_text(aes(label=total_delayed), vjust = 1.5,position = position_dodge(.9), size = 3)
ggplot(data = airline_summary, aes(x=airline, y=pct_delayed, fill=airline)) +
geom_bar(stat='identity') +
labs(title = "\n Pct of Flights Delayed by Airline", y="\n Pct of Flights Delayed", fill="Airline") +
scale_y_continuous(labels=percent) +
theme(axis.title.x = element_blank()) +
geom_text(aes(label=paste0(format(pct_delayed*100, digits=2),"%")), vjust = 1.5,position = position_dodge(.9), size = 3)
ggplot(data = flight_df, aes(x=destination,y=pct_delayed, fill=airline)) +
geom_bar(stat='identity', position = position_dodge()) +
geom_text(aes(label=paste0(format(pct_delayed*100, digits=2),"%")), vjust = 1.5,position = position_dodge(.9), size = 3) +
labs(x="\nDestination City", y="\nPct of Flights Delayed", title="\n Percent of Flights Delayed by Airline and Destination", fill="Airlines") +
theme(axis.title.x = element_blank()) +
scale_y_continuous(labels=percent)
Based on the analysis, we find the Am West Airlines has 286 - or 57% - more delayed flights than Alaska Airlines. However, AmWest had 97% more total flights than Alaska. To better understand the propensity for each airline to have delayed flights, we instead look at the percent of their total flights that are delayed. When we view the data in this way, we find that approximately 13.3% of flights are delayed at Alaska Airlines, vs 10.9% of flights at AmWest. Finally, when we further distill the data to view the percent of flights delayed for each airline based on their destination, we find that Alaska Airlines outperforms AmWest across each of the destination cities. Across each of the destination cities, AmWest had an average of 6.6% more delayed flights per destination compared to Alaska Airlines