Assignment - Tidying and Transforming Data

This assignment is focused on demonstrating a knowledge and ability to work with untidy data by doing the necessary transformations to the data to convert it into Tidy data that is then more functionally capable of being used for data analysis purposes.

The main steps in this assignment include: (1) Create a .CSV file (or optionally, a MySQL database!) that includes the untidy data provided, using a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from the .CSV file into R,and use tidy rand dplyr as needed to tidy and transform the data. (3) Perform analysis to compare the arrival delays for the two airlines.

Setup

This step imports the libraries that will be used to clean and analyze the data

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(scales) #Used for showing percents for y-axis labels below
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Import Data

The data from the homework file was manually duplicated in a .CSV file and then we import that sample data in the code below

untidy_data = read.csv('../input/assignment5_sample_data.csv')

untidy_data
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 3                          NA                NA            NA        
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Clean and Tidy Data

Next we will perform the following steps on the untidy dataset to prepare it for conversion to a tidy dataset.

  1. Name the first two columns
  2. Collect the names of the destination cities for future use when shifting our data from wide to long
  3. Remove any rows with NAs (in this case this applies to Row 3)
  4. Convert all numerical strings to be of numeric type by removing large numbers that have a comma
  5. Use the destination names to convert the data from wide to long
  6. Fill in the airline names for newly created rows that have a blank airline name
  7. Spread out the on time and delayed values so that each record is for one airline and one destination city
  8. Standardize the column names using the janitor library’s clean_names function
  9. Revmoe the “.” in the names for the destination cities
colnames(untidy_data)[1] = "Airline"
colnames(untidy_data)[2] = "Arrival Status"

destination_names <- colnames(untidy_data[3:length(untidy_data)])
destination_names
## [1] "Los.Angeles"   "Phoenix"       "San.Diego"     "San.Francisco"
## [5] "Seattle"
#Remove row of blanks and NA values
untidy_data <- untidy_data %>% na.omit()

print(untidy_data)
##   Airline Arrival Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA        on time         497     221       212           503   1,841
## 2                delayed          62      12        20           102     305
## 4 AM WEST        on time         694   4,840       383           320     201
## 5                delayed         117     415        65           129      61
#Remove the comma from the large number strings, and convert all numeric strings into numeric values
untidy_data[destination_names] <- untidy_data[destination_names] %>% apply(MARGIN = 2, FUN = function(x) as.numeric(str_remove(x,",")))

print(untidy_data)
##   Airline Arrival Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA        on time         497     221       212           503    1841
## 2                delayed          62      12        20           102     305
## 4 AM WEST        on time         694    4840       383           320     201
## 5                delayed         117     415        65           129      61
#Change data from wide to long, by taking destination columns names and putting them into a "Destination" field, and the values for these columns into a "Num Flights" field
long_data <- untidy_data %>% pivot_longer(cols=destination_names,names_to = "Destination", values_to = "Num Flights")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(destination_names)
## 
##   # Now:
##   data %>% select(all_of(destination_names))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
print(long_data)
## # A tibble: 20 × 4
##    Airline   `Arrival Status` Destination   `Num Flights`
##    <chr>     <chr>            <chr>                 <dbl>
##  1 "ALASKA"  on time          Los.Angeles             497
##  2 "ALASKA"  on time          Phoenix                 221
##  3 "ALASKA"  on time          San.Diego               212
##  4 "ALASKA"  on time          San.Francisco           503
##  5 "ALASKA"  on time          Seattle                1841
##  6 ""        delayed          Los.Angeles              62
##  7 ""        delayed          Phoenix                  12
##  8 ""        delayed          San.Diego                20
##  9 ""        delayed          San.Francisco           102
## 10 ""        delayed          Seattle                 305
## 11 "AM WEST" on time          Los.Angeles             694
## 12 "AM WEST" on time          Phoenix                4840
## 13 "AM WEST" on time          San.Diego               383
## 14 "AM WEST" on time          San.Francisco           320
## 15 "AM WEST" on time          Seattle                 201
## 16 ""        delayed          Los.Angeles             117
## 17 ""        delayed          Phoenix                 415
## 18 ""        delayed          San.Diego                65
## 19 ""        delayed          San.Francisco           129
## 20 ""        delayed          Seattle                  61
#Change blanks in Airlines field to an NA value
long_data <- long_data %>% mutate(Airline = ifelse(Airline == "", NA, Airline))

print(long_data)
## # A tibble: 20 × 4
##    Airline `Arrival Status` Destination   `Num Flights`
##    <chr>   <chr>            <chr>                 <dbl>
##  1 ALASKA  on time          Los.Angeles             497
##  2 ALASKA  on time          Phoenix                 221
##  3 ALASKA  on time          San.Diego               212
##  4 ALASKA  on time          San.Francisco           503
##  5 ALASKA  on time          Seattle                1841
##  6 <NA>    delayed          Los.Angeles              62
##  7 <NA>    delayed          Phoenix                  12
##  8 <NA>    delayed          San.Diego                20
##  9 <NA>    delayed          San.Francisco           102
## 10 <NA>    delayed          Seattle                 305
## 11 AM WEST on time          Los.Angeles             694
## 12 AM WEST on time          Phoenix                4840
## 13 AM WEST on time          San.Diego               383
## 14 AM WEST on time          San.Francisco           320
## 15 AM WEST on time          Seattle                 201
## 16 <NA>    delayed          Los.Angeles             117
## 17 <NA>    delayed          Phoenix                 415
## 18 <NA>    delayed          San.Diego                65
## 19 <NA>    delayed          San.Francisco           129
## 20 <NA>    delayed          Seattle                  61
#Fill the missing Airlines values with the previous Airline name
long_data <- long_data %>% fill(Airline)

print(long_data)
## # A tibble: 20 × 4
##    Airline `Arrival Status` Destination   `Num Flights`
##    <chr>   <chr>            <chr>                 <dbl>
##  1 ALASKA  on time          Los.Angeles             497
##  2 ALASKA  on time          Phoenix                 221
##  3 ALASKA  on time          San.Diego               212
##  4 ALASKA  on time          San.Francisco           503
##  5 ALASKA  on time          Seattle                1841
##  6 ALASKA  delayed          Los.Angeles              62
##  7 ALASKA  delayed          Phoenix                  12
##  8 ALASKA  delayed          San.Diego                20
##  9 ALASKA  delayed          San.Francisco           102
## 10 ALASKA  delayed          Seattle                 305
## 11 AM WEST on time          Los.Angeles             694
## 12 AM WEST on time          Phoenix                4840
## 13 AM WEST on time          San.Diego               383
## 14 AM WEST on time          San.Francisco           320
## 15 AM WEST on time          Seattle                 201
## 16 AM WEST delayed          Los.Angeles             117
## 17 AM WEST delayed          Phoenix                 415
## 18 AM WEST delayed          San.Diego                65
## 19 AM WEST delayed          San.Francisco           129
## 20 AM WEST delayed          Seattle                  61
#Use janitor::clean_names function to standardize column names
long_data <- long_data %>% janitor::clean_names()

print(long_data)
## # A tibble: 20 × 4
##    airline arrival_status destination   num_flights
##    <chr>   <chr>          <chr>               <dbl>
##  1 ALASKA  on time        Los.Angeles           497
##  2 ALASKA  on time        Phoenix               221
##  3 ALASKA  on time        San.Diego             212
##  4 ALASKA  on time        San.Francisco         503
##  5 ALASKA  on time        Seattle              1841
##  6 ALASKA  delayed        Los.Angeles            62
##  7 ALASKA  delayed        Phoenix                12
##  8 ALASKA  delayed        San.Diego              20
##  9 ALASKA  delayed        San.Francisco         102
## 10 ALASKA  delayed        Seattle               305
## 11 AM WEST on time        Los.Angeles           694
## 12 AM WEST on time        Phoenix              4840
## 13 AM WEST on time        San.Diego             383
## 14 AM WEST on time        San.Francisco         320
## 15 AM WEST on time        Seattle               201
## 16 AM WEST delayed        Los.Angeles           117
## 17 AM WEST delayed        Phoenix               415
## 18 AM WEST delayed        San.Diego              65
## 19 AM WEST delayed        San.Francisco         129
## 20 AM WEST delayed        Seattle                61
#Spread the arrival status field
tidy_data <- long_data %>% spread(key=arrival_status, value=num_flights)

print(tidy_data)
## # A tibble: 10 × 4
##    airline destination   delayed `on time`
##    <chr>   <chr>           <dbl>     <dbl>
##  1 ALASKA  Los.Angeles        62       497
##  2 ALASKA  Phoenix            12       221
##  3 ALASKA  San.Diego          20       212
##  4 ALASKA  San.Francisco     102       503
##  5 ALASKA  Seattle           305      1841
##  6 AM WEST Los.Angeles       117       694
##  7 AM WEST Phoenix           415      4840
##  8 AM WEST San.Diego          65       383
##  9 AM WEST San.Francisco     129       320
## 10 AM WEST Seattle            61       201
#Clean column names
tidy_data <- tidy_data %>% janitor::clean_names()

print(tidy_data)
## # A tibble: 10 × 4
##    airline destination   delayed on_time
##    <chr>   <chr>           <dbl>   <dbl>
##  1 ALASKA  Los.Angeles        62     497
##  2 ALASKA  Phoenix            12     221
##  3 ALASKA  San.Diego          20     212
##  4 ALASKA  San.Francisco     102     503
##  5 ALASKA  Seattle           305    1841
##  6 AM WEST Los.Angeles       117     694
##  7 AM WEST Phoenix           415    4840
##  8 AM WEST San.Diego          65     383
##  9 AM WEST San.Francisco     129     320
## 10 AM WEST Seattle            61     201
#Remove the "." from the destination city names
tidy_data <- tidy_data %>% mutate(destination = str_replace(destination, "\\.", " "))

print(tidy_data)
## # A tibble: 10 × 4
##    airline destination   delayed on_time
##    <chr>   <chr>           <dbl>   <dbl>
##  1 ALASKA  Los Angeles        62     497
##  2 ALASKA  Phoenix            12     221
##  3 ALASKA  San Diego          20     212
##  4 ALASKA  San Francisco     102     503
##  5 ALASKA  Seattle           305    1841
##  6 AM WEST Los Angeles       117     694
##  7 AM WEST Phoenix           415    4840
##  8 AM WEST San Diego          65     383
##  9 AM WEST San Francisco     129     320
## 10 AM WEST Seattle            61     201

Prepare Data for Analysis

Now that we have the data structured in a tidy data format, we are just about ready to conduct various analyses on the data. To assist in the analysis process, we first adjust our tidy data to include the following fields:

  1. num_flights = the total number of flights to the destination city, per airline
  2. pct_delayed = percent of times a flight is delayed for a specific destination, per airline
  3. pct_ontime = percent of times a flight is on time for a specific destination, per airline

Additionally, to account for differences in the number of flights that a particular airline may have to a specific destination, we also created a separate data table that is summarized by each airline, and has the following additional fields:

  1. total flights = total number of flights across all destinations for particular airline
  2. total_delayed = total number of delayed flights for particular airline across all destinations
  3. total ontime = total number of on time flights for particular airline across all destinations
  4. pct_delayed = percent of total flights that were delayed for a particular airline
  5. pct_ontime = percent of total flights on time for a particular airline
flight_df <- tidy_data


flight_df <- flight_df %>% mutate(num_flights = delayed+on_time,
                                  pct_delayed = delayed/num_flights,
                                  pct_ontime = on_time/num_flights)

print(flight_df)
## # A tibble: 10 × 7
##    airline destination   delayed on_time num_flights pct_delayed pct_ontime
##    <chr>   <chr>           <dbl>   <dbl>       <dbl>       <dbl>      <dbl>
##  1 ALASKA  Los Angeles        62     497         559      0.111       0.889
##  2 ALASKA  Phoenix            12     221         233      0.0515      0.948
##  3 ALASKA  San Diego          20     212         232      0.0862      0.914
##  4 ALASKA  San Francisco     102     503         605      0.169       0.831
##  5 ALASKA  Seattle           305    1841        2146      0.142       0.858
##  6 AM WEST Los Angeles       117     694         811      0.144       0.856
##  7 AM WEST Phoenix           415    4840        5255      0.0790      0.921
##  8 AM WEST San Diego          65     383         448      0.145       0.855
##  9 AM WEST San Francisco     129     320         449      0.287       0.713
## 10 AM WEST Seattle            61     201         262      0.233       0.767
airline_summary <- flight_df %>% group_by(airline) %>% summarize(total_delayed = sum(delayed),
                                              total_ontime = sum(on_time),
                                              total_flights = sum(num_flights),
                                              pct_delayed = (total_delayed/total_flights),
                                              pct_ontime = (total_ontime/total_flights))

print(airline_summary)
## # A tibble: 2 × 6
##   airline total_delayed total_ontime total_flights pct_delayed pct_ontime
##   <chr>           <dbl>        <dbl>         <dbl>       <dbl>      <dbl>
## 1 ALASKA            501         3274          3775       0.133      0.867
## 2 AM WEST           787         6438          7225       0.109      0.891
#Average difference in pct of delayed flights by destination between the two airlines
mean(flight_df[flight_df$airline == 'AM WEST',]$pct_delayed - flight_df[flight_df$airline == 'ALASKA',]$pct_delayed)
## [1] 0.06582325

Visualizing Data

As a final step in our analysis process, we will create visualizations to help us compare the arrival delays for the two airlines

ggplot(data=airline_summary, aes(x=airline, y=total_delayed, fill=airline)) +
  geom_bar(stat='identity') +
  labs(title = "\n Number of Delayed Flights by Airline", y="Total Number of Delayed Flights", fill="Airline") +
  theme(axis.title.x = element_blank()) + 
  geom_text(aes(label=total_delayed), vjust = 1.5,position = position_dodge(.9), size = 3)

ggplot(data = airline_summary, aes(x=airline, y=pct_delayed, fill=airline)) + 
  geom_bar(stat='identity') +
  labs(title = "\n Pct of Flights Delayed by Airline", y="\n Pct of Flights Delayed", fill="Airline") +
  scale_y_continuous(labels=percent) +
  theme(axis.title.x = element_blank()) + 
  geom_text(aes(label=paste0(format(pct_delayed*100, digits=2),"%")), vjust = 1.5,position = position_dodge(.9), size = 3)

ggplot(data = flight_df, aes(x=destination,y=pct_delayed, fill=airline)) +
  geom_bar(stat='identity', position = position_dodge()) +
  geom_text(aes(label=paste0(format(pct_delayed*100, digits=2),"%")), vjust = 1.5,position = position_dodge(.9), size = 3) +
  labs(x="\nDestination City", y="\nPct of Flights Delayed", title="\n Percent of Flights Delayed by Airline and Destination", fill="Airlines") +
  theme(axis.title.x = element_blank()) +
  scale_y_continuous(labels=percent)

Conclusion

Based on the analysis, we find the Am West Airlines has 286 - or 57% - more delayed flights than Alaska Airlines. However, AmWest had 97% more total flights than Alaska. To better understand the propensity for each airline to have delayed flights, we instead look at the percent of their total flights that are delayed. When we view the data in this way, we find that approximately 13.3% of flights are delayed at Alaska Airlines, vs 10.9% of flights at AmWest. Finally, when we further distill the data to view the percent of flights delayed for each airline based on their destination, we find that Alaska Airlines outperforms AmWest across each of the destination cities. Across each of the destination cities, AmWest had an average of 6.6% more delayed flights per destination compared to Alaska Airlines