Loading packages for data manipulation and structuring

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(tidyr)
library(stringr)

Reading in CSV data, calling it “flights”

flights <- read.csv('https://raw.githubusercontent.com/rossboehme/DATA607/main/assignment4/Assignment4-TidyingTransformingData-AirlinesCSV.csv')

Filling in blank cells, removing empty third row, and labeling columns. This will make the data easier to manipulate by having more named cells.

flights[2,1] <- flights[1,1]
flights[5,1] <- flights[4,1]

flights <- flights %>% na.omit()

names(flights) <- c("airline","arr_status",colnames(flights)[3:7]) 

While this dataframe as-is contains useful flight information, it doesn’t adhere to the principles of a “tidy” data set per our textbook R for Data Science:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

The advantages of adhering to these principles are multifold. Storing data in the recommended way for R ensures that packages are as useful as possible. For example, the mutate function in dplyr assumes data adhere to these principles, making it easy to calculate fields using multiple columns.

Therefore I will continue my data preparation below to ensure my data is “tidy.”

The gather and spread functions from tidyr can pivot my data to make this occur.

flights <- flights %>% 
   gather(3:7, key ="destination", value = "num_flights" ) %>%
   spread(key = arr_status, value=num_flights)

Now each of my variables (airline, destination, delayed flights, and on time flights) has its own column. The numeric quality I’m “observing” with this data set is the number of flights per airline, arrival status, and destination. Pivoting the data ensures there’s one arr_delay and one arr_ontime observation per row. By satisfying these two aforementioned column and row principles, I also satisfy the third principle of “tidy” data: Each value having its own cell.

Next I need to change my column names and the data type of the num_arrivals columns so they’re numeric and without spaces in their titles. To not introduce NAs by coercion when I change the type, I’ll first get rid of comma characters.

flights <- flights %>%
  mutate(across(delayed:"on time", ~as.numeric(str_remove(.x, ","))))

names(flights) <- c(colnames(flights)[1:2],"arr_delayed","arr_ontime") 

Finishing up, I’ll replace periods in the destination names with spaces to make them more accurate/graphs cleaner. Finally, I’ll add a column with “total arrivals” to make calculations easier.

flights$destination <- flights$destination %>% str_replace_all('\\.', ' ')

flights <- flights %>% 
  mutate(total_arrivals = arr_delayed + arr_ontime)

Now my data are ready to compare arrival delays for the two airlines. I’ll analyze the data in three steps: 1. Provide an overview of the data. 2. Compare arrival delays for the two airlines. 3. Make conclusions/recommendations about which airline to use.

Data Overview

ontime_v_delayed <- flights %>%
  summarize(arr_ontime = sum(arr_ontime)
  ,arr_delayed = sum(arr_delayed)
  ,total_arrivals = sum(total_arrivals)
  ,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100
  ,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
ontime_v_delayed
##   arr_ontime arr_delayed total_arrivals pct_on_time pct_delayed
## 1       9712        1288          11000          88          12
destination_compare <- flights %>%
  group_by(destination) %>%
  summarize(total_arrivals = sum(total_arrivals)
  ,pct_of_total_arrivals = round(total_arrivals/sum(flights$total_arrivals),2)*100
  ,arr_ontime = sum(arr_ontime)
  ,arr_delayed = sum(arr_delayed)
  ,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100,
  ,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
destination_compare
## # A tibble: 5 × 7
##   destination   total_arrivals pct_of_total_ar…¹ arr_o…² arr_d…³ pct_o…⁴ pct_d…⁵
##   <chr>                  <dbl>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Los Angeles             1370                12    1191     179      87      13
## 2 Phoenix                 5488                50    5061     427      92       8
## 3 San Diego                680                 6     595      85      88      12
## 4 San Francisco           1054                10     823     231      78      22
## 5 Seattle                 2408                22    2042     366      85      15
## # … with abbreviated variable names ¹​pct_of_total_arrivals, ²​arr_ontime,
## #   ³​arr_delayed, ⁴​pct_on_time, ⁵​pct_delayed

Comparing Arrival Delays

airline_compare <- flights %>%
  group_by(airline) %>%
  summarize(arr_ontime = sum(arr_ontime)
  ,arr_delayed = sum(arr_delayed)
  ,total_arrivals = sum(total_arrivals)
  ,pct_of_total_arrivals = round(total_arrivals/sum(flights$total_arrivals),2)*100
  ,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100
  ,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
airline_compare
## # A tibble: 2 × 7
##   airline arr_ontime arr_delayed total_arrivals pct_of_total_a…¹ pct_o…² pct_d…³
##   <chr>        <dbl>       <dbl>          <dbl>            <dbl>   <dbl>   <dbl>
## 1 ALASKA        3274         501           3775               34      87      13
## 2 AM WEST       6438         787           7225               66      89      11
## # … with abbreviated variable names ¹​pct_of_total_arrivals, ²​pct_on_time,
## #   ³​pct_delayed
airline_destination_compare <- flights %>%
  group_by(airline,destination) %>%
  summarize(arr_ontime = sum(arr_ontime)
  ,arr_delayed = sum(arr_delayed)
  ,total_arrivals = sum(total_arrivals)
  ,pct_of_total_arrivals = round(total_arrivals/sum(flights$total_arrivals),2)*100
  ,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100
  ,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
airline_destination_compare %>% arrange(desc(pct_on_time))
## # A tibble: 10 × 8
## # Groups:   airline [2]
##    airline destination   arr_ontime arr_delayed total_…¹ pct_o…² pct_o…³ pct_d…⁴
##    <chr>   <chr>              <dbl>       <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
##  1 ALASKA  Phoenix              221          12      233       2      95       5
##  2 AM WEST Phoenix             4840         415     5255      48      92       8
##  3 ALASKA  San Diego            212          20      232       2      91       9
##  4 ALASKA  Los Angeles          497          62      559       5      89      11
##  5 ALASKA  Seattle             1841         305     2146      20      86      14
##  6 AM WEST Los Angeles          694         117      811       7      86      14
##  7 AM WEST San Diego            383          65      448       4      85      15
##  8 ALASKA  San Francisco        503         102      605       6      83      17
##  9 AM WEST Seattle              201          61      262       2      77      23
## 10 AM WEST San Francisco        320         129      449       4      71      29
## # … with abbreviated variable names ¹​total_arrivals, ²​pct_of_total_arrivals,
## #   ³​pct_on_time, ⁴​pct_delayed

Recommending Airlines/Destinations (Conclusion)

Please note that all recommendations below are based on a limited sample set. These data are for only two airlines over an unknown time span, and from an unknown source. I could make stronger recommendations if I had more background on the data and a larger sample size.