Loading packages for data manipulation and structuring
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(tidyr)
library(stringr)
Reading in CSV data, calling it “flights”
flights <- read.csv('https://raw.githubusercontent.com/rossboehme/DATA607/main/assignment4/Assignment4-TidyingTransformingData-AirlinesCSV.csv')
Filling in blank cells, removing empty third row, and labeling columns. This will make the data easier to manipulate by having more named cells.
flights[2,1] <- flights[1,1]
flights[5,1] <- flights[4,1]
flights <- flights %>% na.omit()
names(flights) <- c("airline","arr_status",colnames(flights)[3:7])
While this dataframe as-is contains useful flight information, it doesn’t adhere to the principles of a “tidy” data set per our textbook R for Data Science:
The advantages of adhering to these principles are multifold. Storing data in the recommended way for R ensures that packages are as useful as possible. For example, the mutate function in dplyr assumes data adhere to these principles, making it easy to calculate fields using multiple columns.
Therefore I will continue my data preparation below to ensure my data is “tidy.”
The gather and spread functions from tidyr can pivot my data to make this occur.
flights <- flights %>%
gather(3:7, key ="destination", value = "num_flights" ) %>%
spread(key = arr_status, value=num_flights)
Now each of my variables (airline, destination, delayed flights, and on time flights) has its own column. The numeric quality I’m “observing” with this data set is the number of flights per airline, arrival status, and destination. Pivoting the data ensures there’s one arr_delay and one arr_ontime observation per row. By satisfying these two aforementioned column and row principles, I also satisfy the third principle of “tidy” data: Each value having its own cell.
Next I need to change my column names and the data type of the num_arrivals columns so they’re numeric and without spaces in their titles. To not introduce NAs by coercion when I change the type, I’ll first get rid of comma characters.
flights <- flights %>%
mutate(across(delayed:"on time", ~as.numeric(str_remove(.x, ","))))
names(flights) <- c(colnames(flights)[1:2],"arr_delayed","arr_ontime")
Finishing up, I’ll replace periods in the destination names with spaces to make them more accurate/graphs cleaner. Finally, I’ll add a column with “total arrivals” to make calculations easier.
flights$destination <- flights$destination %>% str_replace_all('\\.', ' ')
flights <- flights %>%
mutate(total_arrivals = arr_delayed + arr_ontime)
Now my data are ready to compare arrival delays for the two airlines. I’ll analyze the data in three steps: 1. Provide an overview of the data. 2. Compare arrival delays for the two airlines. 3. Make conclusions/recommendations about which airline to use.
ontime_v_delayed <- flights %>%
summarize(arr_ontime = sum(arr_ontime)
,arr_delayed = sum(arr_delayed)
,total_arrivals = sum(total_arrivals)
,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100
,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
ontime_v_delayed
## arr_ontime arr_delayed total_arrivals pct_on_time pct_delayed
## 1 9712 1288 11000 88 12
destination_compare <- flights %>%
group_by(destination) %>%
summarize(total_arrivals = sum(total_arrivals)
,pct_of_total_arrivals = round(total_arrivals/sum(flights$total_arrivals),2)*100
,arr_ontime = sum(arr_ontime)
,arr_delayed = sum(arr_delayed)
,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100,
,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
destination_compare
## # A tibble: 5 × 7
## destination total_arrivals pct_of_total_ar…¹ arr_o…² arr_d…³ pct_o…⁴ pct_d…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles 1370 12 1191 179 87 13
## 2 Phoenix 5488 50 5061 427 92 8
## 3 San Diego 680 6 595 85 88 12
## 4 San Francisco 1054 10 823 231 78 22
## 5 Seattle 2408 22 2042 366 85 15
## # … with abbreviated variable names ¹pct_of_total_arrivals, ²arr_ontime,
## # ³arr_delayed, ⁴pct_on_time, ⁵pct_delayed
airline_compare <- flights %>%
group_by(airline) %>%
summarize(arr_ontime = sum(arr_ontime)
,arr_delayed = sum(arr_delayed)
,total_arrivals = sum(total_arrivals)
,pct_of_total_arrivals = round(total_arrivals/sum(flights$total_arrivals),2)*100
,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100
,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
airline_compare
## # A tibble: 2 × 7
## airline arr_ontime arr_delayed total_arrivals pct_of_total_a…¹ pct_o…² pct_d…³
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 3274 501 3775 34 87 13
## 2 AM WEST 6438 787 7225 66 89 11
## # … with abbreviated variable names ¹pct_of_total_arrivals, ²pct_on_time,
## # ³pct_delayed
airline_destination_compare <- flights %>%
group_by(airline,destination) %>%
summarize(arr_ontime = sum(arr_ontime)
,arr_delayed = sum(arr_delayed)
,total_arrivals = sum(total_arrivals)
,pct_of_total_arrivals = round(total_arrivals/sum(flights$total_arrivals),2)*100
,pct_on_time = round(arr_ontime/sum(total_arrivals),2)*100
,pct_delayed = round(arr_delayed/sum(total_arrivals),2)*100)
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
airline_destination_compare %>% arrange(desc(pct_on_time))
## # A tibble: 10 × 8
## # Groups: airline [2]
## airline destination arr_ontime arr_delayed total_…¹ pct_o…² pct_o…³ pct_d…⁴
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Phoenix 221 12 233 2 95 5
## 2 AM WEST Phoenix 4840 415 5255 48 92 8
## 3 ALASKA San Diego 212 20 232 2 91 9
## 4 ALASKA Los Angeles 497 62 559 5 89 11
## 5 ALASKA Seattle 1841 305 2146 20 86 14
## 6 AM WEST Los Angeles 694 117 811 7 86 14
## 7 AM WEST San Diego 383 65 448 4 85 15
## 8 ALASKA San Francisco 503 102 605 6 83 17
## 9 AM WEST Seattle 201 61 262 2 77 23
## 10 AM WEST San Francisco 320 129 449 4 71 29
## # … with abbreviated variable names ¹total_arrivals, ²pct_of_total_arrivals,
## # ³pct_on_time, ⁴pct_delayed
Please note that all recommendations below are based on a limited sample set. These data are for only two airlines over an unknown time span, and from an unknown source. I could make stronger recommendations if I had more background on the data and a larger sample size.