The chart in this exercise describes arrival delays for two airlines
across five destinations.
Your task is to:
1. Create a .CSV file (or optionally, a MySQL database!) that includes
all of the information above. You’re encouraged to use a “wide”
structure similar to how the information appears above, so that you can
practice tidying and transformations as described below.
2. Read the information from your .CSV file into R, and use tidyr and
dplyr as needed to tidy and transform your data.
3. Perform analysis to compare the arrival delays for the two
airlines.
4. Your code should be in an R Markdown file, posted to rpubs.com, and
should include narrative descriptions of your data cleanup work,
analysis, and conclusions.
Loading R packages:
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Retrieving flight data from the GitHub raw URL using read_csv().
url <- "https://raw.githubusercontent.com/hbedros/data607_hw5/main/Week%205%20-%20Sheet1.csv"
flights <- read_csv(url)
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
head(flights)
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Initiating data wrangling tasks with
dplyr:
1. Dynamically renaming columns by referencing their current
names.
2. Imputing missing values in specific rows and columns.
3. Filtering out rows where the ‘Airline’ column has NA values.
# Tidying with dplyr (i.e. "%>%", "rename()", "filter()")
flights_renamed <- flights %>%
rename(
Airline = !!names(flights)[1], # Rename the first column
Status = !!names(flights)[2] # Rename the second column
)
flights_renamed[2, 1] <- "ALASKA" # Filling NA value
flights_renamed[5, 1] <- "AM WEST" # Filling NA value
flights_cln <- flights_renamed %>%
filter(!is.na(Airline)) # 'Airline' being the first column
Transforming the dataframe’s structure with
tidyr:
1. Using pivot_longer() to transition from a wide format to a long
format by melting cities into a single column.
2. Using pivot_wider() to spread ‘Status’ into separate columns for
counts.
3. Standardizing column names for clarity.
# Tidying with tidyr (i.e. "pivot_longer()", "pivot_wider()")
tidy_flights <- flights_cln %>%
pivot_longer(cols = `Los Angeles`:`Seattle`, names_to = "City", values_to = "Count") %>%
pivot_wider(names_from = Status, values_from = Count) %>%
rename(`On Time` = `on time`, Delayed = delayed)
Performing data analysis on tidy_flights:
1. Aggregating total delays per airline.
2. Aggregating delays per airline and city.
3. Computing delay rate for each airline as the proportion of delayed
flights to total flights.
4. Generating descriptive statistics for delay durations grouped by
airline.
# 1. Overall Delay Analysis
overall_delays <- tidy_flights %>%
group_by(Airline) %>%
summarise(Total_Delays = sum(Delayed))
print(overall_delays)
## # A tibble: 2 × 2
## Airline Total_Delays
## <chr> <dbl>
## 1 ALASKA 501
## 2 AM WEST 787
# 2. City-wise Delay Analysis
city_delays <- tidy_flights %>%
group_by(Airline, City) %>%
summarise(City_Delays = sum(Delayed))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
print(city_delays)
## # A tibble: 10 × 3
## # Groups: Airline [2]
## Airline City City_Delays
## <chr> <chr> <dbl>
## 1 ALASKA Los Angeles 62
## 2 ALASKA Phoenix 12
## 3 ALASKA San Diego 20
## 4 ALASKA San Francisco 102
## 5 ALASKA Seattle 305
## 6 AM WEST Los Angeles 117
## 7 AM WEST Phoenix 415
## 8 AM WEST San Diego 65
## 9 AM WEST San Francisco 129
## 10 AM WEST Seattle 61
# 3. Delay Rate
delay_rate <- tidy_flights %>%
group_by(Airline) %>%
summarise(
Total_Flights = sum(Delayed) + sum(`On Time`),
Delay_Rate = sum(Delayed) / (sum(Delayed) + sum(`On Time`))
)
print(delay_rate)
## # A tibble: 2 × 3
## Airline Total_Flights Delay_Rate
## <chr> <dbl> <dbl>
## 1 ALASKA 3775 0.133
## 2 AM WEST 7225 0.109
# 4. Delay durations grouped by airline
delay_desc_stats <- tidy_flights %>%
group_by(Airline) %>%
summarise(
Mean_Delay = mean(Delayed),
Median_Delay = median(Delayed),
Min_Delay = min(Delayed),
Max_Delay = max(Delayed)
)
print(delay_desc_stats)
## # A tibble: 2 × 5
## Airline Mean_Delay Median_Delay Min_Delay Max_Delay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 100. 62 12 305
## 2 AM WEST 157. 117 61 415
Conclusion:
For ALASKA:
The average delay experienced is 100.2 minutes.
The median delay (which can be indicative of a more typical delay
scenario as it’s less affected by extreme values) is 62 minutes.
The shortest delay recorded was 12 minutes, whereas the longest delay
experienced reached up to 305 minutes.
For AM WEST:
The average delay is significantly higher, standing at 157.4 minutes.
The median delay is 117 minutes.
The range of delays varies from as short as 61 minutes to as long as 415
minutes.
In comparison, while both airlines have experienced significant delays, AM WEST has a higher average and median delay than ALASKA. The data also shows that AM WEST has experienced both longer minimum and maximum delays compared to ALASKA. Travelers concerned about potential delays might consider these statistics when choosing between these two airlines.