Week 5 Assignment:

The chart in this exercise describes arrival delays for two airlines across five destinations.
Your task is to:
1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
3. Perform analysis to compare the arrival delays for the two airlines.
4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Answers below:

Loading R packages:

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Retrieving flight data from the GitHub raw URL using read_csv().

url <- "https://raw.githubusercontent.com/hbedros/data607_hw5/main/Week%205%20-%20Sheet1.csv"

flights <- read_csv(url)
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
head(flights)
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

Initiating data wrangling tasks with dplyr:
1. Dynamically renaming columns by referencing their current names.
2. Imputing missing values in specific rows and columns.
3. Filtering out rows where the ‘Airline’ column has NA values.

# Tidying with dplyr (i.e. "%>%", "rename()", "filter()")

flights_renamed <- flights %>%
  rename(
    Airline = !!names(flights)[1],   # Rename the first column
    Status = !!names(flights)[2]    # Rename the second column
  )

flights_renamed[2, 1] <- "ALASKA" # Filling NA value
flights_renamed[5, 1] <- "AM WEST" # Filling NA value

flights_cln <- flights_renamed %>%
  filter(!is.na(Airline)) # 'Airline' being the first column

Transforming the dataframe’s structure with tidyr:
1. Using pivot_longer() to transition from a wide format to a long format by melting cities into a single column.
2. Using pivot_wider() to spread ‘Status’ into separate columns for counts.
3. Standardizing column names for clarity.

# Tidying with tidyr (i.e. "pivot_longer()", "pivot_wider()")

tidy_flights <- flights_cln %>%
  pivot_longer(cols = `Los Angeles`:`Seattle`, names_to = "City", values_to = "Count") %>%
  pivot_wider(names_from = Status, values_from = Count) %>%
  rename(`On Time` = `on time`, Delayed = delayed)

Performing data analysis on tidy_flights:
1. Aggregating total delays per airline.
2. Aggregating delays per airline and city.
3. Computing delay rate for each airline as the proportion of delayed flights to total flights.
4. Generating descriptive statistics for delay durations grouped by airline.

# 1. Overall Delay Analysis
overall_delays <- tidy_flights %>%
  group_by(Airline) %>%
  summarise(Total_Delays = sum(Delayed))

print(overall_delays)
## # A tibble: 2 × 2
##   Airline Total_Delays
##   <chr>          <dbl>
## 1 ALASKA           501
## 2 AM WEST          787
# 2. City-wise Delay Analysis
city_delays <- tidy_flights %>%
  group_by(Airline, City) %>%
  summarise(City_Delays = sum(Delayed))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
print(city_delays)
## # A tibble: 10 × 3
## # Groups:   Airline [2]
##    Airline City          City_Delays
##    <chr>   <chr>               <dbl>
##  1 ALASKA  Los Angeles            62
##  2 ALASKA  Phoenix                12
##  3 ALASKA  San Diego              20
##  4 ALASKA  San Francisco         102
##  5 ALASKA  Seattle               305
##  6 AM WEST Los Angeles           117
##  7 AM WEST Phoenix               415
##  8 AM WEST San Diego              65
##  9 AM WEST San Francisco         129
## 10 AM WEST Seattle                61
# 3. Delay Rate
delay_rate <- tidy_flights %>%
  group_by(Airline) %>%
  summarise(
    Total_Flights = sum(Delayed) + sum(`On Time`),
    Delay_Rate = sum(Delayed) / (sum(Delayed) + sum(`On Time`))
  )

print(delay_rate)
## # A tibble: 2 × 3
##   Airline Total_Flights Delay_Rate
##   <chr>           <dbl>      <dbl>
## 1 ALASKA           3775      0.133
## 2 AM WEST          7225      0.109
# 4. Delay durations grouped by airline
delay_desc_stats <- tidy_flights %>%
  group_by(Airline) %>%
  summarise(
    Mean_Delay = mean(Delayed),
    Median_Delay = median(Delayed),
    Min_Delay = min(Delayed),
    Max_Delay = max(Delayed)
  )

print(delay_desc_stats)
## # A tibble: 2 × 5
##   Airline Mean_Delay Median_Delay Min_Delay Max_Delay
##   <chr>        <dbl>        <dbl>     <dbl>     <dbl>
## 1 ALASKA        100.           62        12       305
## 2 AM WEST       157.          117        61       415

Conclusion:

In comparison, while both airlines have experienced significant delays, AM WEST has a higher average and median delay than ALASKA. The data also shows that AM WEST has experienced both longer minimum and maximum delays compared to ALASKA. Travelers concerned about potential delays might consider these statistics when choosing between these two airlines.