DATA607

Answers below:

Loading R packages:

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Retrieving flight data from the GitHub raw URL using read_csv().

url <- "https://raw.githubusercontent.com/hbedros/data607_hw5/main/Week%205%20-%20Sheet1.csv"

flights <- read_csv(url)

## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`

head(flights)

## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

Initiating data wrangling tasks with dplyr:
1. Dynamically renaming columns by referencing their current names.
2. Imputing missing values in specific rows and columns.
3. Filtering out rows where the ‘Airline’ column has NA values.

# Tidying with dplyr (i.e. "%>%", "rename()", "filter()")

flights_renamed <- flights %>%
  rename(
    Airline = !!names(flights)[1],   # Rename the first column
    Status = !!names(flights)[2]    # Rename the second column
  )

flights_renamed[2, 1] <- "ALASKA" # Filling NA value
flights_renamed[5, 1] <- "AM WEST" # Filling NA value

flights_cln <- flights_renamed %>%
  filter(!is.na(Airline)) # 'Airline' being the first column

Transforming the dataframe’s structure with tidyr:
1. Using pivot_longer() to transition from a wide format to a long format by melting cities into a single column.
2. Using pivot_wider() to spread ‘Status’ into separate columns for counts.
3. Standardizing column names for clarity.

# Tidying with tidyr (i.e. "pivot_longer()", "pivot_wider()")

tidy_flights <- flights_cln %>%
  pivot_longer(cols = `Los Angeles`:`Seattle`, names_to = "City", values_to = "Count") %>%
  pivot_wider(names_from = Status, values_from = Count) %>%
  rename(`On Time` = `on time`, Delayed = delayed)

Performing data analysis on tidy_flights:
1. Aggregating total delays per airline.
2. Aggregating delays per airline and city.
3. Computing delay rate for each airline as the proportion of delayed flights to total flights.
4. Generating descriptive statistics for delay durations grouped by airline.

# 1. Overall Delay Analysis
overall_delays <- tidy_flights %>%
  group_by(Airline) %>%
  summarise(Total_Delays = sum(Delayed))

print(overall_delays)

## # A tibble: 2 × 2
##   Airline Total_Delays
##   <chr>          <dbl>
## 1 ALASKA           501
## 2 AM WEST          787

# 2. City-wise Delay Analysis
city_delays <- tidy_flights %>%
  group_by(Airline, City) %>%
  summarise(City_Delays = sum(Delayed))

## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.

print(city_delays)

## # A tibble: 10 × 3
## # Groups:   Airline [2]
##    Airline City          City_Delays
##    <chr>   <chr>               <dbl>
##  1 ALASKA  Los Angeles            62
##  2 ALASKA  Phoenix                12
##  3 ALASKA  San Diego              20
##  4 ALASKA  San Francisco         102
##  5 ALASKA  Seattle               305
##  6 AM WEST Los Angeles           117
##  7 AM WEST Phoenix               415
##  8 AM WEST San Diego              65
##  9 AM WEST San Francisco         129
## 10 AM WEST Seattle                61

# 3. Delay Rate
delay_rate <- tidy_flights %>%
  group_by(Airline) %>%
  summarise(
    Total_Flights = sum(Delayed) + sum(`On Time`),
    Delay_Rate = sum(Delayed) / (sum(Delayed) + sum(`On Time`))
  )

print(delay_rate)

## # A tibble: 2 × 3
##   Airline Total_Flights Delay_Rate
##   <chr>           <dbl>      <dbl>
## 1 ALASKA           3775      0.133
## 2 AM WEST          7225      0.109

# 4. Delay durations grouped by airline
delay_desc_stats <- tidy_flights %>%
  group_by(Airline) %>%
  summarise(
    Mean_Delay = mean(Delayed),
    Median_Delay = median(Delayed),
    Min_Delay = min(Delayed),
    Max_Delay = max(Delayed)
  )

print(delay_desc_stats)

## # A tibble: 2 × 5
##   Airline Mean_Delay Median_Delay Min_Delay Max_Delay
##   <chr>        <dbl>        <dbl>     <dbl>     <dbl>
## 1 ALASKA        100.           62        12       305
## 2 AM WEST       157.          117        61       415

Conclusion:

For ALASKA:
The average delay experienced is 100.2 minutes.
The median delay (which can be indicative of a more typical delay scenario as it’s less affected by extreme values) is 62 minutes.
The shortest delay recorded was 12 minutes, whereas the longest delay experienced reached up to 305 minutes.
For AM WEST:
The average delay is significantly higher, standing at 157.4 minutes. The median delay is 117 minutes.
The range of delays varies from as short as 61 minutes to as long as 415 minutes.

In comparison, while both airlines have experienced significant delays, AM WEST has a higher average and median delay than ALASKA. The data also shows that AM WEST has experienced both longer minimum and maximum delays compared to ALASKA. Travelers concerned about potential delays might consider these statistics when choosing between these two airlines.

DATA607_HW5

Haig Bedros

2023-09-27

Week 5 Assignment:

Answers below: