#Resubmission attempt for week 4’s Data Structures in R assignment

Introduction:

This week’s assignment is about tidying and transforming data in R using the tidyr and dplyr packages. I will attempt to import the given data in its original form into MySQL Workbench and tidy the data in R.

#reading the csv file that I created into R

library("RMySQL")

## Loading required package: DBI

mydb <- dbConnect(MySQL(),user= 'bchung',password= 'august29!',dbname= '607',host= 'bchungcuny607.mysql.database.azure.com')

dbListTables(mydb)

## [1] "customers"     "invoices"      "movie ratings" "products"     
## [5] "tidyingdata"

#reading the data table into R
Data <- dbReadTable(mydb,"tidyingdata")

#the data was imported with the unnamed variables being assigned MyUnknownColumn and MyUnknownColumn_.0. by R studio.

Data

##   MyUnknownColumn MyUnknownColumn_.0. Los.Angeles Phoenix San.Diego
## 1          ALASKA             on time         497     221       212
## 2                             delayed          62      12        20
## 3                                                                  
## 4         AM WEST             on time         694   4,840       383
## 5                             delayed         117     415        65
##   San.Francisco Seattle
## 1           503   1,841
## 2           102     305
## 3                      
## 4           320     201
## 5           129      61

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

# Removing the third (blank row) observation in the data frame
Data <- Data[-c(3),]

# Renaming the originally blank column names
Data <- Data %>%
  rename(airline = MyUnknownColumn,
         status = MyUnknownColumn_.0.)

# Mutating variable Airline to fill in the blank values
Data <- Data %>%
  mutate(airline = c('ALASKA','ALASKA','AM_WEST','AM_WEST'))

# Showing data frame after basic cleaning
Data

##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM_WEST on time         694   4,840       383           320     201
## 5 AM_WEST delayed         117     415        65           129      61

# Pivoting long for analysis
Data <- Data %>% 
  pivot_longer(
    cols =  3:7,
    names_to = 'destination',
    values_to = 'frequency')

# Converting character data types into number for analysis
Data$frequency <- as.numeric(gsub(",", "", Data$frequency))

The tidy dataframe shows the observation of frequency (count of flights) per row.

Data

## # A tibble: 20 × 4
##    airline status  destination   frequency
##    <chr>   <chr>   <chr>             <dbl>
##  1 ALASKA  on time Los.Angeles         497
##  2 ALASKA  on time Phoenix             221
##  3 ALASKA  on time San.Diego           212
##  4 ALASKA  on time San.Francisco       503
##  5 ALASKA  on time Seattle            1841
##  6 ALASKA  delayed Los.Angeles          62
##  7 ALASKA  delayed Phoenix              12
##  8 ALASKA  delayed San.Diego            20
##  9 ALASKA  delayed San.Francisco       102
## 10 ALASKA  delayed Seattle             305
## 11 AM_WEST on time Los.Angeles         694
## 12 AM_WEST on time Phoenix            4840
## 13 AM_WEST on time San.Diego           383
## 14 AM_WEST on time San.Francisco       320
## 15 AM_WEST on time Seattle             201
## 16 AM_WEST delayed Los.Angeles         117
## 17 AM_WEST delayed Phoenix             415
## 18 AM_WEST delayed San.Diego            65
## 19 AM_WEST delayed San.Francisco       129
## 20 AM_WEST delayed Seattle              61

Analysis

# Calculating the delay ratios for each airline across all destinations
delayratio.airline <- Data %>%
  group_by(airline,status) %>%
  summarise(frequency = sum(frequency)) %>%
  mutate(delay.ratio = frequency / sum(frequency)) %>%
  filter(status == "delayed")

## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.

delayratio.airline

## # A tibble: 2 × 4
## # Groups:   airline [2]
##   airline status  frequency delay.ratio
##   <chr>   <chr>       <dbl>       <dbl>
## 1 ALASKA  delayed       501       0.133
## 2 AM_WEST delayed       787       0.109

As we can see, Alaska Airlines has a higher ratio of delayed flights when compared to AM_West even with a lower frequency of flights.

#Comparing the delay ratios of each airline by destination can give us deeper insight to potential reasons for delays - if delays are typically caused by destination related factors or airline related factors.

delayratio.compare <- Data %>%
  group_by(airline, destination, status) %>%
  summarise(frequency = sum(frequency)) %>%
  mutate(delay.ratio = frequency / sum(frequency)) %>%
  filter(status =="delayed")

## `summarise()` has grouped output by 'airline', 'destination'. You can override
## using the `.groups` argument.

delayratio.compare

## # A tibble: 10 × 5
## # Groups:   airline, destination [10]
##    airline destination   status  frequency delay.ratio
##    <chr>   <chr>         <chr>       <dbl>       <dbl>
##  1 ALASKA  Los.Angeles   delayed        62      0.111 
##  2 ALASKA  Phoenix       delayed        12      0.0515
##  3 ALASKA  San.Diego     delayed        20      0.0862
##  4 ALASKA  San.Francisco delayed       102      0.169 
##  5 ALASKA  Seattle       delayed       305      0.142 
##  6 AM_WEST Los.Angeles   delayed       117      0.144 
##  7 AM_WEST Phoenix       delayed       415      0.0790
##  8 AM_WEST San.Diego     delayed        65      0.145 
##  9 AM_WEST San.Francisco delayed       129      0.287 
## 10 AM_WEST Seattle       delayed        61      0.233

#PLotting the delay ratios by destination
library(ggplot2)
ggplot(delayratio.compare, aes(x = airline, y = delay.ratio)) +
  geom_point() +
  facet_wrap(vars(destination))

Doing a quick eye comparison of the graphs of delay ratios by destination above we can see that across both airlines the arrival delay ratios are aligned in terms of which desinations have more delays, however AM West shows higher delay ratios for each of the destinations. AM West possibly has airline related factors increasing delay ratios for San Francisco, Seattle, and San Diego.

Conclusions:

Delays to these five destinations are likely impacted by non-airline related factors, as some destinations are delayed more than others regardless of the airline. Although, in comparison we can see that independent of the destination factor AM West shows a higher ratio of delayed to on time arrivals. If one wanted to avoid arrival delays I would recommend flying Alaska Airlines instead of AM West.

607 Week 4 Data Structures in R

Brandon Chung

2024-10-12

The tidy dataframe shows the observation of frequency (count of flights) per row.

Analysis

Conclusions: