#Resubmission attempt for week 4’s Data Structures in R assignment
Introduction:
This week’s assignment is about tidying and transforming data in R using the tidyr and dplyr packages. I will attempt to import the given data in its original form into MySQL Workbench and tidy the data in R.
#reading the csv file that I created into R
library("RMySQL")
## Loading required package: DBI
mydb <- dbConnect(MySQL(),user= 'bchung',password= 'august29!',dbname= '607',host= 'bchungcuny607.mysql.database.azure.com')
dbListTables(mydb)
## [1] "customers" "invoices" "movie ratings" "products"
## [5] "tidyingdata"
#reading the data table into R
Data <- dbReadTable(mydb,"tidyingdata")
#the data was imported with the unnamed variables being assigned MyUnknownColumn and MyUnknownColumn_.0. by R studio.
Data
## MyUnknownColumn MyUnknownColumn_.0. Los.Angeles Phoenix San.Diego
## 1 ALASKA on time 497 221 212
## 2 delayed 62 12 20
## 3
## 4 AM WEST on time 694 4,840 383
## 5 delayed 117 415 65
## San.Francisco Seattle
## 1 503 1,841
## 2 102 305
## 3
## 4 320 201
## 5 129 61
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# Removing the third (blank row) observation in the data frame
Data <- Data[-c(3),]
# Renaming the originally blank column names
Data <- Data %>%
rename(airline = MyUnknownColumn,
status = MyUnknownColumn_.0.)
# Mutating variable Airline to fill in the blank values
Data <- Data %>%
mutate(airline = c('ALASKA','ALASKA','AM_WEST','AM_WEST'))
# Showing data frame after basic cleaning
Data
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM_WEST on time 694 4,840 383 320 201
## 5 AM_WEST delayed 117 415 65 129 61
# Pivoting long for analysis
Data <- Data %>%
pivot_longer(
cols = 3:7,
names_to = 'destination',
values_to = 'frequency')
# Converting character data types into number for analysis
Data$frequency <- as.numeric(gsub(",", "", Data$frequency))
Data
## # A tibble: 20 × 4
## airline status destination frequency
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM_WEST on time Los.Angeles 694
## 12 AM_WEST on time Phoenix 4840
## 13 AM_WEST on time San.Diego 383
## 14 AM_WEST on time San.Francisco 320
## 15 AM_WEST on time Seattle 201
## 16 AM_WEST delayed Los.Angeles 117
## 17 AM_WEST delayed Phoenix 415
## 18 AM_WEST delayed San.Diego 65
## 19 AM_WEST delayed San.Francisco 129
## 20 AM_WEST delayed Seattle 61
# Calculating the delay ratios for each airline across all destinations
delayratio.airline <- Data %>%
group_by(airline,status) %>%
summarise(frequency = sum(frequency)) %>%
mutate(delay.ratio = frequency / sum(frequency)) %>%
filter(status == "delayed")
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
delayratio.airline
## # A tibble: 2 × 4
## # Groups: airline [2]
## airline status frequency delay.ratio
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA delayed 501 0.133
## 2 AM_WEST delayed 787 0.109
As we can see, Alaska Airlines has a higher ratio of delayed flights when compared to AM_West even with a lower frequency of flights.
#Comparing the delay ratios of each airline by destination can give us deeper insight to potential reasons for delays - if delays are typically caused by destination related factors or airline related factors.
delayratio.compare <- Data %>%
group_by(airline, destination, status) %>%
summarise(frequency = sum(frequency)) %>%
mutate(delay.ratio = frequency / sum(frequency)) %>%
filter(status =="delayed")
## `summarise()` has grouped output by 'airline', 'destination'. You can override
## using the `.groups` argument.
delayratio.compare
## # A tibble: 10 × 5
## # Groups: airline, destination [10]
## airline destination status frequency delay.ratio
## <chr> <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los.Angeles delayed 62 0.111
## 2 ALASKA Phoenix delayed 12 0.0515
## 3 ALASKA San.Diego delayed 20 0.0862
## 4 ALASKA San.Francisco delayed 102 0.169
## 5 ALASKA Seattle delayed 305 0.142
## 6 AM_WEST Los.Angeles delayed 117 0.144
## 7 AM_WEST Phoenix delayed 415 0.0790
## 8 AM_WEST San.Diego delayed 65 0.145
## 9 AM_WEST San.Francisco delayed 129 0.287
## 10 AM_WEST Seattle delayed 61 0.233
#PLotting the delay ratios by destination
library(ggplot2)
ggplot(delayratio.compare, aes(x = airline, y = delay.ratio)) +
geom_point() +
facet_wrap(vars(destination))
Doing a quick eye comparison of the graphs of delay ratios by destination above we can see that across both airlines the arrival delay ratios are aligned in terms of which desinations have more delays, however AM West shows higher delay ratios for each of the destinations. AM West possibly has airline related factors increasing delay ratios for San Francisco, Seattle, and San Diego.
Delays to these five destinations are likely impacted by non-airline related factors, as some destinations are delayed more than others regardless of the airline. Although, in comparison we can see that independent of the destination factor AM West shows a higher ratio of delayed to on time arrivals. If one wanted to avoid arrival delays I would recommend flying Alaska Airlines instead of AM West.