For this assignment, we had to import some flight data and change the dataframe into a tidy one (using the tidyr library). So first, I started by getting my libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
##Importing the data For my R file and work, I used mySqlConnection as shown below.
install.packages("RMySQL")
library(RMySQL)
mysqlconnection = dbConnect(RMySQL::MySQL(),
dbname='607_AssignmentW5',
host='localhost',
port=3306,
user='root',
password='')
However, since I had to delete the password for this command and it will not work, I saved it as a csv to be called for this RMarkdown file
result = dbSendQuery(mysqlconnection, "select * from arrivalDelays")
arrivalDelays = fetch(result)
write.csv(arrivalDelays,file='/Users/Ari/Data607/assignmentW5/arrivalDelays.csv')
So now we can grab it as a csv. We also need to get rid of the extra id column that was generated from the sql table.
arrivalDelays <- read_csv("arrivalDelays.csv")
## New names:
## Rows: 4 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): airline, delayStatus dbl (7): ...1, id, LosAngeles, Phoenix, SanDiego,
## SanFrancisco, Seatte
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
#clean up the data and set it as a simple dataframe
arrivalDelays <- subset(arrivalDelays, select = -1)
as_tibble(arrivalDelays)
## # A tibble: 4 × 8
## id airline delayStatus LosAngeles Phoenix SanDiego SanFrancisco Seatte
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ALASKA on time 497 221 212 503 1841
## 2 2 ALASKA delayed 62 12 20 102 305
## 3 3 AM WEST on time 694 4840 383 320 201
## 4 4 AM WEST delayed 117 415 65 129 61
Now we must change this dataframe into a tidy one. From the lecture and readings, I learned a tidy table means “1 row for each observations”.
Right now, the above table contains 2 airlines that have 2 rows each for 2 different delay status (on time, delayed) but, each row contains 5 different observations in the format of the 5 columns based on locations (LosAngeles, Phoenix, SanDiego, SanFrancisco, Seatte). We need to separate each row into 5 rows that reflection each of these locations. In the end, we will have 4 (2 airlines * 2 delay status) * 5 locations = 20 rows.
In order to do this, I used the pivot_longer command and had it separate the rows based on the 5 location columns. They are then combined into 2 new columns: 1 for the location name and 1 for the value (# of flights) that used to sit in the individual location column.
arrivalDelay_tidy <-
pivot_longer(arrivalDelays,
cols=c('LosAngeles', 'Phoenix', 'SanDiego', 'SanFrancisco', 'Seatte'),
names_to = 'Locations',
values_to = 'Flights')
arrivalDelay_tidy <- arrivalDelay_tidy[order(arrivalDelay_tidy$Locations),]
arrivalDelay_tidy
## # A tibble: 20 × 5
## id airline delayStatus Locations Flights
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 ALASKA on time LosAngeles 497
## 2 2 ALASKA delayed LosAngeles 62
## 3 3 AM WEST on time LosAngeles 694
## 4 4 AM WEST delayed LosAngeles 117
## 5 1 ALASKA on time Phoenix 221
## 6 2 ALASKA delayed Phoenix 12
## 7 3 AM WEST on time Phoenix 4840
## 8 4 AM WEST delayed Phoenix 415
## 9 1 ALASKA on time SanDiego 212
## 10 2 ALASKA delayed SanDiego 20
## 11 3 AM WEST on time SanDiego 383
## 12 4 AM WEST delayed SanDiego 65
## 13 1 ALASKA on time SanFrancisco 503
## 14 2 ALASKA delayed SanFrancisco 102
## 15 3 AM WEST on time SanFrancisco 320
## 16 4 AM WEST delayed SanFrancisco 129
## 17 1 ALASKA on time Seatte 1841
## 18 2 ALASKA delayed Seatte 305
## 19 3 AM WEST on time Seatte 201
## 20 4 AM WEST delayed Seatte 61
arrivalDelay_tidy %>% group_by(airline) %>%
summarise(count=sum(Flights),
.groups = 'drop')
## # A tibble: 2 × 2
## airline count
## <chr> <dbl>
## 1 ALASKA 3775
## 2 AM WEST 7225
arrivalDelay_tidy %>%
group_by(airline, delayStatus) %>%
summarise(count=sum(Flights),
.groups = 'drop')
## # A tibble: 4 × 3
## airline delayStatus count
## <chr> <chr> <dbl>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
The two tables above show us the number of flights from each airline and the breakdown of that airlines’ on-time & delayed flights. AM WEST has more flights than ALASKA and less delayed flights as when you work out the ratios: 1 in 6.5 flights from ALASKA are delayed and 1 in 8.2 flights from AM WEST are delayed.
arrivalDelay_tidy %>%
group_by(Locations) %>%
summarise(count=sum(Flights),
.groups = 'drop')
## # A tibble: 5 × 2
## Locations count
## <chr> <dbl>
## 1 LosAngeles 1370
## 2 Phoenix 5488
## 3 SanDiego 680
## 4 SanFrancisco 1054
## 5 Seatte 2408
arrivalDelay_tidy %>%
group_by(Locations, delayStatus) %>%
summarise(count=sum(Flights),
.groups = 'drop')
## # A tibble: 10 × 3
## Locations delayStatus count
## <chr> <chr> <dbl>
## 1 LosAngeles delayed 179
## 2 LosAngeles on time 1191
## 3 Phoenix delayed 427
## 4 Phoenix on time 5061
## 5 SanDiego delayed 85
## 6 SanDiego on time 595
## 7 SanFrancisco delayed 231
## 8 SanFrancisco on time 823
## 9 Seatte delayed 366
## 10 Seatte on time 2042
The two tables above show us the number of flights arriving to each
location and the breakdown of that location’s on-time & delayed
arrived flights. Phoenix has the most arrival flights and least amount
of delayed flights is San Francisco. The ratio breakdown of delayed
flights to on-time flights is below:
- LosAngeles: 1 in 6.6
- Phoenix: 1 in 11.8
- SanDiego: 1 in 7
- SanFrancisco: 1 in 3.6
- Seattle: 1 in 5.6