607_AssignmentW5

Data 607 - Assignment W5

For this assignment, we had to import some flight data and change the dataframe into a tidy one (using the tidyr library). So first, I started by getting my libraries.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)

##Importing the data For my R file and work, I used mySqlConnection as shown below.

install.packages("RMySQL")
library(RMySQL)

mysqlconnection = dbConnect(RMySQL::MySQL(),
                            dbname='607_AssignmentW5',
                            host='localhost',
                            port=3306,
                            user='root',
                            password='')

However, since I had to delete the password for this command and it will not work, I saved it as a csv to be called for this RMarkdown file

result = dbSendQuery(mysqlconnection, "select * from arrivalDelays") 
arrivalDelays = fetch(result)
write.csv(arrivalDelays,file='/Users/Ari/Data607/assignmentW5/arrivalDelays.csv')

So now we can grab it as a csv. We also need to get rid of the extra id column that was generated from the sql table.

arrivalDelays <- read_csv("arrivalDelays.csv")

## New names:
## Rows: 4 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): airline, delayStatus dbl (7): ...1, id, LosAngeles, Phoenix, SanDiego,
## SanFrancisco, Seatte
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

#clean up the data and set it as a simple dataframe
arrivalDelays <- subset(arrivalDelays, select = -1)
as_tibble(arrivalDelays)

## # A tibble: 4 × 8
##      id airline delayStatus LosAngeles Phoenix SanDiego SanFrancisco Seatte
##   <dbl> <chr>   <chr>            <dbl>   <dbl>    <dbl>        <dbl>  <dbl>
## 1     1 ALASKA  on time            497     221      212          503   1841
## 2     2 ALASKA  delayed             62      12       20          102    305
## 3     3 AM WEST on time            694    4840      383          320    201
## 4     4 AM WEST delayed            117     415       65          129     61

Cleaning the data

Now we must change this dataframe into a tidy one. From the lecture and readings, I learned a tidy table means “1 row for each observations”.

Right now, the above table contains 2 airlines that have 2 rows each for 2 different delay status (on time, delayed) but, each row contains 5 different observations in the format of the 5 columns based on locations (LosAngeles, Phoenix, SanDiego, SanFrancisco, Seatte). We need to separate each row into 5 rows that reflection each of these locations. In the end, we will have 4 (2 airlines * 2 delay status) * 5 locations = 20 rows.

In order to do this, I used the pivot_longer command and had it separate the rows based on the 5 location columns. They are then combined into 2 new columns: 1 for the location name and 1 for the value (# of flights) that used to sit in the individual location column.

arrivalDelay_tidy <- 
  pivot_longer(arrivalDelays, 
             cols=c('LosAngeles', 'Phoenix', 'SanDiego', 'SanFrancisco', 'Seatte'),
             names_to = 'Locations',
             values_to = 'Flights')
arrivalDelay_tidy <- arrivalDelay_tidy[order(arrivalDelay_tidy$Locations),]
arrivalDelay_tidy

## # A tibble: 20 × 5
##       id airline delayStatus Locations    Flights
##    <dbl> <chr>   <chr>       <chr>          <dbl>
##  1     1 ALASKA  on time     LosAngeles       497
##  2     2 ALASKA  delayed     LosAngeles        62
##  3     3 AM WEST on time     LosAngeles       694
##  4     4 AM WEST delayed     LosAngeles       117
##  5     1 ALASKA  on time     Phoenix          221
##  6     2 ALASKA  delayed     Phoenix           12
##  7     3 AM WEST on time     Phoenix         4840
##  8     4 AM WEST delayed     Phoenix          415
##  9     1 ALASKA  on time     SanDiego         212
## 10     2 ALASKA  delayed     SanDiego          20
## 11     3 AM WEST on time     SanDiego         383
## 12     4 AM WEST delayed     SanDiego          65
## 13     1 ALASKA  on time     SanFrancisco     503
## 14     2 ALASKA  delayed     SanFrancisco     102
## 15     3 AM WEST on time     SanFrancisco     320
## 16     4 AM WEST delayed     SanFrancisco     129
## 17     1 ALASKA  on time     Seatte          1841
## 18     2 ALASKA  delayed     Seatte           305
## 19     3 AM WEST on time     Seatte           201
## 20     4 AM WEST delayed     Seatte            61

Anaylsis

arrivalDelay_tidy %>% group_by(airline) %>% 
  summarise(count=sum(Flights),
            .groups = 'drop')

## # A tibble: 2 × 2
##   airline count
##   <chr>   <dbl>
## 1 ALASKA   3775
## 2 AM WEST  7225

arrivalDelay_tidy %>% 
  group_by(airline, delayStatus) %>% 
  summarise(count=sum(Flights),
            .groups = 'drop')

## # A tibble: 4 × 3
##   airline delayStatus count
##   <chr>   <chr>       <dbl>
## 1 ALASKA  delayed       501
## 2 ALASKA  on time      3274
## 3 AM WEST delayed       787
## 4 AM WEST on time      6438

The two tables above show us the number of flights from each airline and the breakdown of that airlines’ on-time & delayed flights. AM WEST has more flights than ALASKA and less delayed flights as when you work out the ratios: 1 in 6.5 flights from ALASKA are delayed and 1 in 8.2 flights from AM WEST are delayed.

arrivalDelay_tidy %>% 
  group_by(Locations) %>% 
  summarise(count=sum(Flights),
            .groups = 'drop')

## # A tibble: 5 × 2
##   Locations    count
##   <chr>        <dbl>
## 1 LosAngeles    1370
## 2 Phoenix       5488
## 3 SanDiego       680
## 4 SanFrancisco  1054
## 5 Seatte        2408

arrivalDelay_tidy %>% 
  group_by(Locations, delayStatus) %>% 
  summarise(count=sum(Flights),
            .groups = 'drop')

## # A tibble: 10 × 3
##    Locations    delayStatus count
##    <chr>        <chr>       <dbl>
##  1 LosAngeles   delayed       179
##  2 LosAngeles   on time      1191
##  3 Phoenix      delayed       427
##  4 Phoenix      on time      5061
##  5 SanDiego     delayed        85
##  6 SanDiego     on time       595
##  7 SanFrancisco delayed       231
##  8 SanFrancisco on time       823
##  9 Seatte       delayed       366
## 10 Seatte       on time      2042

The two tables above show us the number of flights arriving to each location and the breakdown of that location’s on-time & delayed arrived flights. Phoenix has the most arrival flights and least amount of delayed flights is San Francisco. The ratio breakdown of delayed flights to on-time flights is below:
- LosAngeles: 1 in 6.6
- Phoenix: 1 in 11.8
- SanDiego: 1 in 7
- SanFrancisco: 1 in 3.6
- Seattle: 1 in 5.6

607_AssignmentW5

Ariann Chai

2023-09-25

Data 607 - Assignment W5

Cleaning the data

Anaylsis