The following is my homework assignment for Week 5 (TIDY/DPLYR).

I am loading the data from MySQL and all the libraries I will use (tidyr, dplyr, knitr, stringr, RMySQL).

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(knitr)
library(RMySQL)
## Loading required package: DBI
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "CUNY607",
                 host = "localhost",
                 port = 3306,
                 user = "root",
                 password = "CUNYDATA")

table_names <- dbListTables(con)
HW5 <- dbReadTable(con, "HW5")

# Fill in missing arilines before doing cleanup
HW5[HW5$Performance == "delayed" & HW5$LosAngeles == "62" &  HW5$Airline == "", "Airline"] <- "ALASKA"
HW5[HW5$Performance == "delayed" & HW5$LosAngeles == "117" &  HW5$Airline == "", "Airline"] <- "AM WEST"

kable(HW5, caption="Table 1. Data from MySQL with empty cells filled in")
Table 1. Data from MySQL with empty cells filled in
ID Airline Performance LosAngeles Phoenix SanDiego SanFrancisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 201
5 AM WEST delayed 117 415 65 129 61

Cleaning the data with dplyr and tidyr.

HW5<-tbl_df(HW5)
HW5a <- HW5 %>% filter(Performance != "")  %>% gather("Destination","Count",4:8)
HW5a <- HW5a[2:5]
HW5a <- HW5a %>% group_by(Airline, Performance, Destination) %>%spread(Performance, Count)
colnames(HW5a) <- str_to_title(colnames(HW5a))
kable(HW5a, caption="Table 2. Airline Data Tidy and Spread for Analysis")
Table 2. Airline Data Tidy and Spread for Analysis
Airline Destination Delayed On Time
ALASKA LosAngeles 62 497
ALASKA Phoenix 12 221
ALASKA SanDiego 20 212
ALASKA SanFrancisco 102 503
ALASKA Seattle 305 1841
AM WEST LosAngeles 117 694
AM WEST Phoenix 415 4840
AM WEST SanDiego 65 383
AM WEST SanFrancisco 129 320
AM WEST Seattle 61 201

Analyze Data

#prep data for analysis
HW5b <- HW5a %>% mutate(Total = Delayed+`On Time`, PercentDelayed = round(Delayed/Total,2)) 

#Method 1
HW5c <- HW5b %>% group_by(Airline) %>% summarise(AirlineDelay = mean(PercentDelayed,2))
kable(HW5c, caption="Table 3. Airline Total Delay by Taking Mean of Percent Delay of each Destination")
Table 3. Airline Total Delay by Taking Mean of Percent Delay of each Destination
Airline AirlineDelay
ALASKA 0.11
AM WEST 0.15
#Method 2
HW5d <- HW5b %>% group_by(Airline) %>% summarise(TotDelay = sum(Delayed),TotOnTime = sum(`On Time`), AirTotal = sum(Total)) %>% mutate(AirDelay = round(TotDelay/AirTotal,2))
kable(HW5d, caption="Table 4. Airline Total Delay by summing all counts then doing a percentage")
Table 4. Airline Total Delay by summing all counts then doing a percentage
Airline TotDelay TotOnTime AirTotal AirDelay
ALASKA 501 3274 3775 0.13
AM WEST 787 6438 7225 0.11

Note that table 3 and 4 show different values based upon two approaches. Table 4 would be the correct one for this analysis and show that Alaska Airlines has less delayes total than America West by 4% points.

Notes: I had to use stackoverflow to determine why when I used the spread function, I would get NAs with data staggered. I realized in reviewing comments that I needed to group the data first.