HW 5

The following is my homework assignment for Week 5 (TIDY/DPLYR).

I am loading the data from MySQL and all the libraries I will use (tidyr, dplyr, knitr, stringr, RMySQL).

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(knitr)
library(RMySQL)

## Loading required package: DBI

con <- dbConnect(RMySQL::MySQL(),
                 dbname = "CUNY607",
                 host = "localhost",
                 port = 3306,
                 user = "root",
                 password = "CUNYDATA")

table_names <- dbListTables(con)
HW5 <- dbReadTable(con, "HW5")

# Fill in missing arilines before doing cleanup
HW5[HW5$Performance == "delayed" & HW5$LosAngeles == "62" &  HW5$Airline == "", "Airline"] <- "ALASKA"
HW5[HW5$Performance == "delayed" & HW5$LosAngeles == "117" &  HW5$Airline == "", "Airline"] <- "AM WEST"

kable(HW5, caption="Table 1. Data from MySQL with empty cells filled in")

Table 1. Data from MySQL with empty cells filled in
ID	Airline	Performance	LosAngeles	Phoenix	SanDiego	SanFrancisco	Seattle
1	ALASKA	on time	497	221	212	503	1841
2	ALASKA	delayed	62	12	20	102	305
3			NA	NA	NA	NA	NA
4	AM WEST	on time	694	4840	383	320	201
5	AM WEST	delayed	117	415	65	129	61

Cleaning the data with dplyr and tidyr.

HW5<-tbl_df(HW5)
HW5a <- HW5 %>% filter(Performance != "")  %>% gather("Destination","Count",4:8)
HW5a <- HW5a[2:5]
HW5a <- HW5a %>% group_by(Airline, Performance, Destination) %>%spread(Performance, Count)
colnames(HW5a) <- str_to_title(colnames(HW5a))
kable(HW5a, caption="Table 2. Airline Data Tidy and Spread for Analysis")

Table 2. Airline Data Tidy and Spread for Analysis
Airline	Destination	Delayed	On Time
ALASKA	LosAngeles	62	497
ALASKA	Phoenix	12	221
ALASKA	SanDiego	20	212
ALASKA	SanFrancisco	102	503
ALASKA	Seattle	305	1841
AM WEST	LosAngeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	SanDiego	65	383
AM WEST	SanFrancisco	129	320
AM WEST	Seattle	61	201

Analyze Data

#prep data for analysis
HW5b <- HW5a %>% mutate(Total = Delayed+`On Time`, PercentDelayed = round(Delayed/Total,2)) 

#Method 1
HW5c <- HW5b %>% group_by(Airline) %>% summarise(AirlineDelay = mean(PercentDelayed,2))
kable(HW5c, caption="Table 3. Airline Total Delay by Taking Mean of Percent Delay of each Destination")

Table 3. Airline Total Delay by Taking Mean of Percent Delay of each Destination
Airline	AirlineDelay
ALASKA	0.11
AM WEST	0.15

#Method 2
HW5d <- HW5b %>% group_by(Airline) %>% summarise(TotDelay = sum(Delayed),TotOnTime = sum(`On Time`), AirTotal = sum(Total)) %>% mutate(AirDelay = round(TotDelay/AirTotal,2))
kable(HW5d, caption="Table 4. Airline Total Delay by summing all counts then doing a percentage")

Table 4. Airline Total Delay by summing all counts then doing a percentage
Airline	TotDelay	TotOnTime	AirTotal	AirDelay
ALASKA	501	3274	3775	0.13
AM WEST	787	6438	7225	0.11

Note that table 3 and 4 show different values based upon two approaches. Table 4 would be the correct one for this analysis and show that Alaska Airlines has less delayes total than America West by 4% points.

Notes: I had to use stackoverflow to determine why when I used the spread function, I would get NAs with data staggered. I realized in reviewing comments that I needed to group the data first.

HW 5

Cesar L. Espitia

March 1, 2017