The following is my homework assignment for Week 5 (TIDY/DPLYR).
I am loading the data from MySQL and all the libraries I will use (tidyr, dplyr, knitr, stringr, RMySQL).
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(knitr)
library(RMySQL)
## Loading required package: DBI
con <- dbConnect(RMySQL::MySQL(),
dbname = "CUNY607",
host = "localhost",
port = 3306,
user = "root",
password = "CUNYDATA")
table_names <- dbListTables(con)
HW5 <- dbReadTable(con, "HW5")
# Fill in missing arilines before doing cleanup
HW5[HW5$Performance == "delayed" & HW5$LosAngeles == "62" & HW5$Airline == "", "Airline"] <- "ALASKA"
HW5[HW5$Performance == "delayed" & HW5$LosAngeles == "117" & HW5$Airline == "", "Airline"] <- "AM WEST"
kable(HW5, caption="Table 1. Data from MySQL with empty cells filled in")
ID | Airline | Performance | LosAngeles | Phoenix | SanDiego | SanFrancisco | Seattle |
---|---|---|---|---|---|---|---|
1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
2 | ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
3 | NA | NA | NA | NA | NA | ||
4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
5 | AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Cleaning the data with dplyr and tidyr.
HW5<-tbl_df(HW5)
HW5a <- HW5 %>% filter(Performance != "") %>% gather("Destination","Count",4:8)
HW5a <- HW5a[2:5]
HW5a <- HW5a %>% group_by(Airline, Performance, Destination) %>%spread(Performance, Count)
colnames(HW5a) <- str_to_title(colnames(HW5a))
kable(HW5a, caption="Table 2. Airline Data Tidy and Spread for Analysis")
Airline | Destination | Delayed | On Time |
---|---|---|---|
ALASKA | LosAngeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | SanDiego | 20 | 212 |
ALASKA | SanFrancisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AM WEST | LosAngeles | 117 | 694 |
AM WEST | Phoenix | 415 | 4840 |
AM WEST | SanDiego | 65 | 383 |
AM WEST | SanFrancisco | 129 | 320 |
AM WEST | Seattle | 61 | 201 |
Analyze Data
#prep data for analysis
HW5b <- HW5a %>% mutate(Total = Delayed+`On Time`, PercentDelayed = round(Delayed/Total,2))
#Method 1
HW5c <- HW5b %>% group_by(Airline) %>% summarise(AirlineDelay = mean(PercentDelayed,2))
kable(HW5c, caption="Table 3. Airline Total Delay by Taking Mean of Percent Delay of each Destination")
Airline | AirlineDelay |
---|---|
ALASKA | 0.11 |
AM WEST | 0.15 |
#Method 2
HW5d <- HW5b %>% group_by(Airline) %>% summarise(TotDelay = sum(Delayed),TotOnTime = sum(`On Time`), AirTotal = sum(Total)) %>% mutate(AirDelay = round(TotDelay/AirTotal,2))
kable(HW5d, caption="Table 4. Airline Total Delay by summing all counts then doing a percentage")
Airline | TotDelay | TotOnTime | AirTotal | AirDelay |
---|---|---|---|---|
ALASKA | 501 | 3274 | 3775 | 0.13 |
AM WEST | 787 | 6438 | 7225 | 0.11 |
Note that table 3 and 4 show different values based upon two approaches. Table 4 would be the correct one for this analysis and show that Alaska Airlines has less delayes total than America West by 4% points.
Notes: I had to use stackoverflow to determine why when I used the spread function, I would get NAs with data staggered. I realized in reviewing comments that I needed to group the data first.