DATA607 WK5 Homework: Tidy Data

Untidy Data
Tidy Data
Analysis
Conclusion

This purpose of this assignment was to mimic the “wide” dataset provided by creating a .CSV file, reading it into R, clean up the data then perform analysis.

library(tidyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(knitr)
library(kableExtra)

Untidy Data

#read file into R
fileloc = "https://raw.githubusercontent.com/javernw/JWCUNYAssignments/master/flights.csv"
flights_df <- read.csv(fileloc, sep = ",", stringsAsFactors = F)
flights_df %>%
  kable() %>%
  kable_styling()

ï..	X	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AMWEST	on time	694	4840	383	320	201
AMWEST	delayed	117	415	65	129	61

#rename first 2 columns
flights_df <- rename(flights_df, "Airline" = `ï..`, "Status" = `X`)
flights_df%>%
  kable() %>%
  kable_styling()

Airline	Status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AMWEST	on time	694	4840	383	320	201
AMWEST	delayed	117	415	65	129	61

Tidy Data

tidyflights_df <- flights_df %>% 
  gather("City", "Count", 3:7, na.rm = T) %>% #pushes data in column into rows
  spread(Status, Count) %>%  
  rename("Delayed" = `delayed`, "OnTime" = `on time`)  

tidyflights_df %>%
  kable() %>%
  kable_styling()

Airline	City	Delayed	OnTime
ALASKA	Los.Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San.Diego	20	212
ALASKA	San.Francisco	102	503
ALASKA	Seattle	305	1841
AMWEST	Los.Angeles	117	694
AMWEST	Phoenix	415	4840
AMWEST	San.Diego	65	383
AMWEST	San.Francisco	129	320
AMWEST	Seattle	61	201

Analysis

tidyflights_df <- tidyflights_df %>% mutate(Ratio = round((`OnTime` / `Delayed`),2), "%Success" = round(`OnTime` / (`Delayed` + `OnTime`), 2))

## Warning: package 'bindrcpp' was built under R version 3.5.2

tidyflights_df %>%
  kable() %>%
  kable_styling()

Airline	City	Delayed	OnTime	Ratio	%Success
ALASKA	Los.Angeles	62	497	8.02	0.89
ALASKA	Phoenix	12	221	18.42	0.95
ALASKA	San.Diego	20	212	10.60	0.91
ALASKA	San.Francisco	102	503	4.93	0.83
ALASKA	Seattle	305	1841	6.04	0.86
AMWEST	Los.Angeles	117	694	5.93	0.86
AMWEST	Phoenix	415	4840	11.66	0.92
AMWEST	San.Diego	65	383	5.89	0.85
AMWEST	San.Francisco	129	320	2.48	0.71
AMWEST	Seattle	61	201	3.30	0.77

Ratio -> Ontime:Delayed. Let’s look at ALASKA Airline in Phoenix. For every 18 flights that is on time, there is 1 that’s delayed which is very good. Based on the success rate, Alaska airline has a 95% record of being on time in Phoenix that is.

# Delays in each city by the two airlines
tidyflights_df %>% 
  arrange(City) %>% 
  select(Airline, City, Delayed) %>% 
  group_by(Airline) %>% 
  spread(Airline, Delayed) %>%
  kable() %>%
  kable_styling() %>% add_header_above(c(" ", "Delays"=2))

	Delays
City	ALASKA	AMWEST
Los.Angeles	62	117
Phoenix	12	415
San.Diego	20	65
San.Francisco	102	129
Seattle	305	61

# % Delay for Alaska and Amwest 
tidyflights_df %>% 
  group_by(Airline) %>% 
  summarise(DelayRate = sum(`Delayed`) / (sum(`Delayed`) + sum(`OnTime`))) %>%
  spread(Airline, DelayRate) %>%
  kable() %>%
  kable_styling()

ALASKA	AMWEST
0.1327152	0.1089273

AMWEST has higher count in delays on a whole but lower delay rate.

alaska <- filter(tidyflights_df, Airline == "ALASKA") #alaska df
summary(alaska$`%Success`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.830   0.860   0.890   0.888   0.910   0.950

amwest <- filter(tidyflights_df, Airline == "AMWEST") #amwest df
summary(amwest$`%Success`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.710   0.770   0.850   0.822   0.860   0.920

Summary of success (on time) for each flight. Alaska shows better progress.

# line graph showing the delay ratios between alaska and amwest
ggplot(tidyflights_df, aes(x=City, y=Ratio, group = Airline)) + geom_line(aes(color=Airline)) +geom_point(aes(color=Airline))

Most delays occured in San Francisco with both airlines. A few factors can contribute to this such as weather or aviation system delay at that city’s airport.

Conclusion

Which Airline is better?

In the analysis, ALASKA has better on time arrival rates with an average of 88%, while AMWEST shows a lower delay rate than ALASKA; 11% and 13% respectively. AMWEST did more flights than ALASKA which means more chances of delays but as we know, when the population size increase, there is little difference in the outcome. However, more flights does not mean it’s better but can be that it has cheaper flights or more accomodating than the competition. It’s hard to tell which flight is better but with other factors remaining constant, AMWEST is better just because they have a lower delay rate and this assignment is focused on the analysis of the arrival delays between the two flights.

DATA607 WK5 Homework: Tidy Data

Javern Wilson

February 25, 2019

Untidy Data

Tidy Data

Analysis

Conclusion