Problem Statement: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
I decided to create the schema and corresponding tables in a MYSQL Database. In order to read the schema into r, need to open a connection and turn the data into an r dataframe for the other portion of this assignment.
The self sustaining script to create and populate the schema can be found here: https://gist.github.com/vindication09/fd719468dfc629335cb6ac4952b16127
Making a connection to MySQL You will need to change the host name, user name, and password associated to your own personal MYSQl instance. For security reason, I will only show the base code with empty parameters except for the database name.
library(RMySQL) mydb = dbConnect(MySQL(), user=‘’, password=’‘, dbname=’week5’, host=’’)
Check the tables in the schema
#check the tables that exist in the schema
dbListTables(mydb)
## [1] "carrier" "delay" "downstream" "ontime"
The table we want to work with is downstream. When making the schema, I used the following design: -Table with unique carrier ID and carrier name -Wide table with only On-Time counts for each city and carrier ID -Wide table with only Delay counts for each city and carrier ID -append On time and Delay into a bridge table
We need to make a table that has all the information in one table We can call upon the DBgetquery function to not only to run the query, but to turn the data into a dataframe, and clear the query cache
ws<-dbGetQuery(mydb, 'select a.carrier, b.status, b.LA, b.Phoenix, b.SanDiego, b.SanFrancisco, b.Seattle
from carrier a
join downstream b
on (a.carrier_id=b.carrier_id)
order by carrier desc;')
Lets check our new data frame
head(ws)
## carrier status LA Phoenix SanDiego SanFrancisco Seattle
## 1 AM WEST ON_TIME 694 4840 383 320 201
## 2 AM WEST DELAY 117 415 65 129 61
## 3 ALASKA ON_TIME 497 221 212 503 1841
## 4 ALASKA DELAY 62 12 20 102 305
With the data now in our working system, we can proceed to using the tools in tidyr and dplyr to transform the data and compare arrival delays for the two airlines.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
I used the following resources to help complete this assigment #https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf #http://data.library.virginia.edu/a-tidyr-tutorial/ #https://rpubs.com/bradleyboehmke/data_wrangling
The data is in a form where we can gather by key pair values. This will take each column that contains a city, and combine them into one column
flight_longdf <- df %>% gather(data=ws, count, LA:Seattle)
head(flight_longdf, 24)
## carrier status . count
## 1 AM WEST ON_TIME LA 694
## 2 AM WEST DELAY LA 117
## 3 ALASKA ON_TIME LA 497
## 4 ALASKA DELAY LA 62
## 5 AM WEST ON_TIME Phoenix 4840
## 6 AM WEST DELAY Phoenix 415
## 7 ALASKA ON_TIME Phoenix 221
## 8 ALASKA DELAY Phoenix 12
## 9 AM WEST ON_TIME SanDiego 383
## 10 AM WEST DELAY SanDiego 65
## 11 ALASKA ON_TIME SanDiego 212
## 12 ALASKA DELAY SanDiego 20
## 13 AM WEST ON_TIME SanFrancisco 320
## 14 AM WEST DELAY SanFrancisco 129
## 15 ALASKA ON_TIME SanFrancisco 503
## 16 ALASKA DELAY SanFrancisco 102
## 17 AM WEST ON_TIME Seattle 201
## 18 AM WEST DELAY Seattle 61
## 19 ALASKA ON_TIME Seattle 1841
## 20 ALASKA DELAY Seattle 305
This still is not the data set we want to work with. Some more additional work can be done in order to make status its own column This will give us a data frame that shows carrier, City, ontime, delay we will use the spread function to complete this task
flight_middf <- flight_longdf %>% spread(status, count)
head(flight_middf, 24)
## carrier . DELAY ON_TIME
## 1 ALASKA LA 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA SanDiego 20 212
## 4 ALASKA SanFrancisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST LA 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST SanDiego 65 383
## 9 AM WEST SanFrancisco 129 320
## 10 AM WEST Seattle 61 201
status2.df<-data.frame(flight_middf)
status2.df
## carrier . DELAY ON_TIME
## 1 ALASKA LA 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA SanDiego 20 212
## 4 ALASKA SanFrancisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST LA 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST SanDiego 65 383
## 9 AM WEST SanFrancisco 129 320
## 10 AM WEST Seattle 61 201
We need to make a small chage and rename one of the columns. Aside from the column name, I now have a data frame that I can use to make comparisons regarding number of delayed flights.
#Lets rename the city column with a proper name
colnames(status2.df)[colnames(status2.df)=="."] <- "City"
head(status2.df)
## carrier City DELAY ON_TIME
## 1 ALASKA LA 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA SanDiego 20 212
## 4 ALASKA SanFrancisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST LA 117 694
Basic base function summary
#The data is now in a format that is optimized for downstream analysis
#Lets see a basic summary
summary(status2.df)
## carrier City DELAY ON_TIME
## Length:10 Length:10 Min. : 12.00 Min. : 201.0
## Class :character Class :character 1st Qu.: 61.25 1st Qu.: 245.8
## Mode :character Mode :character Median : 83.50 Median : 440.0
## Mean :128.80 Mean : 971.2
## 3rd Qu.:126.00 3rd Qu.: 646.2
## Max. :415.00 Max. :4840.0
We can use dplyr to refine a summary of the data. Before looking at a comparison by just Carrier, I want to compare at city level granularity
status2.df %>%
group_by(City)%>%
summarise(Mean_Ontime = mean(ON_TIME, na.rm=TRUE),
Mean_Delay = mean(DELAY, na.rm=TRUE))
## # A tibble: 5 x 3
## City Mean_Ontime Mean_Delay
## <chr> <dbl> <dbl>
## 1 LA 596 89.5
## 2 Phoenix 2530 214
## 3 SanDiego 298 42.5
## 4 SanFrancisco 412 116
## 5 Seattle 1021 183
We can add other dimensions of Descriptive Statistics such a Standard Deviation
#We can look at other dimensio
status2.df %>%
group_by(City)%>%
summarise(SD_Ontime = sd(ON_TIME),
SD_DELAY=sd(DELAY))
## # A tibble: 5 x 3
## City SD_Ontime SD_DELAY
## <chr> <dbl> <dbl>
## 1 LA 139 38.9
## 2 Phoenix 3266 285
## 3 SanDiego 121 31.8
## 4 SanFrancisco 129 19.1
## 5 Seattle 1160 173
Lets put all this information together
#Now lets compare means and standard deviations grouped by city
status2.df %>%
group_by(City)%>%
summarise(Mean_Ontime = mean(ON_TIME, na.rm=TRUE),
Mean_Delay = mean(DELAY, na.rm=TRUE),
SD_Ontime = sd(ON_TIME),
SD_DELAY=sd(DELAY))
## # A tibble: 5 x 5
## City Mean_Ontime Mean_Delay SD_Ontime SD_DELAY
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 LA 596 89.5 139 38.9
## 2 Phoenix 2530 214 3266 285
## 3 SanDiego 298 42.5 121 31.8
## 4 SanFrancisco 412 116 129 19.1
## 5 Seattle 1021 183 1160 173
We can get a better perspective by comparing at carrier granularity
#lets repeat the analysis but grouped by carrier
status2.df %>%
group_by(carrier)%>%
summarise(Mean_Ontime = mean(ON_TIME, na.rm=TRUE),
Mean_Delay = mean(DELAY, na.rm=TRUE),
SD_Ontime = sd(ON_TIME),
SD_DELAY=sd(DELAY))
## # A tibble: 2 x 5
## carrier Mean_Ontime Mean_Delay SD_Ontime SD_DELAY
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 655 100 678 120
## 2 AM WEST 1288 157 1994 147
From this chart, ALASKA airline had a lower average of Delays with less standard deviation from the mean. It should be noted that AM WEST has a greater number of flights than ALASKA.
Lets conclude with some top level visualzations I want to re-examine city level granularity. I am making two charts,one for Delays and the other for On Time flights. The range difference between On Time and Delay is far too great to reasonably scale a combined chart.
#Delays
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.3
ggplot(status2.df, aes(factor(City), DELAY, fill = carrier)) +
geom_bar(stat="identity", position = "dodge") +
theme_minimal()+
labs( x="City", y="Number of Delayed Arrivals")
#on time
library(ggplot2)
ggplot(status2.df, aes(factor(City), ON_TIME, fill = carrier)) +
geom_bar(stat="identity", position = "dodge") +
theme_minimal()+
labs( x="City", y="Number of On Time Arrivals")
We can produce an additional chart that only compares the delays at the carrier level. We are only interested in arrival delays anyways. I will regroup my data and create a smalldata frame with the mean delays and arrivals
flightmean<-status2.df %>%
group_by(carrier)%>%
summarise(Avg_Ontime = mean(ON_TIME, na.rm=TRUE),
Avg_Delay = mean(DELAY, na.rm=TRUE))
flightmean.df<-data.frame(flightmean)
flightmean.df
## carrier Avg_Ontime Avg_Delay
## 1 ALASKA 654.8 100.2
## 2 AM WEST 1287.6 157.4
Plot
library(ggplot2)
ggplot(flightmean.df, aes(factor(carrier), Avg_Delay, fill = carrier)) +
geom_bar(stat="identity", position = "dodge") +
theme_minimal()+
labs( x="City", y="Average Number of Delayed Arrivals")
Overall, we can conclude that AM WEST has more delayed arrivals on average than ALASKA airlines.