DATA 607 Week 5

Problem Statement: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

I decided to create the schema and corresponding tables in a MYSQL Database. In order to read the schema into r, need to open a connection and turn the data into an r dataframe for the other portion of this assignment.

The self sustaining script to create and populate the schema can be found here: https://gist.github.com/vindication09/fd719468dfc629335cb6ac4952b16127

Making a connection to MySQL You will need to change the host name, user name, and password associated to your own personal MYSQl instance. For security reason, I will only show the base code with empty parameters except for the database name.

library(RMySQL) mydb = dbConnect(MySQL(), user=‘’, password=’‘, dbname=’week5’, host=’’)

Check the tables in the schema

#check the tables that exist in the schema 
dbListTables(mydb)

## [1] "carrier"    "delay"      "downstream" "ontime"

The table we want to work with is downstream. When making the schema, I used the following design: -Table with unique carrier ID and carrier name -Wide table with only On-Time counts for each city and carrier ID -Wide table with only Delay counts for each city and carrier ID -append On time and Delay into a bridge table

We need to make a table that has all the information in one table We can call upon the DBgetquery function to not only to run the query, but to turn the data into a dataframe, and clear the query cache

ws<-dbGetQuery(mydb, 'select a.carrier, b.status,  b.LA, b.Phoenix, b.SanDiego, b.SanFrancisco, b.Seattle
           from carrier a 
           join downstream b 
           on (a.carrier_id=b.carrier_id)
               order by carrier desc;')

Lets check our new data frame

head(ws)

##   carrier  status  LA Phoenix SanDiego SanFrancisco Seattle
## 1 AM WEST ON_TIME 694    4840      383          320     201
## 2 AM WEST   DELAY 117     415       65          129      61
## 3  ALASKA ON_TIME 497     221      212          503    1841
## 4  ALASKA   DELAY  62      12       20          102     305

With the data now in our working system, we can proceed to using the tools in tidyr and dplyr to transform the data and compare arrival delays for the two airlines.

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

I used the following resources to help complete this assigment #https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf #http://data.library.virginia.edu/a-tidyr-tutorial/ #https://rpubs.com/bradleyboehmke/data_wrangling

The data is in a form where we can gather by key pair values. This will take each column that contains a city, and combine them into one column

flight_longdf <- df %>% gather(data=ws, count, LA:Seattle)
head(flight_longdf, 24)

##    carrier  status            . count
## 1  AM WEST ON_TIME           LA   694
## 2  AM WEST   DELAY           LA   117
## 3   ALASKA ON_TIME           LA   497
## 4   ALASKA   DELAY           LA    62
## 5  AM WEST ON_TIME      Phoenix  4840
## 6  AM WEST   DELAY      Phoenix   415
## 7   ALASKA ON_TIME      Phoenix   221
## 8   ALASKA   DELAY      Phoenix    12
## 9  AM WEST ON_TIME     SanDiego   383
## 10 AM WEST   DELAY     SanDiego    65
## 11  ALASKA ON_TIME     SanDiego   212
## 12  ALASKA   DELAY     SanDiego    20
## 13 AM WEST ON_TIME SanFrancisco   320
## 14 AM WEST   DELAY SanFrancisco   129
## 15  ALASKA ON_TIME SanFrancisco   503
## 16  ALASKA   DELAY SanFrancisco   102
## 17 AM WEST ON_TIME      Seattle   201
## 18 AM WEST   DELAY      Seattle    61
## 19  ALASKA ON_TIME      Seattle  1841
## 20  ALASKA   DELAY      Seattle   305

This still is not the data set we want to work with. Some more additional work can be done in order to make status its own column This will give us a data frame that shows carrier, City, ontime, delay we will use the spread function to complete this task

flight_middf <- flight_longdf %>% spread(status, count)
head(flight_middf, 24)

##    carrier            . DELAY ON_TIME
## 1   ALASKA           LA    62     497
## 2   ALASKA      Phoenix    12     221
## 3   ALASKA     SanDiego    20     212
## 4   ALASKA SanFrancisco   102     503
## 5   ALASKA      Seattle   305    1841
## 6  AM WEST           LA   117     694
## 7  AM WEST      Phoenix   415    4840
## 8  AM WEST     SanDiego    65     383
## 9  AM WEST SanFrancisco   129     320
## 10 AM WEST      Seattle    61     201

status2.df<-data.frame(flight_middf)
status2.df

##    carrier            . DELAY ON_TIME
## 1   ALASKA           LA    62     497
## 2   ALASKA      Phoenix    12     221
## 3   ALASKA     SanDiego    20     212
## 4   ALASKA SanFrancisco   102     503
## 5   ALASKA      Seattle   305    1841
## 6  AM WEST           LA   117     694
## 7  AM WEST      Phoenix   415    4840
## 8  AM WEST     SanDiego    65     383
## 9  AM WEST SanFrancisco   129     320
## 10 AM WEST      Seattle    61     201

We need to make a small chage and rename one of the columns. Aside from the column name, I now have a data frame that I can use to make comparisons regarding number of delayed flights.

#Lets rename the city column with a proper name 
colnames(status2.df)[colnames(status2.df)=="."] <- "City"
head(status2.df)

##   carrier         City DELAY ON_TIME
## 1  ALASKA           LA    62     497
## 2  ALASKA      Phoenix    12     221
## 3  ALASKA     SanDiego    20     212
## 4  ALASKA SanFrancisco   102     503
## 5  ALASKA      Seattle   305    1841
## 6 AM WEST           LA   117     694

Basic base function summary

#The data is now in a format that is optimized for downstream analysis 
#Lets see a basic summary 
summary(status2.df)

##    carrier              City               DELAY           ON_TIME      
##  Length:10          Length:10          Min.   : 12.00   Min.   : 201.0  
##  Class :character   Class :character   1st Qu.: 61.25   1st Qu.: 245.8  
##  Mode  :character   Mode  :character   Median : 83.50   Median : 440.0  
##                                        Mean   :128.80   Mean   : 971.2  
##                                        3rd Qu.:126.00   3rd Qu.: 646.2  
##                                        Max.   :415.00   Max.   :4840.0

We can use dplyr to refine a summary of the data. Before looking at a comparison by just Carrier, I want to compare at city level granularity

status2.df %>%
  group_by(City)%>% 
  summarise(Mean_Ontime = mean(ON_TIME, na.rm=TRUE),
            Mean_Delay = mean(DELAY, na.rm=TRUE))

## # A tibble: 5 x 3
##   City         Mean_Ontime Mean_Delay
##   <chr>              <dbl>      <dbl>
## 1 LA                   596       89.5
## 2 Phoenix             2530      214  
## 3 SanDiego             298       42.5
## 4 SanFrancisco         412      116  
## 5 Seattle             1021      183

We can add other dimensions of Descriptive Statistics such a Standard Deviation

#We can look at other dimensio
status2.df %>%
  group_by(City)%>% 
  summarise(SD_Ontime = sd(ON_TIME),
            SD_DELAY=sd(DELAY))

## # A tibble: 5 x 3
##   City         SD_Ontime SD_DELAY
##   <chr>            <dbl>    <dbl>
## 1 LA                 139     38.9
## 2 Phoenix           3266    285  
## 3 SanDiego           121     31.8
## 4 SanFrancisco       129     19.1
## 5 Seattle           1160    173

Lets put all this information together

#Now lets compare means and standard deviations grouped by city 
status2.df %>%
  group_by(City)%>% 
  summarise(Mean_Ontime = mean(ON_TIME, na.rm=TRUE),
            Mean_Delay = mean(DELAY, na.rm=TRUE), 
            SD_Ontime = sd(ON_TIME),
            SD_DELAY=sd(DELAY))

## # A tibble: 5 x 5
##   City         Mean_Ontime Mean_Delay SD_Ontime SD_DELAY
##   <chr>              <dbl>      <dbl>     <dbl>    <dbl>
## 1 LA                   596       89.5       139     38.9
## 2 Phoenix             2530      214        3266    285  
## 3 SanDiego             298       42.5       121     31.8
## 4 SanFrancisco         412      116         129     19.1
## 5 Seattle             1021      183        1160    173

We can get a better perspective by comparing at carrier granularity

#lets repeat the analysis but grouped by carrier 
status2.df %>%
  group_by(carrier)%>% 
  summarise(Mean_Ontime = mean(ON_TIME, na.rm=TRUE),
            Mean_Delay = mean(DELAY, na.rm=TRUE), 
            SD_Ontime = sd(ON_TIME),
            SD_DELAY=sd(DELAY))

## # A tibble: 2 x 5
##   carrier Mean_Ontime Mean_Delay SD_Ontime SD_DELAY
##   <chr>         <dbl>      <dbl>     <dbl>    <dbl>
## 1 ALASKA          655        100       678      120
## 2 AM WEST        1288        157      1994      147

From this chart, ALASKA airline had a lower average of Delays with less standard deviation from the mean. It should be noted that AM WEST has a greater number of flights than ALASKA.

Lets conclude with some top level visualzations I want to re-examine city level granularity. I am making two charts,one for Delays and the other for On Time flights. The range difference between On Time and Delay is far too great to reasonably scale a combined chart.

#Delays
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.3

ggplot(status2.df, aes(factor(City), DELAY, fill = carrier)) + 
  geom_bar(stat="identity", position = "dodge") + 
  theme_minimal()+
  labs( x="City", y="Number of Delayed Arrivals")

#on time 
  library(ggplot2)
  ggplot(status2.df, aes(factor(City), ON_TIME, fill = carrier)) + 
    geom_bar(stat="identity", position = "dodge") + 
    theme_minimal()+
    labs( x="City", y="Number of On Time Arrivals")

We can produce an additional chart that only compares the delays at the carrier level. We are only interested in arrival delays anyways. I will regroup my data and create a smalldata frame with the mean delays and arrivals

flightmean<-status2.df %>%
  group_by(carrier)%>% 
  summarise(Avg_Ontime = mean(ON_TIME, na.rm=TRUE),
            Avg_Delay = mean(DELAY, na.rm=TRUE))
flightmean.df<-data.frame(flightmean)
flightmean.df

##   carrier Avg_Ontime Avg_Delay
## 1  ALASKA      654.8     100.2
## 2 AM WEST     1287.6     157.4

Plot

library(ggplot2)
  ggplot(flightmean.df, aes(factor(carrier), Avg_Delay, fill = carrier)) + 
    geom_bar(stat="identity", position = "dodge") + 
    theme_minimal()+
    labs( x="City", y="Average Number of Delayed Arrivals")

Overall, we can conclude that AM WEST has more delayed arrivals on average than ALASKA airlines.

DATA 607 Week 5

Vinicio Haro

February 26, 2018