title: “Project proposal 606”
author: “Lisa Szydziak”
date: “10/25/2021”
output:
html_document: default
word_document: default
pdf_document: default
output: html_document

Data Preparation

# load data
library(nycflights13)
library(tidyr)
library(dplyr)
library(gmodels)
library(ggplot2)
library(ggpubr)
library(psych)
library(nycflights13)

This nycflights13 r-package contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013: 336,776 flights in total. To help understand what causes delays, it also includes a number of other useful datasets.

This package provides the following data tables.

?flights: all flights that departed from NYC in 2013 ?weather: hourly meterological data for each airport ?planes: construction information about each plane ?airports: airport names and locations ?airlines: translation between two letter carrier codes and names

I begin by merging the 4 dataset to include fields from the weather, planes and airlines tables to supplement the flights dataset.

#Use flight dataset 

flights1<-nycflights13::flights

flights2<-merge(flights1,airlines,all.x=TRUE, by.x="carrier", by.y="carrier")

flights3<-merge(flights2,planes,all.x=TRUE, by.x="tailnum", by.y="tailnum")

#Create a variable origin.time by concatenate
#in the flights3 and weather dataset to use to later merge


flights4 <-flights3 %>%
  unite('origin.time', origin,time_hour,remove=FALSE)

weather2<-weather %>%
  unite('origin.time', origin,time_hour,remove=FALSE)

#now merge flights 4 with weather 2 based on the newly created variable
flights5<-merge(flights4,weather2,all.x=TRUE, by.x="origin.time", by.y="origin.time")

#Create a date field
flights5$date <- as.Date(with(flights5, paste(year.x, month.x, day.x,sep="-")), "%Y-%m-%d")
   
flights6<-flights5 %>%
  select(origin.x,date,month.x, dep_delay, dest, distance, name, manufacturer, seats, temp, wind_dir,wind_speed, precip, visib)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What are the most significant variables driving nyc airline departure delays?

Cases

What are the cases, and how many are there?

dim(flights6)
## [1] 336776     14

There are 336,776 observations which represents flights out of NYC airports.

Data collection

Describe the method of data collection.

As stated before, this r package contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013: 336,776 flights in total. Tt also includes a number of other useful datasets.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

This data is an r included in the r-package nycflights13 and be accessed by installing the nycflights13 library.

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is dep_delay. Departure delays, in minutes. Negative times represent early departures. It is a quantitative variable.

Independent Variable

I will be looking at the following independent variables:

origin.x - NYC airport date month.x dest - destination airport distance - how long is the trip name - airline carrier manufacturer - airplane maker seats - number of seats (proxy for size of plane) temp - air temp wind_dir - wind direction wind_speed - wind_speed precip - Precipitation, in inches

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

str(flights6)
## 'data.frame':    336776 obs. of  14 variables:
##  $ origin.x    : chr  "EWR" "EWR" "EWR" "EWR" ...
##  $ date        : Date, format: "2013-01-01" "2013-01-01" ...
##  $ month.x     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_delay   : num  2 -4 24 8 -6 1 -1 -4 -2 47 ...
##  $ dest        : chr  "IAH" "ORD" "IAD" "ORD" ...
##  $ distance    : num  1400 719 212 719 1008 ...
##  $ name        : chr  "United Air Lines Inc." "United Air Lines Inc." "ExpressJet Airlines Inc." "Envoy Air" ...
##  $ manufacturer: chr  "BOEING" "BOEING" "EMBRAER" NA ...
##  $ seats       : int  149 191 55 NA 55 200 149 178 191 191 ...
##  $ temp        : num  39 39 37.9 37.9 37.9 ...
##  $ wind_dir    : num  260 260 240 240 240 240 240 240 240 240 ...
##  $ wind_speed  : num  12.7 12.7 11.5 11.5 11.5 ...
##  $ precip      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ visib       : num  10 10 10 10 10 10 10 10 10 10 ...
summary(flights6)
##    origin.x              date               month.x         dep_delay      
##  Length:336776      Min.   :2013-01-01   Min.   : 1.000   Min.   : -43.00  
##  Class :character   1st Qu.:2013-04-04   1st Qu.: 4.000   1st Qu.:  -5.00  
##  Mode  :character   Median :2013-07-03   Median : 7.000   Median :  -2.00  
##                     Mean   :2013-07-02   Mean   : 6.549   Mean   :  12.64  
##                     3rd Qu.:2013-10-01   3rd Qu.:10.000   3rd Qu.:  11.00  
##                     Max.   :2013-12-31   Max.   :12.000   Max.   :1301.00  
##                                                           NA's   :8255     
##      dest              distance        name           manufacturer      
##  Length:336776      Min.   :  17   Length:336776      Length:336776     
##  Class :character   1st Qu.: 502   Class :character   Class :character  
##  Mode  :character   Median : 872   Mode  :character   Mode  :character  
##                     Mean   :1040                                        
##                     3rd Qu.:1389                                        
##                     Max.   :4983                                        
##                                                                         
##      seats            temp           wind_dir       wind_speed    
##  Min.   :  2.0   Min.   : 10.94   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 55.0   1st Qu.: 42.08   1st Qu.:130.0   1st Qu.: 6.905  
##  Median :149.0   Median : 57.20   Median :220.0   Median :10.357  
##  Mean   :136.7   Mean   : 57.00   Mean   :201.5   Mean   :11.114  
##  3rd Qu.:189.0   3rd Qu.: 71.96   3rd Qu.:290.0   3rd Qu.:14.960  
##  Max.   :450.0   Max.   :100.04   Max.   :360.0   Max.   :42.579  
##  NA's   :52606   NA's   :1573     NA's   :9796    NA's   :1634    
##      precip           visib       
##  Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:10.000  
##  Median :0.0000   Median :10.000  
##  Mean   :0.0046   Mean   : 9.256  
##  3rd Qu.:0.0000   3rd Qu.:10.000  
##  Max.   :1.2100   Max.   :10.000  
##  NA's   :1556     NA's   :1556
hist(flights6$dep_delay)

This variable has negative values and is right skewed. Let’s transform this variable to: log(dep_delay+1)

flights6<-flights6 %>%
  mutate(logdepdelay = ifelse(dep_delay < 1, 1, log(dep_delay+1)))
## Warning in log(dep_delay + 1): NaNs produced
hist(flights6$logdepdelay)

We are interested in flights with delays, so let’s subset the dataset. Let’s consider significant delays of 30 minutes or more.

flights7<- 
  filter (flights6,dep_delay>29)

So, now the dataset is significant delayed flights

dim(flights7)
## [1] 49413    15

The data set now contains 49413 observation.

The question is what is driving delays.

Here is a summary of the new data set of delays

summary(flights7)
##    origin.x              date               month.x         dep_delay      
##  Length:49413       Min.   :2013-01-01   Min.   : 1.000   Min.   :  30.00  
##  Class :character   1st Qu.:2013-04-11   1st Qu.: 4.000   1st Qu.:  43.00  
##  Mode  :character   Median :2013-06-26   Median : 6.000   Median :  64.00  
##                     Mean   :2013-06-27   Mean   : 6.378   Mean   :  85.53  
##                     3rd Qu.:2013-09-02   3rd Qu.: 9.000   3rd Qu.: 105.00  
##                     Max.   :2013-12-31   Max.   :12.000   Max.   :1301.00  
##                                                                            
##      dest              distance        name           manufacturer      
##  Length:49413       Min.   :  80   Length:49413       Length:49413      
##  Class :character   1st Qu.: 483   Class :character   Class :character  
##  Mode  :character   Median : 762   Mode  :character   Mode  :character  
##                     Mean   : 973                                        
##                     3rd Qu.:1147                                        
##                     Max.   :4983                                        
##                                                                         
##      seats            temp           wind_dir       wind_speed    
##  Min.   :  2.0   Min.   : 10.94   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 55.0   1st Qu.: 42.98   1st Qu.:130.0   1st Qu.: 8.055  
##  Median :140.0   Median : 60.98   Median :210.0   Median :11.508  
##  Mean   :124.6   Mean   : 59.23   Mean   :198.9   Mean   :11.734  
##  3rd Qu.:182.0   3rd Qu.: 75.20   3rd Qu.:280.0   3rd Qu.:14.960  
##  Max.   :450.0   Max.   :100.04   Max.   :360.0   Max.   :42.579  
##  NA's   :6367    NA's   :257      NA's   :1474    NA's   :260     
##      precip            visib         logdepdelay   
##  Min.   :0.00000   Min.   : 0.000   Min.   :3.434  
##  1st Qu.:0.00000   1st Qu.:10.000   1st Qu.:3.784  
##  Median :0.00000   Median :10.000   Median :4.174  
##  Mean   :0.00989   Mean   : 8.898   Mean   :4.270  
##  3rd Qu.:0.00000   3rd Qu.:10.000   3rd Qu.:4.663  
##  Max.   :1.21000   Max.   :10.000   Max.   :7.172  
##  NA's   :254       NA's   :254
hist(flights7$dep_delay)

#USE TRANSFORMED Y variable
hist(flights7$logdepdelay)

Let’s look at the variables…………….

attach(flights7)
## The following object is masked from package:datasets:
## 
##     precip
table(flights7$origin.x)
## 
##   EWR   JFK   LGA 
## 20349 15612 13452
boxplot(logdepdelay~origin.x, ylab="Delay", xlab="origin")

table(flights7$dest)
## 
##  ABQ  ACK  ALB  ANC  ATL  AUS  AVL  BDL  BGR  BHM  BNA  BOS  BQN  BTV  BUF  BUR 
##   47   29  116    1 2325  378   37   95   95   88 1125 1885  150  439  755   51 
##  BWI  BZN  CAE  CAK  CHO  CHS  CLE  CLT  CMH  CRW  CVG  DAY  DCA  DEN  DFW  DSM 
##  308    4   38  172   12  501  728 1730  562   39  821  293 1357 1157  984  152 
##  DTW  EGE  EYW  FLL  GRR  GSO  GSP  HDN  HNL  HOU  IAD  IAH  ILM  IND  JAC  JAX 
## 1334   36    1 1750  160  349  209    2   63  337 1026  842   30  339    6  530 
##  LAS  LAX  LGB  MCI  MCO  MDW  MEM  MHT  MIA  MKE  MSN  MSP  MSY  MTJ  MVY  MYR 
##  673 1734   98  442 1864  718  318  231 1268  521  144 1017  635    2   27   13 
##  OAK  OKC  OMA  ORD  ORF  PBI  PDX  PHL  PHX  PIT  PSE  PVD  PWM  RDU  RIC  ROC 
##   51  115  197 2707  303  980  238  245  587  465   50   97  457 1277  617  473 
##  RSW  SAN  SAT  SAV  SBN  SDF  SEA  SFO  SJC  SJU  SLC  SMF  SNA  SRQ  STL  STT 
##  382  336  144  149    3  232  464 1783   46  711  272   63   86  134  768   29 
##  SYR  TPA  TUL  TVC  TYS  XNA 
##  304 1029  116   20  175  115
boxplot(logdepdelay~dest, ylab="Delay", xlab="destination")

table(flights7$month.x)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 3428 3265 4402 4624 4534 5811 6294 4336 2471 2768 2463 5017
boxplot(logdepdelay~month.x, ylab="Delay", xlab="Month")

table(flights7$name)
## 
## AirTran Airways Corporation        Alaska Airlines Inc. 
##                         565                          63 
##      American Airlines Inc.        Delta Air Lines Inc. 
##                        3613                        5126 
##           Endeavor Air Inc.                   Envoy Air 
##                        3309                        3814 
##    ExpressJet Airlines Inc.      Frontier Airlines Inc. 
##                       11863                         133 
##      Hawaiian Airlines Inc.             JetBlue Airways 
##                          16                        8618 
##          Mesa Airlines Inc.       SkyWest Airlines Inc. 
##                         117                           5 
##      Southwest Airlines Co.       United Air Lines Inc. 
##                        2088                        7835 
##             US Airways Inc.              Virgin America 
##                        1652                         596
boxplot(logdepdelay~name, ylab="Delay", xlab="carrier")

table(flights7$manufacturer)
## 
##                    AGUSTA SPA                        AIRBUS 
##                             5                          6382 
##              AIRBUS INDUSTRIE         AMERICAN AIRCRAFT INC 
##                          5185                             3 
##            AVIAT AIRCRAFT INC                 BARKER JACK L 
##                             4                            48 
##                         BEECH                          BELL 
##                             6                             7 
##                        BOEING                BOMBARDIER INC 
##                         10497                          5507 
##                      CANADAIR                  CANADAIR LTD 
##                           277                            14 
##                        CESSNA            CIRRUS DESIGN CORP 
##                            87                            40 
##                   DEHAVILLAND                       DOUGLAS 
##                             6                             2 
##                       EMBRAER                FRIEDEMANN JON 
##                         12996                            12 
##          GULFSTREAM AEROSPACE            HURLEY JAMES LARRY 
##                            60                             2 
##                  KILDALL GARY               LAMBERT RICHARD 
##                             3                            12 
##               LEBLANC GLENN T                    MARZ BARRY 
##                             4                             4 
##             MCDONNELL DOUGLAS MCDONNELL DOUGLAS AIRCRAFT CO 
##                           466                          1193 
## MCDONNELL DOUGLAS CORPORATION                   PAIR MIKE E 
##                           157                             2 
##                         PIPER        ROBINSON HELICOPTER CO 
##                            15                            42 
##                      SIKORSKY                  STEWART MACO 
##                             3                             5
boxplot(logdepdelay~name, ylab="Delay", xlab="manufacturer")

Let’s look at quantitative variables…………

hist(distance)

hist(seats)

hist(temp)

hist(wind_dir)

hist(wind_speed)

hist(precip)

pairs.panels(flights7[,c("distance",  "seats", "temp", "wind_dir","wind_speed", "precip", "visib","dep_delay")])

I am considering further reducing the dataset to 1 airport: JFK.