| title: “Project proposal 606” |
| author: “Lisa Szydziak” |
| date: “10/25/2021” |
| output: |
| html_document: default |
| word_document: default |
| pdf_document: default |
| output: html_document |
# load data
library(nycflights13)
library(tidyr)
library(dplyr)
library(gmodels)
library(ggplot2)
library(ggpubr)
library(psych)
library(nycflights13)
This nycflights13 r-package contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013: 336,776 flights in total. To help understand what causes delays, it also includes a number of other useful datasets.
This package provides the following data tables.
?flights: all flights that departed from NYC in 2013 ?weather: hourly meterological data for each airport ?planes: construction information about each plane ?airports: airport names and locations ?airlines: translation between two letter carrier codes and names
I begin by merging the 4 dataset to include fields from the weather, planes and airlines tables to supplement the flights dataset.
#Use flight dataset
flights1<-nycflights13::flights
flights2<-merge(flights1,airlines,all.x=TRUE, by.x="carrier", by.y="carrier")
flights3<-merge(flights2,planes,all.x=TRUE, by.x="tailnum", by.y="tailnum")
#Create a variable origin.time by concatenate
#in the flights3 and weather dataset to use to later merge
flights4 <-flights3 %>%
unite('origin.time', origin,time_hour,remove=FALSE)
weather2<-weather %>%
unite('origin.time', origin,time_hour,remove=FALSE)
#now merge flights 4 with weather 2 based on the newly created variable
flights5<-merge(flights4,weather2,all.x=TRUE, by.x="origin.time", by.y="origin.time")
#Create a date field
flights5$date <- as.Date(with(flights5, paste(year.x, month.x, day.x,sep="-")), "%Y-%m-%d")
flights6<-flights5 %>%
select(origin.x,date,month.x, dep_delay, dest, distance, name, manufacturer, seats, temp, wind_dir,wind_speed, precip, visib)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
What are the most significant variables driving nyc airline departure delays?
What are the cases, and how many are there?
dim(flights6)
## [1] 336776 14
There are 336,776 observations which represents flights out of NYC airports.
Describe the method of data collection.
As stated before, this r package contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013: 336,776 flights in total. Tt also includes a number of other useful datasets.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
This data is an r included in the r-package nycflights13 and be accessed by installing the nycflights13 library.
What is the response variable? Is it quantitative or qualitative?
The response variable is dep_delay. Departure delays, in minutes. Negative times represent early departures. It is a quantitative variable.
I will be looking at the following independent variables:
origin.x - NYC airport date month.x dest - destination airport distance - how long is the trip name - airline carrier manufacturer - airplane maker seats - number of seats (proxy for size of plane) temp - air temp wind_dir - wind direction wind_speed - wind_speed precip - Precipitation, in inches
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
str(flights6)
## 'data.frame': 336776 obs. of 14 variables:
## $ origin.x : chr "EWR" "EWR" "EWR" "EWR" ...
## $ date : Date, format: "2013-01-01" "2013-01-01" ...
## $ month.x : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_delay : num 2 -4 24 8 -6 1 -1 -4 -2 47 ...
## $ dest : chr "IAH" "ORD" "IAD" "ORD" ...
## $ distance : num 1400 719 212 719 1008 ...
## $ name : chr "United Air Lines Inc." "United Air Lines Inc." "ExpressJet Airlines Inc." "Envoy Air" ...
## $ manufacturer: chr "BOEING" "BOEING" "EMBRAER" NA ...
## $ seats : int 149 191 55 NA 55 200 149 178 191 191 ...
## $ temp : num 39 39 37.9 37.9 37.9 ...
## $ wind_dir : num 260 260 240 240 240 240 240 240 240 240 ...
## $ wind_speed : num 12.7 12.7 11.5 11.5 11.5 ...
## $ precip : num 0 0 0 0 0 0 0 0 0 0 ...
## $ visib : num 10 10 10 10 10 10 10 10 10 10 ...
summary(flights6)
## origin.x date month.x dep_delay
## Length:336776 Min. :2013-01-01 Min. : 1.000 Min. : -43.00
## Class :character 1st Qu.:2013-04-04 1st Qu.: 4.000 1st Qu.: -5.00
## Mode :character Median :2013-07-03 Median : 7.000 Median : -2.00
## Mean :2013-07-02 Mean : 6.549 Mean : 12.64
## 3rd Qu.:2013-10-01 3rd Qu.:10.000 3rd Qu.: 11.00
## Max. :2013-12-31 Max. :12.000 Max. :1301.00
## NA's :8255
## dest distance name manufacturer
## Length:336776 Min. : 17 Length:336776 Length:336776
## Class :character 1st Qu.: 502 Class :character Class :character
## Mode :character Median : 872 Mode :character Mode :character
## Mean :1040
## 3rd Qu.:1389
## Max. :4983
##
## seats temp wind_dir wind_speed
## Min. : 2.0 Min. : 10.94 Min. : 0.0 Min. : 0.000
## 1st Qu.: 55.0 1st Qu.: 42.08 1st Qu.:130.0 1st Qu.: 6.905
## Median :149.0 Median : 57.20 Median :220.0 Median :10.357
## Mean :136.7 Mean : 57.00 Mean :201.5 Mean :11.114
## 3rd Qu.:189.0 3rd Qu.: 71.96 3rd Qu.:290.0 3rd Qu.:14.960
## Max. :450.0 Max. :100.04 Max. :360.0 Max. :42.579
## NA's :52606 NA's :1573 NA's :9796 NA's :1634
## precip visib
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:10.000
## Median :0.0000 Median :10.000
## Mean :0.0046 Mean : 9.256
## 3rd Qu.:0.0000 3rd Qu.:10.000
## Max. :1.2100 Max. :10.000
## NA's :1556 NA's :1556
hist(flights6$dep_delay)
This variable has negative values and is right skewed. Let’s transform this variable to: log(dep_delay+1)
flights6<-flights6 %>%
mutate(logdepdelay = ifelse(dep_delay < 1, 1, log(dep_delay+1)))
## Warning in log(dep_delay + 1): NaNs produced
hist(flights6$logdepdelay)
We are interested in flights with delays, so let’s subset the dataset. Let’s consider significant delays of 30 minutes or more.
flights7<-
filter (flights6,dep_delay>29)
So, now the dataset is significant delayed flights
dim(flights7)
## [1] 49413 15
The data set now contains 49413 observation.
The question is what is driving delays.
Here is a summary of the new data set of delays
summary(flights7)
## origin.x date month.x dep_delay
## Length:49413 Min. :2013-01-01 Min. : 1.000 Min. : 30.00
## Class :character 1st Qu.:2013-04-11 1st Qu.: 4.000 1st Qu.: 43.00
## Mode :character Median :2013-06-26 Median : 6.000 Median : 64.00
## Mean :2013-06-27 Mean : 6.378 Mean : 85.53
## 3rd Qu.:2013-09-02 3rd Qu.: 9.000 3rd Qu.: 105.00
## Max. :2013-12-31 Max. :12.000 Max. :1301.00
##
## dest distance name manufacturer
## Length:49413 Min. : 80 Length:49413 Length:49413
## Class :character 1st Qu.: 483 Class :character Class :character
## Mode :character Median : 762 Mode :character Mode :character
## Mean : 973
## 3rd Qu.:1147
## Max. :4983
##
## seats temp wind_dir wind_speed
## Min. : 2.0 Min. : 10.94 Min. : 0.0 Min. : 0.000
## 1st Qu.: 55.0 1st Qu.: 42.98 1st Qu.:130.0 1st Qu.: 8.055
## Median :140.0 Median : 60.98 Median :210.0 Median :11.508
## Mean :124.6 Mean : 59.23 Mean :198.9 Mean :11.734
## 3rd Qu.:182.0 3rd Qu.: 75.20 3rd Qu.:280.0 3rd Qu.:14.960
## Max. :450.0 Max. :100.04 Max. :360.0 Max. :42.579
## NA's :6367 NA's :257 NA's :1474 NA's :260
## precip visib logdepdelay
## Min. :0.00000 Min. : 0.000 Min. :3.434
## 1st Qu.:0.00000 1st Qu.:10.000 1st Qu.:3.784
## Median :0.00000 Median :10.000 Median :4.174
## Mean :0.00989 Mean : 8.898 Mean :4.270
## 3rd Qu.:0.00000 3rd Qu.:10.000 3rd Qu.:4.663
## Max. :1.21000 Max. :10.000 Max. :7.172
## NA's :254 NA's :254
hist(flights7$dep_delay)
#USE TRANSFORMED Y variable
hist(flights7$logdepdelay)
Let’s look at the variables…………….
attach(flights7)
## The following object is masked from package:datasets:
##
## precip
table(flights7$origin.x)
##
## EWR JFK LGA
## 20349 15612 13452
boxplot(logdepdelay~origin.x, ylab="Delay", xlab="origin")
table(flights7$dest)
##
## ABQ ACK ALB ANC ATL AUS AVL BDL BGR BHM BNA BOS BQN BTV BUF BUR
## 47 29 116 1 2325 378 37 95 95 88 1125 1885 150 439 755 51
## BWI BZN CAE CAK CHO CHS CLE CLT CMH CRW CVG DAY DCA DEN DFW DSM
## 308 4 38 172 12 501 728 1730 562 39 821 293 1357 1157 984 152
## DTW EGE EYW FLL GRR GSO GSP HDN HNL HOU IAD IAH ILM IND JAC JAX
## 1334 36 1 1750 160 349 209 2 63 337 1026 842 30 339 6 530
## LAS LAX LGB MCI MCO MDW MEM MHT MIA MKE MSN MSP MSY MTJ MVY MYR
## 673 1734 98 442 1864 718 318 231 1268 521 144 1017 635 2 27 13
## OAK OKC OMA ORD ORF PBI PDX PHL PHX PIT PSE PVD PWM RDU RIC ROC
## 51 115 197 2707 303 980 238 245 587 465 50 97 457 1277 617 473
## RSW SAN SAT SAV SBN SDF SEA SFO SJC SJU SLC SMF SNA SRQ STL STT
## 382 336 144 149 3 232 464 1783 46 711 272 63 86 134 768 29
## SYR TPA TUL TVC TYS XNA
## 304 1029 116 20 175 115
boxplot(logdepdelay~dest, ylab="Delay", xlab="destination")
table(flights7$month.x)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 3428 3265 4402 4624 4534 5811 6294 4336 2471 2768 2463 5017
boxplot(logdepdelay~month.x, ylab="Delay", xlab="Month")
table(flights7$name)
##
## AirTran Airways Corporation Alaska Airlines Inc.
## 565 63
## American Airlines Inc. Delta Air Lines Inc.
## 3613 5126
## Endeavor Air Inc. Envoy Air
## 3309 3814
## ExpressJet Airlines Inc. Frontier Airlines Inc.
## 11863 133
## Hawaiian Airlines Inc. JetBlue Airways
## 16 8618
## Mesa Airlines Inc. SkyWest Airlines Inc.
## 117 5
## Southwest Airlines Co. United Air Lines Inc.
## 2088 7835
## US Airways Inc. Virgin America
## 1652 596
boxplot(logdepdelay~name, ylab="Delay", xlab="carrier")
table(flights7$manufacturer)
##
## AGUSTA SPA AIRBUS
## 5 6382
## AIRBUS INDUSTRIE AMERICAN AIRCRAFT INC
## 5185 3
## AVIAT AIRCRAFT INC BARKER JACK L
## 4 48
## BEECH BELL
## 6 7
## BOEING BOMBARDIER INC
## 10497 5507
## CANADAIR CANADAIR LTD
## 277 14
## CESSNA CIRRUS DESIGN CORP
## 87 40
## DEHAVILLAND DOUGLAS
## 6 2
## EMBRAER FRIEDEMANN JON
## 12996 12
## GULFSTREAM AEROSPACE HURLEY JAMES LARRY
## 60 2
## KILDALL GARY LAMBERT RICHARD
## 3 12
## LEBLANC GLENN T MARZ BARRY
## 4 4
## MCDONNELL DOUGLAS MCDONNELL DOUGLAS AIRCRAFT CO
## 466 1193
## MCDONNELL DOUGLAS CORPORATION PAIR MIKE E
## 157 2
## PIPER ROBINSON HELICOPTER CO
## 15 42
## SIKORSKY STEWART MACO
## 3 5
boxplot(logdepdelay~name, ylab="Delay", xlab="manufacturer")
Let’s look at quantitative variables…………
hist(distance)
hist(seats)
hist(temp)
hist(wind_dir)
hist(wind_speed)
hist(precip)
pairs.panels(flights7[,c("distance", "seats", "temp", "wind_dir","wind_speed", "precip", "visib","dep_delay")])
I am considering further reducing the dataset to 1 airport: JFK.