Data Preparation

# load data
library(dplyr)
library(nycflights13)
library(ggplot2)

head(flights,10)
names(flights)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is there any relationship between origin airport and and probability of delay?

Cases

What are the cases, and how many are there?

nrow(flights)
## [1] 336776
summary(flights$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2013    2013    2013    2013    2013    2013

Cases are each flights information departure from EWR,JFK,LGA in 2013.

The are 336776 cases.

Data collection

Describe the method of data collection.

Data is clollected by the ‘dplyr’ library. It contains all airline on-time data for all flights departing NYC in 2013.

Type of study

What type of study is this (observational/experiment)?

Observational study

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://github.com/hadley/nycflights13

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is airline delay or not

flights=flights%>%
  mutate(dep_delay_bool=ifelse(dep_delay>=0,'ontime','delay'))

class(flights$dep_delay_bool)
## [1] "character"

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

class(flights$origin)
## [1] "character"

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Delay proportion
table(flights$dep_delay_bool)
## 
##  delay ontime 
## 183575 144946
prop.table(table(flights$dep_delay_bool))
## 
##     delay    ontime 
## 0.5587923 0.4412077
number of flights by each departure airport
table(flights$origin)
## 
##    EWR    JFK    LGA 
## 120835 111279 104662
prop.table(table(flights$origin))
## 
##       EWR       JFK       LGA 
## 0.3587993 0.3304244 0.3107763
Contingency table : Airport v.s. Delay
table(flights$origin, flights$dep_delay_bool)
##      
##       delay ontime
##   EWR 59300  58296
##   JFK 61146  48270
##   LGA 63129  38380
prop.table(table(flights$origin, flights$dep_delay_bool),1)
##      
##           delay    ontime
##   EWR 0.5042689 0.4957311
##   JFK 0.5588397 0.4411603
##   LGA 0.6219054 0.3780946