The Data

Importing data

For this R Guide, all examples will use the flights dataset from the package nycflights13, which contains on-time data for all flights that departed NYC in 2013. First we will require the package containing the data, call up the data, and attach the data to make our lives easier, since we will only be using this one dataset throughout the entire R guide (consider not attaching your data if you know you’ll be switching between two or more datasets, it will not go great).

I also used dplyr to clear out rows with NA’s in the columns I’m interested in, I was having unequal length problems when I wanted to compare my residuals to my independent variable.

require(nycflights13)

## Loading required package: nycflights13

## Warning: package 'nycflights13' was built under R version 3.3.3

require(dplyr)

## Loading required package: dplyr

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

flights <- flights %>% filter(!is.na(dep_delay) & !is.na(arr_delay))

## Warning: package 'bindrcpp' was built under R version 3.3.3

attach(flights)

Exploratory Plotting

In this simple linear regression example, we’re going to look at the relationship between departure delay time and arrival delay time. Let’s start by graphing this data using the plot function.

plot(dep_delay, arr_delay, xlab="Departure Delay (mins)", ylab="Arrival Delay (mins)")

Based on this plot, the relationship between these departure delay and arrival delay seems to be linear, and the relationship is positive, meaning that when departure delay increases we’d expect to see arrival delay increase as well.

However, just because it looks like a linear relationship might be present from an exploritory plot, doesn’t mean it is. To prove there is evidence to conclude a linear relationship between two variables, we need to test a simple linear regression model and ensure that the assumptions of a linear model are met by our data.

Making the model

To make the model, use lm(dependent~independent) function, save it under a meaningful name using <- command thing. Use the summary function to see its insides.

delayModel <- lm(arr_delay~dep_delay)
summary(delayModel)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.587  -11.005   -1.883    8.938  201.938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8369 
## F-statistic: 1.679e+06 on 1 and 327344 DF,  p-value: < 2.2e-16

Let’s not get ahead of ourselves, though, before we commit to this model we need to actually make sure our data meet the assuptions of the model.

Checking Model Assumptions

There are four assumptions of linear models: linearity, normal distribution of residuals, homoscedasticity (equal variance of residuals), and independence of observations.

We confirmed linearity when we looked at the scatterplot of departure delays and arrival delays. Let’s check the other three.

Normal distribution of residuals

First let’s use a histogram of the residuals to check for a normal distribution of residuals:

hist(delayModel$residuals)

Looks pretty normal, centered roughly around 0. We can also use a qq plot to check the distribution of the residuals. If they are normally distributed, we see a straight line.

qqnorm(delayModel$residuals)
qqline(delayModel$residuals)

So this the residuals of this model are not perfectly normally distributed, since you can see above that the residuals deviate from the straight line we’d expect towards the end. However, for purposes of this R guide, we’re going to call it good enough and say we’ve met this condition. Do not necessarily do this in real life.

Homoscedasticity

Next, let’s check if our residuals have equal variance and are centered around 0:

plot(delayModel$residuals~dep_delay)
abline(0,0)

The residuals are roughly cented around zero, but the variance is heteroscedastic, meaning the variance is not spreak equally for all values. There is much wider variance for smaller departure delay times than there is for larger departure delay times.

In real life this compromises our ability to represent this relationship with a linear model, but for purposes of this R guide we’re going to forge ahead.

Independence of observations

That our observations are independent isn’t something we can test in R – it’s something we need to learn about our data from a source familiar with its collection. In our case, this often means from its R documentation. There’s nothing I see in the R documentation of the flights data that suggests the observations are not independent. Thinking about the nature of flights, I suppose it is possible that some huge sports team or group of rich people chartered more than one plane for one trip which departed from an NYC airport, but since that can’t be very common I’m assuming that the vast majority of my observations are independent.

Understanding our model

Now that we’ve made sure we’re operating within the assumptions of a linear model, let’s check and see if our simple linear regression model shows a statistically significant linear relationship between departure delay and arrival delay. Just because we successfully made a model, doesn’t mean it’s a meaningful one.

Regression equation

Let’s call up the summary of our model again:

summary(delayModel)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.587  -11.005   -1.883    8.938  201.938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8369 
## F-statistic: 1.679e+06 on 1 and 327344 DF,  p-value: < 2.2e-16

From the estimate column of the coefficients section of this output, we see the coefficients of our model’s regression equation. The itercept is -5.90 and the slope is 1.02, making our model’s regression equation \[\hat{Arrival Delay}= -5.90+1.02(Departure Delay)\]

Plotting regression equation

Now that we know our regression line, let’s plot it over our data using the plot function and abline function. When using abline to plot an equation, put the intercept first, then \[\beta_1\]

plot(arr_delay~dep_delay, xlab = "Departure Delay (mins)", ylab = "Arrival Delay (mins)")
abline(-5.8994935, 1.0190929)

Checking significance

Before we can use our model to make predictions, we need to see if the model is statistically valuable. For simple linear regression, we check this by conducting a t-test

Our hypotheses

\[{H_0}: {\beta_1} = 0\] \[{H_A}: {\beta_1} \neq 0\]

What this means in words is that our null hypothesis is that departure delay has no linear relationship with arrival delay, and our alternative hypothesis is that departure delay has some linear relationship with arrival delay.

Testing our hypotheses

To test our model, we’ll use a t-test. This can be calculated by hand, technically, but we live in an age of beautiful things like R studio, so let’s use that to our advantage. All the components of a t-test of our model live in our model summary, so let’s pull that up again:

summary(delayModel)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.587  -11.005   -1.883    8.938  201.938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 327344 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8369 
## F-statistic: 1.679e+06 on 1 and 327344 DF,  p-value: < 2.2e-16

To look at t-test results for our \[\beta_1\], we will navigate to the coefficients section of the summary and look at the row for dep_delay. The column t value gives us our t-value (go figure), which is 1295.8, and R then compares that to a t-distribution with 327344 degrees of freedom and let’s us know in the Pr(>|t|) column what our p-value is, which in this case is so small that R returns <2e-16. This means our p-value is certainly less than 0.05, the standard alpha level, and we can reject our null hypothesis and conclude that there is a linear relationship between departure delay and arrival delay.

Interpreting the model

Now that we’ve checked that our model is statistically significant and meets the conditions of a linear model, let’s talk about what our model means.

Since departure and arrival delays can be negative (meaning the flight left or arrived early), we can interpret our y-intercept, -5.90. What this means in the context of our model is that, if a flight left exactly on time, we would predict it would arrive 5.9 minutes early.

Our slope, 1.02, means that for every minute late a plane leaves, we would expect it to arrive 1.02 minutes late.

Confidence interval for our slope

confint(delayModel)

##                 2.5 %    97.5 %
## (Intercept) -5.964211 -5.834776
## dep_delay    1.017552  1.020634

We’re 95% confident that for every minute a flight is delayed, the arrival time delay will increase by between 1.017552 and 1.020634 minutes.

Chapter 3 R Guide: Simple Linear Regression

Jen Kochaver