Simple Linear Regression

Simple linear regression models the relationship between two variables, a predictor variable and a response variable. The model assumes the relationship between the two can be approximated by a straight line. The simple regression equation can be written as:

\[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_1\]

For this guide I will be using the “manatees” data from the “resampledata” package. It gives the number of manatees killed by boats in Florida counties and the number of registered boats in the counties. Let’s call up the data in R:

library(resampledata)
## 
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
## 
##     Titanic
data(manatees)
attach(manatees)

The first thing we want to do is make a scatterplot of the two variables to see what we’re working with. I chose number of boats as the predictor and manatees killed as the response. Sometimes an argument can be made either way, but I doubt manatee deaths cause a rise in boat owners. Let’s see what the plot looks like:

plot(boats,killed,ylab="Manatees Killed by Boats", xlab="Number of Boats")

There is a positive association between the two variables. The points are not very closely or very loosely packed, they are fairly evenly spread apart. It is easy to visualize where the regression line will fall on the plot.

Let’s create the regression model to actually fit this line in the plot.

mod <- lm(killed~boats)
mod
## 
## Call:
## lm(formula = killed ~ boats)
## 
## Coefficients:
## (Intercept)        boats  
##    -35.1276       0.1126

Now we have the intercept and slope of our line and can graph it over the scatterplot.

plot(boats,killed,ylab="Manatees Killed by Boats", xlab="Number of Boats")
abline(-35.1276, 0.1126)

Interpreting the Linear Regression

Once we have the linear regression it’s important to interpret the slope and intercept, if possible.

In order to interpret the intercept, we must be able to reasonably set the explanatory variable to zero. In this case, the data does not support a value for number of boats to be zero and it does not make logical sense in the context of this problem. The data was gathered between 1982 and 2000, and it would be very surprising if there were no registered boats in a coastal state like Florida.

We can interpret the slope by describing the effect on the response variable when the predictor is changed. For every additional one boat registered, the number of manatees killed by boats increases by 0.1126.

Prediction and Residuals

Prediction simply involves plugging in a value for the predictor variable into the regression equation. Let’s choose an entry in the data for the number of boats and predict how many manatees are killed at that amount. I’m choosing the 12th entry in the set:

manatees[12,"boats"]
## [1] 675

In the 12th entry there were 675 boats. To use our model to predict the number of manatees killed we can simply plug the value into our regression equation:

 -35.1276 + 0.1126*675
## [1] 40.8774

Residuals are important when it comes to making predictions. A residual is the observed value - expected value. So in general if a point estimate is good then the predicted response will be close to the observed response, and the residual will be fairly small.

The number of manatees killed when there were 675 boats in the 12th entry is:

manatees[12,"killed"]
## [1] 43

Thus the residual for our calculation of manatee deaths at 675 boats would be:

43 - 40.8774
## [1] 2.1226

This is the distance the actual data point for 675 boats in the 12th entry is from the regression line.

Checking Model Assumptions

In order to perform hypothesis tests and various types of intervals when the simple regression model is used we need make certain assumptions about the errors.

Normality Assumption

One of these assumptions is that the errors have a normal distribution. We can get all of the residuals using some R code:

myresids <- mod$residuals

A quick way of looking for normality is to make a histogram and check for a normal curve:

hist(myresids)

It doesn’t appear to be an exact normal distribution, but it’s not too far off. However, it’s difficult for humans to compare data against a curve. It’s much easier to check against a straight line, so let’s plot the residuals against a normal

qqnorm(myresids)
qqline(myresids)

This doesn’t look too far off, none of the points are too far from the line. There is no obviously horrible violation of this assumption.

Constant Variance Assumption

Another assumption under the model is that the variance of the residuals will be equal for all predictor values. This is also called homoscedasticity.

A quick check of this assumption is to look at the original scatterplot of the data:

plot(boats,killed,ylab="Manatees Killed by Boats", xlab="Number of Boats") 

Here we are looking at the vertical spread at different numbers of boats. If the data was much more tightly packed at one spot than another, this is an obvious violation of the assumption. It looks good for our data though, the points are not too tightly packed.

There is another plot than can check the constant variance assumption. Let’s plot the residuals against the number of boats, this makes it easier to compare variances:

plot(mod$residuals ~ boats)
abline(0,0)

There are no obvious signs of heteroscadasticity.

Mean Square Error

Later on we will need to be able to compute the point estimate of the constant variance and standard deviation of the error term populations. The point estimate of variance is called the mean square error, and can be found with the following equation:

\[Mean Square Error = SSE / (n-2)\]

We can calculate this by hand:

sqrt(sum((mod$residuals)^2)/16)
## [1] 5.40952

But R will do the work for us, and the value can be found in the summary of the linear model:

summary(mod)
## 
## Call:
## lm(formula = killed ~ boats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.500  -2.473   1.383   3.834   7.500 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.12756    7.70720  -4.558 0.000323 ***
## boats         0.11261    0.01264   8.912 1.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.41 on 16 degrees of freedom
## Multiple R-squared:  0.8323, Adjusted R-squared:  0.8218 
## F-statistic: 79.42 on 1 and 16 DF,  p-value: 1.331e-07

Testing the Significance of the Slope and Intercept

A regression model is not going to be very useful unless there is a significant relationship between the predictor and the response. To judge the significance of this relationship, we test a null hypothesis against an alternative hypothesis:

\[{H_0}: {\beta_i}=0\] \[{H_A}: {\beta_i}=/=0\]

The null hypothesis says there is no change in the mean value of the response associated with an increase in the predictor. The alternative hypothesis says there is a positive or negative change in the mean value of the response associated with an increase in the predictor. It can be reasonably concluded that the predictor is significantly related to the response if we have evidence to reject the null hypothesis in favor of the alternative.

The test statistic we will use is: \[t={\beta_i}/SE({\beta_i})\]

The test statistic follows a t-distribution with n-2 degrees of freedom.

If the p-value corresponding to the test statistic is smaller than the significance level, it is reasonable to reject the null hypothesis and conclude there is a significant relationship. The p-values can be found with the summary function. Let’s check them for the manatees linear model we made eariler:

summary(mod)
## 
## Call:
## lm(formula = killed ~ boats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.500  -2.473   1.383   3.834   7.500 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.12756    7.70720  -4.558 0.000323 ***
## boats         0.11261    0.01264   8.912 1.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.41 on 16 degrees of freedom
## Multiple R-squared:  0.8323, Adjusted R-squared:  0.8218 
## F-statistic: 79.42 on 1 and 16 DF,  p-value: 1.331e-07

At a significance level of .05, the p-values for corresponding to both intercept and slope are smaller than the significance level. In this case, it is reasonable to conclude that both the intercept and the slope are significantly related to the response.

Note that it is common practice to include the intercept in regression models whether or not the null is rejected in the case of the intercept. Most often, the mean value of the response is not 0 when the predictor is equal to 0.

Confidence Intervals for Slope

In addition to testing the significance of the slope, it can be useful to calculate a confidence interval. The equation for a 100(1-alpha)% confidence interval is:

\[{\beta_1} +/- t_{\alpha/2}\]

A handy R function can easily find this interval:

confint(mod)
##                    2.5 %      97.5 %
## (Intercept) -51.46609534 -18.7890342
## boats         0.08582122   0.1393946

The default output is a 95% CI, we can change this by specifying in the function. This interval says we are 95% confident that if the number of boats increases by 1, then the number of manatees killed will increase by at least .08582 and at most 0.1394.

Confidence and Prediction Intervals for the Response

Recall the point on the regression line corresponding to a particular value of the predictor variable is:

\[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_1\]

Unless we are very lucky, y-hat will not exactly equal either the mean value of y when the x equals x1, or a particular indicated value of y when x equals x1. Therefore, we ned to place bounds on how far y-hat might be from these values.

A 100(1-alpha)% confidence interval for the mean value of y is:

\[\hat{y} +/- t_{\alpha/2}*s*sqrt(Distance Value)\] A 100(1-alpha)% prediction interval for an individual value of y is:

\[\hat{y} +/- t_{\alpha/2}*s*sqrt(1+Distance Value)\]

Let’s apply these concepts to the manatees data. First we need to choose a value for number of boats to create these intervals around. Let’s use 675 again and create a new data frame:

newdata <- data.frame(boats = 675)

A confidence interval for the mean number of manatees killed when the number of boats is 675:

confy <- predict(mod, newdata, interval="confidence")
confy
##        fit      lwr      upr
## 1 40.88279 37.53951 44.22607

The default level for the predict function is a 95% interval. We can interpret this result as such: we are 95% confident that the mean value of all manatees killed that would be observed when there are 675 boats registered is between 37.54 and 44.23.

Similarly, we can calculate a prediction interval for an individual value of manatees killed when the number of boats is 675:

predy <- predict(mod, newdata, interval="predict")
predy
##        fit      lwr      upr
## 1 40.88279 28.93771 52.82787

We can interpret this as: we are 95% confident that the number of manatees killed in future individual observation when there are 675 registered boats is between 28.94 and 52.83.

Notice that the prediction interval is wider than the confidence interval. We can check that they are both centered at the same value:

confy[1] == predy[1]
## [1] TRUE

They are both centered at y-hat = 40.88. Why is the prediction interval wider? Looking at the formula we can see there is an added 1 to the radical in the prediction interval equation. This accounts for the uncertainty introduced by not knowing the value of the error term.

Correlation

One measure of the usefulness of a simple linear regression model is the simple coefficient of determination, denoted r squared. It is defined as:

\[r^2 = Explained Variation/Total Variation\]

It is the proportion of the variation in the response that is explained by its linear relationship with the predictor. This value can be found with the summary function, or the simple correlation coefficient r can be found with another function:

cor(boats,killed)
## [1] 0.9123163

The F-Test

This test is used to test the significance of the regression relationship between the response and the predictor. For simple linear regression, this is another way to test the null hypothesis that there is no signficant relationship against the alternative hypothesis that the relationship is significant.

The F-statistic is defined as:

\[F = ExplainedVariation/(Unexplained Variation/(n-2))\]

Similarly to the t-test, if the p-value corresponding to the F-stat is less than the significance level then the null is rejected in favor of the alternative. The F-stat and the corresponding p-value can be found in the summary function. Let’s calculate it using the manatees linear model:

summary(mod)
## 
## Call:
## lm(formula = killed ~ boats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.500  -2.473   1.383   3.834   7.500 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.12756    7.70720  -4.558 0.000323 ***
## boats         0.11261    0.01264   8.912 1.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.41 on 16 degrees of freedom
## Multiple R-squared:  0.8323, Adjusted R-squared:  0.8218 
## F-statistic: 79.42 on 1 and 16 DF,  p-value: 1.331e-07

The p-value is very small. Say at a significance level of .05, we reject the null in favor of the alternative. It is reasonable to conclude that there is a significant relationship between number of boats and manatees killed by boats.