This R guide will cover material from chapter 3, called Simple Linear Regression.
In chapter 3, we began by talking about simple linear regression models. Simple linear regression observes the relationship between an independent variable, x, and a dependent variable, y, by looking at a scatterplot of data and approximating a straight line through the points. The equation for the model is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_0} x_i\]
In this equation, beta(zero) + beta(one)x is the mean value of the dependent variable y with independent variable x. Beta(zero) is the y-intercept, or the mean value of y when x equals 0. Beta(one) is the slope. It is the change in the mean value of y with every one-unit increase in the mean of x. Epsilon is the error term that accounts for things other than x that affect y.
To create this model in r, we must first select a data set. We are going to select a package that lists the speed of cars (mph) and the distances (ft) taken to stop:
data(cars)
attach(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
In this example, both of our variables are quantitative variables since they are described using numbers rather than words.
We are now going to create a scatterplot of the speed of cars vs. their stopping distances to view the relationship between these two variables. In this example, we will say that the stopping distance depends on the speed, so the stopping distance is the dependent variable (y) and the speed is the independent variable (x). To create a scatterplot, we use the plot function followed by the predictor variable, response variable. We can also label our axes using the ylab and xlab specifiers and put a title on the graph using the main specifier:
We now want to find the line that describes the relationship between our two variables, the line described by the simple linear regression model. We create this regression line using the lm function and two arguments, the response ~ the predictor, as follows:
mymod <- lm(dist ~ speed)
mymod
##
## Call:
## lm(formula = dist ~ speed)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
After we run our regression model, we can interpret our results. Our regression equation from these results is y = -17.579 + 3.932x. With a slope (beta(one)) of 3.932, we say that with every one-unit (mph) increase in speed, our stopping distance increases by 3.932 feet.
In this case, we will not interpret our y-intercept (beta(zero)) because it would not make sense that there would be a stopping distance with a speed of 0. You would not have a stopping distance if you were not moving to begin with.
We will now add our regression line to the plot by first plotting the data and then using the abline function. The abline function requires two arguments, the values of the intercept and the slope that we got from from our regression model.
plot(speed, dist, ylab = "Car Stopping Distance", xlab = "Speed", main = "Car Speed vs. Stopping Distance")
abline(-17.579,3.932)
So, the relationship between speed and stopping distance is best described by the line seen here. The equation of this line is the same as our simple linear regression model: y = -17.579 + 3.932x.
When we create regression models, we make multiple assumptions about the error term. For example, we assume that the error term is normally distributed. We can check this assumption by creating a histogram of the residuals and checking its shape:
myresids <- mymod$residuals
hist(myresids)
Because this looks decently close to a normal distribution (bell-shaped), this looks pretty good. We can fairly say that our error terms follow a normal distribution. But, if this is hard for us to conclude, we can also check the residuals based on a straight line:
qqnorm(myresids)
qqline(myresids)
Because the data points follow the straight line pretty closely, the assumption checks out in this case.
Another assumption we make when we create regression models is homoscedasticity, meaning that the variance of the residuals will be equal for all predictor values. We can check this assumption by plotting our residuals against our independent variable (speed, in this case):
plot(mymod$residuals ~ speed)
abline(0,0)
This looks pretty good because the vertical spread is about the same. There are no places with obvious clustering of data points.
Another assumption we have about the error term is that all error terms are independent of each other. Intuitively, this makes sense because the observations are randomly selected variables, which would suggest independence.
Our mean square error is the point estimate of the variance of the error term populations. Although we can calculate it by hand, we will just use the summary function.
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Our MSE is located next to the residual standard error: 15.38. So the estimate of our variance is 15.38.
Another way we can examine the relationship between x and y is to look at the correlation between them. We use the variable r to represent correlation. In our case, this will measure the strength and direction of the linear relationship between the speed and stopping distance. We use the cor() function followed by our two variables:
cor(speed, dist)
## [1] 0.8068949
Because this is positive and pretty close to 1, there is a strong positive correlation between the two variables. The closer to 1, the more strongly correlated two variables are. And because they are positively correlated, this means that they move in the same direction: as the speed increases, the stopping distance increases.
We can find our correlation in another way, though. Let’s look at our summary again:
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In the bottom of our summary, we find our r2 value of .6511. If we take the square root of this, we will find our correlation, r. Our variable r2 represents the simple coefficient of determination, which is the proportion of the total variation in our response, y, that is explained by its linear relationship with x. This value can only be between 0 and 1, and if it is close to 1, then the x value in our model does not provide accurate predictions of our y value. Because our r2 value is only .6511, we will not conclude that our speed is a bad predictor of stopping distance.
We can also perform hypothesis tests based on the summary of our model. Let’s look at it one more time:
summary(mymod)
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In an F-test, we test the significance of the relationship between our x and y variables, so between speed and stopping distance in this case. In this test, our null hypothesis would be that our slope is 0, meaning that variables x and y (speed and stopping distance) have no relationship. Our alternative hypothesis is that our slope is something other than 0, meaning they do have a relationship. Our summary function has already ran the test for us, and we can find our test statistic in the bottom row: 89.57. We can also see that with a p-value of 1.49e^-12, we can safely reject the null hypothesis and conclude that there is a relationship between speed and stopping distances. Note that our degrees of freedom in this scenario is 48 because we had 50 observations on 2 variables.
We can also perform a t-test with this information, but for simple linear regression, our t-test and F-test are the same. We have the same null hypothesis that our slope is zero and the two variables have no relationship, and our alternative hypothesis is still that the slope will be something other than zero and the two variables do have a relationship. We find our test statistic at the intersection of the row labeled “speed”" and the column labeled “t value”" and get a value of 9.464. If we pay close attention, we will notice that this is the square root of the test statistic in our F-test; this is because squaring a t-distribution gives an F-distribution in the case of simple linear regression. So we get the same p-value of 1.49e^-12 and can reject our null hypothesis in favor of the alternative hypothesis, concluding that there is a relationship between speed and stopping distances.
We can also create a confidence interval for our slope using the confint() function. We can also specify what level we want our confidence interval to be, as seen below:
confint(mymod, level = .95)
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
The confidence intervals for both the y-intercept and the slope are given above. We see the confidence interval for our y-intercept is (-31.16785, -3.990340), but because it does not make sense to interpret the y-intercept in this example, we will not interpret the confidence interval. The confidence interval for the slope is (3.096964, 4.767853). This means that we are 95% confident that as the speed increases by one unit, the stopping distance will increase by an amount between 3.096964 and 4.767853. Or, in other words, we are 95% confident that the true slope is between 3.096964 and 4.767853.
We can create prediction and confidence intervals for the stopping distance when the speed is 30 mph:
newdata <- data.frame(speed = 30)
predy <- predict(mymod, newdata, interval="predict")
confy <- predict(mymod, newdata, interval = "confidence")
confy
## fit lwr upr
## 1 100.3932 87.43543 113.3509
predy
## fit lwr upr
## 1 100.3932 66.86529 133.921
Our confidence interval says that when the speed is fixed at 30 mph, we can be 95% confident that the mean stopping distance lies between 87.43543 and 113.3509 ft. The prediction interval says that when the speed is fixed at 30 mph, we can be 95% confident that a point estimate of the stopping distance lies between 66.86529 and 133.921 ft.
Intuitively, our prediction interval should be wider because there is more variability in predicting a point estimate than an average. We can check this with the following:
confy %*% c(0, -1, 1)
## [,1]
## 1 25.91548
predy %*% c(0, -1, 1)
## [,1]
## 1 67.05575
Our intuition is correct; the length of our confidence interval is 25.91548 and the length of our prediction interval is 67.05575.
That’s all; you’re done!