We will be using the “trees” dataset that is already built into R to demonstrate the basic statistic concept of the simple linear regression model. This data set provides measurements of the girth, height, and volume of timber in 31 felled black cherry trees. The girth is measured in inches, the height is measured in feet, and the volume of timber is measured in cubic feet. Girth is the diameter of the tree measured at 4 ft 6 in above the ground.
To call up the data set, we use the data() function, followed by attach() so that we don’t have to keep referencing the set we want to use in further calculations.
data(trees)
attach(trees)
I am interested in the relationship between Girth and Volume. I will use Girth as the explanatory variable and Volume as the response.
We can plot the data using the plot function, where we include the independent variable name, the dependent variable name, and then labels for the y- and x-axis (optional) in that order.
plot(Girth,Volume,ylab="Volume (cubic ft)",xlab="Girth (in)")
We can see from this scatterplot that there is a positive association between the two variables. The points seem to follow the trend pretty closely, and there seems to be an obvious linear relationship.
The simple linear regression model assumes that the relationship between the dependent variable, Volume, and the independent variable, Girth, can be approximated by a straight line. Since there seems to be a linear relationship when looking at the scatterplot, we will create a simple linear regression model.
We can create the linear regression model using lm(response variable~explanatory variable). I’m going to name mine “mymod”.
(mymod<-lm(Volume~Girth))
##
## Call:
## lm(formula = Volume ~ Girth)
##
## Coefficients:
## (Intercept) Girth
## -36.943 5.066
To represent the fact that points are scattered around a straight line instead of lying directly on it, we use the simple linear regression model \(\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\).
Plugging in the numbers that R calculated above for the intercept and the slope (found under Girth), our regression line for Girth and Volume is \(\hat{y_i}= -36.943 + 5.066x_i\).
Thus, for every inch increase in the girth, there is on average a 5.066 cubic ft increase in the volume of timber.
We cannot interpret the intercept because we cannot reasonably set the explanatory variable to zero; we have no data for a girth equal to zero. This makes sense, as there’s no possible way that a felled tree can have no girth whatsoever.
We will now plot the regression line on our scatterplot, using abline(intercept,slope).
plot(Girth,Volume,ylab="Volume (cubic ft)",xlab="Girth (in)")
abline(-36.943,5.066)
When we create a simple linear regression model, we make four different assumptions. We first assume that at any given value of x, the population of potential error term values has a mean equal to 0. Let’s quickly check this for our model.
To do this, we’ll sum up the residuals from our model, then calculate the mean.
myresids<-mymod$residuals
mean(myresids)
## [1] 3.044719e-16
This mean of the errors is close enough to 0 that I’d say that this assumption is fulfilled.
Let’s check the normality assumption. We assume that at any given value of x, the population of potential error term values has a normal distribution. To do this, we’ll look at a histogram of the residuals. It’s easy to do this in R using the hist() command.
hist(myresids)
Although we don’t have much data to work with, this doesn’t seem to show that the error term isn’t normally distributed. It does have the appearance of a simple bell-like curve, which is what the normal distribution looks like.
It’s hard for us to see for sure on this graph, so let’s compare to a straight line instead. We will plot the quantiles of the residuals against the quantiles of a normal distribution, using the qqnorm() and qqline() functions respectively.
qqnorm(myresids)
qqline(myresids)
If the residuals are normally distributed, they will follow a straight line. The graph above doesn’t look to bad, but there seems to be a cyclic pattern of the points being over and then being under, with the points at the ends being further off. The points in general, however, seem to be close enough to the line that we can assume that our normality assumption holds.
Another assumption we make in the simple linear regression model is homoscedasticity: that is, the different populations of potential error term values corresponding to different values of x have equal variances.
Let’s go back to our original plot to look for the constant variance assumption.
plot(Girth,Volume,ylab="Volume (cubic ft)",xlab="Girth (in)")
We want to pay attention to the vertical spread. The vertical spread should be about the same at Girth. If the data are more tightly packed at one place than another, then you have trouble. Our tree data seems to be more tightly packed at some areas than, say, at Girth values of 13 and 18.
Let’s look at one more plot to check this assumption. We’ll plot the girth vs the sum of the residuals, with a horizontal line at 0. There shouldn’t be an obvious curve, or points more tightly packed at one point, or else we have an issue.
plot(mymod$residuals ~ Girth)
abline(0,0)
There is no obvious curve here, so I would say we do not have enough evidence to conclude that there is heteroscedasticity, so our equal variance assumption holds.
The last assumption we make is that any one value of the error term is statistically independent of any other value of the error term. In order to check that assumption, we would take a look at the documentation of the data to see how the points were collected. In our case, the data should be independent if data was collected on trees throughout different areas, but possibly dependent if data was collected from trees around the same area. Since we were not given information about where the trees were located, we will not be checking that assumption here.
\(\hat{y}= b_0 +b_1 x_0\) is the point estimate of the mean value of the dependent value (in our case, Volume) when the value of the independent variable Girth is \(x_0\). In this equation, \(b_0\) and \(b_1\) are the least squares point estimates of the y-intercept \(\beta_0\) and the slope \(\beta_1\) in the simple linear regression model. Furthermore, \(x_0\) is a specified value of the independent variable \(x_1\) that is inside our cloud of data.
We are going to choose a value for the Girth to plug in for \(x_0\). This will give us \(\hat{y}\), a point prediction for an individual \(x_0\). We will then calculate the residual in order to see how close our prediction was.
Looking at the head of the data set, let’s use the fifth record for our prediction. We will use indexing to reference the values. tree[5,1] references the fifth record, first column, which happens to be Girth. The third column is Volume. We can predict our point by using matrix multiplication with the coefficient of our linear model times a vector with 1 as the first number, and the girth we want to predict for as the second number: coef(mymod)%*%c(1,myGirth).
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
(myGirth<-trees[5,1])
## [1] 10.7
myactualVolume<-trees[5,3]
(mypred<-coef(mymod)%*%c(1,myGirth))
## [,1]
## [1,] 17.2612
(myresid<-myactualVolume-mypred)
## [,1]
## [1,] 1.538795
We predict that at a Girth of 10.7 inches, the Volume of timber will be 17.2612 cubic ft. Looking at the residual (the last calculation), we can see that we underestimated that actual Volume by 1.538795 cubic ft.
Instead of estimating a single point for the mean Volume when Girth = 10.7, let’s instead calculate a 95% confidence interval.
To do this, we’ll create a new data frame using data.frame(Girth=whatever we want to predict for). We can then calculate a confidence interval using predict(), with the linear model, our new data frame, and a specification for which interval we want to create.
newtrees<-data.frame(Girth=10.7)
(confVolume <- predict(mymod, newtrees, interval="confidence"))
## fit lwr upr
## 1 17.2612 15.23588 19.28653
I am 95% confident that the mean value of the Volume for Girths of 10.7 inches is between 15.2359 and 19.2865 cubic ft, given by the lwr and upr bounds in our calcuation above.
We can do a similar thing for an individual value of Volume. Instead of predicting a single point when Girth = 10.7, let’s instead calculate a 95% prediction interval.
(predVolume <- predict(mymod, newtrees, interval="prediction"))
## fit lwr upr
## 1 17.2612 8.332185 26.19022
I am 95% confident that an individual value of the Volume for a Girth of 10.7 inches is between 8.332185 and 26.19022 cubic ft.
We should check the sizes of the intervals.
confVolume %*% c(0, -1, 1) #conf interval width
## [,1]
## 1 4.050641
predVolume %*% c(0, -1, 1) #pred interval width
## [,1]
## 1 17.85804
The prediction interval is much bigger than the confidence interval, which should be the case. There’s going to be a lot more variability in predicting a single point estimate, rather than estimating a mean value, so the prediction interval needs to cover a larger range.
They should also be centered at the same place - the point estimate for \(x_0\), which in our case is 17.26 cubic ft.
confVolume[1] == predVolume [1]
## [1] TRUE
They are!
We can conduct a t-test to test the significance of both the slope and the y-intercept. I will conduct a two-sided hypothesis test here to test the significance of the slope, \(\beta_1\), but the same steps can be taken in order to test the significance of the y-intercept, \(\beta_0\).
\(H_0 : \beta_1 = 0\) \(H_a : \beta_1 \neq 0\)
By using the summary function, we can a bunch of information about our linear model.
summary(mymod)
##
## Call:
## lm(formula = Volume ~ Girth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
I can get the test stat from our t-test and the p-value from this summary. The test stat can be found under t value in the Girth row, while the p-value is found in the same row under Pr(>|t|).
test stat: 20.48 p-value: < 2e-16 (this p-value is based on 29 degrees of freedom)
Thus, I have sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis at pretty much any significance level, since the p-value is essentially 0. In other words, this p-value shows us that Girth and Volume of a felled black cherry tree have a linear relationship.
R can also calculate a confidence interval for us.
confint(mymod)
## 2.5 % 97.5 %
## (Intercept) -43.825953 -30.060965
## Girth 4.559914 5.571799
I am 95% confident that the true slope \(Beta_1\) lies between [4.559914, 5.571799]. Thus, I am 95% confident that an inch increase in the {Girth} will increase the Volume between 4.5599 and 5.5718 cubic ft.
We can also conduct an F-test to test the significance of the slope. In SLR, it gives us the same results that our test above gave us.
For the F-test, we calc an overall F-statistic using explained variation / [unexplained variation / (n - 2)], where explained variation is the variation that is explained by our linear model. However, the F-test p-value can also be pulled from the model summary, which is much easier to do. From above, we see that the F-test p-value is < 2.2e-16, which is the same as the p-value we found earlier. Thus, we come to the same conclusion of rejecting the null hypothesis, showing us that Girth and Volume of felled black cherry trees have a linear relationship.
You can calculate the mean square error by hand, but it’s much easier to pull off of the summary of our model. Let’s do that now.
Looking previously at the summary of our linear model, we can see that our MSE is 4.252 on 29 degrees of freedom; this number is found as the Residual standard error. The degrees of freedom makes sense, as the degrees of freedom should equal the number of observations minus 2 (for the terms with Beta) - and there are 31 observations! This MSE is the point estimate of \(\sigma^2\), our variance.
The simple coefficient of determination, \(r^2\), gives us the proportion of explained variation to total variation. This number can also be pulled from the model.
\(r^2\), or Multiple R-squared, is 0.9353. Since this number is close to 1, this shows us that the linear model explains a lot of the variation in the 31 observed values of Volume.
We can also calculate the simple correlation coefficient using the cor() function.
cor(Volume, Girth)
## [1] 0.9671194
This number shows us that the linear relationship between Girth and Volume is strong, as the number is close to 1. It’s also a positive number, which reinforces what we already knew about Girth and Volume having a positive relationship.