Introduction

We will be using the “trees” dataset that is already built into R to demonstrate the basic statistic concept of the simple linear regression model. This data set provides measurements of the girth, height, and volume of timber in 31 felled black cherry trees. The girth is measured in inches, the height is measured in feet, and the volume of timber is measured in cubic feet. Girth is the diameter of the tree measured at 4 ft 6 in above the ground.

To call up the data set, we use the data() function, followed by attach() so that we don’t have to keep referencing the set we want to use in further calculations.

data(trees)
attach(trees)

I am interested in the relationship between the girth and the volume of the felled black cherry trees. I will use Girth as the explanatory variable and Volume as the response.

Exploratory Data Analysis

summary(trees)
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

We can see that Girth ranges from 8.3 to 20.6 inches, with the average being 13.25 inches, while Volume ranges from 10.2 to 77 cubic ft, with an average of 30.17 cubic ft.

Girth & Volume Scatterplot

We can plot the data using the plot function, where we include the independent variable name, the dependent variable name, and then labels for the y- and x-axis (optional) in that order.

plot(Girth,Volume,ylab="Volume (cubic ft)",xlab="Girth (in)")

We can see from this scatterplot that there is a positive association between the two variables. The points seem to follow the trend pretty closely, and there seems to be an obvious linear relationship.

The simple linear regression model assumes that the relationship between the dependent variable, Volume, and the independent variable, Girth, can be approximated by a straight line. Since there seems to be a linear relationship when looking at the scatterplot, we will create a simple linear regression model, which will depict this relationship.

The Simple Linear Regression Model

We can create the linear regression model using lm(response variable~explanatory variable). I’m going to name mine “mymod.”"

(mymod<-lm(Volume~Girth))
## 
## Call:
## lm(formula = Volume ~ Girth)
## 
## Coefficients:
## (Intercept)        Girth  
##     -36.943        5.066

To represent the fact that points are scattered around a straight line instead of lying directly on it, we use the simple linear regression model \(\hat{Volume_i}= \hat{\beta_0}+\hat{\beta_1}*Girth_i\).

Plugging in the numbers that R calculated above for the intercept and the slope (found under Girth), our regression line for Girth and Volume is \(\hat{Volume_i}= -36.943 + 5.066*Girth_i\). This equation is the point estimate of the mean value of the volume of a felled black cherry tree when the girth is equal to \(Girth_i\).

Thus, for every inch increase in the girth, there is on average a 5.066 cubic ft increase in the volume of timber.

We cannot interpret the intercept because we cannot reasonably set the explanatory variable to zero; we have no data for a girth equal to zero. This makes sense, as there’s no possible way that a felled tree can have no girth whatsoever.

We will now plot the regression line on our scatterplot, using abline(intercept,slope), and the same plot that we saw earlier.

plot(Girth,Volume,ylab="Volume (cubic ft)",xlab="Girth (in)")
  abline(-36.943,5.066)

Checking Model Assumptions

Mean of the Errors Equals 0

When we create a simple linear regression model, we make four different assumptions. We first assume that at any given value of x, the population of potential error term values has a mean equal to 0. When we create a simple linear regression model in R, by design, R makes the model so that the mean of the errors is equal to 0, so we do not need to check that assumption here.

Normality

Let’s check the normality assumption. We assume that at any given value of x, the population of potential error term values has a normal distribution. To do this, we’ll look at a histogram of the residuals. It’s easy to do this in R using the hist() command.

hist(mymod$residuals, xlab="Residuals", main = NULL)

Although we don’t have much data to work with, this doesn’t seem to show that the error term isn’t normally distributed. It does have the appearance of a simple bell-like curve, which is what the normal distribution looks like.

It’s hard for us to see for sure on this graph, so let’s compare to a straight line instead. We will plot the quantiles of the residuals against the quantiles of a normal distribution, using the qqnorm() and qqline() functions respectively.

qqnorm(mymod$residuals)
qqline(mymod$residuals)

If the residuals are normally distributed, they will follow a straight line. The graph above doesn’t look too bad, but there seems to be a cyclic pattern of the points being over and then being under, with the points at the ends being further off. If a point is over the line, then we have underpredictd; if a point is under the line, then we have overpredicted. Our points in general, however, seem to be close enough to the line that we can assume that our normality assumption holds.

Homoscedasticity

Another assumption we make in the simple linear regression model is homoscedasticity, or equal variance. In other words, at any given girth, we’re going to have the same variability in the volume of the tree.

Let’s go back to our original plot to quickly check for the constant variance assumption.

plot(Girth,Volume,ylab="Volume (cubic ft)",xlab="Girth (in)")

We want to pay attention to the vertical spread. The vertical spread should be about the same at any given girth. If the data are more tightly packed at one place than another, then you have trouble. Although this data is pretty sparse, from what we can see on this plot, there isn’t any obvious evidence of heterscedasticity.

Let’s look at one more plot to check this assumption. We’ll plot the girth vs the sum of the residuals, with a horizontal line at 0. There shouldn’t be an obvious curve, or points more tightly packed at one point, or else we have an issue.

plot(mymod$residuals ~ Girth, ylab="Residuals")
  abline(0,0)

It seems to me that we are underpredicting for the smaller girths, overpredicting for the middle range of girths, and again underpredicting near the higher girths. If we have a curve around our horizontal line at 0 like this, it might mean that a regression model other than a line might be more suited for our data, like a quadratic regression. We will not be covering those regression models in this R guide. The curve in our plot above is visible, but not terribly obvious, so we will continue with our analysis, but we will also take our results with a grain of salt.

Independent Errors

The last assumption we make is that any one value of the error term is statistically independent of any other value of the error term. In order to check that assumption, we would take a look at the documentation of the data to see how the points were collected. In our case, the data should be independent if data was collected on trees throughout different areas, but possibly dependent if data was collected from trees around the same area. Since we were not given information about where the trees were located, we will not be checking that assumption here.

Point Predictions

We will now predict the volume for an individual value of the girth using our model, and see how that compares to the actual volume for that girth.

To refresh, the point estimate of the mean volume, calculated above, is \(\hat{Volume_i}= -36.943 + 5.066*Girth_i\). We are going to choose a value for the girth to plug in for \(Girth_i\). This will give us \(\hat{Volume_i}\), a point prediction for \(Girth_i\). We will then calculate the residual in order to see how close our prediction was.

Let’s look at the head of our data set to choose an observation to predict.

head(trees)
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Let’s use the fifth record. We will use indexing to reference the values.

tree[5,1] references the fifth record, first column, which happens to be Girth. The third column is Volume. We put parentheses around our code to immediately show the calculation.

(myGirth<-trees[5,1])
## [1] 10.7
(myactualVolume<-trees[5,3])
## [1] 18.8

We can predict our point by using matrix multiplication with the coefficient of our linear model times a vector with 1 as the first number, and the girth we want to predict for as the second number: coef(mymod)%*%c(1,myGirth).

(mypred<-coef(mymod)%*%c(1,myGirth))
##         [,1]
## [1,] 17.2612
(myresid<-myactualVolume-mypred)
##          [,1]
## [1,] 1.538795

We predict that at a girth of 10.7 inches, the volume of timber will be 17.2612 cubic ft. Looking at the residual (the last calculation), we can see that we underestimated that actual Volume by 1.538795 cubic ft.

Confidence and Prediction Intervals for Volume

Confidence Interval

Instead of estimating a single point, let’s instead calculate a 95% confidence interval for the mean volume of felled black cherry trees when girth is equal to 10.7 .

To do this, we’ll create a new data frame using data.frame(). In the argument, we specify what girth we want to predict for by using Girth=10.7. We can then calculate a confidence interval using predict(), with the linear model, our new data frame, and a specification for which interval we want to create.

newtrees<-data.frame(Girth=10.7)
(confVolume <- predict(mymod, newtrees, interval="confidence"))
##       fit      lwr      upr
## 1 17.2612 15.23588 19.28653

For a tree with a girth of 10.7 inches, I am 95% confident that the mean volume of the tree is between 15.2359 and 19.2865 cubic ft, given by the lwr and upr bound results in our calcuation above.

Prediction Interval

We can do a similar thing for an individual value of Volume. Instead of predicting a single point when Girth = 10.7, let’s instead calculate a 95% prediction interval.

(predVolume <- predict(mymod, newtrees, interval="prediction"))
##       fit      lwr      upr
## 1 17.2612 8.332185 26.19022

For an individual tree with a girth of 10.7 inches, I am 95% confident that its volume is between 8.332185 and 26.19022 cubic ft.

Hypothesis Tests for the Slope

T-Test

We can conduct a t-test to test the significance of the linear relationship between the girth and volume of felled black cherry trees; in other words, we are testing the significance of the slope. If we find that it is not significant, we do not have evidence that there is a linear relationship between the two.

I will conduct a two-sided hypothesis test here to test the significance of the slope, \(\beta_1\).

\(H_0 : \beta_1 = 0\)

\(H_a : \beta_1 \neq 0\)

By using the summary function, we can see a bunch of information about our linear model.

summary(mymod)
## 
## Call:
## lm(formula = Volume ~ Girth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

I can get the test stat from our t-test and the p-value from this summary. The test stat can be found under t value in the Girth row, while the p-value is found in the same row under Pr(>|t|). The test stat is 20.48, and the p-value is smaller than \(2*10^{-16}\). This p-value is based on 29 degrees of freedom, which is number of observations - number of predictors - 1. In our case, we have 31 observations, and only one predictor since we are doing simple linear regression, which gives us 29 degrees of freedom.

Thus, I have sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis at pretty much any significance level, since the p-value is essentially 0. In other words, this p-value shows us that the girth and the volume of a felled black cherry tree have a linear relationship.

F-Test

We can also conduct an F-test to test the significance of the slope. In SLR, it gives us the same results that our test above gave us.

For the F-test, we calculate an overall F-statistic using explained variation and unexplained variation. Explained variation is the variation that can be explained by our linear model, and the relationship between the girth and volume of felled black cherry trees. The more explained variation there is, the higher the test statistic is, causing the p-value to be smaller. This makes us want to reject the null in favor of the alternative, and shows us that the two variables have a linear relationship - which we would think if the explained variation is high!

The F-test p-value can be pulled from the model summary, which is much easier to do. From the summary above, we see that the F-test p-value is less than \(2*10^{-16}\), which is the same as the p-value we found earlier. Thus, we come to the same conclusion of rejecting the null hypothesis, showing us that the girth and volume of felled black cherry trees have a linear relationship.

Confidence Interval for the Slope

R can also calculate a confidence interval for the slope for us. The default is 95%.

confint(mymod)
##                  2.5 %     97.5 %
## (Intercept) -43.825953 -30.060965
## Girth         4.559914   5.571799

I am 95% confident that the true slope \(\beta_1\) lies between 4.5599 and 5.5718. Thus, I am 95% confident that on average, an inch increase in the girth of a tree will increase the tree’s volume between 4.5599 and 5.5718 cubic ft.

Mean Square Error

Mean Squared Error, or MSE, is the point estimate of \(\sigma^2\), our variance. You can calculate the mean square error by hand, but it’s much easier to pull off of the summary of our model. Let’s do that now.

Looking previously at the summary of our linear model, we can see that our MSE is 4.252 on 29 degrees of freedom; this number is found as the Residual standard error.

Coefficient of Determination

The simple coefficient of determination, \(r^2\), gives us the proportion of explained variation in our model to the total variation of our model (explained variation + unexplained variation). This number can also be pulled from the model.

\(r^2\), or Multiple R-squared, is 0.9353. Since this number is close to 1, this shows us that the linear model explains a lot of the variation in the 31 observed values of volume; in other words, 93.53% of the variation is explained by the linear relationship between volume and girth.

We can also calculate the simple correlation coefficient using the cor() function.

cor(Volume, Girth)
## [1] 0.9671194

This number shows us that the linear relationship between Girth and Volume is strong, as the number is close to 1. It’s also a positive number, which reinforces what we already knew about the girth and the volume having a positive relationship.