In this part we will work the Hardwood dataset. We will fit a regression model using volume (\(m^{3}\)) as the response variable, diameter(\(cm\)) and height are the explanatory variables. We will look into the diagnostic plots of the model discuss the results. We will also fit regression models with transformations of diameter and height. At the end of this part, we will do model selection.
library(sjPlot)
library(sjmisc)
hardwood <- read.csv("Hardwood.csv")
attach(hardwood)
reg_model1 <- lm(volume~., data=hardwood)
tab_model(reg_model1, show.se = TRUE, show.ci = FALSE, show.stat = TRUE)
| volume | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -1.49 | 0.23 | -6.50 | <0.001 |
| diameter | 0.06 | 0.00 | 17.78 | <0.001 |
| height | 0.03 | 0.01 | 2.54 | 0.017 |
| Observations | 30 | |||
| R2 / R2 adjusted | 0.944 / 0.940 | |||
After fitting a regression model, p-value(s), R-square and the slope coefficients can tell us how well the model represents the data. On the other hand the residuals can inform about how poorly the model represents the data. Residuals are the leftovers of the response variable after fitting a model to data, and they might reveal unexplained patterns in the data. We can use this information to not only check if the linear regression assumptions are met, but also to improve the model.
par(mfrow=c(2,2))
plot(reg_model1)
The first diagnostic plot Residuals vs Fitted: Generally it is required that before fitting a linear regression model there should be a linear relationship between the response and the explanatory variable(s), although this relationship could be non-linear. This pattern could be carried into to the residuals as well. A good way to check for this is to plot the residuals against the fitted values (predictors or value of the explanatory variables). If the points in the plot are equally spread around a horizontal line, it is an indication that the model is a good one. Thus, the pattern between the response and the explanatory variable(s) does not appear in the residuals.
For our model, we can observe a U-shape relationship (the red line). This shows that there is a non-linear relationship between the residuals and the predictors. This violets the linear regression assumption that the predictors and the residual are uncorrelated.
The Normal Q-Q plot: Another assumption of a linear regression model is that the residual are normally distributed. Using the Q-Q plot, one will expect that the residuals will follow a straight line. When this is the case, we can conclude that the residuals are normally distributed. In our model we observe that the residual follow a straight line.
Scale-Location: It is also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how we can check the assumption of common variance (homoscedasticity). It is good if we see a horizontal line with equally (randomly) spread points. In our model, we do not see a perfectly horizontal line. However, the residual are somewhat equally spread.
Residuals vs Leverage: Not all extreme values are influential in a linear regression analysis. This plot helps us find such influential cases, if any. Even though the data have extreme values, they might not be influential to determine a regression line. This means, the results would not be much different if we either include or exclude the extreme values from analysis. They follow the pattern in the majority of cases and they do not really matter, since they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. That is, they don’t get along with the trend in the majority of the cases.
Unlike the other plots, patterns are not relevant. We will look out for extreme values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the observation are influential to the regression results. The regression results will be altered if we exclude those cases. In our model, we observe that all the cases are within the Cook’s distance.
Next, let us transform height and diameter and fit a linear regression with the transformation and observe the change, if any, in the model.
log_height <- log(height) # take the log transform of height
log_diameter <- log(diameter) # take the log transform of diameter
reg_model2 <- lm(volume ~ log_diameter + log_height) # fit a new model
tab_model(reg_model1,reg_model2, show.se = TRUE, show.ci = FALSE, show.stat = TRUE)
| volume | volume | |||||||
|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p | Estimates | std. Error | Statistic | p |
| (Intercept) | -1.49 | 0.23 | -6.50 | <0.001 | -6.33 | 0.94 | -6.74 | <0.001 |
| diameter | 0.06 | 0.00 | 17.78 | <0.001 | ||||
| height | 0.03 | 0.01 | 2.54 | 0.017 | ||||
| log diameter | 1.63 | 0.13 | 12.92 | <0.001 | ||||
| log height | 0.56 | 0.34 | 1.68 | 0.105 | ||||
| Observations | 30 | 30 | ||||||
| R2 / R2 adjusted | 0.944 / 0.940 | 0.901 / 0.893 | ||||||
Let’s take a look at the two models that have. In the first model, all the predictors are showed to be significant. However, the second model (the transformed model) we see that only diameter is significant.Also, the R-square value has gone down a bit. So, so the question is what exactly does this mean for us? Which of the models is a better model?
Before we get into that, let us look at the diagnostic plot of the new model.
par(mfrow=c(2,2))
plot(reg_model2)
Interesting. Observe that all the plot are similar to those of the previous model. So we can say that the residuals have not been affected by the transformation. In this case, we need to rely of the Adjusted R-square to select the best model. We will select the model with the largest Adjusted R-square. Which, in this case, is the first model. We can try another transformation. Let us introduce an interaction between height and diameter.
reg_model3 <- lm(volume~diameter + height + diameter*height, data=hardwood)
tab_model(reg_model1, reg_model3, reg_model2, show.se = TRUE, show.ci = FALSE, show.stat = TRUE)
| volume | volume | volume | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p | Estimates | std. Error | Statistic | p | Estimates | std. Error | Statistic | p |
| (Intercept) | -1.49 | 0.23 | -6.50 | <0.001 | 2.05 | 0.88 | 2.34 | 0.027 | -6.33 | 0.94 | -6.74 | <0.001 |
| diameter | 0.06 | 0.00 | 17.78 | <0.001 | -0.08 | 0.03 | -2.37 | 0.025 | ||||
| height | 0.03 | 0.01 | 2.54 | 0.017 | -0.12 | 0.04 | -3.29 | 0.003 | ||||
| diameter × height | 0.01 | 0.00 | 4.14 | <0.001 | ||||||||
| log diameter | 1.63 | 0.13 | 12.92 | <0.001 | ||||||||
| log height | 0.56 | 0.34 | 1.68 | 0.105 | ||||||||
| Observations | 30 | 30 | 30 | |||||||||
| R2 / R2 adjusted | 0.944 / 0.940 | 0.966 / 0.962 | 0.901 / 0.893 | |||||||||
You could observe that the model has improved drastically. What do I mean by this? The interaction term is shown to be significant and the R-square also has increased. Next, let us look at the diagnostic plot of the third model.
par(mfrow=c(2,2))
plot(reg_model3)
Very interesting right? All the plots exhibit the desirable attribute of a good linear regression model. We can see clearly from the Residuals vs Fitted plot that there is no correlation between the residuals and the predictors. Also, the residual follow, perfectly, a sraight line in the Q-Q plot. The scale-location plot is just as we would it to be and all cases in the **Residual vs Leverage* plot are within the Cook’s distance.
So, it would seem that the best model to choose for this problem is the third model. Because it satisfies all the assumption of a linear regression model and gives us the highest R-square. The R-square tells us the proportion of variability in the response that is accounted for by the predictors.