Consider the following data set:
file_address <- 'https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv'
soil <- read.csv(file = file_address)
str(soil) # data set structure'data.frame': 84 obs. of 7 variables:
$ US : num 6.85 10.61 6.63 6.63 10.72 ...
$ DS : num 1.77 1.58 1.79 1.78 1.57 1.56 1.61 1.68 1.65 1.68 ...
$ RP : num 4.37 1.64 5.15 4.83 0.4 0.74 1.64 2.8 2.14 3.99 ...
$ CO : num 9.94 26.89 9.94 9.94 27.8 ...
$ Argila: int 16 13 18 18 12 12 13 14 14 14 ...
$ Tensao: num 83 64.7 93.5 87.5 61 61.9 67.9 69.4 68 72.7 ...
$ Camada: chr "0-20" "0-20" "0-20" "0-20" ...
Say you want to select predictor variables (\(X\)) to model the response variable RP (\(y\)).
Let’s install the package plot3D to a 3D perspective.
# install.packages('plot3D') # remove '#'
library(plot3D)
with(soil,
points3D(x = DS, y = US, z = RP,
phi = 5, bty = "g", pch = 20, ticktype = "detailed",
xlab = "Bulk density", ylab = "Soil moisture",
zlab = "RP", col = 'black'))Through multiple linear regression, the model is of the form:
\[y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_2 + \epsilon\]
Where \(\beta_0\) is the intercept; \(\beta_j\) (\(j = 1,2, ..., p\)) are the parameters associated to the predictors.
Testing hypothesis such as \(H_0: \beta_j = 0\) would help to identify which predictors \(X\) affect \(y\). This can be done using, for instance, \(t\)-test.
model_full <- lm(RP ~ US + DS + CO + Argila + Tensao, data = soil)
summary(model_full)
Call:
lm(formula = RP ~ US + DS + CO + Argila + Tensao, data = soil)
Residuals:
Min 1Q Median 3Q Max
-1.15536 -0.29826 0.03775 0.30054 0.85998
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.530977 2.465087 -2.649 0.00976 **
US -0.213449 0.043977 -4.854 6.10e-06 ***
DS 3.790545 1.784057 2.125 0.03678 *
CO 0.009941 0.013878 0.716 0.47594
Argila -0.026239 0.045377 -0.578 0.56476
Tensao 0.073145 0.014520 5.038 2.97e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4651 on 78 degrees of freedom
Multiple R-squared: 0.8771, Adjusted R-squared: 0.8692
F-statistic: 111.4 on 5 and 78 DF, p-value: < 2.2e-16
Which indicates that US, DS and Tensao are significant (p-values < 0.05). The coeficient of determination (R-squared) indicates that most of the variance in RP has been explained by the fitted model.
Another way to select predictors is through an iterative technique called stepwise, which consists of a process of evaluating the impact of including and/or excluding predictors on the models’ goodness-of-fit. A well-known goodness-of-fit measure if \(R^2\), but there are plenty others, such as AIC - Akaike Information Criterion, that is based on maximum likelihood.
step(model_full)Start: AIC=-122.83
RP ~ US + DS + CO + Argila + Tensao
Df Sum of Sq RSS AIC
- Argila 1 0.0723 16.944 -124.47
- CO 1 0.1110 16.983 -124.28
<none> 16.872 -122.83
- DS 1 0.9765 17.848 -120.11
- US 1 5.0958 21.968 -102.66
- Tensao 1 5.4893 22.361 -101.17
Step: AIC=-124.47
RP ~ US + DS + CO + Tensao
Df Sum of Sq RSS AIC
- CO 1 0.2360 17.180 -125.313
<none> 16.944 -124.474
- DS 1 0.9057 17.850 -122.100
- Tensao 1 5.4875 22.432 -102.908
- US 1 7.1388 24.083 -96.942
Step: AIC=-125.31
RP ~ US + DS + Tensao
Df Sum of Sq RSS AIC
<none> 17.180 -125.313
- DS 1 0.7420 17.922 -123.761
- Tensao 1 5.2611 22.441 -104.872
- US 1 6.9080 24.088 -98.924
Call:
lm(formula = RP ~ US + DS + Tensao, data = soil)
Coefficients:
(Intercept) US DS Tensao
-5.1872 -0.2187 3.0521 0.0688
Sometimes a predictor is highly correlated or has a high VIF, suggesting that it should be removed. However, it might happen to be an important predictor in practice. In this case, a way to deal with correlated predictors without removing predictors is through a method called ridge regression. It consists of choosing a constant \(k\) as small as possible to be added to the diagonal of matrix \(X'X\) during the estimation process.