Introduction

Consider the following data set:

file_address <- 'https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv'
soil <- read.csv(file = file_address)
str(soil)   # data set structure
'data.frame':   84 obs. of  7 variables:
 $ US    : num  6.85 10.61 6.63 6.63 10.72 ...
 $ DS    : num  1.77 1.58 1.79 1.78 1.57 1.56 1.61 1.68 1.65 1.68 ...
 $ RP    : num  4.37 1.64 5.15 4.83 0.4 0.74 1.64 2.8 2.14 3.99 ...
 $ CO    : num  9.94 26.89 9.94 9.94 27.8 ...
 $ Argila: int  16 13 18 18 12 12 13 14 14 14 ...
 $ Tensao: num  83 64.7 93.5 87.5 61 61.9 67.9 69.4 68 72.7 ...
 $ Camada: chr  "0-20" "0-20" "0-20" "0-20" ...

Say you want to select predictor variables (\(X\)) to model the response variable RP (\(y\)).

Let’s install the package plot3D to a 3D perspective.

# install.packages('plot3D')  # remove '#'
library(plot3D)
with(soil, 
     points3D(x = DS, y = US, z = RP,
              phi = 5, bty = "g", pch = 20, ticktype = "detailed",
              xlab = "Bulk density", ylab = "Soil moisture", 
              zlab = "RP", col = 'black'))

Through multiple linear regression, the model is of the form:

\[y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_2 + \epsilon\]

Where \(\beta_0\) is the intercept; \(\beta_j\) (\(j = 1,2, ..., p\)) are the parameters associated to the predictors.

Testing hypothesis such as \(H_0: \beta_j = 0\) would help to identify which predictors \(X\) affect \(y\). This can be done using, for instance, \(t\)-test.

Fitting the model

model_full <- lm(RP ~ US + DS + CO + Argila + Tensao, data = soil)
summary(model_full)

Call:
lm(formula = RP ~ US + DS + CO + Argila + Tensao, data = soil)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.15536 -0.29826  0.03775  0.30054  0.85998 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.530977   2.465087  -2.649  0.00976 ** 
US          -0.213449   0.043977  -4.854 6.10e-06 ***
DS           3.790545   1.784057   2.125  0.03678 *  
CO           0.009941   0.013878   0.716  0.47594    
Argila      -0.026239   0.045377  -0.578  0.56476    
Tensao       0.073145   0.014520   5.038 2.97e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4651 on 78 degrees of freedom
Multiple R-squared:  0.8771,    Adjusted R-squared:  0.8692 
F-statistic: 111.4 on 5 and 78 DF,  p-value: < 2.2e-16

Which indicates that US, DS and Tensao are significant (p-values < 0.05). The coeficient of determination (R-squared) indicates that most of the variance in RP has been explained by the fitted model.

Correlated predictors

Both the tests and the parameter estimation can be mislead by correlation among the predictors \(X_j\).

Correlation matrix

In R, Pearson’s coefficient can computed with:

cor(soil$US, soil$DS)
[1] -0.7315648

Or, for all pairs, in a matrix

R <- cor(soil[, -7])
round(R, 2)
          US    DS    RP    CO Argila Tensao
US      1.00 -0.73 -0.80  0.62  -0.39  -0.66
DS     -0.73  1.00  0.89 -0.76   0.76   0.92
RP     -0.80  0.89  1.00 -0.69   0.63   0.89
CO      0.62 -0.76 -0.69  1.00  -0.73  -0.74
Argila -0.39  0.76  0.63 -0.73   1.00   0.77
Tensao -0.66  0.92  0.89 -0.74   0.77   1.00

Notice, for instance, how strong is the correlation between DS and Tensao (0.92). A helpful way to identify groups of correlated variables is through correlation networks.

# install.packages('qgraph')
library(qgraph)
co <- c('white', 'white', 'yellow', 'white', 'white', 'white')
qgraph(R, layout = 'spring', color = co)

Highly correlated predictors should be removed from the model. If the correlation matrix or network are not handy to evaluate what predictors should be removed, other strategies can be used.

VIF

Basically, correlated predictors can inflate the variance of the parameters’ estimates \(\hat{\beta}_j\). Thus, a way to verify which predictor \(X_j\) is causing inflation is through the VIF - Variance Inflation Factors.

library(car)
vif(model_full)
      US       DS       CO   Argila   Tensao 
2.780627 8.755151 3.038698 3.590033 6.713512 

A predictor with VIF above 5 (this is a rule of thumb, usually adopted) should be removed. As we’ve seen that Tensao and DS are highly correlated, let’s remove one of them, say Tensao, and fit a model with only US and DS.

model <- lm(RP ~ DS + US, data = soil)
summary(model)

Call:
lm(formula = RP ~ DS + US, data = soil)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.21818 -0.42785 -0.03737  0.32429  1.79483 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.64640    1.99052  -5.851 9.95e-08 ***
DS            9.90968    1.00089   9.901 1.33e-15 ***
US           -0.21396    0.04378  -4.887 5.08e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5264 on 81 degrees of freedom
Multiple R-squared:  0.8366,    Adjusted R-squared:  0.8325 
F-statistic: 207.3 on 2 and 81 DF,  p-value: < 2.2e-16

Notice that even after removing the predictors CO, Argila and Tensao, the R-squared didn’t reduced much - only about 3%, which is expected.

Lastly, last check the VIFs with the new model:

vif(model)
      DS       US 
2.151403 2.151403 

Response surface

With the selected model, let’s build a response surface. In this case, a 3D plane. But first, we’ll recall the 3d point cloud dnd to that, let’s add the 3d plane:

with(soil, 
     points3D(x = DS, y = US, z = RP,
              phi = 5, bty = "g", pch = 20, ticktype = "detailed",
              xlab = "Bulk density", ylab = "Soil moisture", 
              zlab = "RP", col = 'black')
     )
xy <- with(soil,
           mesh(x = seq(min(DS), max(DS), length = 50), 
                y = seq(min(US), max(US), length =50))
           )
with(xy, 
     surf3D(x, y, z = -11.6 + 9.91*x - 0.21*y, 
            add = TRUE , facets = NA)
     )

Miscellanea

Stepwise

Another way to select predictors is through an iterative technique called stepwise, which consists of a process of evaluating the impact of including and/or excluding predictors on the models’ goodness-of-fit. A well-known goodness-of-fit measure if \(R^2\), but there are plenty others, such as AIC - Akaike Information Criterion, that is based on maximum likelihood.

step(model_full)
Start:  AIC=-122.83
RP ~ US + DS + CO + Argila + Tensao

         Df Sum of Sq    RSS     AIC
- Argila  1    0.0723 16.944 -124.47
- CO      1    0.1110 16.983 -124.28
<none>                16.872 -122.83
- DS      1    0.9765 17.848 -120.11
- US      1    5.0958 21.968 -102.66
- Tensao  1    5.4893 22.361 -101.17

Step:  AIC=-124.47
RP ~ US + DS + CO + Tensao

         Df Sum of Sq    RSS      AIC
- CO      1    0.2360 17.180 -125.313
<none>                16.944 -124.474
- DS      1    0.9057 17.850 -122.100
- Tensao  1    5.4875 22.432 -102.908
- US      1    7.1388 24.083  -96.942

Step:  AIC=-125.31
RP ~ US + DS + Tensao

         Df Sum of Sq    RSS      AIC
<none>                17.180 -125.313
- DS      1    0.7420 17.922 -123.761
- Tensao  1    5.2611 22.441 -104.872
- US      1    6.9080 24.088  -98.924

Call:
lm(formula = RP ~ US + DS + Tensao, data = soil)

Coefficients:
(Intercept)           US           DS       Tensao  
    -5.1872      -0.2187       3.0521       0.0688  

Ridge regression

Sometimes a predictor is highly correlated or has a high VIF, suggesting that it should be removed. However, it might happen to be an important predictor in practice. In this case, a way to deal with correlated predictors without removing predictors is through a method called ridge regression. It consists of choosing a constant \(k\) as small as possible to be added to the diagonal of matrix \(X'X\) during the estimation process.

Exercises


License: CC BY 4.0