Linear regressions are adequate to describe the relationships between the variables in a database. However, when it is intended to predict, there is a limitation that, by philosophy, regressions fit data that is analyzed, and better the adjustment to the data analyzed, less likely the prediction is useful to another data set.
The adjustment is usually calculated by adding the square of the residuals. It is the sum of the difference between the real values of the response variable and the result estimated by the regression, squared: \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).
Variations have been devised for regression methods such as the penalty of adjustment to the current data set to achieve an adjustment to other data sets.
To explain the subject, we proceed to use an academic database1: Tetko et al. (2001) and Huuskonen (2000) used several models to estimate the relationship between the chemical structure of 1267 compounds and their solubility index. It is an essential issue for pharmaceutical laboratories because, the higher the solubility of a drug, the faster it is absorbed by the human body. The database consists of descriptors of the chemical structure of compounds that are divided into three groups: 208 binary “fingerprints” that indicate the presence or absence of a particular chemical substructure, 16 counting descriptors (such as the number of links or the amount of atoms of a specific type) and 4 continuous descriptors (such as molecular weight or molecular surface area).
The result variable is the solubility value transformed by \((\log_{10})\).
The data set is divided into two: one for training and one for testing, 70% -30%.
Linear least-squares regression is applied to the training set.
Scale-location plot shows if residuals appear randomly spread equally along the ranges of predictors. The red line is horizontal, so it is homoscedastic.
The percentage of variability explained by the model, \(R^2\), is 0.9446. It implies a high setting. The sum of the square of the residuals is 0.5524. The software used for the calculation automatically performs these metrics. This explanation makes use of software R.
If the model is applied to the covariates of the test set, it is presented an \(R^2\) of 0.7456, and a sum of residual squares of 0.8722.
The \(R^2\) metric decreases and increases the sum of squares of the residuals as a result of applying the regression to a different set of values. This type of validation of the metrics is called cross-validation, and it is convenient to determine how well the model fits a different set of data, that is, how well predicts. Instead of performing cross-validation between a test set and a training set, cross-validation can be done with ten subsets, that is, the database is randomly divided into ten groups of equal size and validation is performed crossed leaving one as a test set and the others as a training set each time. It is called the k-fold cross-validation. In this way, ten \(R^2\) and ten sums of squares of the residual are obtained. If the result is averaged, the values can be more reliable to measure how good it is as a predictive model.
An improvement can be sought by applying a robust linear least squares regression. It finds an \(R^2\) of 0.8781, already performed the k-fold cross validation described, as well as a sum of squares of residuals of 0.7213. These values are used as base parameters.
If the results are plotted:
The graph on the left shows if the residuals are random around zero and homocedasticity. The one on the right, if they are Gaussian, two of the assumptions of the linear regression by least squares for a continuous response variable. Both are met acceptably.
Having so many variables can make us suspect that there are very correlated ones. We can eliminate some.
Eliminating variables, or what is the same, reducing dimensionality, can be done using the multivariate Principal Component Analysis (PCA) technique. The application of this technique previously to regression is known as principal component regression (PCR).
The PCA filter eliminates 38 variables, leaving 190.
The PCR finds an \(R^2\) of 0.8837, already cross-validated k-fold. And a sum of squares of residuals of 0.7001. There is an improvement in both metrics.
A dimensionality reduction is also performed before a robust linear regression.
## Robust Linear Model
##
## 951 samples
## 228 predictors
##
## Pre-processing: principal component signal extraction (228), centered
## (228), scaled (228)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 856, 856, 856, 856, 855, 856, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 2.8294713 0.8536421 2.7210207
## FALSE psi.hampel 2.8294526 0.8536650 2.7209978
## FALSE psi.bisquare 2.8290038 0.8538628 2.7206404
## TRUE psi.huber 0.7791505 0.8538315 0.5953206
## TRUE psi.hampel 0.7787367 0.8538127 0.5969627
## TRUE psi.bisquare 0.7820899 0.8528868 0.5970086
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.
An \(R^2\) of 0.8538 and a sum of squares of residuals of 0.7787 are found after k-fold cross-validation. It is not the right solution in the actual case against the base metrics.
The solution to the PCA indicated that it used 190 regressors instead of 228.
It is preferable to perform a partial least squares regression (Partial Least Squares - PLS) since PCA is a method that does not guarantee that the new dimensions are correlated with the result variable. PLS do.
Calculate the PLS with 190 components, instead of 228. In that case, the figures are an \(R^2\) of 0.8397 , and a sum of squares of residuals of 0.8162. Nor was it a good option for this data set.
The next option is using Ridge regression, which minimizes errors by penalizing the sum of the square of the residuals. It only allows high regression coefficients when there is a significant reduction in said sum. It is a second-order penalty (L2) because the penalty is on the squared coefficients: \(\lambda\sum_{j=1}^p{\beta_j^2},\) under p coefficients.
## Ridge Regression
##
## 951 samples
## 228 predictors
##
## Pre-processing: centered (228), scaled (228)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 856, 856, 856, 856, 855, 857, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 0.7037870 0.8848267 0.5205388
## 0.006666667 0.6924396 0.8884148 0.5231951
## 0.013333333 0.6858582 0.8904713 0.5184671
## 0.020000000 0.6835989 0.8912422 0.5171888
## 0.026666667 0.6832657 0.8914583 0.5174413
## 0.033333333 0.6840262 0.8913726 0.5184864
## 0.040000000 0.6854505 0.8911196 0.5200406
## 0.046666667 0.6873603 0.8907532 0.5218736
## 0.053333333 0.6896214 0.8903158 0.5239307
## 0.060000000 0.6921560 0.8898318 0.5262999
## 0.066666667 0.6949139 0.8893171 0.5288518
## 0.073333333 0.6978568 0.8887830 0.5314705
## 0.080000000 0.7009640 0.8882365 0.5343521
## 0.086666667 0.7042134 0.8876840 0.5373565
## 0.093333333 0.7075948 0.8871285 0.5403616
## 0.100000000 0.7110969 0.8865732 0.5434573
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.02666667.
An \(R^2\) of 0.8915 is found. The k-fold cross validation has already been carried out. As well as a sum of squares of residuals of 0.6833, with a lambda of 0.03. An important improvement.
Finally, the lasso regression alternative is used, which has the advantage of zeroing the coefficients that do not contribute, making a dimensionality reduction in the model. It is a first-order penalty (\(L_1\)) because the penalty is calculated over the absolute value of the coefficients, without squaring them: \(\lambda\sum_{j=1}^p|{\beta_j}|\).
The variables and their respective coefficients that remained after the process are presented:.
## FP002 FP026 FP031 FP037 FP040 FP049
## 0.239573523 0.231880862 0.009699797 0.109971530 0.303606900 0.195964123
## FP053 FP069 FP074 FP075 FP079 FP084
## 0.193303854 0.030044758 0.159559330 0.145711421 0.018583356 0.101694029
## FP088 FP099 FP101 FP102 FP116 FP122
## 0.094281834 0.117916921 0.067810564 0.002031774 0.216195476 0.149489388
## FP124 FP135 FP137 FP138 FP142 FP147
## 0.234569455 0.270754050 0.403009855 0.091038060 0.435467535 0.135216424
## FP164 FP166 FP170 FP171 FP173 FP176
## 0.198309607 0.026351863 0.019897858 0.041618171 0.202689588 0.061066726
## FP187 FP188 FP190 FP198 FP202 FP203
## 0.021228523 0.194475893 0.067301399 0.052950069 0.267084442 0.051227148
## NumOxygen SurfaceArea1
## 0.071906679 0.257557431
Lasso regression is popular because it makes a selection of variables (feature selection).
After k-fold cross-validation, the figures are an \(R^2\) of 0.8833, and a sum of residual squares of 0.6949, with the \(\alpha\) of the elasticnet ridge component equal to zero, that is, a pure lasso. Also, an improvement over the base metrics, but Ridge got higher metrics. Elasticnet is simply a combination between the second-order and the first-order penalty: \[\alpha*\lambda_{ridge}\sum_{j=1}^p{\beta_j^2} + (1-\alpha)\lambda_{lasso}\sum_{j=1}^p|{\beta_j}|,\; \alpha\; \epsilon\; [0,1]\]
| Regresion | R2 | Sigma |
|---|---|---|
| Robust Least-squares | 0.878 | 0.721 |
| Principal component (PCR) | 0.884 | 0.700 |
| Robust RPC | 0.854 | 0.779 |
| Partial Least Squares (PLS) | 0.840 | 0.816 |
| Ridge | 0.892 | 0.683 |
| Lasso | 0.883 | 0.695 |
PCR, Lasso, and Ridge improved metrics. However, for the example data, Ridge was the best option.
Development of Chapter 6 of Applied Predictive Modeling by Kuhn and Johnson↩︎