Linear regressions are adequate to describe the relationships between the variables in a database. However, when it is intended to predict, there is a limitation that, by philosophy, regressions fit data that is analyzed, and better the adjustment to the data analyzed, less likely the prediction is useful to another data set.

The adjustment is usually calculated by adding the square of the residuals. It is the sum of the difference between the real values of the response variable and the result estimated by the regression, squared: \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).

Variations have been devised for regression methods such as the penalty of adjustment to the current data set to achieve an adjustment to other data sets.

To explain the subject, we proceed to use an academic database1: Tetko et al. (2001) and Huuskonen (2000) used several models to estimate the relationship between the chemical structure of 1267 compounds and their solubility index. It is an essential issue for pharmaceutical laboratories because, the higher the solubility of a drug, the faster it is absorbed by the human body. The database consists of descriptors of the chemical structure of compounds that are divided into three groups: 208 binary “fingerprints” that indicate the presence or absence of a particular chemical substructure, 16 counting descriptors (such as the number of links or the amount of atoms of a specific type) and 4 continuous descriptors (such as molecular weight or molecular surface area).

The result variable is the solubility value transformed by \((\log_{10})\).

The data set is divided into two: one for training and one for testing, 70% -30%.

Linear least-squares regression is applied to the training set.

Scale-location plot shows if residuals appear randomly spread equally along the ranges of predictors. The red line is horizontal, so it is homoscedastic.

The percentage of variability explained by the model, \(R^2\), is 0.9446. It implies a high setting. The sum of the square of the residuals is 0.5524. The software used for the calculation automatically performs these metrics. This explanation makes use of software R.

If the model is applied to the covariates of the test set, it is presented an \(R^2\) of 0.7456, and a sum of residual squares of 0.8722.

The \(R^2\) metric decreases and increases the sum of squares of the residuals as a result of applying the regression to a different set of values. This type of validation of the metrics is called cross-validation, and it is convenient to determine how well the model fits a different set of data, that is, how well predicts. Instead of performing cross-validation between a test set and a training set, cross-validation can be done with ten subsets, that is, the database is randomly divided into ten groups of equal size and validation is performed crossed leaving one as a test set and the others as a training set each time. It is called the k-fold cross-validation. In this way, ten \(R^2\) and ten sums of squares of the residual are obtained. If the result is averaged, the values can be more reliable to measure how good it is as a predictive model.

An improvement can be sought by applying a robust linear least squares regression. It finds an \(R^2\) of 0.8781, already performed the k-fold cross validation described, as well as a sum of squares of residuals of 0.7213. These values are used as base parameters.

If the results are plotted:

The graph on the left shows if the residuals are random around zero and homocedasticity. The one on the right, if they are Gaussian, two of the assumptions of the linear regression by least squares for a continuous response variable. Both are met acceptably.

Having so many variables can make us suspect that there are very correlated ones. We can eliminate some.

Eliminating variables, or what is the same, reducing dimensionality, can be done using the multivariate Principal Component Analysis (PCA) technique. The application of this technique previously to regression is known as principal component regression (PCR).

The PCA filter eliminates 38 variables, leaving 190.

The PCR finds an \(R^2\) of 0.8837, already cross-validated k-fold. And a sum of squares of residuals of 0.7001. There is an improvement in both metrics.

A dimensionality reduction is also performed before a robust linear regression.

## Robust Linear Model 
## 
## 951 samples
## 228 predictors
## 
## Pre-processing: principal component signal extraction (228), centered
##  (228), scaled (228) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 856, 856, 856, 856, 855, 856, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE       Rsquared   MAE      
##   FALSE      psi.huber     2.8294713  0.8536421  2.7210207
##   FALSE      psi.hampel    2.8294526  0.8536650  2.7209978
##   FALSE      psi.bisquare  2.8290038  0.8538628  2.7206404
##    TRUE      psi.huber     0.7791505  0.8538315  0.5953206
##    TRUE      psi.hampel    0.7787367  0.8538127  0.5969627
##    TRUE      psi.bisquare  0.7820899  0.8528868  0.5970086
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.hampel.

An \(R^2\) of 0.8538 and a sum of squares of residuals of 0.7787 are found after k-fold cross-validation. It is not the right solution in the actual case against the base metrics.

The solution to the PCA indicated that it used 190 regressors instead of 228.

It is preferable to perform a partial least squares regression (Partial Least Squares - PLS) since PCA is a method that does not guarantee that the new dimensions are correlated with the result variable. PLS do.

Calculate the PLS with 190 components, instead of 228. In that case, the figures are an \(R^2\) of 0.8397 , and a sum of squares of residuals of 0.8162. Nor was it a good option for this data set.

The next option is using Ridge regression, which minimizes errors by penalizing the sum of the square of the residuals. It only allows high regression coefficients when there is a significant reduction in said sum. It is a second-order penalty (L2) because the penalty is on the squared coefficients: \(\lambda\sum_{j=1}^p{\beta_j^2},\) under p coefficients.

## Ridge Regression 
## 
## 951 samples
## 228 predictors
## 
## Pre-processing: centered (228), scaled (228) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 856, 856, 856, 856, 855, 857, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000  0.7037870  0.8848267  0.5205388
##   0.006666667  0.6924396  0.8884148  0.5231951
##   0.013333333  0.6858582  0.8904713  0.5184671
##   0.020000000  0.6835989  0.8912422  0.5171888
##   0.026666667  0.6832657  0.8914583  0.5174413
##   0.033333333  0.6840262  0.8913726  0.5184864
##   0.040000000  0.6854505  0.8911196  0.5200406
##   0.046666667  0.6873603  0.8907532  0.5218736
##   0.053333333  0.6896214  0.8903158  0.5239307
##   0.060000000  0.6921560  0.8898318  0.5262999
##   0.066666667  0.6949139  0.8893171  0.5288518
##   0.073333333  0.6978568  0.8887830  0.5314705
##   0.080000000  0.7009640  0.8882365  0.5343521
##   0.086666667  0.7042134  0.8876840  0.5373565
##   0.093333333  0.7075948  0.8871285  0.5403616
##   0.100000000  0.7110969  0.8865732  0.5434573
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.02666667.

An \(R^2\) of 0.8915 is found. The k-fold cross validation has already been carried out. As well as a sum of squares of residuals of 0.6833, with a lambda of 0.03. An important improvement.

Finally, the lasso regression alternative is used, which has the advantage of zeroing the coefficients that do not contribute, making a dimensionality reduction in the model. It is a first-order penalty (\(L_1\)) because the penalty is calculated over the absolute value of the coefficients, without squaring them: \(\lambda\sum_{j=1}^p|{\beta_j}|\).

The variables and their respective coefficients that remained after the process are presented:.

##        FP002        FP026        FP031        FP037        FP040        FP049 
##  0.239573523  0.231880862  0.009699797  0.109971530  0.303606900  0.195964123 
##        FP053        FP069        FP074        FP075        FP079        FP084 
##  0.193303854  0.030044758  0.159559330  0.145711421  0.018583356  0.101694029 
##        FP088        FP099        FP101        FP102        FP116        FP122 
##  0.094281834  0.117916921  0.067810564  0.002031774  0.216195476  0.149489388 
##        FP124        FP135        FP137        FP138        FP142        FP147 
##  0.234569455  0.270754050  0.403009855  0.091038060  0.435467535  0.135216424 
##        FP164        FP166        FP170        FP171        FP173        FP176 
##  0.198309607  0.026351863  0.019897858  0.041618171  0.202689588  0.061066726 
##        FP187        FP188        FP190        FP198        FP202        FP203 
##  0.021228523  0.194475893  0.067301399  0.052950069  0.267084442  0.051227148 
##    NumOxygen SurfaceArea1 
##  0.071906679  0.257557431

Lasso regression is popular because it makes a selection of variables (feature selection).

After k-fold cross-validation, the figures are an \(R^2\) of 0.8833, and a sum of residual squares of 0.6949, with the \(\alpha\) of the elasticnet ridge component equal to zero, that is, a pure lasso. Also, an improvement over the base metrics, but Ridge got higher metrics. Elasticnet is simply a combination between the second-order and the first-order penalty: \[\alpha*\lambda_{ridge}\sum_{j=1}^p{\beta_j^2} + (1-\alpha)\lambda_{lasso}\sum_{j=1}^p|{\beta_j}|,\; \alpha\; \epsilon\; [0,1]\]

Summary of results according to regression methods
Regresion R2 Sigma
Robust Least-squares 0.878 0.721
Principal component (PCR) 0.884 0.700
Robust RPC 0.854 0.779
Partial Least Squares (PLS) 0.840 0.816
Ridge 0.892 0.683
Lasso 0.883 0.695

PCR, Lasso, and Ridge improved metrics. However, for the example data, Ridge was the best option.


  1. Development of Chapter 6 of Applied Predictive Modeling by Kuhn and Johnson↩︎