QSAR model development

Cronin dataset is used to develop a QSAR model. First of all, different plots are generated in order to get an idea of possible correlations between the variables.

# Cronin dataset is loaded. pIG-50 is the activity, logKow predicts the distribution of a substance in various environmental compartments and LUMO is the lowest unoccupied molecular orbital.

library(readxl)
Cronin1996 <- read_excel("~/Desktop/Molecular Biotechnology/Cronin1996.xls")
Cronin1996
# plots are created
plot(Cronin1996$logKow, Cronin1996$`pIG-50`) 

# this plot shows a possible linear correlation between the variables
# correlation coefficient is calculated
cor(Cronin1996$logKow, Cronin1996$`pIG-50`) # 0.8791357
## [1] 0.8791357
plot(Cronin1996$LUMO, Cronin1996$`pIG-50`) 

cor(Cronin1996$LUMO, Cronin1996$`pIG-50`) # -0.2895907
## [1] -0.2895907
# no correlation seems to be present between the two variables

plot(Cronin1996$logKow, Cronin1996$LUMO)

cor(Cronin1996$logKow, Cronin1996$LUMO) # 0.07354909
## [1] 0.07354909
# no correlation seems to be clearly present 

Building test and training set

100 compounds are chosen randomly in order to build a training set and build a linear regression model. The 20 remaining compounds are used as testing set.

train <- sample(rownames(Cronin1996), size = 100, replace = FALSE) 
train_set <- Cronin1996[rownames(Cronin1996) %in% train, ] # randomly chosen 100 samples 
test <- setdiff(rownames(Cronin1996), train)
test_data <- Cronin1996[rownames(Cronin1996) %in% test, ] # randomly chosen 20 samples, excluding the other 100 

Linear regression models

Linear regression models are created among the variables of the training set.

x <- train_set$logKow
lm_m1 <- lm(train_set$`pIG-50` ~ x)
summary(lm_m1)
## 
## Call:
## lm(formula = train_set$`pIG-50` ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69590 -0.27968 -0.03319  0.26119  1.23033 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.9437     0.1000  -9.434 2.05e-15 ***
## x             0.6107     0.0346  17.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 98 degrees of freedom
## Multiple R-squared:  0.7607, Adjusted R-squared:  0.7583 
## F-statistic: 311.6 on 1 and 98 DF,  p-value: < 2.2e-16
r1 <- summary(lm_m1)$r.squared # r squared is put in a variable
r1 
## [1] 0.7607383
X <- train_set$LUMO
lm_m2 <- lm(train_set$`pIG-50` ~ X)
summary(lm_m2)
## 
## Call:
## lm(formula = train_set$`pIG-50` ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6800 -0.4782 -0.1360  0.5122  1.9478 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.68246    0.07691   8.873 3.37e-14 ***
## X           -0.38181    0.18377  -2.078   0.0404 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7689 on 98 degrees of freedom
## Multiple R-squared:  0.04219,    Adjusted R-squared:  0.03242 
## F-statistic: 4.317 on 1 and 98 DF,  p-value: 0.04035
r2 <- summary(lm_m2)$r.squared
r2 # in this case r squared value is very low, suggesting a bad correlation between 
## [1] 0.04219131
# the two variables 
x1 <- train_set$logKow
x2 <- train_set$LUMO
lm_m3 <- lm(train_set$`pIG-50` ~ x1 + x2) # multiregression model
summary(lm_m3)
## 
## Call:
## lm(formula = train_set$`pIG-50` ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47972 -0.17418 -0.02795  0.17825  0.71078 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.07667    0.06732  -15.99   <2e-16 ***
## x1           0.65765    0.02330   28.23   <2e-16 ***
## x2          -0.69510    0.06185  -11.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2546 on 97 degrees of freedom
## Multiple R-squared:  0.8961, Adjusted R-squared:  0.8939 
## F-statistic: 418.2 on 2 and 97 DF,  p-value: < 2.2e-16
r3 <- summary(lm_m3)$r.squared
r3 
## [1] 0.8960704

The R-squared value is good for the first model, for which the dependent variable is the pIG-50 and the independent one is the logKow. On the contrary, the second model has an R-squared which is very low, suggesting that the model is not good. Notice that the slope in this case is negative, suggesting that pIG-50 could be inversely correlated to LUMO. The third model is the one with two variables. The R-squared is higher than the first model, but this could be due to the fact that there is one more variable and not to a real contribution by LUMO. In order to understand this point, validation methods have to be performed.

Validation methods

Test set method

Predicted values calculated on the testing set are used to calculate R-squared values.

# The predictivity of the model on the test set is evaluated 
newdata <- data.frame(x =  test_data$logKow) 
# logKow column is selected to give it to the model in order to predict the value 
pred_1 <- predict(lm_m1, newdata)

newdata2 <- data.frame(X =  test_data$LUMO) 
pred_2 <- predict(lm_m2, newdata2)

newdata3 <- data.frame(x1 = test_data$logKow, x2 = test_data$LUMO)
pred_3 <- predict(lm_m3, newdata3)

# A function to calculate R-squared is created 
r2 <- function(y, y_pred) {
  1 - sum((y - y_pred)^2)/sum((y - mean(y))^2)
}

#R-squared calculated using the function r2 
r2(test_data$`pIG-50`, pred_1) 
## [1] 0.795911
r2(test_data$`pIG-50`, pred_2) # R-squared again very low, from this moment on this 
## [1] 0.1758223
# model will not be taken anymore into account
r2(test_data$`pIG-50`, pred_3) 
## [1] 0.903919

The R-squared are similar to the ones obtained for the training set.

LOOCV

A second validation approach exploits the Leave-one-out Cross Validation.

#LOOCV

library(caret) 
## Loading required package: lattice
## Loading required package: ggplot2
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model 1
model_1 <- train(`pIG-50`~ logKow, data = train_set, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_1) 
## Linear Regression 
## 
## 100 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3885099  0.7505453  0.3050574
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Train the model 3
model_3 <- train(`pIG-50`~ logKow + LUMO, data = train_set, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_3) 
## Linear Regression 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ... 
## Resampling results:
## 
##   RMSE       Rsquared  MAE     
##   0.2594741  0.888741  0.211934
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The third model seems to be the best one since it has the lowest RMSE, highest R-squared and lowest MAE.

K-fold cross validation

K-fold cross validation is performed as further validation method.

# Define training control
train.control <- trainControl(method = "cv", number = 5)
# Train the model 1
model_11 <- train(`pIG-50`~ logKow, data = train_set, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_11)
## Linear Regression 
## 
## 100 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 80, 80, 80, 80, 80 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.3825486  0.766484  0.3041659
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Train the model 3
model_33 <- train(`pIG-50`~ logKow + LUMO, data = train_set, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_33)  
## Linear Regression 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 80, 80, 80, 80, 80 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2648751  0.9094962  0.2163733
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Once again the two-variable model seems to be the best one. The difference with the R-squared of the one-variable model is quite high, suggesting that the better performance is not due to the addition of one more variable.

Final plots

x <- test_data$`pIG-50`
lm_final <- lm(pred_1 ~ x, data = test_data) 
summary(lm_final)
## 
## Call:
## lm(formula = pred_1 ~ x, data = test_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39988 -0.11663 -0.04406  0.17292  0.33188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.23593    0.05453   4.327 0.000406 ***
## x            0.59619    0.04772  12.492 2.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2079 on 18 degrees of freedom
## Multiple R-squared:  0.8966, Adjusted R-squared:  0.8908 
## F-statistic: 156.1 on 1 and 18 DF,  p-value: 2.635e-10
plot(test_data$`pIG-50`, pred_1, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final$coefficients[1], b = lm_final$coefficients[2], col = "red") 

# abline adds the line to the plot 


lm_final3 <- lm(pred_3 ~ x, data = test_data)
summary(lm_final3)
## 
## Call:
## lm(formula = pred_3 ~ x, data = test_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38899 -0.17107 -0.08462  0.23960  0.50434 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1634     0.0706   2.315   0.0327 *  
## x             0.8545     0.0618  13.828 4.99e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2692 on 18 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9092 
## F-statistic: 191.2 on 1 and 18 DF,  p-value: 4.988e-11
plot(test_data$`pIG-50`, pred_3, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final3$coefficients[1], b = lm_final3$coefficients[2], col = "red")   

The last graph is confirming the hypothesis that the two-variable model is the best one, indeed the line is closer to the points with respect to the graph built for the one-variable model.

Second trial with different training and test set

The same passages are followed, but this time using the first 100 compounds as training set and the last 20 as test set.

train_set_2 <- Cronin1996[1:100,] # first 100 compounds are selected 
test_data_2 <- Cronin1996[101:120,] # last 20 
a <- train_set_2$logKow
lm_m1_2 <- lm(train_set_2$`pIG-50` ~ a)
summary(lm_m1_2)
## 
## Call:
## lm(formula = train_set_2$`pIG-50` ~ a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6977 -0.2705 -0.0320  0.2686  1.2278 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94082    0.09501  -9.902   <2e-16 ***
## a            0.61048    0.03264  18.705   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3657 on 98 degrees of freedom
## Multiple R-squared:  0.7812, Adjusted R-squared:  0.779 
## F-statistic: 349.9 on 1 and 98 DF,  p-value: < 2.2e-16
# Residual standard error: 0.3657 ; Multiple R-squared:  0.7812 ; 
# Intercept   -0.94082 ;   Slope  0.61048
r1_2 <- summary(lm_m1_2)$r.squared # r squared is put in a variable
r1_2
## [1] 0.7811913
a1 <- train_set_2$LUMO
lm_m2_2 <- lm(train_set_2$`pIG-50` ~ a1)
summary(lm_m2_2)
## 
## Call:
## lm(formula = train_set_2$`pIG-50` ~ a1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68803 -0.56152 -0.08088  0.50447  1.93005 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.68667    0.07699   8.920 2.67e-14 ***
## a1          -0.34994    0.17892  -1.956   0.0533 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7671 on 98 degrees of freedom
## Multiple R-squared:  0.03757,    Adjusted R-squared:  0.02775 
## F-statistic: 3.826 on 1 and 98 DF,  p-value: 0.05333
# Residual standard error: 0.7671 ; Multiple R-squared:  0.03757 ; 
# Intercept   0.68667 ;   Slope  -0.34994
r2_2 <- summary(lm_m2_2)$r.squared # in this case r squared value is very low, suggesting a bad correlation between 
# the two variables 
r2_2
## [1] 0.03756929
A1 <- train_set_2$logKow
A2 <- train_set_2$LUMO
lm_m3_2 <- lm(train_set_2$`pIG-50` ~ A1 + A2) # multiregression model
summary(lm_m3_2)
## 
## Call:
## lm(formula = train_set_2$`pIG-50` ~ A1 + A2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56476 -0.18662 -0.02485  0.18618  0.70049 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.06513    0.06688  -15.93   <2e-16 ***
## A1           0.64830    0.02289   28.32   <2e-16 ***
## A2          -0.62010    0.05984  -10.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2533 on 97 degrees of freedom
## Multiple R-squared:  0.8962, Adjusted R-squared:  0.894 
## F-statistic: 418.6 on 2 and 97 DF,  p-value: < 2.2e-16
# Residual standard error: 0.2533 ; Multiple R-squared:  0.8962 ; 
# Intercept   -1.06513 ;   Slope  0.64830 for logKow and -0.62010 for LUMO 
r3_2 <- summary(lm_m3_2)$r.squared
r3_2
## [1] 0.8961583

The obtained values with the new training and test sets are very similar to the ones obtained with random chosen samples.

# The predictivity of the model on the test set is evaluated 
newdata_2 <- data.frame(a =  test_data_2$logKow) 
pred_1_2 <- predict(lm_m1_2, newdata_2)

newdata2_2 <- data.frame(a1 =  test_data_2$LUMO) 
pred_2_2 <- predict(lm_m2_2, newdata2_2)

newdata3_2 <- data.frame(A1 = test_data_2$logKow, A2 = test_data_2$LUMO)
pred_3_2 <- predict(lm_m3_2, newdata3_2)

# A function to calculate R-squared is created 
r2 <- function(y, y_pred) {
  1 - sum((y - y_pred)^2)/sum((y - mean(y))^2)
}

#R-squared calculated using the function r2 
r2(test_data_2$`pIG-50`, pred_1_2) # 0.7274747
## [1] 0.7274747
r2(test_data_2$`pIG-50`, pred_2_2) # 0.1577498
## [1] 0.1577498
r2(test_data_2$`pIG-50`, pred_3_2) # 0.8987859
## [1] 0.8987859
#LOOCV 

library(caret) 

# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model 1
model_1_2 <- train(`pIG-50`~ logKow, data = train_set_2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_1_2) 
## Linear Regression 
## 
## 100 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3705084  0.7709434  0.2903446
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE 0.3705084  ;   R-squared 0.7709434   ;  MAE 0.2903446 
  

# Train the model 3
model_3_2 <- train(`pIG-50`~ logKow + LUMO, data = train_set_2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_3_2) 
## Linear Regression 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2584149  0.8885845  0.2128046
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE 0.2584149  ;   R-squared 0.8885845   ;  MAE 0.2128046
#K-fold Cross Validation

# Define training control
train.control <- trainControl(method = "cv", number = 5)
# Train the model 1
model_11_2 <- train(`pIG-50`~ logKow, data = train_set_2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_11_2)
## Linear Regression 
## 
## 100 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 80, 80, 80, 80, 80 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   0.369231  0.7803604  0.288805
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE  0.3736822  ;   R-squared  0.8104504  ;  MAE 0.2930736


# Train the model 3
model_33_2 <- train(`pIG-50`~ logKow + LUMO, data = train_set_2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model_33_2)  
## Linear Regression 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 80, 80, 80, 80, 80 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2614915  0.8986267  0.2177282
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE  0.2557998  ;   R-squared  0.9065938  ;  MAE 0.2102606

These validation methods give similar results with respect the ones obtained using random chosen samples. Lastly, final plots are built.

x_2 <- test_data_2$`pIG-50`
lm_final_2 <- lm(pred_1_2 ~ x_2, data = test_data_2)
summary(lm_final_2)
## 
## Call:
## lm(formula = pred_1_2 ~ x_2, data = test_data_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4262 -0.1616 -0.0062  0.1752  0.3747 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.25893    0.06036   4.290 0.000441 ***
## x_2          0.53123    0.05410   9.819 1.18e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2372 on 18 degrees of freedom
## Multiple R-squared:  0.8427, Adjusted R-squared:  0.8339 
## F-statistic: 96.41 on 1 and 18 DF,  p-value: 1.182e-08
plot(test_data_2$`pIG-50`, pred_1_2, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final_2$coefficients[1], b = lm_final_2$coefficients[2], col = "red") 

# abline adds the line to the plot 


lm_final3_2 <- lm(pred_3_2 ~ x_2, data = test_data_2)
summary(lm_final3_2)
## 
## Call:
## lm(formula = pred_3_2 ~ x_2, data = test_data_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31437 -0.13751  0.03063  0.11151  0.38336 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.11564    0.05181   2.232   0.0386 *  
## x_2          0.75081    0.04644  16.168 3.65e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2036 on 18 degrees of freedom
## Multiple R-squared:  0.9356, Adjusted R-squared:  0.932 
## F-statistic: 261.4 on 1 and 18 DF,  p-value: 3.652e-12
plot(test_data_2$`pIG-50`, pred_3_2, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final3_2$coefficients[1], b = lm_final3_2$coefficients[2], col = "red")   

Another time, the line from the two-variable model is closer to the points.

Conclusion

pIG-50 is linearly dependent on logKow indeed pIG-50 increases with logKow. On the other hand, pIG-50 is inversely correlated with LUMO, as can be inferred by the negative values of the slope of the second model. The p-value for this model is higher with respect to the others, suggesting that LUMO alone is not a good predictor of pIG. The third model is the one which performs better, since there are both logKow, responsible for the growth of pIG-50, and LUMO, which instead makes pIG-50 growing more slowly. The highest R2 values and the lowest RMSE values are confirming the hypothesis that the two-variable model is the best one. Indeed, it is effectively able to better fit the data, explaining the pIG-50 behavior in the best way among the three models proposed.