Cronin dataset is used to develop a QSAR model. First of all, different plots are generated in order to get an idea of possible correlations between the variables.
# Cronin dataset is loaded. pIG-50 is the activity, logKow predicts the distribution of a substance in various environmental compartments and LUMO is the lowest unoccupied molecular orbital.
library(readxl)
Cronin1996 <- read_excel("~/Desktop/Molecular Biotechnology/Cronin1996.xls")
Cronin1996
# plots are created
plot(Cronin1996$logKow, Cronin1996$`pIG-50`)
# this plot shows a possible linear correlation between the variables
# correlation coefficient is calculated
cor(Cronin1996$logKow, Cronin1996$`pIG-50`) # 0.8791357
## [1] 0.8791357
plot(Cronin1996$LUMO, Cronin1996$`pIG-50`)
cor(Cronin1996$LUMO, Cronin1996$`pIG-50`) # -0.2895907
## [1] -0.2895907
# no correlation seems to be present between the two variables
plot(Cronin1996$logKow, Cronin1996$LUMO)
cor(Cronin1996$logKow, Cronin1996$LUMO) # 0.07354909
## [1] 0.07354909
# no correlation seems to be clearly present
100 compounds are chosen randomly in order to build a training set and build a linear regression model. The 20 remaining compounds are used as testing set.
train <- sample(rownames(Cronin1996), size = 100, replace = FALSE)
train_set <- Cronin1996[rownames(Cronin1996) %in% train, ] # randomly chosen 100 samples
test <- setdiff(rownames(Cronin1996), train)
test_data <- Cronin1996[rownames(Cronin1996) %in% test, ] # randomly chosen 20 samples, excluding the other 100
Linear regression models are created among the variables of the training set.
x <- train_set$logKow
lm_m1 <- lm(train_set$`pIG-50` ~ x)
summary(lm_m1)
##
## Call:
## lm(formula = train_set$`pIG-50` ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69590 -0.27968 -0.03319 0.26119 1.23033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9437 0.1000 -9.434 2.05e-15 ***
## x 0.6107 0.0346 17.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3843 on 98 degrees of freedom
## Multiple R-squared: 0.7607, Adjusted R-squared: 0.7583
## F-statistic: 311.6 on 1 and 98 DF, p-value: < 2.2e-16
r1 <- summary(lm_m1)$r.squared # r squared is put in a variable
r1
## [1] 0.7607383
X <- train_set$LUMO
lm_m2 <- lm(train_set$`pIG-50` ~ X)
summary(lm_m2)
##
## Call:
## lm(formula = train_set$`pIG-50` ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6800 -0.4782 -0.1360 0.5122 1.9478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68246 0.07691 8.873 3.37e-14 ***
## X -0.38181 0.18377 -2.078 0.0404 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7689 on 98 degrees of freedom
## Multiple R-squared: 0.04219, Adjusted R-squared: 0.03242
## F-statistic: 4.317 on 1 and 98 DF, p-value: 0.04035
r2 <- summary(lm_m2)$r.squared
r2 # in this case r squared value is very low, suggesting a bad correlation between
## [1] 0.04219131
# the two variables
x1 <- train_set$logKow
x2 <- train_set$LUMO
lm_m3 <- lm(train_set$`pIG-50` ~ x1 + x2) # multiregression model
summary(lm_m3)
##
## Call:
## lm(formula = train_set$`pIG-50` ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47972 -0.17418 -0.02795 0.17825 0.71078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.07667 0.06732 -15.99 <2e-16 ***
## x1 0.65765 0.02330 28.23 <2e-16 ***
## x2 -0.69510 0.06185 -11.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2546 on 97 degrees of freedom
## Multiple R-squared: 0.8961, Adjusted R-squared: 0.8939
## F-statistic: 418.2 on 2 and 97 DF, p-value: < 2.2e-16
r3 <- summary(lm_m3)$r.squared
r3
## [1] 0.8960704
The R-squared value is good for the first model, for which the dependent variable is the pIG-50 and the independent one is the logKow. On the contrary, the second model has an R-squared which is very low, suggesting that the model is not good. Notice that the slope in this case is negative, suggesting that pIG-50 could be inversely correlated to LUMO. The third model is the one with two variables. The R-squared is higher than the first model, but this could be due to the fact that there is one more variable and not to a real contribution by LUMO. In order to understand this point, validation methods have to be performed.
Predicted values calculated on the testing set are used to calculate R-squared values.
# The predictivity of the model on the test set is evaluated
newdata <- data.frame(x = test_data$logKow)
# logKow column is selected to give it to the model in order to predict the value
pred_1 <- predict(lm_m1, newdata)
newdata2 <- data.frame(X = test_data$LUMO)
pred_2 <- predict(lm_m2, newdata2)
newdata3 <- data.frame(x1 = test_data$logKow, x2 = test_data$LUMO)
pred_3 <- predict(lm_m3, newdata3)
# A function to calculate R-squared is created
r2 <- function(y, y_pred) {
1 - sum((y - y_pred)^2)/sum((y - mean(y))^2)
}
#R-squared calculated using the function r2
r2(test_data$`pIG-50`, pred_1)
## [1] 0.795911
r2(test_data$`pIG-50`, pred_2) # R-squared again very low, from this moment on this
## [1] 0.1758223
# model will not be taken anymore into account
r2(test_data$`pIG-50`, pred_3)
## [1] 0.903919
The R-squared are similar to the ones obtained for the training set.
A second validation approach exploits the Leave-one-out Cross Validation.
#LOOCV
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model 1
model_1 <- train(`pIG-50`~ logKow, data = train_set, method = "lm",
trControl = train.control)
# Summarize the results
print(model_1)
## Linear Regression
##
## 100 samples
## 1 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3885099 0.7505453 0.3050574
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Train the model 3
model_3 <- train(`pIG-50`~ logKow + LUMO, data = train_set, method = "lm",
trControl = train.control)
# Summarize the results
print(model_3)
## Linear Regression
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2594741 0.888741 0.211934
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The third model seems to be the best one since it has the lowest RMSE, highest R-squared and lowest MAE.
K-fold cross validation is performed as further validation method.
# Define training control
train.control <- trainControl(method = "cv", number = 5)
# Train the model 1
model_11 <- train(`pIG-50`~ logKow, data = train_set, method = "lm",
trControl = train.control)
# Summarize the results
print(model_11)
## Linear Regression
##
## 100 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 80, 80, 80, 80
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3825486 0.766484 0.3041659
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Train the model 3
model_33 <- train(`pIG-50`~ logKow + LUMO, data = train_set, method = "lm",
trControl = train.control)
# Summarize the results
print(model_33)
## Linear Regression
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 80, 80, 80, 80
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2648751 0.9094962 0.2163733
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Once again the two-variable model seems to be the best one. The difference with the R-squared of the one-variable model is quite high, suggesting that the better performance is not due to the addition of one more variable.
x <- test_data$`pIG-50`
lm_final <- lm(pred_1 ~ x, data = test_data)
summary(lm_final)
##
## Call:
## lm(formula = pred_1 ~ x, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39988 -0.11663 -0.04406 0.17292 0.33188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.23593 0.05453 4.327 0.000406 ***
## x 0.59619 0.04772 12.492 2.63e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2079 on 18 degrees of freedom
## Multiple R-squared: 0.8966, Adjusted R-squared: 0.8908
## F-statistic: 156.1 on 1 and 18 DF, p-value: 2.635e-10
plot(test_data$`pIG-50`, pred_1, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final$coefficients[1], b = lm_final$coefficients[2], col = "red")
# abline adds the line to the plot
lm_final3 <- lm(pred_3 ~ x, data = test_data)
summary(lm_final3)
##
## Call:
## lm(formula = pred_3 ~ x, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38899 -0.17107 -0.08462 0.23960 0.50434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1634 0.0706 2.315 0.0327 *
## x 0.8545 0.0618 13.828 4.99e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2692 on 18 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.9092
## F-statistic: 191.2 on 1 and 18 DF, p-value: 4.988e-11
plot(test_data$`pIG-50`, pred_3, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final3$coefficients[1], b = lm_final3$coefficients[2], col = "red")
The last graph is confirming the hypothesis that the two-variable model is the best one, indeed the line is closer to the points with respect to the graph built for the one-variable model.
The same passages are followed, but this time using the first 100 compounds as training set and the last 20 as test set.
train_set_2 <- Cronin1996[1:100,] # first 100 compounds are selected
test_data_2 <- Cronin1996[101:120,] # last 20
a <- train_set_2$logKow
lm_m1_2 <- lm(train_set_2$`pIG-50` ~ a)
summary(lm_m1_2)
##
## Call:
## lm(formula = train_set_2$`pIG-50` ~ a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6977 -0.2705 -0.0320 0.2686 1.2278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94082 0.09501 -9.902 <2e-16 ***
## a 0.61048 0.03264 18.705 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3657 on 98 degrees of freedom
## Multiple R-squared: 0.7812, Adjusted R-squared: 0.779
## F-statistic: 349.9 on 1 and 98 DF, p-value: < 2.2e-16
# Residual standard error: 0.3657 ; Multiple R-squared: 0.7812 ;
# Intercept -0.94082 ; Slope 0.61048
r1_2 <- summary(lm_m1_2)$r.squared # r squared is put in a variable
r1_2
## [1] 0.7811913
a1 <- train_set_2$LUMO
lm_m2_2 <- lm(train_set_2$`pIG-50` ~ a1)
summary(lm_m2_2)
##
## Call:
## lm(formula = train_set_2$`pIG-50` ~ a1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68803 -0.56152 -0.08088 0.50447 1.93005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68667 0.07699 8.920 2.67e-14 ***
## a1 -0.34994 0.17892 -1.956 0.0533 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7671 on 98 degrees of freedom
## Multiple R-squared: 0.03757, Adjusted R-squared: 0.02775
## F-statistic: 3.826 on 1 and 98 DF, p-value: 0.05333
# Residual standard error: 0.7671 ; Multiple R-squared: 0.03757 ;
# Intercept 0.68667 ; Slope -0.34994
r2_2 <- summary(lm_m2_2)$r.squared # in this case r squared value is very low, suggesting a bad correlation between
# the two variables
r2_2
## [1] 0.03756929
A1 <- train_set_2$logKow
A2 <- train_set_2$LUMO
lm_m3_2 <- lm(train_set_2$`pIG-50` ~ A1 + A2) # multiregression model
summary(lm_m3_2)
##
## Call:
## lm(formula = train_set_2$`pIG-50` ~ A1 + A2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56476 -0.18662 -0.02485 0.18618 0.70049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.06513 0.06688 -15.93 <2e-16 ***
## A1 0.64830 0.02289 28.32 <2e-16 ***
## A2 -0.62010 0.05984 -10.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2533 on 97 degrees of freedom
## Multiple R-squared: 0.8962, Adjusted R-squared: 0.894
## F-statistic: 418.6 on 2 and 97 DF, p-value: < 2.2e-16
# Residual standard error: 0.2533 ; Multiple R-squared: 0.8962 ;
# Intercept -1.06513 ; Slope 0.64830 for logKow and -0.62010 for LUMO
r3_2 <- summary(lm_m3_2)$r.squared
r3_2
## [1] 0.8961583
The obtained values with the new training and test sets are very similar to the ones obtained with random chosen samples.
# The predictivity of the model on the test set is evaluated
newdata_2 <- data.frame(a = test_data_2$logKow)
pred_1_2 <- predict(lm_m1_2, newdata_2)
newdata2_2 <- data.frame(a1 = test_data_2$LUMO)
pred_2_2 <- predict(lm_m2_2, newdata2_2)
newdata3_2 <- data.frame(A1 = test_data_2$logKow, A2 = test_data_2$LUMO)
pred_3_2 <- predict(lm_m3_2, newdata3_2)
# A function to calculate R-squared is created
r2 <- function(y, y_pred) {
1 - sum((y - y_pred)^2)/sum((y - mean(y))^2)
}
#R-squared calculated using the function r2
r2(test_data_2$`pIG-50`, pred_1_2) # 0.7274747
## [1] 0.7274747
r2(test_data_2$`pIG-50`, pred_2_2) # 0.1577498
## [1] 0.1577498
r2(test_data_2$`pIG-50`, pred_3_2) # 0.8987859
## [1] 0.8987859
#LOOCV
library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model 1
model_1_2 <- train(`pIG-50`~ logKow, data = train_set_2, method = "lm",
trControl = train.control)
# Summarize the results
print(model_1_2)
## Linear Regression
##
## 100 samples
## 1 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3705084 0.7709434 0.2903446
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE 0.3705084 ; R-squared 0.7709434 ; MAE 0.2903446
# Train the model 3
model_3_2 <- train(`pIG-50`~ logKow + LUMO, data = train_set_2, method = "lm",
trControl = train.control)
# Summarize the results
print(model_3_2)
## Linear Regression
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 99, 99, 99, 99, 99, 99, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2584149 0.8885845 0.2128046
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE 0.2584149 ; R-squared 0.8885845 ; MAE 0.2128046
#K-fold Cross Validation
# Define training control
train.control <- trainControl(method = "cv", number = 5)
# Train the model 1
model_11_2 <- train(`pIG-50`~ logKow, data = train_set_2, method = "lm",
trControl = train.control)
# Summarize the results
print(model_11_2)
## Linear Regression
##
## 100 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 80, 80, 80, 80
## Resampling results:
##
## RMSE Rsquared MAE
## 0.369231 0.7803604 0.288805
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE 0.3736822 ; R-squared 0.8104504 ; MAE 0.2930736
# Train the model 3
model_33_2 <- train(`pIG-50`~ logKow + LUMO, data = train_set_2, method = "lm",
trControl = train.control)
# Summarize the results
print(model_33_2)
## Linear Regression
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 80, 80, 80, 80, 80
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2614915 0.8986267 0.2177282
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE 0.2557998 ; R-squared 0.9065938 ; MAE 0.2102606
These validation methods give similar results with respect the ones obtained using random chosen samples. Lastly, final plots are built.
x_2 <- test_data_2$`pIG-50`
lm_final_2 <- lm(pred_1_2 ~ x_2, data = test_data_2)
summary(lm_final_2)
##
## Call:
## lm(formula = pred_1_2 ~ x_2, data = test_data_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4262 -0.1616 -0.0062 0.1752 0.3747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25893 0.06036 4.290 0.000441 ***
## x_2 0.53123 0.05410 9.819 1.18e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2372 on 18 degrees of freedom
## Multiple R-squared: 0.8427, Adjusted R-squared: 0.8339
## F-statistic: 96.41 on 1 and 18 DF, p-value: 1.182e-08
plot(test_data_2$`pIG-50`, pred_1_2, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final_2$coefficients[1], b = lm_final_2$coefficients[2], col = "red")
# abline adds the line to the plot
lm_final3_2 <- lm(pred_3_2 ~ x_2, data = test_data_2)
summary(lm_final3_2)
##
## Call:
## lm(formula = pred_3_2 ~ x_2, data = test_data_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31437 -0.13751 0.03063 0.11151 0.38336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.11564 0.05181 2.232 0.0386 *
## x_2 0.75081 0.04644 16.168 3.65e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2036 on 18 degrees of freedom
## Multiple R-squared: 0.9356, Adjusted R-squared: 0.932
## F-statistic: 261.4 on 1 and 18 DF, p-value: 3.652e-12
plot(test_data_2$`pIG-50`, pred_3_2, ylab = "pIG-50 predicted values", xlab = "pIG-50 experimental values")
abline(a = lm_final3_2$coefficients[1], b = lm_final3_2$coefficients[2], col = "red")
Another time, the line from the two-variable model is closer to the points.
pIG-50 is linearly dependent on logKow indeed pIG-50 increases with logKow. On the other hand, pIG-50 is inversely correlated with LUMO, as can be inferred by the negative values of the slope of the second model. The p-value for this model is higher with respect to the others, suggesting that LUMO alone is not a good predictor of pIG. The third model is the one which performs better, since there are both logKow, responsible for the growth of pIG-50, and LUMO, which instead makes pIG-50 growing more slowly. The highest R2 values and the lowest RMSE values are confirming the hypothesis that the two-variable model is the best one. Indeed, it is effectively able to better fit the data, explaining the pIG-50 behavior in the best way among the three models proposed.