The dataset utilized in this exercise is composed of 165 instances and 1,107 binary attributes that describe molecular fingerprints. Permeability, the attribute to be predicted, measures how readily a compound crosses a membrane.
The function nearZeroVar() was employed to eliminate variables that had insignificant variance. Such variables usually do not add any value to a model and may harm its performance.
In total, 719 variables were dropped from the initial set of 1,107 variables. After eliminating the unimportant features, only 388 variables remained out of 165 observations.
The process ensures that the final dataset is both efficient and effective.
Exercise 6.2 C
set.seed(123)trainIndex <-createDataPartition(permeability, p =0.8, list =FALSE)trainX <- fingerprints_filtered[trainIndex, ]testX <- fingerprints_filtered[-trainIndex, ]trainY <- permeability[trainIndex]testY <- permeability[-trainIndex]# PLS model with tuningctrl <-trainControl(method ="cv", number =10)pls_model <-train(x = trainX,y = trainY,method ="pls",tuneLength =20,trControl = ctrl,preProcess =c("center", "scale"))pls_model
Partial Least Squares
133 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.31894 0.3442124 10.254018
2 11.78898 0.4830504 8.534741
3 11.98818 0.4792649 9.219285
4 12.04349 0.4923322 9.448926
5 11.79823 0.5193195 9.049121
6 11.53275 0.5335956 8.658301
7 11.64053 0.5229621 8.878265
8 11.86459 0.5144801 9.265252
9 11.98385 0.5188205 9.218594
10 12.55634 0.4808614 9.610747
11 12.69674 0.4758068 9.702325
12 13.01534 0.4538906 9.956623
13 13.12637 0.4367362 9.878017
14 13.44865 0.4140715 10.065088
15 13.60135 0.4034269 10.188150
16 13.79361 0.3943904 10.247160
17 14.00756 0.3845119 10.412776
18 14.18113 0.3711378 10.587027
19 14.25674 0.3703610 10.575726
20 14.33121 0.3723176 10.679764
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 6.
Splitting the data into training and test datasets, the PLS model was developed using the 10-fold cross-validation approach. The predictors used in the study were centered and scaled before developing the model.
Tuning the model was done based on the number of components used in developing the model while evaluating the model based on the RMSE. The optimal number of components for the model was six, resulting in the minimum RMSE of about 11.53.
The R² value for this number of components was about 0.534. However, using more than six components increased the error in predicting the permeability response.
Performance of the trained PLS model was evaluated using the test set. The trained model gave results of RMSE equal to 12.35, (R^2 = 0.324), and MAE equal to 8.29.
As can be seen from above, there is a clear drop in the performance of the model when applied to test data as compared to the cross-validated training data ((R^2 )). It implies that though patterns could be successfully identified during the training phase, the same cannot be said about performance on new data.
This is due to the highly dimensional nature of the data combined with a relatively large amount of sparsity in the descriptors for permeability prediction.
Further machine learning models such as ridge and lasso regressions were built for comparison against the PLS model. Ridge and lasso regression models have been used because they use regularization which helps deal with the high dimensionality of the problem at hand.
The test (R^2) score for the ridge regression model is 0.327 while its RMSE score is 11.02. This can be said to perform slightly better than the PLS model.
The lasso regression model performed better in terms of explained variance with a score of 0.421 while its RMSE was slightly higher with a score of 11.61.
From the above observations, one can conclude that the ridge and lasso models performed better compared to the PLS model.
Exercise 6.2 F
Between all models analyzed above, the best predictive performance was shown by the lasso regression model, having the highest test (R^2) = 0.421. Despite the ridge model performing better in terms of the root mean square error, the increase in (R^2) of the lasso model indicates that it explains more variance in permeability.
Since we have to deal with a large number of features in a relatively small dataset, and because those features can be sparse due to the presence of zeros in the data, lasso is a very suitable method in such situations, as it does feature selection.
Hence, I think that the lasso model would be an appropriate choice for permeability prediction.
'data.frame': 176 obs. of 58 variables:
$ Yield : num 38 42.4 42 41.4 42.5 ...
$ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
$ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
$ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
$ BiologicalMaterial04 : num 12.7 14.7 14.7 14.7 14 ...
$ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
$ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
$ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
$ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
$ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
$ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
$ BiologicalMaterial11 : num 138 154 154 154 148 ...
$ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
$ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
$ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
$ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
$ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
$ ManufacturingProcess06: num NA 210 207 213 206 ...
$ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
$ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
$ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
$ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
$ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
$ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
$ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
$ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
$ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
$ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
$ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
$ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
$ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
$ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
$ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
$ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
$ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
$ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
$ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
$ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
$ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
$ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
$ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
$ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
$ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
$ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
$ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
$ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
$ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
$ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
$ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
$ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
$ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
$ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
$ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
summary(ChemicalManufacturingProcess)
Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
Median :39.97 Median :6.305 Median :55.09 Median :67.22
Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
Min. : 9.38 Min. :13.24 Min. :40.60
1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
Median :12.10 Median :18.49 Median :48.46
Mean :12.35 Mean :18.60 Mean :48.91
3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
Max. :23.09 Max. :24.85 Max. :59.38
BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
Min. :100.0 Min. :15.88 Min. :11.44
1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
Median :100.0 Median :17.51 Median :12.84
Mean :100.0 Mean :17.49 Mean :12.85
3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
Max. :100.8 Max. :19.14 Max. :14.08
BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
Min. :1.770 Min. :135.8 Min. :18.35
1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
Median :2.710 Median :146.1 Median :20.12
Mean :2.801 Mean :147.0 Mean :20.20
3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
Max. :6.870 Max. :158.7 Max. :22.21
ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
Min. : 0.00 Min. : 0.00 Min. :1.47
1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
Median :11.40 Median :21.00 Median :1.54
Mean :11.21 Mean :16.68 Mean :1.54
3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
Max. :14.10 Max. :22.50 Max. :1.60
NA's :1 NA's :3 NA's :15
ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
Min. :911.0 Min. : 923.0 Min. :203.0
1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
Median :934.0 Median : 999.2 Median :206.8
Mean :931.9 Mean :1001.7 Mean :207.4
3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
Max. :946.0 Max. :1175.3 Max. :227.4
NA's :1 NA's :1 NA's :2
ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
Min. :177.0 Min. :177.0 Min. :38.89
1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
Median :177.0 Median :178.0 Median :45.73
Mean :177.5 Mean :177.6 Mean :45.66
3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
Max. :178.0 Max. :178.0 Max. :49.36
NA's :1 NA's :1
ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
Min. : 7.500 Min. : 7.500 Min. : 0.0
1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
Median : 9.100 Median : 9.400 Median : 0.0
Mean : 9.179 Mean : 9.386 Mean : 857.8
3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
Max. :11.600 Max. :11.500 Max. :4549.0
NA's :9 NA's :10 NA's :1
ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
Min. :32.10 Min. :4701 Min. :5904
1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
Median :34.60 Median :4856 Median :6032
Mean :34.51 Mean :4854 Mean :6039
3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
Max. :38.60 Max. :5055 Max. :6233
NA's :1
ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
Min. : 0 Min. :31.30 Min. : 0
1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
Median :4588 Median :34.40 Median :4835
Mean :4566 Mean :34.34 Mean :4810
3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
Max. :4852 Max. :40.00 Max. :4971
ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
Min. :5890 Min. : 0 Min. :-1.8000
1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
Median :6022 Median :4582 Median :-0.3000
Mean :6028 Mean :4556 Mean :-0.1642
3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
Max. :6146 Max. :4759 Max. : 3.6000
ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
Min. : 0.000 Min. :0.000 Min. : 0.000
1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
Median : 5.000 Median :3.000 Median : 8.000
Mean : 5.406 Mean :3.017 Mean : 8.834
3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
Max. :12.000 Max. :6.000 Max. :23.000
NA's :1 NA's :1 NA's :1
ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
Min. : 0 Min. : 0 Min. : 0
1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
Median :4855 Median :6047 Median :4587
Mean :4828 Mean :6016 Mean :4563
3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
Max. :4990 Max. :6161 Max. :4710
NA's :5 NA's :5 NA's :5
ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
Median :10.400 Median :19.90 Median : 9.100
Mean : 6.592 Mean :20.01 Mean : 9.161
3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
Max. :11.500 Max. :22.00 Max. :11.200
NA's :5 NA's :5 NA's :5
ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
Min. : 0.00 Min. :143.0 Min. :56.00
1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
Median :70.80 Median :158.0 Median :64.00
Mean :70.18 Mean :158.5 Mean :63.54
3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
Max. :72.50 Max. :173.0 Max. :70.00
NA's :5 NA's :5
ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
Min. :2.300 Min. :463.0 Min. :0.01700
1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
Median :2.500 Median :495.0 Median :0.02000
Mean :2.494 Mean :495.6 Mean :0.01957
3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
Max. :2.600 Max. :522.0 Max. :0.02200
NA's :5 NA's :5 NA's :5
ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
Median :1.000 Median :3.000 Median :7.200
Mean :1.014 Mean :2.534 Mean :6.851
3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
Max. :2.300 Max. :3.000 Max. :7.500
ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
Min. :0.00000 Min. :0.00000 Min. : 0.00
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
Median :0.00000 Median :0.00000 Median :11.60
Mean :0.01771 Mean :0.02371 Mean :11.21
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
Max. :0.10000 Max. :0.20000 Max. :12.10
NA's :1 NA's :1
ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
Min. : 0.0000 Min. :0.000 Min. :0.000
1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
Median : 0.8000 Median :1.900 Median :2.200
Mean : 0.9119 Mean :1.805 Mean :2.138
3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
Max. :11.0000 Max. :2.100 Max. :2.600
The Chemical Manufacturing Process Dataset consists of a combination of predictor attributes pertaining to the manufacturing process and one response attribute that reflects the yield of the product at hand. The data set consists of numeric predictors, as well as missing values that have to be tackled.
The above dataset had missing values in many variables. To handle this problem, median imputation was carried out through the use of the preProcess() function. The method involves replacing missing values in each predictor with its median value.
All missing values were removed from the dataset after performing preprocessing.
Exercise 6.3 C
set.seed(123)trainIndex <-createDataPartition(chem_data$Yield, p =0.8, list =FALSE)train_data <- chem_data[trainIndex, ]test_data <- chem_data[-trainIndex, ]ctrl <-trainControl(method ="cv", number =10)rf_model <-train( Yield ~ .,data = train_data,method ="rf",trControl = ctrl)rf_model
Random Forest
144 samples
57 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 131, 130, 130, 129, 131, 129, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.223603 0.6665869 0.9798701
29 1.133356 0.6614352 0.8937680
57 1.149647 0.6348743 0.8805167
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.
Data was randomly split into a training set (80%) and a testing set (20%). A random forest model was subsequently fit on the training set, using 10-fold cross validation to optimize the model.
The root mean squared error (RMSE) was chosen as the evaluation measure. The best performance measure was obtained when the RMSE was 1.1334, corresponding to (mtry = 29).
The random forest model was employed in making predictions using the test dataset. This gave an RMSE value of 1.2694, an (R^2) value of 0.5382, and an MAE value of 0.9677.
When compared with the resampled training set (with an approximate RMSE of 1.1334 and approximate (R^2) of 0.6666), the test set indicates slightly lower accuracy. The rise in RMSE value and reduction in the (R^2) value suggest that the model performs slightly better on the resampled training data than on fresh test data.
This means that there is some overfitting, whereby the model fits some training data features that are not generalizable. Nonetheless, the performance is not entirely bad because it still indicates satisfactory generalization.
According to the variable importance values obtained from the random forest analysis, the most significant factor is found to be ManufacturingProcess32; others are a combination of both biological and process-based variables including BiologicalMaterial06, ManufacturingProcess17, and BiologicalMaterial03.
Even though both biological and manufacturing process predictors have been observed in the list of top-ranked variables, it can be seen that the process-related factors have taken the lead over biological factors.
Therefore, it is evident that manipulating the process parameters will yield better results than biological material adjustments when attempting to increase yield.
Analysis of the relations of the major predictors with yield shows that there are a number of manufacturing process predictors including ManufacturingProcess32 and ManufacturingProcess17 which are related to the yield in some degree. This indicates that certain variations in the conditions under which manufacturing takes place can affect the end result in some way.
Biological predictors such as BiologicalMaterial06 also demonstrate relation with yield but they seem to have less impact than manufacturing process predictors.
This data can be used when running future productions as by determining the predictors having the biggest impact on yield, one will be able to focus only on them in order to achieve better results. Thus, optimal values of manufacturing predictors could lead to increased yields.