DATA624 - Homework 7

Author

Anthony Josue Roman

Exercise 6.2

Exercise 6.2 A

library(caret)
library(AppliedPredictiveModeling)

data(permeability)

str(fingerprints)
 num [1:165, 1:1107] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:165] "1" "2" "3" "4" ...
  ..$ : chr [1:1107] "X1" "X2" "X3" "X4" ...
str(permeability)
 num [1:165, 1] 12.52 1.12 19.41 1.73 1.68 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:165] "1" "2" "3" "4" ...
  ..$ : chr "permeability"

The dataset utilized in this exercise is composed of 165 instances and 1,107 binary attributes that describe molecular fingerprints. Permeability, the attribute to be predicted, measures how readily a compound crosses a membrane.

Exercise 6.2 B

nzv <- nearZeroVar(fingerprints)

length(nzv)
[1] 719
fingerprints_filtered <- fingerprints[, -nzv]
dim(fingerprints_filtered)
[1] 165 388

The function nearZeroVar() was employed to eliminate variables that had insignificant variance. Such variables usually do not add any value to a model and may harm its performance.

In total, 719 variables were dropped from the initial set of 1,107 variables. After eliminating the unimportant features, only 388 variables remained out of 165 observations.

The process ensures that the final dataset is both efficient and effective.

Exercise 6.2 C

set.seed(123)

trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)

trainX <- fingerprints_filtered[trainIndex, ]
testX  <- fingerprints_filtered[-trainIndex, ]

trainY <- permeability[trainIndex]
testY  <- permeability[-trainIndex]

# PLS model with tuning
ctrl <- trainControl(method = "cv", number = 10)

pls_model <- train(
  x = trainX,
  y = trainY,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

pls_model
Partial Least Squares 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE      
   1     13.31894  0.3442124  10.254018
   2     11.78898  0.4830504   8.534741
   3     11.98818  0.4792649   9.219285
   4     12.04349  0.4923322   9.448926
   5     11.79823  0.5193195   9.049121
   6     11.53275  0.5335956   8.658301
   7     11.64053  0.5229621   8.878265
   8     11.86459  0.5144801   9.265252
   9     11.98385  0.5188205   9.218594
  10     12.55634  0.4808614   9.610747
  11     12.69674  0.4758068   9.702325
  12     13.01534  0.4538906   9.956623
  13     13.12637  0.4367362   9.878017
  14     13.44865  0.4140715  10.065088
  15     13.60135  0.4034269  10.188150
  16     13.79361  0.3943904  10.247160
  17     14.00756  0.3845119  10.412776
  18     14.18113  0.3711378  10.587027
  19     14.25674  0.3703610  10.575726
  20     14.33121  0.3723176  10.679764

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 6.

Splitting the data into training and test datasets, the PLS model was developed using the 10-fold cross-validation approach. The predictors used in the study were centered and scaled before developing the model.

Tuning the model was done based on the number of components used in developing the model while evaluating the model based on the RMSE. The optimal number of components for the model was six, resulting in the minimum RMSE of about 11.53.

The R² value for this number of components was about 0.534. However, using more than six components increased the error in predicting the permeability response.

Exercise 6.2 D

preds <- predict(pls_model, newdata = testX)

postResample(preds, testY)
      RMSE   Rsquared        MAE 
12.3486900  0.3244542  8.2881075 

Performance of the trained PLS model was evaluated using the test set. The trained model gave results of RMSE equal to 12.35, (R^2 = 0.324), and MAE equal to 8.29.

As can be seen from above, there is a clear drop in the performance of the model when applied to test data as compared to the cross-validated training data ((R^2 )). It implies that though patterns could be successfully identified during the training phase, the same cannot be said about performance on new data.

This is due to the highly dimensional nature of the data combined with a relatively large amount of sparsity in the descriptors for permeability prediction.

Exercise 6.2 E

Ridge Regression

ridge_grid <- expand.grid(alpha = 0, lambda = seq(0.001, 0.1, length = 20))

ridge_model <- train(
  x = trainX,
  y = trainY,
  method = "glmnet",
  tuneGrid = ridge_grid,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

ridge_model
glmnet 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 118, 118, 121, 120, 120, 120, ... 
Resampling results across tuning parameters:

  lambda       RMSE      Rsquared   MAE    
  0.001000000  11.33669  0.5731857  8.54889
  0.006210526  11.33669  0.5731857  8.54889
  0.011421053  11.33669  0.5731857  8.54889
  0.016631579  11.33669  0.5731857  8.54889
  0.021842105  11.33669  0.5731857  8.54889
  0.027052632  11.33669  0.5731857  8.54889
  0.032263158  11.33669  0.5731857  8.54889
  0.037473684  11.33669  0.5731857  8.54889
  0.042684211  11.33669  0.5731857  8.54889
  0.047894737  11.33669  0.5731857  8.54889
  0.053105263  11.33669  0.5731857  8.54889
  0.058315789  11.33669  0.5731857  8.54889
  0.063526316  11.33669  0.5731857  8.54889
  0.068736842  11.33669  0.5731857  8.54889
  0.073947368  11.33669  0.5731857  8.54889
  0.079157895  11.33669  0.5731857  8.54889
  0.084368421  11.33669  0.5731857  8.54889
  0.089578947  11.33669  0.5731857  8.54889
  0.094789474  11.33669  0.5731857  8.54889
  0.100000000  11.33669  0.5731857  8.54889

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.1.
ridge_preds <- predict(ridge_model, testX)
postResample(ridge_preds, testY)
      RMSE   Rsquared        MAE 
11.0203929  0.3266984  7.6511575 

Lasso Regression

lasso_grid <- expand.grid(alpha = 1, lambda = seq(0.001, 0.1, length = 20))

lasso_model <- train(
  x = trainX,
  y = trainY,
  method = "glmnet",
  tuneGrid = lasso_grid,
  trControl = ctrl,
  preProcess = c("center", "scale")
)

lasso_model
glmnet 

133 samples
388 predictors

Pre-processing: centered (388), scaled (388) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 119, 120, 120, 119, 118, 120, ... 
Resampling results across tuning parameters:

  lambda       RMSE      Rsquared   MAE     
  0.001000000  12.74117  0.4723781  9.176338
  0.006210526  12.74117  0.4723781  9.176338
  0.011421053  12.74117  0.4723781  9.176338
  0.016631579  12.74117  0.4723781  9.176338
  0.021842105  12.74117  0.4723781  9.176338
  0.027052632  12.74117  0.4723781  9.176338
  0.032263158  12.74117  0.4723781  9.176338
  0.037473684  12.74117  0.4723781  9.176338
  0.042684211  12.74117  0.4723781  9.176338
  0.047894737  12.74117  0.4723781  9.176338
  0.053105263  12.74117  0.4723781  9.176338
  0.058315789  12.74117  0.4723781  9.176338
  0.063526316  12.74117  0.4723781  9.176338
  0.068736842  12.74117  0.4723781  9.176338
  0.073947368  12.74117  0.4723781  9.176338
  0.079157895  12.74117  0.4723781  9.176338
  0.084368421  12.74117  0.4723781  9.176338
  0.089578947  12.74117  0.4723781  9.176338
  0.094789474  12.74117  0.4723781  9.176338
  0.100000000  12.73892  0.4724976  9.174490

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.
lasso_preds <- predict(lasso_model, testX)
postResample(lasso_preds, testY)
      RMSE   Rsquared        MAE 
11.6080753  0.4210703  8.2062442 

Further machine learning models such as ridge and lasso regressions were built for comparison against the PLS model. Ridge and lasso regression models have been used because they use regularization which helps deal with the high dimensionality of the problem at hand.

The test (R^2) score for the ridge regression model is 0.327 while its RMSE score is 11.02. This can be said to perform slightly better than the PLS model.

The lasso regression model performed better in terms of explained variance with a score of 0.421 while its RMSE was slightly higher with a score of 11.61.

From the above observations, one can conclude that the ridge and lasso models performed better compared to the PLS model.

Exercise 6.2 F

Between all models analyzed above, the best predictive performance was shown by the lasso regression model, having the highest test (R^2) = 0.421. Despite the ridge model performing better in terms of the root mean square error, the increase in (R^2) of the lasso model indicates that it explains more variance in permeability.

Since we have to deal with a large number of features in a relatively small dataset, and because those features can be sparse due to the presence of zeros in the data, lasso is a very suitable method in such situations, as it does feature selection.

Hence, I think that the lasso model would be an appropriate choice for permeability prediction.

Exercise 6.3

Exercise 6.3 A

library(AppliedPredictiveModeling)

data(ChemicalManufacturingProcess)

str(ChemicalManufacturingProcess)
'data.frame':   176 obs. of  58 variables:
 $ Yield                 : num  38 42.4 42 41.4 42.5 ...
 $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
 $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
 $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
 $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
 $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
 $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
 $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
 $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
 $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
 $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
 $ BiologicalMaterial11  : num  138 154 154 154 148 ...
 $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
 $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
 $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
 $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
 $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
 $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
 $ ManufacturingProcess06: num  NA 210 207 213 206 ...
 $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
 $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
 $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
 $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
 $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
 $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
 $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
 $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
 $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
 $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
 $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
 $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
 $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
 $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
 $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
 $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
 $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
 $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
 $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
 $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
 $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
 $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
 $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
 $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
 $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
 $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
 $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
 $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
 $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
 $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
 $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
 $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
 $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
 $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
 $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
 $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
 $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
 $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
 $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
summary(ChemicalManufacturingProcess)
     Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
 Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
 1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
 Median :39.97   Median :6.305        Median :55.09        Median :67.22       
 Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
 3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
 Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
                                                                               
 BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
 Min.   : 9.38        Min.   :13.24        Min.   :40.60       
 1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
 Median :12.10        Median :18.49        Median :48.46       
 Mean   :12.35        Mean   :18.60        Mean   :48.91       
 3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
 Max.   :23.09        Max.   :24.85        Max.   :59.38       
                                                               
 BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
 Min.   :100.0        Min.   :15.88        Min.   :11.44       
 1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
 Median :100.0        Median :17.51        Median :12.84       
 Mean   :100.0        Mean   :17.49        Mean   :12.85       
 3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
 Max.   :100.8        Max.   :19.14        Max.   :14.08       
                                                               
 BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
 Min.   :1.770        Min.   :135.8        Min.   :18.35       
 1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
 Median :2.710        Median :146.1        Median :20.12       
 Mean   :2.801        Mean   :147.0        Mean   :20.20       
 3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
 Max.   :6.870        Max.   :158.7        Max.   :22.21       
                                                               
 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
 Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
 1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
 Median :11.40          Median :21.00          Median :1.54          
 Mean   :11.21          Mean   :16.68          Mean   :1.54          
 3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
 Max.   :14.10          Max.   :22.50          Max.   :1.60          
 NA's   :1              NA's   :3              NA's   :15            
 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
 Min.   :911.0          Min.   : 923.0         Min.   :203.0         
 1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
 Median :934.0          Median : 999.2         Median :206.8         
 Mean   :931.9          Mean   :1001.7         Mean   :207.4         
 3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
 Max.   :946.0          Max.   :1175.3         Max.   :227.4         
 NA's   :1              NA's   :1              NA's   :2             
 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
 Min.   :177.0          Min.   :177.0          Min.   :38.89         
 1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
 Median :177.0          Median :178.0          Median :45.73         
 Mean   :177.5          Mean   :177.6          Mean   :45.66         
 3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
 Max.   :178.0          Max.   :178.0          Max.   :49.36         
 NA's   :1              NA's   :1                                    
 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
 Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
 1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
 Median : 9.100         Median : 9.400         Median :   0.0        
 Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
 3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
 Max.   :11.600         Max.   :11.500         Max.   :4549.0        
 NA's   :9              NA's   :10             NA's   :1             
 ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
 Min.   :32.10          Min.   :4701           Min.   :5904          
 1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
 Median :34.60          Median :4856           Median :6032          
 Mean   :34.51          Mean   :4854           Mean   :6039          
 3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
 Max.   :38.60          Max.   :5055           Max.   :6233          
                        NA's   :1                                    
 ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
 Min.   :   0           Min.   :31.30          Min.   :   0          
 1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
 Median :4588           Median :34.40          Median :4835          
 Mean   :4566           Mean   :34.34          Mean   :4810          
 3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
 Max.   :4852           Max.   :40.00          Max.   :4971          
                                                                     
 ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
 Min.   :5890           Min.   :   0           Min.   :-1.8000       
 1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
 Median :6022           Median :4582           Median :-0.3000       
 Mean   :6028           Mean   :4556           Mean   :-0.1642       
 3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
 Max.   :6146           Max.   :4759           Max.   : 3.6000       
                                                                     
 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
 Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
 1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
 Median : 5.000         Median :3.000          Median : 8.000        
 Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
 3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
 Max.   :12.000         Max.   :6.000          Max.   :23.000        
 NA's   :1              NA's   :1              NA's   :1             
 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
 Min.   :   0           Min.   :   0           Min.   :   0          
 1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
 Median :4855           Median :6047           Median :4587          
 Mean   :4828           Mean   :6016           Mean   :4563          
 3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
 Max.   :4990           Max.   :6161           Max.   :4710          
 NA's   :5              NA's   :5              NA's   :5             
 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
 Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
 1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
 Median :10.400         Median :19.90          Median : 9.100        
 Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
 3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
 Max.   :11.500         Max.   :22.00          Max.   :11.200        
 NA's   :5              NA's   :5              NA's   :5             
 ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
 Min.   : 0.00          Min.   :143.0          Min.   :56.00         
 1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
 Median :70.80          Median :158.0          Median :64.00         
 Mean   :70.18          Mean   :158.5          Mean   :63.54         
 3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
 Max.   :72.50          Max.   :173.0          Max.   :70.00         
 NA's   :5                                     NA's   :5             
 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
 Min.   :2.300          Min.   :463.0          Min.   :0.01700       
 1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
 Median :2.500          Median :495.0          Median :0.02000       
 Mean   :2.494          Mean   :495.6          Mean   :0.01957       
 3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
 Max.   :2.600          Max.   :522.0          Max.   :0.02200       
 NA's   :5              NA's   :5              NA's   :5             
 ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
 Min.   :0.000          Min.   :0.000          Min.   :0.000         
 1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
 Median :1.000          Median :3.000          Median :7.200         
 Mean   :1.014          Mean   :2.534          Mean   :6.851         
 3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
 Max.   :2.300          Max.   :3.000          Max.   :7.500         
                                                                     
 ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
 Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
 1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
 Median :0.00000        Median :0.00000        Median :11.60         
 Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
 3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
 Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
 NA's   :1              NA's   :1                                    
 ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
 Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
 1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
 Median : 0.8000        Median :1.900          Median :2.200         
 Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
 3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
 Max.   :11.0000        Max.   :2.100          Max.   :2.600         
                                                                     

The Chemical Manufacturing Process Dataset consists of a combination of predictor attributes pertaining to the manufacturing process and one response attribute that reflects the yield of the product at hand. The data set consists of numeric predictors, as well as missing values that have to be tackled.

Exercise 6.3 B

library(caret)

set.seed(123)

preproc <- preProcess(ChemicalManufacturingProcess, method = "medianImpute")

chem_data <- predict(preproc, ChemicalManufacturingProcess)

colSums(is.na(chem_data))
                 Yield   BiologicalMaterial01   BiologicalMaterial02 
                     0                      0                      0 
  BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
                     0                      0                      0 
  BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
                     0                      0                      0 
  BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
                     0                      0                      0 
  BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
                     0                      0                      0 
ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
                     0                      0                      0 
ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
                     0                      0                      0 
ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
                     0                      0                      0 
ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
                     0                      0                      0 
ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
                     0                      0                      0 
ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
                     0                      0                      0 
ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
                     0                      0                      0 
ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
                     0                      0                      0 
ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
                     0                      0                      0 
ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
                     0                      0                      0 
ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
                     0                      0                      0 
ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
                     0                      0                      0 
ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
                     0                      0                      0 
ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
                     0                      0                      0 
ManufacturingProcess45 
                     0 

The above dataset had missing values in many variables. To handle this problem, median imputation was carried out through the use of the preProcess() function. The method involves replacing missing values in each predictor with its median value.

All missing values were removed from the dataset after performing preprocessing.

Exercise 6.3 C

set.seed(123)

trainIndex <- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)

train_data <- chem_data[trainIndex, ]
test_data  <- chem_data[-trainIndex, ]

ctrl <- trainControl(method = "cv", number = 10)

rf_model <- train(
  Yield ~ .,
  data = train_data,
  method = "rf",
  trControl = ctrl
)

rf_model
Random Forest 

144 samples
 57 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 131, 130, 130, 129, 131, 129, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE      
   2    1.223603  0.6665869  0.9798701
  29    1.133356  0.6614352  0.8937680
  57    1.149647  0.6348743  0.8805167

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.

Data was randomly split into a training set (80%) and a testing set (20%). A random forest model was subsequently fit on the training set, using 10-fold cross validation to optimize the model.

The root mean squared error (RMSE) was chosen as the evaluation measure. The best performance measure was obtained when the RMSE was 1.1334, corresponding to (mtry = 29).

Exercise 6.3 D

Linear Regression

lm_model <- train(
  Yield ~ .,
  data = train_data,
  method = "lm"
)

lm_preds <- predict(lm_model, test_data)

postResample(lm_preds, test_data$Yield)
     RMSE  Rsquared       MAE 
1.6056189 0.3730988 1.3456567 

Random Forest

ctrl <- trainControl(method = "cv", number = 10)

rf_model <- train(
  Yield ~ .,
  data = train_data,
  method = "rf",
  trControl = ctrl,
  importance = TRUE
)

rf_preds <- predict(rf_model, test_data)

postResample(rf_preds, test_data$Yield)
     RMSE  Rsquared       MAE 
1.2693532 0.5382311 0.9677364 

The random forest model was employed in making predictions using the test dataset. This gave an RMSE value of 1.2694, an (R^2) value of 0.5382, and an MAE value of 0.9677.

When compared with the resampled training set (with an approximate RMSE of 1.1334 and approximate (R^2) of 0.6666), the test set indicates slightly lower accuracy. The rise in RMSE value and reduction in the (R^2) value suggest that the model performs slightly better on the resampled training data than on fresh test data.

This means that there is some overfitting, whereby the model fits some training data features that are not generalizable. Nonetheless, the performance is not entirely bad because it still indicates satisfactory generalization.

Exercise 6.3 E

varImp(rf_model)
rf variable importance

  only 20 most important variables shown (out of 57)

                       Overall
ManufacturingProcess32  100.00
BiologicalMaterial06     42.78
ManufacturingProcess17   40.30
BiologicalMaterial03     39.99
BiologicalMaterial12     39.28
ManufacturingProcess31   38.70
BiologicalMaterial02     34.40
ManufacturingProcess36   33.65
BiologicalMaterial11     32.01
ManufacturingProcess13   30.17
ManufacturingProcess27   29.89
ManufacturingProcess39   29.01
ManufacturingProcess11   28.05
ManufacturingProcess43   27.35
ManufacturingProcess09   26.87
BiologicalMaterial04     26.43
ManufacturingProcess30   25.23
ManufacturingProcess20   23.03
ManufacturingProcess28   22.80
ManufacturingProcess01   22.64

According to the variable importance values obtained from the random forest analysis, the most significant factor is found to be ManufacturingProcess32; others are a combination of both biological and process-based variables including BiologicalMaterial06, ManufacturingProcess17, and BiologicalMaterial03.

Even though both biological and manufacturing process predictors have been observed in the list of top-ranked variables, it can be seen that the process-related factors have taken the lead over biological factors.

Therefore, it is evident that manipulating the process parameters will yield better results than biological material adjustments when attempting to increase yield.

Exercise 6.3 F

featurePlot(train_data[, c("ManufacturingProcess32",
                          "BiologicalMaterial06",
                          "ManufacturingProcess17")],
            train_data$Yield,
            plot = "scatter")

Analysis of the relations of the major predictors with yield shows that there are a number of manufacturing process predictors including ManufacturingProcess32 and ManufacturingProcess17 which are related to the yield in some degree. This indicates that certain variations in the conditions under which manufacturing takes place can affect the end result in some way.

Biological predictors such as BiologicalMaterial06 also demonstrate relation with yield but they seem to have less impact than manufacturing process predictors.

This data can be used when running future productions as by determining the predictors having the biggest impact on yield, one will be able to focus only on them in order to achieve better results. Thus, optimal values of manufacturing predictors could lead to increased yields.