#Data 624: HW 7 Kuhn and Johnson - problems 6.2 and 6.3.

#6.2: Developing a model to predict permeability

##6.2(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

###Exploring the data

Dimensions - Fingerprints

dim(fingerprints)
## [1]  165 1107

Dimensions - Permeability

head(permeability)
##   permeability
## 1       12.520
## 2        1.120
## 3       19.405
## 4        1.730
## 5        1.680
## 6        0.510

##6.2(b) How many predictors are left for modeling?

###Loading library

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice

###Creating a variable for predictions that have low frequencies

near_zero <- nearZeroVar(fingerprints)

###CHecking how many predictors there are

ncol(fingerprints)
## [1] 1107

###CHecking how many predictors there are with near zero variance

length(near_zero)
## [1] 719

###Filtering to create new filtered fingerprints data set

fingerprints2 <- fingerprints[, -near_zero]

###Checking how many predictors left after filtering

ncol(fingerprints2)
## [1] 388

There are 388 predictors left now after filtering out the near zero predictors.

##6.2(c) How many latent variables are optimal and what is the corresponding resampled estimate of R2?

###Splitting data into training and test set

set.seed(100)
train_data <- createDataPartition(permeability, p = 0.80, list = FALSE)
head(train_data)
##      Resample1
## [1,]         2
## [2,]         3
## [3,]         4
## [4,]         6
## [5,]         7
## [6,]         8
X_train <- fingerprints2[train_data, ]
X_test <- fingerprints2[-train_data, ]
Y_train <- permeability[train_data]
Y_test <- permeability[-train_data]

###Pre-Processing Data

ctrl <- trainControl(method = "cv", number = 10)

###Tuning a PLS (Partial Least Squares)

set.seed(100)
pls_tune <- train(X_train, Y_train,
                  method = "pls", 
                  tuneLength = 20,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
pls_tune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     12.85492  0.3362303   9.721289
##    2     11.82217  0.4436244   8.092500
##    3     11.71991  0.4718842   8.673662
##    4     11.56038  0.4864682   8.683859
##    5     11.40392  0.4845769   8.561044
##    6     11.08213  0.4956908   8.266156
##    7     11.21951  0.4861156   8.526250
##    8     11.25938  0.4897819   8.413176
##    9     11.37410  0.4889105   8.555701
##   10     11.55558  0.4772198   8.541903
##   11     11.77598  0.4596393   8.781978
##   12     11.90984  0.4588892   8.970828
##   13     12.38027  0.4339820   9.260477
##   14     12.59802  0.4194958   9.689526
##   15     12.94009  0.4009358   9.912681
##   16     13.27633  0.3805674  10.162833
##   17     13.48894  0.3741538  10.413769
##   18     13.60309  0.3690599  10.585876
##   19     13.60980  0.3653960  10.578503
##   20     13.66044  0.3613265  10.627887
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
plot(pls_tune)

There are 6 latent variables and the re-sampled estimate of r squared is 0.4956908.

##6.2(d) “Predict the response for the test set. What is the test set estimate of R2?”

pls_prediction <- predict(pls_tune, X_test)
postResample(pred = pls_prediction, obs = Y_test)
##      RMSE  Rsquared       MAE 
## 11.773628  0.471816  8.672468

6 11.08213 0.4956908 8.266156

The training and test sets appear to be actually pretty close to each other. The test estimate for R2 is 0.47. This model does a good job as I see no overfitting.

##6.2(e) Try building other models discussed in this chapter. Do any have better predictive performance?

PCR Model

set.seed(100)
pcr_tune <- train(X_train, Y_train,
                  method = "pcr", 
                  tuneLength = 20,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
pcr_tune
## Principal Component Analysis 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.69636  0.1110173  11.495094
##    2     14.69643  0.1107412  11.545796
##    3     13.01748  0.3397017   9.946278
##    4     12.72074  0.3437014   9.441999
##    5     12.39655  0.3706625   8.972824
##    6     12.35651  0.3718052   8.944906
##    7     12.34481  0.3715405   8.857335
##    8     12.06702  0.4175700   8.418313
##    9     12.07646  0.4285779   8.537146
##   10     12.05970  0.4396163   8.716194
##   11     12.09726  0.4384438   8.748660
##   12     12.24633  0.4225441   8.758359
##   13     12.28194  0.4149305   8.778917
##   14     12.27708  0.4148031   8.796068
##   15     12.27271  0.4150780   8.817452
##   16     12.24219  0.4252113   8.806884
##   17     12.12211  0.4293419   8.706982
##   18     12.15276  0.4423403   8.738848
##   19     12.07120  0.4392150   8.726134
##   20     11.66452  0.4631430   8.472242
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 20.
plot(pcr_tune)

pcr_prediction <- predict(pcr_tune, X_test)
postResample(pred = pcr_prediction, obs = Y_test)
##       RMSE   Rsquared        MAE 
## 12.4227765  0.4114892  9.4308226

20 11.66452 0.4631430 8.472242

This model did not do well. It needed 20 components to fit, when PLS only needed 6. Also, PCR does not build models with y in mind like PLS so not only is it using too many components it’s not even explaining the variance with relevance to permeability.

Checking if increasing tune length makes the component need go up even more:

set.seed(100)
pcr_tune2 <- train(X_train, Y_train,
                  method = "pcr", 
                  tuneLength = 50,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
pcr_tune2
## Principal Component Analysis 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     14.69636  0.1110173  11.495094
##    2     14.69643  0.1107412  11.545796
##    3     13.01748  0.3397017   9.946278
##    4     12.72074  0.3437014   9.441999
##    5     12.39655  0.3706625   8.972824
##    6     12.35651  0.3718052   8.944906
##    7     12.34481  0.3715405   8.857335
##    8     12.06702  0.4175700   8.418313
##    9     12.07646  0.4285779   8.537146
##   10     12.05970  0.4396163   8.716194
##   11     12.09726  0.4384438   8.748660
##   12     12.24633  0.4225441   8.758359
##   13     12.28194  0.4149305   8.778917
##   14     12.27708  0.4148031   8.796068
##   15     12.27271  0.4150780   8.817452
##   16     12.24219  0.4252113   8.806884
##   17     12.12211  0.4293419   8.706982
##   18     12.15276  0.4423403   8.738848
##   19     12.07120  0.4392150   8.726134
##   20     11.66452  0.4631430   8.472242
##   21     11.83560  0.4600169   8.846798
##   22     11.89061  0.4590446   8.858161
##   23     11.90705  0.4502853   8.795143
##   24     11.89356  0.4520845   8.782471
##   25     11.76709  0.4646409   8.647535
##   26     11.95438  0.4516535   8.833433
##   27     11.50474  0.4790646   8.472424
##   28     11.36204  0.4846198   8.494791
##   29     11.38455  0.4852023   8.508798
##   30     11.54396  0.4763228   8.613945
##   31     11.55421  0.4684521   8.699147
##   32     11.15536  0.4940905   8.420403
##   33     11.04658  0.5043588   8.414902
##   34     10.96711  0.5085231   8.370172
##   35     10.97881  0.5096122   8.386033
##   36     11.22650  0.4943005   8.552872
##   37     11.19429  0.5030008   8.460189
##   38     11.20168  0.5044827   8.407628
##   39     10.89835  0.5213800   8.110102
##   40     10.84547  0.5248868   8.069804
##   41     11.00378  0.5071243   8.275885
##   42     11.24342  0.4924778   8.552700
##   43     11.27461  0.4900943   8.616688
##   44     11.34000  0.4904779   8.671189
##   45     11.49380  0.4855984   8.663002
##   46     11.58815  0.4843147   8.679616
##   47     11.75285  0.4740361   8.758917
##   48     11.70395  0.4753790   8.706313
##   49     11.74894  0.4720810   8.701926
##   50     11.76481  0.4687003   8.851439
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 40.

It does go up. Using PLS model is better in this case because it keeps the y in mind unlike this model.

###Ridge Model

ridgeGrid <- expand.grid(lambda = seq(0.001, 1, length = 20))
set.seed(100)
ridge_tune <- train(X_train, Y_train,
                  method = "ridge", 
                  tuneGrid = ridgeGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
ridge_tune
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE       Rsquared   MAE       
##   0.00100000  421.85554  0.1085090  272.873407
##   0.05357895   12.62825  0.4189466    9.454504
##   0.10615789   12.05244  0.4600003    8.985801
##   0.15873684   11.88822  0.4790265    8.890134
##   0.21131579   11.87008  0.4903643    8.904842
##   0.26389474   11.93611  0.4979491    8.972070
##   0.31647368   12.05556  0.5031442    9.058626
##   0.36905263   12.21199  0.5070957    9.169629
##   0.42163158   12.39878  0.5101101    9.351295
##   0.47421053   12.61096  0.5125018    9.562080
##   0.52678947   12.84595  0.5144065    9.787019
##   0.57936842   13.09832  0.5159385   10.014813
##   0.63194737   13.36932  0.5171686   10.247180
##   0.68452632   13.65409  0.5181938   10.488730
##   0.73710526   13.95471  0.5190714   10.738073
##   0.78968421   14.26324  0.5196712   10.982823
##   0.84226316   14.58465  0.5202015   11.232557
##   0.89484211   14.91564  0.5206326   11.505148
##   0.94742105   15.25572  0.5209462   11.783635
##   1.00000000   15.60323  0.5211916   12.068218
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.2113158.
plot(ridge_tune)

ridge_prediction <- predict(ridge_tune, X_test)
postResample(pred = ridge_prediction, obs = Y_test)
##       RMSE   Rsquared        MAE 
## 12.0135032  0.5235396  8.5961594

###LAsso Model

glmnetGrid <- expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 20))
set.seed(100)
lasso_tune <- train(X_train, Y_train,
                  method = "glmnet", 
                  tuneGrid = glmnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
lasso_tune
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.00100000  12.21709  0.4299361  8.942562
##   0.05357895  12.21709  0.4299361  8.942562
##   0.10615789  12.06935  0.4378085  8.815047
##   0.15873684  11.61041  0.4656562  8.390984
##   0.21131579  11.39613  0.4800470  8.142893
##   0.26389474  11.27804  0.4875442  8.064225
##   0.31647368  11.24578  0.4876755  8.014870
##   0.36905263  11.19600  0.4908586  7.989136
##   0.42163158  11.17822  0.4919982  7.965935
##   0.47421053  11.14488  0.4950772  7.888418
##   0.52678947  11.12161  0.4973859  7.830690
##   0.57936842  11.11436  0.4981588  7.795229
##   0.63194737  11.14309  0.4968463  7.788788
##   0.68452632  11.18116  0.4952827  7.790848
##   0.73710526  11.20972  0.4952494  7.796949
##   0.78968421  11.24824  0.4946068  7.804676
##   0.84226316  11.29505  0.4930287  7.822094
##   0.89484211  11.34232  0.4913345  7.841994
##   0.94742105  11.38565  0.4903604  7.861293
##   1.00000000  11.42119  0.4902964  7.863809
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5793684.
plot(lasso_tune)

lasso_prediction <- predict(lasso_tune, X_test)
postResample(pred = lasso_prediction, obs = Y_test)
##       RMSE   Rsquared        MAE 
## 11.8572660  0.4637412  8.4616371

###Elastic NEt Model

enetGrid <- expand.grid(alpha = seq(0.1, 0.80, length = 5), lambda = seq(0.001, 1, length = 20))
set.seed(100)
enet_tune <- train(X_train, Y_train,
                  method = "glmnet", 
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
enet_tune
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared   MAE     
##   0.100  0.00100000  11.43210  0.4722095  8.199503
##   0.100  0.05357895  11.43210  0.4722095  8.199503
##   0.100  0.10615789  11.43210  0.4722095  8.199503
##   0.100  0.15873684  11.43210  0.4722095  8.199503
##   0.100  0.21131579  11.43210  0.4722095  8.199503
##   0.100  0.26389474  11.43210  0.4722095  8.199503
##   0.100  0.31647368  11.43210  0.4722095  8.199503
##   0.100  0.36905263  11.43210  0.4722095  8.199503
##   0.100  0.42163158  11.43210  0.4722095  8.199503
##   0.100  0.47421053  11.43210  0.4722095  8.199503
##   0.100  0.52678947  11.43210  0.4722095  8.199503
##   0.100  0.57936842  11.43210  0.4722095  8.199503
##   0.100  0.63194737  11.43210  0.4722095  8.199503
##   0.100  0.68452632  11.43210  0.4722095  8.199503
##   0.100  0.73710526  11.43210  0.4722095  8.199503
##   0.100  0.78968421  11.43210  0.4722095  8.199503
##   0.100  0.84226316  11.43210  0.4722095  8.199503
##   0.100  0.89484211  11.42824  0.4724875  8.199392
##   0.100  0.94742105  11.41643  0.4733132  8.193682
##   0.100  1.00000000  11.39075  0.4746886  8.164284
##   0.275  0.00100000  11.74481  0.4533784  8.514164
##   0.275  0.05357895  11.74481  0.4533784  8.514164
##   0.275  0.10615789  11.74481  0.4533784  8.514164
##   0.275  0.15873684  11.74481  0.4533784  8.514164
##   0.275  0.21131579  11.74481  0.4533784  8.514164
##   0.275  0.26389474  11.74481  0.4533784  8.514164
##   0.275  0.31647368  11.74267  0.4534946  8.512868
##   0.275  0.36905263  11.67235  0.4573581  8.454410
##   0.275  0.42163158  11.51670  0.4675921  8.289002
##   0.275  0.47421053  11.38890  0.4763297  8.145190
##   0.275  0.52678947  11.32318  0.4812475  8.053562
##   0.275  0.57936842  11.29304  0.4837829  7.990588
##   0.275  0.63194737  11.26683  0.4859662  7.946107
##   0.275  0.68452632  11.24269  0.4875761  7.918856
##   0.275  0.73710526  11.22379  0.4886842  7.905719
##   0.275  0.78968421  11.19919  0.4903021  7.886597
##   0.275  0.84226316  11.17045  0.4923623  7.866079
##   0.275  0.89484211  11.14064  0.4946254  7.844978
##   0.275  0.94742105  11.11162  0.4967938  7.829251
##   0.275  1.00000000  11.08585  0.4986491  7.820341
##   0.450  0.00100000  11.84661  0.4474280  8.639628
##   0.450  0.05357895  11.84661  0.4474280  8.639628
##   0.450  0.10615789  11.84661  0.4474280  8.639628
##   0.450  0.15873684  11.84661  0.4474280  8.639628
##   0.450  0.21131579  11.82330  0.4487202  8.618697
##   0.450  0.26389474  11.57111  0.4637140  8.363749
##   0.450  0.31647368  11.42322  0.4742652  8.197355
##   0.450  0.36905263  11.34498  0.4809515  8.066326
##   0.450  0.42163158  11.29052  0.4845973  8.002124
##   0.450  0.47421053  11.24403  0.4877677  7.973511
##   0.450  0.52678947  11.20008  0.4907607  7.940339
##   0.450  0.57936842  11.16419  0.4935389  7.910367
##   0.450  0.63194737  11.13170  0.4953902  7.880784
##   0.450  0.68452632  11.10414  0.4972038  7.864232
##   0.450  0.73710526  11.07184  0.4995633  7.841399
##   0.450  0.78968421  11.04176  0.5017072  7.811803
##   0.450  0.84226316  11.02093  0.5032174  7.794738
##   0.450  0.89484211  11.00914  0.5045763  7.783117
##   0.450  0.94742105  11.00375  0.5054265  7.764687
##   0.450  1.00000000  11.00150  0.5059166  7.738962
##   0.625  0.00100000  11.92042  0.4434555  8.725955
##   0.625  0.05357895  11.92042  0.4434555  8.725955
##   0.625  0.10615789  11.92042  0.4434555  8.725955
##   0.625  0.15873684  11.87656  0.4456902  8.691825
##   0.625  0.21131579  11.52602  0.4673424  8.325183
##   0.625  0.26389474  11.38942  0.4779188  8.170552
##   0.625  0.31647368  11.29412  0.4846900  8.050341
##   0.625  0.36905263  11.22700  0.4892367  7.990826
##   0.625  0.42163158  11.18251  0.4925249  7.955462
##   0.625  0.47421053  11.16303  0.4929859  7.928071
##   0.625  0.52678947  11.12698  0.4953648  7.904856
##   0.625  0.57936842  11.08325  0.4986268  7.871817
##   0.625  0.63194737  11.05813  0.5007434  7.847826
##   0.625  0.68452632  11.04616  0.5021441  7.816898
##   0.625  0.73710526  11.03357  0.5037805  7.769634
##   0.625  0.78968421  11.02607  0.5051979  7.733287
##   0.625  0.84226316  11.02108  0.5060274  7.709230
##   0.625  0.89484211  11.01927  0.5064586  7.693913
##   0.625  0.94742105  11.02224  0.5069367  7.676267
##   0.625  1.00000000  11.03757  0.5066237  7.667829
##   0.800  0.00100000  11.98373  0.4402705  8.781762
##   0.800  0.05357895  11.98373  0.4402705  8.781762
##   0.800  0.10615789  11.98373  0.4402705  8.781762
##   0.800  0.15873684  11.60605  0.4627230  8.430325
##   0.800  0.21131579  11.40537  0.4769057  8.205472
##   0.800  0.26389474  11.28280  0.4855796  8.059906
##   0.800  0.31647368  11.20558  0.4915130  7.997411
##   0.800  0.36905263  11.19535  0.4913605  7.983860
##   0.800  0.42163158  11.15766  0.4934563  7.946361
##   0.800  0.47421053  11.11693  0.4966345  7.924234
##   0.800  0.52678947  11.09716  0.4986381  7.888670
##   0.800  0.57936842  11.06768  0.5014476  7.806330
##   0.800  0.63194737  11.04342  0.5040800  7.741690
##   0.800  0.68452632  11.03275  0.5052358  7.712522
##   0.800  0.73710526  11.03241  0.5057306  7.690563
##   0.800  0.78968421  11.05545  0.5049038  7.683817
##   0.800  0.84226316  11.08710  0.5037192  7.689270
##   0.800  0.89484211  11.11201  0.5034363  7.697575
##   0.800  0.94742105  11.13789  0.5033287  7.702531
##   0.800  1.00000000  11.17337  0.5026044  7.707972
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.45 and lambda = 1.
plot(enet_tune)

enet_prediction <- predict(enet_tune, X_test)
postResample(pred = enet_prediction, obs = Y_test)
##       RMSE   Rsquared        MAE 
## 11.7079433  0.4763978  8.2892487

Looking at the Rsquared values, it looks like the Ridge model did the best as it has the highest value, meaning it can explain 52% of the varience within the dataset. The Ridge model keeps its predictors, so it doing the best, means that for this dataset it is more effective to keep the predictors rather then eliminate them. Overall, the models performed very similarly.

##6.2(f) Would you recommend any of your models to replace the permeability laboratory experiment? I would not recommend any of these models as they are fitting only for about half of the data and for the context that is not good enough to replace lab experiments. It might work as a screening tool, but I would not trust it for any actual predictive purposes. ————————-

#6.3: The relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield

##6.3(a): Start R and use these commands to load the data

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176  58
str(ChemicalManufacturingProcess)
## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
colSums(is.na(ChemicalManufacturingProcess))
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0

##6.3(b): A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values

predictors_only <- ChemicalManufacturingProcess[, -1]
imputer <-preProcess(predictors_only, method = "knnImpute")
predictors_impt <- predict(imputer, predictors_only)
sum(is.na(predictors_impt))
## [1] 0