data(permeability)
The data set had 719 near zero variables. The applicable data set is now down to 388.
bad_fingers <- nearZeroVar(fingerprints)
perm_finger <- fingerprints[,-bad_fingers]
length(bad_fingers)
## [1] 719
dim(perm_finger)[2]
## [1] 388
Pre-processing will consist of centering and scaling the data.The elasticnet model won’t work with scaling, so that model will only be centered.
cl<-makeCluster(detectCores())
registerDoParallel(cl)
set.seed(658)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
myControl <- trainControl(method='cv', seeds=seeds)
new_perm <- as.data.frame(perm_finger)
perm_train <- createDataPartition(permeability, p = 0.7, list = FALSE)
X_training <- new_perm[perm_train, ]
y_training <- permeability[perm_train, ]
X_testing <- new_perm[-perm_train, ]
y_testing <- permeability[-perm_train, ]
pls_model <- train(x=X_training,
y=y_training,
method='pls',
metric='Rsquared',
trControl=myControl,
preProcess=c('center', 'scale')
)
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp) %>%
select(ncomp, RMSE, Rsquared)
## ncomp RMSE Rsquared
## 1 3 12.44402 0.4636497
pls_pred <- predict(pls_model, newdata=X_testing)
postResample(pred=pls_pred, obs=y_testing)
## RMSE Rsquared MAE
## 10.6990767 0.5100389 8.4541193
The ridge regression’s highest R-squared score was .4818 with a lambda of 1. This is an increase from .4636, or .0182, which is a 3.9% increase in predictive power on the training set.
Also of note, one was the highest lambda value tested, so it is possible that if the range for lambda is increased that the R-squared value might also increase. Further testing should be conducted to see if that is true.
The elasticnet’s highest R-squared value was .4722. Which is a slight increase from the partial least squares model’s training R-square, yet it is also slightly lower than the ridge regression’s R-squared on the training data.
The lasso model was the worst performer of all the models. It’s highest R-squared value was .4509
Initially on the training data the ridge regression had the best R-squared score being ~4% better than the partial least squares model. This score could possibly be improved further by increasing the lambda value in the ridge regression.
ruby <- train( x=X_training,
y=y_training,
method='ridge',
metric='Rsquared',
tuneGrid=expand.grid(lambda = seq(0, 1,by=0.1)),
trControl=myControl,
preProcess=c('center','scale'),
na.action = na.omit
)
ruby
## Ridge Regression
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 106, 106, 105, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 15.26950 0.3654767 11.297176
## 0.1 12.96579 0.4526074 9.793751
## 0.2 12.94546 0.4602483 9.810543
## 0.3 13.11757 0.4669079 10.008800
## 0.4 13.39234 0.4724103 10.278381
## 0.5 13.78452 0.4764512 10.635906
## 0.6 14.24628 0.4784520 11.065704
## 0.7 14.77672 0.4800700 11.560552
## 0.8 15.36053 0.4810695 12.086370
## 0.9 15.98674 0.4816370 12.624526
## 1.0 16.64840 0.4818382 13.160313
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
elasty <- train(x=X_training,
y=y_training,
method='enet',
metric='Rsquared',
tune=expand.grid(fraction=seq(0, 1, by=0.1),
lambda =seq(0, 1, by=0.1)),
trControl=myControl,
preProcess=c('center'),
na.action = na.omit
)
elasty
## Elasticnet
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 106, 106, 105, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0e+00 0.050 12.73731 0.4635880 9.683101
## 0e+00 0.525 14.03838 0.3840277 10.450708
## 0e+00 1.000 17.43465 0.3596392 12.674035
## 1e-04 0.050 3256.49276 0.3875031 2166.900249
## 1e-04 0.525 31388.76539 0.2802239 20853.673198
## 1e-04 1.000 59361.10575 0.2331990 39433.057183
## 1e-01 0.050 12.37650 0.4722650 9.515095
## 1e-01 0.525 12.34334 0.4617231 9.171237
## 1e-01 1.000 12.96978 0.4525790 9.798515
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.1.
lassy<- train(x=X_training,
y=y_training,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
trControl=myControl,
preProcess=c('center','scale')
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x = X_training, y = y_training, method = "lasso", :
## missing values found in aggregated results
lassy
## The lasso
##
## 117 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 105, 105, 105, 106, 106, 105, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.75590 NaN 12.641787
## 0.05 12.72988 0.4509719 9.813431
## 0.10 12.10918 0.4398276 8.913847
## 0.15 12.08650 0.4261224 8.661598
## 0.20 12.33885 0.4071344 8.797435
## 0.25 12.40495 0.4097625 8.919987
## 0.30 12.56445 0.4085810 9.121651
## 0.35 12.80054 0.4026614 9.302328
## 0.40 13.04872 0.3985966 9.593485
## 0.45 13.18989 0.4007178 9.755427
## 0.50 13.29458 0.4041847 9.850746
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
stopCluster(cl)
No I would not recommend any of these models as are, to replace the permeability laboratory experiment. There doesn’t seem to be a strong enough relationship between the predictors and the predicted. These models might be able to be tuned much better than what I have done, where they become a viable option as replacement. As is, these are not viable options.
data(ChemicalManufacturingProcess)
Of the columns that are missing data, most are only missing one or two values. ManufacturingProcess03 is missing the most values with 15 missing, which is just over 8% of the observations. Manufacturing Processes 10 and 11 are also missing a decent amount, missing 9 and 10 values.
For this dataset I am going to use a knn imputation. Which will remove all na values, while also centering and scaling all the data.
colSums(is.na(ChemicalManufacturingProcess))
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
only_yield <- subset(ChemicalManufacturingProcess, select = c(Yield))
no_yield <- subset(ChemicalManufacturingProcess, select = -c(Yield))
imputed_data <- preProcess(no_yield, "knnImpute")
fixed_cmp <- predict(imputed_data,
no_yield)
Using a Ridge Regression model, the optimal performance metric is a Ridge Regression with a lambda of 0.4. This model has an R-squared of .5276 and an RMSE of 1.897.
cl2<-makeCluster(detectCores())
registerDoParallel(cl2)
set.seed(791)
training2 <- createDataPartition(only_yield$Yield, p=0.7, list=FALSE)
X_training2 <- fixed_cmp[training2, ]
y_training2 <- only_yield$Yield[training2]
X_testing2 <- fixed_cmp[-training2, ]
y_testing2 <- only_yield$Yield[-training2]
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
myControl <- trainControl(method='cv', seeds=seeds)
ruby2 <- train( x=X_training2,
y=y_training2,
method='ridge',
metric='RMSE',
tuneGrid=expand.grid(.lambda = seq(0, 2, by=0.2)),
trControl=myControl
)
ruby2
## Ridge Regression
##
## 124 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 112, 112, 111, 112, 111, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 3.794717 0.3112102 2.073043
## 0.2 2.037873 0.5001433 1.306535
## 0.4 1.897186 0.5276175 1.265029
## 0.6 1.909298 0.5233858 1.309832
## 0.8 1.978693 0.5147947 1.374211
## 1.0 2.072019 0.5080121 1.446059
## 1.2 2.176858 0.5031478 1.529156
## 1.4 2.287612 0.4996483 1.619042
## 1.6 2.401310 0.4970583 1.713256
## 1.8 2.516197 0.4950584 1.808623
## 2.0 2.631163 0.4934274 1.904842
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.4.
plot(ruby2)
The RMSE and the R-Squared values are both better on the testing data than the training data. The R-Squared improved to .604 compared to .5276, which is an improvement of 14.6% for the model fit. While the RMSE improved from 1.897 to 1.287, which is an improvement of 32%.
ruby2_pred <- predict(ruby2, newdata=X_testing2)
postResample(pred=ruby2_pred, obs=y_testing2)
## RMSE Rsquared MAE
## 1.2879312 0.6047784 1.0554196
The top 10 predictors of importance are all manufacturing processes. This is not surprising since there are roughly 3.75 manufacturing processes to every 1 biological material.
importance <- predict(ruby2$finalModel, s=ruby2$bestTune[1, "lambda"], type="coef", mode="penalty")$coefficients
importance_sort <- abs(importance)
importance_sort<- importance_sort[importance_sort > 0]
importance_sort <- sort(importance_sort, decreasing = T)
importance_sort[0:10]
## ManufacturingProcess32 ManufacturingProcess36 ManufacturingProcess09
## 0.5196149 0.3689769 0.3607273
## ManufacturingProcess13 ManufacturingProcess17 ManufacturingProcess37
## 0.3473881 0.3369944 0.2281104
## ManufacturingProcess15 ManufacturingProcess11 ManufacturingProcess34
## 0.2159516 0.2152899 0.1962589
## ManufacturingProcess12
## 0.1750054
I believe these predictors have the largest relationship with yield. So if you want to improve yield you would start with the processes that play the biggest role in yield. Also, these processes could be models for all the other processes. Where you look for differences between processes that might affect the yield.
Such as, maybe these processes happen in different facilities that are in different geographic regions. Maybe these processes are affected by unique geographic characteristics, such as humidity. The processes that perform better are in low humidity geographic areas. This could mean that some processes need to be relocated to regions of low humidity. Thus you could improve yield while also lowering costs of having to keep a facility’s humidity down.