HW 8 A

Question 7.2

  • the Mars model does the best job below and it identifies the x1-x5 predictors as most important
  • all the other models do a good job of mostly identifying 1-5 as the variables of importance, however they, still give small values of importance to some predictors.

plot data

library(mlbench)



set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
 ## We convert the 'x' data from a matrix to a data frame
 ## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
 ## Look at the data using
featurePlot(trainingData$x, trainingData$y)

 ## or other methods.

 ## This creates a list with a vector 'y' and a matrix
 ## of predictors 'x'. Also simulate a large test set to
 ## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1) 
testData$x <- data.frame(testData$x)
b <- mlbench.friedman1(5000, sd = 1)

KNN

knnModel <- train(x = trainingData$x,
 y = trainingData$y,
 method = "knn",preProc = c("center", "scale"),
 tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.575784  0.4937689  2.910509
##    7  3.502105  0.5164526  2.839260
##    9  3.392737  0.5567970  2.754948
##   11  3.360067  0.5761135  2.731018
##   13  3.347367  0.5903124  2.722033
##   15  3.322199  0.6074368  2.694127
##   17  3.303677  0.6235049  2.690998
##   19  3.311931  0.6324167  2.696910
##   21  3.323571  0.6414055  2.705803
##   23  3.343621  0.6437899  2.730110
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
varImp(knnModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

SVM

  • below I tune 3 different svm models
    • Radial kernel- produces a RMSE on test set of 2.05
      • epsilon = 0.1 cost C = 16 , sigma = 0.061, Number of Support Vectors : 164 Training error : 0.008
    • Poly- produces a RMSE on test set of 2.12
      • epsilon = 0.1 cost C = .25 ,Number of Support Vectors : 163 ,Training error : .04
    • linear - produces an RMSE of 2.68
      • parameter : epsilon = 0.1 cost C = 1 ,Training error : 0.265593
set.seed(100)
svmRTuned <- train(trainingData$x , trainingData$y,
 method = "svmRadial",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))
set.seed(100)
svmRTuned2 <- train(trainingData$x , trainingData$y,
 method = "svmLinear",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

svmRTuned3 <- train(trainingData$x , trainingData$y,
 method = "svmPoly",
 preProc = c("center", "scale"),
 tuneLength = 3,
 trControl = trainControl(method = "cv"))

svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0552698014532696 
## 
## Number of Support Vectors : 156 
## 
## Objective Function Value : -89.772 
## Training error : 0.008525
svmRTuned2$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 172 
## 
## Objective Function Value : -54.9759 
## Training error : 0.216921
svmRTuned3$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 0.5 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  3  scale =  0.1  offset =  1 
## 
## Number of Support Vectors : 146 
## 
## Objective Function Value : -8.0447 
## Training error : 0.020293
radial_pred <- predict(svmRTuned,testData$x)
linear_pred <- predict(svmRTuned2,testData$x)
poly_pred <- predict(svmRTuned3,testData$x)

RMSE(testData$y,radial_pred)
## [1] 2.049005
RMSE(testData$y,linear_pred)
## [1] 2.763386
RMSE(testData$y,poly_pred)
## [1] 2.056465
varImp(svmRTuned)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000
varImp(svmRTuned2)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000
varImp(svmRTuned3)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

MARS

  • RMSE on test data is 1.14
  • x1-x5 were all chosen as predictors of importance
set.seed(100)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:32)
marsTuned <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))

marsTuned$finalModel
## Selected 16 of 18 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 11 4
## GCV 1.61518    RSS 210.6377    GRSq 0.934423    RSq 0.9568093
mars_pred <- predict(marsTuned,testData$x)
RMSE(testData$y,mars_pred)
## [1] 1.14925
varImp(marsTuned)
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    85.14
## X2    69.24
## X5    49.32
## X3    40.02
## X8     0.00
## X6     0.00
## X10    0.00
## X9     0.00
## X7     0.00

Question 7.5

  • For questions B and C we were asked to compare our results with last week
  • As a small caveat, last week I eliminated many predictors that showed potential collinearity issues and ran models based on full dataset and a trimmed dataset. For part A in this hw, i just use that trimmed dataset
  • However, in the end of part A, I run a caret ensemble model to tune everything at once with auto feature selection on the full dataset
    • This model is used for parts b and c to figure out feature importance

A

  • the data was already imputed in last hw, I import that data file again
    • Last time i ran tests over complete dataset and trimmed dataset, this time I will run test over just the trimmed dataset
data("ChemicalManufacturingProcess")
completedData <- read.csv("imputed.csv")
complete_df_imputed <- completedData
my_df <- ChemicalManufacturingProcess[,-1]
processPredictors <- as.matrix(my_df)
yield <- ChemicalManufacturingProcess[,1]




## eliminate high correlation and near zero var  
corrplot::corrplot(cor(completedData))

nearZeroVar(completedData)
## [1] 8
correlations <- cor(completedData)
highCorr <- findCorrelation(correlations, cutoff = .75)
completedData <- completedData[, -highCorr]
completedData <- completedData[,-3]

Run SVM models

  • Radial test set 1.17 RMSE
    • the residual plot for the model shows non constant variance in the residuals. It almost look like there is a a positive correlation between residuals and predictions
  • Linear SVM test set RMSE 1.16, the residuals do look a little better than above model
  • Poly has an ok residual plot but RMSE of 1.23 on test set
## Train test split
set.seed(123)
smp_size <- floor(0.75 * nrow(completedData))
train_ind <- sample(seq_len(nrow(completedData)), size = smp_size)
train <- completedData[train_ind, ]
test <- completedData[-train_ind, ]
train_y <- yield[train_ind]
test_y <- yield[-train_ind]


## tune models 

## Radial model
svmRTuned <- train(train , train_y,
 method = "svmRadial",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

## plot Radial model
predicted <- predict(svmRTuned,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

## Linear model
svmRTuned2 <- train(train , train_y,
 method = "svmLinear",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

## plot lin model

predicted <- predict(svmRTuned2,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

## poly svm model
svmRTuned3 <- train(train, train_y,
 method = "svmPoly",
 preProc = c("center", "scale"),
 tuneLength = 3,
 trControl = trainControl(method = "cv"))

## plot model
## 
predicted <- predict(svmRTuned3,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

## get predictions RMSE for models 
predictions1 <- predict(svmRTuned,test)
radial_rmse <- RMSE(predictions1,test_y)
predictions2 <- predict(svmRTuned2,test)
linear_rmse <- RMSE(predictions2,test_y)
predictions3 <- predict(svmRTuned3,test)
poly_rmse <- RMSE(predictions3,test_y)
cbind(radial_rmse,linear_rmse,poly_rmse)
##      radial_rmse linear_rmse poly_rmse
## [1,]     1.17302    1.165471  1.233217
## final models 
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0221305795559405 
## 
## Number of Support Vectors : 122 
## 
## Objective Function Value : -66.0452 
## Training error : 0.102862
svmRTuned2$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 125 
## 
## Objective Function Value : -44.7484 
## Training error : 0.336602
svmRTuned3$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 0.25 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  3  scale =  0.01  offset =  1 
## 
## Number of Support Vectors : 117 
## 
## Objective Function Value : -14.6279 
## Training error : 0.357631
varImp(svmRTuned)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06
varImp(svmRTuned2)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06
varImp(svmRTuned3)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06

MARS

  • RMSE on test set is 1.71
  • MARS model looks terrible
set.seed(123)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:32)
marsTuned <- train(train, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))

marsTuned$finalModel
## Selected 2 of 45 terms, and 2 of 36 predictors
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: ManufacturingProcess17, ManufacturingProcess28, ...
## Number of terms at each degree of interaction: 1 0 1
## GCV 2.548561    RSS 318.8066    GRSq 0.2586737    RSq 0.2866986
mars_pred <- predict(marsTuned,test)
RMSE(test_y,mars_pred)
## [1] 1.714526
predicted <- predict(marsTuned,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

varImp(marsTuned)
## earth variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## ManufacturingProcess28     100
## ManufacturingProcess17     100
## ManufacturingProcess21       0
## ManufacturingProcess24       0
## ManufacturingProcess05       0
## BiologicalMaterial03         0
## ManufacturingProcess37       0
## BiologicalMaterial09         0
## ManufacturingProcess12       0
## ManufacturingProcess35       0
## ManufacturingProcess19       0
## ManufacturingProcess34       0
## ManufacturingProcess22       0
## ManufacturingProcess08       0
## ManufacturingProcess01       0
## BiologicalMaterial11         0
## ManufacturingProcess20       0
## ManufacturingProcess11       0
## ManufacturingProcess03       0
## ManufacturingProcess38       0

KNN

  • RMSE on test test is 1.51
knnModel <- train(x = train,
 y = train_y,
 method = "knn",preProc = c("center", "scale"),
 tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 132 samples
##  36 predictor
## 
## Pre-processing: centered (36), scaled (36) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.472477  0.3282891  1.152186
##    7  1.463268  0.3209371  1.160692
##    9  1.459681  0.3179446  1.157638
##   11  1.461165  0.3151395  1.158253
##   13  1.463401  0.3112594  1.159268
##   15  1.463532  0.3076825  1.158154
##   17  1.460605  0.3115393  1.155990
##   19  1.457138  0.3182375  1.152340
##   21  1.457895  0.3214083  1.147691
##   23  1.465486  0.3178753  1.153840
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.
varImp(knnModel)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06
knn_pred <- predict(knnModel,test)
RMSE(test_y,knn_pred)
## [1] 1.517846
predicted <- predict(knnModel,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

caret ensemble on entire dataframe

  • XGBOOST gives best results, at rmse under 1
completedData <- read.csv("imputed.csv")
complete_df_imputed <- completedData

complete_df_imputed <- complete_df_imputed[,-1]
set.seed(123)
smp_size <- floor(0.75 * nrow(complete_df_imputed))
train_ind <- sample(seq_len(nrow(complete_df_imputed)), size = smp_size)
train <- complete_df_imputed[train_ind, ]
test <- complete_df_imputed[-train_ind, ]
train_y <- yield[train_ind]
test_y <- yield[-train_ind]




registerDoParallel(4)
getDoParWorkers()
## [1] 4
set.seed(123)
my_control <- trainControl(method = 'cv', # for “cross-validation”
                           number = 5, # number of k-folds
                           savePredictions = 'final',
                           allowParallel = TRUE)

model_list <- caretList(train,
                        train_y,
                        trControl = my_control,
                        methodList = c('lm', 'svmRadial', 'rf',"pls", 
                                       'xgbTree', 'xgbLinear',"bagEarth",'glmnet',"knn" ),
                        tuneList = NULL,
                        continue_on_fail = FALSE, 
                        preProcess = c('center','scale'))
## Warning in trControlCheck(x = trControl, y = target): indexes not defined
## in trControl. Attempting to set them ourselves, so each model in the
## ensemble will have the same resampling indexes.
options(digits = 3)
model_results <- data.frame(LM = min(model_list$lm$results$RMSE),
 SVM = min(model_list$svmRadial$results$RMSE),
 RF = min(model_list$rf$results$RMSE),
 XGBT = min(model_list$xgbTree$results$RMSE),
 XGBL = min(model_list$xgbLinear$results$RMSE),
 Mars = min(model_list$bagEarth$results$RMSE),
 GLMNET = min(model_list$glmnet$results$RMSE),
 PLS =min(model_list$pls$results$RMSE),
 KNN =min(model_list$knn$results$RMSE))
 
print(model_results)
##     LM  SVM   RF XGBT XGBL Mars GLMNET  PLS  KNN
## 1 12.9 1.32 1.17 1.07 1.15 1.39   1.31 1.58 1.37
options(digits = 3)

LM = predict(model_list$lm,test)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
RMSE(test_y,knn_pred)
## [1] 1.52
svmRadial = predict(model_list$svmRadial,test)
RMSE(test_y,svmRadial)
## [1] 1.14
rf = predict(model_list$rf,test)
RMSE(test_y,rf)
## [1] 1.23
resamples <- resamples(model_list)
dotplot(resamples, metric = 'RMSE')

ensemble_1 <- caretEnsemble(model_list, 
                            metric = 'RMSE', 
                            trControl = my_control)
summary(ensemble_1)
## The following models were ensembled: lm, svmRadial, rf, pls, xgbTree, xgbLinear, bagEarth, glmnet, knn 
## They were weighted: 
## -8.925 0.0044 0.0684 0.2258 -0.2042 0.5689 0.0924 -0.0561 0.3611 0.162
## The resulting RMSE is: 1.2986
## The fit for each individual model on the RMSE is: 
##     method  RMSE RMSESD
##         lm 12.91 22.828
##  svmRadial  1.32  0.128
##         rf  1.17  0.232
##        pls  1.58  0.293
##    xgbTree  1.07  0.175
##  xgbLinear  1.15  0.255
##   bagEarth  1.39  0.160
##     glmnet  1.31  0.224
##        knn  1.37  0.187

B

Variable importance

  • Below i commented out some of the most important variables in out linear versus non linear models ran on entire dataset
  • In my best non linear model which was the radial model, we can see that 5 of the biological material predictors actually make it into top ten most important predictors.
  • I’d say biological material seems slightly more important in non linear models.
    • Our MARS models, while performing pretty poorly, seem to indicate MP 32 is the only important process
plot(varImp(model_list$xgbTree),top=10)##   32 , 13 06, 09, 31 01,37          03  12, 10, 05, 09

plot(varImp(model_list$xgbLinear),top=10)

plot(varImp(model_list$glmnet),top=10)## MP 32,12,09,17,34,06,37,03  BM 06,03

plot(varImp(model_list$bagEarth),top=10)

plot(varImp(model_list$pls),top=10)## MP 32,36,13,09,33   BM:  06,02,03 

plot(varImp(model_list$svmRadial),top=10)## MP 32,13,17,36,09,33    06,03,02,12,11

C

  • Just like our previous hw, we can see the biological material all acts as some sort of accelerant. All have positive correlation with yield
  • In terms of manufacturing process, I added in MP 31 01 and 07 because our best model(XGboost) identified them
    • strangely enough these variables actually show the least correlation with our yield
  • Otherwise just like in our linear model, we can see that certain manufacturing process can hinder or help our yield.
  • Manufacturing process 32 seems very helpful, which looking back at my gls model did happen in the dataset with all predictors as well
  • manufacturing process 13 seems very detrimental to our yield which was also identified in the same model for linear hw last week
corr_df <- as.data.frame(cbind(complete_df_imputed,yield))
corrplot::corrplot(cor(corr_df[,1:58])[c(3,12,10,5,9),58, drop=FALSE], cl.pos='n')

corrplot::corrplot(cor(corr_df[,1:58])[c(25,18,29,45,21,43,13,49,44),58, drop=FALSE], cl.pos='n')