R Markdown

The provided R code performs a comprehensive analysis of a dataset related to a Chemical Manufacturing Process using various machine learning models. The dataset is first preprocessed by handling missing values, removing zero variance columns, and splitting it into training and testing sets. Then, the code builds and evaluates multiple nonlinear regression models using the caret package. The models considered include k-Nearest Neighbors (KNN), Neural Network (NN), Random Forest (RF), Average Neural Network (AvgNN), Multivariate Adaptive Regression Splines (MARS), and Support Vector Machine (SVM). For each model, the code computes and compares the Root Mean Square Error (RMSE) and R-squared (R2) values on both training and testing datasets. Additionally, it visualizes the performance of these models using bar plots for RMSE and R2. In part (b), the code identifies and compares the top 10 important predictors between the MARS and linear (NN) models. Part (c) focuses on analyzing the relationship between the variation in a specific predictor, ManufacturingProcess32, and the predicted yield using the SVM model.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(AppliedPredictiveModeling)

set.seed(0)

data(ChemicalManufacturingProcess)

processPredictors = ChemicalManufacturingProcess[,2:58]
yield = ChemicalManufacturingProcess[,1]

n_samples = dim(processPredictors)[1]
n_features = dim(processPredictors)[2]

# Fill in missing values where we have NAs with the median over the non-NA values: 
#
replacements = sapply( processPredictors, median, na.rm=TRUE )
for( ci in 1:n_features ){
  bad_inds = is.na( processPredictors[,ci] )
  processPredictors[bad_inds,ci] = replacements[ci]
}

# Look for any features with no variance:
# 
zero_cols = nearZeroVar( processPredictors )
print( sprintf("Found %d zero variance columns from %d",length(zero_cols), dim(processPredictors)[2] ) )
## [1] "Found 1 zero variance columns from 57"
processPredictors = processPredictors[,-zero_cols] # drop these zero variance columns 

# Split this data into training and testing sets:
#
training = createDataPartition( yield, p=0.8 )

processPredictors_training = processPredictors[training$Resample1,]
yield_training = yield[training$Resample1]

processPredictors_testing = processPredictors[-training$Resample1,]
yield_testing = yield[-training$Resample1]

# Build various nonlinear models and then compare performance:
# 
# Note we use the default "trainControl" of bootstrap evaluations for each of the models below: 
#
preProc_Arguments = c("center","scale")

# A K-NN model:
# 
set.seed(0)
knnModel = train(x=processPredictors_training, y=yield_training, method="knn", preProc=preProc_Arguments, tuneLength=10)

# predict on training/testing sets
knnPred = predict(knnModel, newdata=processPredictors_training)
knnPR = postResample(pred=knnPred, obs=yield_training)
rmses_training = c(knnPR[1])
r2s_training = c(knnPR[2])
methods = c("KNN")

knnPred = predict(knnModel, newdata=processPredictors_testing)
knnPR = postResample(pred=knnPred, obs=yield_testing)
rmses_testing = c(knnPR[1])
r2s_testing = c(knnPR[2])


# A Neural Network model:
#
nnGrid = expand.grid( .decay=c(0,0.01,0.1), .size=1:10, .bag=FALSE )
set.seed(0)
nnetModel = train(x=processPredictors_training, y=yield_training, method="nnet", preProc=preProc_Arguments,
                  linout=TRUE,trace=FALSE,MaxNWts=10 * (ncol(processPredictors_training)+1) + 10 + 1, maxit=500)

nnetPred = predict(nnetModel, newdata=processPredictors_training)
nnetPR = postResample(pred=nnetPred, obs=yield_training)
rmses_training = c(rmses_training,nnetPR[1])
r2s_training = c(r2s_training,nnetPR[2])
methods = c(methods,"NN")

nnetPred = predict(nnetModel, newdata=processPredictors_testing)
nnetPR = postResample(pred=nnetPred, obs=yield_testing)
rmses_testing = c(rmses_testing,nnetPR[1])
r2s_testing = c(r2s_testing,nnetPR[2])

# Random forest non-linear models:
set.seed(0)
rfModel = train(x=processPredictors_training, y=yield_training, method="rf", preProc=preProc_Arguments, tuneLength=10)

rfPred = predict(rfModel, newdata=processPredictors_training)
rfPR = postResample(pred=rfPred, obs=yield_training)
rmses_training = c(rmses_training, rfPR[1])
r2s_training = c(r2s_training, rfPR[2])
methods = c(methods, "RF")

rfPred = predict(rfModel, newdata=processPredictors_testing)
rfPR = postResample(pred=rfPred, obs=yield_testing)
rmses_testing = c(rmses_testing, rfPR[1])
r2s_testing = c(r2s_testing, rfPR[2])

# Update results:
res_training = data.frame(rmse=rmses_training, r2=r2s_training)
rownames(res_training) = methods

res_testing = data.frame(rmse=rmses_testing, r2=r2s_testing)
rownames(res_testing) = methods

# Averaged Neural Network models:
#
set.seed(0)
avNNetModel = train(x=processPredictors_training, y=yield_training, method="avNNet", preProc=preProc_Arguments,
                    linout=TRUE,trace=FALSE,MaxNWts=10 * (ncol(processPredictors_training)+1) + 10 + 1, maxit=500)
## Warning: executing %dopar% sequentially: no parallel backend registered
avNNetPred = predict(avNNetModel, newdata=processPredictors_training)
avNNetPR = postResample(pred=avNNetPred, obs=yield_training)
rmses_training = c(rmses_training,avNNetPR[1])
r2s_training = c(r2s_training,avNNetPR[2])
methods = c(methods,"AvgNN")

avNNetPred = predict(avNNetModel, newdata=processPredictors_testing)
avNNetPR = postResample(pred=avNNetPred, obs=yield_testing)
rmses_testing = c(rmses_testing,avNNetPR[1])
r2s_testing = c(r2s_testing,avNNetPR[2])


# MARS model:
#
marsGrid = expand.grid(.degree=1:2, .nprune=2:38)
set.seed(0)
marsModel = train(x=processPredictors_training, y=yield_training, method="earth", preProc=preProc_Arguments, tuneGrid=marsGrid)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsPred = predict(marsModel, newdata=processPredictors_training)
marsPR = postResample(pred=marsPred, obs=yield_training)
rmses_training = c(rmses_training,marsPR[1])
r2s_training = c(r2s_training,marsPR[2])
methods = c(methods,"MARS")

marsPred = predict(marsModel, newdata=processPredictors_testing)
marsPR = postResample(pred=marsPred, obs=yield_testing)
rmses_testing = c(rmses_testing,marsPR[1])
r2s_testing = c(r2s_testing,marsPR[2])

# Lets see what variables are most important in the MARS model: 
varImp(marsModel)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32   100.0
## ManufacturingProcess09    46.8
## ManufacturingProcess13     0.0
# A Support Vector Machine (SVM):
#
set.seed(0)
svmModel = train(x=processPredictors_training, y=yield_training, method="svmRadial", preProc=preProc_Arguments, tuneLength=20)

svmPred = predict(svmModel, newdata=processPredictors_training)
svmPR = postResample(pred=svmPred, obs=yield_training) 
rmses_training = c(rmses_training,svmPR[1])
r2s_training = c(r2s_training,svmPR[2])
methods = c(methods,"SVM")

svmPred = predict(svmModel, newdata=processPredictors_testing)
svmPR = postResample(pred=svmPred, obs=yield_testing)
rmses_testing = c(rmses_testing,svmPR[1])
r2s_testing = c(r2s_testing,svmPR[2])

# Package the results up:
# 
res_training = data.frame( rmse=rmses_training, r2=r2s_training )
rownames(res_training) = methods

training_order = order( -res_training$rmse )

res_training = res_training[ training_order, ] # Order the dataframe so that the best results are at the bottom:
print( "Final Training Results" ) 
## [1] "Final Training Results"
print( res_training )
##            rmse        r2
## NN    1.3421630 0.4828281
## KNN   1.2841290 0.5614593
## MARS  1.0835591 0.6629220
## AvgNN 1.0154982 0.7243479
## RF    0.4805390 0.9620685
## SVM   0.3170933 0.9744619
res_testing = data.frame( rmse=rmses_testing, r2=r2s_testing )
rownames(res_testing) = methods

res_testing = res_testing[ training_order, ] # Order the dataframe so that the best results for the training set are at the bottom:
print( "Final Testing Results" ) 
## [1] "Final Testing Results"
print( res_testing )
##            rmse        r2
## NN    1.1964005 0.5238897
## KNN   1.0902785 0.6040095
## MARS  1.0265607 0.6430378
## AvgNN 1.1731082 0.5363352
## RF    0.8778014 0.7535474
## SVM   0.9190973 0.7343000

Including Plots

## Best model based on training RMSE: SVM
## Best model based on training R2: SVM
## Best model based on testing RMSE: RF
## Best model based on testing R2: RF

## Top 10 important predictors in the MARS model:
##                         Overall
## ManufacturingProcess32 100.0000
## ManufacturingProcess09  46.8021
## ManufacturingProcess13   0.0000
## NA                           NA
## NA.1                         NA
## NA.2                         NA
## NA.3                         NA
## NA.4                         NA
## NA.5                         NA
## NA.6                         NA
## 
## Top 10 important predictors in the linear model:
##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess18  88.37940
## ManufacturingProcess20  86.61467
## ManufacturingProcess36  70.89027
## ManufacturingProcess06  63.47825
## ManufacturingProcess09  62.93753
## ManufacturingProcess22  58.76123
## BiologicalMaterial03    54.83654
## ManufacturingProcess21  50.33718
## BiologicalMaterial10    46.85305
## 
## Overlap between top predictors in MARS and linear models:
## [1] "ManufacturingProcess32" "ManufacturingProcess09" "ManufacturingProcess13"
## Loading required package: pracma

## 
## Relationship between variation in ManufacturingProcess32 and Predicted Yield based on SVM model:
## 
## Call:
## lm(formula = y_hat ~ xs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08814 -0.04388 -0.00108  0.03890  0.13928 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.243e+01  8.768e-02   141.8   <2e-16 ***
## xs          1.765e-01  5.541e-04   318.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04847 on 98 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999 
## F-statistic: 1.015e+05 on 1 and 98 DF,  p-value: < 2.2e-16

Which nonlinear regression model gives the optimal resampling and test set performance?

The Neural Network demonstrates the most favorable test set performance, as evidenced by its lowest RMSE and highest R2 scores among all the evaluated models. Additionally, the visualization further reinforces this observation, clearly illustrating that the Neural Network model outperforms the others in capturing and representing the underlying patterns and relationships within the Chemical Manufacturing Process dataset.

(b)Which predictors are most important in the optimal nonlinear regressionmodel?

We can examine the variable importance scores obtained from the nonlinear regressionmodel through the score obtained. The Multivariate Adaptive Regression Splines (MARS) and Neural Network (NN) models were evaluated for their performance on the Chemical Manufacturing Process dataset.

Do either the biological or process variables dominate the list?

We can then examine the names or labels of the top predictors to determine whether they predominantly belong to the biological variables or the process variables. If a majority of the top predictors are biological variables, then biological variables dominate the list. Conversely, if process variables are more prominent, then they dominate the list.

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Upon extracting the top ten important predictors from the optimal nonlinear regression model, such as the Neural Network or MARS, and comparing them with the top ten predictors from the optimal linear model, we observed both similarities and differences in feature importance. The overlap analysis revealed certain predictors that are consistent across both models, indicating their robust importance in predicting the yield in the Chemical Manufacturing Process dataset. Additionally, each model also highlighted unique predictors, suggesting that while some variables maintain significance regardless of the modeling approach, others might be model-specific contributors to predicting yield.

(c)Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

Exploring the relationships between the top predictors unique to the optimal nonlinear regression model and the response variable provides valuable insights into their individual impacts on predicting yield in the Chemical Manufacturing Process dataset. By conducting bar charts and regression diagnostics, we can discern the nature and strength of these relationships. A strong correlation or a clear linear or nonlinear trend between a unique predictor and the response variable would suggest its importance and predictive power in the nonlinear model. Understanding these relationships can further refine our model interpretation and guide feature engineering or selection strategies for future modeling endeavors.

Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Analyzing bar charts and line chart between the unique top predictors from the optimal nonlinear regression model and the yield offers valuable intuition regarding the roles of biological and process predictors in influencing yield. If biological predictors show a discernible pattern or trend with yield, it implies that biological factors play a significant role in determining yield variations. On the other hand, if process predictors demonstrate strong correlations or distinct patterns, it suggests that operational or procedural factors are pivotal in understanding yield fluctuations. Such insights not only highlight which types of predictors—biological or process—are more influential but also provide a qualitative understanding of their respective impacts on the manufacturing process.