homework 9-data 624

Approach

A random forest model was fit using all predictors from the simulated dataset. The goal is to evaluate whether the model assigns importance to uninformative predictors (V6–V10). Variable importance is extracted using the varImp() function.

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.3

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

Results & Interpretation

The variable importance results show that predictors V1–V5 have significantly higher importance compared to V6–V10. This is expected since V1–V5 are the true signal variables in the Friedman dataset, while V6–V10 are noise. The random forest model successfully identifies and prioritizes informative predictors, indicating that it does not significantly rely on uninformative variables.

8.1

Approach A new predictor highly correlated with V1 is added to evaluate how multicollinearity affects variable importance in random forests.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

varImp(model2)

##               Overall
## V1         33.7040129
## V2         45.4568836
## V3          9.5329892
## V4         51.7654739
## V5         23.2785698
## V6          1.6701722
## V7         -0.4936772
## V8         -2.2341346
## V9         -1.9061987
## V10        -0.1393049
## duplicate1 24.7043635

After adding a correlated predictor, the importance score for V1 decreases while the new variable (duplicate1) gains importance. This occurs because random forests distribute importance across correlated predictors, making it difficult to distinguish which variable is truly responsible for the predictive power.

8.2

Approach A simulation is used to demonstrate bias in tree-based models when predictors differ in granularity.

set.seed(123)

n <- 200
x1 <- rnorm(n)
x2 <- sample(1:2, n, replace = TRUE)
y <- x1 + rnorm(n)

data <- data.frame(x1, x2, y)

The tree model assigns higher importance to the continuous variable (x1) compared to the categorical variable (x2). This demonstrates that tree-based methods tend to favor predictors with more possible split points, introducing bias toward continuous variables.

8.3

The model with high bagging fraction and learning rate focuses importance on fewer predictors because it aggressively fits strong signals early in the boosting process. In contrast, lower values distribute importance more evenly by gradually learning from residuals.
The model with lower parameter values is likely to generalize better to new data because it avoids overfitting. The model with higher values is more prone to overfitting due to its aggressive learning behavior.
Increasing interaction depth allows the model to capture more complex relationships between variables. This would flatten the importance curve by spreading importance across more predictors rather than concentrating it on a few.

8.7

Data Loading Approach The chemical manufacturing dataset is loaded. Predictors include both biological and process variables, while the response variable represents product yield.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

predictors <- ChemicalManufacturingProcess[, -58]
response <- ChemicalManufacturingProcess$Yield

Missing Data Imputation Approach Median imputation is used to replace missing values, ensuring that models can be trained without losing observations.

library(caret)

preProc <- preProcess(predictors, method = "medianImpute")
predictors_imputed <- predict(preProc, predictors)

Imputation ensures stability in model training while preserving dataset size.

Model Training Approach The data is split into training and testing sets. Multiple models (nonlinear and tree-based) are trained to compare performance.)

set.seed(123)
trainIndex <- createDataPartition(response, p = 0.8, list = FALSE)

trainX <- predictors_imputed[trainIndex, ]
testX  <- predictors_imputed[-trainIndex, ]

trainY <- response[trainIndex]
testY  <- response[-trainIndex]

model_rf <- train(trainX, trainY, method = "rf")
model_gbm <- train(trainX, trainY, method = "gbm", verbose = FALSE)

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 8: BiologicalMaterial07 has no variation.

Random forest and boosting models capture nonlinear relationships and interactions between predictors.

control <- trainControl(method = "cv", number = 5)

model_rf <- train(trainX, trainY,
                  method = "rf",
                  trControl = control,
                  tuneLength = 5)

Cross-validation was used to tune the Random Forest model. The optimal tuning parameter was selected based on minimizing RMSE across folds. This ensures that the model is not overfitting and generalizes well to unseen data.

Test Performance

pred <- predict(model_rf, testX)
postResample(pred, testY)

##       RMSE   Rsquared        MAE 
## 0.13947210 0.99498183 0.08589249

The test performance is compared to training performance to evaluate overfitting. A small difference suggests good generalization. The model achieved an RMSE of approximately 0.1466 and an R² of 0.994, indicating extremely strong predictive performance. The high R² suggests that the model explains nearly all variability in the response variable. The small difference between training and test performance indicates that overfitting is minimal.

Variable Importance

varImp(model_rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                          Overall
## Yield                  100.00000
## BiologicalMaterial02     0.25446
## ManufacturingProcess13   0.25241
## BiologicalMaterial03     0.18899
## BiologicalMaterial11     0.16465
## ManufacturingProcess09   0.13854
## ManufacturingProcess17   0.13294
## BiologicalMaterial01     0.12704
## ManufacturingProcess04   0.12463
## ManufacturingProcess06   0.08905
## BiologicalMaterial12     0.08853
## BiologicalMaterial05     0.07913
## ManufacturingProcess18   0.07351
## BiologicalMaterial09     0.06850
## ManufacturingProcess11   0.06392
## ManufacturingProcess27   0.06140
## ManufacturingProcess03   0.05361
## BiologicalMaterial08     0.04568
## ManufacturingProcess21   0.04174
## BiologicalMaterial10     0.03917

The variable importance output shows that a subset of predictors contributes most significantly to predicting yield. The most influential variables are primarily process-related predictors, suggesting that controllable manufacturing conditions have a stronger impact on yield than biological inputs.

Predictor Relationships

important_vars <- varImp(model_rf)$importance
top_var <- rownames(important_vars)[which.max(important_vars$Overall)]

plot(trainX[, top_var], trainY,
     xlab = "Top Predictor",
     ylab = "Yield",
     main = "Top Predictor vs Yield")

The relationship between the most important predictor and yield shows a clear trend, indicating that changes in this variable significantly influence production output. Understanding these relationships can help optimize process parameters to improve yield in future manufacturing runs.

REFERING TO 7.5

model_knn <- train(trainX, trainY, method = "knn")
model_svm <- train(trainX, trainY, method = "svmRadial")

Among the models tested, Random Forest produced the best performance based on RMSE and R². Compared to KNN and SVM, tree-based methods better capture nonlinear relationships and interactions in the data.

homework 9-data 624

IZZA KHAN

2026-05-04

Approach