data("permeability")
HW7DATA624
Assignment
We are required to complete questions 6.2 and 6.3 from chapter 6 of “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson.
6.2
A
Below I load the permeability data set which contains the matrix “fingerprints” and a matrix for permeability.
Fingerprints contains binary molecular predictors for 165 compounds and permeability contains the permeability response.
nrow(fingerprints)
[1] 165
ncol(fingerprints)
[1] 1107
nrow(permeability)
[1] 165
B
We are tasked with filtering out predictors that have low frequencies since many of the substructures for each molecule are sparse.
length(nearZeroVar(fingerprints))
[1] 719
We can see from the above that 719 of the 1107 columns in fingerprints have near zero variance based on the default conditions of the function, which is a frequency ratio of the most common value to second most common value of at least 95/5, and a percentage of unique values out of the total number of data points of at least ten. For example, here is what these statistics look like for the first few variables:
head(nearZeroVar(fingerprints, saveMetrics = TRUE), n=10)
freqRatio percentUnique zeroVar nzv
X1 3.342105 1.2121212 FALSE FALSE
X2 3.459459 1.2121212 FALSE FALSE
X3 3.714286 1.2121212 FALSE FALSE
X4 3.714286 1.2121212 FALSE FALSE
X5 3.714286 1.2121212 FALSE FALSE
X6 1.229730 1.2121212 FALSE FALSE
X7 0.000000 0.6060606 TRUE TRUE
X8 0.000000 0.6060606 TRUE TRUE
X9 54.000000 1.2121212 FALSE TRUE
X10 40.250000 1.2121212 FALSE TRUE
Below, I filter out the predictors with near zero variance. This leaves 388 predictors for modeling.
<- nearZeroVar(fingerprints)
nzv_fingerprint <- fingerprints[,-nzv_fingerprint]
fingerprints_filtered ncol(fingerprints_filtered)
[1] 388
C
In this section we are tasked with splitting the data into training and test sets, pre-processing the data, and tuning a PLS model.
First, I combine the 388 predictors from above and the permeability matrix into one matrix.
<- as.data.frame(cbind(fingerprints_filtered, permeability))
modeling_data <- modeling_data |> select(permeability, everything()) modeling_data
Next, I split the data into a train and test set via a 80%/20% split.
set.seed(10)
<- createDataPartition(modeling_data$permeability, p = 0.8, list = FALSE) trainIndex
<- modeling_data[trainIndex,]
trainData <- modeling_data[-trainIndex,] testData
Next, I tune a PLS model on the training set. I center and scale the values using the preProcess function.
set.seed(100)
<- trainControl(method = 'CV', number = 10)
fitControl
<- train(permeability ~ ., data = trainData,
plsFit method = 'pls',
tuneLength = 20,
preProcess = c('center','scale'),
trControl = fitControl)
plsFit
Partial Least Squares
133 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 13.09303 0.3751150 9.955400
2 11.93669 0.4749213 8.426133
3 11.42380 0.5213586 8.536021
4 11.15017 0.5479451 8.634503
5 11.08069 0.5518608 8.428915
6 11.00353 0.5662611 8.441985
7 10.90816 0.5692303 8.401248
8 10.97174 0.5612605 8.490543
9 11.19584 0.5590258 8.533019
10 11.32924 0.5622410 8.720479
11 11.81544 0.5372975 9.012913
12 12.18829 0.5127903 9.470607
13 12.55119 0.4971668 9.533206
14 12.83874 0.4776451 9.752862
15 12.90284 0.4737939 9.623939
16 13.25394 0.4558385 9.863059
17 13.29940 0.4590803 9.754005
18 13.47374 0.4547626 9.851230
19 13.37292 0.4628417 9.691664
20 13.63080 0.4468999 9.829598
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 7.
Based on the above, the optimal number of latent variables is 7.
plot(plsFit)
summary(plsFit)
Data: X dimension: 133 388
Y dimension: 133 1
Fit method: oscorespls
Number of components considered: 7
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 20.32 35.18 40.50 45.08 49.40 57.76 62.36
.outcome 36.47 52.82 62.12 69.35 75.31 77.53 79.50
The average R squared from the re-sampling process is:
mean(plsFit$resample$Rsquared)
[1] 0.5692303
D
Now I use the model to predict the responses for the test set:
<-predict(plsFit, newdata = testData) predictions
The test set estimate of R squared is:
postResample(pred = predictions, obs = testData$permeability)
RMSE Rsquared MAE
13.7668596 0.1969428 10.3091999
This Rsquared is low, suggesting the model does not do a good job of predicting permeability.
<- data.frame(actual = testData$permeability, predicted = predictions)
results
ggplot(results, aes(x = actual, y = predicted)) + geom_point()+
labs(title = "Predicted vs. Actual Values",
x = "Actual Values",
y = "Predicted Values") +
theme_minimal()
E
Next, I try training and predicting with a ridge regression model.
<-data.frame(.lambda = seq(0.01, .2, length=20))
ridgeGrid
set.seed(100)
<- train(permeability ~., data=trainData,
ridgeFit method = 'ridge',
tuneGrid = ridgeGrid,
trControl = fitControl,
preProcess = c('center','scale'))
ridgeFit
Ridge Regression
133 samples
388 predictors
Pre-processing: centered (388), scaled (388)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.01 13.37438 0.4507224 9.446142
0.02 12.93139 0.4788009 9.229718
0.03 12.63829 0.4931569 9.149707
0.04 12.50602 0.5018970 9.101594
0.05 12.39371 0.5091283 9.060586
0.06 12.31765 0.5148716 9.030316
0.07 12.23757 0.5204122 8.991373
0.08 12.19520 0.5245518 8.972502
0.09 12.15088 0.5285588 8.956560
0.10 12.12592 0.5319177 8.951619
0.11 12.09949 0.5351551 8.949193
0.12 12.08405 0.5379754 8.956057
0.13 12.07515 0.5406405 8.965868
0.14 12.07091 0.5429374 8.977723
0.15 12.08750 0.5442211 9.001394
0.16 12.07566 0.5472016 9.004834
0.17 12.07000 0.5494009 9.023342
0.18 12.09274 0.5502725 9.046279
0.19 12.08507 0.5530325 9.056521
0.20 12.10620 0.5545031 9.088196
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.17.
The optimal lambda value is 0.17
plot(ridgeFit)
Next, I try predicting values via the Ridge regression model
<-predict(ridgeFit, newdata = testData)
predictionsRidge
postResample(pred = predictionsRidge, obs = testData$permeability)
RMSE Rsquared MAE
14.2419221 0.2340469 10.1707615
The R squared is slightly higher, which may suggest a better model, though it is still low and thus not a very accurate model.
<- data.frame(actual = testData$permeability, predicted = predictionsRidge)
resultsRidge
ggplot(resultsRidge, aes(x = actual, y = predicted)) + geom_point()+
labs(title = "Predicted vs. Actual Values",
x = "Actual Values",
y = "Predicted Values") +
theme_minimal()
F
Both of the models above seem to fit poorly and thus I would not recommend them for the permeability experiment.
6.3
A
Below, I load the data
data("ChemicalManufacturingProcess")
B
There are missing values in the data set:
sum(is.na(ChemicalManufacturingProcess))
[1] 106
Below I impute missing values using the k-nearest neighbors approach. I will use the KNN function from the VIM package. The function adds columns which state whether a value was imputed to the right of the original data. I will drop these columns from the data frame.
<- kNN(ChemicalManufacturingProcess)
chem_data <- chem_data[,1:ncol(ChemicalManufacturingProcess)] chem_data
C
Next, I split the data into a training and test set via a 80%/20% split.
set.seed(10)
<- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)
chem_trainIndex
<- chem_data[chem_trainIndex,]
chem_trainData <- chem_data[-chem_trainIndex,] chem_testData
Now I will tune a Lasso regression model to the data for two reasons. One, I have not yet used this model type and two, Lasso allows weights of variables that are not relevant to the independent variable to go to zero whereas in Ridge regression, these values can go near zero but not all the way to it. I will center and scale the data in the model tuning.
set.seed(150)
<- trainControl(method = 'CV', number = 10)
chem_fitControl
<- train(Yield ~ ., data = chem_trainData,
chem_fit method = 'glmnet',
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, lengt = 10)),
preProcess = c('center','scale'),
trControl = chem_fitControl)
chem_fit
glmnet
144 samples
57 predictor
Pre-processing: centered (57), scaled (57)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 130, 131, 131, 129, 130, 130, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.001 8.277536 0.4467059 3.4016057
0.112 1.247332 0.5877934 0.9888373
0.223 1.383448 0.5398664 1.0446488
0.334 1.296413 0.5726856 1.0369149
0.445 1.341709 0.5893853 1.0745626
0.556 1.414385 0.5801993 1.1338832
0.667 1.496858 0.5630656 1.2002924
0.778 1.586499 0.5342690 1.2712958
0.889 1.666759 0.5028704 1.3429808
1.000 1.740322 0.4633349 1.4112292
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.112.
plot(chem_fit)
The optimal lambda value was found to be 0.112.
D
Next, I try predicting the test set:
<-predict(chem_fit, newdata = chem_testData)
chem_predictions
postResample(pred = chem_predictions, obs = chem_testData$Yield)
RMSE Rsquared MAE
1.1114012 0.5332645 0.9400514
The R sqaured is 0.53 and the RMSE is 1.11, which is slightly lower than the RMSE for the training set at a lambda of 0.112.
E
Below are the predictors ranked in order of importance. This function scales the importance scores to be between 0 and 100.
varImp(chem_fit)
glmnet variable importance
only 20 most important variables shown (out of 57)
Overall
ManufacturingProcess32 100.0000
ManufacturingProcess09 59.3054
ManufacturingProcess13 26.4793
ManufacturingProcess17 25.1969
ManufacturingProcess45 19.0773
ManufacturingProcess15 14.4978
ManufacturingProcess06 13.6682
ManufacturingProcess37 12.5137
ManufacturingProcess36 10.8643
BiologicalMaterial03 8.8566
BiologicalMaterial05 8.4810
ManufacturingProcess34 5.8699
ManufacturingProcess07 0.9961
BiologicalMaterial07 0.7014
ManufacturingProcess39 0.3192
ManufacturingProcess10 0.0000
BiologicalMaterial01 0.0000
ManufacturingProcess11 0.0000
ManufacturingProcess40 0.0000
BiologicalMaterial06 0.0000
plot(varImp(chem_fit), top = 20)
The manufacturing process predictors dominate the list.
F
Below I explore the relationship between the top predictors and the response variable (yield):
<- varImp(chem_fit)
top_predictors <- top_predictors$importance
top_predictors <- top_predictors |> filter(Overall > 0)
top_predictors
<- rownames(top_predictors)
filter
<- chem_data |> select(Yield, all_of(filter))
relationship_df
ggpairs(relationship_df)
plot(varImp(chem_fit), top = 20)
From the above plot, we can see that ManufacturingProcess32 has the most impact on the Yield. In the ggpairs plot, we can see it has a statistically significant positive correlation of 0.608 to yield. The importance of ManufacturingProcess32 can provide insight to factory management and help them optimize the process and allocate resources to that process at the expense of less important processes, such as Process 24, 19, etc.