library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
The random forest model did not significantly use the uninformative predictors. The variable importance scores of the first five variables are in the range of 2.2 to 8.8. The uninformative predictors (V6-V10) have a variable importance score range of 0.1 to -0.08.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
model3 <- cforest(y ~ .,
data = simulated,
control = cforest_unbiased(ntree = 500))
rfImp3 <- varimp(model3)
Using the ‘cforest’ function helps mitigate the collinearity of V1 and duplicate1 by reducing the importance score of duplicate1
rfImp3
## V1 V2 V3 V4 V5 V6
## 6.571632689 6.110131511 0.010450332 7.485796796 1.889117623 -0.006751589
## V7 V8 V9 V10 duplicate1
## 0.007081381 -0.036022685 0.008976088 0.005356254 2.722380892
We tried using gradient boosted machines and we see a similar relationship to the previous methods. Variables V1-V5 are important while V6-V10 are not. Interestingly, the gbm model reduces the influence of duplicate1 compared to the above methods. The scales are different but duplicate1 has a 10th the influence compared to a 3rd the influence of V1 in the above models.
Mental note: gbm deals with the effects of collinearity better than other tree methods.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
relative.influence(gbmModel, scale. = TRUE)
## n.trees not given. Using 100 trees.
## V1 V2 V3 V4 V5 V6 V7
## 0.45460082 0.71969471 0.26299852 1.00000000 0.33812464 0.01690042 0.00426528
## V8 V9 V10 duplicate1
## 0.00000000 0.00000000 0.00000000 0.40781275
“trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors”
high <- sample(1:10, 100, replace = TRUE)
low <- sample(1:5, 100, replace = TRUE)
y <- jitter(low + high)
sim_data <- dplyr::bind_cols(y, high, low)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
names(sim_data) <- c('y', 'high', 'low')
head(sim_data)
## # A tibble: 6 × 3
## y high low
## <dbl> <int> <int>
## 1 9.06 4 5
## 2 5.91 4 2
## 3 4.12 1 3
## 4 12.0 8 4
## 5 5.95 4 2
## 6 7.10 2 5
We can see the predictor with more distinct values has a much higher variable importance score in our random forest model.
I want to note that for me this is a counter intuitive use of the word ‘granular’. I think of granular as having more detail/potential states but here it appears to mean the opposite - the variable with the fewest possible distinct values.
model1 <- randomForest(y ~ .,
data = sim_data,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## high 15.036417
## low 3.447798
I think with a bagging fraction of 0.9 we are using most of the data with each bootstrapping sample so as a result we are building many trees with essentially the same data set. As a result, whatever predictor that is most important to that specific data set is going to most important in the majority of the trees.
The model with the bagging fraction of 0.1 and learning rate of 0.1 is going to generalize to unseen samples. With a smaller bagging fraction, smaller samples of the original data will be used to build each tree reducing the variance and also potential the bias on other samples.
This is an interesting question, I think that increasing the interaction depth with decrease the slope on the right (0.9 bagging fraction). Building deeper trees might result in increasing the importance of other predictors in the model resulting in a more gradual slope.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp_df <- ChemicalManufacturingProcess
sum(is.na(cmp_df))
## [1] 106
We can use the ‘colSums()’ function to understand missingness by predictor. The greatest amount of missingness is 15 values. I’m going to replace the missing values with the column mean.
colSums(is.na(cmp_df))
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
library(zoo)
cmp_df <- na.aggregate(cmp_df)
sum(is.na(cmp_df))
## [1] 0
columns_to_remove <- nearZeroVar(cmp_df)
cmp_df <- cmp_df[,-columns_to_remove]
# Set the random number seed so we can reproduce the results
set.seed(123456789)
# By default, the numbers are returned as a list. Using
# list = FALSE, a matrix of row numbers is generated.
# These samples are allocated to the training set.
trainingRows <- createDataPartition(cmp_df$Yield, p = .70, list = FALSE)
head(trainingRows)
## Resample1
## [1,] 2
## [2,] 3
## [3,] 4
## [4,] 5
## [5,] 7
## [6,] 8
# Subset the data into objects for training using
# integer sub-setting.
train <- cmp_df[trainingRows,]
train <- train |> select(-Yield)
trans <- preProcess(train, method = c("center", "scale"))
train <- as.data.frame(predict(trans, train))
yield_train <- cmp_df$Yield[trainingRows]
head(train)
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2 2.0957767 1.252483 -0.1268052
## 3 2.0957767 1.252483 -0.1268052
## 4 2.0957767 1.252483 -0.1268052
## 5 1.3778129 1.837914 1.0949364
## 7 1.3911085 2.120707 1.1359172
## 8 0.6731447 1.904892 1.0462716
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2 1.1800471 0.3587809 1.089012
## 3 1.1800471 0.3587809 1.089012
## 4 1.1800471 0.3587809 1.089012
## 5 0.8504608 -0.3993435 1.496600
## 7 0.7458303 -0.5039124 1.440288
## 8 1.7293575 0.3901516 1.512689
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2 2.311820 -0.7707901 1.0177401
## 3 2.311820 -0.7707901 1.0177401
## 4 2.311820 -0.7707901 1.0177401
## 5 1.088856 -0.1581108 0.3809166
## 7 1.834566 0.2094968 0.3653844
## 8 2.028451 0.6506259 1.6234990
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2 1.38641812 1.1472202 -5.4714537
## 3 1.38641812 1.1472202 -5.4714537
## 4 1.38641812 1.1472202 -5.4714537
## 5 0.12116426 1.1472202 -0.2242367
## 7 0.01677037 0.7370656 0.1680786
## 8 1.49707564 1.6940931 0.4132757
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2 -1.969428 -0.008089843 -2.3181036
## 3 -1.969428 -0.008089843 -3.1108362
## 4 -1.969428 -0.008089843 -3.2693827
## 5 -1.969428 -0.008089843 -2.1595570
## 7 -1.969428 1.016858040 0.2186408
## 8 -1.969428 0.515287799 -0.4155453
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2 0.9688887 0.8891570 -1.0286226
## 3 0.0417505 -0.1547302 0.9643337
## 4 0.3983421 2.0770287 -1.0286226
## 5 0.8165268 -0.6586758 0.9643337
## 7 -0.4347856 0.8891570 -1.0286226
## 8 0.2783977 1.5010909 0.9643337
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2 0.8604992 0.5591263 -0.02499045
## 3 0.8604992 -0.4168261 -0.02499045
## 4 -1.1527442 -0.5144214 -0.02499045
## 5 0.8604992 -0.4883960 -0.02499045
## 7 0.8604992 2.3743977 3.10870450
## 8 0.8604992 1.9319660 1.29654056
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2 -0.006996376 -0.4879186 -0.4456372
## 3 -0.006996376 -0.4879186 0.3100416
## 4 -0.006996376 -0.4879186 0.3100416
## 5 -0.006996376 -0.4879186 0.1211219
## 7 3.019944677 -0.4879186 -1.9569949
## 8 2.733635723 -0.4879186 -0.8234766
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2 0.278423622 0.9457981 0.3928445
## 3 0.440595210 0.8079242 0.3928445
## 4 0.782957451 1.0664377 0.6854134
## 5 2.494768657 3.3241223 2.2782886
## 7 -1.955940478 -0.7086884 -1.7364069
## 8 0.008137642 1.1181404 0.5391290
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2 -0.2144985 0.6185123 1.4529711
## 3 0.4094971 0.8268674 1.0426138
## 4 0.4094971 0.7226898 0.9346251
## 5 -0.2924979 1.0143870 1.5609598
## 7 -0.3704974 -1.6525586 -0.3612397
## 8 -0.5264963 -1.4858745 -0.1668600
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2 0.7597951 0.2744285 -0.68368241
## 3 0.7597951 0.2744285 -0.38090877
## 4 0.5570959 0.2744285 -0.07813513
## 5 1.5300521 -0.7018171 0.83018578
## 7 -1.2469271 2.2269197 -1.28922968
## 8 -0.6388295 0.2744285 -0.98645605
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2 -1.7829988 -0.9201897 0.3224479
## 3 -1.1918441 -0.7459859 1.0168982
## 4 -0.6006894 -0.5717820 0.8928892
## 5 0.5816199 1.6928682 1.8353575
## 7 -1.1918441 -1.2685975 -1.5128850
## 8 -0.6006894 -1.0943936 -1.2400652
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2 1.2489958 0.91027573 0.8234824
## 3 1.4505103 1.06428140 0.8034346
## 4 1.3385578 0.91027573 0.8034346
## 5 2.2341779 2.09831946 0.8435302
## 7 0.1294708 -0.36577125 0.8234824
## 8 0.1742518 -0.05775991 0.8034346
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2 2.035497 1.0357345 -1.3228383
## 3 1.874264 0.2846139 -0.8912206
## 4 1.874264 0.2846139 -0.8912206
## 5 2.357963 -0.3162826 -0.8192844
## 7 1.713030 2.9886481 -2.1141372
## 8 1.713030 2.5379758 -1.8983284
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2 1.9201778 1.0509993 1.9873701
## 3 2.6579069 1.0509993 1.9873701
## 4 2.2890423 1.9190132 0.1249846
## 5 2.2890423 2.7870272 0.1249846
## 7 0.0758552 0.6169924 0.1249846
## 8 0.4447197 0.6169924 0.1249846
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2 1.05962903 -0.735134 2.14246805
## 3 1.15314672 -1.905827 -0.68145550
## 4 -0.06258329 -1.905827 0.40466895
## 5 -2.68107870 -3.076521 -1.76757994
## 7 -2.02645485 -0.735134 -0.46423061
## 8 -1.74590177 -0.735134 -0.02978083
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2 -0.6760464 0.2356983 1.9819345
## 3 -0.6760464 0.2356983 -0.5004885
## 4 -0.6760464 0.2356983 -0.5004885
## 5 -0.6760464 0.3000740 -0.5004885
## 7 -0.6760464 0.3000740 -0.5004885
## 8 -0.6760464 0.3000740 -0.5004885
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2 2.0995668 0.02815815 -0.03692521
## 3 -0.4781192 0.42096439 0.06482425
## 4 -0.4781192 -0.19006753 0.16657371
## 5 -0.4781192 -0.01548698 0.16657371
## 7 -0.4781192 0.29002897 -0.24042412
## 8 -0.4781192 0.15909356 -0.13867466
## ManufacturingProcess44 ManufacturingProcess45
## 2 0.30054609 0.15921800
## 3 0.03623605 0.37150866
## 4 0.03623605 -0.05307267
## 5 -0.22807398 -0.05307267
## 7 0.56485612 0.15921800
## 8 0.56485612 0.15921800
# Do the same for the test set using negative integers.
test <- cmp_df[-trainingRows,]
test <- test |> select(-Yield)
trans <- preProcess(test, method = c("center", "scale"))
test <- as.data.frame(predict(trans, test))
yield_test <- cmp_df$Yield[-trainingRows]
head(test)
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -0.1755184 -1.3793558 -2.40754671
## 6 -0.3862653 0.8007961 -0.41775223
## 9 0.9430609 2.1019346 1.19269294
## 15 -0.2079410 1.9355677 0.63917697
## 18 2.0778515 1.0689697 0.03794411
## 21 1.8671047 1.1434624 0.05225918
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 0.3542804 0.651668 -1.2603016
## 6 2.0820641 2.014069 0.7176365
## 9 2.4589059 0.597889 1.6380370
## 15 -0.3140803 1.267139 1.5984783
## 18 2.4375752 1.840781 0.9259793
## 21 2.4731263 2.020044 0.9998224
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1 -1.211512 -3.17093088 1.389593
## 6 1.136085 -1.58036748 2.054042
## 9 1.923388 0.72479687 2.199390
## 15 1.980647 0.01019592 -1.019035
## 18 1.737298 0.10240249 2.843075
## 21 1.866130 0.37902222 3.009187
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1 -1.758777 -1.5392922 -0.10566420
## 6 1.035161 0.6717192 0.59433809
## 9 1.505207 1.4621844 0.59433809
## 15 1.227637 2.2984736 -0.02390474
## 18 1.973228 1.4850964 0.94761971
## 21 2.348049 1.8860570 1.21258092
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1 -0.002005454 0.01550889 -0.08922738
## 6 -2.008492079 0.01550889 -1.36399595
## 9 -2.008492079 0.83950282 -0.71455053
## 15 -2.008492079 0.01550889 -0.38982783
## 18 -2.008492079 0.01550889 -0.38982783
## 21 -2.008492079 0.83950282 -1.20163459
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1 0.04971722 0.1250541 0.1764062
## 6 0.55858429 0.7382875 1.2411437
## 9 0.11058032 0.6564290 -0.8064284
## 15 4.05840476 0.2880655 -0.8064284
## 18 0.71353303 0.5336412 -0.8064284
## 21 3.45545206 -0.2030857 1.2411437
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1 0.08726288 -1.6101083 0.06710068
## 6 0.97879500 -0.1392355 0.06710068
## 9 -1.02143731 1.0526787 0.82638896
## 15 -1.02143731 0.5708411 0.38914364
## 18 -1.02143731 0.1207033 0.53489208
## 21 0.97879500 -0.2026352 -0.19385013
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1 0.01681627 0.03113167 0.9982485
## 6 0.01681627 -0.46351599 -0.6549019
## 9 2.49064599 -0.46351599 -0.7651119
## 15 1.19201577 -0.46351599 0.7778284
## 18 1.76918475 -0.46351599 0.9982485
## 21 0.90343127 -0.46351599 0.6676184
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1 0.8324958 1.209825 0.2823790
## 6 2.5025585 3.126916 0.4449408
## 9 0.7365152 1.815222 0.2448647
## 15 1.2164182 2.017021 0.2948837
## 18 0.8516919 1.462074 0.2980099
## 21 1.8114981 2.101104 0.4136789
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1 0.8562408 0.18038449 0.5551000
## 6 -0.9558037 0.17591996 2.0540011
## 9 -0.5243645 0.03751929 0.1917301
## 15 0.6836651 0.15657363 1.5543674
## 18 0.8562408 0.10002281 0.3734150
## 21 0.5973773 0.18931357 1.8950268
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1 0.2715747 0.09499888 -0.1044680
## 6 0.3202221 -0.56366000 0.9619927
## 9 0.1005240 0.09499888 -0.8182676
## 15 0.2417585 0.09499888 1.5554127
## 18 0.2260658 0.09499888 -1.1149777
## 21 0.3343456 0.09499888 -0.2248475
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1 -0.001520213 -0.2299603 0.16487392
## 6 -1.269930888 -1.5985493 0.16190785
## 9 -0.012299907 -1.2491648 0.11889974
## 15 -0.012299907 0.8471423 0.16635696
## 18 -1.269930888 -1.4238571 0.07885771
## 21 0.616515583 -0.8997803 0.17377215
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1 0.1707123 0.2895910 0.9717201
## 6 0.2350214 0.2927257 1.1000139
## 9 0.2052487 0.1704753 1.0816862
## 15 0.2314487 0.2488409 1.0816862
## 18 0.1480850 0.2049562 0.9900478
## 21 0.2219214 0.3068315 1.0450308
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1 0.4514727 0.5923419 -0.02438267
## 6 0.6963849 0.7307769 -0.11567137
## 9 0.6264100 1.0076468 -0.13595775
## 15 0.6613974 0.5923419 -0.08524180
## 18 0.4164853 1.0768643 -0.03452586
## 21 0.6264100 0.5923419 -0.06495542
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1 -0.4040165 0.9122196 -1.7476413
## 6 2.7566080 2.3496567 0.1069984
## 9 0.3396599 0.5528604 0.1069984
## 15 0.3396599 0.1935011 0.1069984
## 18 -0.2180974 0.1935011 0.1069984
## 21 -0.2180974 -0.1658581 0.1069984
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1 -0.6869526 -0.5069243 -1.2503431
## 6 -0.2965533 -1.6530140 -1.4938557
## 9 -0.1989535 -0.5069243 0.4542445
## 15 -0.2965533 -0.5069243 1.4282946
## 18 0.3866454 0.6391654 0.2107320
## 21 0.1914458 0.6391654 -0.2762931
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1 0.6901983 0.2200479 0.1810502
## 6 -1.4209964 0.2200479 -0.3685787
## 9 0.6901983 0.3630791 -0.3685787
## 15 0.6901983 0.2200479 -0.3685787
## 18 0.6901983 0.3630791 -0.3685787
## 21 0.6901983 0.2915635 2.7341653
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1 0.2466006 -0.07204426 4.28571934
## 6 -0.3503305 -0.60723023 2.68817521
## 9 -0.3503305 -1.14241620 0.09216601
## 15 -0.3503305 -0.07204426 -0.10752701
## 18 -0.3503305 0.46314170 1.68971013
## 21 4.6840274 -1.14241620 2.28878918
## ManufacturingProcess44 ManufacturingProcess45
## 1 -0.5717719 1.2985131
## 6 -0.5717719 -0.9522429
## 9 0.5717719 -0.3895539
## 15 0.5717719 -0.9522429
## 18 -0.5717719 0.1731351
## 21 -2.8588594 -0.9522429
Random Forest gives me optimal performance over Conditional Random Forest and Gradient Booting with a RMSE of 1.28
model1 <- randomForest(yield_train ~ .,
data = train,
importance = TRUE,
ntree = 1000,
keep.forest = TRUE)
rf_pred <- predict(model1, test)
RMSE = mean((yield_test - rf_pred)^2) %>% sqrt()
RMSE
## [1] 1.285018
gbmModel <- gbm(yield_train ~ ., data = train, distribution = "gaussian")
gbm_pred <- predict(gbmModel, test)
## Using 100 trees...
RMSE = mean((yield_test - gbm_pred)^2) %>% sqrt()
RMSE
## [1] 1.327406
model3 <- cforest(yield_train ~ .,
data = train,
control = cforest_unbiased(ntree = 500))
crf_pred <- predict(model3, test, type = "response", OOB = TRUE)
RMSE = mean((yield_test - crf_pred[1:52])^2) %>% sqrt()
RMSE
## [1] 1.932903
The results are interesting, compared to the top 10 predictors for our linear and nonlinear models there are more Biological Materials in our top 10. We still see the Manufacturing Process 32 at the top of the list but Biological Material 12 in second place is a new addition.
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |> arrange(desc(Overall)) |> slice_head(n = 10)
## Overall
## ManufacturingProcess32 0.8726530
## BiologicalMaterial12 0.3566173
## ManufacturingProcess13 0.3117625
## ManufacturingProcess36 0.2413422
## ManufacturingProcess31 0.2227998
## ManufacturingProcess09 0.2187102
## BiologicalMaterial11 0.1996627
## BiologicalMaterial06 0.1651863
## ManufacturingProcess17 0.1643580
## BiologicalMaterial03 0.1636269
The column names in our data set are to verbose to visualize in a tree. I’m going to rename the columns and then retrain our model and then visualize our optimal tree.
This tree is so big it’s a challenge to interpret. Think it’s interesting that with all the previous focus we have had on the manufacturing process, one of the first splits is based on Biological Material 12. This would drastically reshape my understanding of product yield. The most essential split I am making is based on a biological predictor and not my process.
names(train) <- gsub("ManufacturingProcess", "MP", names(train), perl=T)
names(train) <- gsub("BiologicalMaterial", "BM", names(train), perl=T)
train <- round(train, 2)
model1 <- randomForest(yield_train ~ .,
data = train,
importance = TRUE,
ntree = 1000,
keep.forest = TRUE)
library(reprtree)
## Loading required package: tree
## Loading required package: plotrix
## Registered S3 method overwritten by 'reprtree':
## method from
## text.tree tree
reprtree:::plot.getTree(model1)