Data 624 - Predictive Analytics

Chapter 8

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predic- tors (V6 – V10)?

rfImp1
##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

No it did not. The predictors V6 ~ V10 have a very small importance, compared to other predictors such as V1, V2, V4, and V5.

b. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

Yes, it did. The importance for V1 is reduced, splitted between V1 and duplicated1. This is because the trees in the random forest are just as likely to pick V1 for a split than to pick duplicated1, since they are so correlated. As a result, the total number of splits that originally belonged to V1 is “shared” with duplicated1.

c. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cforestModel <- cforest(y ~ ., data=simulated)

varimp(cforestModel) %>% sort(decreasing = T)
##           V4           V1           V2   duplicate1           V5           V3 
##  7.485796796  6.571632689  6.110131511  2.722380892  1.889117623  0.010450332 
##           V9           V7          V10           V6           V8 
##  0.008976088  0.007081381  0.005356254 -0.006751589 -0.036022685
varimp(cforestModel, conditional=T) %>% sort(decreasing = T)
##          V4          V2          V1          V5  duplicate1          V6 
## 6.109465803 4.807391436 3.077532075 1.253301514 0.910467826 0.012782225 
##          V9          V3         V10          V7          V8 
## 0.009321716 0.009202518 0.003628121 0.002904234 0.002410231

The new importance calculated (with conditional set to TRUE) is slightly different than the traditional measurement. The uninformative predictors (V6 ~ V10) are still rated low importances. The importance of other predictors are reduced, with V1 and its highly correlated duplicated1 being reduced the most. Both measurement rated V4 as the top most important predictor, followed by V2.

d. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

For the gradient boosted trees model, I used the gbm function. The model is built using its default input parameters. The summary function can be used to get the relative influence of the predictors on the model.

library(gbm)
## Loaded gbm 2.1.8
gbmModel <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 30.2787835
## V2                 V2 22.4240399
## V1                 V1 19.5630117
## V5                 V5 11.1666898
## V3                 V3  8.3490611
## duplicate1 duplicate1  7.7807889
## V7                 V7  0.2851571
## V6                 V6  0.1524680
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

The same pattern occurs. The uninformative predictors V6 ~ V10 are still rated very low. V4 is still the top rated predictor, followed by V2. The correlated predictors V1 and duplicate1, though, have higher differences in their rated importance. The GBM model seems to detect that V1 is much more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures. Many same predictors will be selected across the trees. As a result, the magnitude of their contribution to the importance metric looks more “stretched” compared to other tree models.

For the Cubist model, I used the cubist function. I set the committees to 100, to match that of the default number of trees in the gbm model. The varImp function calculates the variable importance of a Cubist model.

library(Cubist)
cubistModel <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubistModel)
##            Overall
## V1            64.5
## V3            41.0
## V2            60.0
## V4            48.0
## V5            31.0
## V6             9.0
## duplicate1     6.0
## V8             2.0
## V10            0.5
## V7             0.0
## V9             0.0

Again, the uninformative predictors V6 ~ V10 are rated low in their importance. The top rated predictor, is not V4 this time, but V2. This is different than the random forest and GBM models. For V1 and its correlated predictor duplicated1, the Cubist model values V1 are higher than duplicated1 just like the GBM model.

8.2 Use a simulation to show tree bias with different granularities.

The tree models suffer from selection bias that the predictors having a higher number of distinct values are favored over more granular predictors.

Below, I simulated 4 variables using the sample function with the replacement. All 4 variables are values ranging from 0 to 1, with the following granularity:

The target variable y is created by adding x1 and x4, plus a random error term. x2 and x3 are not used to generate the target.

set.seed(1)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)

y <- x1 + x4 + rnorm(200)

df <- data.frame(x1, x2, x3, x4, y)
str(df)
## 'data.frame':    200 obs. of  5 variables:
##  $ x1: num  0.102 0.8 0.477 0.972 0.846 ...
##  $ x2: num  0.126 0.809 0.571 0.999 0.96 0.283 0.333 0.03 0.267 0.092 ...
##  $ x3: num  0.44 0.89 0.39 0.65 0.4 0.07 0.24 0.94 0.19 0.83 ...
##  $ x4: num  0.8 0 0 0.9 0.7 0.2 0 0.3 0.4 0.6 ...
##  $ y : num  0.92 2.119 0.412 1.172 2.083 ...

Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp.

library(rpart)

rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)
##      Overall
## x1 0.7443663
## x2 0.7563594
## x3 0.5903005
## x4 0.4806487

As you can see, the tree uses x1 mostly to split; while it uses x4 the least. Interestingly, the tree model also uses x2 and x3 to split, even though they are not used to generate the target. x2 and x3 all have more distinct values than x4. This demonstrates that there is a selection bias in the tree model, that it favors predictors with more distinct values.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Gradient boosting trees make predictions by adding the predictions of each subsequent tree. These trees are dependent on each other - each tree is built upon the errors of previous tree. Therefore they have correlated structures. Suppose one of the tree split on one feature frequently. Its subsequent tree, which is built using its errors, will likely split on that feature frequently as well. As a result, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.

Learning rate controls the fraction of the predictions of each tree being added. A higher learning rate means that larger fraction of each tree’s predictions are added to the final prediction. This effectively means that a higher learning rate increases the dependent / correlation structure. More of the same predictors will be selected among the trees. This is why the right-hand plot has its importance focus on just the first few of the predictors, and look very steep.

Bagging fraction is the fraction of data being used in each iteration of the trees. When you have a small bagging fraction, say 0.1, on each iteration just 10% of the full data is randomly sampled. So each tree may be built using very different dataset. Since the dataset are very different, the trees will be splitting very differently from each other. On the contrast, when you have large bagging fraction, say 0.9, essentially on each iteration the trees are seeing the same dataset - they will likely split similarly. This means that larger bagging fraction increases the dependent / correlated structure in the boosting trees. Therefore, the right-hand plot with a larger bagging fraction has its importance focus on just the first few of the predictors.

b. Which model do you think would be more predictive of other samples?

Learning rate and bagging fraction are important parameters to control the overfitting of the gradient boosting model that requires tuning. A smaller learning rate and bagging fraction leads to better generalization ability over unseen samples. If I have to guess, the model with 0.1 learning rate and bagging fraction will be more predictive of out of bag samples. However, since this invovles a trade off between bias-variance, I can only confirm using cross validation or a test set.

c. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Below, I have trained two GBM models with all the same parameters except for the interaction depth. Model 1 has interaction depth of 1, while Model 2 has interaction depth of 10. The predictor importance of both models are plotted:

library(AppliedPredictiveModeling)

data(solubility)

grid1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid1, verbose = FALSE)

grid2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.1, n.minobsinnode=10)
gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid2, verbose = FALSE)

plot(varImp(gbm1))

plot(varImp(gbm2))

As you can see, increasing the interaction depth seems to spread out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process. Therefore, increasing the depth reduce the slope of the importance plot.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

The missing values in the ChemicalManufacturingProcess data are imputed using the bagImpute method. The train test set are splitted, with 20% of the data assigned to the test set.

data(ChemicalManufacturingProcess)

(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))
## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - bagged tree imputation (57)
##   - ignored (0)
cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])

# Train/test splitting data, 20% testing
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]

Below, 4 regression tree models are trained: single tree, random forest, gradient boosting, and cubist. The bootstrapped resampling method is used with 25 repetition to determine the final models. RMSE is used as the metric. There is no need for centering and scaling for the tree models.

Single tree, tuned over the complexity parameter:

set.seed(1)
rpartModel <- train(x = X.train,
                    y = y.train,
                    method = "rpart",
                    tuneLength = 10,
                    control = rpart.control(maxdepth=2))
rpartModel
## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared   MAE     
##   0.008397041  1.549770  0.3812237  1.216080
##   0.008859356  1.549770  0.3812237  1.216080
##   0.018524970  1.549770  0.3812237  1.216080
##   0.020481868  1.549770  0.3812237  1.216080
##   0.035948795  1.548684  0.3821330  1.214237
##   0.037010750  1.548684  0.3821330  1.214237
##   0.040274452  1.545683  0.3841690  1.212441
##   0.070117057  1.559539  0.3700052  1.218758
##   0.075640038  1.567100  0.3650647  1.222131
##   0.422604552  1.710135  0.3513290  1.347456
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04027445.

Random forest, tuned over number of features to use in each tree:

set.seed(1)
rfModel <- train(x = X.train,
                 y = y.train,
                 method = 'rf',
                 tuneLength = 10)
rfModel
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.378648  0.5802953  1.0843105
##    8    1.272058  0.6180565  0.9910083
##   14    1.256792  0.6154582  0.9757093
##   20    1.253456  0.6096284  0.9695865
##   26    1.258766  0.5986575  0.9709039
##   32    1.257386  0.5953348  0.9698568
##   38    1.262774  0.5866870  0.9693972
##   44    1.275029  0.5766315  0.9773552
##   50    1.277282  0.5725979  0.9777703
##   57    1.288198  0.5637389  0.9856393
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.

Gradient boosting, tuned over the number of trees, interaction depth, shrinkage rate, and minimum terminal node size:

set.seed(1)
grid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
gbmModel <- train(x = X.train,
                  y = y.train, 
                  method = 'gbm', 
                  tuneGrid = grid, 
                  verbose = FALSE)

plot(gbmModel)

gbmModel$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 76     200                10       0.1              5
gbmModel$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 57 predictors of which 55 had non-zero influence.

Cubist, tuned over the number of committees and neighbors:

set.seed(1)
cubistModel <- train(x = X.train,
                     y = y.train,
                     method = 'cubist')
cubistModel                  
## Cubist 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.974730  0.3276865  1.3415098
##    1          5          1.953650  0.3397400  1.3197269
##    1          9          1.961401  0.3350359  1.3269185
##   10          0          1.319770  0.5487338  1.0117335
##   10          5          1.294999  0.5647492  0.9812996
##   10          9          1.307784  0.5568598  0.9955234
##   20          0          1.248259  0.5934930  0.9678102
##   20          5          1.222524  0.6085878  0.9385537
##   20          9          1.234662  0.6013848  0.9509907
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

a. Which tree-based regression model gives the optimal resampling and test set performance?

The resampling performance of all the models are calculated below:

resamp <- resamples(list(SingleTree=rpartModel, RandomForest=rfModel, GradientBoosting=gbmModel, Cubist=cubistModel))

summary(resamp)
## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: SingleTree, RandomForest, GradientBoosting, Cubist 
## Number of resamples: 25 
## 
## MAE 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
## SingleTree       1.0158765 1.1529965 1.2181336 1.2124410 1.2834938 1.511266
## RandomForest     0.7789125 0.9015179 0.9654665 0.9695865 1.0133921 1.193207
## GradientBoosting 0.7473860 0.8650054 0.9133498 0.9263050 0.9786561 1.158549
## Cubist           0.7494319 0.8574981 0.9325365 0.9385537 0.9713097 1.133009
##                  NA's
## SingleTree          0
## RandomForest        0
## GradientBoosting    0
## Cubist              0
## 
## RMSE 
##                      Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## SingleTree       1.259445 1.490117 1.568376 1.545683 1.626500 1.815942    0
## RandomForest     1.021405 1.170212 1.254259 1.253456 1.332717 1.502504    0
## GradientBoosting 1.010651 1.098105 1.209567 1.205794 1.291562 1.485723    0
## Cubist           1.004272 1.163104 1.211461 1.222524 1.294660 1.446199    0
## 
## Rsquared 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## SingleTree       0.2335658 0.3185924 0.3740543 0.3841690 0.4188642 0.6073257
## RandomForest     0.4717475 0.5586765 0.5947119 0.6096284 0.6734519 0.7394027
## GradientBoosting 0.5069428 0.5560189 0.6209324 0.6157570 0.6835510 0.7247879
## Cubist           0.4539216 0.5552589 0.6115826 0.6085878 0.6516084 0.7465633
##                  NA's
## SingleTree          0
## RandomForest        0
## GradientBoosting    0
## Cubist              0

Looking at the Mean of the RMSE metric, it appears the gradient boosting model is optimal.

The test set performance is calculated below:

testPerf <- function(models, testData, testTarget) {
  method <- c()
  res <- data.frame()
  for(model in models){
    method <- c(method, model$method)
    pred <- predict(model, newdata=testData)
    res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
  }
  row.names(res) <- method
  return(res)
}

models <- list(rpartModel, rfModel, gbmModel, cubistModel)                  

performance <- testPerf(models, X.test, y.test)
performance
##             RMSE  Rsquared       MAE
## rpart  1.4606072 0.2824136 1.1857749
## rf     0.9243633 0.7077904 0.7535632
## gbm    0.9428919 0.6969765 0.7243978
## cubist 0.7248858 0.8187655 0.6265254

The test set performance suggests that the cubist model is the best. Here, I would trust the resampled metrics than the metric reported by the test set, since the size of the data is very small and the test set performance may not be a close approximation of the true model performance. There, I would choose grandient boosting model.

b. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The predictor importance can be found using the varImp function:

varImp(gbmModel)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  15.501
## ManufacturingProcess09  15.159
## ManufacturingProcess13  13.372
## ManufacturingProcess15  12.283
## BiologicalMaterial08    10.304
## BiologicalMaterial12     9.952
## BiologicalMaterial03     9.437
## ManufacturingProcess31   8.831
## ManufacturingProcess24   8.797
## ManufacturingProcess06   8.394
## ManufacturingProcess21   7.675
## BiologicalMaterial05     7.542
## BiologicalMaterial06     6.152
## BiologicalMaterial09     5.763
## ManufacturingProcess39   5.695
## ManufacturingProcess30   5.512
## ManufacturingProcess14   5.297
## BiologicalMaterial10     4.912
## ManufacturingProcess43   4.435
varImp(cubistModel)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  88.235
## BiologicalMaterial06    58.824
## ManufacturingProcess09  49.412
## ManufacturingProcess13  49.412
## BiologicalMaterial12    48.235
## BiologicalMaterial03    43.529
## ManufacturingProcess39  38.824
## BiologicalMaterial02    32.941
## ManufacturingProcess33  25.882
## ManufacturingProcess04  18.824
## BiologicalMaterial11    14.118
## ManufacturingProcess29  12.941
## ManufacturingProcess14   9.412
## ManufacturingProcess25   7.059
## ManufacturingProcess37   7.059
## ManufacturingProcess28   5.882
## ManufacturingProcess35   5.882
## BiologicalMaterial09     5.882
## ManufacturingProcess27   5.882
varImp(rfModel)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12    32.125
## BiologicalMaterial03    25.852
## ManufacturingProcess13  22.917
## BiologicalMaterial06    21.530
## ManufacturingProcess17  15.573
## ManufacturingProcess36  14.563
## ManufacturingProcess09  12.247
## BiologicalMaterial02    10.393
## ManufacturingProcess31   9.802
## ManufacturingProcess06   8.120
## BiologicalMaterial11     8.019
## BiologicalMaterial05     7.511
## BiologicalMaterial09     7.208
## ManufacturingProcess33   7.175
## BiologicalMaterial08     5.870
## BiologicalMaterial04     5.649
## ManufacturingProcess39   5.255
## ManufacturingProcess28   4.964
## ManufacturingProcess25   4.587
plot(varImp(gbmModel),top = 20)

Out of top 20 important variables, 13 are ManufacturingProcess predictors and 7 are BiologicalMaterial predictors. The ManufacturingProcess predictors dominated the list

As you can see, ManufacturingProcess32 is the top most important predictor. In the optimal linear model, almost all of the top 10 predictors are from the ManufacturingProcess predictors; while for the optimal tree and nonlinear models, the top 10 predictors are not as dominated by the ManufacturingProcess predictors.

c. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Below, the optimal single tree model is plotted. It appears the model made one single split using ManufacturingProcess32 at the value of 159.5. If it’s less than 159.5, yield is assigned to be 39.12369. If it’s equal or more than 159.5, the yield is assigned to be 41.59750.

rpartModel$finalModel
## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 144 506.83480 40.15444  
##   2) ManufacturingProcess32< 159.5 84 145.58300 39.12369  
##     4) BiologicalMaterial12< 19.735 36  47.40003 38.34361 *
##     5) BiologicalMaterial12>=19.735 48  59.84592 39.70875 *
##   3) ManufacturingProcess32>=159.5 60 147.06110 41.59750  
##     6) ManufacturingProcess17>=33.45 44  76.39399 41.13341 *
##     7) ManufacturingProcess17< 33.45 16  35.12938 42.87375 *
plot(rpartModel$finalModel)
text(rpartModel$finalModel)

This view provide two additional knowledge about the relationships:

trainData <- data.frame(X.train, y.train)
less <- subset(trainData, ManufacturingProcess32 < 159.5)
more <- subset(trainData, ManufacturingProcess32 >= 159.5)

mean(less$y.train)
## [1] 39.12369
mean(more$y.train)
## [1] 41.5975