HW 8 A

I have added layered tabs. Main tabs include question2 and question 6
- please make sure to scroll through secondary tabs in each category

Question 7.2

the Mars model does the best job below and it identifies the x1-x5 predictors as most important
all the other models do a good job of mostly identifying 1-5 as the variables of importance, however they, still give small values of importance to some predictors.

plot data

library(mlbench)



set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
 ## We convert the 'x' data from a matrix to a data frame
 ## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
 ## Look at the data using
featurePlot(trainingData$x, trainingData$y)

 ## or other methods.

 ## This creates a list with a vector 'y' and a matrix
 ## of predictors 'x'. Also simulate a large test set to
 ## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1) 
testData$x <- data.frame(testData$x)
b <- mlbench.friedman1(5000, sd = 1)

KNN

knnModel <- train(x = trainingData$x,
 y = trainingData$y,
 method = "knn",preProc = c("center", "scale"),
 tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.575784  0.4937689  2.910509
##    7  3.502105  0.5164526  2.839260
##    9  3.392737  0.5567970  2.754948
##   11  3.360067  0.5761135  2.731018
##   13  3.347367  0.5903124  2.722033
##   15  3.322199  0.6074368  2.694127
##   17  3.303677  0.6235049  2.690998
##   19  3.311931  0.6324167  2.696910
##   21  3.323571  0.6414055  2.705803
##   23  3.343621  0.6437899  2.730110
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

varImp(knnModel)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

SVM

below I tune 3 different svm models
- Radial kernel- produces a RMSE on test set of 2.05
  - epsilon = 0.1 cost C = 16 , sigma = 0.061, Number of Support Vectors : 164 Training error : 0.008
- Poly- produces a RMSE on test set of 2.12
  - epsilon = 0.1 cost C = .25 ,Number of Support Vectors : 163 ,Training error : .04
- linear - produces an RMSE of 2.68
  - parameter : epsilon = 0.1 cost C = 1 ,Training error : 0.265593

set.seed(100)
svmRTuned <- train(trainingData$x , trainingData$y,
 method = "svmRadial",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))
set.seed(100)
svmRTuned2 <- train(trainingData$x , trainingData$y,
 method = "svmLinear",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

svmRTuned3 <- train(trainingData$x , trainingData$y,
 method = "svmPoly",
 preProc = c("center", "scale"),
 tuneLength = 3,
 trControl = trainControl(method = "cv"))

svmRTuned$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0552698014532696 
## 
## Number of Support Vectors : 156 
## 
## Objective Function Value : -89.772 
## Training error : 0.008525

svmRTuned2$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 172 
## 
## Objective Function Value : -54.9759 
## Training error : 0.216921

svmRTuned3$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 0.5 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  3  scale =  0.1  offset =  1 
## 
## Number of Support Vectors : 146 
## 
## Objective Function Value : -8.0447 
## Training error : 0.020293

radial_pred <- predict(svmRTuned,testData$x)
linear_pred <- predict(svmRTuned2,testData$x)
poly_pred <- predict(svmRTuned3,testData$x)

RMSE(testData$y,radial_pred)

## [1] 2.049005

RMSE(testData$y,linear_pred)

## [1] 2.763386

RMSE(testData$y,poly_pred)

## [1] 2.056465

varImp(svmRTuned)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

varImp(svmRTuned2)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

varImp(svmRTuned3)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

MARS

RMSE on test data is 1.14
x1-x5 were all chosen as predictors of importance

set.seed(100)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:32)
marsTuned <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))

marsTuned$finalModel

## Selected 16 of 18 terms, and 5 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, ...
## Number of terms at each degree of interaction: 1 11 4
## GCV 1.61518    RSS 210.6377    GRSq 0.934423    RSq 0.9568093

mars_pred <- predict(marsTuned,testData$x)
RMSE(testData$y,mars_pred)

## [1] 1.14925

varImp(marsTuned)

## earth variable importance
## 
##     Overall
## X1   100.00
## X4    85.14
## X2    69.24
## X5    49.32
## X3    40.02
## X8     0.00
## X6     0.00
## X10    0.00
## X9     0.00
## X7     0.00

Question 7.5

For questions B and C we were asked to compare our results with last week
As a small caveat, last week I eliminated many predictors that showed potential collinearity issues and ran models based on full dataset and a trimmed dataset. For part A in this hw, i just use that trimmed dataset
However, in the end of part A, I run a caret ensemble model to tune everything at once with auto feature selection on the full dataset
- This model is used for parts b and c to figure out feature importance

A

the data was already imputed in last hw, I import that data file again
- Last time i ran tests over complete dataset and trimmed dataset, this time I will run test over just the trimmed dataset

data("ChemicalManufacturingProcess")
completedData <- read.csv("imputed.csv")
complete_df_imputed <- completedData
my_df <- ChemicalManufacturingProcess[,-1]
processPredictors <- as.matrix(my_df)
yield <- ChemicalManufacturingProcess[,1]




## eliminate high correlation and near zero var  
corrplot::corrplot(cor(completedData))

nearZeroVar(completedData)

## [1] 8

correlations <- cor(completedData)
highCorr <- findCorrelation(correlations, cutoff = .75)
completedData <- completedData[, -highCorr]
completedData <- completedData[,-3]

Run SVM models

Radial test set 1.17 RMSE
- the residual plot for the model shows non constant variance in the residuals. It almost look like there is a a positive correlation between residuals and predictions
Linear SVM test set RMSE 1.16, the residuals do look a little better than above model
Poly has an ok residual plot but RMSE of 1.23 on test set

## Train test split
set.seed(123)
smp_size <- floor(0.75 * nrow(completedData))
train_ind <- sample(seq_len(nrow(completedData)), size = smp_size)
train <- completedData[train_ind, ]
test <- completedData[-train_ind, ]
train_y <- yield[train_ind]
test_y <- yield[-train_ind]


## tune models 

## Radial model
svmRTuned <- train(train , train_y,
 method = "svmRadial",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

## plot Radial model
predicted <- predict(svmRTuned,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

## Linear model
svmRTuned2 <- train(train , train_y,
 method = "svmLinear",
 preProc = c("center", "scale"),
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

## plot lin model

predicted <- predict(svmRTuned2,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

## poly svm model
svmRTuned3 <- train(train, train_y,
 method = "svmPoly",
 preProc = c("center", "scale"),
 tuneLength = 3,
 trControl = trainControl(method = "cv"))

## plot model
## 
predicted <- predict(svmRTuned3,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

## get predictions RMSE for models 
predictions1 <- predict(svmRTuned,test)
radial_rmse <- RMSE(predictions1,test_y)
predictions2 <- predict(svmRTuned2,test)
linear_rmse <- RMSE(predictions2,test_y)
predictions3 <- predict(svmRTuned3,test)
poly_rmse <- RMSE(predictions3,test_y)
cbind(radial_rmse,linear_rmse,poly_rmse)

##      radial_rmse linear_rmse poly_rmse
## [1,]     1.17302    1.165471  1.233217

## final models 
svmRTuned$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0221305795559405 
## 
## Number of Support Vectors : 122 
## 
## Objective Function Value : -66.0452 
## Training error : 0.102862

svmRTuned2$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 125 
## 
## Objective Function Value : -44.7484 
## Training error : 0.336602

svmRTuned3$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 0.25 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  3  scale =  0.01  offset =  1 
## 
## Number of Support Vectors : 117 
## 
## Objective Function Value : -14.6279 
## Training error : 0.357631

varImp(svmRTuned)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06

varImp(svmRTuned2)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06

varImp(svmRTuned3)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06

MARS

RMSE on test set is 1.71
MARS model looks terrible

set.seed(123)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:32)
marsTuned <- train(train, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))

marsTuned$finalModel

## Selected 2 of 45 terms, and 2 of 36 predictors
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: ManufacturingProcess17, ManufacturingProcess28, ...
## Number of terms at each degree of interaction: 1 0 1
## GCV 2.548561    RSS 318.8066    GRSq 0.2586737    RSq 0.2866986

mars_pred <- predict(marsTuned,test)
RMSE(test_y,mars_pred)

## [1] 1.714526

predicted <- predict(marsTuned,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

varImp(marsTuned)

## earth variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## ManufacturingProcess28     100
## ManufacturingProcess17     100
## ManufacturingProcess21       0
## ManufacturingProcess24       0
## ManufacturingProcess05       0
## BiologicalMaterial03         0
## ManufacturingProcess37       0
## BiologicalMaterial09         0
## ManufacturingProcess12       0
## ManufacturingProcess35       0
## ManufacturingProcess19       0
## ManufacturingProcess34       0
## ManufacturingProcess22       0
## ManufacturingProcess08       0
## ManufacturingProcess01       0
## BiologicalMaterial11         0
## ManufacturingProcess20       0
## ManufacturingProcess11       0
## ManufacturingProcess03       0
## ManufacturingProcess38       0

KNN

RMSE on test test is 1.51

knnModel <- train(x = train,
 y = train_y,
 method = "knn",preProc = c("center", "scale"),
 tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 132 samples
##  36 predictor
## 
## Pre-processing: centered (36), scaled (36) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.472477  0.3282891  1.152186
##    7  1.463268  0.3209371  1.160692
##    9  1.459681  0.3179446  1.157638
##   11  1.461165  0.3151395  1.158253
##   13  1.463401  0.3112594  1.159268
##   15  1.463532  0.3076825  1.158154
##   17  1.460605  0.3115393  1.155990
##   19  1.457138  0.3182375  1.152340
##   21  1.457895  0.3214083  1.147691
##   23  1.465486  0.3178753  1.153840
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.

varImp(knnModel)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   93.51
## ManufacturingProcess36   86.62
## ManufacturingProcess33   66.87
## BiologicalMaterial11     66.25
## ManufacturingProcess06   63.65
## BiologicalMaterial09     50.78
## ManufacturingProcess11   45.97
## ManufacturingProcess30   39.92
## ManufacturingProcess12   36.89
## ManufacturingProcess28   31.69
## ManufacturingProcess01   26.22
## ManufacturingProcess27   26.04
## BiologicalMaterial10     23.40
## BiologicalMaterial05     22.94
## ManufacturingProcess16   21.21
## ManufacturingProcess35   19.28
## ManufacturingProcess04   18.70
## ManufacturingProcess20   18.02
## ManufacturingProcess02   16.06

knn_pred <- predict(knnModel,test)
RMSE(test_y,knn_pred)

## [1] 1.517846

predicted <- predict(knnModel,test)
actual <-test_y

axisRange <- extendrange(c(actual, predicted))
plot(actual, predicted, ylim = axisRange, xlim = axisRange)
abline(0, 1, col = "darkgrey", lty = 2)

plot(predicted, (predicted-actual), ylab = "residual")
abline(h = 0, col = "darkgrey", lty = 2)

caret ensemble on entire dataframe

XGBOOST gives best results, at rmse under 1

completedData <- read.csv("imputed.csv")
complete_df_imputed <- completedData

complete_df_imputed <- complete_df_imputed[,-1]
set.seed(123)
smp_size <- floor(0.75 * nrow(complete_df_imputed))
train_ind <- sample(seq_len(nrow(complete_df_imputed)), size = smp_size)
train <- complete_df_imputed[train_ind, ]
test <- complete_df_imputed[-train_ind, ]
train_y <- yield[train_ind]
test_y <- yield[-train_ind]




registerDoParallel(4)
getDoParWorkers()

## [1] 4

set.seed(123)
my_control <- trainControl(method = 'cv', # for “cross-validation”
                           number = 5, # number of k-folds
                           savePredictions = 'final',
                           allowParallel = TRUE)

model_list <- caretList(train,
                        train_y,
                        trControl = my_control,
                        methodList = c('lm', 'svmRadial', 'rf',"pls", 
                                       'xgbTree', 'xgbLinear',"bagEarth",'glmnet',"knn" ),
                        tuneList = NULL,
                        continue_on_fail = FALSE, 
                        preProcess = c('center','scale'))

## Warning in trControlCheck(x = trControl, y = target): indexes not defined
## in trControl. Attempting to set them ourselves, so each model in the
## ensemble will have the same resampling indexes.

options(digits = 3)
model_results <- data.frame(LM = min(model_list$lm$results$RMSE),
 SVM = min(model_list$svmRadial$results$RMSE),
 RF = min(model_list$rf$results$RMSE),
 XGBT = min(model_list$xgbTree$results$RMSE),
 XGBL = min(model_list$xgbLinear$results$RMSE),
 Mars = min(model_list$bagEarth$results$RMSE),
 GLMNET = min(model_list$glmnet$results$RMSE),
 PLS =min(model_list$pls$results$RMSE),
 KNN =min(model_list$knn$results$RMSE))
 
print(model_results)

##     LM  SVM   RF XGBT XGBL Mars GLMNET  PLS  KNN
## 1 12.9 1.32 1.17 1.07 1.15 1.39   1.31 1.58 1.37

options(digits = 3)

LM = predict(model_list$lm,test)

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

RMSE(test_y,knn_pred)

## [1] 1.52

svmRadial = predict(model_list$svmRadial,test)
RMSE(test_y,svmRadial)

## [1] 1.14

rf = predict(model_list$rf,test)
RMSE(test_y,rf)

## [1] 1.23

resamples <- resamples(model_list)
dotplot(resamples, metric = 'RMSE')

ensemble_1 <- caretEnsemble(model_list, 
                            metric = 'RMSE', 
                            trControl = my_control)
summary(ensemble_1)

## The following models were ensembled: lm, svmRadial, rf, pls, xgbTree, xgbLinear, bagEarth, glmnet, knn 
## They were weighted: 
## -8.925 0.0044 0.0684 0.2258 -0.2042 0.5689 0.0924 -0.0561 0.3611 0.162
## The resulting RMSE is: 1.2986
## The fit for each individual model on the RMSE is: 
##     method  RMSE RMSESD
##         lm 12.91 22.828
##  svmRadial  1.32  0.128
##         rf  1.17  0.232
##        pls  1.58  0.293
##    xgbTree  1.07  0.175
##  xgbLinear  1.15  0.255
##   bagEarth  1.39  0.160
##     glmnet  1.31  0.224
##        knn  1.37  0.187

B

Variable importance

Below i commented out some of the most important variables in out linear versus non linear models ran on entire dataset
In my best non linear model which was the radial model, we can see that 5 of the biological material predictors actually make it into top ten most important predictors.
I’d say biological material seems slightly more important in non linear models.
- Our MARS models, while performing pretty poorly, seem to indicate MP 32 is the only important process

plot(varImp(model_list$xgbTree),top=10)##   32 , 13 06, 09, 31 01,37          03  12, 10, 05, 09

plot(varImp(model_list$xgbLinear),top=10)

plot(varImp(model_list$glmnet),top=10)## MP 32,12,09,17,34,06,37,03  BM 06,03

plot(varImp(model_list$bagEarth),top=10)

plot(varImp(model_list$pls),top=10)## MP 32,36,13,09,33   BM:  06,02,03

plot(varImp(model_list$svmRadial),top=10)## MP 32,13,17,36,09,33    06,03,02,12,11

C

Just like our previous hw, we can see the biological material all acts as some sort of accelerant. All have positive correlation with yield
In terms of manufacturing process, I added in MP 31 01 and 07 because our best model(XGboost) identified them
- strangely enough these variables actually show the least correlation with our yield
Otherwise just like in our linear model, we can see that certain manufacturing process can hinder or help our yield.
Manufacturing process 32 seems very helpful, which looking back at my gls model did happen in the dataset with all predictors as well
manufacturing process 13 seems very detrimental to our yield which was also identified in the same model for linear hw last week

corr_df <- as.data.frame(cbind(complete_df_imputed,yield))
corrplot::corrplot(cor(corr_df[,1:58])[c(3,12,10,5,9),58, drop=FALSE], cl.pos='n')

corrplot::corrplot(cor(corr_df[,1:58])[c(25,18,29,45,21,43,13,49,44),58, drop=FALSE], cl.pos='n')

Non Linear Regression

HW 8 A

Question 7.2

plot data

KNN

SVM

MARS

Question 7.5

A

Run SVM models

MARS

KNN

caret ensemble on entire dataframe

B

Variable importance

C