HW 7

Problem 6.2
Problem 6.3

library(tidyverse)

Problem 6.2

Part A Load Data

library(caret)
library(AppliedPredictiveModeling)

data(permeability)
fingerprints <- as.data.frame(fingerprints)

Part B Near Zero Variance

The function uses the standard rules of thumb for removing variables. We have no reason to adjust at this point.

zero_vars <- nearZeroVar(fingerprints)
length(zero_vars)

## [1] 719

fingerprints_var <- fingerprints[, -zero_vars]

Over two-thirds of the variables in the dataset are removed.

Part C PLS Model

We split into test-train splits, leaving 20% for the holdout set. We’re stuck with a low number of observations so putting more in the training set, which will be used for repeated cross validation, is our priority.

In practice, I would probably not use a holdout set with this number of observations, but that is what the book asked for.

set.seed(7)
samples <- createDataPartition(permeability, p = .8, list = F)
fp_train <- fingerprints_var[samples, ]
fp_test <- fingerprints_var[-samples, ]

perm_train <- permeability[samples, ]
perm_test<- permeability[-samples, ]

PLS should overcome collinearity, but we will try giving it some help by removing highly correlated variables. The tradeoff here is that we shrink our feature space

too_high <-findCorrelation(cor(fp_train), .9)
fp_train_cor <- fp_train[, -too_high]
fp_test_cor <- fp_test[, -too_high]

We train a PLS model and seek the optimal number of components.

pls_test <- train(x = fp_train,
                  y = perm_train,
                  method = 'pls',
                  tuneGrid = data.frame(ncomp = seq(1, 80, 1)),
                  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5)
                  )

rslts <- pls_test$results

ggplot(rslts) + geom_line(aes(x = ncomp, y = RMSE))

library(kableExtra)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1	13.12196	0.3153016	10.074064	2.635170	0.2117838	1.923137
2	11.47877	0.4827818	8.182957	2.893810	0.2190565	1.982311
3	11.20152	0.5168873	8.280295	2.539717	0.1871408	1.662693
4	11.36578	0.5089077	8.653261	2.521500	0.1779809	1.773574
5	11.27327	0.5245455	8.455785	2.681686	0.1879479	1.868318
6	11.24334	0.5277537	8.098340	2.571194	0.1844890	1.727003
7	11.42653	0.5187461	8.281599	2.340755	0.1788926	1.642486
8	11.67791	0.4981780	8.738919	2.123011	0.1689353	1.585042
9	11.98842	0.4794563	8.851377	2.229950	0.1783094	1.639184
10	12.16037	0.4749514	8.936781	2.414712	0.1820555	1.664207

The best R squared occurs at 3 Components, .51, thought R squared and RMSE are pretty similar between 2 and 9 components.

We’ll try a model with the highly correlated predictors removed just for due diligence.

pls_test <- train(x = fp_train_cor,
                  y = perm_train,
                  method = 'pls',
                  tuneGrid = data.frame(ncomp = seq(1, 80, 1)),
                  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5)
                  )

rslts <- pls_test$results

ggplot(rslts) + geom_line(aes(x = ncomp, y = RMSE))

library(kableExtra)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1	11.93364	0.4301069	8.822951	3.049226	0.2280876	2.202543
2	11.19312	0.4921207	8.264361	2.639869	0.2166940	1.740735
3	11.19351	0.4960587	8.419027	2.069418	0.1936079	1.493946
4	11.03624	0.5152311	8.241503	2.138118	0.1926534	1.610490
5	10.82605	0.5350954	8.137202	2.247647	0.2051616	1.598093
6	10.92083	0.5373558	8.130074	2.248815	0.2003508	1.706117
7	11.33000	0.5189290	8.300017	2.403725	0.2075739	1.789440
8	11.66446	0.4948652	8.662352	2.416649	0.2121677	1.913779
9	11.78705	0.4912127	8.840528	2.507220	0.2110571	2.046892
10	12.01933	0.4815073	8.912092	2.743909	0.2172953	2.233449

Since performance is about the same and it culls our dataset, we’ll use this data for the PLS/PCA model.

Part D Test Set

We choose the number of components based on RMSE.

library(pls)
mod1 <- plsr(perm_train ~ ., data = fp_train_cor, ncomp = 5)

preds <- predict(mod1, fp_test_cor)
df <- data.frame(obs = perm_test, pred = preds[,1,5])
defaultSummary(df)

##       RMSE   Rsquared        MAE 
## 12.2281388  0.4419219  9.0983921

The model performs somewhat worse on the test set, although given that its only about 30 observations, the error bars around RMSE and R squared are likely significant.

Part E Other Models

ENET

We continue using the culled dataset. LASSO regression has the ability to completely eliminate variables, but starting with 130, instead of over 300, will make its job a little easier, without losing much information.

fp_tm <- as.matrix(fp_train_cor)
fp_tt <- as.matrix(fp_test_cor)
grid <- expand.grid(.lambda = c(0, .01, .1), .fraction = seq(.1,.9,.1))
enet_test <- train(x = fp_tm,
                  y = perm_train,
                  method = 'enet',
                  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
                  tuneGrid = grid
)
rslts <- enet_test$results %>%
  arrange(RMSE) 

head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

lambda	fraction	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.10	0.4	10.97565	0.5454836	8.103199	2.083525	0.1710363	1.700495
0.10	0.5	11.03603	0.5467752	8.133713	2.169483	0.1732243	1.751047
0.01	0.2	11.11475	0.5280797	8.195236	2.101289	0.1773798	1.722511
0.10	0.3	11.15948	0.5292068	8.198842	2.030621	0.1700960	1.639116
0.10	0.6	11.22027	0.5397668	8.266250	2.233676	0.1744998	1.741469
0.01	0.3	11.31193	0.5213694	8.330346	2.341844	0.1812038	1.850853
0.10	0.7	11.35702	0.5338386	8.381972	2.282186	0.1747442	1.733163
0.01	0.1	11.36865	0.5132990	8.161978	2.061692	0.1770246	1.636241
0.10	0.2	11.39648	0.5082092	8.224155	2.064574	0.1762908	1.605596
0.10	0.8	11.44244	0.5311521	8.455546	2.335632	0.1755044	1.722711

The best RMSEs nearly all occur with a lambda of .1, indicating a mixture of LASSO and ridge. The best fraction (the ratio of the sum of the coefficients in the enet model relative to the full OLS model) is between .1 and .7, depending on lambda. This seems to indicate there are a large number of low, but not zero, information variables present.

library(elasticnet)
fp_tt <- as.matrix(fp_test_cor)
mod2 <- enet(x = fp_tm, y = perm_train, lambda = .1)

preds <- predict(mod2, s= .4, newx = fp_tt, mode = 'fraction', type = 'fit')
df1 <- data.frame(obs = perm_test, pred = preds$fit)
defaultSummary(df1)

##       RMSE   Rsquared        MAE 
## 11.6333462  0.4916524  8.2352733

RMSE is inline, although slightly higher, than the training set.

We’ll try a PCA regression, which should be a slightly worse version of the PLS regression.

PCR

pca_fit <- preProcess(fp_train, method = 'pca')
pca_train <- predict(pca_fit, fp_train)
lm_test <- train(x = fp_train,
                  y = perm_train,
                  method = 'pcr',
                  tuneGrid = data.frame(ncomp = seq(1, 80, 1)),
                  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
)
rslts <- lm_test$results 

head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1	14.75050	0.1121710	11.481926	2.703301	0.1068101	2.003189
2	14.78931	0.1050584	11.652487	2.652901	0.0963392	1.962427
3	13.38908	0.3066499	9.971547	2.693651	0.2273224	2.076134
4	11.66162	0.4642896	8.443879	2.588202	0.2401876	1.744123
5	11.66126	0.4627821	8.356554	2.609220	0.2392260	1.763022
6	11.71055	0.4558462	8.419351	2.524749	0.2409457	1.807729
7	11.74415	0.4499319	8.517642	2.510217	0.2370372	1.855262
8	11.80431	0.4448628	8.571406	2.485347	0.2343003	1.880409
9	11.88208	0.4413982	8.615894	2.531507	0.2357723	1.901882
10	11.79592	0.4511278	8.611095	2.623503	0.2399310	2.013587

With this simpler model, we are still able to get within about 5% of the RMSE of PLS and ENET.

Gradient Boosting

We’ll try a more modern model, Gradient Boosting, using the XGBoost package. Since this data is all categorical, there should be interactions, which would have to be manually uncovered and plugged into our linear models. GB will find these interactions automatically.

grid<- expand.grid(eta = c(.005, .01, 0.025, 0.05), nrounds = c(200,500), max_depth = c(1,2,3), gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1)

gb_test <- train(x = fp_tm,
                 y = perm_train,
                 method = 'xgbTree',
                 tuneGrid = grid,
                 trControl = trainControl(method = 'cv', number = 10),
                 verbose = F
)

rslts <- gb_test$results %>%
  arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

eta	max_depth	colsample_bytree	min_child_weight	subsample	nrounds	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.050	2	1	1	1	500	10.39426	0.5777854	7.308519	2.801858	0.2340349	2.040100
0.010	2	1	1	1	500	10.46240	0.5585730	7.265872	2.639403	0.2371244	1.852221
0.025	2	1	1	1	200	10.47386	0.5580647	7.284401	2.622707	0.2364705	1.829595
0.025	2	1	1	1	500	10.56748	0.5574036	7.356155	2.707002	0.2305747	1.945410
0.050	2	1	1	1	200	10.58906	0.5530681	7.295085	2.623186	0.2285004	1.869272
0.005	2	1	1	1	500	10.76921	0.5402104	7.306012	2.330568	0.2481567	1.592276
0.010	3	1	1	1	500	10.98923	0.5233653	7.388219	2.884528	0.2610558	2.021502
0.050	1	1	1	1	500	10.99310	0.5511968	7.970474	1.757212	0.1687718	1.520256
0.010	2	1	1	1	200	11.00459	0.5333358	7.344646	2.160730	0.2510869	1.486226
0.005	3	1	1	1	500	11.01229	0.5164521	7.397330	2.759216	0.2711515	1.768182

We get about a 5% improvement in RMSE VS the linear models.

Part F Conclusion

Out of these models, I would choose ENET because it preserves our original variables, making presentation of the model easier. ENET slightly outperformed the PLS model, but even if it didn’t predict quite as well, it would remain my choice.

We’ll examine residuals VS predicted and the most important variables.

preds <- predict(mod2, s= .4, newx = fp_tm, mode = 'fraction', type = 'fit')
train_df <- data.frame(obs = perm_train, pred = preds$fit)

ggplot(train_df) + geom_point(aes(x = pred, y = obs)) + geom_smooth(aes(x = pred, y = obs))

imp <- varImp(enet_test)
plot(imp, top = 20)

The non-linearity when the predicted variable is below 10 could be why the GB model outperformed the linear models. Given this, we would search for interactions to add if we absolutely needed to use a linear model.
The steep drop-off in variable importance is likely why the PCR and PLS models only included about 5 components, despite there being over 100 variables.

Recommendation

I would recommend the ENET model for prescreening compounds to be given the permeability test. This cut costs, while not introducing false positives.

Problem 6.3

Part A Load Data

data(ChemicalManufacturingProcess)

Part B Impute Missing

First, we remove any “near zero variance” variables, because in small datasets, this can occurr in continuous predictors.

zero_vars <- nearZeroVar(ChemicalManufacturingProcess)
length(zero_vars)

## [1] 1

ChemicalManufacturingProcess_new <- ChemicalManufacturingProcess[, -zero_vars]

We then the MICE package to impute missing variables using Random Forest regression. We choose RF because it doesn’t require much tuning and it can handle data with more variables than observations.

library(naniar)
library(mice)

gg_miss_upset(ChemicalManufacturingProcess_new, nsets = 5, nintersects = 5)

gg_miss_case(ChemicalManufacturingProcess_new)

library(mice)

pred <- dplyr::select(ChemicalManufacturingProcess_new, -ManufacturingProcess36) %>%
  mice(data = , m = 3, method = 'rf', maxit = 2)
imputed <-complete(pred)

pred <- mice(data = imputed, m = 3, method = 'rf', maxit = 2)

## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   2   1
##   2   2
##   2   3

chem_imputed <- complete(pred)

Part C Model Building

We start by getting a sense of the correlation to see how much of a problem multicollinearity is. This will advise whether we use elasticnet or PLS.

library(corrplot)
cor_mat <- cor(chem_imputed) %>%
  corrplot(order = 'hclust',tl.pos='n')

It appears to be a moderate problem so we start with PLS.

We’ll also look for skewed variables to see if a BoxCox will be necessary.

library(psych)

  ins_desc <- chem_imputed %>%
    keep(is.numeric) %>%
    describe() %>%
    dplyr::select(-c(vars,n,range, mad, trimmed)) %>%
    arrange(desc(abs(skew))) %>%
    head(10)
  
  rns <- rownames(ins_desc)
  
  round_2 <- function(col) round(col,2)
  ins_desc2 <- ins_desc %>%
    mutate_all(round_2)
  
  rownames(ins_desc2) <- rns
  
  ins_desc2 %>%
    kable %>%
    kable_styling

mean	sd	median	max	skew	kurtosis	se
6016.36	458.21	6047.00	6161.0	-12.86	165.88	34.54
4828.28	368.12	4854.00	4990.0	-12.82	165.15	27.75
4809.68	367.48	4835.00	4971.0	-12.74	163.74	27.70
4562.88	348.95	4586.00	4710.0	-12.70	163.12	26.30
4556.46	349.01	4582.00	4759.0	-12.64	162.07	26.31
4565.80	351.70	4588.00	4852.0	-12.42	158.40	26.51
70.20	5.48	70.80	72.5	-12.00	150.39	0.41
20.02	1.64	19.95	22.0	-10.20	122.44	0.12
0.91	0.87	0.80	11.0	9.05	101.03	0.07
11.21	1.94	11.60	12.1	-5.45	28.53	0.15

With 10 variables showing absolute skew of at least 5, it appears a BoxCox transformation is advisable.

yield <- chem_imputed$Yield
all_x <- select(chem_imputed, - Yield)
samples <- createDataPartition(yield, p = .8, list = F)
cm_train <- all_x[samples, ]
cm_test <- all_x[-samples, ]

yield_train <- yield[samples]
yield_test<- yield[-samples]

PLS

bc_trans <- preProcess(cm_train, method = 'BoxCox')
cm_train_bc <- predict(bc_trans, cm_train)
pls_test <- train(x = cm_train_bc,
                  y = yield_train,
                  method = 'pls',
                  tuneGrid = data.frame(ncomp = seq(1, 12, 1)),
                  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
                  preProcess = c('center', 'scale')
                  )

rslts <- pls_test$results

head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1	1.479064	0.4581292	1.163280	0.4062373	0.1938449	0.2566277
2	1.911005	0.4586269	1.231900	1.3224469	0.2699577	0.4995333
3	1.591002	0.5186443	1.137000	0.9940874	0.2159254	0.3566610
4	1.959623	0.5244660	1.225299	2.0875400	0.2185787	0.6260072
5	2.281869	0.5120782	1.314499	2.8707004	0.2322308	0.8474590
6	2.475436	0.5008570	1.379368	3.2514094	0.2363079	0.9367181
7	2.870926	0.4976666	1.489454	4.3670259	0.2393440	1.2370356
8	3.200531	0.4854751	1.599387	5.1668915	0.2431496	1.4526766
9	3.555623	0.4848262	1.704430	6.1139923	0.2416043	1.6943596
10	3.788721	0.4821295	1.766936	6.6883097	0.2396317	1.8436993

ggplot(rslts) + geom_point(aes(x = ncomp, y = RMSE))

ElasticNet

cm_train_m <- as.matrix(cm_train_bc)
grid <- expand.grid(.lambda = c(0,.01, .1), .fraction = seq(.1,.9,.1))
enet_test <- train(x = cm_train_m,
                  y = yield_train,
                  method = 'enet',
                  trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
                  tuneGrid = grid,
                  preProcess = c('center', 'scale')
)
rslts <- enet_test$results %>%
  arrange(RMSE) %>%
  head(10)

head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

lambda	fraction	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.10	0.2	1.245793	0.6103088	1.0307728	0.2213538	0.1535328	0.1849439
0.00	0.1	1.246954	0.5983720	0.9697345	0.3709042	0.1854045	0.2044358
0.10	0.3	1.254930	0.5851604	0.9907774	0.3126607	0.1830733	0.1989275
0.01	0.2	1.279426	0.5885692	0.9875599	0.4166481	0.1894413	0.2111994
0.01	0.1	1.282634	0.6053662	1.0584336	0.2258827	0.1502298	0.1848459
0.01	0.3	1.325078	0.5936338	0.9909098	0.5913607	0.1976089	0.2508474
0.10	0.4	1.329781	0.5871102	0.9920491	0.6230484	0.1996715	0.2541581
0.00	0.2	1.409977	0.5813539	1.0316194	0.7260290	0.2029704	0.2893751
0.01	0.4	1.427944	0.5955169	1.0201733	0.9905075	0.2047506	0.3793446
0.10	0.5	1.464479	0.5841312	1.0278469	1.0353664	0.2147077	0.3513381

Elasticnet appears to perform about 15% better than the PLS regression. It is also strange that the PLS regression only uses 1 component, though the ENET fraction is also low at .1. We’ll continue with the ENET model.

preproc <- preProcess(cm_train, method =  c('BoxCox', 'center', 'scale'))
cm_train_proc <- predict(preproc, cm_train) %>% as.matrix()
cm_test_proc <- predict(preproc, cm_test) %>% as.matrix()
enet_model_cm <- enet(x = cm_train_proc , y = yield_train, lambda = 0)

predicted <-predict(enet_model_cm, s= .1, newx = cm_train_proc, mode = 'fraction', type = 'fit') %>% as.data.frame()
preds_df <- data.frame(predicted = predicted$fit, actual = yield_train)
ggplot(preds_df) + geom_point(aes(x = predicted, y = actual)) + geom_smooth(aes(x = predicted, y = actual))

Our fit is fairly linear. If a considerable amount of nonlinearity remaining after the BoxCox transformation, we would be somewhat hard-pressed to find a solution through transformations of the predictors with so many variables to worry about. The classic measures to look for nonlinear relations, like marginal model plots, are less effective. We would have two main options, one traditional and one that I’ve had some success with in smaller models.

Transform the response.
Add square terms and letting the LASSO handle the variable selection if we needed

square_it <- function(x) x**2

squared <- mutate_all(cm_train, square_it)
all_train <- cbind(cm_train_bc, squared) %>% as.matrix()

The above method for handling non-linearity is pretty suboptimal. This is where nonlinear methods have their advantages.

grid<- expand.grid(eta = c(.01, 0.025, 0.1), nrounds = c(200, 500), max_depth = c(1,2,3), gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1)

gb_test <- train(x = cm_train_m,
                 y = yield_train,
                 method = 'xgbTree',
                 tuneGrid = grid,
                 trControl = trainControl(method = 'cv', number = 10),
                 verbose = F
)

rslts <- gb_test$results %>%
  arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)

eta	max_depth	colsample_bytree	min_child_weight	subsample	nrounds	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.025	3	1	1	1	500	1.029006	0.7030420	0.7869483	0.1539676	0.1313352	0.1051852
0.100	3	1	1	1	200	1.035521	0.6931972	0.7895210	0.1891608	0.1530216	0.1568017
0.100	3	1	1	1	500	1.037953	0.6914289	0.7890969	0.1876309	0.1518748	0.1574077
0.025	2	1	1	1	500	1.053282	0.6871506	0.8054012	0.1629783	0.1504761	0.1346482
0.100	2	1	1	1	200	1.055557	0.6803353	0.8061325	0.1737833	0.1440501	0.1474035
0.100	2	1	1	1	500	1.061814	0.6769401	0.8048745	0.1841962	0.1403903	0.1509016
0.025	3	1	1	1	200	1.102290	0.6929702	0.8370630	0.1201180	0.1241807	0.0776431
0.010	3	1	1	1	500	1.106824	0.6928414	0.8409648	0.1141102	0.1201056	0.0823961
0.100	1	1	1	1	200	1.126361	0.6429998	0.8733945	0.1743432	0.1631040	0.1311516
0.025	1	1	1	1	500	1.136768	0.6362508	0.8708829	0.1589034	0.1533987	0.1054712

The GB model improves over elasticnet by over 20%!

Part D Test Set

enet_model_cm <- enet(x = cm_train_proc , y = yield_train, lambda = .1) #predict had problems with infite values in predictions with the original model so we adjusted lambda.

test_preds <-predict(enet_model_cm, s= .2, newx = cm_test_proc, mode = 'fraction', type = 'fit') %>% as.data.frame()
df1 <- data.frame(obs = yield_test, pred = test_preds$fit)
defaultSummary(df1)

##      RMSE  Rsquared       MAE 
## 1.2675071 0.5206818 1.0310448

The predictions are quite similar to the stats on the training set. We expect A small amount of degrading, and, in this case, it is under 10% by RMSE

ggplot(df1) + geom_point(aes(x = pred, y = obs)) + geom_smooth(aes(x = pred, y = obs))

This nonlinearity is somewhat alarming, but with only 32 values, we shouldn’t put too much stock in it. I would probably choose a different model in practice though.

Part E Predictor Importance

imp <- varImp(enet_test)
plot(imp, top = 55)

imp <- varImp(enet_test)
plot(imp, top = 20)

Observations

The elasticnet completely removed about 10 variables
About 1/3 of the variables have an importance score over 50
The top 3 variables are all from the ManufacturingProcess group, but the rest of the list is an even split

Part F Variable Relations

We will single out our top variables and find their relationship with the target. This is generally one of the first phases of modeling, but with so many variables, we left it until the list could be narrowed down.

importants <- imp$importance %>%
  rownames_to_column() %>%
  arrange(desc(Overall)) %>%
  head(12)

only_importants <- chem_imputed[,importants$rowname]
only_importants$yield <- yield

library(gridExtra)

plot_it <- function(col){
  ggplot(only_importants) + geom_point(aes_string(x = col, y = yield), color = 'black') + 
  geom_smooth(aes_string(x = col, y = yield))
}

plots <- lapply(colnames(only_importants)[1:length(only_importants) - 1], plot_it)

grid.arrange(grobs = plots, ncol = 4, nrow = 3)

We add a smooth to each plot to aggregate the relationship

Observations

There is strong relationship, but it is not always linear. That is where the BoxCox transformation helped (assuming it also transformed the response) in the model
The 10th plot (reading left to right) seems to have little relationship, but the importance metric might have been fooled by the outliers. This variable should possibly be eliminated

Modeling should be an iterative process, and at this point, we would use this information to improve out model building, noting the nonlinearity in the most important variables.

Recommendation

The manufacturing process elements can be optimized using this model if used carefully. The relationship plots shown are helpful for background information, but do not account for the presence of other variables. To aid in determining which levers to turn in the manufacturing process, the user should go with added variable plots instead. The final decision should be made by plugging new predictors into the model, rather than by examining plots.

HW 7

Peter Goodridge

April 6, 2019

Problem 6.2

Part A Load Data

Part B Near Zero Variance

Part C PLS Model

Part D Test Set

Part E Other Models

Part F Conclusion

Problem 6.3

Part A Load Data

Part B Impute Missing

Part C Model Building

Part D Test Set

Part E Predictor Importance

Part F Variable Relations