Regression
- Univariate regression
- Multivariate Regression
Logistic Regression and Discriminant Analysis
Advanced Feature Selection in Linear Models - Regularization
Chapter 4. More Classification Techniques - K-Nearest Neighbors and Support Vector Machines
- K-Nearest Neighbors
  - Business understanding
  - Modeling and evaluation
Classification and Regressions Trees
Neural Networks and Deep Learning
Cluster Analysis

Regression

Univariate regression

require(alr3)
require(ggplot2)
require(dplyr)
data("snake")
dim(snake)
head(snake)
names(snake) <- c("content", "yield")

Draw scatterplot

snake %>% 
  ggplot(aes(x = content, y = yield)) + 
  geom_point()

Fit a model and explore summary

yield.fit <- lm(yield ~ content, data = snake)
summary(yield.fit)

Fit a regression line

snake %>% 
  ggplot(aes(x = content, y = yield)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Check for linear model assumptions

#check for linear model assumptions
par(mfrow = c(2, 2))
plot(yield.fit)

library(car)
qqPlot(yield.fit)

Multivariate Regression

data(water)
glimpse(water)

Don’t need a Year column, so let’s get rid of it!

social.water <- water[ ,-1]
head(social.water)

Let’s evaluate the correlation statistics and product a matrix of scatterplots

library(corrplot)
water.cor <- cor(social.water)
corrplot(water.cor, method = "ellipse")

The response variable is highly and positively correlated with the OP features with OPBPC as 0.8857, OPRC as 0.9196, and OPSLAKE as 0.9384. Also note that the AP features are highly correlated with each other and the OP features as well. The implication is that we may run into the issue of multi-collinearity.

Modeling and evaluation

We will use the leaps package - best subsets regression methods.

Forward stepwise selection Starts with a model that has zero features; it then adds the features one at a time until all the features are added. A selected feature is added in the process that creates a model with the lowest RSS. So in theory, the first feature selected should be the one that explains the response variable better than any of the others, and so on.
Backward stepwise selection Begins with all the features in the model and removes the least useful, one at a time
Hybrid approach The features are added through forward stepwise regression, but the algorithm then examines if any features that no longer improve the model fit can be removed. Once the model is built, the analyst can examine the output and use various statistics to select the features they believe provide the best fit.
Best subsets models The algorithm fits a model for all the possible feature combinations; so if you have 3 features, 7 models will be created

Build a model

library(leaps)
fit  <- lm(BSAAM ~ ., data = social.water)
summary(fit)

Create subsets with regsubsets() function from leaps package:

sub.fit <- regsubsets(BSAAM ~., data = social.water)

Examine model further - list outputs available with names():

best.summary <- summary(sub.fit)
names(best.summary)

Explore model with lowest RSS and highest R-squared:

which.min(best.summary$rss)
which.max(best.summary$adjr2)

Adding new features always decrease RSS and increase R-squared.

There’re 4 main statistical methods for choosing the number of features: * Aikake’s Information Criterion (AIC) * Mallow’s Cp (Cp) * Bayesian Information Criterion (BIC) * Adjusted R-squared

Plot the CP

par(mfrow = c(1,2))
plot(best.summary$cp, xlab = "number of features", ylab = "cp")
plot(sub.fit, scale = "Cp")

In the plot on the left-hand side, the model with three features has the lowest cp. The plot on the right-hand side displays those features that provide the lowest Cp. The way to read this plot is to select the lowest Cp value at the top of the y axis, which is 1.2. Then, move to the right and look at the colored blocks corresponding to the x axis. Doing this, we see that APSLAKE, OPRC, and OPSLAKE are the features included in this specific model.

Check max value of adjusted R-squared and min value of BIC.

which.max(best.summary$adjr2)
which.min(best.summary$bic)

Create a model with three selected features:

best.fit <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = social.water)
summary(best.fit)

Produce diagnostic plots:

par(mfrow = c(2, 2))
plot(best.fit)

Investigate the issue of collinearity with Variance Inflation Factor (VIF) statistics - vif() from car package:

vif(best.fit)

There’s a certain degree of collinearity between OPRC and OPSLAKE. Let’s explore it by plotting this two variables:

plot(social.water$OPRC, social.water$OPSLAKE, xlab = "OPRC", ylab = "OPSLAKE")

There’s a clear linear relationship between two variables.

The simple solution to address collinearity is to drop the variables to remove the problem without compromising the predictive ability. If we look at the adjusted R-squared from the best subsets, we can see that the two-variable model of APSLAKE and OPSLAKE has produced a value of 0.90, while adding OPRC has only marginally increased it to 0.92

best.summary$adjr2 #adjusted r-squared values

Let’s create two-variable model and test it’s assumptions:

fit.2 <- lm(BSAAM ~ APSLAKE + OPSLAKE, data = social.water)
summary(fit.2)
par(mfrow = c(2, 2))
plot(fit.2)

Let’s run vif() again.

vif(fit.2)

Run Breusch-Pagan (BP) test to formally check the assumption of the constant variance of errors - lmtest::bptest():

library(lmtest)
bptest(fit.2)

There’s no evidence to reject a null hypothesis that error variances are zero

Plot Predicted vs Actual values:

social.water["Actual"] <- water$BSAAM #create the vector Actual
social.water$Forecast = predict(fit.2) #populate Forecast with the predicted values

library(ggplot2)
ggplot(social.water, aes(x = Forecast, y = Actual)) + 
  geom_point() + geom_smooth(method = lm) + 
  labs(title = "Forecast versus Actuals")

Qualitative features

For this example, we will predict the sales of Carseats using just Advertising, a quantitative feature and the qualitative feature ShelveLoc, which is a factor of three levels: Bad, Good, and Medium. With factors, R will automatically code the indicators for the analysis. We build and analyze the model as follows:

library(ISLR)
data(Carseats)
sales.fit <- lm(Sales ~ Advertising + ShelveLoc, data = Carseats)
summary(sales.fit)

If the shelving location is good, the estimate of sales is almost double of that when the location is bad, given an intercept of 4.89662. To see how R codes the indicator features, you can use the contrasts() function:

contrasts(Carseats$ShelveLoc)

Interaction Terms

An example is available in the MASS package with the Boston dataset. The response is the median home value, which is medv in the output. We will use two features: the percentage of homes with a low socioeconomic status, which is termed lstat, and the age of the home in years, which is termed age in the following output:

library(MASS)
data(Boston)
str(Boston)

Using **feature1*feature2** with the lm() function in the code puts both the features as well as their interaction term in the model:

value.fit <- lm(medv ~ lstat * age, data = Boston)
summary(value.fit)

Examining the output, we can see that, while the socioeconomic status is a highly predictive feature, the age of the home is not. However, the two features have a significant interaction to positively explain the home value.

Logistic Regression and Discriminant Analysis

Logistic function used in logistic regression:

\[Probability \ of\ Y = \frac{e^{B_0 + B_{1}x}} {1 - e^{B_0 + B_{1}x}}\]

The logistic function can be turn to odds via:

\[Odds = \frac{Probability(Y)} {1 - Probability(Y)}\] One way to look at the relationship of logistic regression with linear regression is to show logistic regression as the log odds or \[log (P(Y) / (1 - P(Y)))\] is equal to \[B_0 + B_{1}x\]. The coefficients are estimated using a maximum likelihood instead of the OLS. The intuition behind the maximum likelihood is that we are calculating the estimates for Bo and B1, which will create a predicted probability for an observation that is as close as possible to the actual observed outcome of Y, a so-called likelihood.

Data understanding and preparation

Dataset (MASS::biopsy) consists of tissue samples from 699 patients. It is in a data frame with 11 variables, as follows: * ID: Sample code number * V1: Thickness * V2: Uniformity of the cell size * V3: Uniformity of the cell shape * V4: Marginal adhesion * V5: Single epithelial cell size * V6: Bare nucleus (16 observations are missing) * V7: Bland chromatin * V8: Normal nucleolus * V9: Mitosis * class: Whether the tumor diagnosis is benign or malignant; this will be the outcome that we are trying to predict

The medical team has scored and coded each of the nine features on a scale of 1 to 10.

#examine the structure of data
library(MASS)
data(biopsy)
str(biopsy)

#prepare data
## Get rid of ID column
biopsy$ID <- NULL

#rename columns
names(biopsy) <- c("thick", "u.size", "u.shape",
"adhsn", "s.size", "nucl", "chrom", "n.nuc",
"mit", "class")

#get rid of missing values
biopsy.v2 <- na.omit(biopsy)

#recode class to numeric variable
y <- ifelse(biopsy.v2$class == "malignant", 1, 0)

#Plot data
library(reshape2)
library(ggplot2)
#The following code melts the data by their values into one overall feature and groups them by class
biop.m <- reshape2::melt(biopsy.v2, id.var = "class")


ggplot(data = biop.m, aes(x = class, y = value)) + 
  geom_boxplot() + 
  facet_wrap(~ variable, ncol = 3)

#Explore correlation
library(corrplot)
bc <- cor(biopsy.v2[ ,1:9]) #create an object of the features
corrplot.mixed(bc)

The correlation coefficients are indicating that we may have a problem with collinearity, in particular, the features of uniform shape and uniform size that are present. As part of the logistic regression modeling process, it will be necessary to incorporate the VIF analysis as we did with linear regression.

#create train and test set
set.seed(123) #random number generator
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 0.3))
train <- biopsy.v2[ind == 1, ] #the training data set
test <- biopsy.v2[ind == 2, ] #the test data set
str(test) #confirm it worked
#To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:
table(train$class)
table(test$class)

Modelling and Evaluation

For this part of the process, we will start with a logistic regression model of all the input variables and then narrow down the features with the best subsets. After this, we will try our hand at discriminant analysis and Multivariate Adaptive Regression Splines (MARS).

We’ve already discussed the theory behind logistic regression, so we can begin fitting our models. An R installation comes with the glm() function fitting the generalized linear models, which are a class of models that includes logistic regression. The code syntax is similar to the lm() function that we used in the previous chapter. One big difference is that we must use the family = binomial argument in the function, which tells R to run a logistic regression method instead of the other versions of the generalized linear models. We will start by creating a model that includes all of the features on the train set and see how it performs on the test set, as follows:

full.fit <- glm(class ~ ., family = binomial, data = train)
summary(full.fit)

We can see that only two features have p-values less than 0.05 (thickness and nuclei). An examination of the 95 percent confidence intervals can be called on with the confint() function, as follows:

confint(full.fit)

Note that the two significant features have confidence intervals that do not cross zero. You cannot translate the coefficients in logistic regression as the change in Y is based on a one-unit change in X. This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).

In order to produce the odds ratios in R, we will use the following exp(coef()) syntax:

exp(coef(full.fit))

The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease. In this example, all the features except u.size will increase the log odds.

One of the issues pointed out during data exploration was the potential issue of multi-collinearity. It is possible to produce the VIF statistics that we did in linear regression with a logistic model in the following way:

library(car)
vif(full.fit)

None of the values are greater than the VIF rule of thumb statistic of five, so collinearity does not seem to be a problem.

The next task is feature selection; but, for now, let’s produce some code to look at how well this model does on both the train and test sets.

You will first have to create a vector of the predicted probabilities, as follows:

train.probs <- predict(full.fit, type = "response")
train.probs[1:5] #inspect the first 5 predicted probabilities

Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix.

library(InformationValue)
trainY <- y[ind == 1]
testY <- y[ind == 2]
confusionMatrix(trainY, train.probs)

#We can also take a look at the error rate, as follows
misClassError(trainY, train.probs)

Let’s explore prediction on a test set:

test.probs <- predict(full.fit, newdata = test, type = "response")
misClassError(testY, test.probs)
confusionMatrix(testY, test.probs)

Logistic Regression with Cross-Validation

The purpose of cross-validation is to improve our prediction of the test set and minimize the chance of overfitting. With the K-fold cross-validation, the dataset is split into K equal-sized parts. The algorithm learns by alternatively holding out one of the K-sets; it fits a model to the other K-1 parts, and obtains predictions for the left-out K-set. The results are then averaged so as to minimize the errors, and appropriate features are selected. You can also perform the Leave-One-Out-Cross-Validation (LOOCV) method, where K is equal to 1. Simulations have shown that the LOOCV method can have averaged estimates that have a high variance. As a result, most machine learning experts will recommend that the number of K-folds should be 5 or 10.

An R package that will automatically do CV for logistic regression is the bestglm package. This package is dependent on the leaps package that we used for linear regression.

After loading the package, we will need our outcome coded to 0 or 1. If left as a factor, it will not work. The other requirement to utilize the package is that your outcome, or y, is the last column and all the extraneous columns have been removed. A new data frame will do the trick for us, via the following code:

library(bestglm)
X <- train[, 1:9]
Xy <- data.frame(cbind(X, trainY))

Here is the code to run in order to use the CV technique with our data:

bestglm(Xy = Xy, IC = "CV", CVArgs = list(Method = "HTF", K = 10, REP = 1), 
        family=binomial)

The syntax, Xy = Xy, points to our properly formatted data frame. IC = "CV" tells the package that the information criterion to use is cross-validation. CVArgs are the CV arguments that we want to use. The HTF method is K-fold, which is followed by the number of folds, K = 10, and we are asking it to do only one iteration of the random folds with REP = 1.

So, after running the analysis, we will end up with the following output, giving us three features for Best Model, such as thick, u.size, and nucl. The statement on Morgan-Tatar search simply means that a simple exhaustive search was done for all the possible subsets

We can put these features in glm() and then see how well the model did on the test set. A predict() function will not work with bestglm, so this is a required step:

reduce.fit <- glm(class ~ thick + u.size + nucl, family = binomial, data = train)

Compare predicted values with actual ones:

test.cv.probs <- predict(reduce.fit, newdata = test, type = "response")
misClassError(testY, test.cv.probs)
confusionMatrix(testY, test.cv.probs)

The reduced feature model is slightly less accurate than the full feature model, but all is not lost. We can utilize the bestglm package again, this time using the best subsets with the information criterion set to BIC:

bestglm(Xy = Xy, IC = "BIC", family = binomial)

These four features provide the minimum BIC score for all possible subsets. Let’s try this and see how it predicts the test set, as follows:

bic.fit <- glm(class ~ thick + adhsn + nucl + n.nuc, 
               family = binomial, data = train)
test.bic.probs = predict(bic.fit, newdata = test, type = "response")
misClassError(testY, test.bic.probs)
confusionMatrix(testY, test.bic.probs)

Discriminant Analysis Overview

Discriminant Analysis (DA), also known as Fisher Discriminant Analysis (FDA), is another popular classification technique. It can be an effective alternative to logistic regression when the classes are wellseparated.

We’ll explore two type of DA - Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).

DA utilizes Baye’s theorem in order to determine the probability of the class membership for each observation. If you have two classes, for example, benign and malignant, then DA will calculate an observation’s probability for both the classes and select the highest probability as the proper class.

Bayes’ theorem states that the probability of Y occurring–given that X has occurred–is equal to the probability of both Y and X occurring, divided by the probability of X occurring, which can be written as follows:

\[Probability \ of\ Y | X = \frac{P(X \ and\ Y)} {P(X)}\] The numerator in this expression is the likelihood that an observation is from that class level and has these feature values. The denominator is the likelihood of an observation that has these feature values across all the levels. Again, the classification rule says that if you have the joint distribution of X and Y and if X is given, the optimal decision about which class to assign an observation to is by choosing the class with the larger probability (the posterior probability).

The process of attaining posterior probabilities goes through the following steps: 1. Collect data with a known class membership. 2. Calculate the prior probabilities; this represents the proportion of the sample that belongs to each class. 3. Calculate the mean for each feature by their class. 4. Calculate the variance–covariance matrix for each feature; if it is an LDA, then this would be a pooled matrix of all the classes, giving us a linear classifier, and if it is a QDA, then a variance– covariance created for each class. 5. Estimate the normal distribution (Gaussian densities) for each class. 6. Compute the discriminant function that is the rule for the classification of a new object. 7. Assign an observation to a class based on the discriminant function.

Discriminant Analysis Application

LDA is performed in the MASS package, which we have already loaded so that we can access the biopsy data. The syntax is very similar to the lm() and glm() functions.

LDA

We can now begin fitting our LDA model, which is as follows:

lda.fit <- lda(class ~ ., data = train)
lda.fit

This output shows us that Prior probabilities of groups are approximately 64 percent for benign and 36 percent for malignancy. Next is Group means. This is the average of each feature by their class. Coefficients of linear discriminants are the standardized linear combination of the features that are used to determine an observation’s discriminant score. The higher the score, the more likely that the classification is malignant. The plot() function in LDA will provide us with a histogram and/or the densities of the discriminant scores, as follows:

plot(lda.fit, type = "both")

We can see that there is some overlap in the groups, indicating that there will be some incorrectly classified observations.

The predict() function available with LDA provides a list of three elements: class, posterior, and x. The class element is the prediction of benign or malignant, the posterior is the probability score of x being in each class, and x is the linear discriminant score.

Let’s just extract the probability of an observation being malignant:

train.lda.probs <- predict(lda.fit)$posterior[, "malignant"]
misClassError(trainY, train.lda.probs)
confusionMatrix(trainY, train.lda.probs)

Well, unfortunately, it appears that our LDA model has performed much worse than the logistic regression models. The primary question is to see how this will perform on the test data

test.lda.probs <- predict(lda.fit, newdata = test)$posterior[, "malignant"]
misClassError(testY, test.lda.probs)
confusionMatrix(testY, test.lda.probs)

QDA

In R, QDA is also part of the MASS package and the function is qda(). Building the model is rather straightforward again, and we will store it in an object called qda.fit, as follows:

qda.fit <- qda(class ~ ., data = train)
qda.fit

As with LDA, the output has Group means but does not have the coefficients because it is a quadratic function.

The predictions for the train and test data follow the same flow of code as with LDA:

train.qda.probs <- predict(qda.fit)$posterior[, "malignant"]
misClassError(trainY, train.qda.probs)
confusionMatrix(trainY, train.qda.probs)

test.qda.probs <- predict(qda.fit, newdata = test)$posterior[, 2]
misClassError(testY, test.qda.probs)
confusionMatrix(testY, test.qda.probs)

Multivariate Adaptive Regression Splines (MARS)

What is MARS? 1. Start with linear or generalized linear model 2. To capture any nonlinear relationship the hinge function is added.

These hinges are simply points in the input feature that equate to a coefficient change. For example, say we have Y = 12.5 (our intercept) + 1.5(variable 1) + 3.3(variable 2); where variables 1 and 2 are on a scale of 1 to 10. Now, let’s see how a hinge function for variable 2 could come into play: Y = 11 (new intercept) + 1.5(variable 1) + 4.26734(max(0, variable 2 - 5.5) Thus, we read the hinge function as: we take the maximum of either 0 or variable 2 minus 5.50. So, whenever variable 2 has a value greater than 5.5, that value will be multiplied times the coefficient; otherwise, it will be zero.

Automatic variable selection

This can be done with crossvalidation, but the default is to build through a forward pass, then a backward pass to prune the model, which after the forward pass is likely to overfit the data. This backward pass prunes input features and removes hinges based on Generalized Cross Validation (GCV):

\[GCV = RSS / (N * (1 - Effective Number \ of \ Parameters / N)^2)\] \[Effective \ Number\ of\ Parameters = Number\ of\ Input\ Features\ + Penalty * (Number\ of\ Input\ Features - 1) / 2\]

In the earth package, Penalty = 2 for additive model and 3 for multiplicative model, that is one with interaction terms.

Parameters tuning

Important to specify: * how you want the model pruned * that it is a binomial response variable

For example specify a model selection of a five-fold cross validation (pmethod = "cv" and nfold = 5), repeated 3 times (ncross = 3), as an additive model only with no interactions (degree = 1) and only one hinge per input feature (minspan = -1).

library(earth)
set.seed(1)
earth.fit <- earth(class ~ ., data = train,
                   pmethod = "cv",
                   nfold = 5,
                   ncross = 3,
                   degree = 1,
                   minspan = -1,
                   glm = list(family = binomial)
                   )

The model gives us eight terms, including the Intercept and seven predictors. Two of the predictors have hinge functions–thickness and chromatin. If the thickness is greater than 3, the coefficient of 0.7019 is multiplied by that value; otherwise, it is 0. For chromatin, if less than 3 then the coefficient is multiplied by the values; otherwise, it is 0.

Plot data with plotmo() and plotd()

plotmo() produces plots showing the model’s response when varying that predictor and holding the others constant.

plotmo(earth.fit)

plotd() produces density plots of the predicted probabilities by class label

plotd(earth.fit)

Let’s explore relative variable importance:

evimp(earth.fit)

nsubsets - number of model subsets that include the variable after the pruning pass gcv and rss show the decrease in the respective value that the variable contributes (scaled 0 to 100)

Let’s see how model performs on the test set:

test.earth.probs <- predict(earth.fit, newdata = test, type = "response")
misClassError(testY, test.earth.probs)
confusionMatrix(testY, test.earth.probs)

Model selection with ROC and AUC

An effective tool for a classification model comparison is the Receiver Operating Characteristic (ROC) chart. On the ROC chart, the y-axis is the True Positive Rate (TPR) and the x-axis is the False Positive Rate (FPR).

\[TPR = Positives\ correctly\ classified / total\ positives\] \[FPR = Negatives\ correctly\ classified / total\ negatives\]

Plotting the ROC results will generate a curve, and thus you are able to produce the Area Under the Curve (AUC).

Use ROCR package to create ROC curve. 1. Create an object that saves the predicted probabilities with the actual classification 2. Use this object to create another with calculated TPR and FPR. 3. Build a chart with plot().

library(ROCR)
pred.full <- prediction(test.probs, test$class)
perf.full <- performance(pred.full, "tpr", "fpr")
plot(perf.full, main = "ROC", col = 1)

Explore other models:

pred.bic <- prediction(test.bic.probs, test$class)
perf.bic <- performance(pred.bic, "tpr", "fpr")
plot(perf.bic, col = 2, add = TRUE)

The add = TRUE parameter in the plot command adds the line to the existing chart.

The final thing is computing AUC. This is done with creation of a performance object.

performance(pred.full, "auc")@y.values
performance(pred.bic, "auc")@y.values

Advanced Feature Selection in Linear Models - Regularization

Ringe regression
Least Absolute Shrinkage and Selection Operator (LASSO)
Elastic net

Regularization in a nutshell

Linear model follows the form \[Y = B_0 + B_{1}x_{1}+...+B_{n}x_{n} + e\], and also at the best fit tries to minimize the RSS, which is the sum of the squared errors of the actual minus the estimate, or \[e_{1}^2+e_{2}^2+...+e_{n}^2\].

With regularization we apply the shrinkage penalty in conjunction with minimization of RSS. This penalty consists of a lambda (symbol λ), along with the normalization of the beta coefficients and weights.

Why does regularization works: * It’s computationaly efficient * Regularization through the proper selection of lambda and normalization may help you improve the model fit by optimizing the bias-variance trade-off * regularization of coefficients works to solve multi-collinearity problems.

Ridge regression

The normalization term is sum of squared weights (L2-norm). Our model tries to minimize \[RSS + 𝜆(sumB_{j}^2)\]

As lambda increases, the coefficients shrink toward zero but never become zero. The benefit may be an improved predictive accuracy, but as it does not zero out the weights for any of your features, it could lead to issues in the model’s interpretation and communication. To help with this problem, we will turn to LASSO.

LASSO

LASSO applies L1-norm instead of L2-norm as in ridge regression, which is the sum of the absolute value of the feature weights and thus minimizes \[RSS + 𝜆(sum|B_{j|})\] This shrinkage penalty will indeed force a feature weight to zero. This is a clear advantage over ridge regression, as it may greatly improve the model interpretability.

LASSO and Ridge Regression

In a situation of high collinearity or high pairwise correlations, LASSO may force a predictive feature to zero and thus you can lose the predictive ability; that is, say if both feature A and B should be in your model, LASSO may shrink one of their coefficients to zero.

Elastic Net

Combines L1- and L2-norm introducing a mixing parameter aplha in conjunction with lambda.

The goal is to minimize \[RSS + λ[(1 - alpha)(sum|B_{j}|^2) + alpha(sum|B_{j}|^2)]/N\]

Data preparation

#Load packages
library(ElemStatLearn) #contains the data
library(car) #package to calculate Variance Inflation Factor
library(corrplot) #correlation plots
library(leaps) #best subsets regression
library(glmnet) #allows ridge regression, LASSO and elastic net
library(caret) #this will help identify the appropriate parameters

#Explore data
data(prostate)
str(prostate)
plot(prostate)

#let's explore gleason
plot(prostate$gleason)
table(prostate$gleason)

8 and 9 scores are under-represented. The options to deal with them are:

Exclude the feature altogether
Remove the scores of 8 and 9
Recode this feature, creating an indicator variable

#Let's choose the third option
prostate$gleason <- ifelse(prostate$gleason == 6, 0, 1)
table(prostate$gleason)

#Explore data with corrplot
p.cor <- cor(prostate)
corrplot.mixed(p.cor)

Let’s start the learning:

#create train and test set
train <- subset(prostate, train == TRUE)[, 1:9]
str(train)
test <- subset(prostate, train == FALSE)[,1:9]
str(test)

Modelling and evaluation

#Create best subset object
subfit <- regsubsets(lpsa ~., data = train)
b.sum <- summary(subfit)

#find the best subset with the minimun BIC
which.min(b.sum$bic)

#plot BIC performance across subsets
plot(b.sum$bic, type = "l", xlab = "# of Features", ylab = "BIC", 
     main = "BIC score by Feature Inclusion")

#more detailed examination
plot(subfit, scale = "bic", main = "Best Subset Features")

So, the previous plot shows us that the three features included in the lowest BIC are lcavol, lweight, and gleason. It is noteworthy that lcavol is included in every combination of the models.

Before trying this model on test data, let’s explore the linearity in the solution and check the constancy of the variance (by plotting the fitted values vs actual).

ols <- lm(lpsa ~ lcavol + lweight + gleason, data = train)
plot(ols$fitted.values, train$lpsa, xlab = "Predicted", ylab = "Actual", 
     main = "Predicted vs Actual")

An inspection of the plot shows that a linear fit should perform well on this data and that the non-constant variance is not a problem.

Let’s explore predicted values:

pred.subfit <- predict(ols, newdata = test)
plot(pred.subfit, test$lpsa , xlab = "Predicted", 
     ylab = "Actual", main = "Predicted vs Actual")

Before concluding this section, we will need to calculate Mean Squared Error (MSE) to facilitate comparison across the various modeling techniques.

resid.subfit <- test$lpsa - pred.subfit
mean(resid.subfit^2)

Ridge regression

We’ll have all eight features in the model
We’ll be using glmnet package
Input features should be as matrix and not data.frame
Syntax is glmnet(x = our input matrix, y = our response, family = the distribution, alpha = 0) (aplha = 0 for ridge, 1 for LASSO)

#prepare data
x <- as.matrix(train[, 1:8])
y <- train[, 9]

#run ridge regression
library(glmnet)
ridge <- glmnet(x, y, family = "gaussian", alpha = 0)

#Explore results
print(ridge)

plot(ridge, label = TRUE)

The 100th row shows us that 8 features is included, with .6971 explained and lambda equals to .08789.

The y axis is the value of Coefficients and the x axis is L1 Norm

plot(ridge, xvar = "lambda", label = TRUE)

This is a worthwhile plot as it shows that as lambda decreases, the shrinkage parameter decreases and the absolute values of the coefficients increase. To see the coefficients at a specific lambda value, use the coef() command. Here, we will specify the lambda value that we want to use by specifying s=0.1. Set exact=TRUE, which tells glmnet to fit a model with that specific lambda value versus interpolating from the values on either side of our lambda.

ridge.coef <- coef(ridge, s = 0.1, exact = TRUE)
ridge.coef

LASSO

lasso <- glmnet(x, y, family = "gaussian", alpha = 1)
print(lasso)

the model building process stopped at step 69 as the deviance explained no longer improved as lambda decreased.

#plot the results
plot(lasso, xvar = "lambda", label = TRUE)

lasso.coef <- coef(lasso, s = 0.045, exact = TRUE)
lasso.coef
newx <- as.matrix(test[, 1:8])
lasso.y <- predict(lasso, newx =  newx, 
                   type = "response", s = 0.045)
plot(lasso.y, test$lpsa, xlab = "Predicted", ylab = "Actual", 
     main = "LASSO")

Elastic Net

With caret package we are able to find an optimal set of lamda and elastic net parameter, alpha. Follow three steps: * use expand.grid() function in base R to create a vector of all possible combinations of alpha and lambda * use trainControl() function from caret package to determine the resampling method (we’ll use LOOCV) * train a model to select our alpha and lambda parameters using glmnet() in caret’s train function.

When we selected alpha and lambda we’ll apply them to our test set.

alpha and lambda values to try: * Alpha from 0 to 1 by 0.2 increments * Lambda from 0.00 to 0.2 by 0.02 increments ()

Create a vector with expand.grid()

grid <- expand.grid(.alpha = seq(0, 1, by = 0.2), .lambda = seq(0.00, 0.2, by = 0.02))
table(grid)

Then tell the model selection criteria with selectionFunction() in trainControl(). For quantitative responses, the algorithm will select based on its default of Root Mean Square Error (RMSE), which is perfect for our purposes.

control <- trainControl(method = "LOOCV")

Use train() to determine the optimal elastic net parameters. Function is similar to lm(), we only need to add the syntax: method = "glmnet", trControl = control, tuneGrid = grid.

enet.train <- train(lpsa ~., data = train, method = "glmnet", trControl = control, tuneGrid = grid)

RMSE was used to select the optimal model - alpha = 0 and lambda = 0.08. Let’s check ourselves.

enet.train$bestTune

The process for the test set validation is the same as before:

enet <- glmnet(x, y, family = "gaussian", alpha = 0, lambda = 0.08)
(enet.coef <- coef(enet, s = 0.08, exact = TRUE))

#predict y
enet.y <- predict(enet, newx = newx, type = "response", s = .08)
plot(enet.y, test$lpsa, xlab = "Predicted", ylab = "Actual", main = "Elastic Net")

#calculate the RMSE
enet.resid <- enet.y - test$lpsa
mean(enet.resid ^ 2)

Cross validation with glmnet

We’ll try k-fold validation with cv.glmnet(). In k-fold CV, the data is partitioned into equal number of folds and a separate model is build on each k-1 set and then tested on a corresponsing holdout set with the results averaged in order to determine the final parameters. In this method, each fold is used as a test set only one. glmnet provides you with an output values of lambda values and corresponding MSE. It defaults alpha = 1, so if you want to try ridge regression or an elastic net mix, you will need to specify it.

set.seed(317)
lasso.cv <- cv.glmnet(x, y, nfolds = 3)
plot(lasso.cv)

The plot for CV is quite different than the other glmnet plots, showing log(Lambda) versus Mean-Squared Error along with the number of features. The two dotted vertical lines signify the minimum of MSE (left line) and one standard error from the minimum (right line). One standard error away from the minimum is a good place to start if you have an over-fitting problem. You can also call the exact values of these two lambdas, as follows:

lasso.cv$lambda.min
lasso.cv$lambda.1se

Using lambda.1se we can go through the following process of viewing the coefficients and validating the model on the test data:

coef(lasso.cv, s = "lambda.1se")
lasso.y.cv <- predict(lasso.cv, newx = newx, type = "response", s = )

Regularization and classification

Let’s work with breast cancer example

library(MASS)
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 0.3))
train <- biopsy.v2[ind == 1, ]
test <- biopsy.v2[ind == 2, ]
x <- as.matrix(train[, 1:9])
y <- train[, 10]
set.seed(3)
fitCV <- cv.glmnet(x, y, family = "binomial", type.measure = "auc", nfolds = 5)
plot(fitCV)

We can notice an immediate improvement by to AUC just by adding one feature. Let’s have a look at the coefficients for one standart error.

fitCV$lambda.1se
coef(fitCV, s = "lambda.1se")
library(InformationValue)
predCV <- predict(fitCV, newx = as.matrix(test[, 1:9]), 
                  s = "lambda.1se",
                  type = "response")
actuals <- ifelse(test$class == "malignant", 1, 0)
misClassError(actuals, predCV)
plotROC(actuals, predCV)

Using lambda.1se was suboptimal, let’s use lambda.min to improve the output of prediction:

predCV.min <- predict(fitCV, newx = as.matrix(test[, 1:9]),
                      s = "lambda.min",
                      type = "response")
misClassError(actuals, predCV.min)

Chapter 4. More Classification Techniques - K-Nearest Neighbors and Support Vector Machines

These methods are more sophisticated since we’re allowed to relax the assumptions of linearity, which means a linear combination of the features in order to define the decision boundary is not needed.

K-Nearest Neighbors

Use class package. kknn - weighted k-nearest neighbors ## Support Vector Machines

SVMs are linear separators with the largest possible margin and the support vectors the ones touching the safety margin region on both sides. In data that is not linearly separable, many observations will fall on the wrong side of the margin (the so-called slack variables), which is a misclassification. The key to building an SVM algorithm is to solve for the optimal number of support vectors via cross-validation. Any observation that lies directly on the wrong side of the margin for its class is known as a support vector.

An important aspect of SVM is the ability to model nonlinearity with quadratic or higher order polynomials of the input features. In SVMs, this is known as the kernel trick. It could be: * linear, no transformation * polynomial * radial basis function * sigmoid function

Packages e1071

Business understanding

Pima - diabetis - obesity Datasets are in packages MASS (Pima.tr, Pima.te) #### Data understanding and preparation

#load all libraries
library(class) #k-nearest neighbors
library(kknn) #weighted k-nearest neighbors
library(e1071) #SVM
library(caret) #select tuning parameters
library(MASS) #contains the data
library(reshape2) #assist in creating boxplots
library(ggplot2)
library(kernlab) #assist with SVM feature selection

#load data
data("Pima.tr")
data("Pima.te")

#combine into one df
pima <- rbind(Pima.tr, Pima.te)

#basic EDA
pima.melt <- tidyr::gather(pima, variable, value, -type)
pima.melt %>% 
  ggplot(aes(x = type, y = value)) + 
  geom_boxplot() + 
  facet_wrap(~variable, ncol = 2)

#scale the data
pima.scale <- data.frame(pima[, -8])
pima.scale$type <- pima$type

pima.scale.melt <- gather(pima.scale, variable, value, -type)

#check correlation
cor(pima.scale[-8])

table(pima.scale$type)

#create train/test split
set.seed(502)
ind <- sample(2, nrow(pima.scale), replace = TRUE, prob = c(0.7, 0.3))

train <- pima.scale[ind == 1, ]
test <- pima.scale[ind == 2, ]

Modeling and evaluation

KNN

Use expand.grid to identify the most appropriate number of K and cross-validation (from caret package)

grid1 <- expand.grid(.k = seq(2, 20, by = 1))
control <- trainControl(method = "cv")

Let’s compute the optimal k-value using train function from caret:

set.seed(502)
knn.train <- train(type ~ ., data = train, 
                   method = "knn",
                   trControl = control,
                   tuneGrid = grid1)
knn.train

The Kappa statistic is commonly used to provide a measure of how well two evaluators can classify an observation correctly. It provides an insight into this problem by adjusting the accuracy scores, which is done by accounting for the evaluators being totally correct by mere chance. The formula for the statistic is Kappa = (per cent of agreement - per cent of chance agreement) / (1 - per cent of chance agreement). The per cent of agreement is the rate that the evaluators agreed on for the class (accuracy), and percent of chance agreement is the rate that the evaluators randomly agreed on. The higher the Kappa, the better they performed with the maximum agreement being one.

Let’s calculate kappa for a given best choice scenario with k by using knn function from class package while providing to it four inputs: * train inputs * test inputs * train labels * k

knn.test <- knn(train[, -8], test[, -8], train[, 8], k = 17)
table(knn.test, test$type)

#calculate kappa
prob.agree <- (77 + 28)
prob.chance <- ((77 + 26) / 147) * ((77 + 16) / 147)
kappa <- (prob.agree - prob.chance) / (1 - prob.chance)
kappa

The following heuristic should help us with assessing strength of agreement

K < 0.20 - Poor
K 0.21 - 0.4 - Fair
K 0.41 - 0.6 - Moderate
K 0.61 - 0.8 - Good
K 0.81 - 1.0 - Very good

Let’s utilize weighted neighbors to see whether we can beat the accuracy and kappa results. A weighting schema increases the influence of neighbors that are closest to an observation versus those that are farther away. The farther away the observation is from a point in space, the more penalized its influence is.

For this purpose we’ll use the kknn package and its train.knn() function to select optimal weighting scheme.

The train.kknn() function uses LOOCV in order to select the best parameters for the optimal k neighbors, one of the two distance measures, and a kernel function.

There’s 10 weighting schemas for this package: * rectangular (unweighted), * triangular, * epanechnikov, * biweight, * triweight, * cosine, * inversion, * gaussian, * rank, * optimal.

set.seed(123)

#create train object
kknn.train <- train.kknn(type ~ ., data = train, 
                         kmax = 25, distance = 2,
                         kernel = c('rectangular', 'triangular', 'epanechnikov'))

plot(kknn.train)

#call kknn.train to see the summary
kknn.train

SVM Modeling

We will use the e1071 package to build our SVM models. The package has tune.svm() funtion, which assists in the selection of the tuning parameters/kernel functions. It uses cross-validation to optimize the tuning parameters.

linear.tune <- tune.svm(type ~., data = train, 
                        kernel = "linear",
                        cost = c(0.001, 0.01, 0.1, 1, 5, 10))

summary(linear.tune)
best.linear <- linear.tune$best.model
tune.test <- predict(best.linear, newdata = test)
table(tune.test, test$type)

With tune.svm() we can try different kernels

set.seed(123)
#polynomial
poly.tune <- tune.svm(type ~., data = train,
                      kernel = "polynomial",
                      degree = c(3, 4, 5), 
                      coef0 = c(0.1, 0.5, 1, 2, 3, 4))

summary(poly.tune)
best.poly <- poly.tune$best.model
poly.test <- predict(best.poly, newdata = test)
table(poly.test, test$type)

#radial
rbf.tune <- tune.svm(type ~ ., data = train,
                     kernel = "radial",
                     gamma = c(0.1, 0.5, 1, 2, 3, 4))

summary(rbf.tune)
best.rbf <- rbf.tune$best.model
rbf.test <- predict(best.rbf, newdata = test)
table(rbf.test, test$type)


#sigmoid
sigmoid.tune <- tune.svm(type ~ ., data = train,
                         kernel = "sigmoid",
                         gamma = c(0.1, 0.5, 1, 2, 3, 4),
                         coef0 = c(0.1, 0.5, 1, 2, 3, 4))
summary(sigmoid.tune)
best.sigmoid <- sigmoid.tune$best.model
sigmoid.test <- predict(best.sigmoid, newdata = test)
table(sigmoid.test, test$type)

Model selection

To compare models we can use confusionMatrix() from caret package.

confusionMatrix(sigmoid.test, test$type, positive = "Yes")
confusionMatrix(tune.test, test$type, positive = "Yes")

Explore the outputs: * Sensitivity (the true positive rate; in this case, the rate of those not having diabetes has been correctly identified as such.) * Specificity (the true negative rate or, for our purposes, the rate of a diabetic that has been correctly identified.) * Positive Prediction Value - is the probability of someone in the population classified as being diabetic and truly has the disease. * Negative Prediction Value - is the probability of someone in the population classified as not being diabetic and truly does not have the disease

Feature selection for SVM

Steps: 1. set random seed 2. specify cv method in caret’s rfeControl() function 3. perform a recursive feature selection with the rfe() function 4. test how model performs on a test set

There are several different functions that you can use. Here we will need lrFuncs.

set.seed(123)
rfeCNTL <- rfeControl(functions = lrFuncs, 
                      method = "cv",
                      number = 10)

svm.features <- rfe(train[, 1:7], train [, 8],
                    sizes = c(7, 6, 5, 4),
                    rfeControl = rfeCNTL,
                    method = "svmLinear")

svm.features

Build a model via svm() function from e1071 package:

#check on train data with 5 features

#build a model svm.5
svm.5 <- svm(type ~ glu + ped + npreg + bmi + age,
             data = train,
             kernel = "linear")

#predict based on a svm.5 model
svm.5.predict <- predict(svm.5, newdata = test[c(1, 2, 5, 6, 7)])
table(svm.5.predict, test$type)

Classification and Regressions Trees

Regression Tree

Based on prostate data

#load packages
library(rpart) #classification and regression trees
library(partykit) #treeplots
library(randomForest) #random forests
library(xgboost) #gradient boosting
library(caret) #tune hyper-parameters

#data
library(MASS) #breast and pima indian data
library(ElemStatLearn) #prostate data

#prepare prostate data
data("prostate")
prostate$gleason <- ifelse(prostate$gleason == 6, 0, 1)
pros.train <- subset(prostate, train == TRUE)[, 1:9]
pros.test <- subset(prostate, train == FALSE)[, 1:9]

Build a regression tree with rpart() function from rpart package and examine the results stored in $cptable:

tree.pros <- rpart(lpsa ~., data = pros.train)

#examine results
print(tree.pros$cptable)

CP (complexity parameter)
nsplit - number of splits
rel error - RSS for the number of splits divided by the RSS for no splits
xerror (based on ten-fold CV) - average error

Examine results visually with plotcp() function:

#plot tree
plotcp(tree.pros)

The plot shows us the relative error by the tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting a tree size, 5, which is four splits, we can build a new tree object where xerror is minimized by pruning our tree by first creating an object for cp associated with the pruned tree from the table. Then the prune() function handles the rest:

cp <- min(tree.pros$cptable[5, ])
prune.tree.pros <- prune(tree.pros, cp = cp)

Let’s compare and plot full and pruned trees. Simply use as.party() from partykit package as a wrapper in plot():

plot(as.party(tree.pros))
plot(as.party(prune.tree.pros))

Let’s see how model performs on test data

party.pros.test <- predict(prune.tree.pros, newdata = pros.test)
rpart.resid <- party.pros.test - pros.test$lpsa
mean(rpart.resid^2) #calculate MSE

Classification tree

#prepare data
data(biopsy)
biopsy <- biopsy[, -1] #delete ID
names(biopsy) <- c("thick", "u.size", "u.shape", "adhsn",
"s.size", "nucl",
"chrom", "n.nuc", "mit", "class") #change the feature names
biopsy.v2 <- na.omit(biopsy) #delete the observations with missing values
set.seed(123) #random number generator
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 0.3))
biop.train <- biopsy.v2[ind == 1, ] #the training data set
biop.test <- biopsy.v2[ind == 2, ] #the test data set

Create a tree and examine the table for the optimal number of splits:

set.seed(123)
tree.biop <- rpart(class ~., data = biop.train)

#explore cptable - complexity parameters table
tree.biop$cptable

The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the pruned tree, and see how it performs on the test set:

cp <- min(tree.biop$cptable[3, ])
prune.tree.biop <- prune(tree.biop, cp = cp)
plot(as.party(prune.tree.biop))

When predicting on test data use type = "class" in the predict() function:

#predict on test data
rparty.test <- predict(prune.tree.biop, newdata = biop.test, 
                       type = "class")
table(rparty.test, biop.test$class)

Random forest regression

Use randomForest::randomForest() with formula as a first argument and data as a second:

set.seed(123)
rf.pros <- randomForest(lpsa ~., data = pros.train)
rf.pros
plot(rf.pros)

Let’s identify the specific and optimal tree (with regards to mean squared error) with which.min() function:

which.min(rf.pros$mse)

set.seed(123)
rf.pros.2 <- randomForest(lpsa ~., data = pros.train, 
                          ntree = which.min(rf.pros$mse))
rf.pros.2

Let’s product a variable importance plot with randomForest::varImpPlot():

varImpPlot(rf.pros.2, scale = TRUE,
           main = "Variable Importance Plot - PSA Score")

To examine raw numbers, use randomForest::importance() function:

importance(rf.pros.2)

Check on a test data

rf.pros.test <- predict(rf.pros.2, newdata = pros.test)
rf.resid <- rf.pros.test - pros.test$lpsa #calculate residual
mean(rf.resid^2)

Random forest classification

set.seed(123)
rf.biop <- randomForest(class ~., data = biop.train)
rf.biop
plot(rf.biop)

Let’s find the minimum error rate

which.min(rf.biop$err.rate[ , 1]) #why the first column?

Only 19 trees are needed to optimize the model accuracy. Let’s try this and see how it performs:

set.seed(123)
rf.biop.2 <- randomForest(class ~., data = biop.train, 
                          ntree = 19)
rf.biop.2

varImpPlot(rf.biop.2)

Let’s perform the same procedure but with Pima dataset.

#prepare data
data(Pima.tr)
data(Pima.te)
pima <- rbind(Pima.tr, Pima.te)
set.seed(502)
ind <- sample(2, nrow(pima), replace = TRUE, prob = c(0.7, 0.3))
pima.train <- pima[ind == 1, ]
pima.test <- pima[ind == 2, ]

#fit random forest model
set.seed(321)
rf.pima <- randomForest(type ~., data = pima.train)
rf.pima

#predict
rf.pima.test <- predict(rf.pima, newdata = pima.test)
table(rf.pima.test, pima.test$type)

Extreme Gradient Boosting

We’ll be using xboost package and tuning the following parameters:

nrounds. The maximum number of iterations (number of trees in the final model),
colsample_bytree. The number of features, expressed as a ratio, to sample when building a tree. Default is 1 (100% of the features).
min_child_weight. The minimum weight in the trees being boosted. Default is 1.
eta. Learning rate, which is the contribution of each tree to the solution. Default is 0.3.
gamma. Minimum loss reduction required to make another leaf partition in a tree.
subsample. Ratio of data observations. Default is 1 (100%).
max_depth. Maximum depth of the individual trees.

Using the expand.grid() function, we’ll build a experimental grid. Beware: you need to specify even default parameters or you recieve an error.

grid <- expand.grid(nround = c(75,100),
                    colsample_bytree = 1,
                    min_child_weight = 1, 
                    eta = c(0.01, 0.1, 0.3),
                    gamma = c(0.5, 0.25),
                    subsample = 0.5,
                    max_depth = c(2, 3))

Before using caret::train() let’s specify trainControl argument

cntrl <- trainControl(method = "cv",
                      number = 5,
                      verboseIter = TRUE,
                      returnData = TRUE,
                      returnResamp = "final")

set.seed(1)
train.xgb <- train(x = pima.train[, 1:7],
                   y = pima.train[, 8],
                   trControl = cntrl,
                   method = "xgbTree")
train.xgb

What to do next: 1. create a list of parameters to use with xgboost::xgb.train() function 2. turn dataframe into a matrix of input features and a list of labeled outcomes 3. turn the features and labels into the input required, as xgb.Dmatrix.

param <- list(objective = "binary:logistic",
              booster = "gbtree", 
              eval_metric = "error",
              eta = 0.1,
              max_depth = 2,
              subsample = 0.5,
              colsample_bytree = 1,
              gamma = 0.5)

x <- as.matrix(pima.train[, 1:7])
y <- ifelse(pima.train$type  == "Yes", 1, 0)
train.mat <- xgb.DMatrix(data = x, label = y)

Create a model

set.seed(1)
xgb.fit <- xgb.train(params = param, 
                     data = train.mat,
                     nrounds = 75)

Then let’s the variable importance and also plot it.

impMatrix <- xgb.importance(feature_names = dimnames(x)[[2]],
                            model = xgb.fit)
impMatrix
xgb.plot.importance(impMatrix, main = "Gain by Feature")

The following metrics are result of xgb.importance() call.

gain. Improvement in the accuracy that feature brings to the branches it is on.
cover. Relative number of total observations related to this feature.
frequency. Per cent of times that feature occurs in all of the trees

Let’s see how we can perform on a test set - using the InformationValue package.

library(InformationValue)
pred <- predict(xgb.fit, x)
optimalCutoff(y, pred)

pima.testMat <- as.matrix(pima.test[, 1:7])
xgb.pima.test <- predict(xgb.fit, pima.testMat)
y.test <- ifelse(pima.test$type == "Yes", 1, 0)
1 - misClassError(y.test, xgb.pima.test, threshold = 0.39)

#xgb.pima.test <- ifelse(xgb.pima.test > 0.39, 1, 0) 

#pay attention - confusionMatrix exists in both in caret and InformationValue packages. the latter allows us to specify the threshold for numberic values in float32.
InformationValue::confusionMatrix(y.test, xgb.pima.test, threshold = 0.39)

InformationValue::optimalCutoff() provides the optimal probability threshold to minimize error.

Let’s plot ROC curve (with InformationValue::plotROC()):

plotROC(y.test, xgb.pima.test)

###Feature selection with random forests

We’ll be using Sonar data from mlbench package with 2 classes - Mine or Rock.

data(Sonar, package = "mlbench")
table(Sonar$Class)

Run an algorithm by using Boruta() function with formula, data, and doTrace = 1 to trace the progress of feature selection.

library(Boruta)
set.seed(1)
feature.selection <- Boruta(Class ~., data = Sonar, doTrace = 1)

After creating model let’s explore the table with the final importance decision.

table(feature.selection$finalDecision)

It’s simple to create a new dataframe with new features by using getSelectedAttributes() function. Note that function takes two arguments - model and withTentative = FALSE. And then subset the original dataframe.

fNames <- getSelectedAttributes(feature.selection)
Sonar.features <- Sonar[, fNames]

Neural Networks and Deep Learning

Skipped, let’s focus on Tensorflow and Keras.

R for Machine Learning

Regression

Univariate regression

Multivariate Regression

Modeling and evaluation

Qualitative features

Interaction Terms

Logistic Regression and Discriminant Analysis

Modelling and Evaluation

Logistic Regression with Cross-Validation

Discriminant Analysis Overview

LDA

QDA

Multivariate Adaptive Regression Splines (MARS)

Model selection with ROC and AUC

Advanced Feature Selection in Linear Models - Regularization

Ridge regression

LASSO

Elastic Net

Cross validation with glmnet

Regularization and classification

Chapter 4. More Classification Techniques - K-Nearest Neighbors and Support Vector Machines

K-Nearest Neighbors

Business understanding

Modeling and evaluation

KNN

SVM Modeling

Model selection

Feature selection for SVM

Classification and Regressions Trees

Regression Tree

Classification tree

Random forest regression

Random forest classification

Extreme Gradient Boosting

Neural Networks and Deep Learning

Cluster Analysis