require(alr3)
require(ggplot2)
require(dplyr)
data("snake")
dim(snake)
head(snake)
names(snake) <- c("content", "yield")
Draw scatterplot
snake %>%
ggplot(aes(x = content, y = yield)) +
geom_point()
Fit a model and explore summary
yield.fit <- lm(yield ~ content, data = snake)
summary(yield.fit)
Fit a regression line
snake %>%
ggplot(aes(x = content, y = yield)) +
geom_point() +
geom_smooth(method = "lm")
Check for linear model assumptions
#check for linear model assumptions
par(mfrow = c(2, 2))
plot(yield.fit)
library(car)
qqPlot(yield.fit)
data(water)
glimpse(water)
Don’t need a Year column, so let’s get rid of it!
social.water <- water[ ,-1]
head(social.water)
Let’s evaluate the correlation statistics and product a matrix of scatterplots
library(corrplot)
water.cor <- cor(social.water)
corrplot(water.cor, method = "ellipse")
The response variable is highly and positively correlated with the OP features with OPBPC as 0.8857, OPRC as 0.9196, and OPSLAKE as 0.9384. Also note that the AP features are highly correlated with each other and the OP features as well. The implication is that we may run into the issue of multi-collinearity.
We will use the leaps
package - best subsets regression methods.
Forward stepwise selection Starts with a model that has zero features; it then adds the features one at a time until all the features are added. A selected feature is added in the process that creates a model with the lowest RSS. So in theory, the first feature selected should be the one that explains the response variable better than any of the others, and so on.
Backward stepwise selection Begins with all the features in the model and removes the least useful, one at a time
Hybrid approach The features are added through forward stepwise regression, but the algorithm then examines if any features that no longer improve the model fit can be removed. Once the model is built, the analyst can examine the output and use various statistics to select the features they believe provide the best fit.
Best subsets models The algorithm fits a model for all the possible feature combinations; so if you have 3 features, 7 models will be created
Build a model
library(leaps)
fit <- lm(BSAAM ~ ., data = social.water)
summary(fit)
Create subsets with regsubsets()
function from leaps
package:
sub.fit <- regsubsets(BSAAM ~., data = social.water)
Examine model further - list outputs available with names()
:
best.summary <- summary(sub.fit)
names(best.summary)
Explore model with lowest RSS and highest R-squared:
which.min(best.summary$rss)
which.max(best.summary$adjr2)
Adding new features always decrease RSS and increase R-squared.
There’re 4 main statistical methods for choosing the number of features: * Aikake’s Information Criterion (AIC) * Mallow’s Cp (Cp) * Bayesian Information Criterion (BIC) * Adjusted R-squared
Plot the CP
par(mfrow = c(1,2))
plot(best.summary$cp, xlab = "number of features", ylab = "cp")
plot(sub.fit, scale = "Cp")
In the plot on the left-hand side, the model with three features has the lowest cp. The plot on the right-hand side displays those features that provide the lowest Cp. The way to read this plot is to select the lowest Cp value at the top of the y axis, which is 1.2. Then, move to the right and look at the colored blocks corresponding to the x axis. Doing this, we see that APSLAKE, OPRC, and OPSLAKE are the features included in this specific model.
Check max value of adjusted R-squared and min value of BIC.
which.max(best.summary$adjr2)
which.min(best.summary$bic)
Create a model with three selected features:
best.fit <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = social.water)
summary(best.fit)
Produce diagnostic plots:
par(mfrow = c(2, 2))
plot(best.fit)
Investigate the issue of collinearity with Variance Inflation Factor (VIF) statistics - vif()
from car
package:
vif(best.fit)
There’s a certain degree of collinearity between OPRC and OPSLAKE. Let’s explore it by plotting this two variables:
plot(social.water$OPRC, social.water$OPSLAKE, xlab = "OPRC", ylab = "OPSLAKE")
There’s a clear linear relationship between two variables.
The simple solution to address collinearity is to drop the variables to remove the problem without compromising the predictive ability. If we look at the adjusted R-squared from the best subsets, we can see that the two-variable model of APSLAKE and OPSLAKE has produced a value of 0.90, while adding OPRC has only marginally increased it to 0.92
best.summary$adjr2 #adjusted r-squared values
Let’s create two-variable model and test it’s assumptions:
fit.2 <- lm(BSAAM ~ APSLAKE + OPSLAKE, data = social.water)
summary(fit.2)
par(mfrow = c(2, 2))
plot(fit.2)
Let’s run vif()
again.
vif(fit.2)
Run Breusch-Pagan (BP) test to formally check the assumption of the constant variance of errors - lmtest::bptest()
:
library(lmtest)
bptest(fit.2)
There’s no evidence to reject a null hypothesis that error variances are zero
Plot Predicted vs Actual values:
social.water["Actual"] <- water$BSAAM #create the vector Actual
social.water$Forecast = predict(fit.2) #populate Forecast with the predicted values
library(ggplot2)
ggplot(social.water, aes(x = Forecast, y = Actual)) +
geom_point() + geom_smooth(method = lm) +
labs(title = "Forecast versus Actuals")
For this example, we will predict the sales of Carseats
using just Advertising
, a quantitative feature and the qualitative feature ShelveLoc
, which is a factor of three levels: Bad, Good, and Medium. With factors, R will automatically code the indicators for the analysis. We build and analyze the model as follows:
library(ISLR)
data(Carseats)
sales.fit <- lm(Sales ~ Advertising + ShelveLoc, data = Carseats)
summary(sales.fit)
If the shelving location is good, the estimate of sales is almost double of that when the location is bad, given an intercept of 4.89662. To see how R codes the indicator features, you can use the contrasts()
function:
contrasts(Carseats$ShelveLoc)
An example is available in the MASS package with the Boston dataset. The response is the median home value, which is medv
in the output. We will use two features: the percentage of homes with a low socioeconomic status, which is termed lstat
, and the age of the home in years, which is termed age
in the following output:
library(MASS)
data(Boston)
str(Boston)
Using **feature1*feature2** with the lm()
function in the code puts both the features as well as their interaction term in the model:
value.fit <- lm(medv ~ lstat * age, data = Boston)
summary(value.fit)
Examining the output, we can see that, while the socioeconomic status is a highly predictive feature, the age of the home is not. However, the two features have a significant interaction to positively explain the home value.
Logistic function used in logistic regression:
\[Probability \ of\ Y = \frac{e^{B_0 + B_{1}x}} {1 - e^{B_0 + B_{1}x}}\]
The logistic function can be turn to odds via:
\[Odds = \frac{Probability(Y)} {1 - Probability(Y)}\] One way to look at the relationship of logistic regression with linear regression is to show logistic regression as the log odds or \[log (P(Y) / (1 - P(Y)))\] is equal to \[B_0 + B_{1}x\]. The coefficients are estimated using a maximum likelihood instead of the OLS. The intuition behind the maximum likelihood is that we are calculating the estimates for Bo and B1, which will create a predicted probability for an observation that is as close as possible to the actual observed outcome of Y, a so-called likelihood.
Data understanding and preparation
Dataset (MASS::biopsy) consists of tissue samples from 699 patients. It is in a data frame with 11 variables, as follows: * ID: Sample code number * V1: Thickness * V2: Uniformity of the cell size * V3: Uniformity of the cell shape * V4: Marginal adhesion * V5: Single epithelial cell size * V6: Bare nucleus (16 observations are missing) * V7: Bland chromatin * V8: Normal nucleolus * V9: Mitosis * class: Whether the tumor diagnosis is benign or malignant; this will be the outcome that we are trying to predict
The medical team has scored and coded each of the nine features on a scale of 1 to 10.
#examine the structure of data
library(MASS)
data(biopsy)
str(biopsy)
#prepare data
## Get rid of ID column
biopsy$ID <- NULL
#rename columns
names(biopsy) <- c("thick", "u.size", "u.shape",
"adhsn", "s.size", "nucl", "chrom", "n.nuc",
"mit", "class")
#get rid of missing values
biopsy.v2 <- na.omit(biopsy)
#recode class to numeric variable
y <- ifelse(biopsy.v2$class == "malignant", 1, 0)
#Plot data
library(reshape2)
library(ggplot2)
#The following code melts the data by their values into one overall feature and groups them by class
biop.m <- reshape2::melt(biopsy.v2, id.var = "class")
ggplot(data = biop.m, aes(x = class, y = value)) +
geom_boxplot() +
facet_wrap(~ variable, ncol = 3)
#Explore correlation
library(corrplot)
bc <- cor(biopsy.v2[ ,1:9]) #create an object of the features
corrplot.mixed(bc)
The correlation coefficients are indicating that we may have a problem with collinearity, in particular, the features of uniform shape and uniform size that are present. As part of the logistic regression modeling process, it will be necessary to incorporate the VIF analysis as we did with linear regression.
#create train and test set
set.seed(123) #random number generator
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 0.3))
train <- biopsy.v2[ind == 1, ] #the training data set
test <- biopsy.v2[ind == 2, ] #the test data set
str(test) #confirm it worked
#To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:
table(train$class)
table(test$class)
For this part of the process, we will start with a logistic regression model of all the input variables and then narrow down the features with the best subsets. After this, we will try our hand at discriminant analysis and Multivariate Adaptive Regression Splines (MARS).
We’ve already discussed the theory behind logistic regression, so we can begin fitting our models. An R installation comes with the glm()
function fitting the generalized linear models, which are a class of models that includes logistic regression. The code syntax is similar to the lm()
function that we used in the previous chapter. One big difference is that we must use the family = binomial
argument in the function, which tells R to run a logistic regression method instead of the other versions of the generalized linear models. We will start by creating a model that includes all of the features on the train set and see how it performs on the test set, as follows:
full.fit <- glm(class ~ ., family = binomial, data = train)
summary(full.fit)
We can see that only two features have p-values less than 0.05 (thickness and nuclei). An examination of the 95 percent confidence intervals can be called on with the confint()
function, as follows:
confint(full.fit)
Note that the two significant features have confidence intervals that do not cross zero. You cannot translate the coefficients in logistic regression as the change in Y is based on a one-unit change in X. This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).
In order to produce the odds ratios in R, we will use the following exp(coef())
syntax:
exp(coef(full.fit))
The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease. In this example, all the features except u.size will increase the log odds.
One of the issues pointed out during data exploration was the potential issue of multi-collinearity. It is possible to produce the VIF statistics that we did in linear regression with a logistic model in the following way:
library(car)
vif(full.fit)
None of the values are greater than the VIF rule of thumb statistic of five, so collinearity does not seem to be a problem.
The next task is feature selection; but, for now, let’s produce some code to look at how well this model does on both the train and test sets.
You will first have to create a vector of the predicted probabilities, as follows:
train.probs <- predict(full.fit, type = "response")
train.probs[1:5] #inspect the first 5 predicted probabilities
Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix.
library(InformationValue)
trainY <- y[ind == 1]
testY <- y[ind == 2]
confusionMatrix(trainY, train.probs)
#We can also take a look at the error rate, as follows
misClassError(trainY, train.probs)
Let’s explore prediction on a test set:
test.probs <- predict(full.fit, newdata = test, type = "response")
misClassError(testY, test.probs)
confusionMatrix(testY, test.probs)
The purpose of cross-validation is to improve our prediction of the test set and minimize the chance of overfitting. With the K-fold cross-validation, the dataset is split into K equal-sized parts. The algorithm learns by alternatively holding out one of the K-sets; it fits a model to the other K-1 parts, and obtains predictions for the left-out K-set. The results are then averaged so as to minimize the errors, and appropriate features are selected. You can also perform the Leave-One-Out-Cross-Validation (LOOCV) method, where K is equal to 1. Simulations have shown that the LOOCV method can have averaged estimates that have a high variance. As a result, most machine learning experts will recommend that the number of K-folds should be 5 or 10.
An R package that will automatically do CV for logistic regression is the bestglm
package. This package is dependent on the leaps
package that we used for linear regression.
After loading the package, we will need our outcome coded to 0 or 1. If left as a factor, it will not work. The other requirement to utilize the package is that your outcome, or y, is the last column and all the extraneous columns have been removed. A new data frame will do the trick for us, via the following code:
library(bestglm)
X <- train[, 1:9]
Xy <- data.frame(cbind(X, trainY))
Here is the code to run in order to use the CV technique with our data:
bestglm(Xy = Xy, IC = "CV", CVArgs = list(Method = "HTF", K = 10, REP = 1),
family=binomial)
The syntax, Xy = Xy
, points to our properly formatted data frame. IC = "CV"
tells the package that the information criterion to use is cross-validation. CVArgs
are the CV arguments that we want to use. The HTF method is K-fold, which is followed by the number of folds, K = 10
, and we are asking it to do only one iteration of the random folds with REP = 1
.
So, after running the analysis, we will end up with the following output, giving us three features for Best Model, such as thick
, u.size
, and nucl
. The statement on Morgan-Tatar search simply means that a simple exhaustive search was done for all the possible subsets
We can put these features in glm()
and then see how well the model did on the test set. A predict()
function will not work with bestglm, so this is a required step:
reduce.fit <- glm(class ~ thick + u.size + nucl, family = binomial, data = train)
Compare predicted values with actual ones:
test.cv.probs <- predict(reduce.fit, newdata = test, type = "response")
misClassError(testY, test.cv.probs)
confusionMatrix(testY, test.cv.probs)
The reduced feature model is slightly less accurate than the full feature model, but all is not lost. We can utilize the bestglm
package again, this time using the best subsets with the information criterion set to BIC
:
bestglm(Xy = Xy, IC = "BIC", family = binomial)
These four features provide the minimum BIC
score for all possible subsets. Let’s try this and see how it predicts the test
set, as follows:
bic.fit <- glm(class ~ thick + adhsn + nucl + n.nuc,
family = binomial, data = train)
test.bic.probs = predict(bic.fit, newdata = test, type = "response")
misClassError(testY, test.bic.probs)
confusionMatrix(testY, test.bic.probs)
Discriminant Analysis (DA), also known as Fisher Discriminant Analysis (FDA), is another popular classification technique. It can be an effective alternative to logistic regression when the classes are wellseparated.
We’ll explore two type of DA - Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).
DA utilizes Baye’s theorem in order to determine the probability of the class membership for each observation. If you have two classes, for example, benign and malignant, then DA will calculate an observation’s probability for both the classes and select the highest probability as the proper class.
Bayes’ theorem states that the probability of Y occurring–given that X has occurred–is equal to the probability of both Y and X occurring, divided by the probability of X occurring, which can be written as follows:
\[Probability \ of\ Y | X = \frac{P(X \ and\ Y)} {P(X)}\] The numerator in this expression is the likelihood that an observation is from that class level and has these feature values. The denominator is the likelihood of an observation that has these feature values across all the levels. Again, the classification rule says that if you have the joint distribution of X and Y and if X is given, the optimal decision about which class to assign an observation to is by choosing the class with the larger probability (the posterior probability).
The process of attaining posterior probabilities goes through the following steps: 1. Collect data with a known class membership. 2. Calculate the prior probabilities; this represents the proportion of the sample that belongs to each class. 3. Calculate the mean for each feature by their class. 4. Calculate the variance–covariance matrix for each feature; if it is an LDA, then this would be a pooled matrix of all the classes, giving us a linear classifier, and if it is a QDA, then a variance– covariance created for each class. 5. Estimate the normal distribution (Gaussian densities) for each class. 6. Compute the discriminant function that is the rule for the classification of a new object. 7. Assign an observation to a class based on the discriminant function.
Discriminant Analysis Application
LDA is performed in the MASS
package, which we have already loaded so that we can access the biopsy data. The syntax is very similar to the lm()
and glm()
functions.
We can now begin fitting our LDA model, which is as follows:
lda.fit <- lda(class ~ ., data = train)
lda.fit
This output shows us that Prior probabilities of groups
are approximately 64 percent for benign and 36 percent for malignancy. Next is Group means
. This is the average of each feature by their class. Coefficients of linear discriminants
are the standardized linear combination of the features that are used to determine an observation’s discriminant score. The higher the score, the more likely that the classification is malignant. The plot()
function in LDA will provide us with a histogram and/or the densities of the discriminant scores, as follows:
plot(lda.fit, type = "both")
We can see that there is some overlap in the groups, indicating that there will be some incorrectly classified observations.
The predict()
function available with LDA provides a list of three elements: class, posterior, and x. The class
element is the prediction of benign or malignant, the posterior
is the probability score of x being in each class, and x
is the linear discriminant score.
Let’s just extract the probability of an observation being malignant:
train.lda.probs <- predict(lda.fit)$posterior[, "malignant"]
misClassError(trainY, train.lda.probs)
confusionMatrix(trainY, train.lda.probs)
Well, unfortunately, it appears that our LDA model has performed much worse than the logistic regression models. The primary question is to see how this will perform on the test
data
test.lda.probs <- predict(lda.fit, newdata = test)$posterior[, "malignant"]
misClassError(testY, test.lda.probs)
confusionMatrix(testY, test.lda.probs)
In R, QDA is also part of the MASS
package and the function is qda()
. Building the model is rather straightforward again, and we will store it in an object called qda.fit
, as follows:
qda.fit <- qda(class ~ ., data = train)
qda.fit
As with LDA, the output has Group means
but does not have the coefficients because it is a quadratic function.
The predictions for the train
and test
data follow the same flow of code as with LDA:
train.qda.probs <- predict(qda.fit)$posterior[, "malignant"]
misClassError(trainY, train.qda.probs)
confusionMatrix(trainY, train.qda.probs)
test.qda.probs <- predict(qda.fit, newdata = test)$posterior[, 2]
misClassError(testY, test.qda.probs)
confusionMatrix(testY, test.qda.probs)
What is MARS? 1. Start with linear or generalized linear model 2. To capture any nonlinear relationship the hinge
function is added.
These hinges
are simply points in the input feature that equate to a coefficient change. For example, say we have Y = 12.5 (our intercept) + 1.5(variable 1) + 3.3(variable 2)
; where variables 1 and 2 are on a scale of 1 to 10. Now, let’s see how a hinge function for variable 2 could come into play: Y = 11 (new intercept) + 1.5(variable 1) + 4.26734(max(0, variable 2 - 5.5)
Thus, we read the hinge function as: we take the maximum of either 0 or variable 2 minus 5.50. So, whenever variable 2 has a value greater than 5.5, that value will be multiplied times the coefficient; otherwise, it will be zero.
Automatic variable selection
This can be done with crossvalidation, but the default is to build through a forward pass, then a backward pass to prune the model, which after the forward pass is likely to overfit the data. This backward pass prunes input features and removes hinges based on Generalized Cross Validation (GCV):
\[GCV = RSS / (N * (1 - Effective Number \ of \ Parameters / N)^2)\] \[Effective \ Number\ of\ Parameters = Number\ of\ Input\ Features\ + Penalty * (Number\ of\ Input\ Features - 1) / 2\]
In the earth
package, Penalty = 2
for additive model and 3
for multiplicative model, that is one with interaction terms.
Parameters tuning
Important to specify: * how you want the model pruned * that it is a binomial response variable
For example specify a model selection of a five-fold cross validation (pmethod = "cv"
and nfold = 5
), repeated 3 times (ncross = 3
), as an additive model only with no interactions (degree = 1
) and only one hinge per input feature (minspan = -1
).
library(earth)
set.seed(1)
earth.fit <- earth(class ~ ., data = train,
pmethod = "cv",
nfold = 5,
ncross = 3,
degree = 1,
minspan = -1,
glm = list(family = binomial)
)
The model gives us eight terms, including the Intercept and seven predictors. Two of the predictors have hinge
functions–thickness and chromatin. If the thickness is greater than 3, the coefficient of 0.7019 is multiplied by that value; otherwise, it is 0. For chromatin, if less than 3 then the coefficient is multiplied by the values; otherwise, it is 0.
Plot data with plotmo()
and plotd()
plotmo()
produces plots showing the model’s response when varying that predictor and holding the others constant.
plotmo(earth.fit)
plotd()
produces density plots of the predicted probabilities by class label
plotd(earth.fit)
Let’s explore relative variable importance:
evimp(earth.fit)
nsubsets
- number of model subsets that include the variable after the pruning pass gcv
and rss
show the decrease in the respective value that the variable contributes (scaled 0 to 100)
Let’s see how model performs on the test set:
test.earth.probs <- predict(earth.fit, newdata = test, type = "response")
misClassError(testY, test.earth.probs)
confusionMatrix(testY, test.earth.probs)
An effective tool for a classification model comparison is the Receiver Operating Characteristic (ROC) chart. On the ROC chart, the y-axis is the True Positive Rate (TPR) and the x-axis is the False Positive Rate (FPR).
\[TPR = Positives\ correctly\ classified / total\ positives\] \[FPR = Negatives\ correctly\ classified / total\ negatives\]
Plotting the ROC results will generate a curve, and thus you are able to produce the Area Under the Curve (AUC).
Use ROCR
package to create ROC curve. 1. Create an object that saves the predicted probabilities with the actual classification 2. Use this object to create another with calculated TPR and FPR. 3. Build a chart with plot()
.
library(ROCR)
pred.full <- prediction(test.probs, test$class)
perf.full <- performance(pred.full, "tpr", "fpr")
plot(perf.full, main = "ROC", col = 1)
Explore other models:
pred.bic <- prediction(test.bic.probs, test$class)
perf.bic <- performance(pred.bic, "tpr", "fpr")
plot(perf.bic, col = 2, add = TRUE)
The add = TRUE
parameter in the plot
command adds the line to the existing chart.
The final thing is computing AUC. This is done with creation of a performance
object.
performance(pred.full, "auc")@y.values
performance(pred.bic, "auc")@y.values
Regularization in a nutshell
Linear model follows the form \[Y = B_0 + B_{1}x_{1}+...+B_{n}x_{n} + e\], and also at the best fit tries to minimize the RSS, which is the sum of the squared errors of the actual minus the estimate, or \[e_{1}^2+e_{2}^2+...+e_{n}^2\].
With regularization we apply the shrinkage penalty in conjunction with minimization of RSS. This penalty consists of a lambda (symbol λ), along with the normalization of the beta coefficients and weights.
Why does regularization works: * It’s computationaly efficient * Regularization through the proper selection of lambda and normalization may help you improve the model fit by optimizing the bias-variance trade-off * regularization of coefficients works to solve multi-collinearity problems.
The normalization term is sum of squared weights (L2-norm). Our model tries to minimize \[RSS + 𝜆(sumB_{j}^2)\]
As lambda increases, the coefficients shrink toward zero but never become zero. The benefit may be an improved predictive accuracy, but as it does not zero out the weights for any of your features, it could lead to issues in the model’s interpretation and communication. To help with this problem, we will turn to LASSO.
LASSO applies L1-norm instead of L2-norm as in ridge regression, which is the sum of the absolute value of the feature weights and thus minimizes \[RSS + 𝜆(sum|B_{j|})\] This shrinkage penalty will indeed force a feature weight to zero. This is a clear advantage over ridge regression, as it may greatly improve the model interpretability.
LASSO and Ridge Regression
In a situation of high collinearity or high pairwise correlations, LASSO may force a predictive feature to zero and thus you can lose the predictive ability; that is, say if both feature A and B should be in your model, LASSO may shrink one of their coefficients to zero.
Combines L1- and L2-norm introducing a mixing parameter aplha in conjunction with lambda.
The goal is to minimize \[RSS + λ[(1 - alpha)(sum|B_{j}|^2) + alpha(sum|B_{j}|^2)]/N\]
Data preparation
#Load packages
library(ElemStatLearn) #contains the data
library(car) #package to calculate Variance Inflation Factor
library(corrplot) #correlation plots
library(leaps) #best subsets regression
library(glmnet) #allows ridge regression, LASSO and elastic net
library(caret) #this will help identify the appropriate parameters
#Explore data
data(prostate)
str(prostate)
plot(prostate)
#let's explore gleason
plot(prostate$gleason)
table(prostate$gleason)
8 and 9 scores are under-represented. The options to deal with them are:
#Let's choose the third option
prostate$gleason <- ifelse(prostate$gleason == 6, 0, 1)
table(prostate$gleason)
#Explore data with corrplot
p.cor <- cor(prostate)
corrplot.mixed(p.cor)
Let’s start the learning:
#create train and test set
train <- subset(prostate, train == TRUE)[, 1:9]
str(train)
test <- subset(prostate, train == FALSE)[,1:9]
str(test)
Modelling and evaluation
#Create best subset object
subfit <- regsubsets(lpsa ~., data = train)
b.sum <- summary(subfit)
#find the best subset with the minimun BIC
which.min(b.sum$bic)
#plot BIC performance across subsets
plot(b.sum$bic, type = "l", xlab = "# of Features", ylab = "BIC",
main = "BIC score by Feature Inclusion")
#more detailed examination
plot(subfit, scale = "bic", main = "Best Subset Features")
So, the previous plot shows us that the three features included in the lowest BIC are lcavol
, lweight
, and gleason
. It is noteworthy that lcavol
is included in every combination of the models.
Before trying this model on test data, let’s explore the linearity in the solution and check the constancy of the variance (by plotting the fitted values vs actual).
ols <- lm(lpsa ~ lcavol + lweight + gleason, data = train)
plot(ols$fitted.values, train$lpsa, xlab = "Predicted", ylab = "Actual",
main = "Predicted vs Actual")
An inspection of the plot shows that a linear fit should perform well on this data and that the non-constant variance is not a problem.
Let’s explore predicted values:
pred.subfit <- predict(ols, newdata = test)
plot(pred.subfit, test$lpsa , xlab = "Predicted",
ylab = "Actual", main = "Predicted vs Actual")
Before concluding this section, we will need to calculate Mean Squared Error (MSE) to facilitate comparison across the various modeling techniques.
resid.subfit <- test$lpsa - pred.subfit
mean(resid.subfit^2)
Ridge regression
glmnet
packageglmnet(x = our input matrix, y = our response, family = the distribution, alpha = 0)
(aplha = 0 for ridge, 1 for LASSO)#prepare data
x <- as.matrix(train[, 1:8])
y <- train[, 9]
#run ridge regression
library(glmnet)
ridge <- glmnet(x, y, family = "gaussian", alpha = 0)
#Explore results
print(ridge)
plot(ridge, label = TRUE)
The 100th row shows us that 8 features is included, with .6971
explained and lambda
equals to .08789
.
The y axis is the value of Coefficients and the x axis is L1 Norm
plot(ridge, xvar = "lambda", label = TRUE)
This is a worthwhile plot as it shows that as lambda
decreases, the shrinkage parameter decreases and the absolute values of the coefficients increase. To see the coefficients at a specific lambda value, use the coef()
command. Here, we will specify the lambda value that we want to use by specifying s=0.1
. Set exact=TRUE
, which tells glmnet
to fit a model with that specific lambda
value versus interpolating from the values on either side of our lambda
.
ridge.coef <- coef(ridge, s = 0.1, exact = TRUE)
ridge.coef
LASSO
lasso <- glmnet(x, y, family = "gaussian", alpha = 1)
print(lasso)
the model building process stopped at step 69 as the deviance explained no longer improved as lambda
decreased.
#plot the results
plot(lasso, xvar = "lambda", label = TRUE)
lasso.coef <- coef(lasso, s = 0.045, exact = TRUE)
lasso.coef
newx <- as.matrix(test[, 1:8])
lasso.y <- predict(lasso, newx = newx,
type = "response", s = 0.045)
plot(lasso.y, test$lpsa, xlab = "Predicted", ylab = "Actual",
main = "LASSO")
Elastic Net
With caret
package we are able to find an optimal set of lamda and elastic net parameter, alpha. Follow three steps: * use expand.grid()
function in base R to create a vector of all possible combinations of alpha and lambda * use trainControl()
function from caret
package to determine the resampling method (we’ll use LOOCV) * train a model to select our alpha and lambda parameters using glmnet()
in caret’s train
function.
When we selected alpha and lambda we’ll apply them to our test set.
alpha and lambda values to try: * Alpha from 0 to 1 by 0.2 increments * Lambda from 0.00 to 0.2 by 0.02 increments ()
Create a vector with expand.grid()
grid <- expand.grid(.alpha = seq(0, 1, by = 0.2), .lambda = seq(0.00, 0.2, by = 0.02))
table(grid)
Then tell the model selection criteria with selectionFunction()
in trainControl()
. For quantitative responses, the algorithm will select based on its default of Root Mean Square Error (RMSE), which is perfect for our purposes.
control <- trainControl(method = "LOOCV")
Use train()
to determine the optimal elastic net parameters. Function is similar to lm()
, we only need to add the syntax: method = "glmnet", trControl = control, tuneGrid = grid
.
enet.train <- train(lpsa ~., data = train, method = "glmnet", trControl = control, tuneGrid = grid)
RMSE was used to select the optimal model - alpha = 0 and lambda = 0.08. Let’s check ourselves.
enet.train$bestTune
The process for the test set
validation is the same as before:
enet <- glmnet(x, y, family = "gaussian", alpha = 0, lambda = 0.08)
(enet.coef <- coef(enet, s = 0.08, exact = TRUE))
#predict y
enet.y <- predict(enet, newx = newx, type = "response", s = .08)
plot(enet.y, test$lpsa, xlab = "Predicted", ylab = "Actual", main = "Elastic Net")
#calculate the RMSE
enet.resid <- enet.y - test$lpsa
mean(enet.resid ^ 2)
We’ll try k-fold validation with cv.glmnet()
. In k-fold CV, the data is partitioned into equal number of folds and a separate model is build on each k-1 set and then tested on a corresponsing holdout set with the results averaged in order to determine the final parameters. In this method, each fold is used as a test set only one. glmnet
provides you with an output values of lambda values and corresponding MSE. It defaults alpha
= 1, so if you want to try ridge regression or an elastic net mix, you will need to specify it.
set.seed(317)
lasso.cv <- cv.glmnet(x, y, nfolds = 3)
plot(lasso.cv)
The plot for CV is quite different than the other glmnet plots, showing log(Lambda) versus Mean-Squared Error along with the number of features. The two dotted vertical lines signify the minimum of MSE (left line) and one standard error from the minimum (right line). One standard error away from the minimum is a good place to start if you have an over-fitting problem. You can also call the exact values of these two lambdas, as follows:
lasso.cv$lambda.min
lasso.cv$lambda.1se
Using lambda.1se
we can go through the following process of viewing the coefficients and validating the model on the test data:
coef(lasso.cv, s = "lambda.1se")
lasso.y.cv <- predict(lasso.cv, newx = newx, type = "response", s = )
Let’s work with breast cancer example
library(MASS)
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 0.3))
train <- biopsy.v2[ind == 1, ]
test <- biopsy.v2[ind == 2, ]
x <- as.matrix(train[, 1:9])
y <- train[, 10]
set.seed(3)
fitCV <- cv.glmnet(x, y, family = "binomial", type.measure = "auc", nfolds = 5)
plot(fitCV)
We can notice an immediate improvement by to AUC just by adding one feature. Let’s have a look at the coefficients for one standart error.
fitCV$lambda.1se
coef(fitCV, s = "lambda.1se")
library(InformationValue)
predCV <- predict(fitCV, newx = as.matrix(test[, 1:9]),
s = "lambda.1se",
type = "response")
actuals <- ifelse(test$class == "malignant", 1, 0)
misClassError(actuals, predCV)
plotROC(actuals, predCV)
Using lambda.1se
was suboptimal, let’s use lambda.min
to improve the output of prediction:
predCV.min <- predict(fitCV, newx = as.matrix(test[, 1:9]),
s = "lambda.min",
type = "response")
misClassError(actuals, predCV.min)
These methods are more sophisticated since we’re allowed to relax the assumptions of linearity, which means a linear combination of the features in order to define the decision boundary is not needed.
Use class
package. kknn
- weighted k-nearest neighbors ## Support Vector Machines
SVMs are linear separators with the largest possible margin and the support vectors the ones touching the safety margin region on both sides. In data that is not linearly separable, many observations will fall on the wrong side of the margin (the so-called slack variables), which is a misclassification. The key to building an SVM algorithm is to solve for the optimal number of support vectors via cross-validation. Any observation that lies directly on the wrong side of the margin for its class is known as a support vector.
An important aspect of SVM is the ability to model nonlinearity with quadratic or higher order polynomials of the input features. In SVMs, this is known as the kernel trick. It could be: * linear, no transformation * polynomial * radial basis function * sigmoid function
Packages e1071
Pima - diabetis - obesity Datasets are in packages MASS (Pima.tr, Pima.te) #### Data understanding and preparation
#load all libraries
library(class) #k-nearest neighbors
library(kknn) #weighted k-nearest neighbors
library(e1071) #SVM
library(caret) #select tuning parameters
library(MASS) #contains the data
library(reshape2) #assist in creating boxplots
library(ggplot2)
library(kernlab) #assist with SVM feature selection
#load data
data("Pima.tr")
data("Pima.te")
#combine into one df
pima <- rbind(Pima.tr, Pima.te)
#basic EDA
pima.melt <- tidyr::gather(pima, variable, value, -type)
pima.melt %>%
ggplot(aes(x = type, y = value)) +
geom_boxplot() +
facet_wrap(~variable, ncol = 2)
#scale the data
pima.scale <- data.frame(pima[, -8])
pima.scale$type <- pima$type
pima.scale.melt <- gather(pima.scale, variable, value, -type)
#check correlation
cor(pima.scale[-8])
table(pima.scale$type)
#create train/test split
set.seed(502)
ind <- sample(2, nrow(pima.scale), replace = TRUE, prob = c(0.7, 0.3))
train <- pima.scale[ind == 1, ]
test <- pima.scale[ind == 2, ]
Use expand.grid
to identify the most appropriate number of K and cross-validation (from caret
package)
grid1 <- expand.grid(.k = seq(2, 20, by = 1))
control <- trainControl(method = "cv")
Let’s compute the optimal k-value using train
function from caret
:
set.seed(502)
knn.train <- train(type ~ ., data = train,
method = "knn",
trControl = control,
tuneGrid = grid1)
knn.train
The Kappa statistic is commonly used to provide a measure of how well two evaluators can classify an observation correctly. It provides an insight into this problem by adjusting the accuracy scores, which is done by accounting for the evaluators being totally correct by mere chance. The formula for the statistic is Kappa = (per cent of agreement - per cent of chance agreement) / (1 - per cent of chance agreement). The per cent of agreement is the rate that the evaluators agreed on for the class (accuracy), and percent of chance agreement is the rate that the evaluators randomly agreed on. The higher the Kappa, the better they performed with the maximum agreement being one.
Let’s calculate kappa for a given best choice scenario with k by using knn
function from class
package while providing to it four inputs: * train inputs * test inputs * train labels * k
knn.test <- knn(train[, -8], test[, -8], train[, 8], k = 17)
table(knn.test, test$type)
#calculate kappa
prob.agree <- (77 + 28)
prob.chance <- ((77 + 26) / 147) * ((77 + 16) / 147)
kappa <- (prob.agree - prob.chance) / (1 - prob.chance)
kappa
The following heuristic should help us with assessing strength of agreement
Let’s utilize weighted neighbors to see whether we can beat the accuracy and kappa results. A weighting schema increases the influence of neighbors that are closest to an observation versus those that are farther away. The farther away the observation is from a point in space, the more penalized its influence is.
For this purpose we’ll use the kknn
package and its train.knn()
function to select optimal weighting scheme.
The train.kknn()
function uses LOOCV in order to select the best parameters for the optimal k neighbors, one of the two distance measures, and a kernel
function.
There’s 10 weighting schemas for this package: * rectangular (unweighted), * triangular, * epanechnikov, * biweight, * triweight, * cosine, * inversion, * gaussian, * rank, * optimal.
set.seed(123)
#create train object
kknn.train <- train.kknn(type ~ ., data = train,
kmax = 25, distance = 2,
kernel = c('rectangular', 'triangular', 'epanechnikov'))
plot(kknn.train)
#call kknn.train to see the summary
kknn.train
We will use the e1071
package to build our SVM models. The package has tune.svm()
funtion, which assists in the selection of the tuning parameters/kernel functions. It uses cross-validation to optimize the tuning parameters.
linear.tune <- tune.svm(type ~., data = train,
kernel = "linear",
cost = c(0.001, 0.01, 0.1, 1, 5, 10))
summary(linear.tune)
best.linear <- linear.tune$best.model
tune.test <- predict(best.linear, newdata = test)
table(tune.test, test$type)
With tune.svm()
we can try different kernels
set.seed(123)
#polynomial
poly.tune <- tune.svm(type ~., data = train,
kernel = "polynomial",
degree = c(3, 4, 5),
coef0 = c(0.1, 0.5, 1, 2, 3, 4))
summary(poly.tune)
best.poly <- poly.tune$best.model
poly.test <- predict(best.poly, newdata = test)
table(poly.test, test$type)
#radial
rbf.tune <- tune.svm(type ~ ., data = train,
kernel = "radial",
gamma = c(0.1, 0.5, 1, 2, 3, 4))
summary(rbf.tune)
best.rbf <- rbf.tune$best.model
rbf.test <- predict(best.rbf, newdata = test)
table(rbf.test, test$type)
#sigmoid
sigmoid.tune <- tune.svm(type ~ ., data = train,
kernel = "sigmoid",
gamma = c(0.1, 0.5, 1, 2, 3, 4),
coef0 = c(0.1, 0.5, 1, 2, 3, 4))
summary(sigmoid.tune)
best.sigmoid <- sigmoid.tune$best.model
sigmoid.test <- predict(best.sigmoid, newdata = test)
table(sigmoid.test, test$type)
To compare models we can use confusionMatrix()
from caret
package.
confusionMatrix(sigmoid.test, test$type, positive = "Yes")
confusionMatrix(tune.test, test$type, positive = "Yes")
Explore the outputs: * Sensitivity (the true positive rate; in this case, the rate of those not having diabetes has been correctly identified as such.) * Specificity (the true negative rate or, for our purposes, the rate of a diabetic that has been correctly identified.) * Positive Prediction Value - is the probability of someone in the population classified as being diabetic and truly has the disease. * Negative Prediction Value - is the probability of someone in the population classified as not being diabetic and truly does not have the disease
Steps: 1. set random seed 2. specify cv method in caret
’s rfeControl()
function 3. perform a recursive feature selection with the rfe()
function 4. test how model performs on a test set
There are several different functions that you can use. Here we will need lrFuncs
.
set.seed(123)
rfeCNTL <- rfeControl(functions = lrFuncs,
method = "cv",
number = 10)
svm.features <- rfe(train[, 1:7], train [, 8],
sizes = c(7, 6, 5, 4),
rfeControl = rfeCNTL,
method = "svmLinear")
svm.features
Build a model via svm()
function from e1071
package:
#check on train data with 5 features
#build a model svm.5
svm.5 <- svm(type ~ glu + ped + npreg + bmi + age,
data = train,
kernel = "linear")
#predict based on a svm.5 model
svm.5.predict <- predict(svm.5, newdata = test[c(1, 2, 5, 6, 7)])
table(svm.5.predict, test$type)
Based on prostate data
#load packages
library(rpart) #classification and regression trees
library(partykit) #treeplots
library(randomForest) #random forests
library(xgboost) #gradient boosting
library(caret) #tune hyper-parameters
#data
library(MASS) #breast and pima indian data
library(ElemStatLearn) #prostate data
#prepare prostate data
data("prostate")
prostate$gleason <- ifelse(prostate$gleason == 6, 0, 1)
pros.train <- subset(prostate, train == TRUE)[, 1:9]
pros.test <- subset(prostate, train == FALSE)[, 1:9]
Build a regression tree with rpart()
function from rpart
package and examine the results stored in $cptable:
tree.pros <- rpart(lpsa ~., data = pros.train)
#examine results
print(tree.pros$cptable)
Examine results visually with plotcp()
function:
#plot tree
plotcp(tree.pros)
The plot shows us the relative error by the tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting a tree size, 5, which is four splits, we can build a new tree object where xerror is minimized by pruning our tree by first creating an object for cp associated with the pruned tree from the table. Then the prune()
function handles the rest:
cp <- min(tree.pros$cptable[5, ])
prune.tree.pros <- prune(tree.pros, cp = cp)
Let’s compare and plot full and pruned trees. Simply use as.party()
from partykit
package as a wrapper in plot()
:
plot(as.party(tree.pros))
plot(as.party(prune.tree.pros))
Let’s see how model performs on test data
party.pros.test <- predict(prune.tree.pros, newdata = pros.test)
rpart.resid <- party.pros.test - pros.test$lpsa
mean(rpart.resid^2) #calculate MSE
#prepare data
data(biopsy)
biopsy <- biopsy[, -1] #delete ID
names(biopsy) <- c("thick", "u.size", "u.shape", "adhsn",
"s.size", "nucl",
"chrom", "n.nuc", "mit", "class") #change the feature names
biopsy.v2 <- na.omit(biopsy) #delete the observations with missing values
set.seed(123) #random number generator
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 0.3))
biop.train <- biopsy.v2[ind == 1, ] #the training data set
biop.test <- biopsy.v2[ind == 2, ] #the test data set
Create a tree and examine the table for the optimal number of splits:
set.seed(123)
tree.biop <- rpart(class ~., data = biop.train)
#explore cptable - complexity parameters table
tree.biop$cptable
The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the pruned tree, and see how it performs on the test set:
cp <- min(tree.biop$cptable[3, ])
prune.tree.biop <- prune(tree.biop, cp = cp)
plot(as.party(prune.tree.biop))
When predicting on test data use type = "class"
in the predict()
function:
#predict on test data
rparty.test <- predict(prune.tree.biop, newdata = biop.test,
type = "class")
table(rparty.test, biop.test$class)
Use randomForest::randomForest()
with formula as a first argument and data as a second:
set.seed(123)
rf.pros <- randomForest(lpsa ~., data = pros.train)
rf.pros
plot(rf.pros)
Let’s identify the specific and optimal tree (with regards to mean squared error) with which.min()
function:
which.min(rf.pros$mse)
set.seed(123)
rf.pros.2 <- randomForest(lpsa ~., data = pros.train,
ntree = which.min(rf.pros$mse))
rf.pros.2
Let’s product a variable importance plot with randomForest::varImpPlot()
:
varImpPlot(rf.pros.2, scale = TRUE,
main = "Variable Importance Plot - PSA Score")
To examine raw numbers, use randomForest::importance()
function:
importance(rf.pros.2)
Check on a test data
rf.pros.test <- predict(rf.pros.2, newdata = pros.test)
rf.resid <- rf.pros.test - pros.test$lpsa #calculate residual
mean(rf.resid^2)
set.seed(123)
rf.biop <- randomForest(class ~., data = biop.train)
rf.biop
plot(rf.biop)
Let’s find the minimum error rate
which.min(rf.biop$err.rate[ , 1]) #why the first column?
Only 19 trees are needed to optimize the model accuracy. Let’s try this and see how it performs:
set.seed(123)
rf.biop.2 <- randomForest(class ~., data = biop.train,
ntree = 19)
rf.biop.2
varImpPlot(rf.biop.2)
Let’s perform the same procedure but with Pima dataset.
#prepare data
data(Pima.tr)
data(Pima.te)
pima <- rbind(Pima.tr, Pima.te)
set.seed(502)
ind <- sample(2, nrow(pima), replace = TRUE, prob = c(0.7, 0.3))
pima.train <- pima[ind == 1, ]
pima.test <- pima[ind == 2, ]
#fit random forest model
set.seed(321)
rf.pima <- randomForest(type ~., data = pima.train)
rf.pima
#predict
rf.pima.test <- predict(rf.pima, newdata = pima.test)
table(rf.pima.test, pima.test$type)
We’ll be using xboost
package and tuning the following parameters:
Using the expand.grid()
function, we’ll build a experimental grid. Beware: you need to specify even default parameters or you recieve an error.
grid <- expand.grid(nround = c(75,100),
colsample_bytree = 1,
min_child_weight = 1,
eta = c(0.01, 0.1, 0.3),
gamma = c(0.5, 0.25),
subsample = 0.5,
max_depth = c(2, 3))
Before using caret::train()
let’s specify trainControl
argument
cntrl <- trainControl(method = "cv",
number = 5,
verboseIter = TRUE,
returnData = TRUE,
returnResamp = "final")
set.seed(1)
train.xgb <- train(x = pima.train[, 1:7],
y = pima.train[, 8],
trControl = cntrl,
method = "xgbTree")
train.xgb
What to do next: 1. create a list of parameters to use with xgboost::xgb.train()
function 2. turn dataframe into a matrix of input features and a list of labeled outcomes 3. turn the features and labels into the input required, as xgb.Dmatrix
.
param <- list(objective = "binary:logistic",
booster = "gbtree",
eval_metric = "error",
eta = 0.1,
max_depth = 2,
subsample = 0.5,
colsample_bytree = 1,
gamma = 0.5)
x <- as.matrix(pima.train[, 1:7])
y <- ifelse(pima.train$type == "Yes", 1, 0)
train.mat <- xgb.DMatrix(data = x, label = y)
Create a model
set.seed(1)
xgb.fit <- xgb.train(params = param,
data = train.mat,
nrounds = 75)
Then let’s the variable importance and also plot it.
impMatrix <- xgb.importance(feature_names = dimnames(x)[[2]],
model = xgb.fit)
impMatrix
xgb.plot.importance(impMatrix, main = "Gain by Feature")
The following metrics are result of xgb.importance()
call.
Let’s see how we can perform on a test set - using the InformationValue
package.
library(InformationValue)
pred <- predict(xgb.fit, x)
optimalCutoff(y, pred)
pima.testMat <- as.matrix(pima.test[, 1:7])
xgb.pima.test <- predict(xgb.fit, pima.testMat)
y.test <- ifelse(pima.test$type == "Yes", 1, 0)
1 - misClassError(y.test, xgb.pima.test, threshold = 0.39)
#xgb.pima.test <- ifelse(xgb.pima.test > 0.39, 1, 0)
#pay attention - confusionMatrix exists in both in caret and InformationValue packages. the latter allows us to specify the threshold for numberic values in float32.
InformationValue::confusionMatrix(y.test, xgb.pima.test, threshold = 0.39)
InformationValue::optimalCutoff()
provides the optimal probability threshold to minimize error.
Let’s plot ROC curve (with InformationValue::plotROC()
):
plotROC(y.test, xgb.pima.test)
###Feature selection with random forests
We’ll be using Sonar
data from mlbench
package with 2 classes - Mine or Rock.
data(Sonar, package = "mlbench")
table(Sonar$Class)
Run an algorithm by using Boruta()
function with formula, data, and doTrace = 1
to trace the progress of feature selection.
library(Boruta)
set.seed(1)
feature.selection <- Boruta(Class ~., data = Sonar, doTrace = 1)
After creating model let’s explore the table with the final importance decision.
table(feature.selection$finalDecision)
It’s simple to create a new dataframe with new features by using getSelectedAttributes()
function. Note that function takes two arguments - model and withTentative = FALSE
. And then subset the original dataframe.
fNames <- getSelectedAttributes(feature.selection)
Sonar.features <- Sonar[, fNames]
Skipped, let’s focus on Tensorflow and Keras.