Here are all the packages or libraries necessary to reproduce this analysis.
suppressMessages(library(ggplot2))
suppressMessages(library(Hmisc))
suppressMessages(library(caret))
suppressMessages(library(glmnet))
suppressMessages(library(leaps))
suppressMessages(library(gridExtra))
options("scipen"=100, "digits"=4)
options(warn=-1)
In this section we upload the data into R and take a quick look at variables to see if there are any missing values, to inspect variables types, and so on.
Dataset1 <- read.csv("~/Downloads/Dataset1.csv")
str(Dataset1)
## 'data.frame': 400 obs. of 6 variables:
## $ input.var.1: num 1.14 1.06 1.07 1.14 1.01 1.14 1.09 1.03 1.1 1.13 ...
## $ Input.var.2: int 90 98 54 69 63 61 55 70 80 37 ...
## $ input.var.3: int 91 99 100 67 92 31 44 100 56 42 ...
## $ Input.var.4: num 364.5 470.6 78.7 164.2 125 ...
## $ Input.var.5: int 26 25 38 41 29 6 29 35 49 29 ...
## $ Output.var : num 103.2 104.4 82.4 77.5 78.3 ...
#describe(Dataset1)
names(Dataset1) <- tolower(names(Dataset1))
names(Dataset1) <- gsub(pattern = "input.", replacement = "", x = names(Dataset1))
names(Dataset1) #sanity check
## [1] "var.1" "var.2" "var.3" "var.4" "var.5"
## [6] "output.var"
In this section, we look at the distribution of each variable to see if any transformation or normalisation may be needed. Input variable 4 may need to be transformed, it looks skewed. The other variables look fairly normal and evenly distributed. And no significant outliers to notice, at least from the box plots below. We also look for correlated and linearly dependent predictors. We should particularly be concerned with the last statement if we will be using linear based models. If we use only tree based models such as random forest, CART or boosting, we may overlook that statement.
p1 <- ggplot(aes(x = var.1), data = Dataset1) + geom_histogram()
p2 <- ggplot(aes(x = var.2), data = Dataset1) + geom_histogram()
p3 <- ggplot(aes(x = var.3), data = Dataset1) + geom_histogram()
p4 <- ggplot(aes(x = var.4), data = Dataset1) + geom_histogram()
p5 <- ggplot(aes(x = var.5), data = Dataset1) + geom_histogram()
grid.arrange(p1, p2, p3, p4, p5, ncol=3, main="Histogram - Dataset1")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
b1 <- ggplot(aes(x = var.1, y = var.1), data = Dataset1) + geom_boxplot()
b2 <- ggplot(aes(x = var.2, y = var.2), data = Dataset1) + geom_boxplot()
b3 <- ggplot(aes(x = var.3, y = var.3), data = Dataset1) + geom_boxplot()
b4 <- ggplot(aes(x = var.4, y = var.4), data = Dataset1) + geom_boxplot()
b5 <- ggplot(aes(x = var.5, y = var.5), data = Dataset1) + geom_boxplot()
grid.arrange(b1, b2, b3, b4,b5, ncol=3, main="Boxplot - Dataset1")
combo.info <- findLinearCombos(Dataset1[-6])# all variables are linearly independant, a good thing for linear models consideration
combo.info
## $linearCombos
## list()
##
## $remove
## NULL
cor.mat <- cor(Dataset1[-6]) #variables 2 and 4 are more than 95% correlated. We'll caret to remove one, although one could pick it randomly.
highCorr <- findCorrelation(cor.mat, .90)
highCorr
## [1] 4
clean.data <- Dataset1[,-highCorr] #var.4 was dropped. Again, one could have chosen to drop var.2
rm(cor.mat, Dataset1, combo.info, highCorr)
set.seed(44) #from random.org
#split the data 70% training set and 30% test set
index <- createDataPartition(clean.data$output.var, p = 0.7, list = FALSE)
train <- clean.data[index,]
test <- clean.data[-index,]
preProcValues <- preProcess(train, method = c("center", "scale"))
train <- predict(preProcValues, train)
test <- predict(preProcValues, test)
rm(index, preProcValues)
The first model that we will use is adaptive boosting trees, which is probably the best off the shelf model to use that will generate high predictive accuracy. It does not require much transformation on the data. All the preprocessing and cleaning that we did is not necessary for this model. Without much tuning, it give us an rmse value of about 0.089 or ~98% accuracy.
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
#repeats = 10
)
set.seed(7)
gbmFit <- train(train$output.var ~ ., data = train,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
gbmFit
## Stochastic Gradient Boosting
##
## 280 samples
## 4 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
##
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ...
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared RMSE SD Rsquared SD
## 1 50 0.28302 0.9524 0.04758 0.014459
## 1 100 0.14940 0.9842 0.02642 0.005120
## 1 150 0.10414 0.9908 0.01668 0.002775
## 2 50 0.19170 0.9735 0.03764 0.009325
## 2 100 0.11648 0.9875 0.01930 0.003970
## 2 150 0.10282 0.9900 0.01778 0.003322
## 3 50 0.14884 0.9821 0.03466 0.007926
## 3 100 0.10294 0.9902 0.01964 0.002974
## 3 150 0.09285 0.9919 0.01782 0.002630
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3 and shrinkage = 0.1.
plot(gbmFit)
varImp(gbmFit)
## gbm variable importance
##
## Overall
## var.3 100.00
## var.2 91.32
## var.1 3.53
## var.5 0.00
preds <- predict(gbmFit, test)
rmse <- sqrt(mean((test$output.var-preds)^2)) # ada boosting gives an rmse of about 0.089
The second modelling approach we will use is a shrinkage method. This is a linear model with regularization. Like the privious method, we will also use cross-validation to build the model and perform the final model evaluation on the validation set. The rmse is 0.033, which is just a micro closer to 100% accuracy than booting trees, about 99.85%. The other inconvenient with trees based models is the poor interpretation of the results.
x <- as.matrix(train[-5])
y <- train$output.var
cvfit = cv.glmnet(x, y)
plot(cvfit)
cvfit$lambda.min
## [1] 0.003445
coef(cvfit,s=0.1)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.0000000000000002611
## var.1 0.0365390788652994458
## var.2 0.5992948843992818020
## var.3 0.6454768291884692699
## var.5 .
coef(cvfit, s = "lambda.min")
## 5 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.0000000000000005487
## var.1 0.1540221000300564247
## var.2 0.7174260257999696133
## var.3 0.7608904925553053866
## var.5 .
preds <- predict(cvfit, newx = as.matrix(test[-5]), s = "lambda.min")
rmse <- sqrt(mean((test$output.var-preds)^2)) # rmse is 0.033
The third modelling approach we will explore is the best subset selection. Given the small number of variables, we will build linear regression models with a combination of each predictor variable. This is called best subset selection. This approach becomes computationally too expensive or not practical when there are more than 15 predictors to consider. We will try to choose among the models of different sizes using adjusted R square, and BIC statisticis.
regfit.best = regsubsets(output.var ~., data = train)
regfit.summary <- summary(regfit.best)
#names(regfit.summary)
which.max(regfit.summary$adjr2)# best model should have about 3 variables based on adjusted R square
## [1] 3
which.min(regfit.summary$bic) # best model should have about 3 variables based on BIC statistics
## [1] 3
plot(regfit.best , scale ="adjr2")
plot(regfit.best , scale ="bic")
plot(regfit.summary$bic , xlab ="Number of Variables" , ylab ="BIC" , type = 'l',
main = "Best Subset Using BIC Statistics")
points(3 , regfit.summary$bic[3] , col =" red " , cex =2 , pch =20)
regfit.full = regsubsets(output.var ~., data = clean.data)
coef(regfit.full, 3)
## (Intercept) var.1 var.2 var.3
## -65.9964 60.9762 0.5410 0.5408
In this section, we will do exactly what we did in the previous section, but using cross-validation instead adjusted R square or BIC, to determine the best coefficients or best model. This approach is somewhat involved, as we must perform best subset selection within each of the k training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy. First, we create a vector that allocates each observation to one of k = 10 folds, and we create a matrix in which we will store the results . We will also implement our own cross-validation and predict method, given the limitations of the leaps package.
predict.regsubsets = function(object , newdata , id ,...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form , newdata)
coefi = coef(object , id = id)
xvars = names(coefi)
mat[ , xvars ]%*% coefi
}
k =10
set.seed (1)
folds = sample (1: k , nrow (train) , replace = TRUE )
cv.errors = matrix ( NA ,k ,4 , dimnames = list ( NULL , paste (1:4)))
for ( j in 1: k ){
best.fit = regsubsets (output.var ~., data = train[folds != j, ])
for ( i in 1:4) {
preds = predict.regsubsets (best.fit , train [ folds == j ,] , id = i )
cv.errors [j , i] = sqrt(mean ((train$output.var [ folds == j ] - preds) ^2))
}
}
mean.cv.errors = apply(cv.errors ,2 , mean)
mean.cv.errors
## 1 2 3 4
## 0.72790 0.16168 0.03917 0.03930
plot(mean.cv.errors , type = 'b', main = "Best Subset Using Cross-Validation", xlab =
"Number of Variables")
points(3 , min(mean.cv.errors), col ="red" , cex =2 , pch =20)
#We see that cross-validation also selects a 3-variable model.
Now that we know what the final model should look like based on the adjusted R square, BIC, and cross-validation, we will now perfom best subset selection on the validation set. Not too surprising, the results are comparable to what we obtained earlier. We still have an rmse of about 0.04, which translates to an accuracy rate of about 99.85% (based on the adjusted R square values we looked at ealier)
#regfit.best <- regsubsets(output.var ~., data=train)
test.mat <- model.matrix(output.var ~., data=test)
rmse <- rep(NA,4)
for(i in 1:4){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
rmse[i]=sqrt(mean((test$output.var-pred)^2))
}
min(rmse) # Again quiet closer to 100% or 99.9%
## [1] 0.04161
Some final thoughts: