Here are all the packages or libraries necessary to reproduce this analysis.
suppressMessages(library(ggplot2))
suppressMessages(library(Hmisc))
suppressMessages(library(caret))
suppressMessages(library(glmnet))
suppressMessages(library(leaps))
suppressMessages(library(gridExtra))
options("scipen"=100, "digits"=4)
options(warn=-1)
In this section we upload the data into R and take a quick look at variables to see if there are any missing values, to inspect variables types, and so on.
Dataset2 <- read.csv("~/Downloads/Dataset2.csv")
str(Dataset2)
## 'data.frame': 400 obs. of 6 variables:
## $ input.var.1: int 153 41 114 76 126 133 117 202 62 103 ...
## $ Input.var.2: int 47 43 51 45 39 44 49 43 47 55 ...
## $ input.var.3: int 40 41 69 44 71 49 24 70 37 47 ...
## $ Input.var.4: int 330772 31392 388677 77612 491991 10442 50850 1381605 71801 190576 ...
## $ Input.var.5: int 18 11 20 30 41 32 17 5 35 9 ...
## $ Output.var : num 475371 36224 454176 130492 568512 ...
#describe(Dataset2)
names(Dataset2) <- tolower(names(Dataset2))
names(Dataset2) <- gsub(pattern = "input.", replacement = "", x = names(Dataset2))
names(Dataset2) #sanity check
## [1] "var.1" "var.2" "var.3" "var.4" "var.5"
## [6] "output.var"
In this section, we look at the distribution of each variable to see if any transformation or normalisation may be needed. Input variable 4 may need to be transformed, it looks skewed with many outliers. Input variable 3 looks skewed as well. The other variables look fairly normal and evenly distributed. And no significant outliers to notice, at least from the box plots below. We also look for correlated and linearly dependent predictors. We should particularly be concerned with the last statement if we will be using linear based models. If we use only tree based models such as random forest, CART or boosting, we may overlook that statement.
p1 <- ggplot(aes(x = var.1), data = Dataset2) + geom_histogram()
p2 <- ggplot(aes(x = var.2), data = Dataset2) + geom_histogram()
p3 <- ggplot(aes(x = var.3), data = Dataset2) + geom_histogram()
p4 <- ggplot(aes(x = var.4), data = Dataset2) + geom_histogram()
p5 <- ggplot(aes(x = var.5), data = Dataset2) + geom_histogram()
grid.arrange(p1, p2, p3, p4, p5, ncol=3, main="Histogram - Dataset2")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
b1 <- ggplot(aes(x = var.1, y = var.1), data = Dataset2) + geom_boxplot()
b2 <- ggplot(aes(x = var.2, y = var.2), data = Dataset2) + geom_boxplot()
b3 <- ggplot(aes(x = var.3, y = var.3), data = Dataset2) + geom_boxplot()
b4 <- ggplot(aes(x = var.4, y = var.4), data = Dataset2) + geom_boxplot()
b5 <- ggplot(aes(x = var.5, y = var.5), data = Dataset2) + geom_boxplot()
grid.arrange(b1, b2, b3, b4,b5, ncol=3, main="Histogram - Dataset2")
combo.info <- findLinearCombos(Dataset2[-6])# all variables are linearly independant, a good thing for linear models consideration
combo.info
## $linearCombos
## list()
##
## $remove
## NULL
cor.mat <- cor(Dataset2[-6]) #variables 2 and 4 are more than 95% correlated. We'll caret to remove one, although one could pick it randomly.
highCorr <- findCorrelation(cor.mat, .90)
highCorr
## integer(0)
clean.data <- Dataset2 #var.4 was dropped. Again, one could have chosen to drop var.2
rm(cor.mat, Dataset2, combo.info, highCorr)
We may have to perform variables transformation in order to improve accuracy with this data set. This didn’t make much of a difference actually. We thought about doing the same for input variable 3 as it’s slightly skewed, but didn’t do it.
clean.data$var.4 <- (clean.data$var.4)^(1/5)
We partition the data using caret, 70/30.
set.seed(44) #from random.org
#split the data 70% training set and 30% test set
index <- createDataPartition(clean.data$output.var, p = 0.7, list = FALSE)
train <- clean.data[index,]
test <- clean.data[-index,]
rm(index)
The first model that we will use is adaptive boosting trees, which is probably the best off the shelf model to use with very accuracy. We added a grid search to improve rmse value. The final model gives us an rmse value of about 60000.
gbmGrid <- expand.grid(interaction.depth = c(3, 6, 9, 15),
n.trees = seq(300,500,50),
shrinkage = 0.01)
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
#repeats = 10
)
set.seed(7)
gbmFit <- train(train$output.var ~ ., data = train,
method = "gbm",
trControl = fitControl,
#tuneGrid = gbmGrid,
verbose = FALSE)
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
gbmFit
## Stochastic Gradient Boosting
##
## 280 samples
## 5 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
##
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ...
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared RMSE SD Rsquared SD
## 1 50 185894 0.8138 36806 0.047450
## 1 100 177291 0.8285 31830 0.036854
## 1 150 175084 0.8320 33210 0.037437
## 2 50 103552 0.9422 26682 0.030748
## 2 100 80687 0.9658 22033 0.019166
## 2 150 70307 0.9738 19899 0.015822
## 3 50 80406 0.9665 16681 0.012487
## 3 100 63720 0.9794 10459 0.005388
## 3 150 57063 0.9833 8407 0.004559
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3 and shrinkage = 0.1.
plot(gbmFit)
varImp(gbmFit)
## gbm variable importance
##
## Overall
## var.4 100.00
## var.1 59.08
## var.3 53.94
## var.5 1.15
## var.2 0.00
preds <- predict(gbmFit, test)
rmse <- sqrt(mean((test$output.var-preds)^2)) # ada boosting gives an rmse of about 59590
The second modelling approach we will use is a shrinkage method. This is a linear model with regularization. Like the privious method, we will also use cross-validation to build the model and perform the final model evaluation on the validation set. The rmse is 202224. This is not so great!
x <- as.matrix(train[-6])
y <- train$output.var
cvfit = cv.glmnet(x, y)
plot(cvfit)
cvfit$lambda.min
## [1] 1757
coef(cvfit,s=0.1)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -392462.8
## var.1 4373.4
## var.2 -5197.8
## var.3 7821.2
## var.4 13708.7
## var.5 529.3
coef(cvfit, s = "lambda.min")
## 6 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -395707.4
## var.1 4364.5
## var.2 -5085.8
## var.3 7807.6
## var.4 13745.8
## var.5 512.9
preds <- predict(cvfit, newx = as.matrix(test[-6]), s = "lambda.min")
rmse <- sqrt(mean((test$output.var-preds)^2)) # rmse is 202224
Best subset selection approach. BIC statistics chooses a model with only two variables as best model, while the adjusted R square chooses all 5 variables for the best model. Looking at the graphs below, one can clearly see why. There’s not much in accuracy after input variables 1 and 3.
regfit.best = regsubsets(output.var ~., data = train)
regfit.summary <- summary(regfit.best)
#names(regfit.summary)
which.max(regfit.summary$adjr2)# best model have all 5 variables based on adjusted R square
## [1] 5
which.min(regfit.summary$bic) # best model should have about 2 variables based on BIC statistics
## [1] 2
plot(regfit.best , scale ="adjr2")
plot(regfit.best , scale ="bic")
plot(regfit.summary$bic , xlab ="Number of Variables" , ylab ="BIC" , type = 'l',
main = "Best Subset Using BIC Statistics")
points(2 , regfit.summary$bic[2] , col =" red " , cex =2 , pch =20)
plot(regfit.summary$adjr2 , xlab ="Number of Variables" , ylab ="Adjusted R Square" , type = 'l', main = "Best Subset Usinh Adjusted R Square")
points(5 , regfit.summary$adjr2[5] , col =" red " , cex =2 , pch =20)
regfit.full = regsubsets(output.var ~., data = clean.data)
coef(regfit.full, 5)
## (Intercept) var.1 var.2 var.3 var.4 var.5
## -444352.4 4271.8 -4843.0 7227.1 19505.6 609.9
coef(regfit.full, 2)
## (Intercept) var.1 var.3
## -588976 4950 8430
Now that we know what the final model should look like based on the adjusted R square, BIC, and cross-validation, we will now perfom best subset selection on the validation set. The rmse of about 202437, which translates to an accuracy rate of about 80% (based on the adjusted R square values we looked at ealier)
#regfit.best <- regsubsets(output.var ~., data=train)
test.mat <- model.matrix(output.var ~., data=test)
rmse <- rep(NA,5)
for(i in 1:5){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
rmse[i]=sqrt(mean((test$output.var-pred)^2))
}
min(rmse) # 202437 or about 80%
## [1] 202437
Some final thoughts: