Here are all the packages or libraries necessary to reproduce this analysis.

suppressMessages(library(ggplot2))
suppressMessages(library(Hmisc))
suppressMessages(library(caret))
suppressMessages(library(glmnet))
suppressMessages(library(leaps))
suppressMessages(library(gridExtra))
options("scipen"=100, "digits"=4)
options(warn=-1)

In this section we upload the data into R and take a quick look at variables to see if there are any missing values, to inspect variables types, and so on.

Dataset2 <- read.csv("~/Downloads/Dataset2.csv")
str(Dataset2)
## 'data.frame':    400 obs. of  6 variables:
##  $ input.var.1: int  153 41 114 76 126 133 117 202 62 103 ...
##  $ Input.var.2: int  47 43 51 45 39 44 49 43 47 55 ...
##  $ input.var.3: int  40 41 69 44 71 49 24 70 37 47 ...
##  $ Input.var.4: int  330772 31392 388677 77612 491991 10442 50850 1381605 71801 190576 ...
##  $ Input.var.5: int  18 11 20 30 41 32 17 5 35 9 ...
##  $ Output.var : num  475371 36224 454176 130492 568512 ...
#describe(Dataset2)
names(Dataset2) <- tolower(names(Dataset2))
names(Dataset2) <- gsub(pattern = "input.", replacement = "", x = names(Dataset2))
names(Dataset2) #sanity check
## [1] "var.1"      "var.2"      "var.3"      "var.4"      "var.5"     
## [6] "output.var"

In this section, we look at the distribution of each variable to see if any transformation or normalisation may be needed. Input variable 4 may need to be transformed, it looks skewed with many outliers. Input variable 3 looks skewed as well. The other variables look fairly normal and evenly distributed. And no significant outliers to notice, at least from the box plots below. We also look for correlated and linearly dependent predictors. We should particularly be concerned with the last statement if we will be using linear based models. If we use only tree based models such as random forest, CART or boosting, we may overlook that statement.

p1 <- ggplot(aes(x = var.1), data = Dataset2) + geom_histogram()
p2 <- ggplot(aes(x = var.2), data = Dataset2) + geom_histogram() 
p3 <- ggplot(aes(x = var.3), data = Dataset2) + geom_histogram() 
p4 <- ggplot(aes(x = var.4), data = Dataset2) + geom_histogram()
p5 <- ggplot(aes(x = var.5), data = Dataset2) + geom_histogram() 
grid.arrange(p1, p2, p3, p4, p5, ncol=3, main="Histogram - Dataset2")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

b1 <- ggplot(aes(x = var.1, y = var.1), data = Dataset2) + geom_boxplot()
b2 <- ggplot(aes(x = var.2, y = var.2), data = Dataset2) + geom_boxplot()
b3 <- ggplot(aes(x = var.3, y = var.3), data = Dataset2) + geom_boxplot()
b4 <- ggplot(aes(x = var.4, y = var.4), data = Dataset2) + geom_boxplot()
b5 <- ggplot(aes(x = var.5, y = var.5), data = Dataset2) + geom_boxplot()
grid.arrange(b1, b2, b3, b4,b5, ncol=3, main="Histogram - Dataset2")

combo.info <- findLinearCombos(Dataset2[-6])# all variables are linearly independant, a good thing for linear models consideration
combo.info
## $linearCombos
## list()
## 
## $remove
## NULL
cor.mat <-  cor(Dataset2[-6]) #variables 2 and 4 are more than 95% correlated. We'll caret to remove one, although one could pick it randomly.
highCorr <- findCorrelation(cor.mat, .90)
highCorr
## integer(0)
clean.data <- Dataset2 #var.4 was dropped. Again, one could have chosen to drop var.2 
rm(cor.mat, Dataset2, combo.info, highCorr)

We may have to perform variables transformation in order to improve accuracy with this data set. This didn’t make much of a difference actually. We thought about doing the same for input variable 3 as it’s slightly skewed, but didn’t do it.

clean.data$var.4 <- (clean.data$var.4)^(1/5)

We partition the data using caret, 70/30.

    set.seed(44) #from random.org
    #split the data 70% training set and 30% test set
    index <- createDataPartition(clean.data$output.var, p = 0.7, list = FALSE)
    train <- clean.data[index,]
    test  <- clean.data[-index,]
    rm(index)

The first model that we will use is adaptive boosting trees, which is probably the best off the shelf model to use with very accuracy. We added a grid search to improve rmse value. The final model gives us an rmse value of about 60000.

gbmGrid <-  expand.grid(interaction.depth = c(3, 6, 9, 15),
                        n.trees = seq(300,500,50),
                        shrinkage = 0.01)
    
    fitControl <- trainControl(## 10-fold CV
      method = "repeatedcv",
      number = 10,
      #repeats = 10
    )
    
    set.seed(7)
    gbmFit <- train(train$output.var ~ ., data = train,
                    method = "gbm",
                    trControl = fitControl,
                    #tuneGrid = gbmGrid,
                    verbose = FALSE)
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
    gbmFit
## Stochastic Gradient Boosting 
## 
## 280 samples
##   5 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## 
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE    Rsquared  RMSE SD  Rsquared SD
##   1                   50      185894  0.8138    36806    0.047450   
##   1                  100      177291  0.8285    31830    0.036854   
##   1                  150      175084  0.8320    33210    0.037437   
##   2                   50      103552  0.9422    26682    0.030748   
##   2                  100       80687  0.9658    22033    0.019166   
##   2                  150       70307  0.9738    19899    0.015822   
##   3                   50       80406  0.9665    16681    0.012487   
##   3                  100       63720  0.9794    10459    0.005388   
##   3                  150       57063  0.9833     8407    0.004559   
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3 and shrinkage = 0.1.
    plot(gbmFit)

    varImp(gbmFit)
## gbm variable importance
## 
##       Overall
## var.4  100.00
## var.1   59.08
## var.3   53.94
## var.5    1.15
## var.2    0.00
    preds <- predict(gbmFit, test)
    rmse <- sqrt(mean((test$output.var-preds)^2)) # ada boosting gives an rmse of about 59590

The second modelling approach we will use is a shrinkage method. This is a linear model with regularization. Like the privious method, we will also use cross-validation to build the model and perform the final model evaluation on the validation set. The rmse is 202224. This is not so great!

    x <- as.matrix(train[-6])
    y <- train$output.var
    cvfit = cv.glmnet(x, y)
    plot(cvfit)

    cvfit$lambda.min
## [1] 1757
    coef(cvfit,s=0.1)
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) -392462.8
## var.1          4373.4
## var.2         -5197.8
## var.3          7821.2
## var.4         13708.7
## var.5           529.3
    coef(cvfit, s = "lambda.min")
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) -395707.4
## var.1          4364.5
## var.2         -5085.8
## var.3          7807.6
## var.4         13745.8
## var.5           512.9
    preds <- predict(cvfit, newx = as.matrix(test[-6]), s = "lambda.min")
    rmse <- sqrt(mean((test$output.var-preds)^2)) # rmse is 202224

Best subset selection approach. BIC statistics chooses a model with only two variables as best model, while the adjusted R square chooses all 5 variables for the best model. Looking at the graphs below, one can clearly see why. There’s not much in accuracy after input variables 1 and 3.

regfit.best = regsubsets(output.var ~., data = train)
regfit.summary <- summary(regfit.best)
#names(regfit.summary)
which.max(regfit.summary$adjr2)# best model have all 5 variables based on adjusted R square
## [1] 5
which.min(regfit.summary$bic)  # best model should have about 2 variables based on BIC statistics
## [1] 2
plot(regfit.best , scale ="adjr2")

plot(regfit.best , scale ="bic")

plot(regfit.summary$bic , xlab ="Number of Variables" , ylab ="BIC" , type = 'l', 
     main = "Best Subset Using BIC Statistics")
points(2 , regfit.summary$bic[2] , col =" red " , cex =2 , pch =20)

plot(regfit.summary$adjr2 , xlab ="Number of Variables" , ylab ="Adjusted R Square" , type = 'l', main = "Best Subset Usinh Adjusted R Square")
points(5 , regfit.summary$adjr2[5] , col =" red " , cex =2 , pch =20)

regfit.full = regsubsets(output.var ~., data = clean.data)
coef(regfit.full, 5)
## (Intercept)       var.1       var.2       var.3       var.4       var.5 
##   -444352.4      4271.8     -4843.0      7227.1     19505.6       609.9
coef(regfit.full, 2)
## (Intercept)       var.1       var.3 
##     -588976        4950        8430

Now that we know what the final model should look like based on the adjusted R square, BIC, and cross-validation, we will now perfom best subset selection on the validation set. The rmse of about 202437, which translates to an accuracy rate of about 80% (based on the adjusted R square values we looked at ealier)

#regfit.best <- regsubsets(output.var ~., data=train)
test.mat <- model.matrix(output.var ~., data=test)
rmse <- rep(NA,5)
for(i in 1:5){
  coefi=coef(regfit.best,id=i)
  pred=test.mat[,names(coefi)]%*%coefi
  rmse[i]=sqrt(mean((test$output.var-pred)^2))
}
min(rmse) # 202437 or about 80%
## [1] 202437

Some final thoughts:

  1. The analysis with data set 1 was slightly more elaborate than this one.
  2. All the variables in data set 2 were uncorrelated, while input variables 4 are skewed in both data sets. Outliers are present in the input variable 4 in data set 2 and input variable 3 is slightly skewed, and we didn’t notice any outliers with data set 1.
  3. Linear based models didn’t do great on this data set, only about 202400 RMSE or 80% on accuracy compared to boosting trees that gave us about 60000 in RMSE value.
  4. If time wasn’t an issue, we may have tried model aggregation techniques (this is NOT ensemble approach like random forest) to minimize the RMSE value, hence increase accurace