Here are all the packages or libraries necessary to reproduce this analysis.

suppressMessages(library(ggplot2))
suppressMessages(library(Hmisc))
suppressMessages(library(caret))
suppressMessages(library(glmnet))
suppressMessages(library(leaps))
suppressMessages(library(gridExtra))
options("scipen"=100, "digits"=4)
options(warn=-1)

In this section we upload the data into R and take a quick look at variables to see if there are any missing values, to inspect variables types, and so on.

Dataset1 <- read.csv("~/Downloads/Dataset1.csv")
str(Dataset1)
## 'data.frame':    400 obs. of  6 variables:
##  $ input.var.1: num  1.14 1.06 1.07 1.14 1.01 1.14 1.09 1.03 1.1 1.13 ...
##  $ Input.var.2: int  90 98 54 69 63 61 55 70 80 37 ...
##  $ input.var.3: int  91 99 100 67 92 31 44 100 56 42 ...
##  $ Input.var.4: num  364.5 470.6 78.7 164.2 125 ...
##  $ Input.var.5: int  26 25 38 41 29 6 29 35 49 29 ...
##  $ Output.var : num  103.2 104.4 82.4 77.5 78.3 ...
#describe(Dataset1)
names(Dataset1) <- tolower(names(Dataset1))
names(Dataset1) <- gsub(pattern = "input.", replacement = "", x = names(Dataset1))
names(Dataset1) #sanity check
## [1] "var.1"      "var.2"      "var.3"      "var.4"      "var.5"     
## [6] "output.var"

In this section, we look at the distribution of each variable to see if any transformation or normalisation may be needed. Input variable 4 may need to be transformed, it looks skewed. The other variables look fairly normal and evenly distributed. And no significant outliers to notice, at least from the box plots below. We also look for correlated and linearly dependent predictors. We should particularly be concerned with the last statement if we will be using linear based models. If we use only tree based models such as random forest, CART or boosting, we may overlook that statement.

p1 <- ggplot(aes(x = var.1), data = Dataset1) + geom_histogram()
p2 <- ggplot(aes(x = var.2), data = Dataset1) + geom_histogram() 
p3 <- ggplot(aes(x = var.3), data = Dataset1) + geom_histogram() 
p4 <- ggplot(aes(x = var.4), data = Dataset1) + geom_histogram()
p5 <- ggplot(aes(x = var.5), data = Dataset1) + geom_histogram() 
grid.arrange(p1, p2, p3, p4, p5, ncol=3, main="Histogram - Dataset1")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

b1 <- ggplot(aes(x = var.1, y = var.1), data = Dataset1) + geom_boxplot()
b2 <- ggplot(aes(x = var.2, y = var.2), data = Dataset1) + geom_boxplot()
b3 <- ggplot(aes(x = var.3, y = var.3), data = Dataset1) + geom_boxplot()
b4 <- ggplot(aes(x = var.4, y = var.4), data = Dataset1) + geom_boxplot()
b5 <- ggplot(aes(x = var.5, y = var.5), data = Dataset1) + geom_boxplot()
grid.arrange(b1, b2, b3, b4,b5, ncol=3, main="Boxplot - Dataset1")

combo.info <- findLinearCombos(Dataset1[-6])# all variables are linearly independant, a good thing for linear models consideration
combo.info
## $linearCombos
## list()
## 
## $remove
## NULL
cor.mat <-  cor(Dataset1[-6]) #variables 2 and 4 are more than 95% correlated. We'll caret to remove one, although one could pick it randomly.
highCorr <- findCorrelation(cor.mat, .90)
highCorr
## [1] 4
clean.data <- Dataset1[,-highCorr] #var.4 was dropped. Again, one could have chosen to drop var.2 
rm(cor.mat, Dataset1, combo.info, highCorr)
    set.seed(44) #from random.org
    #split the data 70% training set and 30% test set
    index <- createDataPartition(clean.data$output.var, p = 0.7, list = FALSE)
    train <- clean.data[index,]
    test  <- clean.data[-index,]
    preProcValues <- preProcess(train, method = c("center", "scale"))
    train <- predict(preProcValues, train)
    test <- predict(preProcValues, test)
    rm(index, preProcValues)

The first model that we will use is adaptive boosting trees, which is probably the best off the shelf model to use that will generate high predictive accuracy. It does not require much transformation on the data. All the preprocessing and cleaning that we did is not necessary for this model. Without much tuning, it give us an rmse value of about 0.089 or ~98% accuracy.

    fitControl <- trainControl(## 10-fold CV
      method = "repeatedcv",
      number = 10,
      #repeats = 10
    )
    
    set.seed(7)
    gbmFit <- train(train$output.var ~ ., data = train,
                    method = "gbm",
                    trControl = fitControl,
                    verbose = FALSE)
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
    gbmFit
## Stochastic Gradient Boosting 
## 
## 280 samples
##   4 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## 
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE     Rsquared  RMSE SD  Rsquared SD
##   1                   50      0.28302  0.9524    0.04758  0.014459   
##   1                  100      0.14940  0.9842    0.02642  0.005120   
##   1                  150      0.10414  0.9908    0.01668  0.002775   
##   2                   50      0.19170  0.9735    0.03764  0.009325   
##   2                  100      0.11648  0.9875    0.01930  0.003970   
##   2                  150      0.10282  0.9900    0.01778  0.003322   
##   3                   50      0.14884  0.9821    0.03466  0.007926   
##   3                  100      0.10294  0.9902    0.01964  0.002974   
##   3                  150      0.09285  0.9919    0.01782  0.002630   
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3 and shrinkage = 0.1.
    plot(gbmFit)

    varImp(gbmFit)
## gbm variable importance
## 
##       Overall
## var.3  100.00
## var.2   91.32
## var.1    3.53
## var.5    0.00
    preds <- predict(gbmFit, test)
    rmse <- sqrt(mean((test$output.var-preds)^2)) # ada boosting gives an rmse of about 0.089

The second modelling approach we will use is a shrinkage method. This is a linear model with regularization. Like the privious method, we will also use cross-validation to build the model and perform the final model evaluation on the validation set. The rmse is 0.033, which is just a micro closer to 100% accuracy than booting trees, about 99.85%. The other inconvenient with trees based models is the poor interpretation of the results.

    x <- as.matrix(train[-5])
    y <- train$output.var
    cvfit = cv.glmnet(x, y)
    plot(cvfit)

    cvfit$lambda.min
## [1] 0.003445
    coef(cvfit,s=0.1)
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept) -0.0000000000000002611
## var.1        0.0365390788652994458
## var.2        0.5992948843992818020
## var.3        0.6454768291884692699
## var.5        .
    coef(cvfit, s = "lambda.min")
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept) -0.0000000000000005487
## var.1        0.1540221000300564247
## var.2        0.7174260257999696133
## var.3        0.7608904925553053866
## var.5        .
    preds <- predict(cvfit, newx = as.matrix(test[-5]), s = "lambda.min")
    rmse <- sqrt(mean((test$output.var-preds)^2)) # rmse is 0.033

The third modelling approach we will explore is the best subset selection. Given the small number of variables, we will build linear regression models with a combination of each predictor variable. This is called best subset selection. This approach becomes computationally too expensive or not practical when there are more than 15 predictors to consider. We will try to choose among the models of different sizes using adjusted R square, and BIC statisticis.

regfit.best = regsubsets(output.var ~., data = train)
regfit.summary <- summary(regfit.best)
#names(regfit.summary)
which.max(regfit.summary$adjr2)# best model should have about 3 variables based on adjusted R square
## [1] 3
which.min(regfit.summary$bic)  # best model should have about 3 variables based on BIC statistics
## [1] 3
plot(regfit.best , scale ="adjr2")

plot(regfit.best , scale ="bic")

plot(regfit.summary$bic , xlab ="Number of Variables" , ylab ="BIC" , type = 'l', 
     main = "Best Subset Using BIC Statistics")
points(3 , regfit.summary$bic[3] , col =" red " , cex =2 , pch =20)

regfit.full = regsubsets(output.var ~., data = clean.data)
coef(regfit.full, 3)
## (Intercept)       var.1       var.2       var.3 
##    -65.9964     60.9762      0.5410      0.5408

In this section, we will do exactly what we did in the previous section, but using cross-validation instead adjusted R square or BIC, to determine the best coefficients or best model. This approach is somewhat involved, as we must perform best subset selection within each of the k training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy. First, we create a vector that allocates each observation to one of k = 10 folds, and we create a matrix in which we will store the results . We will also implement our own cross-validation and predict method, given the limitations of the leaps package.

predict.regsubsets = function(object , newdata , id ,...) {
  form = as.formula(object$call[[2]])
  mat = model.matrix(form , newdata)
  coefi = coef(object , id = id)
  xvars = names(coefi)
  mat[ , xvars ]%*% coefi
}

k =10
set.seed (1)
folds = sample (1: k , nrow (train) , replace = TRUE )
cv.errors = matrix ( NA ,k ,4 , dimnames = list ( NULL , paste (1:4)))
for ( j in 1: k ){
   best.fit = regsubsets (output.var ~., data = train[folds != j, ])
   for ( i in 1:4) {
        preds = predict.regsubsets (best.fit , train [ folds == j ,] , id = i )
        cv.errors [j , i] = sqrt(mean ((train$output.var [ folds == j ] - preds) ^2))
      }
   }

mean.cv.errors = apply(cv.errors ,2 , mean)
mean.cv.errors
##       1       2       3       4 
## 0.72790 0.16168 0.03917 0.03930
plot(mean.cv.errors , type = 'b', main = "Best Subset Using Cross-Validation", xlab = 
       "Number of Variables")
points(3 , min(mean.cv.errors), col ="red" , cex =2 , pch =20)

#We see that cross-validation also selects a 3-variable model. 

Now that we know what the final model should look like based on the adjusted R square, BIC, and cross-validation, we will now perfom best subset selection on the validation set. Not too surprising, the results are comparable to what we obtained earlier. We still have an rmse of about 0.04, which translates to an accuracy rate of about 99.85% (based on the adjusted R square values we looked at ealier)

#regfit.best <- regsubsets(output.var ~., data=train)
test.mat <- model.matrix(output.var ~., data=test)
rmse <- rep(NA,4)
for(i in 1:4){
  coefi=coef(regfit.best,id=i)
  pred=test.mat[,names(coefi)]%*%coefi
  rmse[i]=sqrt(mean((test$output.var-pred)^2))
}
min(rmse) # Again quiet closer to 100% or 99.9%
## [1] 0.04161

Some final thoughts:

  1. Linear models gave us an accuracy of about 99.9% while tree based models got about 98% accuracy. Our error evaluation criteria was RMSE. But if model interpretation is an important factor, linear based models should be the preferred approach.
  2. Linear regression models (best subset selection) gave very similar results with shrinkage based approach, about 99.9%. Given the fact that these are simulated data, it is not surprising.
  3. Using shrinkage models with the glmnet library, one could have also varies the values of alpha. But using the default settings, we already got closer to %100 accuracy. So we didn’t bother.
  4. Although I did not care to do it on all methods here, the final model should be built on the entire data set after all validation has been conducted. It is important that we make use of the full data set in order to obtain more accurate coefficients estimate.
  5. One unit increase in variable 1 represents about 0.62 unit increase in the response variable. Variables 4 and 5 are almost unsignificant ans should be discarded in the final model, based on BIC, adjused R square and other statistics.
  6. A validation and cross-validation sets were used with all models, given the small size of the data set. Other approaches such as SVM or models aggregation techniques could have been used as well.
  7. Best subset selection model is our prefered method given the smaller dimensional space (2^5 = 32). Shrinkage models are particularly useful in high dimensional data.
  8. In both data set, input variable 4 is skewed to the right.
  9. This analysi took about 3 hours or so to complete, although I spread them on a couple of days due to family and other work obligations. I hope this doens’t disqualify me :-)