Loading and preprocessing the data

Summary shows us that the data has no outliers or missing values

completeResp <- read.csv("CompleteResponses.csv")
summary(completeResp)
##      salary            age            elevel           car       
##  Min.   : 20000   Min.   :20.00   Min.   :0.000   Min.   : 1.00  
##  1st Qu.: 52082   1st Qu.:35.00   1st Qu.:1.000   1st Qu.: 6.00  
##  Median : 84950   Median :50.00   Median :2.000   Median :11.00  
##  Mean   : 84871   Mean   :49.78   Mean   :1.983   Mean   :10.52  
##  3rd Qu.:117162   3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:15.75  
##  Max.   :150000   Max.   :80.00   Max.   :4.000   Max.   :20.00  
##     zipcode          credit           brand       
##  Min.   :0.000   Min.   :     0   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:120807   1st Qu.:0.0000  
##  Median :4.000   Median :250607   Median :1.0000  
##  Mean   :4.041   Mean   :249176   Mean   :0.6217  
##  3rd Qu.:6.000   3rd Qu.:374640   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :500000   Max.   :1.0000

Importance of each feature used

In order to choose the features in the model, a correlation matrix was created in order to see the relationship of all the features with one another. This was done before discretising 4 of the features as the correlation matrix only works on numerical features. We got the outcome in the figure below. The only correlation in the data is a weak one between salary and brand (0.2), that’s why our focus was on salary to predict the brand.

cor(completeResp)
##               salary          age        elevel           car      zipcode
## salary   1.000000000  0.007978566 -6.620234e-03 -6.090575e-03 -0.005471132
## age      0.007978566  1.000000000 -5.830340e-03  1.024607e-02  0.003681375
## elevel  -0.006620234 -0.005830340  1.000000e+00 -4.676852e-05  0.018095400
## car     -0.006090575  0.010246067 -4.676852e-05  1.000000e+00  0.001526528
## zipcode -0.005471132  0.003681375  1.809540e-02  1.526528e-03  1.000000000
## credit  -0.025126808 -0.004400692  2.720642e-03 -1.032914e-02  0.004962011
## brand    0.206489883  0.013713286 -4.828912e-03  5.923147e-03  0.004665088
##               credit        brand
## salary  -0.025126808  0.206489883
## age     -0.004400692  0.013713286
## elevel   0.002720642 -0.004828912
## car     -0.010329137  0.005923147
## zipcode  0.004962011  0.004665088
## credit   1.000000000  0.005688438
## brand    0.005688438  1.000000000

We can see that only salary has a some positive correlation with the brand, so our focus will be on this feature.

After reading the surveykey.xlsx file, we know that some of our features should be converted to Factor and shouldn’t be numerical.

str(completeResp)
## 'data.frame':    9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : int  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : int  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: int  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : int  0 1 0 1 0 1 1 1 0 1 ...
#Preprocessing the data
completeResp[,c(3:5,7)] <- lapply(completeResp[,c(3:5,7)] , factor)

Now we need to split the data into training and testing sets to create and assess our models

#Set seed to know the random order
set.seed(998)

#define an 75%/25% train/test split of the dataset
inTraining <- createDataPartition(completeResp$brand, p = .75, list = FALSE)
training <- completeResp[inTraining,]
testing <- completeResp[-inTraining,]

Classifiers Tried:

C5.0

C5.0 can produce a classifier expressed either as a decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees.

#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

#Set seed to know the random order
set.seed(998)

#train Linear Regression model
# The tuneGrid parameter lets us decide which values the main parameter will take,
# while tuneLength only limit the number of default parameters to use.
C5Fit <- train(brand~., data = training, method = "C5.0", trControl=fitControl, tuneLength = 2)

#check the results
C5Fit
## C5.0 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6681, 6681, 6682, 6681, 6683, 6682, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8254401  0.6510231
##   rules  FALSE   10      0.9210663  0.8314165
##   rules   TRUE    1      0.8249015  0.6499032
##   rules   TRUE   10      0.9206607  0.8306052
##   tree   FALSE    1      0.8254401  0.6510231
##   tree   FALSE   10      0.9225519  0.8357656
##   tree    TRUE    1      0.8247668  0.6496340
##   tree    TRUE   10      0.9225484  0.8356501
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = tree
##  and winnow = FALSE.
#How the model prioritized each feature in the training
varImp(C5Fit)
## C5.0 variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##          Overall
## salary     100.0
## age         85.1
## car6         0.0
## car18        0.0
## zipcode3     0.0
## car20        0.0
## car14        0.0
## elevel2      0.0
## car19        0.0
## car15        0.0
## zipcode4     0.0
## car16        0.0
## car2         0.0
## zipcode7     0.0
## credit       0.0
## car13        0.0
## zipcode2     0.0
## zipcode1     0.0
## car11        0.0
## car10        0.0

Random Forests

Random Forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is good against overfitting due to the randomness of the algorithm.

#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

#Dataframe for manual tuning of mtry
rfGrid <- expand.grid(mtry=c(3,6,10,12,18))

#Set seed to know the random order
set.seed(998)

#Train Random Forest Regression model
rfFit <- train(brand~., data = training, method = "rf", trControl=fitControl, tuneGrid=rfGrid)

#training results
rfFit
## Random Forest 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6681, 6681, 6682, 6681, 6683, 6682, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    3    0.7343772  0.3613482
##    6    0.9101531  0.8098533
##   10    0.9214697  0.8336403
##   12    0.9220090  0.8347370
##   18    0.9233549  0.8373730
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
#How the model prioritized each feature in the training
varImp(rfFit)
## rf variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##           Overall
## salary   100.0000
## age       67.8408
## credit     9.8106
## elevel3    0.7260
## elevel4    0.6664
## elevel1    0.6498
## elevel2    0.5902
## zipcode7   0.5675
## zipcode6   0.4471
## zipcode1   0.4399
## zipcode2   0.4340
## zipcode4   0.3589
## zipcode3   0.3501
## zipcode5   0.3496
## car15      0.3373
## zipcode8   0.3005
## car12      0.2988
## car7       0.2619
## car17      0.2453
## car5       0.2299

Gradient Boosting

Gradient Boosting is a machine learning technique which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

#10 fold cross validation repeated 10 times
fitControlGBM <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

#Grid to define our own parameters for this classifier
gbmGrid <-  expand.grid(interaction.depth = c(3,5), 
                        n.trees = (5)*50, 
                        shrinkage = c(0.1),
                        n.minobsinnode = 20)

#Set seed to know the random order
set.seed(998)

#Train GBM Regression model
gbmFit <- train(brand~., data = training, 
                 method = "gbm", 
                 trControl = fitControlGBM, 
                 verbose = FALSE, 
                 ## specifying the exact models 
                 ## to evaluate:
                 tuneGrid = gbmGrid)

#training results
gbmFit
## Stochastic Gradient Boosting 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6681, 6681, 6682, 6681, 6683, 6682, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  Accuracy   Kappa    
##   3                  0.9245689  0.8404536
##   5                  0.9232230  0.8373474
## 
## Tuning parameter 'n.trees' was held constant at a value of 250
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 20.
#How the model prioritized each feature in the training
plot(varImp(gbmFit))

Classifier selected to make the predictions

Below are the best metrics for each classifier I used for this task:

Classifier <- c('Random Forest','C5.0','Gradient Boosting Trees')
Accuracy <- c(0.923, 0.923, 0.926)
Kappa <- c(0.837, 0.836, 0.844)
metrics <- data.frame(Classifier, Accuracy, Kappa)
kable(metrics)
Classifier Accuracy Kappa
Random Forest 0.923 0.837
C5.0 0.923 0.836
Gradient Boosting Trees 0.926 0.844

I picked the most accurate model after optimization which was the gradient boosting classifier. I have very high level of confidence because after testing 3 algorithms and getting similar accuracy, we are safe from over-fitting (this will be rectified with postResample() method).

Model Accuracy

Now that we have our model, we will apply it to the test set and see how accurate it can predict

pred <- predict(gbmFit, newdata = testing)
summary(pred)
##    0    1 
##  954 1520
postResample(pred, testing$brand)
##  Accuracy     Kappa 
## 0.9264349 0.8441964

After using postResample() with the predictions and the test set, we got accuracy and kappa that are almost exact to the results we achieved during model optimization process.

This is good news as it means our model is good and isn’t overly-fitted. And since our data has no outliers, it’s safe to use this model to predict the missing values in our survey,

Visualizing where the errors occured

Our 92.5% accuracy yielded quite a few errors. The plot below shows the distribution of errors. We can easily spot that most errors occured around the edges of the prediction, which means our model couldnt decided which brand to predict.

#adding predictions to the test set
testing$pred <- pred

#ploting the distribution of errors compared to the rest of the data
ggplot(data=testing) +
  geom_point(mapping =aes(x=salary, y=age, color=(pred==1))) +
  facet_grid(brand ~ . ) +
  guides(color=FALSE)

Loading and preprocessing incomplete data

incompleteData <- read.csv("SurveyIncomplete.csv")

#preprocess
incompleteData[,c(3:5,7)] <- lapply(incompleteData[,c(3:5,7)] , factor)

Predicting incomplete data

Finally, now we can predict the incomplete questions in the survey using the most optimized model.

#predict incomplete data
incompleteData$brand <- predict(gbmFit, newdata = incompleteData)
summary(incompleteData$brand)
##    0    1 
## 1925 3075

As seen in the summary, the model predicts that customers would pick sony 61.5% of the time over acer

Charts showing predicted answers of the incomplete data and the overall data

#combining both data frames
completeResp <-do.call(rbind, list(incompleteData, completeResp))
summary(completeResp)
##      salary            age        elevel        car           zipcode    
##  Min.   : 20000   Min.   :20.00   0:3033   18     :  786   8      :1698  
##  1st Qu.: 52139   1st Qu.:35.00   1:2941   15     :  781   6      :1697  
##  Median : 85220   Median :50.00   2:3003   8      :  774   5      :1676  
##  Mean   : 85102   Mean   :49.81   3:2948   19     :  765   4      :1672  
##  3rd Qu.:117641   3rd Qu.:65.00   4:2973   5      :  760   2      :1657  
##  Max.   :150000   Max.   :80.00            17     :  758   7      :1654  
##                                            (Other):10274   (Other):4844  
##      credit       brand   
##  Min.   :     0   0:5669  
##  1st Qu.:121173   1:9229  
##  Median :250689           
##  Mean   :249288           
##  3rd Qu.:374897           
##  Max.   :500000           
## 
#change 0 to online and 1 to store (more meaningful)
completeResp$brand <- ifelse(completeResp$brand == 1, "Sony", "Acer")

#using ggplot to draw barchart for incomplete data
ggplot(data = incompleteData, mapping = aes(x = brand, fill = brand)) +
  geom_bar(position = "dodge") +
  labs(x="Brand", y="Count", title="Brand Predictions for the incomplete data") +
  geom_text(mapping = aes(label = sprintf("%0.2f%%",..count../sum(..count..)*100)), stat = "count", vjust = -0.5)+
  expand_limits(y=3400)

#scale_y_continuous(labels = scales::number_format(accuracy = 0.01))
#label=sprintf("%0.2f", round(a, digits = 2))

#using ggplot to draw barchart for complete data
ggplot(data = completeResp, mapping = aes(x = brand, fill = brand)) +
  geom_bar(position = "dodge") +
  labs(x="Brand", y="Count", title="Brand Predictions for the complete data") +
  geom_text(mapping = aes(label = sprintf("%0.2f",..count../sum(..count..)*100)), stat = "count", vjust = -0.5)+
  expand_limits(y=10000)

#figure out how to round digits