Summary shows us that the data has no outliers or missing values
completeResp <- read.csv("CompleteResponses.csv")
summary(completeResp)
## salary age elevel car
## Min. : 20000 Min. :20.00 Min. :0.000 Min. : 1.00
## 1st Qu.: 52082 1st Qu.:35.00 1st Qu.:1.000 1st Qu.: 6.00
## Median : 84950 Median :50.00 Median :2.000 Median :11.00
## Mean : 84871 Mean :49.78 Mean :1.983 Mean :10.52
## 3rd Qu.:117162 3rd Qu.:65.00 3rd Qu.:3.000 3rd Qu.:15.75
## Max. :150000 Max. :80.00 Max. :4.000 Max. :20.00
## zipcode credit brand
## Min. :0.000 Min. : 0 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:120807 1st Qu.:0.0000
## Median :4.000 Median :250607 Median :1.0000
## Mean :4.041 Mean :249176 Mean :0.6217
## 3rd Qu.:6.000 3rd Qu.:374640 3rd Qu.:1.0000
## Max. :8.000 Max. :500000 Max. :1.0000
In order to choose the features in the model, a correlation matrix was created in order to see the relationship of all the features with one another. This was done before discretising 4 of the features as the correlation matrix only works on numerical features. We got the outcome in the figure below. The only correlation in the data is a weak one between salary and brand (0.2), that’s why our focus was on salary to predict the brand.
cor(completeResp)
## salary age elevel car zipcode
## salary 1.000000000 0.007978566 -6.620234e-03 -6.090575e-03 -0.005471132
## age 0.007978566 1.000000000 -5.830340e-03 1.024607e-02 0.003681375
## elevel -0.006620234 -0.005830340 1.000000e+00 -4.676852e-05 0.018095400
## car -0.006090575 0.010246067 -4.676852e-05 1.000000e+00 0.001526528
## zipcode -0.005471132 0.003681375 1.809540e-02 1.526528e-03 1.000000000
## credit -0.025126808 -0.004400692 2.720642e-03 -1.032914e-02 0.004962011
## brand 0.206489883 0.013713286 -4.828912e-03 5.923147e-03 0.004665088
## credit brand
## salary -0.025126808 0.206489883
## age -0.004400692 0.013713286
## elevel 0.002720642 -0.004828912
## car -0.010329137 0.005923147
## zipcode 0.004962011 0.004665088
## credit 1.000000000 0.005688438
## brand 0.005688438 1.000000000
We can see that only salary has a some positive correlation with the brand, so our focus will be on this feature.
After reading the surveykey.xlsx file, we know that some of our features should be converted to Factor and shouldn’t be numerical.
str(completeResp)
## 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : int 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : int 0 1 0 3 3 3 4 3 4 1 ...
## $ car : int 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: int 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : int 0 1 0 1 0 1 1 1 0 1 ...
#Preprocessing the data
completeResp[,c(3:5,7)] <- lapply(completeResp[,c(3:5,7)] , factor)
Now we need to split the data into training and testing sets to create and assess our models
#Set seed to know the random order
set.seed(998)
#define an 75%/25% train/test split of the dataset
inTraining <- createDataPartition(completeResp$brand, p = .75, list = FALSE)
training <- completeResp[inTraining,]
testing <- completeResp[-inTraining,]
C5.0 can produce a classifier expressed either as a decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees.
#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#Set seed to know the random order
set.seed(998)
#train Linear Regression model
# The tuneGrid parameter lets us decide which values the main parameter will take,
# while tuneLength only limit the number of default parameters to use.
C5Fit <- train(brand~., data = training, method = "C5.0", trControl=fitControl, tuneLength = 2)
#check the results
C5Fit
## C5.0
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6681, 6681, 6682, 6681, 6683, 6682, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8254401 0.6510231
## rules FALSE 10 0.9210663 0.8314165
## rules TRUE 1 0.8249015 0.6499032
## rules TRUE 10 0.9206607 0.8306052
## tree FALSE 1 0.8254401 0.6510231
## tree FALSE 10 0.9225519 0.8357656
## tree TRUE 1 0.8247668 0.6496340
## tree TRUE 10 0.9225484 0.8356501
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = tree
## and winnow = FALSE.
#How the model prioritized each feature in the training
varImp(C5Fit)
## C5.0 variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## salary 100.0
## age 85.1
## car6 0.0
## car18 0.0
## zipcode3 0.0
## car20 0.0
## car14 0.0
## elevel2 0.0
## car19 0.0
## car15 0.0
## zipcode4 0.0
## car16 0.0
## car2 0.0
## zipcode7 0.0
## credit 0.0
## car13 0.0
## zipcode2 0.0
## zipcode1 0.0
## car11 0.0
## car10 0.0
Random Forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is good against overfitting due to the randomness of the algorithm.
#10 fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#Dataframe for manual tuning of mtry
rfGrid <- expand.grid(mtry=c(3,6,10,12,18))
#Set seed to know the random order
set.seed(998)
#Train Random Forest Regression model
rfFit <- train(brand~., data = training, method = "rf", trControl=fitControl, tuneGrid=rfGrid)
#training results
rfFit
## Random Forest
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6681, 6681, 6682, 6681, 6683, 6682, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 3 0.7343772 0.3613482
## 6 0.9101531 0.8098533
## 10 0.9214697 0.8336403
## 12 0.9220090 0.8347370
## 18 0.9233549 0.8373730
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
#How the model prioritized each feature in the training
varImp(rfFit)
## rf variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## salary 100.0000
## age 67.8408
## credit 9.8106
## elevel3 0.7260
## elevel4 0.6664
## elevel1 0.6498
## elevel2 0.5902
## zipcode7 0.5675
## zipcode6 0.4471
## zipcode1 0.4399
## zipcode2 0.4340
## zipcode4 0.3589
## zipcode3 0.3501
## zipcode5 0.3496
## car15 0.3373
## zipcode8 0.3005
## car12 0.2988
## car7 0.2619
## car17 0.2453
## car5 0.2299
Gradient Boosting is a machine learning technique which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
#10 fold cross validation repeated 10 times
fitControlGBM <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#Grid to define our own parameters for this classifier
gbmGrid <- expand.grid(interaction.depth = c(3,5),
n.trees = (5)*50,
shrinkage = c(0.1),
n.minobsinnode = 20)
#Set seed to know the random order
set.seed(998)
#Train GBM Regression model
gbmFit <- train(brand~., data = training,
method = "gbm",
trControl = fitControlGBM,
verbose = FALSE,
## specifying the exact models
## to evaluate:
tuneGrid = gbmGrid)
#training results
gbmFit
## Stochastic Gradient Boosting
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6681, 6681, 6682, 6681, 6683, 6682, ...
## Resampling results across tuning parameters:
##
## interaction.depth Accuracy Kappa
## 3 0.9245689 0.8404536
## 5 0.9232230 0.8373474
##
## Tuning parameter 'n.trees' was held constant at a value of 250
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 20.
#How the model prioritized each feature in the training
plot(varImp(gbmFit))
Below are the best metrics for each classifier I used for this task:
Classifier <- c('Random Forest','C5.0','Gradient Boosting Trees')
Accuracy <- c(0.923, 0.923, 0.926)
Kappa <- c(0.837, 0.836, 0.844)
metrics <- data.frame(Classifier, Accuracy, Kappa)
kable(metrics)
| Classifier | Accuracy | Kappa |
|---|---|---|
| Random Forest | 0.923 | 0.837 |
| C5.0 | 0.923 | 0.836 |
| Gradient Boosting Trees | 0.926 | 0.844 |
I picked the most accurate model after optimization which was the gradient boosting classifier. I have very high level of confidence because after testing 3 algorithms and getting similar accuracy, we are safe from over-fitting (this will be rectified with postResample() method).
Now that we have our model, we will apply it to the test set and see how accurate it can predict
pred <- predict(gbmFit, newdata = testing)
summary(pred)
## 0 1
## 954 1520
postResample(pred, testing$brand)
## Accuracy Kappa
## 0.9264349 0.8441964
After using postResample() with the predictions and the test set, we got accuracy and kappa that are almost exact to the results we achieved during model optimization process.
This is good news as it means our model is good and isn’t overly-fitted. And since our data has no outliers, it’s safe to use this model to predict the missing values in our survey,
Our 92.5% accuracy yielded quite a few errors. The plot below shows the distribution of errors. We can easily spot that most errors occured around the edges of the prediction, which means our model couldnt decided which brand to predict.
#adding predictions to the test set
testing$pred <- pred
#ploting the distribution of errors compared to the rest of the data
ggplot(data=testing) +
geom_point(mapping =aes(x=salary, y=age, color=(pred==1))) +
facet_grid(brand ~ . ) +
guides(color=FALSE)
incompleteData <- read.csv("SurveyIncomplete.csv")
#preprocess
incompleteData[,c(3:5,7)] <- lapply(incompleteData[,c(3:5,7)] , factor)
Finally, now we can predict the incomplete questions in the survey using the most optimized model.
#predict incomplete data
incompleteData$brand <- predict(gbmFit, newdata = incompleteData)
summary(incompleteData$brand)
## 0 1
## 1925 3075
As seen in the summary, the model predicts that customers would pick sony 61.5% of the time over acer
#combining both data frames
completeResp <-do.call(rbind, list(incompleteData, completeResp))
summary(completeResp)
## salary age elevel car zipcode
## Min. : 20000 Min. :20.00 0:3033 18 : 786 8 :1698
## 1st Qu.: 52139 1st Qu.:35.00 1:2941 15 : 781 6 :1697
## Median : 85220 Median :50.00 2:3003 8 : 774 5 :1676
## Mean : 85102 Mean :49.81 3:2948 19 : 765 4 :1672
## 3rd Qu.:117641 3rd Qu.:65.00 4:2973 5 : 760 2 :1657
## Max. :150000 Max. :80.00 17 : 758 7 :1654
## (Other):10274 (Other):4844
## credit brand
## Min. : 0 0:5669
## 1st Qu.:121173 1:9229
## Median :250689
## Mean :249288
## 3rd Qu.:374897
## Max. :500000
##
#change 0 to online and 1 to store (more meaningful)
completeResp$brand <- ifelse(completeResp$brand == 1, "Sony", "Acer")
#using ggplot to draw barchart for incomplete data
ggplot(data = incompleteData, mapping = aes(x = brand, fill = brand)) +
geom_bar(position = "dodge") +
labs(x="Brand", y="Count", title="Brand Predictions for the incomplete data") +
geom_text(mapping = aes(label = sprintf("%0.2f%%",..count../sum(..count..)*100)), stat = "count", vjust = -0.5)+
expand_limits(y=3400)
#scale_y_continuous(labels = scales::number_format(accuracy = 0.01))
#label=sprintf("%0.2f", round(a, digits = 2))
#using ggplot to draw barchart for complete data
ggplot(data = completeResp, mapping = aes(x = brand, fill = brand)) +
geom_bar(position = "dodge") +
labs(x="Brand", y="Count", title="Brand Predictions for the complete data") +
geom_text(mapping = aes(label = sprintf("%0.2f",..count../sum(..count..)*100)), stat = "count", vjust = -0.5)+
expand_limits(y=10000)
#figure out how to round digits