Machine Learning Classification - Predicting Brand Preference

Introduction

In this task, a fictional sales team engaged a market research firm to conduct a survey of existing customers. One of the goals of the survey was to find out which of two brands of computers the customers prefer. Unfortunately, the answer to the brand preference question was not properly captured for all respondents.

Our objective is to investigate if customer responses to some survey questions (e.g. income, age, etc.) enable us to predict the answer to the brand preference question.

There are three data files that we will be working with:

The file labeled CompleteResponses.csv is the data set we will use to train and build our predictive models. It includes close to 10,000 fully answered surveys.
The file labeled SurveyIncomplete.csv will be the main test set. Our optimized model will be used with this data to predict brand preference.
The file Survey Key explains the survey questions and the numeric code for each of the possible answers in the survey.

The training set

The first step is to import the CompleteResponses file and to familiarize ourselves with the data.

# import the dataset with the complete responses
library(readr)
CompleteResponses <- read_csv("~/Documents/Data Science/Data Analytics and Big Data/Predicting Customer Preferences/CompleteResponses.csv", col_types = cols(brand = col_double()))

# view the first few rows to gain an impression of the data values
print.data.frame(head(CompleteResponses))

##      salary age elevel car zipcode    credit brand
## 1 119806.54  45      0  14       4 442037.71     0
## 2 106880.48  63      1  11       6  45007.18     1
## 3  78020.75  23      0  15       2  48795.32     0
## 4  63689.94  51      3   6       5  40888.88     1
## 5  50873.62  20      3  14       4 352951.50     0
## 6 130812.74  56      3  14       3 135943.02     1

The file Survery Key can be used to understand the seven variables in the training set:

salary - respondents enter numeric value
age - respondents enter numeric value
education level - respondents choose between the following values: 0 (less than high school degree); 1 (high school degree); 2 (some college); 3 (college degree); or 4 (Master’s, Doctoral or professional degree)
Primary car’s make: respondents choose between the following values: 1 (BMW); 2 (Buick); … ; or 20 (none of the above)
zipcode - respondents enter zip code, which is captured as 1 out of 9 regions in the U.S.
credit limit - respondents enter numeric value
computer brand preference - respondents choose between the following values: 0 (Acer) or 1 (Sony)

Preprocessing

Now that we better understand the training set, let’s explore the data type of each variable and make the necessary changes (if any).

# check data structure of variables
str(CompleteResponses)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : num  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : num  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : num  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: num  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : num  0 1 0 1 0 1 1 1 0 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   salary = col_double(),
##   ..   age = col_double(),
##   ..   elevel = col_double(),
##   ..   car = col_double(),
##   ..   zipcode = col_double(),
##   ..   credit = col_double(),
##   ..   brand = col_double()
##   .. )

R is treating all the variables as numerical, but is that really the case? Because we are dealing with a classification problem, some of the attributes need to be converted to factors.

# convert the following attributes to factors
CompleteResponses$brand <- as.factor(CompleteResponses$brand)
CompleteResponses$car <- as.factor(CompleteResponses$car)
CompleteResponses$elevel <- as.ordered(CompleteResponses$elevel)
CompleteResponses$zipcode <- as.factor(CompleteResponses$zipcode)

We should verify if the conversion took place and check for any missing values in our dataset.

# check data structure of variables
str(CompleteResponses)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : num  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 1 2 1 4 4 4 5 4 5 2 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 5 7 3 6 5 4 6 1 1 5 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 2 1 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   salary = col_double(),
##   ..   age = col_double(),
##   ..   elevel = col_double(),
##   ..   car = col_double(),
##   ..   zipcode = col_double(),
##   ..   credit = col_double(),
##   ..   brand = col_double()
##   .. )

# check for missing data
sum(is.na(CompleteResponses))

## [1] 0

There are no missing values and the selected attributes have been converted to factors. We can proceed to the next step.

Model development and evaluation

Training and testing sets

Before we can build our models, we need to create training and testing sets. The data will be split 75/25 for training and testing, respectively.

# create training and testing sets
library(caret)
set.seed(998)
inTraining <- createDataPartition(CompleteResponses$brand, p = .75, list = FALSE)
training <- CompleteResponses[inTraining,]
testing <- CompleteResponses[-inTraining,]

Decision tree model

Now it’s time to model. We will build a model using a decision tree, C5.0, on the training set with 10-fold cross validation and an automatic tuning grid.

# add 10-fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)

# C5.0 model
c50Fit1 <- train(brand~., data = training, method = "C5.0", trControl = fitControl)
c50Fit1

## C5.0 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6681, 6682, 6681, 6681, 6682, 6681, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8216527  0.6375557
##   rules  FALSE   10      0.9228138  0.8355564
##   rules  FALSE   20      0.9213317  0.8330515
##   rules   TRUE    1      0.8219219  0.6384468
##   rules   TRUE   10      0.9202550  0.8300311
##   rules   TRUE   20      0.9216045  0.8335374
##   tree   FALSE    1      0.8216527  0.6375557
##   tree   FALSE   10      0.9216025  0.8333024
##   tree   FALSE   20      0.9230862  0.8367476
##   tree    TRUE    1      0.8221911  0.6390365
##   tree    TRUE   10      0.9218717  0.8342135
##   tree    TRUE   20      0.9222773  0.8351449
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
##  = FALSE.

For the above model, we used decision trees to predict what class (Acer or Sony) a given observation belongs to. That is why brand was chosen as the dependent variable. Per the model’s output, here are the final performance metrics:

Accuracy: 0.9230
Kappa: 0.8367

It would be interesting to know how the model prioritized each feature during the training process. We can use the VarImp function from the caret package to obtain this information.

# use VarImp() to assess how the model prioritized each feature during training
varImp(object = c50Fit1)

## C5.0 variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##          Overall
## salary    100.00
## age       100.00
## credit     78.34
## elevel^4   41.50
## zipcode2   40.26
## car5       36.56
## zipcode4   35.72
## car4       27.96
## zipcode1   27.86
## zipcode7   24.93
## zipcode6   24.88
## zipcode5   24.18
## car10      22.70
## car8       18.53
## car16      13.83
## elevel.C   13.16
## car9       12.77
## car3       11.21
## car13      10.59
## car6        8.58

plot(varImp(object = c50Fit1), main = "C5.0 - Variable Importance")

The three most important variables for our C5.0 model are age, salary and credit limit.

Random forest model

Now let’s build a model using random forest, on the same training set with 10-fold cross validation and manually tuning 5 different mtry values. As before, the goal is to classify brand preference.

# random forest model with 5 different mtry values manually tuned
rfGrid <- expand.grid(mtry = c(1, 2, 3, 4, 5))
rfFit1 <- train(brand~., data = training, method = "rf", trControl = fitControl, tuneGrid = rfGrid)
rfFit1

## Random Forest 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6681, 6682, 6681, 6682, 6681, 6681, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa       
##   1     0.6217672  0.0000000000
##   2     0.6220364  0.0008835975
##   3     0.7280443  0.3498910387
##   4     0.8384951  0.6471758621
##   5     0.8841614  0.7535417371
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

A value of mtry = 5 yielded the best performance metrics. Here they are:

Accuracy: 0.8841
Kappa: 0.7535

Once again, we can use the VarImp function to understand how the model prioritized each feature during training.

# use VarImp() to assess variable importance during training
varImp(object = rfFit1)

## rf variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##           Overall
## salary   100.0000
## age       41.7817
## credit    21.4088
## elevel.C   1.4845
## elevel^4   1.4721
## elevel.L   1.4484
## elevel.Q   0.9675
## zipcode1   0.5160
## zipcode5   0.4671
## zipcode4   0.4604
## zipcode7   0.4303
## zipcode3   0.4284
## zipcode2   0.4039
## zipcode6   0.3734
## zipcode8   0.2591
## car12      0.2151
## car15      0.1888
## car10      0.1775
## car7       0.1639
## car17      0.1510

plot(varImp(object = rfFit1), main = "Random Forest - Variable Importance")

The most important variable for our Random Forest model is salary, followed by age and credit limit.

Predict brand preference

We built and trained two models using the survey with complete responses. The predictive model that performed best was our decision tree classifier, C5.0.

The next step is to apply our model to the test set, the survey with incomplete responses.

# import the dataset with incomplete responses
SurveyIncomplete <- read_csv("~/Documents/Data Science/Data Analytics and Big Data/Predicting Customer Preferences/SurveyIncomplete.csv")

## Parsed with column specification:
## cols(
##   salary = col_double(),
##   age = col_double(),
##   elevel = col_double(),
##   car = col_double(),
##   zipcode = col_double(),
##   credit = col_double(),
##   brand = col_double()
## )

# inspect the first few rows
print.data.frame(head(SurveyIncomplete))

##      salary age elevel car zipcode   credit brand
## 1 150000.00  76      1   3       3 377980.1     1
## 2  82523.84  51      1   8       3 141657.6     0
## 3 115646.64  34      0  10       2 360980.4     1
## 4 141443.39  22      3  18       2 282736.3     1
## 5 149211.27  56      0   5       3 215667.3     1
## 6  46202.25  26      4  12       1 150419.4     1

Notice the similarity between the training and test sets? Both the complete and incomplete surveys were conducted using the same questions and possible answers. However, we modified some attributes back in the preprocessing step. These changes must also be done to our test set for the model to function properly.

# convert the following attributes to factors
SurveyIncomplete$brand <- as.factor(SurveyIncomplete$brand)
SurveyIncomplete$car <- as.factor(SurveyIncomplete$car)
SurveyIncomplete$elevel <- as.ordered(SurveyIncomplete$elevel)
SurveyIncomplete$zipcode <- as.factor(SurveyIncomplete$zipcode)

# verify data structure of variables
str(SurveyIncomplete)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of  7 variables:
##  $ salary : num  150000 82524 115647 141443 149211 ...
##  $ age    : num  76 51 34 22 56 26 64 50 26 46 ...
##  $ elevel : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 2 2 1 4 1 5 4 4 3 4 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 3 8 10 18 5 12 1 9 3 18 ...
##  $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 4 4 3 3 4 2 3 1 5 7 ...
##  $ credit : num  377980 141658 360980 282736 215667 ...
##  $ brand  : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   salary = col_double(),
##   ..   age = col_double(),
##   ..   elevel = col_double(),
##   ..   car = col_double(),
##   ..   zipcode = col_double(),
##   ..   credit = col_double(),
##   ..   brand = col_double()
##   .. )

At last, we can use the predict function and our best model to make predictions about brand preference!

# predict brand preference
PredictionsBrand <- predict(c50Fit1, SurveyIncomplete)

With the summary function and our prediction object, we can finally learn how many individuals are predicted to prefer Sony and Acer.

# computer brand preference
summary(PredictionsBrand)

##    0    1 
## 1938 3062

Conclusion

Based on our predictions, most customers will prefer a Sony computer. If we add the fully answered surverys and the predictions together, Sony comes out on top as well. With this information, the fictional sales team can begin to work on their product selection strategy.