In this task, a fictional sales team engaged a market research firm to conduct a survey of existing customers. One of the goals of the survey was to find out which of two brands of computers the customers prefer. Unfortunately, the answer to the brand preference question was not properly captured for all respondents.
Our objective is to investigate if customer responses to some survey questions (e.g. income, age, etc.) enable us to predict the answer to the brand preference question.
There are three data files that we will be working with:
The file labeled CompleteResponses.csv is the data set we will use to train and build our predictive models. It includes close to 10,000 fully answered surveys.
The file labeled SurveyIncomplete.csv will be the main test set. Our optimized model will be used with this data to predict brand preference.
The file Survey Key explains the survey questions and the numeric code for each of the possible answers in the survey.
The first step is to import the CompleteResponses file and to familiarize ourselves with the data.
# import the dataset with the complete responses
library(readr)
CompleteResponses <- read_csv("~/Documents/Data Science/Data Analytics and Big Data/Predicting Customer Preferences/CompleteResponses.csv", col_types = cols(brand = col_double()))
# view the first few rows to gain an impression of the data values
print.data.frame(head(CompleteResponses))
## salary age elevel car zipcode credit brand
## 1 119806.54 45 0 14 4 442037.71 0
## 2 106880.48 63 1 11 6 45007.18 1
## 3 78020.75 23 0 15 2 48795.32 0
## 4 63689.94 51 3 6 5 40888.88 1
## 5 50873.62 20 3 14 4 352951.50 0
## 6 130812.74 56 3 14 3 135943.02 1
The file Survery Key can be used to understand the seven variables in the training set:
Now that we better understand the training set, let’s explore the data type of each variable and make the necessary changes (if any).
# check data structure of variables
str(CompleteResponses)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : num 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : num 0 1 0 3 3 3 4 3 4 1 ...
## $ car : num 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: num 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : num 0 1 0 1 0 1 1 1 0 1 ...
## - attr(*, "spec")=
## .. cols(
## .. salary = col_double(),
## .. age = col_double(),
## .. elevel = col_double(),
## .. car = col_double(),
## .. zipcode = col_double(),
## .. credit = col_double(),
## .. brand = col_double()
## .. )
R is treating all the variables as numerical, but is that really the case? Because we are dealing with a classification problem, some of the attributes need to be converted to factors.
# convert the following attributes to factors
CompleteResponses$brand <- as.factor(CompleteResponses$brand)
CompleteResponses$car <- as.factor(CompleteResponses$car)
CompleteResponses$elevel <- as.ordered(CompleteResponses$elevel)
CompleteResponses$zipcode <- as.factor(CompleteResponses$zipcode)
We should verify if the conversion took place and check for any missing values in our dataset.
# check data structure of variables
str(CompleteResponses)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : num 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 1 2 1 4 4 4 5 4 5 2 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 5 7 3 6 5 4 6 1 1 5 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 2 1 2 ...
## - attr(*, "spec")=
## .. cols(
## .. salary = col_double(),
## .. age = col_double(),
## .. elevel = col_double(),
## .. car = col_double(),
## .. zipcode = col_double(),
## .. credit = col_double(),
## .. brand = col_double()
## .. )
# check for missing data
sum(is.na(CompleteResponses))
## [1] 0
There are no missing values and the selected attributes have been converted to factors. We can proceed to the next step.
Before we can build our models, we need to create training and testing sets. The data will be split 75/25 for training and testing, respectively.
# create training and testing sets
library(caret)
set.seed(998)
inTraining <- createDataPartition(CompleteResponses$brand, p = .75, list = FALSE)
training <- CompleteResponses[inTraining,]
testing <- CompleteResponses[-inTraining,]
Now it’s time to model. We will build a model using a decision tree, C5.0, on the training set with 10-fold cross validation and an automatic tuning grid.
# add 10-fold cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
# C5.0 model
c50Fit1 <- train(brand~., data = training, method = "C5.0", trControl = fitControl)
c50Fit1
## C5.0
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6681, 6682, 6681, 6681, 6682, 6681, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8216527 0.6375557
## rules FALSE 10 0.9228138 0.8355564
## rules FALSE 20 0.9213317 0.8330515
## rules TRUE 1 0.8219219 0.6384468
## rules TRUE 10 0.9202550 0.8300311
## rules TRUE 20 0.9216045 0.8335374
## tree FALSE 1 0.8216527 0.6375557
## tree FALSE 10 0.9216025 0.8333024
## tree FALSE 20 0.9230862 0.8367476
## tree TRUE 1 0.8221911 0.6390365
## tree TRUE 10 0.9218717 0.8342135
## tree TRUE 20 0.9222773 0.8351449
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
For the above model, we used decision trees to predict what class (Acer or Sony) a given observation belongs to. That is why brand was chosen as the dependent variable. Per the model’s output, here are the final performance metrics:
It would be interesting to know how the model prioritized each feature during the training process. We can use the VarImp function from the caret package to obtain this information.
# use VarImp() to assess how the model prioritized each feature during training
varImp(object = c50Fit1)
## C5.0 variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## salary 100.00
## age 100.00
## credit 78.34
## elevel^4 41.50
## zipcode2 40.26
## car5 36.56
## zipcode4 35.72
## car4 27.96
## zipcode1 27.86
## zipcode7 24.93
## zipcode6 24.88
## zipcode5 24.18
## car10 22.70
## car8 18.53
## car16 13.83
## elevel.C 13.16
## car9 12.77
## car3 11.21
## car13 10.59
## car6 8.58
plot(varImp(object = c50Fit1), main = "C5.0 - Variable Importance")
The three most important variables for our C5.0 model are age, salary and credit limit.
Now let’s build a model using random forest, on the same training set with 10-fold cross validation and manually tuning 5 different mtry values. As before, the goal is to classify brand preference.
# random forest model with 5 different mtry values manually tuned
rfGrid <- expand.grid(mtry = c(1, 2, 3, 4, 5))
rfFit1 <- train(brand~., data = training, method = "rf", trControl = fitControl, tuneGrid = rfGrid)
rfFit1
## Random Forest
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6681, 6682, 6681, 6682, 6681, 6681, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.6217672 0.0000000000
## 2 0.6220364 0.0008835975
## 3 0.7280443 0.3498910387
## 4 0.8384951 0.6471758621
## 5 0.8841614 0.7535417371
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
A value of mtry = 5 yielded the best performance metrics. Here they are:
Once again, we can use the VarImp function to understand how the model prioritized each feature during training.
# use VarImp() to assess variable importance during training
varImp(object = rfFit1)
## rf variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## salary 100.0000
## age 41.7817
## credit 21.4088
## elevel.C 1.4845
## elevel^4 1.4721
## elevel.L 1.4484
## elevel.Q 0.9675
## zipcode1 0.5160
## zipcode5 0.4671
## zipcode4 0.4604
## zipcode7 0.4303
## zipcode3 0.4284
## zipcode2 0.4039
## zipcode6 0.3734
## zipcode8 0.2591
## car12 0.2151
## car15 0.1888
## car10 0.1775
## car7 0.1639
## car17 0.1510
plot(varImp(object = rfFit1), main = "Random Forest - Variable Importance")
The most important variable for our Random Forest model is salary, followed by age and credit limit.
We built and trained two models using the survey with complete responses. The predictive model that performed best was our decision tree classifier, C5.0.
The next step is to apply our model to the test set, the survey with incomplete responses.
# import the dataset with incomplete responses
SurveyIncomplete <- read_csv("~/Documents/Data Science/Data Analytics and Big Data/Predicting Customer Preferences/SurveyIncomplete.csv")
## Parsed with column specification:
## cols(
## salary = col_double(),
## age = col_double(),
## elevel = col_double(),
## car = col_double(),
## zipcode = col_double(),
## credit = col_double(),
## brand = col_double()
## )
# inspect the first few rows
print.data.frame(head(SurveyIncomplete))
## salary age elevel car zipcode credit brand
## 1 150000.00 76 1 3 3 377980.1 1
## 2 82523.84 51 1 8 3 141657.6 0
## 3 115646.64 34 0 10 2 360980.4 1
## 4 141443.39 22 3 18 2 282736.3 1
## 5 149211.27 56 0 5 3 215667.3 1
## 6 46202.25 26 4 12 1 150419.4 1
Notice the similarity between the training and test sets? Both the complete and incomplete surveys were conducted using the same questions and possible answers. However, we modified some attributes back in the preprocessing step. These changes must also be done to our test set for the model to function properly.
# convert the following attributes to factors
SurveyIncomplete$brand <- as.factor(SurveyIncomplete$brand)
SurveyIncomplete$car <- as.factor(SurveyIncomplete$car)
SurveyIncomplete$elevel <- as.ordered(SurveyIncomplete$elevel)
SurveyIncomplete$zipcode <- as.factor(SurveyIncomplete$zipcode)
# verify data structure of variables
str(SurveyIncomplete)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 150000 82524 115647 141443 149211 ...
## $ age : num 76 51 34 22 56 26 64 50 26 46 ...
## $ elevel : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 2 2 1 4 1 5 4 4 3 4 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 3 8 10 18 5 12 1 9 3 18 ...
## $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 4 4 3 3 4 2 3 1 5 7 ...
## $ credit : num 377980 141658 360980 282736 215667 ...
## $ brand : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 1 ...
## - attr(*, "spec")=
## .. cols(
## .. salary = col_double(),
## .. age = col_double(),
## .. elevel = col_double(),
## .. car = col_double(),
## .. zipcode = col_double(),
## .. credit = col_double(),
## .. brand = col_double()
## .. )
At last, we can use the predict function and our best model to make predictions about brand preference!
# predict brand preference
PredictionsBrand <- predict(c50Fit1, SurveyIncomplete)
With the summary function and our prediction object, we can finally learn how many individuals are predicted to prefer Sony and Acer.
# computer brand preference
summary(PredictionsBrand)
## 0 1
## 1938 3062
Based on our predictions, most customers will prefer a Sony computer. If we add the fully answered surverys and the predictions together, Sony comes out on top as well. With this information, the fictional sales team can begin to work on their product selection strategy.