Predict which Brand Customers Prefer, Acer or Sony

Overview

Background

We are asked to predict the customers’ brand preferences between Acer and Sony that are missing from the incomplete surveys by building computer models.

Dataset Information

10,000 completed surveys
5,000 incompleted surveys
Attributes consist of: Salary, Age, Education Level, Car, Zipcode, Credit, Brand Preference

Data Science Outline

Pre-processing

We examined the structure of the dataset, we found that the data types of education level, car, zipcode and brand have levels, thus we need to convert them into factor so that the machine can recognize and process them.
The sum of missing values is 0, thus there are no missing values to be dealt with.

# load the libraries
library(readr)
library(caret)
library(ggplot2)

# set seed
set.seed(123)

# load the dataset and check the structure 
complete <- read.csv("CompleteResponses.csv")
str(complete)

## 'data.frame':    9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : int  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : int  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: int  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : int  0 1 0 1 0 1 1 1 0 1 ...

# check the sum of missing values
sum(is.na(complete))

## [1] 0

# change data type to factor
complete$elevel <- as.factor(complete$elevel)
complete$car <- as.factor(complete$car)
complete$zipcode <- as.factor(complete$zipcode)
complete$brand <- as.factor(complete$brand)

Build the Models

Since we are predicting preference of either Acer or Sony, this is a classification problem.
We used Caret package to try two different classification classifiers. Both of the classifiers are tree models, we will evaluate the performances after we build the models.
- C5.0
- Random Forest
We also used 10 folds cross validation to avoid overfitting.

# define an 75%/25% train/test split of the dataset
inTraining <- createDataPartition(complete$brand, p = .75, list = FALSE)
training <- complete[inTraining,]
testing <- complete[-inTraining,]

# 10 fold cross validation
fitControl <- trainControl(method = "cv", number = 10)

Evaluate the Models

Both C5.0 and Random Forest have very close accuracy and kappa scores.
Both models have consistent scores for their 10 fold cross validation.
C5.0 has a huge time advantage than Random Forest, about 7 times faster.

Make Predictions

Since C5.0 is a lot faster and it also has a high accuracy, we picked C5.0 as our final model for prediction.
Before we make the prediction, we also have to make any necessary pre-processing to ensure the incomplete dataset has the same format with the complete dataset we trained our modle with.
Our prediction shows that the customers prefer Sony over Acer.
In the end, we also combined complete survey and incomplete survey to count the total preference over Acer and Sony.