This report describe prediction the customers and classify the customers into four segments using Machine Learning Algorithm. The dataset used in this report is Customer Segment Data hosted in Kaggle and was acquired from the Analytics Vidhya hackathon.

The dataset using in this report for modeling is real house data in the US. The dataset is hosted in Kaggle. It can be downloaded here:
https://www.kaggle.com/kaushiksuresh147/customer-segmentation

The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis (EDA)
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Import necessary libraries.

rm(list = ls())
library(ggplot2)
library(gridExtra)
library(corrgram)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.5     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
library(dplyr) 
library(scales)  
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(treemapify)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Read house dataset from .csv file to R dataframe. Then, see the dataframe’s structure.

## read data
customers_df <- read.csv("data/Train.csv")
test_df <- read.csv("data/Test.csv")

The dataset has 8068 observations and 11 variables. The target variable is Segmentation and the remaining variables are candidate features.

Compute statistical summary of each variable.

## statistical summary
summary(customers_df)
##        ID            Gender          Ever_Married            Age       
##  Min.   :458982   Length:8068        Length:8068        Min.   :18.00  
##  1st Qu.:461241   Class :character   Class :character   1st Qu.:30.00  
##  Median :463473   Mode  :character   Mode  :character   Median :40.00  
##  Mean   :463479                                         Mean   :43.47  
##  3rd Qu.:465744                                         3rd Qu.:53.00  
##  Max.   :467974                                         Max.   :89.00  
##                                                                        
##   Graduated          Profession        Work_Experience  Spending_Score    
##  Length:8068        Length:8068        Min.   : 0.000   Length:8068       
##  Class :character   Class :character   1st Qu.: 0.000   Class :character  
##  Mode  :character   Mode  :character   Median : 1.000   Mode  :character  
##                                        Mean   : 2.642                     
##                                        3rd Qu.: 4.000                     
##                                        Max.   :14.000                     
##                                        NA's   :829                        
##   Family_Size      Var_1           Segmentation      
##  Min.   :1.00   Length:8068        Length:8068       
##  1st Qu.:2.00   Class :character   Class :character  
##  Median :3.00   Mode  :character   Mode  :character  
##  Mean   :2.85                                        
##  3rd Qu.:4.00                                        
##  Max.   :9.00                                        
##  NA's   :335

We can see minimum, median, mean, and maximum values of each numeric variable.

2. Exploratory Data Analysis

To find out the column names and types, we used str() function.

str(customers_df)
## 'data.frame':    8068 obs. of  11 variables:
##  $ ID             : int  462809 462643 466315 461735 462669 461319 460156 464347 465015 465176 ...
##  $ Gender         : chr  "Male" "Female" "Female" "Male" ...
##  $ Ever_Married   : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Age            : int  22 38 67 67 40 56 32 33 61 55 ...
##  $ Graduated      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Profession     : chr  "Healthcare" "Engineer" "Engineer" "Lawyer" ...
##  $ Work_Experience: num  1 NA 1 0 NA 0 1 1 0 1 ...
##  $ Spending_Score : chr  "Low" "Average" "Low" "High" ...
##  $ Family_Size    : num  4 3 1 2 6 2 3 3 3 4 ...
##  $ Var_1          : chr  "Cat_4" "Cat_4" "Cat_6" "Cat_6" ...
##  $ Segmentation   : chr  "D" "A" "B" "B" ...

From the result above, we know the following: 1. The first column is ID. It is unique and unnecessary for prediction. So, it should be removed.
2. We need to removed Var_1 because it is not needed in the prediction

customers_df = customers_df[ , -1]
customers_df = customers_df[ , -9]
test_df = test_df[ , -1]
test_df = test_df[ , -9]

We need to remove the missing value so we have to make the empty data into NA

is.na(customers_df$Ever_Married) #bikin yang kosong jadi NA
customers_df$Ever_Married[customers_df$Ever_Married == ""]
customers_df <- customers_df %>%
  mutate(Ever_Married = 
           replace(Ever_Married,
                   Ever_Married == "",
                   NA))
test_df$Ever_Married[test_df$Ever_Married == ""]
test_df <- test_df %>%
  mutate(Ever_Married = 
           replace(Ever_Married,
                   Ever_Married == "",
                   NA))

customers_df <- customers_df %>%
  mutate(Graduated = 
           replace(Graduated,
                   Graduated == "",
                   NA))
customers_df <- customers_df[complete.cases(customers_df),]

test_df <- test_df %>%
  mutate(Graduated = 
           replace(Graduated,
                   Graduated == "",
                   NA))
customers_df <- customers_df[complete.cases(customers_df),]



customers_df <- customers_df %>%
  mutate(Profession = 
           replace(Profession,
                   Profession == "",
                   NA))

test_df <- test_df %>%
  mutate(Profession = 
           replace(Profession,
                   Profession == "",
                   NA))


test_df$Ever_Married[test_df$Ever_Married == ""]
test_df <- test_df %>%
  mutate(Ever_Married = 
           replace(Ever_Married,
                   Ever_Married == "",
                   NA))

customers_df$Work_Experience[customers_df$Work_Experience == ""]
customers_df <- customers_df %>%
  mutate(Work_Experience = 
           replace(Work_Experience,
                   Work_Experience == "",
                   NA))

test_df$Work_Experience[test_df$Work_Experience == ""]
test_df <- test_df %>%
  mutate(Work_Experience = 
           replace(Work_Experience,
                   Work_Experience == "",
                   NA))

customers_df$Age[customers_df$Age == ""]
customers_df <- customers_df %>%
  mutate(Age = 
           replace(Age,
                   Age == "",
                   NA))

test_df$Age[test_df$Age == ""]
test_df <- test_df %>%
  mutate(Age = 
           replace(Age,
                   Age == "",
                   NA))

customers_df$Family_Size[customers_df$Family_Size == ""]
customers_df <- customers_df %>%
  mutate(Family_Size = 
           replace(Family_Size,
                   Family_Size == "",
                   NA))

test_df$Family_Size[test_df$Family_Size == ""]
test_df <- test_df %>%
  mutate(Family_Size = 
           replace(Family_Size,
                   Family_Size == "",
                   NA))


customers_df$Family_Size[customers_df$Family_Size == ""]
customers_df <- customers_df %>%
  mutate(Family_Size = 
           replace(Family_Size,
                   Family_Size == "",
                   NA))

test_df$Family_Size[test_df$Family_Size == ""]
test_df <- test_df %>%
  mutate(Family_Size = 
           replace(Family_Size,
                   Family_Size == "",
                   NA))

customers_df <- customers_df[complete.cases(customers_df),]
test_df <- test_df[complete.cases(test_df),]

After removing the missing values:
1. The type of *Gender, Ever_Married, Graduated, Profession, Spending_Score, Segmentation* is chr so it should be converted to factor.
2. The type of
Family_Size, Work_Experience, Age is int so it should be converted to num**

test_df$Family_Size <- as.numeric(test_df$Family_Size)
test_df$Work_Experience <- as.numeric(test_df$Work_Experience)
test_df$Gender <- as.factor(test_df$Gender)
test_df$Ever_Married <- as.factor(test_df$Ever_Married)
test_df$Age <- as.numeric(test_df$Age)
test_df$Graduated <- as.factor(test_df$Graduated)
test_df$Spending_Score <- as.factor(test_df$Spending_Score)
test_df$Segmentation <- as.factor(test_df$Segmentation)
test_df$Profession <- as.factor(test_df$Profession)


customers_df$Family_Size <- as.numeric(customers_df$Family_Size)
customers_df$Work_Experience <- as.numeric(customers_df$Work_Experience)
customers_df$Gender <- as.factor(customers_df$Gender)
customers_df$Ever_Married <- as.factor(customers_df$Ever_Married)
customers_df$Age <- as.numeric(customers_df$Age)
customers_df$Graduated <- as.factor(customers_df$Graduated)
customers_df$Spending_Score <- as.factor(customers_df$Spending_Score)
customers_df$Segmentation <- as.factor(customers_df$Segmentation)
customers_df$Profession <- as.factor(customers_df$Profession)

2.1. Univariate Analysis

Analysis of a single variable. there are segmentation : A, B, C, D

ggplot(customers_df, aes(x = Segmentation, fill = Segmentation),) +
  geom_bar() +
  stat_count(geom = "text", color = "white", size = 3,
             aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  labs(title = "Customers Based on Segmentation",
       x = "Segmentation", y = "Customers") +
  theme(plot.title = element_text(hjust = 0.5))

Based on barchart above, we can see the customer distribution by segmentation.

2.2. Bivariate Analysis

p1 <- ggplot(customers_df, aes(x=Segmentation, fill = Ever_Married)) +
  geom_bar(position = "stack") +
  stat_count(geom = "text", color = "white", size = 3.5,
             aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  labs(title = "Customer Segmentation by Ever Married") +
  theme(plot.title = element_text(hjust = 0.5))

p2 <- ggplot(customers_df, aes(x=Segmentation, fill = Gender)) +
  geom_bar(position = "stack") +
  stat_count(geom = "text", color = "white", size = 3.5,
             aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  labs(title = "Customer Segmentation by Gender") +
  theme(plot.title = element_text(hjust = 0.5))


p3 <- ggplot(customers_df, aes(x=Spending_Score, fill = Segmentation)) +
  geom_bar(position = "stack") +
  stat_count(geom = "text", color = "white", size = 3.5,
             aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  labs(title = "Customer Segmentation by Spending Score") +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p1,p2,p3)

Based on three plot above, we can see the customer segmentation based on Ever_Married, Gender and Spending_Score.

2.3 Multivariate Data Analysis

ggplot(customers_df, aes(x=Work_Experience, y=Age, color=Segmentation,
                         shape=Graduated)) +
  geom_point() +
  geom_jitter() +
  facet_grid(~Profession) +
  facet_wrap(~ Segmentation) +
  labs(title = "Customer Segmentation by Age, Graduated and Work Experience") +
  theme(plot.title = element_text(hjust = 0.5))

from this plot, we know the segmentation based on Age, Graduated, and Work Experience

3. Data Preparation

3.1 Data Cleaning

Data cleaning has been done when we want to do Exploratory Data Analysis so we don’t need to do it again.

dim(customers_df)
## [1] 6718    9

Number of observation is now 6718. It means, the data cleaning process removed 1350 rows.

3.2 Training and Testing Division

Randomly divided the dataset into training and testing with 70:30, but we dont need to divided the dataset because the company already provided the training and testing data.

4. Modeling

Create classification model using Decision Tree, Random Forest and Support Vector Machine (SVM). We will create four models: without PCA & One Hot Encoding, with PCA, with OHE and with PCA & OHE.

4.1 Decision Tree

### Decision Tree Model

library(party)
model.dt <- ctree(formula = Segmentation ~ ., 
                  data = customers_df)
model.dt
### Predict Decision Tree
pred.dt <- predict(model.dt, test_df)
pred.dt
cm.dt <- table(test_df$Segmentation, pred.dt,
               dnn = c("Actual", "Predicted"))
cm.dt
##       Predicted
## Actual   A   B   C   D
##      A 165 157 151 226
##      B 115 112 105 123
##      C  86 109 124  67
##      D 135  94 110 299

4.2 Random Forest

model.forest <- randomForest(formula = Segmentation ~ ., 
                             data = customers_df)
pred.forest <- predict(model.forest, test_df)
cm.forest <- table(test_df$Segmentation, pred.forest,
                   dnn = c("Actual", "Predicted"))
cm.forest
##       Predicted
## Actual   A   B   C   D
##      A 188 133 164 214
##      B 124  99 128 104
##      C 104  87 140  55
##      D 145  93 116 284

4.3 Support Vector Machine

model.svm <- svm(formula = Segmentation ~ ., 
                 data = customers_df)

pred.svm <- predict(model.svm, test_df)
cm.svm <- table(test_df$Segmentation, pred.svm,
                dnn = c("Actual", "Predicted"))
cm.svm
##       Predicted
## Actual   A   B   C   D
##      A 233 121 156 189
##      B 142 106 112  95
##      C 116  88 135  47
##      D 181  80 118 259

Result

confusionMatrix(pred.dt, test_df$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 165 115  86 135
##          B 157 112 109  94
##          C 151 105 124 110
##          D 226 123  67 299
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3214          
##                  95% CI : (0.3018, 0.3415)
##     No Information Rate : 0.3209          
##     P-Value [Acc > NIR] : 0.4898          
##                                           
##                   Kappa : 0.089           
##                                           
##  Mcnemar's Test P-Value : 2.081e-11       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity           0.23605  0.24615  0.32124   0.4687
## Specificity           0.77282  0.79106  0.79576   0.7299
## Pos Pred Value        0.32934  0.23729  0.25306   0.4182
## Neg Pred Value        0.68157  0.79894  0.84479   0.7683
## Prevalence            0.32094  0.20891  0.17723   0.2929
## Detection Rate        0.07576  0.05142  0.05693   0.1373
## Detection Prevalence  0.23003  0.21671  0.22498   0.3283
## Balanced Accuracy     0.50444  0.51861  0.55850   0.5993
confusionMatrix(pred.svm, test_df$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 233 142 116 181
##          B 121 106  88  80
##          C 156 112 135 118
##          D 189  95  47 259
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3365          
##                  95% CI : (0.3167, 0.3568)
##     No Information Rate : 0.3209          
##     P-Value [Acc > NIR] : 0.06251         
##                                           
##                   Kappa : 0.1051          
##                                           
##  Mcnemar's Test P-Value : 1.499e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity            0.3333  0.23297  0.34974   0.4060
## Specificity            0.7032  0.83227  0.78460   0.7851
## Pos Pred Value         0.3467  0.26835  0.25912   0.4390
## Neg Pred Value         0.6906  0.80426  0.84852   0.7613
## Prevalence             0.3209  0.20891  0.17723   0.2929
## Detection Rate         0.1070  0.04867  0.06198   0.1189
## Detection Prevalence   0.3085  0.18136  0.23921   0.2709
## Balanced Accuracy      0.5183  0.53262  0.56717   0.5955
confusionMatrix(pred.forest, test_df$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 188 124 104 145
##          B 133  99  87  93
##          C 164 128 140 116
##          D 214 104  55 284
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3264          
##                  95% CI : (0.3068, 0.3466)
##     No Information Rate : 0.3209          
##     P-Value [Acc > NIR] : 0.2981          
##                                           
##                   Kappa : 0.0957          
##                                           
##  Mcnemar's Test P-Value : 1.662e-10       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity           0.26896  0.21758  0.36269   0.4451
## Specificity           0.74780  0.81834  0.77232   0.7578
## Pos Pred Value        0.33512  0.24029  0.25547   0.4323
## Neg Pred Value        0.68398  0.79841  0.84908   0.7673
## Prevalence            0.32094  0.20891  0.17723   0.2929
## Detection Rate        0.08632  0.04545  0.06428   0.1304
## Detection Prevalence  0.25758  0.18916  0.25161   0.3017
## Balanced Accuracy     0.50838  0.51796  0.56751   0.6015

From these results we know that the accuracy is around 0.30 so we decided to use One Hot Encoding.

4.4 ONE HOT ENCODING

create dataframe to be encoded (gender,ever_married, graduated, profession, spending score on training data and testing data)

customers_df2 <- customers_df
test_df2 <- test_df

create dataframe to be encoded from training data

#### 1. create dataframe to be encoded  (gender, train_df)
Gender_df <- data.frame(customers_df2$Gender)
colnames(Gender_df) <- "Gender"


### create OHE dataframe  (gender,customer_df)
df1 <- dummyVars("~.", data = Gender_df)
df2 <- data.frame(predict(df1, newdata = Gender_df))
df2

##  combine to original dataframe (gender,customer_df)
customers_df2 <- cbind(customers_df2, df2)
customers_df2$Gender<- NULL

#### 2. create dataframe to be encoded  (ever_maried,customer_df)
Ever_Married_df <- data.frame(customers_df2$Ever_Married)
colnames(Ever_Married_df) <- "Ever_Married"


### create OHE dataframe  (ever_married,customer_df)
df3 <- dummyVars("~.", data = Ever_Married_df)
df4 <- data.frame(predict(df3, newdata = Ever_Married_df))
df4

## combine to original dataframe (ever_married,customer_df)
customers_df2 <- cbind(customers_df2, df4)
customers_df2$Ever_Married<- NULL
View(customers_df)

#### 3. create dataframe to be encoded  (graduated,customer_df)
Graduated_df <- data.frame(customers_df2$Graduated)
colnames(Graduated_df) <- "Graduated"


### create OHE dataframe  (Graduated,customer_df)
df5 <- dummyVars("~.", data = Graduated_df)
df6 <- data.frame(predict(df5, newdata = Graduated_df))
df6

## combine to original dataframe (Graduated,customer_df)
customers_df2 <- cbind(customers_df2, df6)
customers_df2$Graduated<- NULL

#### 4. create dataframe to be encoded  (Profession,customer_df)
Profession_df <- data.frame(customers_df2$Profession)
colnames(Profession_df) <- "Profession"


### create OHE dataframe  (Profession,customer_df)
df7 <- dummyVars("~.", data = Profession_df)
df8 <- data.frame(predict(df7, newdata = Profession_df))
df8

## combine to original dataframe (Profession,customer_df)
customers_df2 <- cbind(customers_df2, df8)
customers_df2$Profession<- NULL
View(customers_df)


#### 5. create dataframe to be encoded  (SpendingScore,customer_df)
Spending_Score_df <- data.frame(customers_df2$Spending_Score)
colnames(Spending_Score_df) <- "Spending_Score"


### create OHE dataframe  (SpendingScore,customer_df)
df9 <- dummyVars("~.", data = Spending_Score_df)
df10 <- data.frame(predict(df9, newdata = Spending_Score_df))
df10

## combine to original dataframe (SpendingScore,customer_df)
customers_df2 <- cbind(customers_df2, df10)
customers_df2$Spending_Score<- NULL
View(customers_df2)

create dataframe to be encoded from testing data

#### 1. create dataframe to be encoded  (gender, test_df)
Gender_df2 <- data.frame(test_df2$Gender)
colnames(Gender_df2) <- "Gender"


### create OHE dataframe  (gender,test_df)
df11 <- dummyVars("~.", data = Gender_df2)
df12 <- data.frame(predict(df1, newdata = Gender_df2))
df12

##  combine to original dataframe (gender,test_df)
test_df2 <- cbind(test_df2, df12)
test_df2$Gender<- NULL

#### 2. create dataframe to be encoded  (ever_married,test_df)
Ever_Married_df2 <- data.frame(test_df2$Ever_Married)
colnames(Ever_Married_df2) <- "Ever_Married"


### create OHE dataframe  (ever_married,test_df)
df13 <- dummyVars("~.", data = Ever_Married_df2)
df14 <- data.frame(predict(df13, newdata = Ever_Married_df2))
df14

## combine to original dataframe (ever_married,test_df)
test_df2 <- cbind(test_df2, df14)
test_df2$Ever_Married<- NULL

#### 3. create dataframe to be encoded  (graduated,test_df)
Graduated_df2 <- data.frame(test_df2$Graduated)
colnames(Graduated_df2) <- "Graduated"


### create OHE dataframe  (Graduated,test_df)
df15 <- dummyVars("~.", data = Graduated_df2)
df16 <- data.frame(predict(df15, newdata = Graduated_df2))
df16

## combine to original dataframe (Graduated,test_df)
test_df2 <- cbind(test_df2, df16)
test_df2$Graduated<- NULL


#### 4. create dataframe to be encoded  (Profession,test_df)
Profession_df2 <- data.frame(test_df2$Profession)
colnames(Profession_df2) <- "Profession"


### create OHE dataframe  (Profession,test_df)
df17 <- dummyVars("~.", data = Profession_df2)
df18 <- data.frame(predict(df17, newdata = Profession_df2))
df18

## combine to original dataframe (Profession,test_df)
test_df2 <- cbind(test_df2, df18)
test_df2$Profession<- NULL
View(test_df2)

#### 5. create dataframe to be encoded  (SpendingScore,test_df)
Spending_Score_df2 <- data.frame(test_df2$Spending_Score)
colnames(Spending_Score_df2) <- "Spending_Score"


### create OHE dataframe  (SpendingScore,test_df)
df19 <- dummyVars("~.", data = Spending_Score_df2)
df20 <- data.frame(predict(df19, newdata = Spending_Score_df2))
df20

## combine to original dataframe (SpendingScore,test_df)
test_df2 <- cbind(test_df2, df20)
test_df2$Spending_Score<- NULL
View(test_df2)

After we made encoded dataframe, create a model using dataframe encoding

4.4.1 Decision Tree using OHE

model.dt_OHE <- ctree(formula = Segmentation ~ Gender.Male+Gender.Female+Ever_Married.No+Ever_Married.Yes+
                        Profession.Artist+Profession.Engineer+Profession.Entertainment+Profession.Executive+
                        Profession.Healthcare+Profession.Homemaker+Profession.Lawyer+Profession.Marketing+
                        Spending_Score.Average+Spending_Score.High+Spending_Score.Low , 
                  data = customers_df2)

4.4.2 Random Forest using OHE

set.seed(2021)
model.forest_OHE <- randomForest(formula = Segmentation ~ Gender.Male+Gender.Female+Ever_Married.No+Ever_Married.Yes+
                                   Profession.Artist+Profession.Engineer+Profession.Entertainment+Profession.Executive+
                                   Profession.Healthcare+Profession.Homemaker+Profession.Lawyer+Profession.Marketing+
                                   Spending_Score.Average+Spending_Score.High+Spending_Score.Low, 
                                 data = customers_df2)
model.forest_OHE
## 
## Call:
##  randomForest(formula = Segmentation ~ Gender.Male + Gender.Female +      Ever_Married.No + Ever_Married.Yes + Profession.Artist +      Profession.Engineer + Profession.Entertainment + Profession.Executive +      Profession.Healthcare + Profession.Homemaker + Profession.Lawyer +      Profession.Marketing + Spending_Score.Average + Spending_Score.High +      Spending_Score.Low, data = customers_df2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 50.95%
## Confusion matrix:
##     A   B   C    D class.error
## A 845 275 234  274   0.4809582
## B 550 366 473  194   0.7687934
## C 313 294 913  215   0.4737752
## D 428  93  80 1171   0.3391648

4.4.3 Support Vector Machines (SVM) using OHE

model.svm_OHE <- svm(formula = Segmentation ~ Gender.Male+Gender.Female+Ever_Married.No+Ever_Married.Yes+
                       Profession.Artist+Profession.Engineer+Profession.Entertainment+Profession.Executive+
                       Profession.Healthcare+Profession.Homemaker+Profession.Lawyer+Profession.Marketing+
                       Spending_Score.Average+Spending_Score.High+Spending_Score.Low, 
                     data = customers_df2)
model.svm_OHE
## 
## Call:
## svm(formula = Segmentation ~ Gender.Male + Gender.Female + Ever_Married.No + 
##     Ever_Married.Yes + Profession.Artist + Profession.Engineer + 
##     Profession.Entertainment + Profession.Executive + Profession.Healthcare + 
##     Profession.Homemaker + Profession.Lawyer + Profession.Marketing + 
##     Spending_Score.Average + Spending_Score.High + Spending_Score.Low, 
##     data = customers_df2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  5260

1. Decision Tree using OHE predict

pred.dt_OHE <- predict(model.dt_OHE, test_df2)
cm.dt_OHE <- table(test_df2$Segmentation, pred.dt_OHE,
                   dnn = c("Actual", "Predicted"))
cm.dt_OHE
##       Predicted
## Actual   A   B   C   D
##      A 215 177 124 183
##      B 137 129  87 102
##      C 108 133  97  48
##      D 190 113  94 241

2. Random Forest using OHE predict

pred.forest_OHE <- predict(model.forest_OHE, test_df2)
cm.forest_OHE <- table(test_df2$Segmentation, pred.forest_OHE,
                       dnn = c("Actual", "Predicted"))
cm.forest_OHE
##       Predicted
## Actual   A   B   C   D
##      A 258 103 151 187
##      B 160  70 117 108
##      C 132  78 124  52
##      D 200  67 119 252

3. SVM using OHE predict

pred.svm_OHE <- predict(model.svm_OHE, test_df2)
cm.svm_OHE <- table(test_df2$Segmentation, pred.svm_OHE,
                    dnn = c("Actual", "Predicted"))
cm.svm_OHE
##       Predicted
## Actual   A   B   C   D
##      A 219 147 148 185
##      B 138  97 112 108
##      C 108 104 122  52
##      D 186  86 116 250
library(caret)
confusionMatrix(pred.dt_OHE, test_df2$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 215 137 108 190
##          B 177 129 133 113
##          C 124  87  97  94
##          D 183 102  48 241
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3131          
##                  95% CI : (0.2937, 0.3331)
##     No Information Rate : 0.3209          
##     P-Value [Acc > NIR] : 0.7888          
##                                           
##                   Kappa : 0.0735          
##                                           
##  Mcnemar's Test P-Value : 2.114e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity           0.30758  0.28352  0.25130   0.3777
## Specificity           0.70588  0.75450  0.82980   0.7838
## Pos Pred Value        0.33077  0.23370  0.24129   0.4199
## Neg Pred Value        0.68325  0.79951  0.83727   0.7525
## Prevalence            0.32094  0.20891  0.17723   0.2929
## Detection Rate        0.09871  0.05923  0.04454   0.1107
## Detection Prevalence  0.29844  0.25344  0.18457   0.2635
## Balanced Accuracy     0.50673  0.51901  0.54055   0.5808
confusionMatrix(pred.svm_OHE, test_df2$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 219 138 108 186
##          B 147  97 104  86
##          C 148 112 122 116
##          D 185 108  52 250
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3159          
##                  95% CI : (0.2964, 0.3359)
##     No Information Rate : 0.3209          
##     P-Value [Acc > NIR] : 0.7005          
##                                           
##                   Kappa : 0.0779          
##                                           
##  Mcnemar's Test P-Value : 7.656e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity            0.3133  0.21319  0.31606   0.3918
## Specificity            0.7079  0.80441  0.79018   0.7760
## Pos Pred Value         0.3364  0.22350  0.24498   0.4202
## Neg Pred Value         0.6857  0.79472  0.84286   0.7549
## Prevalence             0.3209  0.20891  0.17723   0.2929
## Detection Rate         0.1006  0.04454  0.05601   0.1148
## Detection Prevalence   0.2989  0.19927  0.22865   0.2732
## Balanced Accuracy      0.5106  0.50880  0.55312   0.5839
confusionMatrix(pred.forest_OHE, test_df2$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 258 160 132 200
##          B 103  70  78  67
##          C 151 117 124 119
##          D 187 108  52 252
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3232          
##                  95% CI : (0.3036, 0.3433)
##     No Information Rate : 0.3209          
##     P-Value [Acc > NIR] : 0.4172          
##                                           
##                   Kappa : 0.0815          
##                                           
##  Mcnemar's Test P-Value : 1.304e-10       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity            0.3691  0.15385  0.32124   0.3950
## Specificity            0.6673  0.85607  0.78404   0.7747
## Pos Pred Value         0.3440  0.22013  0.24266   0.4207
## Neg Pred Value         0.6912  0.79301  0.84283   0.7555
## Prevalence             0.3209  0.20891  0.17723   0.2929
## Detection Rate         0.1185  0.03214  0.05693   0.1157
## Detection Prevalence   0.3444  0.14601  0.23462   0.2750
## Balanced Accuracy      0.5182  0.50496  0.55264   0.5848

4.5 Decision Tree Using OHE and PCA

1. Cleaning data

customers_df3 <- customers_df2

customers_df3$Segmentation <- NULL
Segmentation3 <- customers_df$Segmentation

customers_df4 <- data.frame(customers_df3,Segmentation3)

customers_df4$Segmentation <- NULL

2. Perform PCA using OHE dataset

pr.out2 <- prcomp(customers_df4[1:20])

#### select k principal component (PC) as features
k <- 16
features_df2 <- pr.out2$x[ , 1:k]
features_df2 <- data.frame( features_df2)

3. Combine dataset OHE with PCA

#### combine dataset
customer_pca_df2 <- cbind(customers_df4$Segmentation3, features_df2)
colnames(customer_pca_df2) [1] <- "Segmentation"

#### test and train pca
set.seed(2021)
m <- nrow(customer_pca_df2)
m_train_pca2 <- m * 0.7
train_pca_idx2 <- sample(m, m_train_pca2)


train_pca_df2 <- customer_pca_df2[ train_pca_idx2, ]
test_pca_df2 <- customer_pca_df2[ -train_pca_idx2, ]

4. Decision tree with pca using OHE dataset

model.dt_pca2 <- ctree(formula = Segmentation ~ ., 
                       data = train_pca_df2)

plot(model.dt_pca2)

### predict Decision Tree with pca
pred.dt_pca2 <- predict(model.dt_pca2, test_pca_df2)
cm.dt_pca2 <- table(test_pca_df2$Segmentation, pred.dt_pca2,
                    dnn = c("Actual", "Predicted"))
cm.dt_pca2
##       Predicted
## Actual   A   B   C   D
##      A  74 167  65 194
##      B  45 156 168  80
##      C  16 104 329  77
##      D  34  64  13 430

5. Random Forest with pca using OHE dataset

set.seed(2021)
model.forest_pca2 <- randomForest(formula = Segmentation ~ ., 
                                  data = train_pca_df2)
model.forest_pca2
## 
## Call:
##  randomForest(formula = Segmentation ~ ., data = train_pca_df2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 51.64%
## Confusion matrix:
##     A   B   C   D class.error
## A 470 264 163 231   0.5833333
## B 261 403 336 134   0.6446208
## C 159 318 591 141   0.5111663
## D 231 110  80 810   0.3419984
#### predict Random Forest with pca
pred.forest_pca2 <- predict(model.forest_pca2, test_pca_df2)
cm.forest_pca2 <- table(test_pca_df2$Segmentation, pred.forest_pca2,
                        dnn = c("Actual", "Predicted"))
cm.forest_pca2
##       Predicted
## Actual   A   B   C   D
##      A 211 111  56 122
##      B 116 155 117  61
##      C  80 124 269  53
##      D  98  46  39 358
### SVM PCA using OHE dataset
library(e1071)
model.svm_pca2 <- svm(formula = Segmentation ~ ., 
                      data = train_pca_df2)
model.svm_pca2
## 
## Call:
## svm(formula = Segmentation ~ ., data = train_pca_df2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  3885
####  predict SVM with PCA
pred.svm_pca2 <- predict(model.svm_pca2, test_pca_df2)
cm.svm_pca2 <- table(test_pca_df2$Segmentation, pred.svm_pca2,
                     dnn = c("Actual", "Predicted"))
cm.svm_pca2
##       Predicted
## Actual   A   B   C   D
##      A 224 123  58  95
##      B  83 172 142  52
##      C  44 104 318  60
##      D 103  34   8 396

5. Result

confusionMatrix(pred.dt_pca2, test_pca_df2$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A  74  45  16  34
##          B 167 156 104  64
##          C  65 168 329  13
##          D 194  80  77 430
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4906          
##                  95% CI : (0.4685, 0.5126)
##     No Information Rate : 0.2684          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3177          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity           0.14800  0.34744   0.6255   0.7948
## Specificity           0.93734  0.78622   0.8349   0.7620
## Pos Pred Value        0.43787  0.31772   0.5722   0.5506
## Neg Pred Value        0.76936  0.80787   0.8633   0.9101
## Prevalence            0.24802  0.22272   0.2609   0.2684
## Detection Rate        0.03671  0.07738   0.1632   0.2133
## Detection Prevalence  0.08383  0.24355   0.2852   0.3874
## Balanced Accuracy     0.54267  0.56683   0.7302   0.7784
confusionMatrix(pred.forest_pca2, test_pca_df2$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 211 116  80  98
##          B 111 155 124  46
##          C  56 117 269  39
##          D 122  61  53 358
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4926          
##                  95% CI : (0.4705, 0.5146)
##     No Information Rate : 0.2684          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.322           
##                                           
##  Mcnemar's Test P-Value : 0.07677         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity            0.4220  0.34521   0.5114   0.6617
## Specificity            0.8061  0.82068   0.8577   0.8400
## Pos Pred Value         0.4178  0.35550   0.5593   0.6027
## Neg Pred Value         0.8087  0.81392   0.8326   0.8713
## Prevalence             0.2480  0.22272   0.2609   0.2684
## Detection Rate         0.1047  0.07688   0.1334   0.1776
## Detection Prevalence   0.2505  0.21627   0.2386   0.2946
## Balanced Accuracy      0.6140  0.58294   0.6846   0.7509
confusionMatrix(pred.svm_pca2, test_pca_df2$Segmentation )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D
##          A 224  83  44 103
##          B 123 172 104  34
##          C  58 142 318   8
##          D  95  52  60 396
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5506          
##                  95% CI : (0.5286, 0.5725)
##     No Information Rate : 0.2684          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3992          
##                                           
##  Mcnemar's Test P-Value : 5.92e-11        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D
## Sensitivity            0.4480  0.38307   0.6046   0.7320
## Specificity            0.8483  0.83344   0.8604   0.8597
## Pos Pred Value         0.4934  0.39723   0.6046   0.6567
## Neg Pred Value         0.8233  0.82502   0.8604   0.8974
## Prevalence             0.2480  0.22272   0.2609   0.2684
## Detection Rate         0.1111  0.08532   0.1577   0.1964
## Detection Prevalence   0.2252  0.21478   0.2609   0.2991
## Balanced Accuracy      0.6481  0.60826   0.7325   0.7958

6. Recommendation

  1. Focus on spending_score customers because there are so many customers who rated low score. This is important to increase sales for the new market.
  2. From three models that we already tried, Support Vector Machine using PCA and OHE is the best method to get better accuracy but it must still be improved because it is still not believable to use.
  3. The company should be observe for the new variable unused to get optimal results.