This report describe prediction the customers and classify the customers into four segments using Machine Learning Algorithm. The dataset used in this report is Customer Segment Data hosted in Kaggle and was acquired from the Analytics Vidhya hackathon.
The dataset using in this report for modeling is real house data in the US. The dataset is hosted in Kaggle. It can be downloaded here:
https://www.kaggle.com/kaushiksuresh147/customer-segmentation
The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis (EDA)
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import necessary libraries.
rm(list = ls())
library(ggplot2)
library(gridExtra)
library(corrgram)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.5 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(treemapify)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Read house dataset from .csv file to R dataframe. Then, see the dataframe’s structure.
## read data
customers_df <- read.csv("data/Train.csv")
test_df <- read.csv("data/Test.csv")
The dataset has 8068 observations and 11 variables. The target variable is Segmentation and the remaining variables are candidate features.
Compute statistical summary of each variable.
## statistical summary
summary(customers_df)
## ID Gender Ever_Married Age
## Min. :458982 Length:8068 Length:8068 Min. :18.00
## 1st Qu.:461241 Class :character Class :character 1st Qu.:30.00
## Median :463473 Mode :character Mode :character Median :40.00
## Mean :463479 Mean :43.47
## 3rd Qu.:465744 3rd Qu.:53.00
## Max. :467974 Max. :89.00
##
## Graduated Profession Work_Experience Spending_Score
## Length:8068 Length:8068 Min. : 0.000 Length:8068
## Class :character Class :character 1st Qu.: 0.000 Class :character
## Mode :character Mode :character Median : 1.000 Mode :character
## Mean : 2.642
## 3rd Qu.: 4.000
## Max. :14.000
## NA's :829
## Family_Size Var_1 Segmentation
## Min. :1.00 Length:8068 Length:8068
## 1st Qu.:2.00 Class :character Class :character
## Median :3.00 Mode :character Mode :character
## Mean :2.85
## 3rd Qu.:4.00
## Max. :9.00
## NA's :335
We can see minimum, median, mean, and maximum values of each numeric variable.
To find out the column names and types, we used str() function.
str(customers_df)
## 'data.frame': 8068 obs. of 11 variables:
## $ ID : int 462809 462643 466315 461735 462669 461319 460156 464347 465015 465176 ...
## $ Gender : chr "Male" "Female" "Female" "Male" ...
## $ Ever_Married : chr "No" "Yes" "Yes" "Yes" ...
## $ Age : int 22 38 67 67 40 56 32 33 61 55 ...
## $ Graduated : chr "No" "Yes" "Yes" "Yes" ...
## $ Profession : chr "Healthcare" "Engineer" "Engineer" "Lawyer" ...
## $ Work_Experience: num 1 NA 1 0 NA 0 1 1 0 1 ...
## $ Spending_Score : chr "Low" "Average" "Low" "High" ...
## $ Family_Size : num 4 3 1 2 6 2 3 3 3 4 ...
## $ Var_1 : chr "Cat_4" "Cat_4" "Cat_6" "Cat_6" ...
## $ Segmentation : chr "D" "A" "B" "B" ...
From the result above, we know the following: 1. The first column is ID. It is unique and unnecessary for prediction. So, it should be removed.
2. We need to removed Var_1 because it is not needed in the prediction
customers_df = customers_df[ , -1]
customers_df = customers_df[ , -9]
test_df = test_df[ , -1]
test_df = test_df[ , -9]
We need to remove the missing value so we have to make the empty data into NA
is.na(customers_df$Ever_Married) #bikin yang kosong jadi NA
customers_df$Ever_Married[customers_df$Ever_Married == ""]
customers_df <- customers_df %>%
mutate(Ever_Married =
replace(Ever_Married,
Ever_Married == "",
NA))
test_df$Ever_Married[test_df$Ever_Married == ""]
test_df <- test_df %>%
mutate(Ever_Married =
replace(Ever_Married,
Ever_Married == "",
NA))
customers_df <- customers_df %>%
mutate(Graduated =
replace(Graduated,
Graduated == "",
NA))
customers_df <- customers_df[complete.cases(customers_df),]
test_df <- test_df %>%
mutate(Graduated =
replace(Graduated,
Graduated == "",
NA))
customers_df <- customers_df[complete.cases(customers_df),]
customers_df <- customers_df %>%
mutate(Profession =
replace(Profession,
Profession == "",
NA))
test_df <- test_df %>%
mutate(Profession =
replace(Profession,
Profession == "",
NA))
test_df$Ever_Married[test_df$Ever_Married == ""]
test_df <- test_df %>%
mutate(Ever_Married =
replace(Ever_Married,
Ever_Married == "",
NA))
customers_df$Work_Experience[customers_df$Work_Experience == ""]
customers_df <- customers_df %>%
mutate(Work_Experience =
replace(Work_Experience,
Work_Experience == "",
NA))
test_df$Work_Experience[test_df$Work_Experience == ""]
test_df <- test_df %>%
mutate(Work_Experience =
replace(Work_Experience,
Work_Experience == "",
NA))
customers_df$Age[customers_df$Age == ""]
customers_df <- customers_df %>%
mutate(Age =
replace(Age,
Age == "",
NA))
test_df$Age[test_df$Age == ""]
test_df <- test_df %>%
mutate(Age =
replace(Age,
Age == "",
NA))
customers_df$Family_Size[customers_df$Family_Size == ""]
customers_df <- customers_df %>%
mutate(Family_Size =
replace(Family_Size,
Family_Size == "",
NA))
test_df$Family_Size[test_df$Family_Size == ""]
test_df <- test_df %>%
mutate(Family_Size =
replace(Family_Size,
Family_Size == "",
NA))
customers_df$Family_Size[customers_df$Family_Size == ""]
customers_df <- customers_df %>%
mutate(Family_Size =
replace(Family_Size,
Family_Size == "",
NA))
test_df$Family_Size[test_df$Family_Size == ""]
test_df <- test_df %>%
mutate(Family_Size =
replace(Family_Size,
Family_Size == "",
NA))
customers_df <- customers_df[complete.cases(customers_df),]
test_df <- test_df[complete.cases(test_df),]
After removing the missing values:
1. The type of *Gender, Ever_Married, Graduated, Profession, Spending_Score, Segmentation* is chr so it should be converted to factor.
2. The type of Family_Size, Work_Experience, Age is int so it should be converted to num**
test_df$Family_Size <- as.numeric(test_df$Family_Size)
test_df$Work_Experience <- as.numeric(test_df$Work_Experience)
test_df$Gender <- as.factor(test_df$Gender)
test_df$Ever_Married <- as.factor(test_df$Ever_Married)
test_df$Age <- as.numeric(test_df$Age)
test_df$Graduated <- as.factor(test_df$Graduated)
test_df$Spending_Score <- as.factor(test_df$Spending_Score)
test_df$Segmentation <- as.factor(test_df$Segmentation)
test_df$Profession <- as.factor(test_df$Profession)
customers_df$Family_Size <- as.numeric(customers_df$Family_Size)
customers_df$Work_Experience <- as.numeric(customers_df$Work_Experience)
customers_df$Gender <- as.factor(customers_df$Gender)
customers_df$Ever_Married <- as.factor(customers_df$Ever_Married)
customers_df$Age <- as.numeric(customers_df$Age)
customers_df$Graduated <- as.factor(customers_df$Graduated)
customers_df$Spending_Score <- as.factor(customers_df$Spending_Score)
customers_df$Segmentation <- as.factor(customers_df$Segmentation)
customers_df$Profession <- as.factor(customers_df$Profession)
Analysis of a single variable. there are segmentation : A, B, C, D
ggplot(customers_df, aes(x = Segmentation, fill = Segmentation),) +
geom_bar() +
stat_count(geom = "text", color = "white", size = 3,
aes(label = ..count..), position=position_stack(vjust = 0.5)) +
labs(title = "Customers Based on Segmentation",
x = "Segmentation", y = "Customers") +
theme(plot.title = element_text(hjust = 0.5))
Based on barchart above, we can see the customer distribution by segmentation.
p1 <- ggplot(customers_df, aes(x=Segmentation, fill = Ever_Married)) +
geom_bar(position = "stack") +
stat_count(geom = "text", color = "white", size = 3.5,
aes(label = ..count..), position=position_stack(vjust = 0.5)) +
labs(title = "Customer Segmentation by Ever Married") +
theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(customers_df, aes(x=Segmentation, fill = Gender)) +
geom_bar(position = "stack") +
stat_count(geom = "text", color = "white", size = 3.5,
aes(label = ..count..), position=position_stack(vjust = 0.5)) +
labs(title = "Customer Segmentation by Gender") +
theme(plot.title = element_text(hjust = 0.5))
p3 <- ggplot(customers_df, aes(x=Spending_Score, fill = Segmentation)) +
geom_bar(position = "stack") +
stat_count(geom = "text", color = "white", size = 3.5,
aes(label = ..count..), position=position_stack(vjust = 0.5)) +
labs(title = "Customer Segmentation by Spending Score") +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1,p2,p3)
Based on three plot above, we can see the customer segmentation based on Ever_Married, Gender and Spending_Score.
ggplot(customers_df, aes(x=Work_Experience, y=Age, color=Segmentation,
shape=Graduated)) +
geom_point() +
geom_jitter() +
facet_grid(~Profession) +
facet_wrap(~ Segmentation) +
labs(title = "Customer Segmentation by Age, Graduated and Work Experience") +
theme(plot.title = element_text(hjust = 0.5))
from this plot, we know the segmentation based on Age, Graduated, and Work Experience
Data cleaning has been done when we want to do Exploratory Data Analysis so we don’t need to do it again.
dim(customers_df)
## [1] 6718 9
Number of observation is now 6718. It means, the data cleaning process removed 1350 rows.
Randomly divided the dataset into training and testing with 70:30, but we dont need to divided the dataset because the company already provided the training and testing data.
Create classification model using Decision Tree, Random Forest and Support Vector Machine (SVM). We will create four models: without PCA & One Hot Encoding, with PCA, with OHE and with PCA & OHE.
### Decision Tree Model
library(party)
model.dt <- ctree(formula = Segmentation ~ .,
data = customers_df)
model.dt
### Predict Decision Tree
pred.dt <- predict(model.dt, test_df)
pred.dt
cm.dt <- table(test_df$Segmentation, pred.dt,
dnn = c("Actual", "Predicted"))
cm.dt
## Predicted
## Actual A B C D
## A 165 157 151 226
## B 115 112 105 123
## C 86 109 124 67
## D 135 94 110 299
model.forest <- randomForest(formula = Segmentation ~ .,
data = customers_df)
pred.forest <- predict(model.forest, test_df)
cm.forest <- table(test_df$Segmentation, pred.forest,
dnn = c("Actual", "Predicted"))
cm.forest
## Predicted
## Actual A B C D
## A 188 133 164 214
## B 124 99 128 104
## C 104 87 140 55
## D 145 93 116 284
model.svm <- svm(formula = Segmentation ~ .,
data = customers_df)
pred.svm <- predict(model.svm, test_df)
cm.svm <- table(test_df$Segmentation, pred.svm,
dnn = c("Actual", "Predicted"))
cm.svm
## Predicted
## Actual A B C D
## A 233 121 156 189
## B 142 106 112 95
## C 116 88 135 47
## D 181 80 118 259
confusionMatrix(pred.dt, test_df$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 165 115 86 135
## B 157 112 109 94
## C 151 105 124 110
## D 226 123 67 299
##
## Overall Statistics
##
## Accuracy : 0.3214
## 95% CI : (0.3018, 0.3415)
## No Information Rate : 0.3209
## P-Value [Acc > NIR] : 0.4898
##
## Kappa : 0.089
##
## Mcnemar's Test P-Value : 2.081e-11
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.23605 0.24615 0.32124 0.4687
## Specificity 0.77282 0.79106 0.79576 0.7299
## Pos Pred Value 0.32934 0.23729 0.25306 0.4182
## Neg Pred Value 0.68157 0.79894 0.84479 0.7683
## Prevalence 0.32094 0.20891 0.17723 0.2929
## Detection Rate 0.07576 0.05142 0.05693 0.1373
## Detection Prevalence 0.23003 0.21671 0.22498 0.3283
## Balanced Accuracy 0.50444 0.51861 0.55850 0.5993
confusionMatrix(pred.svm, test_df$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 233 142 116 181
## B 121 106 88 80
## C 156 112 135 118
## D 189 95 47 259
##
## Overall Statistics
##
## Accuracy : 0.3365
## 95% CI : (0.3167, 0.3568)
## No Information Rate : 0.3209
## P-Value [Acc > NIR] : 0.06251
##
## Kappa : 0.1051
##
## Mcnemar's Test P-Value : 1.499e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.3333 0.23297 0.34974 0.4060
## Specificity 0.7032 0.83227 0.78460 0.7851
## Pos Pred Value 0.3467 0.26835 0.25912 0.4390
## Neg Pred Value 0.6906 0.80426 0.84852 0.7613
## Prevalence 0.3209 0.20891 0.17723 0.2929
## Detection Rate 0.1070 0.04867 0.06198 0.1189
## Detection Prevalence 0.3085 0.18136 0.23921 0.2709
## Balanced Accuracy 0.5183 0.53262 0.56717 0.5955
confusionMatrix(pred.forest, test_df$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 188 124 104 145
## B 133 99 87 93
## C 164 128 140 116
## D 214 104 55 284
##
## Overall Statistics
##
## Accuracy : 0.3264
## 95% CI : (0.3068, 0.3466)
## No Information Rate : 0.3209
## P-Value [Acc > NIR] : 0.2981
##
## Kappa : 0.0957
##
## Mcnemar's Test P-Value : 1.662e-10
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.26896 0.21758 0.36269 0.4451
## Specificity 0.74780 0.81834 0.77232 0.7578
## Pos Pred Value 0.33512 0.24029 0.25547 0.4323
## Neg Pred Value 0.68398 0.79841 0.84908 0.7673
## Prevalence 0.32094 0.20891 0.17723 0.2929
## Detection Rate 0.08632 0.04545 0.06428 0.1304
## Detection Prevalence 0.25758 0.18916 0.25161 0.3017
## Balanced Accuracy 0.50838 0.51796 0.56751 0.6015
From these results we know that the accuracy is around 0.30 so we decided to use One Hot Encoding.
customers_df2 <- customers_df
test_df2 <- test_df
#### 1. create dataframe to be encoded (gender, train_df)
Gender_df <- data.frame(customers_df2$Gender)
colnames(Gender_df) <- "Gender"
### create OHE dataframe (gender,customer_df)
df1 <- dummyVars("~.", data = Gender_df)
df2 <- data.frame(predict(df1, newdata = Gender_df))
df2
## combine to original dataframe (gender,customer_df)
customers_df2 <- cbind(customers_df2, df2)
customers_df2$Gender<- NULL
#### 2. create dataframe to be encoded (ever_maried,customer_df)
Ever_Married_df <- data.frame(customers_df2$Ever_Married)
colnames(Ever_Married_df) <- "Ever_Married"
### create OHE dataframe (ever_married,customer_df)
df3 <- dummyVars("~.", data = Ever_Married_df)
df4 <- data.frame(predict(df3, newdata = Ever_Married_df))
df4
## combine to original dataframe (ever_married,customer_df)
customers_df2 <- cbind(customers_df2, df4)
customers_df2$Ever_Married<- NULL
View(customers_df)
#### 3. create dataframe to be encoded (graduated,customer_df)
Graduated_df <- data.frame(customers_df2$Graduated)
colnames(Graduated_df) <- "Graduated"
### create OHE dataframe (Graduated,customer_df)
df5 <- dummyVars("~.", data = Graduated_df)
df6 <- data.frame(predict(df5, newdata = Graduated_df))
df6
## combine to original dataframe (Graduated,customer_df)
customers_df2 <- cbind(customers_df2, df6)
customers_df2$Graduated<- NULL
#### 4. create dataframe to be encoded (Profession,customer_df)
Profession_df <- data.frame(customers_df2$Profession)
colnames(Profession_df) <- "Profession"
### create OHE dataframe (Profession,customer_df)
df7 <- dummyVars("~.", data = Profession_df)
df8 <- data.frame(predict(df7, newdata = Profession_df))
df8
## combine to original dataframe (Profession,customer_df)
customers_df2 <- cbind(customers_df2, df8)
customers_df2$Profession<- NULL
View(customers_df)
#### 5. create dataframe to be encoded (SpendingScore,customer_df)
Spending_Score_df <- data.frame(customers_df2$Spending_Score)
colnames(Spending_Score_df) <- "Spending_Score"
### create OHE dataframe (SpendingScore,customer_df)
df9 <- dummyVars("~.", data = Spending_Score_df)
df10 <- data.frame(predict(df9, newdata = Spending_Score_df))
df10
## combine to original dataframe (SpendingScore,customer_df)
customers_df2 <- cbind(customers_df2, df10)
customers_df2$Spending_Score<- NULL
View(customers_df2)
#### 1. create dataframe to be encoded (gender, test_df)
Gender_df2 <- data.frame(test_df2$Gender)
colnames(Gender_df2) <- "Gender"
### create OHE dataframe (gender,test_df)
df11 <- dummyVars("~.", data = Gender_df2)
df12 <- data.frame(predict(df1, newdata = Gender_df2))
df12
## combine to original dataframe (gender,test_df)
test_df2 <- cbind(test_df2, df12)
test_df2$Gender<- NULL
#### 2. create dataframe to be encoded (ever_married,test_df)
Ever_Married_df2 <- data.frame(test_df2$Ever_Married)
colnames(Ever_Married_df2) <- "Ever_Married"
### create OHE dataframe (ever_married,test_df)
df13 <- dummyVars("~.", data = Ever_Married_df2)
df14 <- data.frame(predict(df13, newdata = Ever_Married_df2))
df14
## combine to original dataframe (ever_married,test_df)
test_df2 <- cbind(test_df2, df14)
test_df2$Ever_Married<- NULL
#### 3. create dataframe to be encoded (graduated,test_df)
Graduated_df2 <- data.frame(test_df2$Graduated)
colnames(Graduated_df2) <- "Graduated"
### create OHE dataframe (Graduated,test_df)
df15 <- dummyVars("~.", data = Graduated_df2)
df16 <- data.frame(predict(df15, newdata = Graduated_df2))
df16
## combine to original dataframe (Graduated,test_df)
test_df2 <- cbind(test_df2, df16)
test_df2$Graduated<- NULL
#### 4. create dataframe to be encoded (Profession,test_df)
Profession_df2 <- data.frame(test_df2$Profession)
colnames(Profession_df2) <- "Profession"
### create OHE dataframe (Profession,test_df)
df17 <- dummyVars("~.", data = Profession_df2)
df18 <- data.frame(predict(df17, newdata = Profession_df2))
df18
## combine to original dataframe (Profession,test_df)
test_df2 <- cbind(test_df2, df18)
test_df2$Profession<- NULL
View(test_df2)
#### 5. create dataframe to be encoded (SpendingScore,test_df)
Spending_Score_df2 <- data.frame(test_df2$Spending_Score)
colnames(Spending_Score_df2) <- "Spending_Score"
### create OHE dataframe (SpendingScore,test_df)
df19 <- dummyVars("~.", data = Spending_Score_df2)
df20 <- data.frame(predict(df19, newdata = Spending_Score_df2))
df20
## combine to original dataframe (SpendingScore,test_df)
test_df2 <- cbind(test_df2, df20)
test_df2$Spending_Score<- NULL
View(test_df2)
After we made encoded dataframe, create a model using dataframe encoding
model.dt_OHE <- ctree(formula = Segmentation ~ Gender.Male+Gender.Female+Ever_Married.No+Ever_Married.Yes+
Profession.Artist+Profession.Engineer+Profession.Entertainment+Profession.Executive+
Profession.Healthcare+Profession.Homemaker+Profession.Lawyer+Profession.Marketing+
Spending_Score.Average+Spending_Score.High+Spending_Score.Low ,
data = customers_df2)
set.seed(2021)
model.forest_OHE <- randomForest(formula = Segmentation ~ Gender.Male+Gender.Female+Ever_Married.No+Ever_Married.Yes+
Profession.Artist+Profession.Engineer+Profession.Entertainment+Profession.Executive+
Profession.Healthcare+Profession.Homemaker+Profession.Lawyer+Profession.Marketing+
Spending_Score.Average+Spending_Score.High+Spending_Score.Low,
data = customers_df2)
model.forest_OHE
##
## Call:
## randomForest(formula = Segmentation ~ Gender.Male + Gender.Female + Ever_Married.No + Ever_Married.Yes + Profession.Artist + Profession.Engineer + Profession.Entertainment + Profession.Executive + Profession.Healthcare + Profession.Homemaker + Profession.Lawyer + Profession.Marketing + Spending_Score.Average + Spending_Score.High + Spending_Score.Low, data = customers_df2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 50.95%
## Confusion matrix:
## A B C D class.error
## A 845 275 234 274 0.4809582
## B 550 366 473 194 0.7687934
## C 313 294 913 215 0.4737752
## D 428 93 80 1171 0.3391648
model.svm_OHE <- svm(formula = Segmentation ~ Gender.Male+Gender.Female+Ever_Married.No+Ever_Married.Yes+
Profession.Artist+Profession.Engineer+Profession.Entertainment+Profession.Executive+
Profession.Healthcare+Profession.Homemaker+Profession.Lawyer+Profession.Marketing+
Spending_Score.Average+Spending_Score.High+Spending_Score.Low,
data = customers_df2)
model.svm_OHE
##
## Call:
## svm(formula = Segmentation ~ Gender.Male + Gender.Female + Ever_Married.No +
## Ever_Married.Yes + Profession.Artist + Profession.Engineer +
## Profession.Entertainment + Profession.Executive + Profession.Healthcare +
## Profession.Homemaker + Profession.Lawyer + Profession.Marketing +
## Spending_Score.Average + Spending_Score.High + Spending_Score.Low,
## data = customers_df2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 5260
pred.dt_OHE <- predict(model.dt_OHE, test_df2)
cm.dt_OHE <- table(test_df2$Segmentation, pred.dt_OHE,
dnn = c("Actual", "Predicted"))
cm.dt_OHE
## Predicted
## Actual A B C D
## A 215 177 124 183
## B 137 129 87 102
## C 108 133 97 48
## D 190 113 94 241
pred.forest_OHE <- predict(model.forest_OHE, test_df2)
cm.forest_OHE <- table(test_df2$Segmentation, pred.forest_OHE,
dnn = c("Actual", "Predicted"))
cm.forest_OHE
## Predicted
## Actual A B C D
## A 258 103 151 187
## B 160 70 117 108
## C 132 78 124 52
## D 200 67 119 252
pred.svm_OHE <- predict(model.svm_OHE, test_df2)
cm.svm_OHE <- table(test_df2$Segmentation, pred.svm_OHE,
dnn = c("Actual", "Predicted"))
cm.svm_OHE
## Predicted
## Actual A B C D
## A 219 147 148 185
## B 138 97 112 108
## C 108 104 122 52
## D 186 86 116 250
library(caret)
confusionMatrix(pred.dt_OHE, test_df2$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 215 137 108 190
## B 177 129 133 113
## C 124 87 97 94
## D 183 102 48 241
##
## Overall Statistics
##
## Accuracy : 0.3131
## 95% CI : (0.2937, 0.3331)
## No Information Rate : 0.3209
## P-Value [Acc > NIR] : 0.7888
##
## Kappa : 0.0735
##
## Mcnemar's Test P-Value : 2.114e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.30758 0.28352 0.25130 0.3777
## Specificity 0.70588 0.75450 0.82980 0.7838
## Pos Pred Value 0.33077 0.23370 0.24129 0.4199
## Neg Pred Value 0.68325 0.79951 0.83727 0.7525
## Prevalence 0.32094 0.20891 0.17723 0.2929
## Detection Rate 0.09871 0.05923 0.04454 0.1107
## Detection Prevalence 0.29844 0.25344 0.18457 0.2635
## Balanced Accuracy 0.50673 0.51901 0.54055 0.5808
confusionMatrix(pred.svm_OHE, test_df2$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 219 138 108 186
## B 147 97 104 86
## C 148 112 122 116
## D 185 108 52 250
##
## Overall Statistics
##
## Accuracy : 0.3159
## 95% CI : (0.2964, 0.3359)
## No Information Rate : 0.3209
## P-Value [Acc > NIR] : 0.7005
##
## Kappa : 0.0779
##
## Mcnemar's Test P-Value : 7.656e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.3133 0.21319 0.31606 0.3918
## Specificity 0.7079 0.80441 0.79018 0.7760
## Pos Pred Value 0.3364 0.22350 0.24498 0.4202
## Neg Pred Value 0.6857 0.79472 0.84286 0.7549
## Prevalence 0.3209 0.20891 0.17723 0.2929
## Detection Rate 0.1006 0.04454 0.05601 0.1148
## Detection Prevalence 0.2989 0.19927 0.22865 0.2732
## Balanced Accuracy 0.5106 0.50880 0.55312 0.5839
confusionMatrix(pred.forest_OHE, test_df2$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 258 160 132 200
## B 103 70 78 67
## C 151 117 124 119
## D 187 108 52 252
##
## Overall Statistics
##
## Accuracy : 0.3232
## 95% CI : (0.3036, 0.3433)
## No Information Rate : 0.3209
## P-Value [Acc > NIR] : 0.4172
##
## Kappa : 0.0815
##
## Mcnemar's Test P-Value : 1.304e-10
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.3691 0.15385 0.32124 0.3950
## Specificity 0.6673 0.85607 0.78404 0.7747
## Pos Pred Value 0.3440 0.22013 0.24266 0.4207
## Neg Pred Value 0.6912 0.79301 0.84283 0.7555
## Prevalence 0.3209 0.20891 0.17723 0.2929
## Detection Rate 0.1185 0.03214 0.05693 0.1157
## Detection Prevalence 0.3444 0.14601 0.23462 0.2750
## Balanced Accuracy 0.5182 0.50496 0.55264 0.5848
customers_df3 <- customers_df2
customers_df3$Segmentation <- NULL
Segmentation3 <- customers_df$Segmentation
customers_df4 <- data.frame(customers_df3,Segmentation3)
customers_df4$Segmentation <- NULL
pr.out2 <- prcomp(customers_df4[1:20])
#### select k principal component (PC) as features
k <- 16
features_df2 <- pr.out2$x[ , 1:k]
features_df2 <- data.frame( features_df2)
#### combine dataset
customer_pca_df2 <- cbind(customers_df4$Segmentation3, features_df2)
colnames(customer_pca_df2) [1] <- "Segmentation"
#### test and train pca
set.seed(2021)
m <- nrow(customer_pca_df2)
m_train_pca2 <- m * 0.7
train_pca_idx2 <- sample(m, m_train_pca2)
train_pca_df2 <- customer_pca_df2[ train_pca_idx2, ]
test_pca_df2 <- customer_pca_df2[ -train_pca_idx2, ]
model.dt_pca2 <- ctree(formula = Segmentation ~ .,
data = train_pca_df2)
plot(model.dt_pca2)
### predict Decision Tree with pca
pred.dt_pca2 <- predict(model.dt_pca2, test_pca_df2)
cm.dt_pca2 <- table(test_pca_df2$Segmentation, pred.dt_pca2,
dnn = c("Actual", "Predicted"))
cm.dt_pca2
## Predicted
## Actual A B C D
## A 74 167 65 194
## B 45 156 168 80
## C 16 104 329 77
## D 34 64 13 430
set.seed(2021)
model.forest_pca2 <- randomForest(formula = Segmentation ~ .,
data = train_pca_df2)
model.forest_pca2
##
## Call:
## randomForest(formula = Segmentation ~ ., data = train_pca_df2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 51.64%
## Confusion matrix:
## A B C D class.error
## A 470 264 163 231 0.5833333
## B 261 403 336 134 0.6446208
## C 159 318 591 141 0.5111663
## D 231 110 80 810 0.3419984
#### predict Random Forest with pca
pred.forest_pca2 <- predict(model.forest_pca2, test_pca_df2)
cm.forest_pca2 <- table(test_pca_df2$Segmentation, pred.forest_pca2,
dnn = c("Actual", "Predicted"))
cm.forest_pca2
## Predicted
## Actual A B C D
## A 211 111 56 122
## B 116 155 117 61
## C 80 124 269 53
## D 98 46 39 358
### SVM PCA using OHE dataset
library(e1071)
model.svm_pca2 <- svm(formula = Segmentation ~ .,
data = train_pca_df2)
model.svm_pca2
##
## Call:
## svm(formula = Segmentation ~ ., data = train_pca_df2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 3885
#### predict SVM with PCA
pred.svm_pca2 <- predict(model.svm_pca2, test_pca_df2)
cm.svm_pca2 <- table(test_pca_df2$Segmentation, pred.svm_pca2,
dnn = c("Actual", "Predicted"))
cm.svm_pca2
## Predicted
## Actual A B C D
## A 224 123 58 95
## B 83 172 142 52
## C 44 104 318 60
## D 103 34 8 396
confusionMatrix(pred.dt_pca2, test_pca_df2$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 74 45 16 34
## B 167 156 104 64
## C 65 168 329 13
## D 194 80 77 430
##
## Overall Statistics
##
## Accuracy : 0.4906
## 95% CI : (0.4685, 0.5126)
## No Information Rate : 0.2684
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3177
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.14800 0.34744 0.6255 0.7948
## Specificity 0.93734 0.78622 0.8349 0.7620
## Pos Pred Value 0.43787 0.31772 0.5722 0.5506
## Neg Pred Value 0.76936 0.80787 0.8633 0.9101
## Prevalence 0.24802 0.22272 0.2609 0.2684
## Detection Rate 0.03671 0.07738 0.1632 0.2133
## Detection Prevalence 0.08383 0.24355 0.2852 0.3874
## Balanced Accuracy 0.54267 0.56683 0.7302 0.7784
confusionMatrix(pred.forest_pca2, test_pca_df2$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 211 116 80 98
## B 111 155 124 46
## C 56 117 269 39
## D 122 61 53 358
##
## Overall Statistics
##
## Accuracy : 0.4926
## 95% CI : (0.4705, 0.5146)
## No Information Rate : 0.2684
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.322
##
## Mcnemar's Test P-Value : 0.07677
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.4220 0.34521 0.5114 0.6617
## Specificity 0.8061 0.82068 0.8577 0.8400
## Pos Pred Value 0.4178 0.35550 0.5593 0.6027
## Neg Pred Value 0.8087 0.81392 0.8326 0.8713
## Prevalence 0.2480 0.22272 0.2609 0.2684
## Detection Rate 0.1047 0.07688 0.1334 0.1776
## Detection Prevalence 0.2505 0.21627 0.2386 0.2946
## Balanced Accuracy 0.6140 0.58294 0.6846 0.7509
confusionMatrix(pred.svm_pca2, test_pca_df2$Segmentation )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D
## A 224 83 44 103
## B 123 172 104 34
## C 58 142 318 8
## D 95 52 60 396
##
## Overall Statistics
##
## Accuracy : 0.5506
## 95% CI : (0.5286, 0.5725)
## No Information Rate : 0.2684
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3992
##
## Mcnemar's Test P-Value : 5.92e-11
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D
## Sensitivity 0.4480 0.38307 0.6046 0.7320
## Specificity 0.8483 0.83344 0.8604 0.8597
## Pos Pred Value 0.4934 0.39723 0.6046 0.6567
## Neg Pred Value 0.8233 0.82502 0.8604 0.8974
## Prevalence 0.2480 0.22272 0.2609 0.2684
## Detection Rate 0.1111 0.08532 0.1577 0.1964
## Detection Prevalence 0.2252 0.21478 0.2609 0.2991
## Balanced Accuracy 0.6481 0.60826 0.7325 0.7958