In conducting personality and segmentation analysis also build a machine learning to classify the data, there are several step that this research pursue as follows:
After conducting the analysis, there are two customer categories which are shopaholic & general shopper. Then, the general shopper category divided into Mid to Low Class and High Class. The shopaholic characteristics and backgrounds are discount hunter, love shopping, mid income, graduation, etc. Then if the company wants to acquire more costumers, the company can target the General Shopper - High Class category as well. Using decision tree model can classify the positive class and negative class with accuracy score 0.6897321, sensitivity score 0.7272727, and AUC score 73.1215 % . Furthermore, using another more complex machine learning model to get better accuracy to clasify both classes.
Customer Personality Analysis is an anlysis to understand
characteristic the customer’s personality. It helps the company to
understand the customers more deeply so that the company can give a
product / recommendation based on specific character, profile, behavior,
needs, and concerns of different types of customers.
and makes it easier for them to modify products according to the specific needs, profile, behaviors and concerns of different types of customers.
The objectives are :
People
Products
Promotion
Place
There are several packages that used in this research, as
follows:
# data cleansing
library(readr)
library(dplyr)
# Date
library(lubridate)
#data analysis
library(GGally)
# Silhouettle, Elbow
library(factoextra)
#K-Medoid
library(cluster)
# PCA
library(FactoMineR)
# K-Mods
library(MASS)
library(klaR)
#data visualizationl
library(ggplot2)
library(scales)
library(echarts4r)
# Sampling adequacy (MSA)
library(EFAtools)
# Decision Tree
library(partykit)
# Confussion Matrix
library(caret)
# ROC
library(ROCR)
# Visualization DT
library(rpart)
library(rattle)
#ROC
library(pROC)The csv file is loaded into persona variable and using function
read.csv to extract the csv file. Then the data contains 27
columns and 2240 rows.
# Extracting Data
persona <- read.csv("data_input/marketing_campaign.csv",sep = "\t")
persona <- persona %>% dplyr::select(-c(Z_CostContact,Z_Revenue)) #Remove unknown column
persona
This step consists of three steps ranging from the top 6
observations of data set inspection, the bottom 6 observations of data
set inspection by using head also tail
function, then unique inspection for each variables so that the data
set’s background can be recognized a little bit.
1.
Top 6 observations of data set.
# Top 6 data
head(persona)
2. Bottom 6 observations of data set.
# Bottom 6 data
tail(persona)
3. Unique Inspection.
lengths(lapply(persona, unique))#> ID Year_Birth Education Marital_Status
#> 2240 59 5 8
#> Income Kidhome Teenhome Dt_Customer
#> 1975 3 3 663
#> Recency MntWines MntFruits MntMeatProducts
#> 100 776 158 558
#> MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
#> 182 177 213 15
#> NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#> 15 14 14 16
#> AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
#> 2 2 2 2
#> AcceptedCmp2 Complain Response
#> 2 2 2
Perform data type inspection to ensure the data type of each
column is appropriate by using glimpse() function in dplyr
package. Then, there are several variables whose data type are changed,
as follows :
persona <- persona %>%
mutate_at(vars(Education,Marital_Status,Complain,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Year_Birth),as.factor) %>%
mutate(Dt_Customer=dmy(Dt_Customer))
glimpse(persona)#> Rows: 2,240
#> Columns: 27
#> $ ID <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 485…
#> $ Year_Birth <fct> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 19…
#> $ Education <fct> Graduation, Graduation, Graduation, Graduation, Ph…
#> $ Marital_Status <fct> Single, Single, Together, Together, Married, Toget…
#> $ Income <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 3…
#> $ Kidhome <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,…
#> $ Teenhome <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
#> $ Dt_Customer <date> 2012-09-04, 2014-03-08, 2013-08-21, 2014-02-10, 2…
#> $ Recency <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82…
#> $ MntWines <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6,…
#> $ MntFruits <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, …
#> $ MntMeatProducts <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 4…
#> $ MntFishProducts <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3…
#> $ MntSweetProducts <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1…
#> $ MntGoldProds <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14…
#> $ NumDealsPurchases <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3,…
#> $ NumWebPurchases <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3,…
#> $ NumCatalogPurchases <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0…
#> $ NumStorePurchases <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12,…
#> $ NumWebVisitsMonth <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8…
#> $ AcceptedCmp3 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
#> $ AcceptedCmp4 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ AcceptedCmp5 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
#> $ AcceptedCmp1 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
#> $ AcceptedCmp2 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Complain <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Response <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
In this step, checking the missing values is a must so that the
missing value treatment can be done then the data is ready to analyze.
# Check Missing Value
colSums(is.na(persona))#> ID Year_Birth Education Marital_Status
#> 0 0 0 0
#> Income Kidhome Teenhome Dt_Customer
#> 24 0 0 0
#> Recency MntWines MntFruits MntMeatProducts
#> 0 0 0 0
#> MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
#> 0 0 0 0
#> NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#> 0 0 0 0
#> AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
#> 0 0 0 0
#> AcceptedCmp2 Complain Response
#> 0 0 0
# Check column with empty string
persona[is.na(persona$Income),]Insight: There isn’t any missing values.
There
are 24 rows of missing value within Income column.
persona <- persona %>%
mutate(Income = ifelse(is.na(Income),0,Income))
head(persona)
Checking the data whether there is the same values or duplicates
for each row.
persona %>%
duplicated() %>%
sum()#> [1] 0
Insight: There are not duplicate data, then
there is not any data treatment.
Splitting data into two type categories which are numeric data type (persona1) for dimensionality reduction purpose and factor data type (persona2) for unsupervised learning.
# Selection Data 1
persona1 <- persona %>%
select_if(is.numeric)
# rownames(persona1) <- persona1$ID
persona1 <- persona1 %>%
dplyr::select(-ID)
persona1# Selection Data 2
persona2 <- persona %>%
select_if(purrr::negate(is.numeric)) %>% # Select coloumns except numeric column
dplyr::select(-Dt_Customer)
persona2summary(persona1)#> Income Kidhome Teenhome Recency
#> Min. : 0 Min. :0.0000 Min. :0.0000 Min. : 0.00
#> 1st Qu.: 34722 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:24.00
#> Median : 51075 Median :0.0000 Median :0.0000 Median :49.00
#> Mean : 51688 Mean :0.4442 Mean :0.5062 Mean :49.11
#> 3rd Qu.: 68290 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:74.00
#> Max. :666666 Max. :2.0000 Max. :2.0000 Max. :99.00
#> MntWines MntFruits MntMeatProducts MntFishProducts
#> Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
#> 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0 1st Qu.: 3.00
#> Median : 173.50 Median : 8.0 Median : 67.0 Median : 12.00
#> Mean : 303.94 Mean : 26.3 Mean : 166.9 Mean : 37.53
#> 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0 3rd Qu.: 50.00
#> Max. :1493.00 Max. :199.0 Max. :1725.0 Max. :259.00
#> MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
#> Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000
#> Median : 8.00 Median : 24.00 Median : 2.000 Median : 4.000
#> Mean : 27.06 Mean : 44.02 Mean : 2.325 Mean : 4.085
#> 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000
#> Max. :263.00 Max. :362.00 Max. :15.000 Max. :27.000
#> NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#> Min. : 0.000 Min. : 0.00 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
#> Median : 2.000 Median : 5.00 Median : 6.000
#> Mean : 2.662 Mean : 5.79 Mean : 5.317
#> 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
#> Max. :28.000 Max. :13.00 Max. :20.000
Insight:
ggcorr(persona1, palette = "RdGy", nbreaks = 4,label = T,label_round = 2,label_size = 2.8,hjust = 0.95,layout.exp = 0.2, label_color = "white")ggpairs(persona1[1:15], progress = F,upper = list(continuous = wrap("cor", size=2)))+
theme(axis.text = element_text(colour = "black", size = 4),
strip.text = element_text(size = 5))
Insight:
K-means is a centroid-based clustering algorithm, meaning that each cluster has one centroid that represents the cluster. In this part, using K-means for clustering which only based on numeric variable in data frame of persona1.
The Elbow method measures the optimal number of clusters based on the SSE (Sum of Square Error) value. Visually, the optimal number of clusters can be seen from the points that tend to form the first elbow or the points that have a high/significant number of changes from the previous point.
elbow_scale_num <- scale(persona1)
as.data.frame(elbow_scale_num)fviz_nbclust(x = elbow_scale_num, FUNcluster = kmeans, method = "wss") + geom_hline(yintercept = 19750) + geom_hline(yintercept = 16200) + geom_hline(yintercept = 17800) + geom_vline(xintercept = 3, linetype = 2) + geom_vline(xintercept = 5, linetype = 2) + geom_vline(xintercept = 7, linetype = 2)Insight:
Based on plot above, the optimum K
that is close to 90 degrees : 3 & 7. Then, the k number 3 is close
to 90 degrees than the k number 7.
RNGkind(sample.kind = "Rounding")
set.seed(113)
# berdasarkan elbow method, k = 3
elb <- kmeans(elbow_scale_num, center = 3)
#Input Labelling to df
df_elb <- persona1 %>%
mutate(cluster=elb$cluster)
df_elb# Mean
df_elb %>%
dplyr::select(1,4,16) %>%
dplyr::group_by(cluster) %>%
summarise_all(mean)# Sum
df_elb %>%
dplyr::select(-c(1,4)) %>%
dplyr::group_by(cluster) %>%
summarise_all(sum)🔻Cluster Profiling:
Cluster 1:
Cluster 2:
Cluster 3:
fviz_cluster(object = elb,
data = persona1,labelsize = 5) # Does not involve column groups / target where it is for clustering.the clustering goal is to minimize the average distance within clusters; and to maximize the average distance between clusters. Therefore, there are some measurement to evaluate the cluter, as follows:
The first way we can look at this is by running a Silhouette Analysis. This analysis measures how well an observation is clustered and it estimates the average distance between clusters.
sil0 <- silhouette(elb$cluster, dist(elbow_scale_num))
head(sil0[, 1:3], 10) #looking at the first 10 observations#> cluster neighbor sil_width
#> [1,] 3 2 0.17234788
#> [2,] 1 2 0.46941136
#> [3,] 3 2 -0.05048963
#> [4,] 1 2 0.51994893
#> [5,] 2 1 -0.10534239
#> [6,] 2 1 0.22107363
#> [7,] 2 1 0.20944478
#> [8,] 1 2 0.46563436
#> [9,] 1 2 0.48843113
#> [10,] 1 2 0.22527343
The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
fviz_silhouette(sil0)#> cluster size ave.sil.width
#> 1 1 1046 0.40
#> 2 2 613 0.11
#> 3 3 581 0.14
Inight:
Using k =3 which based on company’s need and business perspective.
RNGkind(sample.kind = "Rounding")
set.seed(193)
cl_p1 <- kmodes(persona2, 3, iter.max = 14, weighted = FALSE )
cluster <- cl_p1$cluster
output_p1 <- cbind(persona2,cluster)
output_p1output_p1 %>%
dplyr::group_by(cluster,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>%
summarise(n=n()) %>%
arrange(cluster,desc(n)) %>%
filter(cluster==1) 🔻Cluster Profiling:
Cluster 1:
c2 <- output_p1 %>%
dplyr::group_by(cluster,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>%
summarise(n=n()) %>%
arrange(cluster,desc(n)) %>%
filter(cluster==2)
c2🔻Cluster Profiling:
Cluster 2:
c3 <- output_p1 %>%
dplyr::group_by(cluster,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>%
summarise(n=n()) %>%
arrange(cluster,desc(n)) %>%
filter(cluster==3)
c3🔻Cluster Profiling:
Cluster 3:
K-Medoid is a category partitioning clustering algorithm, where the
center of the cluster is called the medoid. In general, to determine the
optimum K value for a K-medoid, silhouettes are used, in which this
technique seeks the lowest dissimilarities for each cluster.
In this part, using unsupervised learning, K-Medoid, to provide lebelling of target so that it can be used in making a supervised model. Furthermore, this K-Medoid Clustering conduct based on numeric variable in data frame of persona1 and factor variable in data frame of persona2.
The Silhouette method measures the Silhouette coefficient by calculating the average distance of each data to all data in the same cluster, then calculating the average distance of each data to all data in other clusters. Then, the data is regrouped based on the minimum distance.
# Scale
# Calling df persona 1 due to numeric data type
scale_num <- as.data.frame(scale(persona1))
persona_scale <- cbind(scale_num,persona2)
persona_scalefviz_nbclust(x = scale_num, FUNcluster = pam, method = "silhouette",k.max = 15) + geom_hline(yintercept = 0.3225)+
labs(subtitle = "Silhouette method")
***
The Elbow method measures the optimal number of clusters based on the SSE (Sum of Square Error) value. Visually, the optimal number of clusters can be seen from the points that tend to form the first elbow or the points that have a high/significant number of changes from the previous point.
elbow_scale_medoid <- scale(persona1)
as.data.frame(elbow_scale_medoid)fviz_nbclust(x = elbow_scale_medoid, FUNcluster = pam, method = "wss") + geom_hline(yintercept = 23150) + geom_vline(xintercept = 2, linetype = 2)Based on Silhouette Method and Elbow Method, the optimum k is two. And the goal is to choose a small value of k that still has a low WSS, and the elbow that form a 90 degree angle.
Using K-Medoids Clustering (PAM) with k=2 to cluster the company’s customer. Then clustering result is used to provide label/target column.
RNGkind(sample.kind = "Rounding")
set.seed(1004)
#perform k-medoids clustering with k = 2 clusters
kmed <- pam(persona_scale, k = 2)
#Input Labelling to df
df <- cbind(persona1,persona2)
df$label <- kmed$cluster
df# Numeric - mean
df %>%
dplyr::select(1:15,26) %>%
dplyr::group_by(label) %>%
dplyr::select(1,4) %>%
summarise_all(mean)# Numeric - sum
df %>%
dplyr::select(1:15,26) %>%
dplyr::group_by(label) %>%
dplyr::select(2,3,5:15) %>%
summarise_all(sum) # Kategorik - Label 1
f1 <- df %>%
dplyr::select(-(1:15)) %>%
dplyr::group_by(label,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>% summarise(n=n()) %>%
filter(label==1) %>%
arrange(label,desc(n))
f1# Kategorik - Label 2
f2 <- df %>%
dplyr::select(-(1:15)) %>%
dplyr::group_by(label,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>% summarise(n=n()) %>%
filter(label==2) %>%
arrange(label,desc(n))
f2🔻Cluster Profiling:
Cluster 1:
Cluster 2:
fviz_cluster(object = kmed,
data = persona1,labelsize = 5) # Does not involve column groups / target where it is for clustering.the clustering goal is to minimize the average distance within clusters; and to maximize the average distance between clusters. Therefore, there are some measurement to evaluate the cluter, as follows:
The first way we can look at this is by running a Silhouette Analysis. This analysis measures how well an observation is clustered and it estimates the average distance between clusters.
sil <- silhouette(kmed$cluster, dist(persona_scale))
head(sil[, 1:3], 10) #looking at the first 10 observations#> cluster neighbor sil_width
#> [1,] 1 2 0.596189205
#> [2,] 1 2 0.664027875
#> [3,] 1 2 0.233695083
#> [4,] 2 1 0.646646375
#> [5,] 2 1 0.655211120
#> [6,] 2 1 -0.006236619
#> [7,] 2 1 0.409488525
#> [8,] 2 1 0.638280574
#> [9,] 2 1 0.566364248
#> [10,] 1 2 0.569280417
The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
fviz_silhouette(sil)#> cluster size ave.sil.width
#> 1 1 910 0.52
#> 2 2 1330 0.50
Inight:
Clusters with a large silhouette width are well very clustered, those with width near 0 mean the observation lies between two clusters, and negative silhouette widths mean the observation is likely in the wrong cluster.
Using supervised machine learning model to classify the customers, whether they belong to cluster 1 as general shopper or cluster 2 as shopaholic.
The purpose of dimensionality reduction, PCA, is to reduce the number of variables (dimensions/features) in the data while retaining as much information as possible. Dimensionality reduction can overcome the problem of high-dimensional data. Difficulties encountered in high-dimensional data:
Checking sampling adequacy (MSA) is important before conducting PCA analysis , which is done via the Kaiser-Meyer-Olkin (KMO) test. From the result above shows that the average MSA for sample is 0.888, more than sufficient to proceed (generally anything above 0.6 is good, see paper for more info).
KMO(cor(persona1))#>
#> ── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────
#>
#> ✔ The overall KMO value for your data is meritorious.
#> These data are probably suitable for factor analysis.
#>
#> Overall: 0.888
#>
#> For each variable:
#> Income Kidhome Teenhome Recency
#> 0.931 0.900 0.574 0.302
#> MntWines MntFruits MntMeatProducts MntFishProducts
#> 0.882 0.938 0.911 0.937
#> MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
#> 0.939 0.935 0.502 0.858
#> NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#> 0.903 0.899 0.842
Using FactoMineR package to conduct PCA analysis, where
the PCA() function for PCA analysis and
plot.PCA() for visualization.
persona1_pca <- PCA(X = persona1, # numeric data without target/lebel
scale.unit = T, # scale is true
graph = F, # untuk tidak menampilkan visualisasi
ncp = 15) # The number of PCs used (a total of 15 numeric coloumns)
# Show the PCA's results
results1 <- persona1_pca$ind$coord %>% as.data.frame()
results1Insight:
persona1_pca$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 6.0226357 40.150905 40.15090
#> comp 2 1.8344275 12.229517 52.38042
#> comp 3 1.0252150 6.834767 59.21519
#> comp 4 1.0049524 6.699683 65.91487
#> comp 5 0.7842345 5.228230 71.14310
#> comp 6 0.7548658 5.032438 76.17554
#> comp 7 0.6272726 4.181817 80.35736
#> comp 8 0.5031005 3.354004 83.71136
#> comp 9 0.4666598 3.111065 86.82243
#> comp 10 0.4323828 2.882552 89.70498
#> comp 11 0.3943847 2.629232 92.33421
#> comp 12 0.3568601 2.379067 94.71328
#> comp 13 0.3210059 2.140039 96.85332
#> comp 14 0.2533239 1.688826 98.54214
#> comp 15 0.2186788 1.457859 100.00000
Insight:
Combining the data frame from the PCA result with non numeric data frame with labels from clustering results of unserved learning models
df_clean <- cbind(results1[,1:7],persona2)
df_clean$label <- kmed$cluster
df_clean <- df_clean %>% mutate(label=as.factor(label))
df_clean <- df_clean %>% dplyr::select(-Year_Birth) # Drop it due to it belongs to visualization purpose
rownames(df_clean) <- NULL
df_cleanIn cross validation step, the data is devided into train data with proportion 80% of original data and test data with proportion 20% of original data as unseen data.
# Set seed to lock the random
# RNGkind(sample.kind = "Rounding")
set.seed(1239)
# index sampling
index <- sample(x = nrow(df_clean), size = nrow(df_clean)*0.8)
# splitting
df_clean_train <- df_clean[index,]
df_clean_test <- df_clean[-index,]prop.table(table(df_clean_train$label))#>
#> 1 2
#> 0.4051339 0.5948661
After, checking the imbalance of target variable and there is no imbalance over there (refers to statistical best practice the limmitation is 60% & 40%). It can be concluded that there is no needed to do the treatments such as up-sampling, down-sampling, SMOTE, and ROSE. And the treatment can be conducted after cross validation and only conducted on the train data because the test data is unseen data.
Tree-based model which is quite simple with
robust/powerful performance for prediction. The Decision Tree
produces a visualization in the form of a decision tree
which can be interpreted easily. Using ctree
function from partykit package to implement the decision
tree model.
library(party)
persona_tree <-
ctree(
formula = label ~ Dim.1 + Dim.2 + Dim.3 + Dim.4 + Dim.5 + Dim.6 + Dim.7 +
Education + Marital_Status + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 +
AcceptedCmp4 + AcceptedCmp5 + Complain + Response ,
data = df_clean_train)
plot(persona_tree,type = "simple") #,type = "simple"The tree diagram result can simplified by using rpart
function and library of rpart.
tree_model <- rpart(label ~ Dim.1 + Dim.2 + Dim.3 + Dim.4 + Dim.5 + Dim.6 + Dim.7 +
Education + Marital_Status + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 +
AcceptedCmp4 + AcceptedCmp5 + Complain + Response,
data = df_clean_train
)
fancyRpartPlot(tree_model, caption = NULL)# prediction on data test
persona_pred_test <- predict(object = persona_tree, # model decision tree
newdata = df_clean_test, # predict data test
type = "response") # confusion matrix
untuned_test <- confusionMatrix(data = persona_pred_test, # prediction result
reference = df_clean_test$label, # original label
positive = "2") # number 2 -> Shopaholic is this research observation
untuned_test#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2
#> 1 118 74
#> 2 66 190
#>
#> Accuracy : 0.6875
#> 95% CI : (0.6423, 0.7302)
#> No Information Rate : 0.5893
#> P-Value [Acc > NIR] : 1.117e-05
#>
#> Kappa : 0.3586
#>
#> Mcnemar's Test P-Value : 0.5541
#>
#> Sensitivity : 0.7197
#> Specificity : 0.6413
#> Pos Pred Value : 0.7422
#> Neg Pred Value : 0.6146
#> Prevalence : 0.5893
#> Detection Rate : 0.4241
#> Detection Prevalence : 0.5714
#> Balanced Accuracy : 0.6805
#>
#> 'Positive' Class : 2
#>
# prediction on train test
persona_pred_train <- predict(object = persona_tree, # model decision tree
newdata = df_clean_train, # predict data test
type = "response") # confusion matrix
untuned_train <- confusionMatrix(data = persona_pred_train, # prediction result
reference = df_clean_train$label, # original label
positive = "2") # number 2 -> Shopaholic is this research observation
untuned_train#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2
#> 1 447 307
#> 2 279 759
#>
#> Accuracy : 0.673
#> 95% CI : (0.6507, 0.6947)
#> No Information Rate : 0.5949
#> P-Value [Acc > NIR] : 5.365e-12
#>
#> Kappa : 0.3257
#>
#> Mcnemar's Test P-Value : 0.2647
#>
#> Sensitivity : 0.7120
#> Specificity : 0.6157
#> Pos Pred Value : 0.7312
#> Neg Pred Value : 0.5928
#> Prevalence : 0.5949
#> Detection Rate : 0.4235
#> Detection Prevalence : 0.5792
#> Balanced Accuracy : 0.6639
#>
#> 'Positive' Class : 2
#>
In post-prunning step, adding ctree_control parameter
within ctree function in order to improve the model
accuracy by minimizing the over fitting model. This is because the
decision tree model can perform data splitting into very detail level,
thus making the Decision Tree model only memorize the train
data patterns by making complex rules, where the model should be able to
learn the test data patterns as well.
tuning_tree <- ctree(formula = label ~. , data = df_clean_train,
controls = ctree_control(mincriterion = 0.95,
minsplit = 100,
minbucket = 5)
)
plot(tuning_tree, type = "simple")# prediction on data test
persona_pred_test1 <- predict(object = tuning_tree, # model decision tree
newdata = df_clean_test) # confusion matrix
tuned_test <- confusionMatrix(data = persona_pred_test1, # prediction result
reference = df_clean_test$label, # original label
positive = "2") # number 2 -> Shopaholic is this research observationInsight:
In this step, the data train has predicted by using tuning model of decision tree, with the result type is response (class). Then, evaluating the model with confusion matrix which focused on accuracy score.
# prediction on train test
persona_pred_train1 <- predict(object = tuning_tree, # model decision tree
newdata = df_clean_train, # predict data test
type = "response") # confusion matrix
tuned_train <- confusionMatrix(data = persona_pred_train1, # prediction result
reference = df_clean_train$label, # original label
positive = "2") # number 2 -> Shopaholic is this research observation
tuned_train#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2
#> 1 417 295
#> 2 309 771
#>
#> Accuracy : 0.6629
#> 95% CI : (0.6405, 0.6848)
#> No Information Rate : 0.5949
#> P-Value [Acc > NIR] : 1.733e-09
#>
#> Kappa : 0.2986
#>
#> Mcnemar's Test P-Value : 0.5968
#>
#> Sensitivity : 0.7233
#> Specificity : 0.5744
#> Pos Pred Value : 0.7139
#> Neg Pred Value : 0.5857
#> Prevalence : 0.5949
#> Detection Rate : 0.4302
#> Detection Prevalence : 0.6027
#> Balanced Accuracy : 0.6488
#>
#> 'Positive' Class : 2
#>
Insight:
# Comparison After Tuning & Before Tuning
performance_comparison <- cbind.data.frame(Accuracy=c(untuned_test$overall[1],tuned_test$overall[1]),
Sensitivity = c(untuned_test$byClass[[1]], tuned_test$byClass[[1]])
)
rownames(performance_comparison) <- c("Untuned DT Model", "Tuned DT Model")
performance_comparisonInsight:
# Get Prediction Result with type = probability
persona_pred_test1_prob <- predict(object = tuning_tree, # model decision tree
newdata = df_clean_test, type="prob")
# Extract / Convert the list into matrix with 2 columns
pred_test_improvement <- matrix(unlist(persona_pred_test1_prob), ncol = 2, byrow = TRUE)
## AUC
pred_prob_improvement <- pred_test_improvement[,2]
model_auc_eva <- prediction(pred_prob_improvement,
df_clean_test$label)
model_auc1 <- performance(model_auc_eva,
measure = "auc")
model_auc1@y.values#> [[1]]
#> [1] 0.731215
Insight:
AUC = 0.731215, it can be concluded that the model is quite good at separating positive shopaholic and negative general shopper classes.
The AUC value of the Decision Tree model is more than 0.5 and close to 1, meaning that our model is able to distinguish positive and negative classes well.
pred_result <- df_clean_test
pred_result$pred <- persona_pred_test1
pred_result <- pred_result %>% dplyr::select(label,pred) %>% rename(actual=label)
row.names(pred_result) <- NULL
pred_result# confusion matrix
tuned_test <- confusionMatrix(data = persona_pred_test1, # prediction result
reference = df_clean_test$label, # original label
positive = "2") # number 2 -> Shopaholic is this research observation# Plot
Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y <- c(tuned_test$table[1], tuned_test$table[2], tuned_test$table[3], tuned_test$table[4])
df1_log <- data.frame(Actual, Predicted, y)
# Plot 1
plot_log1 <- ggplot(data = df1_log, mapping = aes(x = Actual, y = Predicted, col = y)) +
geom_tile(aes(fill = y)) +
geom_text(aes(label = sprintf("%1.0f", y)), color = "Black", size = 5, family = "Times New Roman") +
scale_fill_gradient(low = "#FFA07C", high = "#FC945A") +
scale_color_gradient(low = "#FFA07C", high = "#FC945A") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#B78980", color = "#B78980"),
panel.background = element_rect(fill = "#B78980"),
panel.grid = element_line(colour = "#B78980"),
panel.grid.major.x = element_line(colour = "#B78980"),
panel.grid.minor.x = element_line(colour = "#B78980"),
axis.text.x = element_text(color = "#54433A", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "#54433A", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#402E32", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#402E32", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
plot_log1There are two customer categories, namely shopaholic and general shopper. Furthermore, general shopper category divided into Mid to Low Class and High Class.
The personality characters of shopaholic are discount hunter, love shopping in stores also on websites, like to visit company’s web, and beside the discount most customers buy products during the 4th campaign. Then the shopaholic backgrounds are mid income, graduation, married, single.
If the company wants to acquire more customers, the company can target the General Shopper - High Class category. The personality characters of General Shopper - High Class category are not easily tempted by marketing campaigns, loves Wines, loves Meat Products, loves Fish Products, likes to shop in stores, through catalogs, and websites. Then the General Shopper - High Class category backgrounds are high income, master, together.
Using Decision Tree Model - Tuned DT Model, it can classify the positive class and negative class with accuracy score 0.6897321, sensitivity score 0.7272727, and auc score 73.1215 % .