Customer segmentation is the process of separating customers into groups based on the certain traits they share. Segmentation offers a simple way of organizing and managing company’s relationships with customers. This process also makes it easy to tailor and personalize company’s marketing, service, and sales efforts to the needs of specific groups. This helps to boost customer loyalty and conversions.
We use Dataset from Kaggle.com about customer personality analysis segmentation. Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers. Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
We will analyze by using unsupervised learning method, clustering and PCA method. Unsupervised learning is a machine learning method where there is no target variable in the data to be analyzed. In unsupervised learning, it focuses more on exploring data such as looking for patterns in data.
library(dplyr) #for general data wrangling
library(lubridate) #for date time data converting
library(GGally) #for making plot
library(FactoMineR) #for exploratory multivariate data analysis
library(factoextra)# for making intuitive plot
library(ggplot2) #for adorn the plot
library(gridExtra) #for tidy up the plot
library(ggiraphExtra) #for profiling process
library(plotly) #for making a plotly visualization
customer <- read.csv("marketing_campaign.csv", sep = "\t")
rmarkdown::paged_table(customer)
#Rename column names to make it more intuitive
names(customer) <- c("ID","Year_Birth","Education","Marital_Status","Income","Kidhome","Teenhome","Dt_Customer","Recency","Wines", "Fruits", "Meat", "Fish", "Sweet", "Gold", "Deals", "Web", "Catalog", "Store", "WebVisits", "Cmp3", "Cmp4", "Cmp5", "Cmp1", "Cmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response")
Data Columns:
• ID: Customer’s unique identifier
• Year_Birth: Customer’s birth year
• Education: Customer’s education level
• Marital_Status: Customer’s marital status
• Income: Customer’s yearly household income
• Kidhome: Number of children in customer’s household
• Teenhome: Number of teenagers in customer’s household
• Dt_Customer: Date of customer’s enrollment with the company
• Recency: Number of days since customer’s last purchase
• Wines (MntWines): Amount spent on wine in last 2 years
• Fruits (MntFruits): Amount spent on fruits in last 2 years
• Meat (MntMeatProducts): Amount spent on meat in last 2 years
• Fish (MntFishProducts): Amount spent on fish in last 2 years
• Sweet (MntSweetProduc)ts: Amount spent on sweets in last 2 years
• Gold (MntGoldProds): Amount spent on gold in last 2 years
• Web (NumWebPurchases): Number of purchases made through the company’s
website
• Catalog (NumCatalogPurchases): Number of purchases made using a
catalog
• Store (NumStorePurchases): Number of purchases made directly in
stores
• WebVisits (NumWebVisitsMonth): Number of visits to company’s website
in the last month
• Deals (NumDealsPurchases): Number of purchases made with a
discount
• Cmp1 (AcceptedCmp1): 1 if customer accepted the offer in the 1st
campaign, 0 otherwise
• Cmp2 (AcceptedCmp2): 1 if customer accepted the offer in the 2nd
campaign, 0 otherwise
• Cmp3 (AcceptedCmp3): 1 if customer accepted the offer in the 3rd
campaign, 0 otherwise
• Cmp4 (AcceptedCmp4): 1 if customer accepted the offer in the 4th
campaign, 0 otherwise
• Cmp5 (AcceptedCmp5): 1 if customer accepted the offer in the 5th
campaign, 0 otherwise
• Complain: 1 if the customer complained in the last 2 years, 0
otherwise
• Response: 1 if customer accepted the offer in the last campaign, 0
otherwise
glimpse(customer)
## Rows: 2,240
## Columns: 29
## $ ID <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 4855, 58…
## $ Year_Birth <int> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 1974, 1…
## $ Education <chr> "Graduation", "Graduation", "Graduation", "Graduation",…
## $ Marital_Status <chr> "Single", "Single", "Together", "Together", "Married", …
## $ Income <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 33454,…
## $ Kidhome <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0…
## $ Teenhome <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1…
## $ Dt_Customer <chr> "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2014",…
## $ Recency <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82, 53,…
## $ Wines <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6, 194,…
## $ Fruits <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, 14, 2…
## $ Meat <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 480, 5…
## $ Fish <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3, 6, …
## $ Sweet <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1, 68,…
## $ Gold <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14, 5, …
## $ Deals <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2…
## $ Web <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3, 4, 1…
## $ Catalog <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0, 1, …
## $ Store <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12, 3, 6…
## $ WebVisits <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8, 7, …
## $ Cmp3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Cmp1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
## $ Cmp2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Complain <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Z_CostContact <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ Z_Revenue <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
## $ Response <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
We will take out ID, Year_Birth,
Dt_Customer, Z_CostContact, and
Z_Revenue because we won’t use it. In K-means clustering,
the data used must be in numeric form, so that categorical data will
remain in integer form. Education and
Marital_Status column will be convert into numerical data
type.
#Convert Education and Marital_Status column into numerical data
cust_clean <- customer
cust_clean$Marital_Status <- sapply(X = cust_clean$Marital_Status,
FUN = switch,
"Single" = "0",
"Married" = "1",
"Together" = "2",
"Divorced" = "3",
"Widow" = "4",
"Alone" = "5",
"Other" = "6")
cust_clean$Marital_Status <- as.character(cust_clean$Marital_Status)
cust_clean$Education <- sapply(X = cust_clean$Education,
FUN = switch,
"2n Cycle" = "0",
"Basic" = "1",
"Graduation" = "2",
"Master" = "3",
"PhD" = "4")
#Convert some columns into factors to make the EDA process easier
cust_clean <- cust_clean %>%
mutate_at(vars(Education, Marital_Status, Kidhome, Teenhome, Cmp3, Cmp4, Cmp5, Cmp1, Cmp2, Complain, Response), as.factor) %>%
select(-c(Dt_Customer, ID, Year_Birth, Z_CostContact, Z_Revenue))
glimpse(cust_clean)
## Rows: 2,240
## Columns: 24
## $ Education <fct> 2, 2, 2, 2, 4, 3, 2, 4, 4, 4, 2, 1, 2, 3, 2, 4, 2, 2, 3…
## $ Marital_Status <fct> 0, 0, 2, 2, 1, 2, 3, 1, 2, 2, 1, 1, 3, 3, 1, 0, 1, 2, 1…
## $ Income <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 33454,…
## $ Kidhome <fct> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0…
## $ Teenhome <fct> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1…
## $ Recency <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82, 53,…
## $ Wines <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6, 194,…
## $ Fruits <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, 14, 2…
## $ Meat <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 480, 5…
## $ Fish <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3, 6, …
## $ Sweet <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1, 68,…
## $ Gold <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14, 5, …
## $ Deals <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2…
## $ Web <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3, 4, 1…
## $ Catalog <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0, 1, …
## $ Store <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12, 3, 6…
## $ WebVisits <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8, 7, …
## $ Cmp3 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp4 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp5 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Cmp1 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
## $ Cmp2 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Complain <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Response <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
cust_clean %>% is.na() %>% colSums()
## Education Marital_Status Income Kidhome Teenhome
## 0 0 24 0 0
## Recency Wines Fruits Meat Fish
## 0 0 0 0 0
## Sweet Gold Deals Web Catalog
## 0 0 0 0 0
## Store WebVisits Cmp3 Cmp4 Cmp5
## 0 0 0 0 0
## Cmp1 Cmp2 Complain Response
## 0 0 0 0
There are any missing value on Income column. We will remove it because we need to keep the original data and it’s only 24 row.
cust_clean <- na.omit(cust_clean)
cust_clean %>% is.na() %>% colSums()
## Education Marital_Status Income Kidhome Teenhome
## 0 0 0 0 0
## Recency Wines Fruits Meat Fish
## 0 0 0 0 0
## Sweet Gold Deals Web Catalog
## 0 0 0 0 0
## Store WebVisits Cmp3 Cmp4 Cmp5
## 0 0 0 0 0
## Cmp1 Cmp2 Complain Response
## 0 0 0 0
There are no missing value any more. So, we can continue to the next process.
Is there any outlier in this data set?
summary(cust_clean)
## Education Marital_Status Income Kidhome Teenhome Recency
## 0: 200 0 :471 Min. : 1730 0:1283 0:1147 Min. : 0.00
## 1: 54 1 :857 1st Qu.: 35303 1: 887 1:1018 1st Qu.:24.00
## 2:1116 2 :573 Median : 51382 2: 46 2: 51 Median :49.00
## 3: 365 3 :232 Mean : 52247 Mean :49.01
## 4: 481 4 : 76 3rd Qu.: 68522 3rd Qu.:74.00
## 5 : 3 Max. :666666 Max. :99.00
## NULL: 4
## Wines Fruits Meat Fish
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0 1st Qu.: 3.00
## Median : 174.5 Median : 8.00 Median : 68.0 Median : 12.00
## Mean : 305.1 Mean : 26.36 Mean : 167.0 Mean : 37.64
## 3rd Qu.: 505.0 3rd Qu.: 33.00 3rd Qu.: 232.2 3rd Qu.: 50.00
## Max. :1493.0 Max. :199.00 Max. :1725.0 Max. :259.00
##
## Sweet Gold Deals Web
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 8.00 Median : 24.50 Median : 2.000 Median : 4.000
## Mean : 27.03 Mean : 43.97 Mean : 2.324 Mean : 4.085
## 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :262.00 Max. :321.00 Max. :15.000 Max. :27.000
##
## Catalog Store WebVisits Cmp3 Cmp4 Cmp5
## Min. : 0.000 Min. : 0.000 Min. : 0.000 0:2053 0:2052 0:2054
## 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000 1: 163 1: 164 1: 162
## Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 2.671 Mean : 5.801 Mean : 5.319
## 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :28.000 Max. :13.000 Max. :20.000
##
## Cmp1 Cmp2 Complain Response
## 0:2074 0:2186 0:2195 0:1883
## 1: 142 1: 30 1: 21 1: 333
##
##
##
##
##
It seems there is any outlier in Income column. Let’s check it first using boxplot.
boxplot(cust_clean$Income)
Yes, we are right. There is any outlier in Income column. We will take
it out.
cust_clean <- cust_clean %>% filter(Income<=300000)
- Do the variables have the same scale?
Let’s check the range for each column:
summary(cust_clean)
## Education Marital_Status Income Kidhome Teenhome Recency
## 0: 200 0 :471 Min. : 1730 0:1283 0:1146 Min. : 0.00
## 1: 54 1 :857 1st Qu.: 35284 1: 886 1:1018 1st Qu.:24.00
## 2:1115 2 :572 Median : 51373 2: 46 2: 51 Median :49.00
## 3: 365 3 :232 Mean : 51970 Mean :49.02
## 4: 481 4 : 76 3rd Qu.: 68487 3rd Qu.:74.00
## 5 : 3 Max. :162397 Max. :99.00
## NULL: 4
## Wines Fruits Meat Fish
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0 1st Qu.: 3.00
## Median : 175.0 Median : 8.00 Median : 68.0 Median : 12.00
## Mean : 305.2 Mean : 26.36 Mean : 167.1 Mean : 37.65
## 3rd Qu.: 505.0 3rd Qu.: 33.00 3rd Qu.: 232.5 3rd Qu.: 50.00
## Max. :1493.0 Max. :199.00 Max. :1725.0 Max. :259.00
##
## Sweet Gold Deals Web
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 8.00 Median : 25.00 Median : 2.000 Median : 4.000
## Mean : 27.04 Mean : 43.98 Mean : 2.323 Mean : 4.086
## 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000
## Max. :262.00 Max. :321.00 Max. :15.000 Max. :27.000
##
## Catalog Store WebVisits Cmp3 Cmp4 Cmp5
## Min. : 0.000 Min. : 0.000 Min. : 0.000 0:2052 0:2051 0:2053
## 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000 1: 163 1: 164 1: 162
## Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 2.672 Mean : 5.802 Mean : 5.319
## 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :28.000 Max. :13.000 Max. :20.000
##
## Cmp1 Cmp2 Complain Response
## 0:2073 0:2185 0:2194 0:1882
## 1: 142 1: 30 1: 21 1: 333
##
##
##
##
##
The scale of numeric data has different ranges. It is very risky to form bias in the model, especially in the PCA process, so scaling will be carried out before the PCA process is carried out.
-Does our numeric predictors have correlation with each other?
ggcorr(cust_clean, low = "navy", high = "turquoise", label = T)
The value of the Pearson correlation coefficient (r) lies between -1 to
+1. When the value of – r=0; there is no relation between the
variable.
r=+1; perfectly positively correlated.
r=-1; perfectly negatively correlated.
r= 0 to 0.30; negligible correlation.
r=0.30 to 0.50; moderate correlation.
r=0.50 to 1 highly correlated.
source
Based on the plot above, most of predictors have moderate correlation. Principal Component Analysis can be used to reduce the dimension of the data and preserve as much information as possible, producing non-multicollinearity data.
Our target column is Response. Let’s see if the customer will accept
the last campaign based on Deals and Income.
We need to filter based on Income first to get a clearer
visualization, because there is an outlier in the Income
column.
#Mutate Response and Marital_Status column into factor for easier visualization in this section
#using customer data for clearly visualization (before marital_status convert into numerical data)
customer_eda <- customer %>% mutate_at(vars(Response, Marital_Status), as.factor)
customer_eda %>% filter(Income<=200000) %>%
ggplot( aes(Deals, Income, color = Response, size = Marital_Status)) +
geom_point(alpha = 0.5) + theme_minimal()
Based on the graph above, most customers who accept the campaign have an
income of 25000-125000, and their marital status is Married, single, or
together.
Then we want to check some stats based on customer’s Response by using boxplot.
p1 <- ggplot(customer_eda, aes(Response, Wines, fill = Response)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "MntWines")
p2 <- ggplot(customer_eda, aes(Response, Fruits, fill = Response)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "MntFruits")
p3 <- ggplot(customer_eda, aes(Response, Meat, fill = Response)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "MntMeatProducts")
p4 <- ggplot(customer_eda, aes(Response, Fish, fill = Response)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "MntFishProducts")
p5 <- ggplot(customer_eda, aes(Response, Sweet, fill = Response)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "MntSweetProducts")
p6 <- ggplot(customer_eda, aes(Response, Gold, fill = Response)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "MntGoldProds")
grid.arrange(p1, p2, p3, p4, p5, p6)
Based on the graphs above, most customers who say yes to the campaign buy a more considerable amount, even though it’s not significant. But we can say they are our loyal customer.
Our dataset does not have a customer-type label. However, the difference is not too significant when looking at the pattern of the number of purchases on various products and the response to the last marketing campaign. This data might be suitable for clustering using K-means.
Clustering quantitative predictor and qualitative predictor. This clustering predictor only use in PCA method, for clustering analysis all of categorical predictors will convert into numerical predictors.
#Check the numeric predictors
cust_clean %>%
select_if(is.numeric) %>%
colnames()
## [1] "Income" "Recency" "Wines" "Fruits" "Meat" "Fish"
## [7] "Sweet" "Gold" "Deals" "Web" "Catalog" "Store"
## [13] "WebVisits"
There are 13 numerical columns
# quantitative columns
quanti <- cust_clean %>%
select_if(is.numeric) %>%
colnames()
# indexing numerical columns
quantivar <- which(colnames(cust_clean) %in% quanti)
# qualitative columns
quali <- cust_clean %>%
select_if(is.factor) %>%
colnames()
# indexing categorical columns
qualivar <- which(colnames(cust_clean) %in% quali)
Principal component analysis (PCA) is a multivariate statistical analysis technique. Arguably, this is the most popular statistical analysis technique today. Usually, PCA is used in the fields of pattern recognition and signal processing.
PCA is the basis of multivariate data analysis that applies the projection method. This analysis technique is usually used to summarize large-scale multivariate data tables to collect more minor variables or summary indexes. From there, the variables are analyzed to find specific trends, variable clusters, and outliers. source
# PCA by using FactoMineR
cust_pca <- PCA(
X = cust_clean,
scale.unit = T,
quali.sup = qualivar,
graph = F,
ncp = 13 # 13 numerical columns
)
cust_pca$eig # analyze cumulative variance of each PC
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 5.8313793 44.856764 44.85676
## comp 2 1.5545979 11.958445 56.81521
## comp 3 1.0026549 7.712730 64.52794
## comp 4 0.8987336 6.913335 71.44127
## comp 5 0.6654878 5.119137 76.56041
## comp 6 0.6362576 4.894289 81.45470
## comp 7 0.5290949 4.069961 85.52466
## comp 8 0.4330794 3.331380 88.85604
## comp 9 0.3981573 3.062749 91.91879
## comp 10 0.3629042 2.791571 94.71036
## comp 11 0.2661855 2.047580 96.75794
## comp 12 0.2461341 1.893339 98.65128
## comp 13 0.1753336 1.348720 100.00000
Eigenvalues represent the total amount of variance that can be
explained by a given principal component. They can be positive or
negative in theory, but in practice they explain variance which is
always positive.
- If eigenvalues are greater than zero, then it’s a good sign.
- Since variance cannot be negative, negative eigenvalues imply the
model is ill-conditioned.
- Eigenvalues close to zero imply there is item multicollinearity, since
all the variance can be taken up by the first component.
source
Through PCA, We can keep some informative principal components (with high cumulative variance) from the Kernels dataset to perform dimensionality reduction. By doing this, we can reduce the dimensionality of the dataset while retaining as much information as possible.
In this study, We want to retain at least 85% of the information from our data. From the PCA summary (cust_pca$eig), and picked PC1-PC8 from a total of 13 PC. By doing this, We were able to reduce 38.46% of dimension from my original data while retaining 87.5% of the information from the data.
We can extract the values of PC1-PC8 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
# making a new data frame from PCA result
cust_x <- data.frame(cust_pca$ind$coord[,1:8])
cust_1 <- cbind(cust_x, Response = cust_clean$Response)
head(cust_1)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## 1 3.63029082 0.6284463 0.2578246 1.75833936 0.96723077 -0.2504482
## 2 -2.13279216 -0.9567982 -0.3326070 -0.34326515 0.31540242 -0.3632943
## 3 1.71918345 0.1411721 -0.9475246 -0.11005151 -0.95656049 1.0691186
## 4 -2.31635974 -0.5090272 -0.7967320 0.07766089 0.09785698 0.1260063
## 5 0.06529784 0.7402671 1.5385334 0.22315971 1.12298509 0.3412906
## 6 0.78300157 0.6853681 -1.1632779 -0.85228829 -0.25014634 1.1248055
## Dim.7 Dim.8 Response
## 1 2.3737353 -0.04560618 1
## 2 -0.1418925 -0.17755398 0
## 3 -0.6182426 0.45559518 0
## 4 -0.1837603 0.06647132 0
## 5 -0.5817403 0.19720561 0
## 6 -0.1887550 0.15115626 0
Outlier identification
plot.PCA(
x=cust_pca,
choix="ind",
select="contrib 10",
invisible = "quali",
habillage = "Response")
In total, 8 customer IDs are considered outliers, 2 of which gave
response 1, namely, the customer accepted the offer in the last
campaign.
Variables Factor Map
We want to know the contribution of variables to each
PC, how much information each variable explains to each PC, and
the correlation between the initial variables.
# PC1 bar plot variable contribution
fviz_contrib(X = cust_pca,
axes = 1, # = PC1
choice = "var")
# PC2 bar plot variable contribution
fviz_contrib(X = cust_pca,
axes = 2, # = PC2
choice = "var")
Based on the graphs above,
- Catalog, Meat, Income,
Wines, Store, Fish,
Fruits, Sweet are the variables contribute to
the PC1.
- Deals, Web, WebVisits are the
variables contribute to the PC2.
Clustering is one of the machine learning methods and is included in unsupervised learning. Clustering itself aims to find similar data patterns so that it has the possibility of grouping similar data. In those that have been grouped in clustering, they are usually also called clusters. In determining a good cluster, it is when a member in the cluster has the most similarity possible while between cluster members there are quite significant differences. Clustering is widely used in various fields such as customer segmentation, product recommendations, data profiling, and many more. source
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters. The goal is to group similar data points together and discover underlying patterns or structures within the data. K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. The main objective of k-means clustering is to partition your data into a specific number (k) of groups, where data points within each group are similar and dissimilar to points in other groups. It achieves this by minimizing the distance between data points and their assigned cluster’s center, called the centroid source.
In K-means clustering, the data used must be in numeric form, so we will try to only use numerical predictors.
cust_clust <- cust_clean %>% select(-c(Education, Marital_Status, Kidhome, Teenhome, Cmp3, Cmp4, Cmp5, Cmp1, Cmp2, Complain, Response))
glimpse(cust_clust)
## Rows: 2,215
## Columns: 13
## $ Income <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 33454, 3035…
## $ Recency <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 59, 82, 53, 38, 23, …
## $ Wines <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 6, 194, 233, 3,…
## $ Fruits <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 16, 61, 2, 14, 22, 5, 5,…
## $ Meat <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 11, 480, 53, 17, 1…
## $ Fish <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 11, 225, 3, 6, 59, 2, 1…
## $ Sweet <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 1, 112, 5, 1, 68, 13, 12,…
## $ Gold <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 16, 30, 14, 5, 45, 4, 2…
## $ Deals <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 2, 1, …
## $ Web <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 2, 3, 6, 1, 7, 3, 4, 11, 2, 4,…
## $ Catalog <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 4, 1, 0, 6, 0, 1, 4, 1, 2,…
## $ Store <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 3, 8, 5, 3, 12, 3, 6, 9, 3, …
## $ WebVisits <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 8, 2, 6, 8, 3, 8, 7, 5, 6, 8,…
#scaling data
cust_z <- scale(cust_clust)
summary(cust_z)
## Income Recency Wines Fruits
## Min. :-2.33388 Min. :-1.6934384 Min. :-0.9048 Min. :-0.6623
## 1st Qu.:-0.77514 1st Qu.:-0.8644117 1st Qu.:-0.8336 1st Qu.:-0.6121
## Median :-0.02773 Median :-0.0008421 Median :-0.3860 Median :-0.4613
## Mean : 0.00000 Mean : 0.0000000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.76730 3rd Qu.: 0.8627274 3rd Qu.: 0.5922 3rd Qu.: 0.1668
## Max. : 5.12987 Max. : 1.7262970 Max. : 3.5209 Max. : 4.3374
## Meat Fish Sweet Gold
## Min. :-0.7448 Min. :-0.6876 Min. :-0.6583 Min. :-0.8487
## 1st Qu.:-0.6735 1st Qu.:-0.6328 1st Qu.:-0.6339 1st Qu.:-0.6750
## Median :-0.4416 Median :-0.4684 Median :-0.4635 Median :-0.3662
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2917 3rd Qu.: 0.2255 3rd Qu.: 0.1451 3rd Qu.: 0.2320
## Max. : 6.9454 Max. : 4.0421 Max. : 5.7199 Max. : 5.3455
## Deals Web Catalog Store
## Min. :-1.2074 Min. :-1.49036 Min. :-0.9127 Min. :-1.7848
## 1st Qu.:-0.6876 1st Qu.:-0.76082 1st Qu.:-0.9127 1st Qu.:-0.8620
## Median :-0.1678 Median :-0.03129 Median :-0.2295 Median :-0.2468
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3520 3rd Qu.: 0.69825 3rd Qu.: 0.4538 3rd Qu.: 0.6760
## Max. : 6.5896 Max. : 8.35836 Max. : 8.6528 Max. : 2.2140
## WebVisits
## Min. :-2.1925
## 1st Qu.:-0.9558
## Median : 0.2808
## Mean : 0.0000
## 3rd Qu.: 0.6931
## Max. : 6.0520
1. Elbow method
The elbow method, also known as total within sum of squares, is a
technique used to determine the optimal number of clusters for a k-means
clustering analysis. source
# obtain k optimum
fviz_nbclust(cust_z, FUNcluster = kmeans, method = "wss") + labs(subtitle = "Elbow method")
Based on the graph above, the line slows down at k = 2. The elbow method
is one of the most widely used methods but it has disadvantages. The
elbow method of interpretation is based on visual interpretation, which
leads to incorrect results.
2. Silhouette Method
Average silhouette method measures how well-defined a particular cluster
is, and how well-separated it is from other clusters. At this point, it
is necessary to state that Silhouette value is calculated for each
observation in the data set. Average of the silhouette value of all
observations gives us the average silhouette value, which is the
silhouette value of the clustering analysis. source
fviz_nbclust(cust_z, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")
As can be easily seen from the plot, the clustering model with the
highest silhouette value is the clustering for 2 clusters. Therefore, it
can be inferred that the optimal number of clusters is two. However,
when the silhouette values on the y-axis are examined, the silhouette
value for the number of clusters 3 is also quite close, although the
number of clusters 2 is the highest. For this reason, it would be more
useful to always run the clustering algorithm for both 2 and 3 clusters
and interpret the results. source
3. Gap Statistic
Gap Statistic Method compares the observed within-cluster variation for
different values of k with the variation expected under a null reference
distribution of the data.
fviz_nbclust(cust_z, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")
Gap Statistic Method offers 5 as the optimal number of clusters.
Two of them give the k = 2, but gap statistic method gives us 5. Then we will try to use mean value of them, 3.
# k-means clustering
set.seed(133)
cust_km <- kmeans(cust_z, 3)
# result analysis
cust_km
## K-means clustering with 3 clusters of sizes 597, 593, 1025
##
## Cluster means:
## Income Recency Wines Fruits Meat Fish
## 1 1.1118923 0.010729995 0.8678931 1.0678523 1.2548446 1.1541386
## 2 0.2566170 -0.018550399 0.4767978 -0.1340294 -0.1449327 -0.1959788
## 3 -0.7960718 0.004482517 -0.7813398 -0.5444179 -0.6470216 -0.5588345
## Sweet Gold Deals Web Catalog Store WebVisits
## 1 1.0746399 0.6740915 -0.4803667 0.4379592 1.1455884 0.8156569 -1.0235079
## 2 -0.1521858 0.2927043 0.7823986 0.8292665 0.1011815 0.5577581 0.2467713
## 3 -0.5378672 -0.5619573 -0.1728619 -0.7348456 -0.7257726 -0.7977539 0.4533647
##
## Clustering vector:
## [1] 1 3 2 3 2 2 2 3 3 3 3 1 2 3 1 3 3 1 3 3 1 2 2 2 3 3 3 1 3 3 3 2 1 3 2 3 3
## [38] 1 1 3 3 3 1 3 3 2 2 1 3 1 2 1 1 3 2 1 2 2 2 1 3 3 1 2 2 1 2 2 3 3 1 1 3 2
## [75] 3 3 3 3 1 3 3 2 1 3 3 3 3 2 3 1 2 3 3 1 1 1 3 3 1 3 1 1 2 2 1 2 3 1 2 3 3
## [112] 2 3 3 3 1 1 1 3 2 2 2 1 3 1 3 3 3 3 1 2 1 2 3 2 3 3 3 3 2 2 2 3 2 2 3 3 3
## [149] 1 3 2 3 2 1 3 1 3 1 3 3 3 3 3 3 1 1 3 3 1 3 3 1 3 3 3 3 2 1 3 3 1 3 3 2 3
## [186] 2 1 1 2 2 1 2 1 3 3 3 2 3 2 3 1 2 2 1 2 3 1 3 2 3 1 2 3 2 3 2 2 1 3 2 1 3
## [223] 3 2 3 3 2 3 3 1 1 3 1 2 3 2 1 1 1 3 3 1 3 2 3 2 2 3 3 3 2 3 3 2 3 1 3 1 3
## [260] 1 3 3 3 3 2 1 1 1 2 3 2 3 2 3 3 1 2 1 2 3 3 1 3 3 2 3 3 1 2 3 2 3 3 3 1 3
## [297] 1 2 3 3 3 1 2 3 3 2 3 2 3 3 2 2 1 3 3 3 3 3 3 2 3 3 1 1 3 1 1 1 3 2 2 3 1
## [334] 3 1 3 3 2 1 2 1 2 3 3 1 2 2 1 2 3 3 2 2 1 3 1 2 3 2 3 2 3 3 3 3 2 3 3 3 3
## [371] 3 3 3 2 1 3 2 1 3 1 3 2 1 3 3 3 3 3 1 3 3 2 3 3 2 3 2 3 1 2 2 2 1 3 1 1 2
## [408] 3 3 3 1 1 3 1 2 3 1 1 2 2 1 3 3 2 2 3 3 3 3 3 3 3 3 3 1 3 2 2 2 3 2 2 2 1
## [445] 3 3 1 2 1 3 1 3 1 1 3 2 2 1 3 2 3 3 2 3 2 2 2 3 3 3 3 1 1 2 2 3 3 1 3 1 2
## [482] 2 2 3 1 2 1 3 3 3 2 3 2 1 1 3 2 3 1 2 1 2 1 3 3 1 1 3 1 3 2 3 3 1 2 1 2 2
## [519] 1 2 3 3 2 3 1 3 3 3 3 3 1 1 3 1 3 3 3 3 3 2 3 1 3 1 1 3 1 3 1 2 1 2 3 2 3
## [556] 3 3 2 3 3 3 3 2 3 3 3 2 3 2 3 3 3 3 1 1 2 3 3 1 1 3 2 3 3 3 3 3 3 2 1 2 3
## [593] 3 3 3 3 1 3 3 3 3 2 2 3 3 3 3 2 3 1 3 1 3 1 1 3 3 2 1 1 3 1 3 1 2 2 1 1 1
## [630] 2 2 1 2 1 3 1 2 1 3 2 3 2 3 3 3 1 3 2 3 1 3 3 3 3 3 3 3 1 2 1 1 2 3 2 1 3
## [667] 1 2 1 2 3 1 2 1 1 1 1 2 1 3 3 3 3 3 3 2 1 2 2 2 1 3 1 3 2 2 3 3 2 3 2 3 1
## [704] 1 3 2 3 2 2 3 1 3 3 1 1 2 1 3 2 2 2 2 1 1 2 3 1 2 3 3 3 1 1 3 1 3 1 1 2 1
## [741] 1 1 1 2 2 3 3 3 2 1 3 1 3 1 1 3 2 1 1 2 3 3 3 3 1 3 1 1 3 3 3 3 3 3 2 2 2
## [778] 1 1 3 3 3 3 2 2 1 3 2 3 3 1 1 2 3 2 2 1 3 3 1 2 1 2 3 2 2 3 1 3 2 3 1 1 2
## [815] 3 1 3 3 2 2 3 3 1 2 1 3 2 3 3 3 3 1 1 1 2 3 3 2 2 1 3 2 1 3 2 3 1 3 2 3 2
## [852] 2 2 2 3 2 3 2 1 3 3 1 1 2 3 1 3 3 3 3 3 1 1 3 3 2 1 3 2 1 3 1 2 1 1 2 3 1
## [889] 3 1 2 2 1 1 3 3 3 1 1 2 3 1 1 2 1 3 1 3 1 3 3 1 2 1 1 1 1 1 3 2 3 1 3 1 2
## [926] 2 2 2 2 1 1 3 2 2 2 3 2 3 3 3 3 3 3 2 2 3 2 1 2 3 3 2 2 1 3 3 2 1 3 3 2 1
## [963] 1 1 2 3 2 3 3 3 2 1 2 1 1 1 3 1 3 2 1 3 3 1 3 2 2 2 1 2 3 3 2 2 1 3 3 1 3
## [1000] 2 3 3 2 1 3 3 3 3 3 2 3 3 1 3 3 3 2 1 1 1 3 1 3 3 3 3 2 2 3 3 1 3 3 3 1 2
## [1037] 2 1 3 1 3 3 1 3 3 1 1 2 2 2 3 2 3 1 2 3 1 3 1 2 3 2 1 1 3 2 3 1 2 1 3 1 2
## [1074] 3 1 3 1 1 3 1 3 3 2 2 1 3 2 1 2 2 3 3 1 3 3 2 2 1 1 3 1 3 2 3 3 3 2 2 3 2
## [1111] 3 3 2 2 3 3 1 2 3 3 1 1 3 3 1 3 3 2 3 3 3 1 3 3 2 2 3 2 1 3 1 3 3 2 1 1 1
## [1148] 3 2 2 1 3 2 3 3 1 1 3 3 1 2 3 3 3 2 3 1 2 3 2 3 3 3 3 2 3 3 1 2 3 3 3 2 3
## [1185] 2 1 1 3 2 3 3 1 2 1 3 3 3 3 1 1 1 2 3 2 3 1 3 3 3 1 3 3 2 2 3 2 3 3 3 3 3
## [1222] 3 2 3 1 3 3 3 3 1 1 3 3 3 3 3 2 1 2 1 1 2 2 2 1 3 1 3 1 1 3 3 1 2 3 3 1 2
## [1259] 1 3 2 3 2 3 3 1 2 1 1 3 2 2 3 2 3 1 2 3 3 3 3 3 3 3 2 1 3 3 1 3 2 1 2 2 1
## [1296] 2 2 2 1 3 1 2 3 2 3 3 3 1 2 1 2 3 2 3 3 3 1 3 2 1 1 2 3 1 3 3 3 3 2 2 3 3
## [1333] 3 3 2 2 1 1 1 3 3 1 1 3 2 1 3 2 3 2 1 2 1 2 3 3 3 3 3 3 2 2 2 2 3 2 3 3 3
## [1370] 1 3 3 1 2 3 3 3 3 2 3 3 2 2 2 2 3 3 2 3 2 1 2 3 2 2 3 1 3 2 3 3 3 3 3 1 2
## [1407] 3 3 3 3 3 3 2 3 3 1 3 2 3 2 3 3 3 3 3 3 1 1 3 1 2 1 2 3 1 1 3 2 1 3 3 1 2
## [1444] 2 2 3 3 3 1 2 1 3 1 3 3 3 1 1 3 1 3 3 2 1 2 3 3 1 2 1 2 1 2 2 3 1 2 3 1 3
## [1481] 2 1 1 2 3 2 2 2 2 2 1 1 2 1 3 1 1 3 3 2 3 3 3 1 1 3 3 3 2 1 3 1 3 2 2 2 3
## [1518] 3 3 3 1 2 2 3 2 1 3 2 3 3 3 2 3 3 1 1 1 2 3 2 3 3 1 3 2 3 1 3 2 2 1 2 2 1
## [1555] 3 1 3 2 3 3 1 3 2 3 1 1 3 1 3 3 2 3 1 3 1 3 3 3 2 3 3 1 3 2 1 2 3 3 3 1 2
## [1592] 2 2 1 3 1 3 3 1 3 3 3 2 3 3 1 2 3 3 2 2 3 3 1 3 3 3 2 2 2 1 3 3 3 2 2 3 1
## [1629] 3 2 2 1 3 3 1 3 1 3 3 3 1 2 1 2 3 2 3 3 3 2 3 2 1 3 1 1 1 1 3 3 3 2 3 1 3
## [1666] 3 3 3 3 1 2 2 1 2 1 3 3 3 1 3 2 3 1 2 3 3 2 3 3 2 3 1 3 1 1 3 1 3 2 2 3 3
## [1703] 2 3 1 1 1 2 3 3 3 2 1 3 3 3 1 1 1 2 1 2 3 3 2 3 1 2 1 3 1 1 1 2 2 3 2 3 3
## [1740] 3 3 3 1 1 3 2 1 3 2 3 1 3 3 3 3 1 1 3 3 3 3 3 1 3 3 1 2 2 3 2 3 2 1 3 3 1
## [1777] 3 3 3 2 3 2 1 1 1 3 2 3 2 2 3 1 1 3 2 1 1 1 3 2 1 2 3 2 3 3 1 1 3 2 1 1 3
## [1814] 3 3 2 3 3 3 1 3 2 2 3 1 3 1 2 2 3 3 2 3 1 3 1 1 1 2 3 3 1 2 2 3 2 1 1 2 3
## [1851] 3 2 1 3 1 3 2 3 3 2 1 2 2 1 3 3 1 3 2 3 1 1 3 1 1 3 2 1 3 3 1 1 2 3 3 2 3
## [1888] 3 1 2 3 3 3 1 1 1 1 2 2 3 3 3 3 3 1 1 1 1 3 2 1 1 2 3 2 3 3 1 3 2 2 3 3 1
## [1925] 3 3 1 3 1 1 1 3 3 3 2 1 1 2 1 3 2 1 3 1 2 3 3 3 1 2 1 1 1 2 3 2 2 3 2 1 3
## [1962] 3 3 3 3 1 3 2 3 2 3 3 2 1 3 1 1 3 1 1 2 2 2 3 3 3 3 3 3 1 2 3 2 1 3 1 2 1
## [1999] 3 3 3 3 3 3 3 2 2 1 3 3 3 2 3 2 2 3 1 2 2 2 2 3 1 2 2 3 3 2 3 3 1 1 1 3 2
## [2036] 1 3 3 3 1 1 2 3 1 3 2 1 3 2 3 3 2 1 3 2 1 2 3 3 3 2 1 2 1 1 3 3 2 3 2 1 2
## [2073] 3 1 2 1 2 3 2 3 2 3 3 3 2 3 1 2 2 3 3 2 2 3 1 1 3 3 3 3 2 3 1 2 1 2 3 3 1
## [2110] 3 3 1 2 3 3 3 3 3 3 3 3 2 1 3 3 2 3 3 1 3 3 3 3 3 3 2 3 1 3 2 1 3 3 1 1 1
## [2147] 3 2 2 1 2 2 1 1 2 2 3 3 2 3 3 3 1 1 1 1 3 1 3 3 1 1 3 3 2 2 3 3 2 2 1 3 3
## [2184] 1 3 3 3 2 1 3 1 3 3 3 1 3 2 2 1 3 3 2 2 2 2 3 3 2 3 3 1 2 2 1 3
##
## Within cluster sum of squares by cluster:
## [1] 7705.288 4901.901 3334.311
## (between_SS / total_SS = 44.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the clustering result, it is very interesting that the size of each cluster is quite proportional to one another (cluster 1: 549; cluster 2: 641; cluster 3: 1025). In this case the points in cluster 1 have a sum of square distances from the centroid equal to 7705.288, the points in cluster 2 have a sum of square distances from the centroid of 4901.901, the points in cluster 3 have a sum of square distances from the centroid of 3334.311.
The ratio between the sum of squares distance between cluster to the total sum of squares is 44.6%, meaning that only 44.6% of the sum of squares distance comes from the distance between clusters. This numbers mean that if you imagine a cluster like a cloud of points that have a distance from the centroid, this represents the sum of squares of the distances of the points assigned to a cluster from the centroid of that cluster. between_SS / total_SS = 44.6 %. total_SS tends to be small so the ratio tends to 1 (or 100%).
It means that as much the points in each cluster are compact (low
within_SS for each cluster and therefore low
tot.within_SS which is the sum of all the
within_SS ) the better the clustering is because you have
low variance in each cluster, in this case the separation of centroids
(between_SS) explains the separation of the clusters which
are compact and well separated.
Thus, we can conclude that our data is not properly clustered since the observations in the same cluster has a big distance or variations. The number of members on each cluster is not equally distributed.
# make new column, cluster column
cust_clust$cluster <- cust_km$cluster
# profiling with summarise data
cust_clust %>%
group_by(cluster) %>%
summarise_all(mean)
## # A tibble: 3 × 14
## cluster Income Recency Wines Fruits Meat Fish Sweet Gold Deals Web
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 75905. 49.3 598. 68.9 449. 101. 71.2 78.9 1.40 5.29
## 2 2 57494. 48.5 466. 21.0 135. 26.9 20.8 59.1 3.83 6.36
## 3 3 34833. 49.2 41.6 4.69 21.9 7.05 4.95 14.9 1.99 2.07
## # ℹ 3 more variables: Catalog <dbl>, Store <dbl>, WebVisits <dbl>
# Plotting to make easier process profiling
ggRadar(data=cust_clust, aes(colour=cluster), interactive=TRUE)
Profiling:
- Cluster 1 Beloved Customer:
- The highest income level.
- Customers with the largest number of purchases for each type of
product
- They like to buy through catalogs and come directly to the store
- Cluster 2 Discount Chaser:
- The second income level.
- They love discount
- They like to buy online via our website
- Cluster 3 The Potential One:
- The third income level
- Only buy in a small amount
- They love visit our website
PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.
fviz_cluster(object = cust_km, data = cust_clust, labelsize = 0) + theme_minimal()
3D-plot Visualization for Multidimentional Data
cust_3D <- cbind(cust_x, cluster = cust_clust$cluster)
plotly::plot_ly(cust_3D, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c(
"red", "green", "blue")) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = "Dim.1"),
yaxis = list(title = "Dim.2"),
zaxis = list(title = "Dim.3")))
Based on the two graphs above, the unsupervised learning model can separate data into three different money clusters without any overlap between each other.
This customer data set is used for customer segmentation
using unsupervised learning with PCA and clustering methods.
* We picked PC1-PC8 from a total of 13 PC. By doing this, We were able
to reduce 38.46% of dimension from my original data while retaining
87.5% of the information from the data.
* We find out k optimum using three methods, elbow method,
gap statistic, and silhouette method. We use k= 3
* Our model can create 3 clusters with quite good results, namely not
overlapping each other.
* Profiling customer result:
- Cluster 1 Beloved Customer:
- The highest income level.
- Customers with the largest number of purchases for each type of
product
- They like to buy through catalogs and come directly to the store
- Cluster 2 Discount Chaser:
- The second income level.
- They love discount
- They like to buy online via our website
- Cluster 3 The Potential One:
- The third income level
- Only buy in a small amount
- They love visit our website
https://www.researchgate.net/publication/287543507_CLUSTERING_DATA_NON-NUMERIK_DENGAN_PENDEKATAN_ALGORITMA_K-MEANS_DAN_HAMMING_DISTANCE_STUDI_KASUS_BIRO_JODOH/fulltext/5677848208ae0ad265c5be74/CLUSTERING-DATA-NON-NUMERIK-DENGAN-PENDEKATAN-ALGORITMA-K-MEANS-DAN-HAMMING-DISTANCE-STUDI-KASUS-BIRO-JODOH.pdf
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
https://algolearn.netlify.app/p/fuzzy-clustering/
https://medium.com/@ozturkfemre/unsupervised-learning-determination-of-cluster-number-be8842cdb11#:~:text=Generate%20B%20reference%20datasets%20by,the%20optimal%20number%20of%20clusters.
https://andrea-grianti.medium.com/kmeans-parameters-in-rstudio-explained-c493ec5a05df
https://algorit.ma/blog/principal-component-analysis-2022/