An advertisement division of large club store needs to perform customer analysis the store customers in order to create a segmentation for more targeted marketing campaign
Our goal is to identify similar customers and characterize them (at least some of them). In other word perform clustering and identify customers segmentation.
Colomns description:
People
ID: Customer's unique identifier
Year_Birth: Customer's birth year
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if the customer complained in the last 2 years, 0 otherwise
Products
MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years
Place
NumWebPurchases: Number of purchases made through the company’s website
NumStorePurchases: Number of purchases made directly in stores
Assume that data was current on 2014-07-01
Read “m_marketing_campaign.csv” using data.table::fread command, examine the data.
freadfunction ofdata.tableread cvs real fast
# fread m_marketing_campaign.csv and save it as df
df = fread("marketing_campaign.csv")
df_orig <- df
# Checking for irregular data:
sum(is.na(df))
## [1] 0
sum(is.null(df))
## [1] 0
# Getting to know about the data
head(df,10)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome
## 1: 5524 1957 Bachelor Single 58138 0 0
## 2: 2174 1954 Bachelor Single 46344 1 1
## 3: 4141 1965 Bachelor Together 71613 0 0
## 4: 6182 1984 Bachelor Together 26646 1 0
## 5: 5324 1981 PhD Married 58293 1 0
## 6: 7446 1967 Master Together 62513 0 1
## 7: 965 1971 Bachelor Divorced 55635 0 1
## 8: 6177 1985 PhD Married 33454 1 0
## 9: 4855 1974 PhD Together 30351 1 0
## 10: 5899 1950 PhD Together 5648 1 1
## Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 1: 04-09-2012 58 635 88 546 172
## 2: 08-03-2014 38 11 1 6 2
## 3: 21-08-2013 26 426 49 127 111
## 4: 10-02-2014 26 11 4 20 10
## 5: 19-01-2014 94 173 43 118 46
## 6: 09-09-2013 16 520 42 98 0
## 7: 13-11-2012 34 235 65 164 50
## 8: 08-05-2013 32 76 10 56 3
## 9: 06-06-2013 19 14 0 24 3
## 10: 13-03-2014 68 28 0 6 1
## MntSweetProducts MntGoldProds NumWebPurchases NumStorePurchases Complain
## 1: 88 88 8 4 0
## 2: 1 6 1 2 0
## 3: 21 42 8 10 0
## 4: 3 5 2 4 0
## 5: 27 15 5 6 0
## 6: 42 14 6 10 0
## 7: 49 27 7 7 0
## 8: 1 23 4 4 0
## 9: 3 2 3 2 0
## 10: 1 13 1 0 0
summary(df)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2209 Length:2209
## 1st Qu.: 2826 1st Qu.:1959 Class :character Class :character
## Median : 5462 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8427 3rd Qu.:1977
## Max. :11191 Max. :1996
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2209
## 1st Qu.: 35246 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 51390 Median :0.0000 Median :0.0000 Mode :character
## Mean : 52244 Mean :0.4418 Mean :0.5052
## 3rd Qu.: 68627 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :666666 Max. :2.0000 Max. :2.0000
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0
## Median :49.00 Median : 174.0 Median : 8.00 Median : 68.0
## Mean :49.08 Mean : 305.2 Mean : 26.35 Mean : 167.2
## 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.: 33.00 3rd Qu.: 233.0
## Max. :99.00 Max. :1493.0 Max. :199.00 Max. :1725.0
## MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 2.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 4.000
## Mean : 37.56 Mean : 27.07 Mean : 43.85 Mean : 4.082
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 6.000
## Max. :259.00 Max. :262.00 Max. :321.00 Max. :27.000
## NumStorePurchases Complain
## Min. : 0.000 Min. :0.000000
## 1st Qu.: 3.000 1st Qu.:0.000000
## Median : 5.000 Median :0.000000
## Mean : 5.803 Mean :0.009507
## 3rd Qu.: 8.000 3rd Qu.:0.000000
## Max. :13.000 Max. :1.000000
# Convert Year_Birth to Age (assume that current date is 2014-07-01)
df$Age = as.integer(2014 - df$Year_Birth)
# Dt_Customer is a date (it is still character), convert it to membership days (name it MembershipDays)
df$Dt_Customer = as.Date(df$Dt_Customer,format="%d-%m-%Y")
df$MembershipDays = as.numeric(difftime(current_date, df$Dt_Customer, "day"))
# hint: note European date format, use as.Date with proper format argument
head(df[,c("Year_Birth","Age")],5)
## Year_Birth Age
## 1: 1957 57
## 2: 1954 60
## 3: 1965 49
## 4: 1984 30
## 5: 1981 33
head(df[,c("Dt_Customer","MembershipDays")],5)
## Dt_Customer MembershipDays
## 1: 2012-09-04 665
## 2: 2014-03-08 115
## 3: 2013-08-21 314
## 4: 2014-02-10 141
## 5: 2014-01-19 163
# Summarize Education column (use table function)
# Lets treat Education column as ordinal categories and use simple levels for distance calculations
# Assuming following order of degrees:
# HighSchool, Associate, Bachelor, Master, PhD
# factorize Education column (hint: use factor function with above levels)
edu_values_bef <- as.data.frame(unique(df$Education))
# Using Ordinal Encoding for Education column
encode_ordinal <- function(x, order = unique(x)) {
x <- as.numeric(factor(x, levels = order, exclude = NULL))
x
}
df[["Education"]] <- encode_ordinal(df[["Education"]])
edu_values <- as.data.frame(unique(df$Education))
print(paste("Values in Education before encoding:", edu_values_bef))
## [1] "Values in Education before encoding: c(\"Bachelor\", \"PhD\", \"Master\", \"HighSchool\", \"Associate\")"
print(paste("Values in Education after encoding:", edu_values))
## [1] "Values in Education after encoding: c(1, 2, 3, 4, 5)"
Label encoded values in Education column of dataset.
# Summarize Marital Status column (use table function)
# Lets convert single Marital_Status categories for 5 separate binary categories
# Divorced, Married, Single, Together and Widow, the value will be 1 if customer
# is in that category and 0 if customer is not
#One-Hot encoding for Marital_Status column
before_enc <- as.data.frame(colnames(df))
enc_marital = factor(df$Marital_Status)
enc_df = as.data.frame(model.matrix(~enc_marital)[,-1])
df = cbind(df, enc_df)
after_enc <- as.data.frame(colnames(df))
print(paste("Columns in df before encoding:", before_enc))
## [1] "Columns in df before encoding: c(\"ID\", \"Year_Birth\", \"Education\", \"Marital_Status\", \"Income\", \"Kidhome\", \"Teenhome\", \"Dt_Customer\", \"Recency\", \"MntWines\", \"MntFruits\", \"MntMeatProducts\", \"MntFishProducts\", \"MntSweetProducts\", \"MntGoldProds\", \"NumWebPurchases\", \"NumStorePurchases\", \"Complain\", \"Age\", \"MembershipDays\")"
print(paste("Columns in df after encoding:", after_enc))
## [1] "Columns in df after encoding: c(\"ID\", \"Year_Birth\", \"Education\", \"Marital_Status\", \"Income\", \"Kidhome\", \"Teenhome\", \"Dt_Customer\", \"Recency\", \"MntWines\", \"MntFruits\", \"MntMeatProducts\", \"MntFishProducts\", \"MntSweetProducts\", \"MntGoldProds\", \"NumWebPurchases\", \"NumStorePurchases\", \"Complain\", \"Age\", \"MembershipDays\", \"enc_maritalMarried\", \"enc_maritalSingle\", \"enc_maritalTogether\", \"enc_maritalWidow\")"
Encoded the Marital status column and added as individual column in dataset.
# lets remove columns which we will no longer use:
# remove ID, Year_Birth, Dt_Customer, Marital_Status
# and save it as df_sel
df <- subset(df, select = -c(ID,Dt_Customer,Year_Birth,Marital_Status))
# Convert Education to integers
df$Education = as.integer(df$Education)
df_sel <- df
head(df_sel,5)
## Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts
## 1: 1 58138 0 0 58 635 88 546
## 2: 1 46344 1 1 38 11 1 6
## 3: 1 71613 0 0 26 426 49 127
## 4: 1 26646 1 0 26 11 4 20
## 5: 2 58293 1 0 94 173 43 118
## MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## 1: 172 88 88 8
## 2: 2 1 6 1
## 3: 111 21 42 8
## 4: 10 3 5 2
## 5: 46 27 15 5
## NumStorePurchases Complain Age MembershipDays enc_maritalMarried
## 1: 4 0 57 665 0
## 2: 2 0 60 115 0
## 3: 10 0 49 314 0
## 4: 4 0 30 141 0
## 5: 6 0 33 163 1
## enc_maritalSingle enc_maritalTogether enc_maritalWidow
## 1: 1 0 0
## 2: 1 0 0
## 3: 0 1 0
## 4: 0 1 0
## 5: 0 0 0
Removed columns ID, Year_Birth, Dt_Customer, Marital_Status
# lets scale
# run scale function on df_sel and save it as df_scale
# that will be our scaled values which we will use for analysis
df_scale <- as.data.frame(scale(df_sel))
head(df_scale,5)
## Education Income Kidhome Teenhome Recency MntWines MntFruits
## 1 -0.77967753 0.2339039 -0.8227362 -0.9281454 0.3082732 0.9766566 1.5488659
## 2 -0.77967753 -0.2341403 1.0393789 0.9090170 -0.3826166 -0.8711997 -0.6370558
## 3 -0.77967753 0.7686585 -0.8227362 -0.9281454 -0.7971505 0.3577432 0.5689700
## 4 -0.77967753 -1.0158542 1.0393789 -0.9281454 -0.7971505 -0.8711997 -0.5616792
## 5 0.01547836 0.2400551 1.0393789 -0.9281454 1.5518749 -0.3914678 0.4182168
## MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## 1 1.6879549 2.4630607 1.481888649 0.85482704 1.4304941
## 2 -0.7180699 -0.6514171 -0.634215838 -0.73267383 -1.1252228
## 3 -0.1789421 1.3455128 -0.147755036 -0.03572223 1.4304941
## 4 -0.6556915 -0.5048534 -0.585569758 -0.75203360 -0.7601204
## 5 -0.2190425 0.1546831 -0.001816796 -0.55843593 0.3351868
## NumStorePurchases Complain Age MembershipDays enc_maritalMarried
## 1 -0.5538715 -0.09794622 0.9853629 1.5300508 -0.7959829
## 2 -1.1683880 -0.09794622 1.2357656 -1.1889072 -0.7959829
## 3 1.2896778 -0.09794622 0.3176225 -0.2051387 -0.7959829
## 4 -0.5538715 -0.09794622 -1.2682609 -1.0603746 -0.7959829
## 5 0.0606449 -0.09794622 -1.0178582 -0.9516163 1.2557397
## enc_maritalSingle enc_maritalTogether enc_maritalWidow
## 1 1.9205079 -0.5916806 -0.1887179
## 2 1.9205079 -0.5916806 -0.1887179
## 3 -0.5204599 1.6893359 -0.1887179
## 4 -0.5204599 1.6893359 -0.1887179
## 5 -0.5204599 -0.5916806 -0.1887179
Principal Component Analysis produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.
# Run PCA on df_scale, make biplot and scree plot/percentage variance explained plot
# which(apply(df_scale, 2, var)==0)
df_scale = df_scale[, which(apply(df_scale, 2, var) != 0)]
pca = prcomp(df_scale[ ,c(1:20)],
center = TRUE,scale=TRUE)
ggbiplot(pca, scale = 0, labels=rownames(pca$x), color = TRUE) + geom_point( size = 0.5) + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))
pr.var <- pca$sdev^2
pve <- pr.var / sum(pr.var)
Cols <- function(vec) {
cols <- rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
par(mfrow = c(1, 2))
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained", ylim = c(0, 1),
type = "b")
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
pr_out <- pca
library(factoextra)
data_transform = as.data.frame(pca$x[,1:2])
fviz_nbclust(data_transform, kmeans, method = 'wss')
kmeans_data = kmeans(data_transform, centers = 3, nstart = 50)
fviz_cluster(kmeans_data, data = data_transform)
k_edu<- kmeans(df_scale$Education, 3, nstart=50)
hist(k_edu$cluster,xlab = "Education Cluster",col = "cyan",border = "blue")
Below plot (Scree plot) provides information about the proportion of variance explained by each principal component which will be obtained by dividing variance of principal component by total variance of dataset.
km.out <- kmeans(df_scale, 2, nstart = 20)
km.out$cluster
## [1] 1 2 1 2 2 1 1 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 2 1 2 2
## [38] 1 1 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 2 1
## [75] 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 1 1 1 1 1 2 2 1 1 2 2
## [112] 1 2 2 2 1 1 1 2 1 1 1 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 2 1 2 1 1 2 2 2 1 2 1
## [149] 2 1 1 2 1 2 1 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 1 1 1
## [186] 2 1 1 1 1 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 1 2 1 1 2 1 2 1 1 1 2 2 1 2 2 1 2
## [223] 2 1 2 2 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2
## [260] 2 2 1 1 1 1 1 2 1 2 1 2 2 1 1 1 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 1 2 1 2 2
## [297] 2 2 1 2 2 2 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2 1 2 2 1 2 1 2
## [334] 2 1 1 1 1 1 2 2 1 1 2 1 1 2 2 2 1 1 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2
## [371] 2 1 2 1 1 2 1 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 2 2 2
## [408] 1 1 2 1 1 2 1 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 1
## [445] 1 1 2 1 2 1 1 2 1 1 1 2 1 2 2 1 2 2 1 2 2 2 2 2 1 1 1 1 2 2 1 2 1 2 1 1 2
## [482] 1 1 1 2 2 2 1 2 1 1 1 2 1 2 1 2 1 2 1 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 1 2 2
## [519] 2 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 1 2 2 2 2
## [556] 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2
## [593] 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 1 1 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1
## [630] 2 1 2 1 1 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1
## [667] 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1
## [704] 2 1 1 2 1 2 2 1 1 2 1 2 2 1 2 1 1 1 2 2 1 1 2 2 2 1 1 2 1 2 1 1 2 1 1 1 1
## [741] 1 1 2 2 2 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 1 1 2
## [778] 2 2 2 1 1 1 2 1 2 2 1 1 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2 1 2 2 2 1 1 1 2 1 2
## [815] 2 1 2 2 2 1 2 1 2 1 2 2 2 2 1 1 1 1 2 2 2 1 1 2 2 1 2 1 2 1 2 2 2 2 2 2 2
## [852] 2 1 2 1 1 2 2 1 1 2 2 1 2 2 2 2 2 1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 2 1 2 1 2
## [889] 2 1 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
## [926] 1 1 1 2 1 1 2 2 1 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2
## [963] 2 1 2 2 2 1 1 2 1 1 1 2 1 2 2 1 2 2 1 2 1 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 2
## [1000] 1 1 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 1 2
## [1037] 1 2 2 1 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 2 1 2 1 1 2 1 2
## [1074] 1 1 2 1 2 2 1 1 1 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 1 2 1 2 2 2 1 1 2 2 2 2 2
## [1111] 1 2 2 1 1 2 2 1 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 1 1 1 2 1 1
## [1148] 1 2 1 2 2 1 1 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 1 2 1 1 1
## [1185] 2 1 2 2 1 1 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 1 2
## [1222] 1 2 2 2 2 1 1 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 2 1 1 2 2 1 1 1 2 2
## [1259] 2 1 2 2 1 2 1 1 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 1 1 1
## [1296] 1 2 1 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1
## [1333] 1 1 1 1 2 2 1 1 2 1 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2
## [1370] 1 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 2 1 1 2 1 1 2 1 2 1 2 2 2 2 2 1 1 2 2 2
## [1407] 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 1 1 2 1 1 2 2 1 1 1 1 2
## [1444] 2 2 1 1 1 2 1 2 2 2 1 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1
## [1481] 2 2 2 1 1 2 1 1 1 1 1 2 1 1 2 2 1 2 2 2 1 1 2 2 2 1 1 2 1 2 1 2 1 2 2 2 2
## [1518] 1 1 1 2 1 1 2 1 2 2 2 1 2 2 1 1 1 1 2 2 2 2 1 2 2 2 1 2 2 1 1 2 1 1 2 1 2
## [1555] 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1 2 2 2 1
## [1592] 2 1 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 1 2 2 2 2 1 1 1 2 2 2 1 1 2 1 2 2 1
## [1629] 1 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 1 2 1 1 1 1 2 2 2 2 2 1 2 2 2 2
## [1666] 2 1 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 2 2 1 2 1
## [1703] 1 1 2 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 2
## [1740] 1 1 2 1 1 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2
## [1777] 1 2 1 1 1 1 2 2 2 1 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2 2 1 1 2 2 1 1 2 2 2 1
## [1814] 2 2 2 1 2 1 1 2 1 2 1 2 1 2 2 2 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1
## [1851] 2 1 2 1 2 2 1 1 1 1 1 2 2 1 2 2 2 1 1 2 1 1 2 1 1 2 2 1 1 2 2 2 2 2 2 1 1
## [1888] 2 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 1 1 1 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1
## [1925] 2 1 1 1 2 2 2 1 1 1 2 1 2 2 1 2 1 1 2 2 2 1 1 1 1 1 1 2 2 2 2 1 1 2 2 2 2
## [1962] 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2 2 2
## [1999] 2 2 2 2 2 1 1 2 2 2 1 2 1 1 2 1 2 2 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2 2
## [2036] 2 1 1 1 2 1 2 1 1 2 1 2 2 1 1 2 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 2 1 2 1 2 1
## [2073] 1 2 1 2 1 2 2 2 1 2 1 1 1 2 2 2 1 2 1 1 2 2 2 2 1 2 1 1 1 1 2 2 1 2 2 2 2
## [2110] 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 1 2 2 1 1 1 2 1 1 1 1
## [2147] 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 2 2 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2
## [2184] 1 2 2 2 1 2 2 1 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 1 1 2
sum(km.out$cluster == 1)
## [1] 980
# Finding optimal k in clutering:
km_out_list <- lapply(1:10, function(k) list(
k=k,
km_out=kmeans(df_scale, k, nstart = 20)))
km_results <- data.frame(
k=sapply(km_out_list, function(k) k$k),
totss=sapply(km_out_list, function(k) k$km_out$totss),
tot_withinss=sapply(km_out_list, function(k) k$km_out$tot.withinss)
)
km_results
## k totss tot_withinss
## 1 1 44160 44160.00
## 2 2 44160 35269.93
## 3 3 44160 32637.74
## 4 4 44160 30666.83
## 5 5 44160 29348.63
## 6 6 44160 28148.74
## 7 7 44160 26777.93
## 8 8 44160 24773.63
## 9 9 44160 25047.47
## 10 10 44160 23163.02
plot_ly(km_results,x=~k,y=~tot_withinss) %>% layout(title = "Optimal k using the elbow curve method") %>% add_markers() %>% add_paths()
# Choosing k: gap statistics:
suppressWarnings(gap_kmeans <- clusGap(df_scale, kmeans, nstart = 25, K.max = 10, B = 100))
plot(gap_kmeans, main = "Gap Statistic: kmeans")
# Choosing k: silhouette:
par(mar = c(5, 2, 4, 2), mfrow=c(2,2))
for(k in c(2,3,4,9)) {
kmeans_cluster <- kmeans(df_scale, k, nstart=20)
si <- silhouette(kmeans_cluster$cluster, dist = dist(df_scale))
plot(si,main="")
}
silhouette_score <- function(k){
km <- kmeans(df_scale, centers = k, nstart=25)
ss <- silhouette(km$cluster, dist(df_scale))
mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)
fviz_nbclust(df_scale, kmeans, method='silhouette') + ggtitle( "Finding Optimal 'k' using silhouette") + theme(plot.title = element_text(hjust = 0.5))
Approaches to find optimal k * Elbow curve method In the Elbow method, we will be actually varying the number of clusters ( K ) from 1 – n (mostly 10). For each value of K, we are calculating WCSS ( Within-Cluster Sum of Square ). WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When we analyze the graph, we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters. k value identified in elbow curve method is 2. * Gap statistics method This technique uses the output of any clustering algorithm comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. k value identified in Gap statistics method is 4.
Since two approaches got optimal k as 2, we are sticking with 2 for optimal k.
Final Optimal k is 2
km_out <- kmeans(df_scale,centers = 2, nstart = 25)
km_2 <- kmeans(df_scale, centers = 4, nstart = 25)
km_3 <- kmeans(df_scale, centers = 5, nstart = 25)
km_4 <- kmeans(df_scale, centers = 3, nstart = 25)
km <- cbind(pr_out$x[,1], pr_out$x[,2])
fviz_cluster(km_out, data = df_scale)
fviz_cluster(km_2, data = df_scale)
fviz_cluster(km_3, data = df_scale)
fviz_cluster(km_4, data = df_scale)
# plots to compare
p1 <- fviz_cluster(km_out, geom = "point", data = df_scale) + ggtitle("Clustering with k =2") + theme(plot.title = element_text(hjust = 0.5))
p2 <- fviz_cluster(km_2, geom = "point", data = df_scale) + ggtitle("Clustering with k =4") + theme(plot.title = element_text(hjust = 0.5))
p3 <- fviz_cluster(km_3, geom = "point", data = df_scale) + ggtitle("Clustering with k =5") + theme(plot.title = element_text(hjust = 0.5))
p4 <- fviz_cluster(km_4, geom = "point", data = df_scale) + ggtitle("Clustering with k =3") + theme(plot.title = element_text(hjust = 0.5))
df_scale$Cluster <- km_out$cluster
head(df_scale,5)
## Education Income Kidhome Teenhome Recency MntWines MntFruits
## 1 -0.77967753 0.2339039 -0.8227362 -0.9281454 0.3082732 0.9766566 1.5488659
## 2 -0.77967753 -0.2341403 1.0393789 0.9090170 -0.3826166 -0.8711997 -0.6370558
## 3 -0.77967753 0.7686585 -0.8227362 -0.9281454 -0.7971505 0.3577432 0.5689700
## 4 -0.77967753 -1.0158542 1.0393789 -0.9281454 -0.7971505 -0.8711997 -0.5616792
## 5 0.01547836 0.2400551 1.0393789 -0.9281454 1.5518749 -0.3914678 0.4182168
## MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## 1 1.6879549 2.4630607 1.481888649 0.85482704 1.4304941
## 2 -0.7180699 -0.6514171 -0.634215838 -0.73267383 -1.1252228
## 3 -0.1789421 1.3455128 -0.147755036 -0.03572223 1.4304941
## 4 -0.6556915 -0.5048534 -0.585569758 -0.75203360 -0.7601204
## 5 -0.2190425 0.1546831 -0.001816796 -0.55843593 0.3351868
## NumStorePurchases Complain Age MembershipDays enc_maritalMarried
## 1 -0.5538715 -0.09794622 0.9853629 1.5300508 -0.7959829
## 2 -1.1683880 -0.09794622 1.2357656 -1.1889072 -0.7959829
## 3 1.2896778 -0.09794622 0.3176225 -0.2051387 -0.7959829
## 4 -0.5538715 -0.09794622 -1.2682609 -1.0603746 -0.7959829
## 5 0.0606449 -0.09794622 -1.0178582 -0.9516163 1.2557397
## enc_maritalSingle enc_maritalTogether enc_maritalWidow Cluster
## 1 1.9205079 -0.5916806 -0.1887179 2
## 2 1.9205079 -0.5916806 -0.1887179 1
## 3 -0.5204599 1.6893359 -0.1887179 2
## 4 -0.5204599 1.6893359 -0.1887179 1
## 5 -0.5204599 -0.5916806 -0.1887179 1
Linkage defines how to calculate distance between clusters containing multiple data points. Different methods of linkages are as follows * Complete Linkage: largest distance between elements of two clusters * Single: smallest distance between elements of two clusters * Average: Average dissimilarity between all elements of two clusters * Centroid: Dissimilarity between the centroids
Number of clusters obtianed using hierarchical method is only 1.
single linkage is fast, and can perform well on non-globular data, but it performs poorly in the presence of noise.
average and complete linkage perform well on cleanly separated globular clusters, but have mixed results otherwise.
Ward is the most effective method for noisy data
suppressPlotlyMessage <- function(p) {
suppressMessages(plotly_build(p))
}
hc.complete <- hclust(dist(df_scale), method = "complete")
hc.single <- hclust(dist(df_scale), method = "single")
hc.average <- hclust(dist(df_scale), method = "average")
df_scale_ct <- cutree(hc.complete, k=2)
df_scale_sg <- cutree(hc.single, k=2)
df_scale_ag <- cutree(hc.average, k=2)
par(mfrow = c(1, 3))
suppressMessages(plot_ly(x=~pr_out$x[,1],y=~pr_out$x[,2], color = as.factor(df_scale_ct), colors=c("red","blue","green"),mode = "markers"))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
suppressMessages(plot_ly(x=~pr_out$x[,1],y=~pr_out$x[,2], color = as.factor(df_scale_sg), colors=c("red","blue","green"),mode = "markers"))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
suppressMessages(plot_ly(x=~pr_out$x[,1],y=~pr_out$x[,2], color = as.factor(df_scale_ag), colors=c("red","blue","green"),mode = "markers"))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
k-means clustering performs better while clustering the marketing campaign dataset properly. With k = 2, clustering is performed more accurately in k-means clustering.
suppressMessages(plot_ly(x=pr_out$x[,1],y=pr_out$x[,2], color = as.factor(km_out$cluster), colors=c("red","blue")))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode