Problem Statement

An advertisement division of large club store needs to perform customer analysis the store customers in order to create a segmentation for more targeted marketing campaign

Our goal is to identify similar customers and characterize them (at least some of them). In other word perform clustering and identify customers segmentation.

Colomns description:
People
  ID: Customer's unique identifier
  Year_Birth: Customer's birth year
  Education: Customer's education level
  Marital_Status: Customer's marital status
  Income: Customer's yearly household income
  Kidhome: Number of children in customer's household
  Teenhome: Number of teenagers in customer's household
  Dt_Customer: Date of customer's enrollment with the company
  Recency: Number of days since customer's last purchase
  Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

  MntWines: Amount spent on wine in last 2 years
  MntFruits: Amount spent on fruits in last 2 years
  MntMeatProducts: Amount spent on meat in last 2 years
  MntFishProducts: Amount spent on fish in last 2 years
  MntSweetProducts: Amount spent on sweets in last 2 years
  MntGoldProds: Amount spent on gold in last 2 years

Place
  NumWebPurchases: Number of purchases made through the company’s website
  NumStorePurchases: Number of purchases made directly in stores

Assume that data was current on 2014-07-01

1. Read Dataset and Data Conversion to Proper Data Format

Read “m_marketing_campaign.csv” using data.table::fread command, examine the data.

fread function of data.table read cvs real fast

# fread m_marketing_campaign.csv and save it as df
df = fread("marketing_campaign.csv")
df_orig <- df

# Checking for irregular data:
sum(is.na(df))
## [1] 0
sum(is.null(df))
## [1] 0
# Getting to know about the data
head(df,10)
##       ID Year_Birth Education Marital_Status Income Kidhome Teenhome
##  1: 5524       1957  Bachelor         Single  58138       0        0
##  2: 2174       1954  Bachelor         Single  46344       1        1
##  3: 4141       1965  Bachelor       Together  71613       0        0
##  4: 6182       1984  Bachelor       Together  26646       1        0
##  5: 5324       1981       PhD        Married  58293       1        0
##  6: 7446       1967    Master       Together  62513       0        1
##  7:  965       1971  Bachelor       Divorced  55635       0        1
##  8: 6177       1985       PhD        Married  33454       1        0
##  9: 4855       1974       PhD       Together  30351       1        0
## 10: 5899       1950       PhD       Together   5648       1        1
##     Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
##  1:  04-09-2012      58      635        88             546             172
##  2:  08-03-2014      38       11         1               6               2
##  3:  21-08-2013      26      426        49             127             111
##  4:  10-02-2014      26       11         4              20              10
##  5:  19-01-2014      94      173        43             118              46
##  6:  09-09-2013      16      520        42              98               0
##  7:  13-11-2012      34      235        65             164              50
##  8:  08-05-2013      32       76        10              56               3
##  9:  06-06-2013      19       14         0              24               3
## 10:  13-03-2014      68       28         0               6               1
##     MntSweetProducts MntGoldProds NumWebPurchases NumStorePurchases Complain
##  1:               88           88               8                 4        0
##  2:                1            6               1                 2        0
##  3:               21           42               8                10        0
##  4:                3            5               2                 4        0
##  5:               27           15               5                 6        0
##  6:               42           14               6                10        0
##  7:               49           27               7                 7        0
##  8:                1           23               4                 4        0
##  9:                3            2               3                 2        0
## 10:                1           13               1                 0        0
summary(df)
##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1893   Length:2209        Length:2209       
##  1st Qu.: 2826   1st Qu.:1959   Class :character   Class :character  
##  Median : 5462   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5592   Mean   :1969                                        
##  3rd Qu.: 8427   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##      Income          Kidhome          Teenhome      Dt_Customer       
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2209       
##  1st Qu.: 35246   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 51390   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 52244   Mean   :0.4418   Mean   :0.5052                     
##  3rd Qu.: 68627   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000                     
##     Recency         MntWines        MntFruits      MntMeatProducts 
##  Min.   : 0.00   Min.   :   0.0   Min.   :  0.00   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  24.0   1st Qu.:  2.00   1st Qu.:  16.0  
##  Median :49.00   Median : 174.0   Median :  8.00   Median :  68.0  
##  Mean   :49.08   Mean   : 305.2   Mean   : 26.35   Mean   : 167.2  
##  3rd Qu.:74.00   3rd Qu.: 505.0   3rd Qu.: 33.00   3rd Qu.: 233.0  
##  Max.   :99.00   Max.   :1493.0   Max.   :199.00   Max.   :1725.0  
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumWebPurchases 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 2.000  
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 4.000  
##  Mean   : 37.56   Mean   : 27.07   Mean   : 43.85   Mean   : 4.082  
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 6.000  
##  Max.   :259.00   Max.   :262.00   Max.   :321.00   Max.   :27.000  
##  NumStorePurchases    Complain       
##  Min.   : 0.000    Min.   :0.000000  
##  1st Qu.: 3.000    1st Qu.:0.000000  
##  Median : 5.000    Median :0.000000  
##  Mean   : 5.803    Mean   :0.009507  
##  3rd Qu.: 8.000    3rd Qu.:0.000000  
##  Max.   :13.000    Max.   :1.000000
# Convert Year_Birth to Age (assume that current date is 2014-07-01)
df$Age = as.integer(2014 - df$Year_Birth)

# Dt_Customer is a date (it is still character), convert it to membership days (name it MembershipDays)
df$Dt_Customer = as.Date(df$Dt_Customer,format="%d-%m-%Y")
df$MembershipDays = as.numeric(difftime(current_date, df$Dt_Customer, "day"))
# hint: note European date format, use as.Date with proper format argument
head(df[,c("Year_Birth","Age")],5)
##    Year_Birth Age
## 1:       1957  57
## 2:       1954  60
## 3:       1965  49
## 4:       1984  30
## 5:       1981  33
head(df[,c("Dt_Customer","MembershipDays")],5)
##    Dt_Customer MembershipDays
## 1:  2012-09-04            665
## 2:  2014-03-08            115
## 3:  2013-08-21            314
## 4:  2014-02-10            141
## 5:  2014-01-19            163
# Summarize Education column (use table function)
# Lets treat Education column as ordinal categories and use simple levels for distance calculations
# Assuming following order of degrees:
#    HighSchool, Associate, Bachelor, Master, PhD
# factorize Education column (hint: use factor function with above levels)

edu_values_bef <- as.data.frame(unique(df$Education))
# Using Ordinal Encoding for Education column
encode_ordinal <- function(x, order = unique(x)) {
  x <- as.numeric(factor(x, levels = order, exclude = NULL))
  x
}

df[["Education"]] <- encode_ordinal(df[["Education"]]) 
edu_values <- as.data.frame(unique(df$Education))

print(paste("Values in Education before encoding:", edu_values_bef))
## [1] "Values in Education before encoding: c(\"Bachelor\", \"PhD\", \"Master\", \"HighSchool\", \"Associate\")"
print(paste("Values in Education after encoding:", edu_values))
## [1] "Values in Education after encoding: c(1, 2, 3, 4, 5)"

Label encoded values in Education column of dataset.

# Summarize Marital Status column (use table function)
# Lets convert single Marital_Status categories for 5 separate binary categories 
# Divorced, Married, Single, Together and Widow, the value will be 1 if customer 
# is in that category and 0 if customer is not
#One-Hot encoding for Marital_Status column


before_enc <- as.data.frame(colnames(df))
enc_marital = factor(df$Marital_Status)
enc_df = as.data.frame(model.matrix(~enc_marital)[,-1])
df = cbind(df, enc_df)
after_enc <- as.data.frame(colnames(df))

print(paste("Columns in df before encoding:", before_enc))
## [1] "Columns in df before encoding: c(\"ID\", \"Year_Birth\", \"Education\", \"Marital_Status\", \"Income\", \"Kidhome\", \"Teenhome\", \"Dt_Customer\", \"Recency\", \"MntWines\", \"MntFruits\", \"MntMeatProducts\", \"MntFishProducts\", \"MntSweetProducts\", \"MntGoldProds\", \"NumWebPurchases\", \"NumStorePurchases\", \"Complain\", \"Age\", \"MembershipDays\")"
print(paste("Columns in df after encoding:", after_enc))
## [1] "Columns in df after encoding: c(\"ID\", \"Year_Birth\", \"Education\", \"Marital_Status\", \"Income\", \"Kidhome\", \"Teenhome\", \"Dt_Customer\", \"Recency\", \"MntWines\", \"MntFruits\", \"MntMeatProducts\", \"MntFishProducts\", \"MntSweetProducts\", \"MntGoldProds\", \"NumWebPurchases\", \"NumStorePurchases\", \"Complain\", \"Age\", \"MembershipDays\", \"enc_maritalMarried\", \"enc_maritalSingle\", \"enc_maritalTogether\", \"enc_maritalWidow\")"

Encoded the Marital status column and added as individual column in dataset.

# lets remove columns which we will no longer use:
# remove ID, Year_Birth, Dt_Customer, Marital_Status
# and save it as df_sel 
df <- subset(df, select = -c(ID,Dt_Customer,Year_Birth,Marital_Status))

# Convert Education to integers 
df$Education = as.integer(df$Education)
df_sel <- df
head(df_sel,5)
##    Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts
## 1:         1  58138       0        0      58      635        88             546
## 2:         1  46344       1        1      38       11         1               6
## 3:         1  71613       0        0      26      426        49             127
## 4:         1  26646       1        0      26       11         4              20
## 5:         2  58293       1        0      94      173        43             118
##    MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## 1:             172               88           88               8
## 2:               2                1            6               1
## 3:             111               21           42               8
## 4:              10                3            5               2
## 5:              46               27           15               5
##    NumStorePurchases Complain Age MembershipDays enc_maritalMarried
## 1:                 4        0  57            665                  0
## 2:                 2        0  60            115                  0
## 3:                10        0  49            314                  0
## 4:                 4        0  30            141                  0
## 5:                 6        0  33            163                  1
##    enc_maritalSingle enc_maritalTogether enc_maritalWidow
## 1:                 1                   0                0
## 2:                 1                   0                0
## 3:                 0                   1                0
## 4:                 0                   1                0
## 5:                 0                   0                0

Removed columns ID, Year_Birth, Dt_Customer, Marital_Status

# lets scale
# run scale function on df_sel and save it as df_scale
# that will be our scaled values which we will use for analysis
df_scale <- as.data.frame(scale(df_sel))
head(df_scale,5)
##     Education     Income    Kidhome   Teenhome    Recency   MntWines  MntFruits
## 1 -0.77967753  0.2339039 -0.8227362 -0.9281454  0.3082732  0.9766566  1.5488659
## 2 -0.77967753 -0.2341403  1.0393789  0.9090170 -0.3826166 -0.8711997 -0.6370558
## 3 -0.77967753  0.7686585 -0.8227362 -0.9281454 -0.7971505  0.3577432  0.5689700
## 4 -0.77967753 -1.0158542  1.0393789 -0.9281454 -0.7971505 -0.8711997 -0.5616792
## 5  0.01547836  0.2400551  1.0393789 -0.9281454  1.5518749 -0.3914678  0.4182168
##   MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## 1       1.6879549       2.4630607      1.481888649   0.85482704       1.4304941
## 2      -0.7180699      -0.6514171     -0.634215838  -0.73267383      -1.1252228
## 3      -0.1789421       1.3455128     -0.147755036  -0.03572223       1.4304941
## 4      -0.6556915      -0.5048534     -0.585569758  -0.75203360      -0.7601204
## 5      -0.2190425       0.1546831     -0.001816796  -0.55843593       0.3351868
##   NumStorePurchases    Complain        Age MembershipDays enc_maritalMarried
## 1        -0.5538715 -0.09794622  0.9853629      1.5300508         -0.7959829
## 2        -1.1683880 -0.09794622  1.2357656     -1.1889072         -0.7959829
## 3         1.2896778 -0.09794622  0.3176225     -0.2051387         -0.7959829
## 4        -0.5538715 -0.09794622 -1.2682609     -1.0603746         -0.7959829
## 5         0.0606449 -0.09794622 -1.0178582     -0.9516163          1.2557397
##   enc_maritalSingle enc_maritalTogether enc_maritalWidow
## 1         1.9205079          -0.5916806       -0.1887179
## 2         1.9205079          -0.5916806       -0.1887179
## 3        -0.5204599           1.6893359       -0.1887179
## 4        -0.5204599           1.6893359       -0.1887179
## 5        -0.5204599          -0.5916806       -0.1887179

2. Run PCA

Principal Component Analysis produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.

# Run PCA on df_scale, make biplot and scree plot/percentage variance explained plot
# which(apply(df_scale, 2, var)==0)
df_scale = df_scale[, which(apply(df_scale, 2, var) != 0)]

pca = prcomp(df_scale[ ,c(1:20)],
             center = TRUE,scale=TRUE)

ggbiplot(pca, scale = 0, labels=rownames(pca$x), color = TRUE) + geom_point( size = 0.5) + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

pr.var <- pca$sdev^2
pve <- pr.var / sum(pr.var)
Cols <- function(vec) {
   cols <- rainbow(length(unique(vec)))
   return(cols[as.numeric(as.factor(vec))])
 }
par(mfrow = c(1, 2))
plot(pve, xlab = "Principal Component",
    ylab = "Proportion of Variance Explained", ylim = c(0, 1),
    type = "b")
plot(cumsum(pve), xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1), type = "b")

pr_out <- pca
library(factoextra)
data_transform = as.data.frame(pca$x[,1:2])
fviz_nbclust(data_transform, kmeans, method = 'wss')

kmeans_data = kmeans(data_transform, centers = 3, nstart = 50)
fviz_cluster(kmeans_data, data = data_transform)

k_edu<- kmeans(df_scale$Education, 3, nstart=50)


hist(k_edu$cluster,xlab = "Education Cluster",col = "cyan",border = "blue")

Below plot (Scree plot) provides information about the proportion of variance explained by each principal component which will be obtained by dividing variance of principal component by total variance of dataset.

3. Cluster with K-Means

3.1 Selecting Number of Clusters

km.out <- kmeans(df_scale, 2, nstart = 20)
km.out$cluster
##    [1] 1 2 1 2 2 1 1 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 2 1 2 2
##   [38] 1 1 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 2 1
##   [75] 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 1 1 1 1 1 2 2 1 1 2 2
##  [112] 1 2 2 2 1 1 1 2 1 1 1 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 2 1 2 1 1 2 2 2 1 2 1
##  [149] 2 1 1 2 1 2 1 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 1 1 1
##  [186] 2 1 1 1 1 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 1 2 1 1 2 1 2 1 1 1 2 2 1 2 2 1 2
##  [223] 2 1 2 2 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2
##  [260] 2 2 1 1 1 1 1 2 1 2 1 2 2 1 1 1 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 1 2 1 2 2
##  [297] 2 2 1 2 2 2 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2 1 2 2 1 2 1 2
##  [334] 2 1 1 1 1 1 2 2 1 1 2 1 1 2 2 2 1 1 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2
##  [371] 2 1 2 1 1 2 1 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 2 2 2
##  [408] 1 1 2 1 1 2 1 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 1
##  [445] 1 1 2 1 2 1 1 2 1 1 1 2 1 2 2 1 2 2 1 2 2 2 2 2 1 1 1 1 2 2 1 2 1 2 1 1 2
##  [482] 1 1 1 2 2 2 1 2 1 1 1 2 1 2 1 2 1 2 1 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 1 2 2
##  [519] 2 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 1 2 2 2 2
##  [556] 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2
##  [593] 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 1 1 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1
##  [630] 2 1 2 1 1 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1
##  [667] 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 1
##  [704] 2 1 1 2 1 2 2 1 1 2 1 2 2 1 2 1 1 1 2 2 1 1 2 2 2 1 1 2 1 2 1 1 2 1 1 1 1
##  [741] 1 1 2 2 2 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 1 1 2
##  [778] 2 2 2 1 1 1 2 1 2 2 1 1 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2 1 2 2 2 1 1 1 2 1 2
##  [815] 2 1 2 2 2 1 2 1 2 1 2 2 2 2 1 1 1 1 2 2 2 1 1 2 2 1 2 1 2 1 2 2 2 2 2 2 2
##  [852] 2 1 2 1 1 2 2 1 1 2 2 1 2 2 2 2 2 1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 2 1 2 1 2
##  [889] 2 1 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
##  [926] 1 1 1 2 1 1 2 2 1 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2
##  [963] 2 1 2 2 2 1 1 2 1 1 1 2 1 2 2 1 2 2 1 2 1 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 2
## [1000] 1 1 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 1 2
## [1037] 1 2 2 1 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 2 1 2 1 1 2 1 2
## [1074] 1 1 2 1 2 2 1 1 1 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 1 2 1 2 2 2 1 1 2 2 2 2 2
## [1111] 1 2 2 1 1 2 2 1 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 1 1 1 2 1 1
## [1148] 1 2 1 2 2 1 1 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 1 2 1 1 1
## [1185] 2 1 2 2 1 1 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 1 2
## [1222] 1 2 2 2 2 1 1 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 2 1 1 2 2 1 1 1 2 2
## [1259] 2 1 2 2 1 2 1 1 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 1 1 1
## [1296] 1 2 1 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1
## [1333] 1 1 1 1 2 2 1 1 2 1 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2
## [1370] 1 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 2 1 1 2 1 1 2 1 2 1 2 2 2 2 2 1 1 2 2 2
## [1407] 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 1 1 2 1 1 2 2 1 1 1 1 2
## [1444] 2 2 1 1 1 2 1 2 2 2 1 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1
## [1481] 2 2 2 1 1 2 1 1 1 1 1 2 1 1 2 2 1 2 2 2 1 1 2 2 2 1 1 2 1 2 1 2 1 2 2 2 2
## [1518] 1 1 1 2 1 1 2 1 2 2 2 1 2 2 1 1 1 1 2 2 2 2 1 2 2 2 1 2 2 1 1 2 1 1 2 1 2
## [1555] 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1 2 2 2 1
## [1592] 2 1 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 1 2 2 2 2 1 1 1 2 2 2 1 1 2 1 2 2 1
## [1629] 1 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 1 2 1 1 1 1 2 2 2 2 2 1 2 2 2 2
## [1666] 2 1 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 2 2 1 2 1
## [1703] 1 1 2 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 2
## [1740] 1 1 2 1 1 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2
## [1777] 1 2 1 1 1 1 2 2 2 1 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2 2 1 1 2 2 1 1 2 2 2 1
## [1814] 2 2 2 1 2 1 1 2 1 2 1 2 1 2 2 2 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1
## [1851] 2 1 2 1 2 2 1 1 1 1 1 2 2 1 2 2 2 1 1 2 1 1 2 1 1 2 2 1 1 2 2 2 2 2 2 1 1
## [1888] 2 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 1 1 1 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1
## [1925] 2 1 1 1 2 2 2 1 1 1 2 1 2 2 1 2 1 1 2 2 2 1 1 1 1 1 1 2 2 2 2 1 1 2 2 2 2
## [1962] 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2 2 2
## [1999] 2 2 2 2 2 1 1 2 2 2 1 2 1 1 2 1 2 2 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2 2
## [2036] 2 1 1 1 2 1 2 1 1 2 1 2 2 1 1 2 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 2 1 2 1 2 1
## [2073] 1 2 1 2 1 2 2 2 1 2 1 1 1 2 2 2 1 2 1 1 2 2 2 2 1 2 1 1 1 1 2 2 1 2 2 2 2
## [2110] 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 1 2 2 1 1 1 2 1 1 1 1
## [2147] 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 2 2 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2
## [2184] 1 2 2 2 1 2 2 1 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 1 1 2
sum(km.out$cluster == 1)
## [1] 980
# Finding optimal k in clutering:
km_out_list <- lapply(1:10, function(k) list(
  k=k,
  km_out=kmeans(df_scale, k, nstart = 20)))

km_results <- data.frame(
  k=sapply(km_out_list, function(k) k$k),
  totss=sapply(km_out_list, function(k) k$km_out$totss),
  tot_withinss=sapply(km_out_list, function(k) k$km_out$tot.withinss)
  )
km_results
##     k totss tot_withinss
## 1   1 44160     44160.00
## 2   2 44160     35269.93
## 3   3 44160     32637.74
## 4   4 44160     30666.83
## 5   5 44160     29348.63
## 6   6 44160     28148.74
## 7   7 44160     26777.93
## 8   8 44160     24773.63
## 9   9 44160     25047.47
## 10 10 44160     23163.02
plot_ly(km_results,x=~k,y=~tot_withinss) %>% layout(title = "Optimal k using the elbow curve method") %>% add_markers() %>% add_paths()
# Choosing k: gap statistics:
suppressWarnings(gap_kmeans <- clusGap(df_scale, kmeans, nstart = 25, K.max = 10, B = 100))
plot(gap_kmeans, main = "Gap Statistic: kmeans")

# Choosing k: silhouette:
par(mar = c(5, 2, 4, 2), mfrow=c(2,2))
for(k in c(2,3,4,9)) {
  kmeans_cluster <- kmeans(df_scale, k, nstart=20)
  si <- silhouette(kmeans_cluster$cluster, dist = dist(df_scale))
  plot(si,main="")
}

silhouette_score <- function(k){
  km <- kmeans(df_scale, centers = k, nstart=25)
  ss <- silhouette(km$cluster, dist(df_scale))
  mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)


fviz_nbclust(df_scale, kmeans, method='silhouette') + ggtitle( "Finding Optimal 'k' using silhouette") + theme(plot.title = element_text(hjust = 0.5))

Approaches to find optimal k * Elbow curve method In the Elbow method, we will be actually varying the number of clusters ( K ) from 1 – n (mostly 10). For each value of K, we are calculating WCSS ( Within-Cluster Sum of Square ). WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When we analyze the graph, we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters. k value identified in elbow curve method is 2. * Gap statistics method This technique uses the output of any clustering algorithm comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. k value identified in Gap statistics method is 4.

  • Silhouette method A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects he well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters. ** k value identified in Gap statistics method is 2.

Since two approaches got optimal k as 2, we are sticking with 2 for optimal k.

Final Optimal k is 2

3.2 Clusters Visulalization

km_out <- kmeans(df_scale,centers = 2, nstart = 25)
km_2 <- kmeans(df_scale, centers = 4, nstart = 25)
km_3 <- kmeans(df_scale, centers = 5, nstart = 25)
km_4 <- kmeans(df_scale, centers = 3, nstart = 25)

km <- cbind(pr_out$x[,1], pr_out$x[,2])

fviz_cluster(km_out, data = df_scale)

fviz_cluster(km_2, data = df_scale)

fviz_cluster(km_3, data = df_scale)

fviz_cluster(km_4, data = df_scale)

# plots to compare
p1 <- fviz_cluster(km_out, geom = "point",  data = df_scale) + ggtitle("Clustering with k =2") + theme(plot.title = element_text(hjust = 0.5))
p2 <- fviz_cluster(km_2, geom = "point",  data = df_scale) + ggtitle("Clustering with k =4") + theme(plot.title = element_text(hjust = 0.5))
p3 <- fviz_cluster(km_3, geom = "point",  data = df_scale) + ggtitle("Clustering with k =5") + theme(plot.title = element_text(hjust = 0.5))
p4 <- fviz_cluster(km_4, geom = "point",  data = df_scale)  + ggtitle("Clustering with k =3") + theme(plot.title = element_text(hjust = 0.5))

3.3 Characterizing Cluster

  • Cluster 1: Lower Education, Lower Income, Lower Kidhome, Lower Teenhome, Higher MntWInes, Higher Mnt Fruits, Lower Complains, Single and together
  • Cluster 2: Higher Education, Higher Income, Higher Kidhome, Higher Teenhome, Lower MntWInes, Lowe Mnt Fruits, Married and divorced
df_scale$Cluster <- km_out$cluster
head(df_scale,5)
##     Education     Income    Kidhome   Teenhome    Recency   MntWines  MntFruits
## 1 -0.77967753  0.2339039 -0.8227362 -0.9281454  0.3082732  0.9766566  1.5488659
## 2 -0.77967753 -0.2341403  1.0393789  0.9090170 -0.3826166 -0.8711997 -0.6370558
## 3 -0.77967753  0.7686585 -0.8227362 -0.9281454 -0.7971505  0.3577432  0.5689700
## 4 -0.77967753 -1.0158542  1.0393789 -0.9281454 -0.7971505 -0.8711997 -0.5616792
## 5  0.01547836  0.2400551  1.0393789 -0.9281454  1.5518749 -0.3914678  0.4182168
##   MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases
## 1       1.6879549       2.4630607      1.481888649   0.85482704       1.4304941
## 2      -0.7180699      -0.6514171     -0.634215838  -0.73267383      -1.1252228
## 3      -0.1789421       1.3455128     -0.147755036  -0.03572223       1.4304941
## 4      -0.6556915      -0.5048534     -0.585569758  -0.75203360      -0.7601204
## 5      -0.2190425       0.1546831     -0.001816796  -0.55843593       0.3351868
##   NumStorePurchases    Complain        Age MembershipDays enc_maritalMarried
## 1        -0.5538715 -0.09794622  0.9853629      1.5300508         -0.7959829
## 2        -1.1683880 -0.09794622  1.2357656     -1.1889072         -0.7959829
## 3         1.2896778 -0.09794622  0.3176225     -0.2051387         -0.7959829
## 4        -0.5538715 -0.09794622 -1.2682609     -1.0603746         -0.7959829
## 5         0.0606449 -0.09794622 -1.0178582     -0.9516163          1.2557397
##   enc_maritalSingle enc_maritalTogether enc_maritalWidow Cluster
## 1         1.9205079          -0.5916806       -0.1887179       2
## 2         1.9205079          -0.5916806       -0.1887179       1
## 3        -0.5204599           1.6893359       -0.1887179       2
## 4        -0.5204599           1.6893359       -0.1887179       1
## 5        -0.5204599          -0.5916806       -0.1887179       1

4. Cluster with Hierarchical Clustering

Linkage defines how to calculate distance between clusters containing multiple data points. Different methods of linkages are as follows * Complete Linkage: largest distance between elements of two clusters * Single: smallest distance between elements of two clusters * Average: Average dissimilarity between all elements of two clusters * Centroid: Dissimilarity between the centroids

Number of clusters obtianed using hierarchical method is only 1.

suppressPlotlyMessage <- function(p) {
  suppressMessages(plotly_build(p))
}

hc.complete <- hclust(dist(df_scale), method = "complete")
hc.single <- hclust(dist(df_scale), method = "single")
hc.average <- hclust(dist(df_scale), method = "average")

df_scale_ct <- cutree(hc.complete, k=2)
df_scale_sg <- cutree(hc.single, k=2)
df_scale_ag <- cutree(hc.average, k=2)

par(mfrow = c(1, 3))
suppressMessages(plot_ly(x=~pr_out$x[,1],y=~pr_out$x[,2], color = as.factor(df_scale_ct), colors=c("red","blue","green"),mode = "markers"))
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
suppressMessages(plot_ly(x=~pr_out$x[,1],y=~pr_out$x[,2], color = as.factor(df_scale_sg), colors=c("red","blue","green"),mode = "markers"))
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
suppressMessages(plot_ly(x=~pr_out$x[,1],y=~pr_out$x[,2], color = as.factor(df_scale_ag), colors=c("red","blue","green"),mode = "markers"))
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

Conclusion

k-means clustering performs better while clustering the marketing campaign dataset properly. With k = 2, clustering is performed more accurately in k-means clustering.

suppressMessages(plot_ly(x=pr_out$x[,1],y=pr_out$x[,2], color = as.factor(km_out$cluster), colors=c("red","blue")))
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode