Agenda

  1. Clustering Analysis
  2. K-Means Clustering
    1. Data Exploration and Preparation

    2. Standardization/Scaling

    3. Model Training

    4. Model Evaluation

  3. Hierarchical clustering

CLustering Analysis

  • Cluster analysis is a unsupervised machine learning method in data mining

  • It is used for grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups

Case Study - Identify Teen Market Segments

You are a marketing senior analyst in a global consulting firm. Your manager gave you a project to conduct a more in-depth analysis of teenagers’ market segments so that your customer firm will be able to extend their market of products and increase their retail sales.

You are brain-storming ideas for this project, and it suddenly occurs to you that interacting with friends on social media is a common way for teenagers to spend their time. This may be a potential way to identify segments of teenagers who share similar tests, so that clients can avoid targeting commercials with teens with no interest in the product being sold.

Load data

We will first load the Social Media data snsdata.csv from Canvas - Dataset module.

The data was sampled across our high school graduation years. The full text of their SNS profiles were scraped and each teen’s gender, age, and number of SNS friends were also recorded.

A text mining tool was used to divide the remaining content into words. The final dataset indicates how many times each word appeared in the person’s SNS profile.

    # load data
    library(ggplot2)
    teens <- read.csv("snsdata.csv", stringsAsFactors = TRUE)

Step 1: Explore and Preparing Data

We can first take a quick look at the dataset.

    str(teens)
## 'data.frame':    30000 obs. of  40 variables:
##  $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
##  $ age         : num  19 18.8 18.3 18.9 19 ...
##  $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
##  $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
##  $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
##  $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
##  $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
##  $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
##  $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
##  $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
##  $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
##  $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
##  $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
##  $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

Step 1: Explore and Preparing Data

The existence of NA in gender variable indicate the missing values. To examine how many miss data exist, - For categorical data, we can use table() function with an additional option. - For numerical data, we can use summary() function.

# look at missing data for female variable
table(teens$gender)
## 
##     F     M 
## 22054  5222
table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724

eliminate age outliers

Since our data include ages ranging from 3 years old to 106 years old, we need to narrow down the data to a more reasonable range for teens.

So we can use ifelse function to treat ages out of scope as NA as well.

# look at missing data for age variable
summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.086  16.312  17.287  17.994  18.259 106.927    5086
teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)

summary(teens$age) # Now the distribution is more like teenagers
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.03   16.30   17.27   17.25   18.22   20.00    5523

Handling missing categorical values - Dummy Coding

Besides exclusing all missing values, an alternative solution for missing categorical data is to treat missing values as a separate category. Dummy coding can be sued to create a separate binary variable for each level except one, which is reserved to be the reference group. is.na()is a function to test whether the value is NA, !is.na()indicates a value is not NA.

    # reassign missing gender values to "unknown"
    teens$female <- ifelse(teens$gender == "F" &
                             !is.na(teens$gender), 1, 0) #Not NA
    teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

# check our recoding work
table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724
table(teens$female, useNA = "ifany")
## 
##     0     1 
##  7946 22054
table(teens$no_gender, useNA = "ifany")
## 
##     0     1 
## 27276  2724

Handling missing numeric values - Imputing missing values

Now we still have 5,523 missing age values, we will use imputation - filling in the missing data with a guess to the true value.

How should we make an informed guess of a teen’s age? - Hint: Review the available variables as a reference

colnames(teens)[1:4]
## [1] "gradyear" "gender"   "age"      "friends"

Imputing based on Group Means

If you are thinking of using the graduation year, you’ve got the right idea!

We can calculate the average age for each graduation year using ave() function. This function returns the average of the subset of data.

# create a vector with the average age for each gradyear, repeated by person
ave_age <- ave(teens$age, teens$gradyear,
               FUN = function(x) mean(x, na.rm = TRUE))

Then we can use ifelse() function to assign this group average to missing values

teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)
# check the summary results to ensure missing values are eliminated
summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.03   16.28   17.24   17.24   18.21   20.00

Step 2: Training a Model on the Data

To cluster the teenagers into marketing segments, we will use kmeans() function in

Step 2: Training a Model on the Data

We will start cluster analysis by considering on the 36 features represent the number of times variable interests appeared on SNS profiles

# Only select numeric data relevant to their interests
interests <- teens[5:40]

Then, we need to standardize the features so that they have a mean of 0, and standard deviation of 1. This helps our interpretation of the relative values. We will use lapply function (Apply a Function over a List or Vector)

  • scale function: Centers or scales the colums of numeric matrix
# create a z-score standardized data frame for easier interpretation
interests_z <- lapply(interests, scale)
# Change it into a dataframe
interests_z <-as.data.frame(interests_z)
# compare the data before and after the transformation
summary(interests$basketball)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2673  0.0000 24.0000

Step 2: Training a Model on the Data

  • The last decision is the number of cluster to use for segmenting the data.

  • Too many clusters will make them too specific to be useful, while too few clusters may result in heterogeneous groupings.

  • It usually take a few experiments after trying different values.

# create the clusters using k-means
RNGversion("3.5.2") # use an older random number generator to get the same result
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(2345)
teen_clusters <- kmeans(interests_z, 5)

Step 3: Evaluating Model Performance

Evaluating clustering results can be somewhat subjective, since the sucesss or failure of the model depends on whether the clusters are useful their intended purpose.One of the most basic ways - to examine the number of cases falling to each group.

# look at the size of the clusters
teen_clusters$size
## [1]   871   600  5981  1034 21514

For a more in-depth look at the clusters, we can examine the coordinated of the cluster centroids

# look at the cluster centers
teen_clusters$centers
##    basketball   football      soccer    softball  volleyball    swimming
## 1  0.16001227  0.2364174  0.10385512  0.07232021  0.18897158  0.23970234
## 2 -0.09195886  0.0652625 -0.09932124 -0.01739428 -0.06219308  0.03339844
## 3  0.52755083  0.4873480  0.29778605  0.37178877  0.37986175  0.29628671
## 4  0.34081039  0.3593965  0.12722250  0.16384661  0.11032200  0.26943332
## 5 -0.16695523 -0.1641499 -0.09033520 -0.11367669 -0.11682181 -0.10595448
##   cheerleading    baseball      tennis      sports        cute          sex
## 1    0.3931445  0.02993479  0.13532387  0.10257837  0.37884271  0.020042068
## 2   -0.1101103 -0.11487510  0.04062204 -0.09899231 -0.03265037 -0.042486141
## 3    0.3303485  0.35231971  0.14057808  0.32967130  0.54442929  0.002913623
## 4    0.1856664  0.27527088  0.10980958  0.79711920  0.47866008  2.028471066
## 5   -0.1136077 -0.10918483 -0.05097057 -0.13135334 -0.18878627 -0.097928345
##          sexy         hot      kissed       dance        band    marching
## 1  0.11740551  0.41389104  0.06787768  0.22780899 -0.10257102 -0.10942590
## 2 -0.04329091 -0.03812345 -0.04554933  0.04573186  4.06726666  5.25757242
## 3  0.24040196  0.38551819 -0.03356121  0.45662534 -0.02120728 -0.10880541
## 4  0.51266080  0.31708549  2.97973077  0.45535061  0.38053621 -0.02014608
## 5 -0.09501817 -0.13810894 -0.13535855 -0.15932739 -0.12167214 -0.11098063
##        music        rock         god      church       jesus       bible
## 1  0.1378306  0.05905951  0.03651755 -0.00709374  0.01458533 -0.03692278
## 2  0.4981238  0.15963917  0.09283620  0.06414651  0.04801941  0.05863810
## 3  0.2844999  0.21436936  0.35014919  0.53739806  0.27843424  0.22990963
## 4  1.1367885  1.21013948  0.41679142  0.16627797  0.12988313  0.08478769
## 5 -0.1532006 -0.12460034 -0.12144246 -0.15889274 -0.08557822 -0.06813159
##          hair       dress      blonde        mall    shopping       clothes
## 1  0.43807926  0.14905267  0.06137340  0.60368108  0.79806891  0.5651537331
## 2 -0.04484083  0.07201611 -0.01146396 -0.08724304 -0.03865318 -0.0003526292
## 3  0.23612853  0.39407628  0.03471458  0.48318495  0.66327838  0.3759725120
## 4  2.55623737  0.53852195  0.36134138  0.62256686  0.27101815  1.2306917174
## 5 -0.20498730 -0.14348036 -0.02918252 -0.18625656 -0.22865236 -0.1865419798
##    hollister abercrombie          die       death        drunk       drugs
## 1  4.1521844  3.96493810  0.043475966  0.09857501  0.035614771  0.03443294
## 2 -0.1678300 -0.14129577  0.009447317  0.05135888 -0.086773220 -0.06878491
## 3 -0.0553846 -0.07417839  0.037989066  0.11972190 -0.009688746 -0.05973769
## 4  0.1610784  0.26324494  1.712181870  0.93631312  1.897388200  2.73326605
## 5 -0.1557662 -0.14861104 -0.094875180 -0.08370729 -0.087520105 -0.11423381

Profiling (Identifying Patterns)

  • By examining whether the clusters fall above or below the mean level for each interest category, we can begin to notice patterns that distinguish the clusters from each other.

  • In practice, this involves printing the cluster centers and searching through them for any patterns or extreme values, much like a word search puzzle but with numbers.

  • E.g., Cluster 3 is substantially above the mean interest level on all the sports.

Profiling and Segmenting

Given this subset of the interest data, we can already infer some characteristics of the clusters. In the following table, each cluster is shown with the features that most distinguish it from the other clusters.

Association with other Features

The algorithm appears to be performing quite well. Now we can focus our effort on turning these insights into action. We’ll begin applying the clusters back to full dataset. We can add the identified cluster ID as a new column to orignal data.

 # apply the cluster IDs to the original data frame
  teens$cluster <- teen_clusters$cluster

Now we can examine how the cluster relates to individual charateristics

# look at the first five records
teens[1:5, c("cluster", "gender", "age", "friends")]
##   cluster gender    age friends
## 1       5      M 18.982       7
## 2       3      F 18.801       0
## 3       5      M 18.335      69
## 4       5      F 18.875       0
## 5       4   <NA> 18.995      10

Using the aggregate() function, we can also look at the demographic characteristics of the clusters.

# mean age by cluster
aggregate(data = teens, age ~ cluster, mean)
##   cluster      age
## 1       1 16.86497
## 2       2 17.39037
## 3       3 17.07656
## 4       4 17.11957
## 5       5 17.29849
# proportion of females by cluster
aggregate(data = teens, female ~ cluster, mean)
##   cluster    female
## 1       1 0.8381171
## 2       2 0.7250000
## 3       3 0.8378198
## 4       4 0.8027079
## 5       5 0.6994515

# mean number of friends by cluster
aggregate(data = teens, friends ~ cluster, mean)
##   cluster  friends
## 1       1 41.43054
## 2       2 32.57333
## 3       3 37.16185
## 4       4 30.50290
## 5       5 27.70052

Conclusion: The association among group membership, gender, and number of friends suggests that the clusters can be useful predictors of behavior.

Case Introduction - Customer Segmentation

In business world, age and income are two crucial features that could be utilized to segment potential customers as they are highly influential for purchasing behaviors and capacities

This session we will use both k-means and hierarchical clustering method to analyze teh age_income data to segment potential consumers, then we will compare their performance and interpret the results.

First load the data age_income_data.csv from canvas.

market <- read.csv("age_income_data.csv")

Explore Dataset

We will first explore the structure and statistics of the variables in this dataset.

str(market)
## 'data.frame':    8105 obs. of  3 variables:
##  $ bin   : chr  "60-69" "30-39" "20-29" "30-39" ...
##  $ age   : int  64 33 24 33 78 62 88 54 54 31 ...
##  $ income: num  87083 76808 12044 61972 60120 ...
summary(market)
##      bin                 age            income        
##  Length:8105        Min.   :18.00   Min.   :   233.6  
##  Class :character   1st Qu.:28.00   1st Qu.: 43792.7  
##  Mode  :character   Median :39.00   Median : 65060.0  
##                     Mean   :42.85   Mean   : 66223.6  
##                     3rd Qu.:55.00   3rd Qu.: 85944.7  
##                     Max.   :89.00   Max.   :178676.4

Explore Trend with Visualization

Since there are only two variables, we can further utilize stacked box plot to explore potential trend/relationship between these two variables.

par(mfrow = c(1, 2)) #Create a partition of the plotting space to plot two graphs side by side
boxplot(market$age ~ market$bin, main = "Explore Age") 
boxplot(market$income ~ market$bin, main = "Explore Income") #Observe the trend of income for different age range

par(mfrow = c(1, 1)) #Reset the ploting partition to default (1 plot)

Correlation Test

We noticed that there are potential relationship between age and income, we can further validate this with correlation test. cor.test is a function we can use to test the statistical significance of a certain correlation relationship.

#Test the correlation between age and income
cor.test(market$age, market$income) 
## 
##  Pearson's product-moment correlation
## 
## data:  market$age and market$income
## t = -5.4055, df = 8103, p-value = 6.648e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08160633 -0.03822020
## sample estimates:
##         cor 
## -0.05994158

We will only focus on the p-value output, since it is smaller than 0.5, it indicates a significant correlation between age and income.

K-means Clustering without Scaling

To understand the importance of scaling, we will first create a k-means model without scaling to see what we will get:

#K-means Clustering without Scaling
set.seed(789) #Set a random seed to generate the same result

three <- kmeans(market[, 2:3], 3) 
#Specify variables age and income
#Specify k=3 clusters

Visualize K-Means Clustering Results

Then we can visualize the cluster results with following code:

We first create a scatter plot, then add points of cluster centers. Use color to distinguish different clusters

plot(market$age, market$inc, col = three$cluster, xlab = 'age',
     ylab = 'income', main = 'K-means without Scaling')
#Add points of centers to the plot
points(three$centers[, 1], three$centers[, 2], 
       pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three) #remove the k-means cluster result

Standardization/Scaling

Since the scales of age and income are drastically different, the clustering results are over-simplified. We need to use scale() function to standardize these two variables.

market$age_scale <- as.numeric(scale(market$age))
market$inc_scale <- as.numeric(scale(market$income))

Redo K-Means Clustering

Then we will redo the k-means anlaysis

#Redo clustering after scaling
set.seed(789)
three <- kmeans(market[, 4:5], 3)
#Visualize results after scaling
plot(market$age_scale, market$inc_scale,
     col=three$cluster, 
     xlab = 'age', ylab = 'income',
     main = 'K-means with Scaling')
points(three$centers[, 1], three$centers[, 2], 
       pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three)

Build Mulitple K-means model with different K

Now if we want to see what would be the best number of clusters, we can build multiple models to evaluate which one performs the best.

#Change number of clusters from 2 to 10, then compare their results through elbow plot
set.seed(456)
two <- kmeans(market[, 4:5], 2)
three <- kmeans(market[, 4:5], 3)
four <- kmeans(market[, 4:5], 4)
five <- kmeans(market[, 4:5], 5)
six <- kmeans(market[, 4:5], 6)
seven <- kmeans(market[, 4:5], 7)
eight <- kmeans(market[, 4:5], 8)
nine <- kmeans(market[, 4:5], 9)
ten <- kmeans(market[, 4:5], 10)

Evaluate the Models

To evaluate models, we need to first calculate the total within-cluster sum-of-square (measure how losely data are clustered). tot.withinss stands for Total within-cluster sum of squares, i.e. sum(withiness)

# Evaluting the Models
optimize <- data.frame(clusters = c(2:10), wss = rep(0, 9)) #Create a new dataframe to store different wss(withinness) from the above models
optimize[1, 2] <- as.numeric(two$tot.withinss) #store the total within-cluster sum of square to their respective row
optimize[2, 2] <- as.numeric(three$tot.withinss)
optimize[3, 2] <- as.numeric(four$tot.withinss)
optimize[4, 2] <- as.numeric(five$tot.withinss)
optimize[5, 2] <- as.numeric(six$tot.withinss)
optimize[6, 2] <- as.numeric(seven$tot.withinss)
optimize[7, 2] <- as.numeric(eight$tot.withinss)
optimize[8, 2] <- as.numeric(nine$tot.withinss)
optimize[9, 2] <- as.numeric(ten$tot.withinss)

Elbow Plot

So now we can use elbow plot to identify the “elbow point”, which gives us a sense of the performance of each model.

#Plot out the Elbow plot 
plot(optimize$wss ~ optimize$clusters, type = "b", 
     ylim = c(0, 12000), ylab = 'Within Sum of Square Error',
     main = 'Finding Optimal Number of Clusters Based on Error',
     xlab = 'Number of Clusters', pch = 17, col = 'black')

Hierarchical Clustering

Then we will use hclust() function to conduct hierarchical clustering. We will need a dist()function to produce a distance matrix as an input, as well as a specified method for cluster distance.

set.seed(456)
hc_mod <- hclust(dist(market[, 4:5]), method = "ward.D2")
hc_mod
## 
## Call:
## hclust(d = dist(market[, 4:5]), method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 8105

Visualize as Dendrogram

Then we can visualize the result use as.dendrogram() function

 # Visualizing the Model Output
dend <- as.dendrogram(hc_mod)
#Install this package if you don't have it
if(!require("dendextend")) install.packages("dendextend") 
library(dendextend)
#Add colors to the different branches
dend_six_color <- color_branches(dend, k = 6)
#Plot the dendrogram
plot(dend_six_color, leaflab = "none", horiz = TRUE,
     main = "Age and Income Dendrogram", xlab = "Height")
#Add a line to indicate cluster decision
abline(v = 37.5, lty = 'dashed', col = 'blue')

Structure of Dendrogram

An alternative way to show the hierarchical structure:

str(cut(dend, h = 37.5)$upper)
## --[dendrogram w/ 2 branches and 6 members at h = 108]
##   |--[dendrogram w/ 2 branches and 2 members at h = 38]
##   |  |--leaf "Branch 1" (h= 15.7 midpoint = 274, x.member = 782 )
##   |  `--leaf "Branch 2" (h= 19.5 midpoint = 628, x.member = 1526 )
##   `--[dendrogram w/ 2 branches and 4 members at h = 93.8]
##      |--[dendrogram w/ 2 branches and 2 members at h = 41.3]
##      |  |--leaf "Branch 3" (h= 17 midpoint = 431, x.member = 905 )
##      |  `--leaf "Branch 4" (h= 18.5 midpoint = 463, x.member = 1473 )
##      `--[dendrogram w/ 2 branches and 2 members at h = 56.4]
##         |--leaf "Branch 5" (h= 13.6 midpoint = 530, x.member = 1323 )
##         `--leaf "Branch 6" (h= 30.8 midpoint = 753, x.member = 2096 )

Save the hierarchical results to the original Dataset

#Save hierarchical clustering ID results with 5 clusters
dend_five <- cutree(dend, k = 5)
market$dend5 <- dend_five
#Save hierarchical clustering ID results with 6 clusters
dend_six <- cutree(dend, k = 6)
market$dend6 <- dend_six
#Show the cluster IDs
market[1:10,]
##      bin age   income  age_scale   inc_scale dend5 dend6
## 1  60-69  64 87083.24  1.2071346  0.75070999     1     1
## 2  30-39  33 76807.82 -0.5619790  0.38091240     2     2
## 3  20-29  24 12043.60 -1.0755926 -1.94986082     3     3
## 4  30-39  33 61972.00 -0.5619790 -0.15300793     2     2
## 5  70-79  78 60120.32  2.0060891 -0.21964755     4     4
## 6  60-69  62 40058.42  1.0929982 -0.94164698     4     5
## 7    80-  88 38850.72  2.5767709 -0.98511016     4     4
## 8  50-59  54 65239.05  0.6364528 -0.03543151     4     5
## 9  50-59  54 51362.92  0.6364528 -0.53481378     4     5
## 10 30-39  31 36418.25 -0.6761154 -1.07265161     2     2

Comparing K-means & Hierarchical results

We notice that the within sum of square error reduced at the point of 5 or 6 clusters.

Then we will save both k-means and hierarchical cluster results (with 5 and 6 clusters for each) into original dataset.