- Clustering Analysis
- K-Means Clustering
Data Exploration and Preparation
Standardization/Scaling
Model Training
Model Evaluation
- Hierarchical clustering
Data Exploration and Preparation
Standardization/Scaling
Model Training
Model Evaluation
Cluster analysis is a unsupervised machine learning method in data mining
It is used for grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups
You are a marketing senior analyst in a global consulting firm. Your manager gave you a project to conduct a more in-depth analysis of teenagers’ market segments so that your customer firm will be able to extend their market of products and increase their retail sales.
You are brain-storming ideas for this project, and it suddenly occurs to you that interacting with friends on social media is a common way for teenagers to spend their time. This may be a potential way to identify segments of teenagers who share similar tests, so that clients can avoid targeting commercials with teens with no interest in the product being sold.
We will first load the Social Media data snsdata.csv from Canvas - Dataset module.
The data was sampled across our high school graduation years. The full text of their SNS profiles were scraped and each teen’s gender, age, and number of SNS friends were also recorded.
A text mining tool was used to divide the remaining content into words. The final dataset indicates how many times each word appeared in the person’s SNS profile.
# load data
library(ggplot2)
teens <- read.csv("snsdata.csv", stringsAsFactors = TRUE)
We can first take a quick look at the dataset.
str(teens)
## 'data.frame': 30000 obs. of 40 variables: ## $ gradyear : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ... ## $ gender : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ... ## $ age : num 19 18.8 18.3 18.9 19 ... ## $ friends : int 7 0 69 0 10 142 72 17 52 39 ... ## $ basketball : int 0 0 0 0 0 0 0 0 0 0 ... ## $ football : int 0 1 1 0 0 0 0 0 0 0 ... ## $ soccer : int 0 0 0 0 0 0 0 0 0 0 ... ## $ softball : int 0 0 0 0 0 0 0 1 0 0 ... ## $ volleyball : int 0 0 0 0 0 0 0 0 0 0 ... ## $ swimming : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cheerleading: int 0 0 0 0 0 0 0 0 0 0 ... ## $ baseball : int 0 0 0 0 0 0 0 0 0 0 ... ## $ tennis : int 0 0 0 0 0 0 0 0 0 0 ... ## $ sports : int 0 0 0 0 0 0 0 0 0 0 ... ## $ cute : int 0 1 0 1 0 0 0 0 0 1 ... ## $ sex : int 0 0 0 0 1 1 0 2 0 0 ... ## $ sexy : int 0 0 0 0 0 0 0 1 0 0 ... ## $ hot : int 0 0 0 0 0 0 0 0 0 1 ... ## $ kissed : int 0 0 0 0 5 0 0 0 0 0 ... ## $ dance : int 1 0 0 0 1 0 0 0 0 0 ... ## $ band : int 0 0 2 0 1 0 1 0 0 0 ... ## $ marching : int 0 0 0 0 0 1 1 0 0 0 ... ## $ music : int 0 2 1 0 3 2 0 1 0 1 ... ## $ rock : int 0 2 0 1 0 0 0 1 0 1 ... ## $ god : int 0 1 0 0 1 0 0 0 0 6 ... ## $ church : int 0 0 0 0 0 0 0 0 0 0 ... ## $ jesus : int 0 0 0 0 0 0 0 0 0 2 ... ## $ bible : int 0 0 0 0 0 0 0 0 0 0 ... ## $ hair : int 0 6 0 0 1 0 0 0 0 1 ... ## $ dress : int 0 4 0 0 0 1 0 0 0 0 ... ## $ blonde : int 0 0 0 0 0 0 0 0 0 0 ... ## $ mall : int 0 1 0 0 0 0 2 0 0 0 ... ## $ shopping : int 0 0 0 0 2 1 0 0 0 1 ... ## $ clothes : int 0 0 0 0 0 0 0 0 0 0 ... ## $ hollister : int 0 0 0 0 0 0 2 0 0 0 ... ## $ abercrombie : int 0 0 0 0 0 0 0 0 0 0 ... ## $ die : int 0 0 0 0 0 0 0 0 0 0 ... ## $ death : int 0 0 1 0 0 0 0 0 0 0 ... ## $ drunk : int 0 0 0 0 1 1 0 0 0 0 ... ## $ drugs : int 0 0 0 0 1 0 0 0 0 0 ...
The existence of NA in gender variable indicate the missing values. To examine how many miss data exist, - For categorical data, we can use table() function with an additional option. - For numerical data, we can use summary() function.
# look at missing data for female variable table(teens$gender)
## ## F M ## 22054 5222
table(teens$gender, useNA = "ifany")
## ## F M <NA> ## 22054 5222 2724
Since our data include ages ranging from 3 years old to 106 years old, we need to narrow down the data to a more reasonable range for teens.
So we can use ifelse function to treat ages out of scope as NA as well.
# look at missing data for age variable summary(teens$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.086 16.312 17.287 17.994 18.259 106.927 5086
teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA) summary(teens$age) # Now the distribution is more like teenagers
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 13.03 16.30 17.27 17.25 18.22 20.00 5523
Besides exclusing all missing values, an alternative solution for missing categorical data is to treat missing values as a separate category. Dummy coding can be sued to create a separate binary variable for each level except one, which is reserved to be the reference group. is.na()is a function to test whether the value is NA, !is.na()indicates a value is not NA.
# reassign missing gender values to "unknown"
teens$female <- ifelse(teens$gender == "F" &
!is.na(teens$gender), 1, 0) #Not NA
teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)
# check our recoding work table(teens$gender, useNA = "ifany")
## ## F M <NA> ## 22054 5222 2724
table(teens$female, useNA = "ifany")
## ## 0 1 ## 7946 22054
table(teens$no_gender, useNA = "ifany")
## ## 0 1 ## 27276 2724
Now we still have 5,523 missing age values, we will use imputation - filling in the missing data with a guess to the true value.
How should we make an informed guess of a teen’s age? - Hint: Review the available variables as a reference
colnames(teens)[1:4]
## [1] "gradyear" "gender" "age" "friends"
If you are thinking of using the graduation year, you’ve got the right idea!
We can calculate the average age for each graduation year using ave() function. This function returns the average of the subset of data.
# create a vector with the average age for each gradyear, repeated by person
ave_age <- ave(teens$age, teens$gradyear,
FUN = function(x) mean(x, na.rm = TRUE))
Then we can use ifelse() function to assign this group average to missing values
teens$age <- ifelse(is.na(teens$age), ave_age, teens$age) # check the summary results to ensure missing values are eliminated summary(teens$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 13.03 16.28 17.24 17.24 18.21 20.00
To cluster the teenagers into marketing segments, we will use kmeans() function in
We will start cluster analysis by considering on the 36 features represent the number of times variable interests appeared on SNS profiles
# Only select numeric data relevant to their interests interests <- teens[5:40]
Then, we need to standardize the features so that they have a mean of 0, and standard deviation of 1. This helps our interpretation of the relative values. We will use lapply function (Apply a Function over a List or Vector)
# create a z-score standardized data frame for easier interpretation interests_z <- lapply(interests, scale) # Change it into a dataframe interests_z <-as.data.frame(interests_z) # compare the data before and after the transformation summary(interests$basketball)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.2673 0.0000 24.0000
The last decision is the number of cluster to use for segmenting the data.
Too many clusters will make them too specific to be useful, while too few clusters may result in heterogeneous groupings.
It usually take a few experiments after trying different values.
# create the clusters using k-means
RNGversion("3.5.2") # use an older random number generator to get the same result
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(2345) teen_clusters <- kmeans(interests_z, 5)
Evaluating clustering results can be somewhat subjective, since the sucesss or failure of the model depends on whether the clusters are useful their intended purpose.One of the most basic ways - to examine the number of cases falling to each group.
# look at the size of the clusters teen_clusters$size
## [1] 871 600 5981 1034 21514
For a more in-depth look at the clusters, we can examine the coordinated of the cluster centroids
# look at the cluster centers teen_clusters$centers
## basketball football soccer softball volleyball swimming ## 1 0.16001227 0.2364174 0.10385512 0.07232021 0.18897158 0.23970234 ## 2 -0.09195886 0.0652625 -0.09932124 -0.01739428 -0.06219308 0.03339844 ## 3 0.52755083 0.4873480 0.29778605 0.37178877 0.37986175 0.29628671 ## 4 0.34081039 0.3593965 0.12722250 0.16384661 0.11032200 0.26943332 ## 5 -0.16695523 -0.1641499 -0.09033520 -0.11367669 -0.11682181 -0.10595448 ## cheerleading baseball tennis sports cute sex ## 1 0.3931445 0.02993479 0.13532387 0.10257837 0.37884271 0.020042068 ## 2 -0.1101103 -0.11487510 0.04062204 -0.09899231 -0.03265037 -0.042486141 ## 3 0.3303485 0.35231971 0.14057808 0.32967130 0.54442929 0.002913623 ## 4 0.1856664 0.27527088 0.10980958 0.79711920 0.47866008 2.028471066 ## 5 -0.1136077 -0.10918483 -0.05097057 -0.13135334 -0.18878627 -0.097928345 ## sexy hot kissed dance band marching ## 1 0.11740551 0.41389104 0.06787768 0.22780899 -0.10257102 -0.10942590 ## 2 -0.04329091 -0.03812345 -0.04554933 0.04573186 4.06726666 5.25757242 ## 3 0.24040196 0.38551819 -0.03356121 0.45662534 -0.02120728 -0.10880541 ## 4 0.51266080 0.31708549 2.97973077 0.45535061 0.38053621 -0.02014608 ## 5 -0.09501817 -0.13810894 -0.13535855 -0.15932739 -0.12167214 -0.11098063 ## music rock god church jesus bible ## 1 0.1378306 0.05905951 0.03651755 -0.00709374 0.01458533 -0.03692278 ## 2 0.4981238 0.15963917 0.09283620 0.06414651 0.04801941 0.05863810 ## 3 0.2844999 0.21436936 0.35014919 0.53739806 0.27843424 0.22990963 ## 4 1.1367885 1.21013948 0.41679142 0.16627797 0.12988313 0.08478769 ## 5 -0.1532006 -0.12460034 -0.12144246 -0.15889274 -0.08557822 -0.06813159 ## hair dress blonde mall shopping clothes ## 1 0.43807926 0.14905267 0.06137340 0.60368108 0.79806891 0.5651537331 ## 2 -0.04484083 0.07201611 -0.01146396 -0.08724304 -0.03865318 -0.0003526292 ## 3 0.23612853 0.39407628 0.03471458 0.48318495 0.66327838 0.3759725120 ## 4 2.55623737 0.53852195 0.36134138 0.62256686 0.27101815 1.2306917174 ## 5 -0.20498730 -0.14348036 -0.02918252 -0.18625656 -0.22865236 -0.1865419798 ## hollister abercrombie die death drunk drugs ## 1 4.1521844 3.96493810 0.043475966 0.09857501 0.035614771 0.03443294 ## 2 -0.1678300 -0.14129577 0.009447317 0.05135888 -0.086773220 -0.06878491 ## 3 -0.0553846 -0.07417839 0.037989066 0.11972190 -0.009688746 -0.05973769 ## 4 0.1610784 0.26324494 1.712181870 0.93631312 1.897388200 2.73326605 ## 5 -0.1557662 -0.14861104 -0.094875180 -0.08370729 -0.087520105 -0.11423381
By examining whether the clusters fall above or below the mean level for each interest category, we can begin to notice patterns that distinguish the clusters from each other.
In practice, this involves printing the cluster centers and searching through them for any patterns or extreme values, much like a word search puzzle but with numbers.
E.g., Cluster 3 is substantially above the mean interest level on all the sports.
Given this subset of the interest data, we can already infer some characteristics of the clusters. In the following table, each cluster is shown with the features that most distinguish it from the other clusters.
The algorithm appears to be performing quite well. Now we can focus our effort on turning these insights into action. We’ll begin applying the clusters back to full dataset. We can add the identified cluster ID as a new column to orignal data.
# apply the cluster IDs to the original data frame teens$cluster <- teen_clusters$cluster
Now we can examine how the cluster relates to individual charateristics
# look at the first five records
teens[1:5, c("cluster", "gender", "age", "friends")]
## cluster gender age friends ## 1 5 M 18.982 7 ## 2 3 F 18.801 0 ## 3 5 M 18.335 69 ## 4 5 F 18.875 0 ## 5 4 <NA> 18.995 10
Using the aggregate() function, we can also look at the demographic characteristics of the clusters.
# mean age by cluster aggregate(data = teens, age ~ cluster, mean)
## cluster age ## 1 1 16.86497 ## 2 2 17.39037 ## 3 3 17.07656 ## 4 4 17.11957 ## 5 5 17.29849
# proportion of females by cluster aggregate(data = teens, female ~ cluster, mean)
## cluster female ## 1 1 0.8381171 ## 2 2 0.7250000 ## 3 3 0.8378198 ## 4 4 0.8027079 ## 5 5 0.6994515
# mean number of friends by cluster aggregate(data = teens, friends ~ cluster, mean)
## cluster friends ## 1 1 41.43054 ## 2 2 32.57333 ## 3 3 37.16185 ## 4 4 30.50290 ## 5 5 27.70052
Conclusion: The association among group membership, gender, and number of friends suggests that the clusters can be useful predictors of behavior.
In business world, age and income are two crucial features that could be utilized to segment potential customers as they are highly influential for purchasing behaviors and capacities
This session we will use both k-means and hierarchical clustering method to analyze teh age_income data to segment potential consumers, then we will compare their performance and interpret the results.
First load the data age_income_data.csv from canvas.
market <- read.csv("age_income_data.csv")
We will first explore the structure and statistics of the variables in this dataset.
str(market)
## 'data.frame': 8105 obs. of 3 variables: ## $ bin : chr "60-69" "30-39" "20-29" "30-39" ... ## $ age : int 64 33 24 33 78 62 88 54 54 31 ... ## $ income: num 87083 76808 12044 61972 60120 ...
summary(market)
## bin age income ## Length:8105 Min. :18.00 Min. : 233.6 ## Class :character 1st Qu.:28.00 1st Qu.: 43792.7 ## Mode :character Median :39.00 Median : 65060.0 ## Mean :42.85 Mean : 66223.6 ## 3rd Qu.:55.00 3rd Qu.: 85944.7 ## Max. :89.00 Max. :178676.4
Since there are only two variables, we can further utilize stacked box plot to explore potential trend/relationship between these two variables.
par(mfrow = c(1, 2)) #Create a partition of the plotting space to plot two graphs side by side boxplot(market$age ~ market$bin, main = "Explore Age") boxplot(market$income ~ market$bin, main = "Explore Income") #Observe the trend of income for different age range
par(mfrow = c(1, 1)) #Reset the ploting partition to default (1 plot)
We noticed that there are potential relationship between age and income, we can further validate this with correlation test. cor.test is a function we can use to test the statistical significance of a certain correlation relationship.
#Test the correlation between age and income cor.test(market$age, market$income)
## ## Pearson's product-moment correlation ## ## data: market$age and market$income ## t = -5.4055, df = 8103, p-value = 6.648e-08 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.08160633 -0.03822020 ## sample estimates: ## cor ## -0.05994158
We will only focus on the p-value output, since it is smaller than 0.5, it indicates a significant correlation between age and income.
To understand the importance of scaling, we will first create a k-means model without scaling to see what we will get:
#K-means Clustering without Scaling set.seed(789) #Set a random seed to generate the same result three <- kmeans(market[, 2:3], 3) #Specify variables age and income #Specify k=3 clusters
Then we can visualize the cluster results with following code:
We first create a scatter plot, then add points of cluster centers. Use color to distinguish different clusters
plot(market$age, market$inc, col = three$cluster, xlab = 'age',
ylab = 'income', main = 'K-means without Scaling')
#Add points of centers to the plot
points(three$centers[, 1], three$centers[, 2],
pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three) #remove the k-means cluster result
Since the scales of age and income are drastically different, the clustering results are over-simplified. We need to use scale() function to standardize these two variables.
market$age_scale <- as.numeric(scale(market$age)) market$inc_scale <- as.numeric(scale(market$income))
Then we will redo the k-means anlaysis
#Redo clustering after scaling
set.seed(789)
three <- kmeans(market[, 4:5], 3)
#Visualize results after scaling
plot(market$age_scale, market$inc_scale,
col=three$cluster,
xlab = 'age', ylab = 'income',
main = 'K-means with Scaling')
points(three$centers[, 1], three$centers[, 2],
pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three)
Now if we want to see what would be the best number of clusters, we can build multiple models to evaluate which one performs the best.
#Change number of clusters from 2 to 10, then compare their results through elbow plot set.seed(456) two <- kmeans(market[, 4:5], 2) three <- kmeans(market[, 4:5], 3) four <- kmeans(market[, 4:5], 4) five <- kmeans(market[, 4:5], 5) six <- kmeans(market[, 4:5], 6) seven <- kmeans(market[, 4:5], 7) eight <- kmeans(market[, 4:5], 8) nine <- kmeans(market[, 4:5], 9) ten <- kmeans(market[, 4:5], 10)
To evaluate models, we need to first calculate the total within-cluster sum-of-square (measure how losely data are clustered). tot.withinss stands for Total within-cluster sum of squares, i.e. sum(withiness)
# Evaluting the Models optimize <- data.frame(clusters = c(2:10), wss = rep(0, 9)) #Create a new dataframe to store different wss(withinness) from the above models optimize[1, 2] <- as.numeric(two$tot.withinss) #store the total within-cluster sum of square to their respective row optimize[2, 2] <- as.numeric(three$tot.withinss) optimize[3, 2] <- as.numeric(four$tot.withinss) optimize[4, 2] <- as.numeric(five$tot.withinss) optimize[5, 2] <- as.numeric(six$tot.withinss) optimize[6, 2] <- as.numeric(seven$tot.withinss) optimize[7, 2] <- as.numeric(eight$tot.withinss) optimize[8, 2] <- as.numeric(nine$tot.withinss) optimize[9, 2] <- as.numeric(ten$tot.withinss)
So now we can use elbow plot to identify the “elbow point”, which gives us a sense of the performance of each model.
#Plot out the Elbow plot
plot(optimize$wss ~ optimize$clusters, type = "b",
ylim = c(0, 12000), ylab = 'Within Sum of Square Error',
main = 'Finding Optimal Number of Clusters Based on Error',
xlab = 'Number of Clusters', pch = 17, col = 'black')
Then we will use hclust() function to conduct hierarchical clustering. We will need a dist()function to produce a distance matrix as an input, as well as a specified method for cluster distance.
set.seed(456) hc_mod <- hclust(dist(market[, 4:5]), method = "ward.D2") hc_mod
## ## Call: ## hclust(d = dist(market[, 4:5]), method = "ward.D2") ## ## Cluster method : ward.D2 ## Distance : euclidean ## Number of objects: 8105
Then we can visualize the result use as.dendrogram() function
# Visualizing the Model Output
dend <- as.dendrogram(hc_mod)
#Install this package if you don't have it
if(!require("dendextend")) install.packages("dendextend")
library(dendextend)
#Add colors to the different branches
dend_six_color <- color_branches(dend, k = 6)
#Plot the dendrogram
plot(dend_six_color, leaflab = "none", horiz = TRUE,
main = "Age and Income Dendrogram", xlab = "Height")
#Add a line to indicate cluster decision
abline(v = 37.5, lty = 'dashed', col = 'blue')
An alternative way to show the hierarchical structure:
str(cut(dend, h = 37.5)$upper)
## --[dendrogram w/ 2 branches and 6 members at h = 108] ## |--[dendrogram w/ 2 branches and 2 members at h = 38] ## | |--leaf "Branch 1" (h= 15.7 midpoint = 274, x.member = 782 ) ## | `--leaf "Branch 2" (h= 19.5 midpoint = 628, x.member = 1526 ) ## `--[dendrogram w/ 2 branches and 4 members at h = 93.8] ## |--[dendrogram w/ 2 branches and 2 members at h = 41.3] ## | |--leaf "Branch 3" (h= 17 midpoint = 431, x.member = 905 ) ## | `--leaf "Branch 4" (h= 18.5 midpoint = 463, x.member = 1473 ) ## `--[dendrogram w/ 2 branches and 2 members at h = 56.4] ## |--leaf "Branch 5" (h= 13.6 midpoint = 530, x.member = 1323 ) ## `--leaf "Branch 6" (h= 30.8 midpoint = 753, x.member = 2096 )
#Save hierarchical clustering ID results with 5 clusters dend_five <- cutree(dend, k = 5) market$dend5 <- dend_five #Save hierarchical clustering ID results with 6 clusters dend_six <- cutree(dend, k = 6) market$dend6 <- dend_six #Show the cluster IDs market[1:10,]
## bin age income age_scale inc_scale dend5 dend6 ## 1 60-69 64 87083.24 1.2071346 0.75070999 1 1 ## 2 30-39 33 76807.82 -0.5619790 0.38091240 2 2 ## 3 20-29 24 12043.60 -1.0755926 -1.94986082 3 3 ## 4 30-39 33 61972.00 -0.5619790 -0.15300793 2 2 ## 5 70-79 78 60120.32 2.0060891 -0.21964755 4 4 ## 6 60-69 62 40058.42 1.0929982 -0.94164698 4 5 ## 7 80- 88 38850.72 2.5767709 -0.98511016 4 4 ## 8 50-59 54 65239.05 0.6364528 -0.03543151 4 5 ## 9 50-59 54 51362.92 0.6364528 -0.53481378 4 5 ## 10 30-39 31 36418.25 -0.6761154 -1.07265161 2 2
We notice that the within sum of square error reduced at the point of 5 or 6 clusters.
Then we will save both k-means and hierarchical cluster results (with 5 and 6 clusters for each) into original dataset.