Given the text of teenagers’ Social Networking Service (SNS) pages, it is possible to identify groups that share common interests such as sports, religion, or music. Clustering can automate the process of discovering the natural segments in this population. Finding the natural clusters of teenagers based on their messages in a social network is the purpose of this study.
library(ggplot2)
library(lattice)
library(stats)
library(cluster) # for plotting clusters with clusplot()
library(fpc) # for plotting clusters with plotcluster(), validating clusters
For this analysis, I will be using a dataset representing a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006. The SNS is unnamed for the security purposes. However, the assumption is the profiles represent a fairly wide cross section of American adolescents in 2006.
A text mining tool was used to divide the SNS page content into words. From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the person’s SNS profile.
teens <- read.csv("snsdata.csv")
str(teens)
## 'data.frame': 30000 obs. of 40 variables:
## $ gradyear : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
## $ gender : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
## $ age : num 19 18.8 18.3 18.9 19 ...
## $ friends : int 7 0 69 0 10 142 72 17 52 39 ...
## $ basketball : int 0 0 0 0 0 0 0 0 0 0 ...
## $ football : int 0 1 1 0 0 0 0 0 0 0 ...
## $ soccer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ softball : int 0 0 0 0 0 0 0 1 0 0 ...
## $ volleyball : int 0 0 0 0 0 0 0 0 0 0 ...
## $ swimming : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cheerleading: int 0 0 0 0 0 0 0 0 0 0 ...
## $ baseball : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tennis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sports : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cute : int 0 1 0 1 0 0 0 0 0 1 ...
## $ sex : int 0 0 0 0 1 1 0 2 0 0 ...
## $ sexy : int 0 0 0 0 0 0 0 1 0 0 ...
## $ hot : int 0 0 0 0 0 0 0 0 0 1 ...
## $ kissed : int 0 0 0 0 5 0 0 0 0 0 ...
## $ dance : int 1 0 0 0 1 0 0 0 0 0 ...
## $ band : int 0 0 2 0 1 0 1 0 0 0 ...
## $ marching : int 0 0 0 0 0 1 1 0 0 0 ...
## $ music : int 0 2 1 0 3 2 0 1 0 1 ...
## $ rock : int 0 2 0 1 0 0 0 1 0 1 ...
## $ god : int 0 1 0 0 1 0 0 0 0 6 ...
## $ church : int 0 0 0 0 0 0 0 0 0 0 ...
## $ jesus : int 0 0 0 0 0 0 0 0 0 2 ...
## $ bible : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hair : int 0 6 0 0 1 0 0 0 0 1 ...
## $ dress : int 0 4 0 0 0 1 0 0 0 0 ...
## $ blonde : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mall : int 0 1 0 0 0 0 2 0 0 0 ...
## $ shopping : int 0 0 0 0 2 1 0 0 0 1 ...
## $ clothes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hollister : int 0 0 0 0 0 0 2 0 0 0 ...
## $ abercrombie : int 0 0 0 0 0 0 0 0 0 0 ...
## $ die : int 0 0 0 0 0 0 0 0 0 0 ...
## $ death : int 0 0 1 0 0 0 0 0 0 0 ...
## $ drunk : int 0 0 0 0 1 1 0 0 0 0 ...
## $ drugs : int 0 0 0 0 1 0 0 0 0 0 ...
The data include 30,000 teenagers with four variables indicating personal characteristics (gradyear, gender, age and friends) and 36 words indicating interests (basketball, football, soccer, etc).
I will check for the presence of missing values in any of four variables corresponding to users’ private information.
table(teens$gradyear, useNA = "ifany")
##
## 2006 2007 2008 2009
## 7500 7500 7500 7500
table(teens$gender, useNA = "ifany")
##
## F M <NA>
## 22054 5222 2724
summary(teens$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.086 16.310 17.290 17.990 18.260 106.900 5086
summary(teens$friends)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.00 20.00 30.18 44.00 830.00
The summaries show that there are no missing values for the gradyear and friends variables. However, we have 2724 users (~9%) with a missing value of their gender and 5086 users (~17%) with missing age.
Although the majority of users provides the age range of 16 to 18 years, there are also outliers ranging between 3 and 106 years:
boxplot(teens$age, ylab = "Age")
rug(jitter(teens$age), side = 2)
abline(h = mean(teens$age, na.rm = T), lty = 2)
A reasonable range of ages for high school students includes those who are at least 13 years old and not yet 20 years old. Any age value falling outside this range will be treated the same as missing data because it is not feasible to trust the age provided.
teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)
summary(teens$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.03 16.30 17.26 17.25 18.22 20.00 5523
Now, I need to deal with the missing values of age, whose number only increased after the previous action. For this, I create dummy variables for female and unknown gender.
teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)
teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)
In order to eliminate 5523 missing values on age, I will use an imputation of mean age values, calculated for each specific graduation year.
mean(teens$age, na.rm = TRUE)
## [1] 17.25243
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
## gradyear age
## 1 2006 18.65586
## 2 2007 17.70617
## 3 2008 16.76770
## 4 2009 15.81957
ave_age <- ave(teens$age,
teens$gradyear,
FUN = function(x) mean(x, na.rm = TRUE))
teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)
I start the cluster analysis by considering only the 36 features that represent the number of times various interests appeared on the SNS profiles of teens. For this, I create the dataset interests consisting of these features only. Next, I use a z-score standardization that rescales features such that they have a mean of zero and a standard deviation of one.
interests <- teens[5:40]
interests_z <- as.data.frame(lapply(interests, scale))
The last decision is how many clusters to use for segmenting the data. I am going to use 5 clusters, based on the five stereotypical teenagers groups in the movie The Breakfast Club. The five stereotypes in the movie aredefined as folows: a Brain, an Athlete, a Basket Case, a Princess, and a Criminal.
set.seed(12)
teen_clusters <- kmeans(interests_z, 5)
clusplot(interests_z,
teen_clusters$cluster,
color=TRUE,
col.clus=c(1:5)[unique(teen_clusters$cluster)],
shade=TRUE,
labels=4,
lines=0,
main = "Bivariate Cluster Plot")
plotcluster(interests_z, teen_clusters$cluster)
The succsess of the clustering in this case will be measured mostly in qualitative terms. As a starting point, I will check how many members are there in each cluster. Usually, if the groups are too large or too small, then they are not likely to be very useful.
teen_clusters$size
## [1] 807 871 21386 1034 5902
There are no clusters containing only a single person, which can happen occasionally with k-means, which is a good sign. Nowever, the third cluster contains 21386 users, which is quite a lot compared to other clusters. I will examine this in more details further.
For a more in-depth look at the clusters, I will examine the coordinates of the cluster centroids and print out the list of all the clusters with associated interests. Negative values here are below the overall mean for all students and positive values are above the mean.
teen_clusters$centers
## basketball football soccer softball volleyball swimming
## 1 -0.1197076 0.03407084 -0.07534803 -0.01857530 -0.08231183 0.04443602
## 2 0.1600123 0.23641736 0.10385512 0.07232021 0.18897158 0.23970234
## 3 -0.1659080 -0.16391963 -0.08937126 -0.11450666 -0.11647460 -0.10647982
## 4 0.3408104 0.35939651 0.12722250 0.16384661 0.11032200 0.26943332
## 5 0.5342163 0.49145283 0.29652556 0.37807886 0.38608679 0.29717768
## cheerleading baseball tennis sports cute sex
## 1 -0.1037258 -0.10137260 0.01803027 -0.1156168 -0.04100467 -0.048513710
## 2 0.3931445 0.02993479 0.13532387 0.1025784 0.37884271 0.020042068
## 3 -0.1133195 -0.10953078 -0.05071678 -0.1309568 -0.18710011 -0.098106465
## 4 0.1856664 0.27527088 0.10980958 0.7971192 0.47866008 2.028471066
## 5 0.3342512 0.35810397 0.14209901 0.3355434 0.54379994 0.003788495
## sexy hot kissed dance band marching
## 1 -0.009262782 -0.07025678 -0.06394190 0.03079363 3.36621675 4.63236247
## 2 0.117405508 0.41389104 0.06787768 0.22780899 -0.10257102 -0.10942590
## 3 -0.089383638 -0.13784208 -0.13504780 -0.16014339 -0.13116273 -0.13522953
## 4 0.512660797 0.31708549 2.97973077 0.45535061 0.38053621 -0.02014608
## 5 0.218007808 0.39244704 -0.03395961 0.46267738 -0.03653438 -0.12371345
## music rock god church jesus bible
## 1 0.3785335 0.14557689 0.06135048 0.02931310 0.04593181 0.02896785
## 2 0.1378306 0.05905951 0.03651755 -0.00709374 0.01458533 -0.03692278
## 3 -0.1536903 -0.12553131 -0.12169734 -0.15851136 -0.08590196 -0.06814393
## 4 1.1367885 1.21013948 0.41679142 0.16627797 0.12988313 0.08478769
## 5 0.2856413 0.21423365 0.35417491 0.54227642 0.28007953 0.23355445
## hair dress blonde mall shopping clothes
## 1 -0.04741405 0.05086888 -0.01329495 -0.1009747 -0.06820414 -0.007444541
## 2 0.43807926 0.14905267 0.06137340 0.6036811 0.79806891 0.565153733
## 3 -0.20463467 -0.14223712 -0.02910047 -0.1838132 -0.22755611 -0.186272626
## 4 2.55623737 0.53852195 0.36134138 0.6225669 0.27101815 1.230691717
## 5 0.23549031 0.39210018 0.03490147 0.4816970 0.66862165 0.376965432
## hollister abercrombie die death drunk drugs
## 1 -0.16931294 -0.1475679 -0.02295448 0.02216690 -0.086897407 -0.08524621
## 2 4.15218436 3.9649381 0.04347597 0.09857501 0.035614771 0.03443294
## 3 -0.15522297 -0.1485723 -0.09442739 -0.08339236 -0.087310572 -0.11373535
## 4 0.16107838 0.2632449 1.71218187 0.93631312 1.897388200 2.73326605
## 5 -0.05538354 -0.0727214 0.03891643 0.12055808 -0.009415413 -0.06015763
Given these interests, it is already possible to infer some characteristics of the clusters. Cluster 2, 4 and 5 are substantially above the mean on all the listed sports. Out of these three clusters the last one (number 5) is not interested in drugs and drunk, suggesting that this group may include athletes.
By continuing to examine the clusters in this way, it’s possible to construct a table listing the dominant interests of each of the groups.
Because clustering creates new information, the performance of a clustering algorithm depends at least somewhat on both the quality of the clusters themselves as well as what you do with that information.
In order to turn the insights gathered from the clustering process into action, I apply the clusters back onto the full dataset.
teens$cluster <- teen_clusters$cluster
teens[1:10, c("cluster", "gender", "age", "friends")]
## cluster gender age friends
## 1 3 M 18.98200 7
## 2 5 F 18.80100 0
## 3 3 M 18.33500 69
## 4 3 F 18.87500 0
## 5 4 <NA> 18.99500 10
## 6 3 F 18.65586 142
## 7 2 F 18.93000 72
## 8 3 M 18.32200 17
## 9 3 F 19.05500 52
## 10 5 F 18.70800 39
aggregate(data = teens, age ~ cluster, mean)
## cluster age
## 1 1 17.37390
## 2 2 16.86497
## 3 3 17.29687
## 4 4 17.11957
## 5 5 17.07847
aggregate(data = teens, female ~ cluster, mean)
## cluster female
## 1 1 0.7323420
## 2 2 0.8381171
## 3 3 0.6995698
## 4 4 0.8027079
## 5 5 0.8373433
aggregate(data = teens, friends ~ cluster, mean)
## cluster friends
## 1 1 32.73606
## 2 2 41.43054
## 3 3 27.66754
## 4 4 30.50290
## 5 5 37.21484
To recall, overall about 74 percent of the SNS users are female. Clusters 2 and 5 are around 83 percent female, while Cluster 3 is only 69 procent female. The same cluster 3 has the smallest amount of friends on average - about 27 percent, while clusters 2 and 5 have the highest average number of friends.
The association among group membership, gender, and number of friends suggests that the clusters can be useful predictors for preparing e.g. marketing activities based on these insights.