Given the text of teenagers’ Social Networking Service (SNS) pages, it is possible to identify groups that share common interests such as sports, religion, or music. Clustering can automate the process of discovering the natural segments in this population. Finding the natural clusters of teenagers based on their messages in a social network is the purpose of this study.

Used Libraries

library(ggplot2)
library(lattice)
library(stats)
library(cluster)  # for plotting clusters with clusplot()
library(fpc)      # for plotting clusters with plotcluster(), validating clusters

Load Data

For this analysis, I will be using a dataset representing a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006. The SNS is unnamed for the security purposes. However, the assumption is the profiles represent a fairly wide cross section of American adolescents in 2006.

A text mining tool was used to divide the SNS page content into words. From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the person’s SNS profile.

teens <- read.csv("snsdata.csv")
str(teens)
## 'data.frame':    30000 obs. of  40 variables:
##  $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
##  $ age         : num  19 18.8 18.3 18.9 19 ...
##  $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
##  $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
##  $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
##  $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
##  $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
##  $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
##  $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
##  $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
##  $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
##  $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
##  $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
##  $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

The data include 30,000 teenagers with four variables indicating personal characteristics (gradyear, gender, age and friends) and 36 words indicating interests (basketball, football, soccer, etc).

Data Analysis

I will check for the presence of missing values in any of four variables corresponding to users’ private information.

table(teens$gradyear, useNA = "ifany")
## 
## 2006 2007 2008 2009 
## 7500 7500 7500 7500
table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724
summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.086  16.310  17.290  17.990  18.260 106.900    5086
summary(teens$friends)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    3.00   20.00   30.18   44.00  830.00

The summaries show that there are no missing values for the gradyear and friends variables. However, we have 2724 users (~9%) with a missing value of their gender and 5086 users (~17%) with missing age.

Although the majority of users provides the age range of 16 to 18 years, there are also outliers ranging between 3 and 106 years:

boxplot(teens$age, ylab = "Age")
rug(jitter(teens$age), side = 2)
abline(h = mean(teens$age, na.rm = T), lty = 2)

Cleaning and Preparing Data

A reasonable range of ages for high school students includes those who are at least 13 years old and not yet 20 years old. Any age value falling outside this range will be treated the same as missing data because it is not feasible to trust the age provided.

teens$age <- ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)
summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.03   16.30   17.26   17.25   18.22   20.00    5523

Now, I need to deal with the missing values of age, whose number only increased after the previous action. For this, I create dummy variables for female and unknown gender.

teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)
teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

In order to eliminate 5523 missing values on age, I will use an imputation of mean age values, calculated for each specific graduation year.

mean(teens$age, na.rm = TRUE)
## [1] 17.25243
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
##   gradyear      age
## 1     2006 18.65586
## 2     2007 17.70617
## 3     2008 16.76770
## 4     2009 15.81957
ave_age <- ave(teens$age, 
               teens$gradyear, 
               FUN = function(x) mean(x, na.rm = TRUE))
teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)

Training a Model on the Data

I start the cluster analysis by considering only the 36 features that represent the number of times various interests appeared on the SNS profiles of teens. For this, I create the dataset interests consisting of these features only. Next, I use a z-score standardization that rescales features such that they have a mean of zero and a standard deviation of one.

interests <- teens[5:40]
interests_z <- as.data.frame(lapply(interests, scale))

The last decision is how many clusters to use for segmenting the data. I am going to use 5 clusters, based on the five stereotypical teenagers groups in the movie The Breakfast Club. The five stereotypes in the movie aredefined as folows: a Brain, an Athlete, a Basket Case, a Princess, and a Criminal.

set.seed(12)
teen_clusters <- kmeans(interests_z, 5)

clusplot(interests_z,
         teen_clusters$cluster, 
         color=TRUE,
         col.clus=c(1:5)[unique(teen_clusters$cluster)],
         shade=TRUE,
         labels=4, 
         lines=0, 
         main = "Bivariate Cluster Plot")

plotcluster(interests_z, teen_clusters$cluster)

Evaluating Model Performance

The succsess of the clustering in this case will be measured mostly in qualitative terms. As a starting point, I will check how many members are there in each cluster. Usually, if the groups are too large or too small, then they are not likely to be very useful.

teen_clusters$size
## [1]   807   871 21386  1034  5902

There are no clusters containing only a single person, which can happen occasionally with k-means, which is a good sign. Nowever, the third cluster contains 21386 users, which is quite a lot compared to other clusters. I will examine this in more details further.

For a more in-depth look at the clusters, I will examine the coordinates of the cluster centroids and print out the list of all the clusters with associated interests. Negative values here are below the overall mean for all students and positive values are above the mean.

teen_clusters$centers
##   basketball    football      soccer    softball  volleyball    swimming
## 1 -0.1197076  0.03407084 -0.07534803 -0.01857530 -0.08231183  0.04443602
## 2  0.1600123  0.23641736  0.10385512  0.07232021  0.18897158  0.23970234
## 3 -0.1659080 -0.16391963 -0.08937126 -0.11450666 -0.11647460 -0.10647982
## 4  0.3408104  0.35939651  0.12722250  0.16384661  0.11032200  0.26943332
## 5  0.5342163  0.49145283  0.29652556  0.37807886  0.38608679  0.29717768
##   cheerleading    baseball      tennis     sports        cute          sex
## 1   -0.1037258 -0.10137260  0.01803027 -0.1156168 -0.04100467 -0.048513710
## 2    0.3931445  0.02993479  0.13532387  0.1025784  0.37884271  0.020042068
## 3   -0.1133195 -0.10953078 -0.05071678 -0.1309568 -0.18710011 -0.098106465
## 4    0.1856664  0.27527088  0.10980958  0.7971192  0.47866008  2.028471066
## 5    0.3342512  0.35810397  0.14209901  0.3355434  0.54379994  0.003788495
##           sexy         hot      kissed       dance        band    marching
## 1 -0.009262782 -0.07025678 -0.06394190  0.03079363  3.36621675  4.63236247
## 2  0.117405508  0.41389104  0.06787768  0.22780899 -0.10257102 -0.10942590
## 3 -0.089383638 -0.13784208 -0.13504780 -0.16014339 -0.13116273 -0.13522953
## 4  0.512660797  0.31708549  2.97973077  0.45535061  0.38053621 -0.02014608
## 5  0.218007808  0.39244704 -0.03395961  0.46267738 -0.03653438 -0.12371345
##        music        rock         god      church       jesus       bible
## 1  0.3785335  0.14557689  0.06135048  0.02931310  0.04593181  0.02896785
## 2  0.1378306  0.05905951  0.03651755 -0.00709374  0.01458533 -0.03692278
## 3 -0.1536903 -0.12553131 -0.12169734 -0.15851136 -0.08590196 -0.06814393
## 4  1.1367885  1.21013948  0.41679142  0.16627797  0.12988313  0.08478769
## 5  0.2856413  0.21423365  0.35417491  0.54227642  0.28007953  0.23355445
##          hair       dress      blonde       mall    shopping      clothes
## 1 -0.04741405  0.05086888 -0.01329495 -0.1009747 -0.06820414 -0.007444541
## 2  0.43807926  0.14905267  0.06137340  0.6036811  0.79806891  0.565153733
## 3 -0.20463467 -0.14223712 -0.02910047 -0.1838132 -0.22755611 -0.186272626
## 4  2.55623737  0.53852195  0.36134138  0.6225669  0.27101815  1.230691717
## 5  0.23549031  0.39210018  0.03490147  0.4816970  0.66862165  0.376965432
##     hollister abercrombie         die       death        drunk       drugs
## 1 -0.16931294  -0.1475679 -0.02295448  0.02216690 -0.086897407 -0.08524621
## 2  4.15218436   3.9649381  0.04347597  0.09857501  0.035614771  0.03443294
## 3 -0.15522297  -0.1485723 -0.09442739 -0.08339236 -0.087310572 -0.11373535
## 4  0.16107838   0.2632449  1.71218187  0.93631312  1.897388200  2.73326605
## 5 -0.05538354  -0.0727214  0.03891643  0.12055808 -0.009415413 -0.06015763

Given these interests, it is already possible to infer some characteristics of the clusters. Cluster 2, 4 and 5 are substantially above the mean on all the listed sports. Out of these three clusters the last one (number 5) is not interested in drugs and drunk, suggesting that this group may include athletes.

By continuing to examine the clusters in this way, it’s possible to construct a table listing the dominant interests of each of the groups.

Analysing Model Performance

Because clustering creates new information, the performance of a clustering algorithm depends at least somewhat on both the quality of the clusters themselves as well as what you do with that information.

In order to turn the insights gathered from the clustering process into action, I apply the clusters back onto the full dataset.

teens$cluster <- teen_clusters$cluster
teens[1:10, c("cluster", "gender", "age", "friends")]
##    cluster gender      age friends
## 1        3      M 18.98200       7
## 2        5      F 18.80100       0
## 3        3      M 18.33500      69
## 4        3      F 18.87500       0
## 5        4   <NA> 18.99500      10
## 6        3      F 18.65586     142
## 7        2      F 18.93000      72
## 8        3      M 18.32200      17
## 9        3      F 19.05500      52
## 10       5      F 18.70800      39
aggregate(data = teens, age ~ cluster, mean)
##   cluster      age
## 1       1 17.37390
## 2       2 16.86497
## 3       3 17.29687
## 4       4 17.11957
## 5       5 17.07847
aggregate(data = teens, female ~ cluster, mean)
##   cluster    female
## 1       1 0.7323420
## 2       2 0.8381171
## 3       3 0.6995698
## 4       4 0.8027079
## 5       5 0.8373433
aggregate(data = teens, friends ~ cluster, mean)
##   cluster  friends
## 1       1 32.73606
## 2       2 41.43054
## 3       3 27.66754
## 4       4 30.50290
## 5       5 37.21484

To recall, overall about 74 percent of the SNS users are female. Clusters 2 and 5 are around 83 percent female, while Cluster 3 is only 69 procent female. The same cluster 3 has the smallest amount of friends on average - about 27 percent, while clusters 2 and 5 have the highest average number of friends.

The association among group membership, gender, and number of friends suggests that the clusters can be useful predictors for preparing e.g. marketing activities based on these insights.