Step 1: Collecting Data

The dataset collected used for clustering analysis with k-menas contains social networking service (SNS) information of 30000 anonymous U.S. high school students who had SNS profile in 2006. This dataset was compiled by Brett Lantz while conducting his sociological research on the teenage identities at the University of Notre Dame, and was used again as example in his book: Machine Learning with R (second edition) published by PACKT. This exercise goes along with his book example of using SNS data to perform clustering analysis. The SNS dataset can be found at the Packt Publishing website with filename: snsdata.csv, or can be found from Prof. Eric Suess Machine Learning course webpage at http://www.sci.csueastbay.edu/~esuess/stat6620/.

Step 2: Exploring & Preparing the Data

The SNS dataset contains 30000 observations (rows) each preresents a high school student and 40 features (columns) that provides information for the student. The 40 features includes graduation year, the gender, age and number of friends one has connected throught the SNS for each student, and the remaining 36 columns are word/s terms such as football, shopping, hot, church etc. that describes the student interest and beliefs with value of 1 indicates a belonging to the group, and 0 otherwise. Information like this can help to group individuals into clusters with similar interest, and provide help for companies’ marketing teams to advertise appropriate products online targeting students with certain interest or belief.

library(stats)
teens <- read.csv('http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml12/snsdata.csv')
str(teens)
## 'data.frame':    30000 obs. of  40 variables:
##  $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
##  $ age         : num  19 18.8 18.3 18.9 19 ...
##  $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
##  $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
##  $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
##  $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
##  $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
##  $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
##  $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
##  $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
##  $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
##  $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
##  $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
##  $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

The table() function is used to examine the data quality of gender feature. Since gender is a categorical feature, it will be splitted by categories each with number of observations. However, the number of F and M doesn’t match up indicates some potential missing value that are not include in the table() by R. Another way to force table() to present everything include the NA values is to use the useNA paramter, set to ‘ifany’ inside table(). This is about 9% of missing value of the gender feature in the whole dataset.

table(teens$gender)
## 
##     F     M 
## 22054  5222
table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724

In addition, there’s about 5000 more observation that have missing values in the age feature. This is about 17% of the dataset with missing value in the age feature. In a situation which a larger dataset contain small amount of missing values, its reasonable to remove those observations as long as that small amount of removal do not affect the original distribution of the data or representation of the whole population. But in situaion like this containg 9+17 = 26% missing data. The removal of these can severely affect the overall distribution compared to the original data prior removal. So alternative data cleaning method may be done to address that.

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.086  16.310  17.290  17.990  18.260 106.900    5086

The original age range contains value from 3 - 106, which is unrealistic because student at age of 3 or 106 would not attend high school. A reasonable age range for people attending high school will be the age range between 13 to 20. An ifelse() is used to keep any age values between 13 to 20 as it is, and change other value into NA.

teens$age <- ifelse(teens$age >= 13 & teens$age < 20,
                     teens$age, NA)

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.03   16.30   17.26   17.25   18.22   20.00    5523

A way to solve the missing value (NA) in a categorical feature to to create additional category as ‘NA’. This can be done by creating the categorical feature levels (n - 1) dummy variable. Below are codes that the first one assigns value of 1 if the fenger is equal to female and not euqal to NA, and 0 otherwise. Second one assigns value of 1 if the gender is missing value, and 0 othereise. This way, two binary categorical feature/ dummy variables are created to represent identity of each female or no_gender, both of these dummies are 0 implies that the observation contains a male student information.

teens$female <- ifelse(teens$gender == "F" &
                         !is.na(teens$gender), 1, 0)
teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

Using the table() to look at levels of features related to gender below. First one is the regular categorical feature, while the other two are newly created binary features indicate female and no_gender identities.

table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724
table(teens$female, useNA = "ifany")
## 
##     0     1 
##  7946 22054
table(teens$no_gender, useNA = "ifany")
## 
##     0     1 
## 27276  2724

The mean of the age feature is calculated but first results in a NA value since about 17% of the age feature contain missing value. One way to solve that and able to look at the avarage age excluding missing value is by setting the na.rm paramater to T.

mean(teens$age)
## [1] NA
mean(teens$age, na.rm = TRUE) 
## [1] 17.25243

The aggregate function can be used to calculate the mean age by gradyear groups of observations excluding NA values. Output shows pretty reasonable value for students with various expected graduation year. Each group is estimate to be one year apart.

aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
##   gradyear      age
## 1     2006 18.65586
## 2     2007 17.70617
## 3     2008 16.76770
## 4     2009 15.81957

Using the ave(), the average age value by gradyear can be calculated and assign in a vector which contains same length as the number of row of the dataset.

ave_age <- ave(teens$age, teens$gradyear,
                 FUN = function(x) mean(x, na.rm = TRUE))
head(ave_age)
## [1] 18.65586 18.65586 18.65586 18.65586 18.65586 18.65586

Using and ifelse(), the ave_age vector that is constructed previously can be flexibly added into the original age featue of cells with missing values. For example, the ifelse() claims that if there is a missing value in position of the age feature, assign the average age value of that gradyear group. Otherwise, use the original value stated in the age feature.

teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)

To examine the age feature of the original dataset again, NA values are all disappear because they are replaced by average age value of the respective gradyear group.

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.03   16.28   17.24   17.24   18.21   20.00

Step 3 Model Training for Data

A z-scored normalization is applied to all of the numerical feature to elimiated bias generate for having higher weight calculation for features with larger range.

interests <- teens[5:40]
interests_z <- as.data.frame(lapply(interests, scale))
summary(interests_z[5:10])
##    volleyball         swimming      cheerleading        baseball      
##  Min.   :-0.2237   Min.   :-0.26   Min.   :-0.2073   Min.   :-0.2011  
##  1st Qu.:-0.2237   1st Qu.:-0.26   1st Qu.:-0.2073   1st Qu.:-0.2011  
##  Median :-0.2237   Median :-0.26   Median :-0.2073   Median :-0.2011  
##  Mean   : 0.0000   Mean   : 0.00   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2237   3rd Qu.:-0.26   3rd Qu.:-0.2073   3rd Qu.:-0.2011  
##  Max.   :21.6533   Max.   :59.70   Max.   :17.2911   Max.   :30.4663  
##      tennis            sports       
##  Min.   :-0.1689   Min.   :-0.2971  
##  1st Qu.:-0.1689   1st Qu.:-0.2971  
##  Median :-0.1689   Median :-0.2971  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.1689   3rd Qu.:-0.2971  
##  Max.   :28.8468   Max.   :25.1762

The set.seed() can be used to ensure the result matches each run since the k-means utilized random starting points. Using the kmeans() on the dataset, set the number of cluster to 5, a cluster object is then generated/modeled.

set.seed(2345)
teen_clusters <- kmeans(interests_z, 5)

Step 4: Model Performance Evaluation

There are three parts for the cluster object, the size, center and cluster. For example, the size tells the number of observations being grouped together within each cluster.

teen_clusters$size
## [1]   871   600  5981  1034 21514

The cluster center values shows each of the cluster centroids of the coordinates. The row referes to the five clusters, withi the numebrs across each row indicates the cluster’s average value for the interest listed at the top of the column. Positive values are above the overall mean level and the negative values are below the overall mean for all the teens.

teen_clusters$centers
##    basketball   football      soccer    softball  volleyball    swimming
## 1  0.16001227  0.2364174  0.10385512  0.07232021  0.18897158  0.23970234
## 2 -0.09195886  0.0652625 -0.09932124 -0.01739428 -0.06219308  0.03339844
## 3  0.52755083  0.4873480  0.29778605  0.37178877  0.37986175  0.29628671
## 4  0.34081039  0.3593965  0.12722250  0.16384661  0.11032200  0.26943332
## 5 -0.16695523 -0.1641499 -0.09033520 -0.11367669 -0.11682181 -0.10595448
##   cheerleading    baseball      tennis      sports        cute
## 1    0.3931445  0.02993479  0.13532387  0.10257837  0.37884271
## 2   -0.1101103 -0.11487510  0.04062204 -0.09899231 -0.03265037
## 3    0.3303485  0.35231971  0.14057808  0.32967130  0.54442929
## 4    0.1856664  0.27527088  0.10980958  0.79711920  0.47866008
## 5   -0.1136077 -0.10918483 -0.05097057 -0.13135334 -0.18878627
##            sex        sexy         hot      kissed       dance        band
## 1  0.020042068  0.11740551  0.41389104  0.06787768  0.22780899 -0.10257102
## 2 -0.042486141 -0.04329091 -0.03812345 -0.04554933  0.04573186  4.06726666
## 3  0.002913623  0.24040196  0.38551819 -0.03356121  0.45662534 -0.02120728
## 4  2.028471066  0.51266080  0.31708549  2.97973077  0.45535061  0.38053621
## 5 -0.097928345 -0.09501817 -0.13810894 -0.13535855 -0.15932739 -0.12167214
##      marching      music        rock         god      church       jesus
## 1 -0.10942590  0.1378306  0.05905951  0.03651755 -0.00709374  0.01458533
## 2  5.25757242  0.4981238  0.15963917  0.09283620  0.06414651  0.04801941
## 3 -0.10880541  0.2844999  0.21436936  0.35014919  0.53739806  0.27843424
## 4 -0.02014608  1.1367885  1.21013948  0.41679142  0.16627797  0.12988313
## 5 -0.11098063 -0.1532006 -0.12460034 -0.12144246 -0.15889274 -0.08557822
##         bible        hair       dress      blonde        mall    shopping
## 1 -0.03692278  0.43807926  0.14905267  0.06137340  0.60368108  0.79806891
## 2  0.05863810 -0.04484083  0.07201611 -0.01146396 -0.08724304 -0.03865318
## 3  0.22990963  0.23612853  0.39407628  0.03471458  0.48318495  0.66327838
## 4  0.08478769  2.55623737  0.53852195  0.36134138  0.62256686  0.27101815
## 5 -0.06813159 -0.20498730 -0.14348036 -0.02918252 -0.18625656 -0.22865236
##         clothes  hollister abercrombie          die       death
## 1  0.5651537331  4.1521844  3.96493810  0.043475966  0.09857501
## 2 -0.0003526292 -0.1678300 -0.14129577  0.009447317  0.05135888
## 3  0.3759725120 -0.0553846 -0.07417839  0.037989066  0.11972190
## 4  1.2306917174  0.1610784  0.26324494  1.712181870  0.93631312
## 5 -0.1865419798 -0.1557662 -0.14861104 -0.094875180 -0.08370729
##          drunk       drugs
## 1  0.035614771  0.03443294
## 2 -0.086773220 -0.06878491
## 3 -0.009688746 -0.05973769
## 4  1.897388200  2.73326605
## 5 -0.087520105 -0.11423381

Step 5: Model Performance Improvement

The cluster part of the cluster object is appended into the original dataset as a seperate column. It contains a vector of information for the grouping into particular cluster for each teen. For example, looking at the first 5 observations, its respective calculated cluster grouping is shown.

teens$cluster <- teen_clusters$cluster
teens[1:5, c("cluster", "gender", "age", "friends")]
##   cluster gender    age friends
## 1       5      M 18.982       7
## 2       3      F 18.801       0
## 3       5      M 18.335      69
## 4       5      F 18.875       0
## 5       4   <NA> 18.995      10

A more useful way to look at the analysis result is to look at various feature by clusters. Here, the average age by clusters is being examined. It shows that the avarage age is 16.8 for a cluster 1 and 16.3 for a cluster 5. There’s no meaning for the cluster label by default, but the meaning for the clusters can be assigned by the modeler.

aggregate(data = teens, age ~ cluster, mean)
##   cluster      age
## 1       1 16.86497
## 2       2 17.39037
## 3       3 17.07656
## 4       4 17.11957
## 5       5 17.29849

The proportion of female population by clusters can be seen below using the aggregate(). For exampple, there is about 84% of the population are female belong to cluster 1, while only 70% of population are female in cluster 5.

aggregate(data = teens, female ~ cluster, mean)
##   cluster    female
## 1       1 0.8381171
## 2       2 0.7250000
## 3       3 0.8378198
## 4       4 0.8027079
## 5       5 0.6994515

The average friends number can be examine by clusters using the aggregate(). For example, teens that are belonged to cluster one has an average of 41 firends, while teens belonged to cluster five has an average of 28 friends. The results matches the intuitive thinking that the number of friends seems to be positively correlated to the stereotypy of each cluster’s high school popularity. Since first cluster is pre-defined as ‘princesses’ and cluster 5 is pre-definded as ‘basket cases’.

aggregate(data = teens, friends ~ cluster, mean)
##   cluster  friends
## 1       1 41.43054
## 2       2 32.57333
## 3       3 37.16185
## 4       4 30.50290
## 5       5 27.70052

Conclustion: The social networking services (SNS) data contains 30000 observations of high school students are analyzed using Clustering with k-mean. The purpose of this unsupervised learning is to defined specific finite number of (k) clusters and allow the algorithm to run and assigns each observations to any of the k cluster by their similarity in features. A good cluster model is by minimizing difference within cluster and maximize difference between clusters. Picking a overly large k value may be not useful to see grouping similartiy, while a super small k-value will increase the heterogeniety of each cluster and hard to spot any useful similarity within.