Introduction

This project is from Book: Machine learning with R by Brett Lantz, chapter 9.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is by given the dataset of teenagers Social Network service information , we can identify groups that share common interests such as sports, religion, or music. Clustering can automate the process of discovering the natural segmentation in the population.

Required packages

we will use stats package

library(stats)

Setp 1 - collecting data

The dataset represents a random sample of 30,000 US high school students who had a profile on a well-know SNS in 2006. the data was sampled evenly across four high school graduation years(2016 through to 2009)

Step 2 - exploring and preparing the data

teens <- read.csv("snsdata.csv", stringsAsFactors = TRUE)

#Explore structure of the data frame 
str(teens)
## 'data.frame':    30000 obs. of  40 variables:
##  $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
##  $ age         : num  19 18.8 18.3 18.9 19 ...
##  $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
##  $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
##  $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
##  $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
##  $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
##  $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
##  $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
##  $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
##  $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
##  $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
##  $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
##  $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

We can notice in gender there are some students didn’t reveal their gender.

table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724

We will use dummy variables to split Female, Male, and no_gender so we don’t bias the model

We can explore age also

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.086  16.312  17.287  17.994  18.259 106.927    5086

Of course the data need cleaning, we will consider ages below 12 and above 20 as NA

teens$age <- ifelse (teens$age >= 13 & teens$age < 20, teens$age, NA)

Check now

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.03   16.30   17.27   17.25   18.22   20.00    5523

Dummy preparation - dummy coding missing values

We will create a binary values dummy variables for each level of a nominal feature expect one, which is held out to serve as the reference group.

teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)

teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

table(teens$gender, useNA = "ifany")
## 
##     F     M  <NA> 
## 22054  5222  2724
table(teens$female, useNA = "ifany")
## 
##     0     1 
##  7946 22054
table(teens$no_gender, useNA = "ifany")
## 
##     0     1 
## 27276  2724

We can see now the number of TRUEs in Female and no_gender dummy variables are matching female and NAs in the gender column.

Data preparation - imputing the missing values

mean(teens$age, na.rm = TRUE)
## [1] 17.25243

To explore students by graduation year so we can impute data accurately.

aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
##   gradyear      age
## 1     2006 18.65586
## 2     2007 17.70617
## 3     2008 16.76770
## 4     2009 15.81957

Now we can see the mean average foe each graduation year, so we can impute the data for NAs. we will use ave function.

avg_age <- ave(teens$age, teens$gradyear, FUN = function(x) mean(x, na.rm = TRUE))

#To impute the data in the age column

teens$age <- ifelse(is.na(teens$age), avg_age, teens$age)

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.03   16.28   17.24   17.24   18.21   20.00

Step 3 - training a model on the data

As the Kmeans() function requires a data frame containing only numeric data and a parameter specifying the desired number of clusters, we will create a data frame containing only these features

intersets <- teens[5:40]

We will normalize the data using z-score standardization

intersets_z <- as.data.frame(lapply(intersets, scale))

#To confirm transformation

summary(intersets$basketball)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2673  0.0000 24.0000

After transformation

summary(intersets_z$basketball)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3322 -0.3322 -0.3322  0.0000 -0.3322 29.4923

We can the transformation lead to mean = zero and a range that spans above and below zero.

We have a clusters coming from the breakfast club movie from 1985, there are identified 5 terms , a brain, an athlete, a basket case, a prices, and a criminal.

We will use set seed to ensure that the results matches the output when we run the program again

RNGversion("3.5.2")
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(2345)
teen_clusters <- kmeans(intersets_z, 5)

Step 4 - evaluating model performance

#Examine the number of examples falling in each of each of the groups. If the groups are too large or too small, then they are not likely to be very useful

teen_clusters$size
## [1]   871   600  5981  1034 21514

For more in-depth look at the clusters, we can examine the coordinates of the cluster centroides.

teen_clusters$centers
##    basketball   football      soccer    softball  volleyball    swimming
## 1  0.16001227  0.2364174  0.10385512  0.07232021  0.18897158  0.23970234
## 2 -0.09195886  0.0652625 -0.09932124 -0.01739428 -0.06219308  0.03339844
## 3  0.52755083  0.4873480  0.29778605  0.37178877  0.37986175  0.29628671
## 4  0.34081039  0.3593965  0.12722250  0.16384661  0.11032200  0.26943332
## 5 -0.16695523 -0.1641499 -0.09033520 -0.11367669 -0.11682181 -0.10595448
##   cheerleading    baseball      tennis      sports        cute          sex
## 1    0.3931445  0.02993479  0.13532387  0.10257837  0.37884271  0.020042068
## 2   -0.1101103 -0.11487510  0.04062204 -0.09899231 -0.03265037 -0.042486141
## 3    0.3303485  0.35231971  0.14057808  0.32967130  0.54442929  0.002913623
## 4    0.1856664  0.27527088  0.10980958  0.79711920  0.47866008  2.028471066
## 5   -0.1136077 -0.10918483 -0.05097057 -0.13135334 -0.18878627 -0.097928345
##          sexy         hot      kissed       dance        band    marching
## 1  0.11740551  0.41389104  0.06787768  0.22780899 -0.10257102 -0.10942590
## 2 -0.04329091 -0.03812345 -0.04554933  0.04573186  4.06726666  5.25757242
## 3  0.24040196  0.38551819 -0.03356121  0.45662534 -0.02120728 -0.10880541
## 4  0.51266080  0.31708549  2.97973077  0.45535061  0.38053621 -0.02014608
## 5 -0.09501817 -0.13810894 -0.13535855 -0.15932739 -0.12167214 -0.11098063
##        music        rock         god      church       jesus       bible
## 1  0.1378306  0.05905951  0.03651755 -0.00709374  0.01458533 -0.03692278
## 2  0.4981238  0.15963917  0.09283620  0.06414651  0.04801941  0.05863810
## 3  0.2844999  0.21436936  0.35014919  0.53739806  0.27843424  0.22990963
## 4  1.1367885  1.21013948  0.41679142  0.16627797  0.12988313  0.08478769
## 5 -0.1532006 -0.12460034 -0.12144246 -0.15889274 -0.08557822 -0.06813159
##          hair       dress      blonde        mall    shopping       clothes
## 1  0.43807926  0.14905267  0.06137340  0.60368108  0.79806891  0.5651537331
## 2 -0.04484083  0.07201611 -0.01146396 -0.08724304 -0.03865318 -0.0003526292
## 3  0.23612853  0.39407628  0.03471458  0.48318495  0.66327838  0.3759725120
## 4  2.55623737  0.53852195  0.36134138  0.62256686  0.27101815  1.2306917174
## 5 -0.20498730 -0.14348036 -0.02918252 -0.18625656 -0.22865236 -0.1865419798
##    hollister abercrombie          die       death        drunk       drugs
## 1  4.1521844  3.96493810  0.043475966  0.09857501  0.035614771  0.03443294
## 2 -0.1678300 -0.14129577  0.009447317  0.05135888 -0.086773220 -0.06878491
## 3 -0.0553846 -0.07417839  0.037989066  0.11972190 -0.009688746 -0.05973769
## 4  0.1610784  0.26324494  1.712181870  0.93631312  1.897388200  2.73326605
## 5 -0.1557662 -0.14861104 -0.094875180 -0.08370729 -0.087520105 -0.11423381

Step 5 - improving model performance

We can add the clusters column to the original data frame

teens$cluster <- teen_clusters$cluster

#Now we can examine how the cluster assignment relates to individual characteristics.

teens[1:5, c("cluster", "gender", "age", "friends")]
##   cluster gender    age friends
## 1       5      M 18.982       7
## 2       3      F 18.801       0
## 3       5      M 18.335      69
## 4       5      F 18.875       0
## 5       4   <NA> 18.995      10

Using aggregate() function, we can also look at the demographic characteristics of the clusters

aggregate(data = teens, age ~ cluster, mean)
##   cluster      age
## 1       1 16.86497
## 2       2 17.39037
## 3       3 17.07656
## 4       4 17.11957
## 5       5 17.29849

We can explore also clusters by female, and average number of friends.

aggregate(data = teens, female ~ cluster, mean)
##   cluster    female
## 1       1 0.8381171
## 2       2 0.7250000
## 3       3 0.8378198
## 4       4 0.8027079
## 5       5 0.6994515

Cluster 1 : princesses Cluster 2 : Brains Cluster 3 : Criminals Cluster 4 : Athletes Cluster 5 : Basket Cases

aggregate(data = teens, friends ~ cluster, mean)
##   cluster  friends
## 1       1 41.43054
## 2       2 32.57333
## 3       3 37.16185
## 4       4 30.50290
## 5       5 27.70052

We can see now Princesses are have most friends. but the basket cases have the most number of friends

Summary

Clustering is a non-supervised learning approach, using k-means algorithm as a simple algorithm for clustering. we can see how we used kmeans() function from stats package to perform the clustering on the data set.