If you are viewing this on Github, please kindly visit http://www.rpubs.com/jasonchanhku/segmentation for the full interactive version.

Libraries Used

library(DT) #Data Preview
library(Amelia) #Missingness Map
library(plotly) #visualization
library(d3heatmap) #heatmap


Objective

The objective of this project is to use information from collected profiles of teens using Social Networking Service (SNS) and segment them according to appropriate clusters.


Step 1: Data Exploration

Dataset Preview (Pre-processed)

datatable() was used to preview the dataset in the following manner:

datatable(head(teens, 300), class = "cell-border stripe")

Structure and Missingness map

From the preview above, it is obvious that some values are missing. Hence, using str(), summary(), and missmap() will help gauge the levels and missingness of the dataset.

str(teens)
## 'data.frame':    30000 obs. of  40 variables:
##  $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
##  $ age         : num  19 18.8 18.3 18.9 19 ...
##  $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
##  $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
##  $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
##  $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
##  $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
##  $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
##  $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
##  $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
##  $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
##  $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
##  $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
##  $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

Seems like our dataset has close to 40 variable. The first 4 indicates personal characteristics and the other 36 indicates interests.

missmap(teens)

From the missmap(), fortunately, only two columns, age and gender have missing values and need to be dealt with.

Visualization

Since the age columns has several NA values, it is good to visualize the proportion of Females to Males to deal with imputation later on.

Outlier

As the market segmentation is only meant for teens, it is important to check for any potential outliers in the age columns

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.086  16.310  17.290  17.990  18.260 106.900    5086

As the Max age is a 106.9, it is obvious that that is an outlier and a boxplot would be good to see if there are any more non-teens in this dataset. Besides, there are also 5086 missing values in the age column that must be dealt with.

plot_ly(teens, y=~age, type = "box")%>%layout(title = "Spotting Outliers")
## Warning: Ignoring 5086 observations

Now, it is obvious that:

There are much more females than males in the dataset. This imply the usage of SNS is more frequently by females than males. About 9% of the dataset consists of missing gender data. 17% of the dataset have missing age values. Outliers are present in the age column of the dataset.


Step 2: Data Preprocessing

Subset Reasonable “Teenage” Age Range

A reasonable age group for teenage would be 13 to 20 years old. Anything other than that should be as good as a missing data, as the age data cannot be trusted.

teens$age <- ifelse(teens$age >= 13 & teens$age <= 20, teens$age, NA)

summary of the dataset after subset:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.03   16.30   17.26   17.25   18.22   20.00    5523

Unfortunately, the dataset presents itself with 5523 missing values. However, these missing values can’t be removed as it represents about 26% of the dataset (with gender NAs added)

Dummy Coding

Before imputation, it is useful to dummy code those entries without gender. This is because the K-means clustering algorithm can only take in numeric data.

teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)
teens$nogender <- ifelse(is.na(teens$gender), 1 , 0)

To double check:

## 
##     F     M  <NA> 
## 22054  5222  2724
## 
##     0     1 
##  7946 22054
## 
##     0     1 
## 27276  2724

Imputation of Age

To impute the age values, the best bet is to refer to the graduation year, as they would usually just differ by a calender year. By using aggregate(), the following mean ages are obtained:

avg_age <- ave(teens$age, teens$gradyear, FUN = function(x) mean(x, na.rm = TRUE))
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
##   gradyear      age
## 1     2006 18.65586
## 2     2007 17.70617
## 3     2008 16.76770
## 4     2009 15.81957

ave helps return vector of the mean age for given each teen’s graduation year. This vector can be used to impute in those missing values for age.

#This implies, if age is missing, return the avg_age position based on that entry, else, return itself
teens$age <- ifelse(is.na(teens$age), avg_age, teens$age)

After the pre-processing, this is what the summary looks like:

summary(teens$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.03   16.28   17.24   17.24   18.21   20.00


Step 3: Model Training

Remember that the k-means algorithm can only take in normalized numeric values. Therefore the current dataset needs to be subsetted and normalized (using scale()).

#try with the first 36 features
interests <- teens[5:40]

#z-score normalization
interests_z <- as.data.frame(lapply(interests, scale))

Choosing the value of k

As the rule of thumb goes, k should be approximately \(k = sqrt(n/2)\), where n is the number of features. Hence as there are 36 features, a good start of k would be 5.

#to ensure results are reproduceable 
set.seed(2345)

teen_clusters <- kmeans(interests_z, 5)


Step4: Evaluating Model Performance

To evaluate model performance, the size of each cluster shall we examined and also which individuals belong to each cluster.

Clustering Results

Surprisingly, 21514 entries fall into the 5th cluster. This already represents 72% of the data. Hence, each cluster must be looked into carefully. This could be due to a cluster disparity or perhaps a large group of teens share those similar interests.

Cluster Examination

The centroids of the clusters are examined here

As it is difficult to tell groupings from the table of 36 features, a heatmap using d3heatmap would provide more insights visually and we can perform labelling from there.

Based on the heatmap, the following conspicious top 5 features are grouped into the following clusters:

Based on the information above, a suitable labelling would perhaps be the following:


Step 5: Improving Model Performance

With the new labels, it can be inserted back into the dataset to be useful. With the new labels, how the cluster assignment relates to individual characteristics can be examined.

teens$cluster <- teen_clusters$cluster

Cluster Relationship to Features

The features chosen are friends, age, and female

Friends

Age

Female

Findings

The following findings are based on post labelling of the data after performing k-means:

  • Princesses have the most friends. This tallies with the stereotype of clusters in highschool where the more popular groups have more friends.
  • Princesses have the most proportion of females at 84%. Recall that overall there are 74% females in the dataset. While musicans and basket class only contain around 70% females. This shows the disparity in discussed interests between males and females.