If you are viewing this on Github, please kindly visit http://www.rpubs.com/jasonchanhku/segmentation for the full interactive version.
library(DT) #Data Preview
library(Amelia) #Missingness Map
library(plotly) #visualization
library(d3heatmap) #heatmap
The objective of this project is to use information from collected profiles of teens using Social Networking Service (SNS) and segment them according to appropriate clusters.
datatable() was used to preview the dataset in the following manner:
datatable(head(teens, 300), class = "cell-border stripe")
From the preview above, it is obvious that some values are missing. Hence, using str(), summary(), and missmap() will help gauge the levels and missingness of the dataset.
str(teens)
## 'data.frame': 30000 obs. of 40 variables:
## $ gradyear : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
## $ gender : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
## $ age : num 19 18.8 18.3 18.9 19 ...
## $ friends : int 7 0 69 0 10 142 72 17 52 39 ...
## $ basketball : int 0 0 0 0 0 0 0 0 0 0 ...
## $ football : int 0 1 1 0 0 0 0 0 0 0 ...
## $ soccer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ softball : int 0 0 0 0 0 0 0 1 0 0 ...
## $ volleyball : int 0 0 0 0 0 0 0 0 0 0 ...
## $ swimming : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cheerleading: int 0 0 0 0 0 0 0 0 0 0 ...
## $ baseball : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tennis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sports : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cute : int 0 1 0 1 0 0 0 0 0 1 ...
## $ sex : int 0 0 0 0 1 1 0 2 0 0 ...
## $ sexy : int 0 0 0 0 0 0 0 1 0 0 ...
## $ hot : int 0 0 0 0 0 0 0 0 0 1 ...
## $ kissed : int 0 0 0 0 5 0 0 0 0 0 ...
## $ dance : int 1 0 0 0 1 0 0 0 0 0 ...
## $ band : int 0 0 2 0 1 0 1 0 0 0 ...
## $ marching : int 0 0 0 0 0 1 1 0 0 0 ...
## $ music : int 0 2 1 0 3 2 0 1 0 1 ...
## $ rock : int 0 2 0 1 0 0 0 1 0 1 ...
## $ god : int 0 1 0 0 1 0 0 0 0 6 ...
## $ church : int 0 0 0 0 0 0 0 0 0 0 ...
## $ jesus : int 0 0 0 0 0 0 0 0 0 2 ...
## $ bible : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hair : int 0 6 0 0 1 0 0 0 0 1 ...
## $ dress : int 0 4 0 0 0 1 0 0 0 0 ...
## $ blonde : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mall : int 0 1 0 0 0 0 2 0 0 0 ...
## $ shopping : int 0 0 0 0 2 1 0 0 0 1 ...
## $ clothes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hollister : int 0 0 0 0 0 0 2 0 0 0 ...
## $ abercrombie : int 0 0 0 0 0 0 0 0 0 0 ...
## $ die : int 0 0 0 0 0 0 0 0 0 0 ...
## $ death : int 0 0 1 0 0 0 0 0 0 0 ...
## $ drunk : int 0 0 0 0 1 1 0 0 0 0 ...
## $ drugs : int 0 0 0 0 1 0 0 0 0 0 ...
Seems like our dataset has close to 40 variable. The first 4 indicates personal characteristics and the other 36 indicates interests.
missmap(teens)
From the missmap(), fortunately, only two columns, age and gender have missing values and need to be dealt with.
NA values, it is good to visualize the proportion of Females to Males to deal with imputation later on.
As the market segmentation is only meant for teens, it is important to check for any potential outliers in the age columns
summary(teens$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.086 16.310 17.290 17.990 18.260 106.900 5086
As the Max age is a 106.9, it is obvious that that is an outlier and a boxplot would be good to see if there are any more non-teens in this dataset. Besides, there are also 5086 missing values in the age column that must be dealt with.
plot_ly(teens, y=~age, type = "box")%>%layout(title = "Spotting Outliers")
## Warning: Ignoring 5086 observations
Now, it is obvious that:
There are much more females than males in the dataset. This imply the usage of SNS is more frequently by females than males. About 9% of the dataset consists of missing gender data. 17% of the dataset have missing age values. Outliers are present in the age column of the dataset.
A reasonable age group for teenage would be 13 to 20 years old. Anything other than that should be as good as a missing data, as the age data cannot be trusted.
teens$age <- ifelse(teens$age >= 13 & teens$age <= 20, teens$age, NA)
summary of the dataset after subset:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.03 16.30 17.26 17.25 18.22 20.00 5523
Unfortunately, the dataset presents itself with 5523 missing values. However, these missing values can’t be removed as it represents about 26% of the dataset (with gender NAs added)
Before imputation, it is useful to dummy code those entries without gender. This is because the K-means clustering algorithm can only take in numeric data.
teens$female <- ifelse(teens$gender == "F" & !is.na(teens$gender), 1, 0)
teens$nogender <- ifelse(is.na(teens$gender), 1 , 0)
To double check:
##
## F M <NA>
## 22054 5222 2724
##
## 0 1
## 7946 22054
##
## 0 1
## 27276 2724
To impute the age values, the best bet is to refer to the graduation year, as they would usually just differ by a calender year. By using aggregate(), the following mean ages are obtained:
avg_age <- ave(teens$age, teens$gradyear, FUN = function(x) mean(x, na.rm = TRUE))
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
## gradyear age
## 1 2006 18.65586
## 2 2007 17.70617
## 3 2008 16.76770
## 4 2009 15.81957
ave helps return vector of the mean age for given each teen’s graduation year. This vector can be used to impute in those missing values for age.
#This implies, if age is missing, return the avg_age position based on that entry, else, return itself
teens$age <- ifelse(is.na(teens$age), avg_age, teens$age)
After the pre-processing, this is what the summary looks like:
summary(teens$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.03 16.28 17.24 17.24 18.21 20.00
Remember that the k-means algorithm can only take in normalized numeric values. Therefore the current dataset needs to be subsetted and normalized (using scale()).
#try with the first 36 features
interests <- teens[5:40]
#z-score normalization
interests_z <- as.data.frame(lapply(interests, scale))
As the rule of thumb goes, k should be approximately \(k = sqrt(n/2)\), where n is the number of features. Hence as there are 36 features, a good start of k would be 5.
#to ensure results are reproduceable
set.seed(2345)
teen_clusters <- kmeans(interests_z, 5)
To evaluate model performance, the size of each cluster shall we examined and also which individuals belong to each cluster.
Surprisingly, 21514 entries fall into the 5th cluster. This already represents 72% of the data. Hence, each cluster must be looked into carefully. This could be due to a cluster disparity or perhaps a large group of teens share those similar interests.
The centroids of the clusters are examined here
As it is difficult to tell groupings from the table of 36 features, a heatmap usingd3heatmap would provide more insights visually and we can perform labelling from there.
Based on the heatmap, the following conspicious top 5 features are grouped into the following clusters:
Based on the information above, a suitable labelling would perhaps be the following:
With the new labels, it can be inserted back into the dataset to be useful. With the new labels, how the cluster assignment relates to individual characteristics can be examined.
teens$cluster <- teen_clusters$cluster
The features chosen are friends, age, and female
The following findings are based on post labelling of the data after performing k-means: