For this analysis we will using SNS data collected from teenagers to apply a cluster algorithm to identify groups that share common interests. The data represents a random sample of 30,000 U.S. high school studdents who used a SNS website in 2006. The dataset indicates for each person, how many times a each word appeared in the student’s profile.
Below we loaded the data into R and use the str() function to take a qucik look at the data.
## Example: Finding Teen Market Segments ----
## Step 2: Exploring and preparing the data ----
teens <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml12/snsdata.csv")
str(teens)
'data.frame': 30000 obs. of 40 variables:
$ gradyear : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
$ gender : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
$ age : num 19 18.8 18.3 18.9 19 ...
$ friends : int 7 0 69 0 10 142 72 17 52 39 ...
$ basketball : int 0 0 0 0 0 0 0 0 0 0 ...
$ football : int 0 1 1 0 0 0 0 0 0 0 ...
$ soccer : int 0 0 0 0 0 0 0 0 0 0 ...
$ softball : int 0 0 0 0 0 0 0 1 0 0 ...
$ volleyball : int 0 0 0 0 0 0 0 0 0 0 ...
$ swimming : int 0 0 0 0 0 0 0 0 0 0 ...
$ cheerleading: int 0 0 0 0 0 0 0 0 0 0 ...
$ baseball : int 0 0 0 0 0 0 0 0 0 0 ...
$ tennis : int 0 0 0 0 0 0 0 0 0 0 ...
$ sports : int 0 0 0 0 0 0 0 0 0 0 ...
$ cute : int 0 1 0 1 0 0 0 0 0 1 ...
$ sex : int 0 0 0 0 1 1 0 2 0 0 ...
$ sexy : int 0 0 0 0 0 0 0 1 0 0 ...
$ hot : int 0 0 0 0 0 0 0 0 0 1 ...
$ kissed : int 0 0 0 0 5 0 0 0 0 0 ...
$ dance : int 1 0 0 0 1 0 0 0 0 0 ...
$ band : int 0 0 2 0 1 0 1 0 0 0 ...
$ marching : int 0 0 0 0 0 1 1 0 0 0 ...
$ music : int 0 2 1 0 3 2 0 1 0 1 ...
$ rock : int 0 2 0 1 0 0 0 1 0 1 ...
$ god : int 0 1 0 0 1 0 0 0 0 6 ...
$ church : int 0 0 0 0 0 0 0 0 0 0 ...
$ jesus : int 0 0 0 0 0 0 0 0 0 2 ...
$ bible : int 0 0 0 0 0 0 0 0 0 0 ...
$ hair : int 0 6 0 0 1 0 0 0 0 1 ...
$ dress : int 0 4 0 0 0 1 0 0 0 0 ...
$ blonde : int 0 0 0 0 0 0 0 0 0 0 ...
$ mall : int 0 1 0 0 0 0 2 0 0 0 ...
$ shopping : int 0 0 0 0 2 1 0 0 0 1 ...
$ clothes : int 0 0 0 0 0 0 0 0 0 0 ...
$ hollister : int 0 0 0 0 0 0 2 0 0 0 ...
$ abercrombie : int 0 0 0 0 0 0 0 0 0 0 ...
$ die : int 0 0 0 0 0 0 0 0 0 0 ...
$ death : int 0 0 1 0 0 0 0 0 0 0 ...
$ drunk : int 0 0 0 0 1 1 0 0 0 0 ...
$ drugs : int 0 0 0 0 1 0 0 0 0 0 ...
# look at missing data for female variable
#table(teens$gender)
table(teens$gender, useNA = "ifany")
F M <NA>
22054 5222 2724
# look at missing data for age variable
summary(teens$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.086 16.312 17.287 17.994 18.259 106.927 5086
From the exploratory analysis from above we notice that variables gender and age both have missing values. Also, we notice that there are over four times as many females as males in the data, suggesting that males are not inclined as much as females to use the SNS websites. Another interesting note is that when looking at age, we see that minium age is 3 and the max age is 106 which does not seem to make sence since we are only looking a high school students. We can address this issue but removing all users that are below the age of 13 and greater than 20.
# eliminate age outliers
teens$age <- ifelse(teens$age >= 13 & teens$age < 20,
teens$age, NA)
summary(teens$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
13.03 16.30 17.27 17.25 18.22 20.00 5523
To address the issue of the NAs within the age variable, I will create a dummy variable as well as creating a third category unknown gender. This method is another way of dealing with NA values instead of removing the values and all the information they hold.
# reassign missing gender values to "unknown"
teens$female <- ifelse(teens$gender == "F" &
!is.na(teens$gender), 1, 0)
teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)
# check our recoding work
table(teens$gender, useNA = "ifany")
F M <NA>
22054 5222 2724
table(teens$female, useNA = "ifany")
0 1
7946 22054
table(teens$no_gender, useNA = "ifany")
0 1
27276 2724
Now that we have taken care of the gender variable, we need to address the NAs regarding the age variable. We will use a method known as imputation, which involves filling in the missing data with a guess as to the true value. We see that from the data, the average age of a student is 17.2373261. But since this data was collected from different graduation classes, we need to get tha aggreated mean for each year. We next will input these aggergate means back into the orginial data where there are NAs for that cohort.
# age by cohort
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
# create a vector with the average age for each gradyear, repeated by person
ave_age <- ave(teens$age, teens$gradyear,
FUN = function(x) mean(x, na.rm = TRUE))
teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)
# check the summary results to ensure missing values are eliminated
summary(teens$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.03 16.28 17.24 17.24 18.21 20.00
From the summary above we see that all the NA values have been addressed. Now we can dive into the analysis. We will be using the k-means clustering algorithm. We will be training data with only the 36 feautes that represent the number of various interests appeated on their profiles. Also we will need to standardize the features so that they have a mean of zero and standard deviation of one. And finally I decided that the model would be using 5 clusters to segment the data.
## Step 3: Training a model on the data ----
interests <- teens[5:40]
interests_z <- as.data.frame(lapply(interests, scale))
set.seed(2345)
teen_clusters <- kmeans(interests_z, 5)
Let’s now evaluate the model’s performance.
## Step 4: Evaluating model performance ----
# look at the size of the clusters
#install.packages("cluster")
library(cluster)
clusplot(interests_z, teen_clusters$cluster)
#teen_clusters$size
# look at the cluster centers
#teen_clusters$centers
Let’s see if we can improve the model.
## Step 5: Improving model performance ----
# apply the cluster IDs to the original data frame
teens$cluster <- teen_clusters$cluster
# look at the first five records
teens[1:5, c("cluster", "gender", "age", "friends")]
# mean age by cluster
aggregate(data = teens, age ~ cluster, mean)
# proportion of females by cluster
aggregate(data = teens, female ~ cluster, mean)
# mean number of friends by cluster
aggregate(data = teens, friends ~ cluster, mean)