Cluster Analysis: Finding Trends and Market Segments

Author: Luca Perer
Date: May 4, 2014

Assignment Goal: This assignment will look at a dataset of teenagers behaviours on social media websites and segment users to identify market segments.

Q1) What are the sizes of the five clusters?

Size of Clusters
I ran the data through the kmeans function to see how our 1000 participants fit into 5 clusters. As can be seen below the sample of teenagers fit disproportionately across clusters.

Cluster 5 has the largest number of teenagers in it. The other clusters are much smaller, with cluster 3 having only 27 teenagers in it.

## [1]  47  36  27  48 842

Dendrogram
This Dendrogram shows us the distribution of teenagers. This allows us to:

distances = dist(interests_z, method = "euclidean")
ClusterTeenagers = hclust(distances, method = "ward")
plot(ClusterTeenagers)

plot of chunk unnamed-chunk-2

Additional This next output has been cut to five clusters, and the hiherarchical output sorts the teenagers into differently sized clusters

clusterGroups = cutree(ClusterTeenagers, k = 5)
table(clusterGroups)
## clusterGroups
##   1   2   3   4   5 
## 633 115  77 109  66

Q2) Describe the segment by their interests. Give a name to each segment.

Using the Kmeans function we are able to interpret what each cluster talks about. This will alow me to name the cluster and build a profile of who these teenagers are.

Sement 1 (Jersey Shore) Cluster one has almost an equal representation of males and females. As a result the top interests seem to represent two profiles. The most significant interests are:

Though they have smaller means, other interests include:

As a result this segment seems to like to play soccer, dress well, listen to music and contemplate death.

Segment 2 (Rebels)

Segment 3 (Shoppers)

Segment 4 (Religious)

Segment 5 (Girls)

This group has a very large number of teenagers in it, as a result the deviations away from the mean are very small, and it becomes difficult to assertain which interests differentiate the segment. Because there are so many girls in this segment I was not suprised to see that the most relevant interests just happened to identify them as girls.

##   basketball football soccer softball volleyball swimming cheerleading
## 1       0.19     0.01   0.37    -0.10       0.01     0.14        -0.02
## 2       0.82     0.35   0.78    -0.01       0.09     0.81        -0.05
## 3       0.53     0.58  -0.02     0.28      -0.03     0.16         2.17
## 4       0.66     0.23  -0.08     1.35       1.09     0.08        -0.16
## 5      -0.16    -0.08  -0.09    -0.12      -0.10    -0.07        -0.15
##   baseball tennis sports  cute   sex  sexy   hot kissed dance  band
## 1    -0.10   0.36  -0.02  0.90  0.06  0.34  0.57   0.01  0.83 -0.01
## 2     0.37   0.17   0.43  0.95  3.73  0.67  0.98   2.44  1.16  0.45
## 3     0.15  -0.09   0.01  0.32 -0.01  0.11  0.01   0.20  0.10 -0.06
## 4     0.62   0.03   0.74 -0.11 -0.07 -0.19 -0.13  -0.09 -0.03  0.99
## 5    -0.06  -0.07  -0.07 -0.23 -0.12 -0.08 -0.14  -0.09 -0.21 -0.09
##   marching music  rock   god church jesus bible  hair dress blonde  mall
## 1     0.03  0.59  0.33  0.15   0.08 -0.08 -0.03  0.70  0.61   0.38  0.55
## 2    -0.14  1.52  3.04  0.19  -0.11  0.06 -0.10  1.97  0.88   1.40  0.18
## 3    -0.14  0.14 -0.04  0.05   0.37 -0.20 -0.10  0.44  0.14   0.04  0.95
## 4     1.28  0.18  0.16  1.32   1.58  1.79  1.00  0.02 -0.20   0.05  0.03
## 5    -0.10 -0.19 -0.18 -0.15  -0.18 -0.13 -0.07 -0.24 -0.14  -0.13 -0.19
##   shopping clothes hollister abercrombie   die death drunk drugs
## 1     0.39    0.37      0.00       -0.04  0.53  0.85  0.07  0.03
## 2     0.32    1.09      0.04       -0.17  1.45  0.57  2.66  3.15
## 3     1.32    1.10      2.52        2.41 -0.18 -0.13 -0.15 -0.06
## 4     0.02    0.13     -0.23       -0.17  0.00  0.14  0.04 -0.02
## 5    -0.19   -0.21     -0.18       -0.16 -0.14 -0.19 -0.09 -0.10

Q3) Do the segments have different gender composition?

Before proceeding to create the graph representing the proportion of gender, i made sure that I coppied the original set of data correctly. The numbers match and therfore I eliminated possibility for error.

The Sminogram shows us that almost 70% of the respondents are female. Segment 2 consists almost entirely of females.

## 
##   1   2   3   4   5 
##  47  36  27  48 842
## Loading required package: grid

plot of chunk unnamed-chunk-5

Q4) Do the segments differ in the number of friends they have?

Segment 4 is more social than the other segments. They have a higher average number of friends as can be seen in the table below.

##     1     2     3     4     5 
## 28.74 29.75 28.07 31.98 28.55

One can see here that the mean number of friends is higher in cluster 4. The box plot shows us that this group of teenagers may be more social than others since there are more teenagers in this cluster who have 40 or more friends. All segments seem to have equal numbers of minimal friends. and although the means vary, and the limits are very different this is effected by the number of people in each sample. plot of chunk unnamed-chunk-7