Cluster Analysis: Finding Trends and Market Segments

Author: Luca Perer
Date: May 4, 2014

Assignment Goal: This assignment will look at a dataset of teenagers behaviours on social media websites and segment users to identify market segments.

Q1) What are the sizes of the five clusters?

Size of Clusters
I ran the data through the kmeans function to see how our 1000 participants fit into 5 clusters. As can be seen below the sample of teenagers fit disproportionately across clusters.

Cluster 5 has the largest number of teenagers in it. The other clusters are much smaller, with cluster 3 having only 27 teenagers in it.

## [1]  47  36  27  48 842

Dendrogram
This Dendrogram shows us the distribution of teenagers. This allows us to:

We measure the distances between datapoints by using the euclidean method
This illustrates the arangement of the clusters
Nodes are all quite similar, and small

distances = dist(interests_z, method = "euclidean")
ClusterTeenagers = hclust(distances, method = "ward")
plot(ClusterTeenagers)

plot of chunk unnamed-chunk-2

Additional This next output has been cut to five clusters, and the hiherarchical output sorts the teenagers into differently sized clusters

clusterGroups = cutree(ClusterTeenagers, k = 5)
table(clusterGroups)

## clusterGroups
##   1   2   3   4   5 
## 633 115  77 109  66

Q2) Describe the segment by their interests. Give a name to each segment.

Using the Kmeans function we are able to interpret what each cluster talks about. This will alow me to name the cluster and build a profile of who these teenagers are.

Sement 1 (Jersey Shore) Cluster one has almost an equal representation of males and females. As a result the top interests seem to represent two profiles. The most significant interests are:

Cute
Death
Dance
Hair

Though they have smaller means, other interests include:

soccer
music
dress

As a result this segment seems to like to play soccer, dress well, listen to music and contemplate death.

Segment 2 (Rebels)

Sex
Drugs
Rock
Drunk
Kissed

Segment 3 (Shoppers)

Hollister
Ambercrombie
Cheerleading
Shopping
Clothes

Segment 4 (Religious)

Jesus
Church
Softball
God
Marching

Segment 5 (Girls)

This group has a very large number of teenagers in it, as a result the deviations away from the mean are very small, and it becomes difficult to assertain which interests differentiate the segment. Because there are so many girls in this segment I was not suprised to see that the most relevant interests just happened to identify them as girls.

Hair
Clothes
Dance
Mall
Shopping
Music

##   basketball football soccer softball volleyball swimming cheerleading
## 1       0.19     0.01   0.37    -0.10       0.01     0.14        -0.02
## 2       0.82     0.35   0.78    -0.01       0.09     0.81        -0.05
## 3       0.53     0.58  -0.02     0.28      -0.03     0.16         2.17
## 4       0.66     0.23  -0.08     1.35       1.09     0.08        -0.16
## 5      -0.16    -0.08  -0.09    -0.12      -0.10    -0.07        -0.15
##   baseball tennis sports  cute   sex  sexy   hot kissed dance  band
## 1    -0.10   0.36  -0.02  0.90  0.06  0.34  0.57   0.01  0.83 -0.01
## 2     0.37   0.17   0.43  0.95  3.73  0.67  0.98   2.44  1.16  0.45
## 3     0.15  -0.09   0.01  0.32 -0.01  0.11  0.01   0.20  0.10 -0.06
## 4     0.62   0.03   0.74 -0.11 -0.07 -0.19 -0.13  -0.09 -0.03  0.99
## 5    -0.06  -0.07  -0.07 -0.23 -0.12 -0.08 -0.14  -0.09 -0.21 -0.09
##   marching music  rock   god church jesus bible  hair dress blonde  mall
## 1     0.03  0.59  0.33  0.15   0.08 -0.08 -0.03  0.70  0.61   0.38  0.55
## 2    -0.14  1.52  3.04  0.19  -0.11  0.06 -0.10  1.97  0.88   1.40  0.18
## 3    -0.14  0.14 -0.04  0.05   0.37 -0.20 -0.10  0.44  0.14   0.04  0.95
## 4     1.28  0.18  0.16  1.32   1.58  1.79  1.00  0.02 -0.20   0.05  0.03
## 5    -0.10 -0.19 -0.18 -0.15  -0.18 -0.13 -0.07 -0.24 -0.14  -0.13 -0.19
##   shopping clothes hollister abercrombie   die death drunk drugs
## 1     0.39    0.37      0.00       -0.04  0.53  0.85  0.07  0.03
## 2     0.32    1.09      0.04       -0.17  1.45  0.57  2.66  3.15
## 3     1.32    1.10      2.52        2.41 -0.18 -0.13 -0.15 -0.06
## 4     0.02    0.13     -0.23       -0.17  0.00  0.14  0.04 -0.02
## 5    -0.19   -0.21     -0.18       -0.16 -0.14 -0.19 -0.09 -0.10

Q3) Do the segments have different gender composition?

Before proceeding to create the graph representing the proportion of gender, i made sure that I coppied the original set of data correctly. The numbers match and therfore I eliminated possibility for error.

The Sminogram shows us that almost 70% of the respondents are female. Segment 2 consists almost entirely of females.

## 
##   1   2   3   4   5 
##  47  36  27  48 842

## Loading required package: grid

plot of chunk unnamed-chunk-5

Q4) Do the segments differ in the number of friends they have?

Segment 4 is more social than the other segments. They have a higher average number of friends as can be seen in the table below.

##     1     2     3     4     5 
## 28.74 29.75 28.07 31.98 28.55

One can see here that the mean number of friends is higher in cluster 4. The box plot shows us that this group of teenagers may be more social than others since there are more teenagers in this cluster who have 40 or more friends. All segments seem to have equal numbers of minimal friends. and although the means vary, and the limits are very different this is effected by the number of people in each sample. plot of chunk unnamed-chunk-7