Author: Luca Perer
Date: May 4, 2014
Size of Clusters
I ran the data through the kmeans function to see how our 1000 participants fit into 5 clusters. As can be seen below the sample of teenagers fit disproportionately across clusters.
Cluster 5 has the largest number of teenagers in it. The other clusters are much smaller, with cluster 3 having only 27 teenagers in it.
## [1] 47 36 27 48 842
Dendrogram
This Dendrogram shows us the distribution of teenagers. This allows us to:
distances = dist(interests_z, method = "euclidean")
ClusterTeenagers = hclust(distances, method = "ward")
plot(ClusterTeenagers)
Additional This next output has been cut to five clusters, and the hiherarchical output sorts the teenagers into differently sized clusters
clusterGroups = cutree(ClusterTeenagers, k = 5)
table(clusterGroups)
## clusterGroups
## 1 2 3 4 5
## 633 115 77 109 66
Using the Kmeans function we are able to interpret what each cluster talks about. This will alow me to name the cluster and build a profile of who these teenagers are.
Sement 1 (Jersey Shore) Cluster one has almost an equal representation of males and females. As a result the top interests seem to represent two profiles. The most significant interests are:
Though they have smaller means, other interests include:
As a result this segment seems to like to play soccer, dress well, listen to music and contemplate death.
Segment 2 (Rebels)
Segment 3 (Shoppers)
Segment 4 (Religious)
Segment 5 (Girls)
This group has a very large number of teenagers in it, as a result the deviations away from the mean are very small, and it becomes difficult to assertain which interests differentiate the segment. Because there are so many girls in this segment I was not suprised to see that the most relevant interests just happened to identify them as girls.
## basketball football soccer softball volleyball swimming cheerleading
## 1 0.19 0.01 0.37 -0.10 0.01 0.14 -0.02
## 2 0.82 0.35 0.78 -0.01 0.09 0.81 -0.05
## 3 0.53 0.58 -0.02 0.28 -0.03 0.16 2.17
## 4 0.66 0.23 -0.08 1.35 1.09 0.08 -0.16
## 5 -0.16 -0.08 -0.09 -0.12 -0.10 -0.07 -0.15
## baseball tennis sports cute sex sexy hot kissed dance band
## 1 -0.10 0.36 -0.02 0.90 0.06 0.34 0.57 0.01 0.83 -0.01
## 2 0.37 0.17 0.43 0.95 3.73 0.67 0.98 2.44 1.16 0.45
## 3 0.15 -0.09 0.01 0.32 -0.01 0.11 0.01 0.20 0.10 -0.06
## 4 0.62 0.03 0.74 -0.11 -0.07 -0.19 -0.13 -0.09 -0.03 0.99
## 5 -0.06 -0.07 -0.07 -0.23 -0.12 -0.08 -0.14 -0.09 -0.21 -0.09
## marching music rock god church jesus bible hair dress blonde mall
## 1 0.03 0.59 0.33 0.15 0.08 -0.08 -0.03 0.70 0.61 0.38 0.55
## 2 -0.14 1.52 3.04 0.19 -0.11 0.06 -0.10 1.97 0.88 1.40 0.18
## 3 -0.14 0.14 -0.04 0.05 0.37 -0.20 -0.10 0.44 0.14 0.04 0.95
## 4 1.28 0.18 0.16 1.32 1.58 1.79 1.00 0.02 -0.20 0.05 0.03
## 5 -0.10 -0.19 -0.18 -0.15 -0.18 -0.13 -0.07 -0.24 -0.14 -0.13 -0.19
## shopping clothes hollister abercrombie die death drunk drugs
## 1 0.39 0.37 0.00 -0.04 0.53 0.85 0.07 0.03
## 2 0.32 1.09 0.04 -0.17 1.45 0.57 2.66 3.15
## 3 1.32 1.10 2.52 2.41 -0.18 -0.13 -0.15 -0.06
## 4 0.02 0.13 -0.23 -0.17 0.00 0.14 0.04 -0.02
## 5 -0.19 -0.21 -0.18 -0.16 -0.14 -0.19 -0.09 -0.10
Before proceeding to create the graph representing the proportion of gender, i made sure that I coppied the original set of data correctly. The numbers match and therfore I eliminated possibility for error.
The Sminogram shows us that almost 70% of the respondents are female. Segment 2 consists almost entirely of females.
##
## 1 2 3 4 5
## 47 36 27 48 842
## Loading required package: grid
Segment 4 is more social than the other segments. They have a higher average number of friends as can be seen in the table below.
## 1 2 3 4 5
## 28.74 29.75 28.07 31.98 28.55
One can see here that the mean number of friends is higher in cluster 4. The box plot shows us that this group of teenagers may be more social than others since there are more teenagers in this cluster who have 40 or more friends. All segments seem to have equal numbers of minimal friends. and although the means vary, and the limits are very different this is effected by the number of people in each sample.