This analysis applies heirarchical clustering and K-means clustering to evaluate the SURVEY10 dataset. Our analysis is intented to develop a targeted marketing strategy for a new dating app. The SURVEY10 dataset contains survey response from students, capturing various demographics, behavioral attributes, and preferences. By segmenting students into various groups, also called Clusters, we aim to identify groupings that inform personalized marketing approaches. This analysis report utilizes formatted tables, high quality visuals, and strategic recommendations to compliment engagement on the dating app.
| Name | SURVEY10 |
| Number of rows | 699 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Gender | 0 | 1 | FALSE | 2 | Fem: 386, Mal: 313 |
| Handedness | 0 | 1 | FALSE | 3 | Rig: 615, Lef: 61, Amb: 23 |
| SigificantOther | 0 | 1 | FALSE | 2 | No: 389, Yes: 310 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Height | 0 | 1 | 67.74 | 4.44 | 52 | 65.00 | 67.0 | 72.00 | 78 | ▁▂▇▇▃ |
| Weight | 0 | 1 | 155.57 | 36.46 | 75 | 130.00 | 150.0 | 178.00 | 305 | ▂▇▅▁▁ |
| DesiredWeight | 0 | 1 | 149.47 | 33.83 | 95 | 122.50 | 140.0 | 175.00 | 285 | ▇▅▅▁▁ |
| GPA | 0 | 1 | 3.15 | 0.63 | 0 | 2.85 | 3.2 | 3.56 | 4 | ▁▁▁▇▇ |
| TxtPerDay | 0 | 1 | 71.45 | 98.38 | 0 | 20.00 | 45.0 | 100.00 | 1000 | ▇▁▁▁▁ |
| MinPerDayFaceBook | 0 | 1 | 62.94 | 67.11 | 0 | 20.00 | 45.0 | 90.00 | 600 | ▇▁▁▁▁ |
| NumTattoos | 0 | 1 | 0.23 | 0.69 | 0 | 0.00 | 0.0 | 0.00 | 6 | ▇▁▁▁▁ |
| NumBodyPiercings | 0 | 1 | 1.81 | 2.06 | 0 | 0.00 | 2.0 | 3.00 | 10 | ▇▂▁▁▁ |
| WeeklyHrsVideoGame | 0 | 1 | 2.46 | 4.87 | 0 | 0.00 | 0.5 | 3.00 | 45 | ▇▁▁▁▁ |
| DistanceMovedToSchool | 0 | 1 | 177.96 | 317.88 | 0 | 15.00 | 106.7 | 213.50 | 5000 | ▇▁▁▁▁ |
| PercentDateable | 0 | 1 | 26.28 | 22.21 | 0 | 10.00 | 20.0 | 40.00 | 100 | ▇▃▂▁▁ |
| NumPhoneContacts | 0 | 1 | 175.21 | 127.49 | 0 | 90.00 | 150.0 | 229.50 | 1000 | ▇▃▁▁▁ |
| PercMoreAttractiveThan | 0 | 1 | 53.09 | 25.79 | 0 | 39.00 | 50.0 | 71.50 | 100 | ▃▅▇▆▃ |
| PercMoreIntelligentThan | 0 | 1 | 57.21 | 23.73 | 0 | 45.00 | 60.0 | 75.00 | 100 | ▂▃▇▆▃ |
| PercMoreAthleticThan | 0 | 1 | 49.30 | 27.03 | 0 | 30.00 | 50.0 | 70.00 | 100 | ▇▇▇▇▅ |
| PercFunnierThan | 0 | 1 | 51.90 | 25.13 | 0 | 30.00 | 50.0 | 70.00 | 100 | ▃▆▇▆▃ |
| OwnAttractiveness | 0 | 1 | 72.55 | 17.99 | 1 | 65.00 | 75.0 | 85.00 | 100 | ▁▁▂▇▅ |
Variable | Missing_Values |
|---|---|
Gender | 0 |
Height | 0 |
Weight | 0 |
DesiredWeight | 0 |
GPA | 0 |
TxtPerDay | 0 |
MinPerDayFaceBook | 0 |
NumTattoos | 0 |
NumBodyPiercings | 0 |
Handedness | 0 |
WeeklyHrsVideoGame | 0 |
DistanceMovedToSchool | 0 |
PercentDateable | 0 |
NumPhoneContacts | 0 |
PercMoreAttractiveThan | 0 |
PercMoreIntelligentThan | 0 |
PercMoreAthleticThan | 0 |
PercFunnierThan | 0 |
SigificantOther | 0 |
OwnAttractiveness | 0 |
The Gender Distribution bar chart shows us that the dataset contains more female respondants than male. The count for females is slightly higher, but both groups have sufficient representation. Understanding gender-based preferences can help tailor marketing strategies. Males and females may have different interests, dating preferences, or responses to app features. Clustering based on gender, combined with other variables (such as age, interests, and dating preferences), can provide a clearer picture of distinct user groups.
The box plot above reveals key insights about social media engagement. Females tend to spend more time on Facebook than males, on average. This is indicated by the higher median and interquartile range. There are multiple outliers in both groups, with some individuals spending over 600 minutes (10 hours) per day on Facebook. These extreme values represent a small subset of users with exceptionally high engagement. Since time spent on social media varies across individuals, it can be crucial for clustering users into different engagement levels. Users who spend more time on social media may respond better to digital marketing strategies than low-engagement users.
The Distribution of GPA historgram provides important insights into the academic performance of students in the dataset. The majority of students have GPAs between 2.5 and 4.0, with the highest concentration around 3.0 and 3.5. The right skewed distribution supports this. Students with different GPA ranges may have different lifestyle choices, study habits, and social free time that would impact their dating app usage. GPA could be linked to time spent on social media, extracurricular involvement, or relationship preferences, making it a useful variable for clustering.
The scatterplot of Weight vs. Desired Weight provides several key insights. There is a clear linear relationship between current weight and desired weight, meaning individuals generally want to weigh a similar amount to their current weight. Many points fall below the dashed line, indicating that most individuals desire to weigh less than their current weight. Desired weight could be an important factor in dating preferences and self-perception, making it valuable for clustering. Some users might prioritize fitness and self-improvement, while others may have other interests. Clustering can help segment these groups. People with larger discrepancies between their current and desired weights may engage differently with dating apps. Wether they seek partners with similar fitness goals, or prioritize other factors, this data will help refine marketing strategies through clustering.
The above boxplot provides insights into communication habits among male and female respondents. The median number of texts per day is higher for females than for males, suggesting that women, on average, tend to send more text messages. Both groups have extreme outliers (cut from the visualization table for ease of viewing) which send over 200 text messages per day, likely indicating high use of digital media. Text behavior varies significantly, so clustering users based on their texting frequency can help segment highly active users from those who prefer less frequent engagement. This variable is very useful for clustering, as it helps to identify different communication styles.
Clustering is an unsupervised machine learning technique that helps identify distinct groups within a dataset based on shared characteristics. For our dating app marketing task, clustering will be exeptionally useful for several reasons:
1.) Identifying Distinct Student Segments Students have different dating preferences, lifvestyles, and social behaviors. Some might be highly social and extroverted, while others may remain reserved. Clustering groups of similar individuals together allows the company to tailor its messages to each segment.
2.) Personalizing Marketing Strategies A “one size fits all” approach to marketing is not effective for a diverce population of college students. By clustering students into meaningful “Clusters”, any marketing team can create targeted marketing strategies.
3.) Enhanced User Engagement & Retention If the dating app understands different user personalities, it has the potential to make more matches. If users feel the app is tailored to them, there is the potential for higher app downloads, increased user retention, and more active engagement within the dating community.
Clustering enhances marketing effeciency, personalization for each user, and user satisfaction. It allows our dating app to target the right audience with the right message. This will potantially lead to a better user experience.
To create meaningful clusters for the dating app marketing team, we must identify characteristics that capture demographic, lifestyle, and dating differences. These variables will eventually be used for hierarchical and k-means clustering.
Demographic Variables - Gender, GPA, and Distance Moved to School are essential for understanding dating preferences. They may indicate different social habits and lifestyle balance. They can help us cluster groups based on demographics.
Lifestyle & Social Behavior Variables - Texts per Day, Minutes per Day on Facebook, Weekly Hours of Video Games, and Number of Phone Contacts give us insights into an individual’s pastimes: crucial for understanding the dating world. These variables give insights into communication habits, social engagement, and social connectivity.
Self Perception & Dating Preferences - Desired Weight, Percent Datable (a self-reported metric on perceived attractiveness in dating), Intelligence, Humor, and athleticism are valuable variables that give us even closer insights into how we may cluster students from our data set.
All three categories of variables provide actionable insights into student clustering. Some brief expected clusters may include: “Social Butterflies” ~ highly social, frequent texters, and moderately-highly active on social media. “Confident Daters” ~ Students with high self-esteem and perceived intelligence. “Casual Gamers” ~ Those who spend more time gaming or online rather than socializing in person. Laslty, “Reserved Intellectuals” ~ High GPA students who value intelligence, but may have fewer social contacts. These clusters will be used to design two distinct marketing strategies to target different dating personas efficiently and smartly.
The heirarchical clustering dendrogram (above) visually represents how the students in the datasets are grouped based on similarity. The x-axis represents individual data points (students), while the y-axis represents the distance, or dissimilarity, between clusters. As you progress down the dendrogram, students become increasingly similar until they are in a single category of their own. The further back up you move, similar students are merged into clusters. We can use this dendrogram to create our clusters, but first we must analyze how many clusters are required.
Now that we have a clear and stylish dendrogram, we need to decide where to “cut” and form clusters. To identify the best cutting point, we can analyze an Elbow Chart:
The Elbow Chart helps determine the optimal number of clusters (k) by plotting the Total Within-Cluster Sums of Squares (WCSS) against different values of k. WCSS measures cluster compactness by summing the squared distances between each data point and the centroid of its cluster. Examine the “bend of the elbow” at k = 4. The “bend” suggests diminishing returns after this point. Choosing fewer clusters would oversimplify each cluster, and we would be missing meaningful patterns in our data.
Using the cluster summary above, we can come to some generalizations about our clustered groups. Row 1 represents the first cluster. These individuals are outgoing, athletic and confident. They are the tallest, heaviest, and they have a higher than average attractiveness perception.
Cluster 2 groups together those considered to be High Achievers and Social Butterflies. They are the shortest in height and the lightest in weight, but are very active on social media. They send the most texts, have a higher confidence in intelligence over athleticism, and are the strongest academic achievers (Highest GPA).
Cluster 3 seems to be more reserved. They spend a moderate amount of time texting and on facebook, they’re shorter and weigh moderately less than clusters 1 and 4. They have the lowest GPA and they play the least amount of video games. Those in cluster 3 also have the lowest confidence in their intelligence.
Those in cluster 4 comes in a close second for height and weight against those in cluster 1. They moved the furthest to attend school, but only slightly more than those in cluster 1. They are second to last in confident of intelligence, but funnier than average.
Above is the K-Means clustering table, which segments students into distinct groups based on similarities, much like hierarchical clustering but with some methodological differences. Each row shows a cluster centroid—the average characteristics of students in that group. Positive values indicate characteristics above the dataset average, negative values indicate below-average traits, and values near zero reflect features close to the overall average.
The bar chart above summarizes our K-means clustering analysis. We can see here how each distinct group of students (clusters) differs across key traits:
Cluster 1 is characterized by their above-average height, weight, and desired weight. Combine this with higher-than-average social interactions (a combination of Facebook use and texts per day), and you’ve got a population of students that likely represents confident, engaged users. Marketing to this group should feature direct interactions, appeals to technology and connection, and active events. We’ll call these folks “Social Enthusiasts”
Cluster 2 observes below-average perceptions of self-attractiveness and athleticism, but maintains relatively high GPAs. This implies that this group could benefit from supportive, confidence building app features like icebreakers or guided interactions to generate engagement. We’ll call these folks the “Reserved Achievers”.
Cluster 3 represents lower-than-average self perception in categories like attractiveness, athleticism, intelligence, and humor. These folks also have low social engagement. This group would likely seek privacy-enhanced features and carefully structured interactions to create comfort and generate engagement. We’ll call these folks “Hermits”
Cluster 4 stands out with high levels of personal expression (more tattoos, more body piercings). Their willingness to visibly set themselves apart from the rest suggests a more creative and unconventional dating experience. Providing customization, expressive communiation features, and niche community options should strongly resonate with this group. We’ll call these folks the “Expressive Alternatives”
This heat map provides another visual summary of how each cluster differs across several aforementioned characteristics. Warmer colors (pink) indicate higher-than-average characteristics within a cluster. Cooler colors (blue) highlight below-average features. This quick color distinction allows us to notice key distinctions between clusters. For example, we can see that our “Expressive Alternatives” lead the pack in number of tattoos and body piercings. This heat map is provided as a compliment to the bar chart.
The Principal Component Analysis (PCA) plot above also allows us to visualize our clusters. Each color-coded cluster occupies its own region of the plot (with the exception of cluster 4, our Expressive Alternatives. Very fitting for them). For example, the green cluster (Cluster 3) occupies a unique area distinctly apart from the red cluster (cluster 1), indicating substantial behavioral differences. Overlapping regions (Such as clusters 2 & 4) indicate some shared characteristics among these segments. Recognizing these groups enables targeted, strategic engagement marketing strategies. In summary, this visualization confirms that we have clearly defined actionable user segments. This allows us to focus investments in features tailored speccifically to each group’s preferences and behaviors.
The first proposed marketing strategy is to target clusters 1 & 4, our Social Enthisiasts and our Expressive Alternatives. We will leverage high sociability, confidence, and personal expression to maximize active app engagement and social interactions. An emphasis on self-expression, social visibility, and dynamic interactions will contribute to out marketing structure. We will promote features like exclusive social events, such as virtual or local meetups, or even competitions encouraging creativity and social networking.
App features that we will promote to tailor the user experience to clusters 1 & 4 include: expressive profiles, social games & interactive challenges, and enhanced visibility options. By emphasizing these features, we expect to see higher retention, increased participation, and more matches, ultimately driving strong community growth and brand loyalty.
In this marketing strategy, we will target the remaining clusters: 2 & 3. The Reserved Achievers and the Hermits will be provided a safe, structured, and supportive experience. This is meant to foster meaningful relationships among reserved or quieter students. We can highlight a safe, welcoming community on our platform that values compatability and genuine connection over superficial intentions. We will offer introductory incentives, such as premium access trials, to build that initial comfort and confidence.
App features that will be tailored to these groups may include:
Guided Interactions: icebreakers, initial topic prompts, and personaility based match-making to reduce anxiety and promote comfort and confidence.
Privacy Assurance: enhanced privacy settings, like an anonymous browsing feature or discreet interactions modes to create an approachable atmosphere.
Personalized Recommendations: algorithmically driven matches to gently encourage engagement based on the user’s interests and compatibility.
This clustering analysis has provided valuable insights into distinct segments within our student user base. We can now implement precise and effective targeting strategies thanks to our clusters. Through heirarchical and K-means clustering methods, we’ve identified dour well-defined student clusters: The Social Enthusiasts, The Reserved Achievers, The Hermits, and The Expressive Alternatives. Our two marketing strategies are tailored to the interests of all four groups. Some “Next Steps” to ensure continued success of our analysis includes implementation and testing, meticulous monitoring, user feedback, and periodic reassessment. By proactively pursuing these steps, we can ensure that the insights from this analysis translate into real world, and long term, benefits.