Week 3 Data Dive: Group By and Probabilities

Introduction

This analysis delves into the Social Media and Entertainment Dataset to explore user behavior through grouping and probability analysis. Key objectives include:

Grouping data in three different ways to analyze trends and anomalies.
Identifying the least probable groups (rarities) and understanding their significance.
Investigating combinations of categorical variables and missing patterns.
Presenting findings through meaningful visualizations and actionable insights.

Grouping Data

To understand patterns, we will group the data using three different categorical columns and summarize key metrics.

1. Grouping by Primary Platform

This analysis groups users by their primary social media platform to examine platform popularity and average daily usage.

platform_summary <- data %>%
  group_by(`Primary Platform`) %>%
  summarize(
    Count = n(),
    Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)
  ) %>%
  arrange(Count)

# Display the summary
platform_summary
## # A tibble: 5 × 3
##   `Primary Platform` Count Avg_TimeSpent
##   <chr>              <int>         <dbl>
## 1 Instagram          59721          4.25
## 2 YouTube            59757          4.27
## 3 Facebook           59936          4.26
## 4 Twitter            60285          4.25
## 5 TikTok             60301          4.25

Insight:
- Platforms like Facebook and YouTube attract a large user base, indicating broad appeal. - Less popular platforms (e.g., Instagram) may cater to niche audiences.

2. Grouping by Gender

Grouping users by gender highlights differences in platform usage patterns.

gender_summary <- data %>%
  group_by(Gender) %>%
  summarize(
    Count = n(),
    Avg_Age = mean(Age, na.rm = TRUE),
    Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)
  ) %>%
  arrange(Count)

# Display the summary
gender_summary
## # A tibble: 3 × 4
##   Gender  Count Avg_Age Avg_TimeSpent
##   <chr>   <int>   <dbl>         <dbl>
## 1 Female  99873    38.5          4.26
## 2 Male    99902    38.6          4.25
## 3 Other  100225    38.5          4.26

Insight:
- All gender groups show similar average usage times (~4 hours), indicating uniform platform engagement.
- The smaller representation of certain genders may reflect demographic biases or data collection limitations.

3. Grouping by Age Group

Users are grouped into age categories for analysis. This helps identify trends in platform usage across different life stages.

data <- data %>%
  mutate(Age_Group = cut(Age, breaks = c(0, 18, 30, 50, 100), labels = c("0-18", "19-30", "31-50", "51+")))

age_group_summary <- data %>%
  group_by(Age_Group) %>%
  summarize(
    Count = n(),
    Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)
  ) %>%
  arrange(Count)

# Display the summary
age_group_summary
## # A tibble: 4 × 3
##   Age_Group  Count Avg_TimeSpent
##   <fct>      <int>         <dbl>
## 1 0-18       34591          4.29
## 2 19-30      69105          4.25
## 3 51+        80844          4.26
## 4 31-50     115460          4.25

Insight:
- 31-50 age group dominates the dataset, showing high engagement.
- Younger age groups (0-18) have lower representation, possibly reflecting parental restrictions or different platform preferences.

Probability Analysis

The groups with the smallest user counts have the lowest probabilities of appearing in a random selection. We identify the rarest groups:

# Find lowest probability groups using `pull()` to extract column values
rare_platform <- platform_summary %>% filter(Count == min(Count)) %>% pull(`Primary Platform`)
rare_gender <- gender_summary %>% filter(Count == min(Count)) %>% pull(Gender)
rare_age_group <- age_group_summary %>% filter(Count == min(Count)) %>% pull(Age_Group)

list(
  "Rarest Platform" = rare_platform,
  "Rarest Gender" = rare_gender,
  "Rarest Age Group" = rare_age_group
)
## $`Rarest Platform`
## [1] "Instagram"
## 
## $`Rarest Gender`
## [1] "Female"
## 
## $`Rarest Age Group`
## [1] 0-18
## Levels: 0-18 19-30 31-50 51+

Conclusion:
- Rarest groups (e.g., Instagram, certain genders, and age groups) represent niche demographics.
- These insights help target specialized campaigns or further investigate underrepresented segments.

Combining Two Categorical Variables

We will examine combinations of Primary Platform and Gender to find missing values.

category_combinations <- data %>%
  count(`Primary Platform`, Gender) %>%
  complete(`Primary Platform`, Gender, fill = list(n = 0)) %>%
  arrange(n)

# Display unique category combinations
category_combinations
## # A tibble: 15 × 3
##    `Primary Platform` Gender     n
##    <chr>              <chr>  <int>
##  1 YouTube            Female 19705
##  2 Facebook           Male   19789
##  3 Instagram          Male   19829
##  4 Instagram          Female 19915
##  5 Instagram          Other  19977
##  6 YouTube            Other  20002
##  7 YouTube            Male   20050
##  8 TikTok             Other  20051
##  9 Twitter            Female 20064
## 10 Facebook           Other  20065
## 11 Facebook           Female 20082
## 12 Twitter            Male   20091
## 13 TikTok             Female 20107
## 14 Twitter            Other  20130
## 15 TikTok             Male   20143

Insight:
- Missing combinations (e.g., “Twitter” and “Other Gender”) suggest gaps in data collection or natural preferences.

Visualizations

Each group will have an accompanying visualization.

1. Platform Popularity

# Add percentage and sort by average time spent
platform_summary <- platform_summary %>%
  mutate(
    Percentage = Count / sum(Count) * 100  # Normalize counts
  ) %>%
  arrange(desc(Avg_TimeSpent))  # Sort by average time spent

# Visualize with both metrics (Avg_TimeSpent + Count)
ggplot(platform_summary, aes(x = reorder(`Primary Platform`, -Avg_TimeSpent))) +
  geom_bar(aes(y = Avg_TimeSpent, fill = `Primary Platform`), stat = "identity", show.legend = FALSE) +
  geom_text(aes(y = Avg_TimeSpent, label = paste0(round(Avg_TimeSpent, 1), " hrs")), vjust = -0.5, size = 3) +  # Avg time labels
  geom_point(aes(y = Count / 10000), color = "blue", size = 3) +  # Overlay user counts (scaled down for dual-axis effect)
  scale_y_continuous(
    name = "Average Time Spent per User (hrs)",
    sec.axis = sec_axis(~ . * 10000, name = "User Count")
  ) +
  labs(
    title = "Platform Engagement: Average Time Spent vs User Count",
    x = "Primary Platform",
    y = "Average Time Spent per User (hrs)"
  ) +
  theme_minimal()

2. Gender and Social Media Time

# Normalize counts for gender
gender_summary <- gender_summary %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Visualize with both metrics (Avg_TimeSpent + Count)
ggplot(gender_summary, aes(x = reorder(Gender, -Avg_TimeSpent), y = Avg_TimeSpent, fill = Percentage)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Avg_TimeSpent, 1), " hrs")), vjust = -0.5, size = 3) +
  geom_point(aes(y = Count / 10000), color = "blue", size = 3) +  # Overlay user counts
  scale_y_continuous(
    name = "Average Time Spent per User (hrs)",
    sec.axis = sec_axis(~ . * 10000, name = "User Count")
  ) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "User %") +
  labs(
    title = "Gender Engagement: Avg Time Spent vs User Count",
    x = "Gender",
    y = "Average Time Spent per User (hrs)"
  ) +
  theme_minimal()

3. Age Group Distribution

ggplot(age_group_summary, aes(x = Age_Group, y = Count, fill = Age_Group)) +
  geom_bar(stat = "identity") +
  labs(
    title = "User Count by Age Group",
    x = "Age Group",
    y = "User Count"
  ) +
  theme_minimal()

Final Insights and Next Steps

Platform Preferences: Facebook and YouTube dominate, while Instagram attracts niche users.
Uniform Engagement: Gender and age groups show similar engagement patterns overall.
Hypothesis: Rare groups exist due to demographic shifts, platform preferences, or data limitations.
- Underrepresented Groups: Smaller counts may indicate demographic shifts or data limitations.
- Platform Preferences by Age: Older groups may favor broader platforms like Facebook.
Next Steps:
- Investigate missing combinations in categorical variables.
- Explore other demographic factors affecting platform choice.
- Look at time trends in platform engagement.