This analysis delves into the Social Media and Entertainment Dataset to explore user behavior through grouping and probability analysis. Key objectives include:
To understand patterns, we will group the data using three different categorical columns and summarize key metrics.
This analysis groups users by their primary social media platform to examine platform popularity and average daily usage.
platform_summary <- data %>%
group_by(`Primary Platform`) %>%
summarize(
Count = n(),
Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)
) %>%
arrange(Count)
# Display the summary
platform_summary
## # A tibble: 5 × 3
## `Primary Platform` Count Avg_TimeSpent
## <chr> <int> <dbl>
## 1 Instagram 59721 4.25
## 2 YouTube 59757 4.27
## 3 Facebook 59936 4.26
## 4 Twitter 60285 4.25
## 5 TikTok 60301 4.25
Insight:
- Platforms like Facebook and YouTube attract a large user base,
indicating broad appeal. - Less popular platforms (e.g., Instagram) may
cater to niche audiences.
Grouping users by gender highlights differences in platform usage patterns.
gender_summary <- data %>%
group_by(Gender) %>%
summarize(
Count = n(),
Avg_Age = mean(Age, na.rm = TRUE),
Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)
) %>%
arrange(Count)
# Display the summary
gender_summary
## # A tibble: 3 × 4
## Gender Count Avg_Age Avg_TimeSpent
## <chr> <int> <dbl> <dbl>
## 1 Female 99873 38.5 4.26
## 2 Male 99902 38.6 4.25
## 3 Other 100225 38.5 4.26
Insight:
- All gender groups show similar average usage times (~4 hours),
indicating uniform platform engagement.
- The smaller representation of certain genders may reflect demographic
biases or data collection limitations.
Users are grouped into age categories for analysis. This helps identify trends in platform usage across different life stages.
data <- data %>%
mutate(Age_Group = cut(Age, breaks = c(0, 18, 30, 50, 100), labels = c("0-18", "19-30", "31-50", "51+")))
age_group_summary <- data %>%
group_by(Age_Group) %>%
summarize(
Count = n(),
Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)
) %>%
arrange(Count)
# Display the summary
age_group_summary
## # A tibble: 4 × 3
## Age_Group Count Avg_TimeSpent
## <fct> <int> <dbl>
## 1 0-18 34591 4.29
## 2 19-30 69105 4.25
## 3 51+ 80844 4.26
## 4 31-50 115460 4.25
Insight:
- 31-50 age group dominates the dataset, showing high engagement.
- Younger age groups (0-18) have lower representation, possibly
reflecting parental restrictions or different platform preferences.
The groups with the smallest user counts have the lowest probabilities of appearing in a random selection. We identify the rarest groups:
# Find lowest probability groups using `pull()` to extract column values
rare_platform <- platform_summary %>% filter(Count == min(Count)) %>% pull(`Primary Platform`)
rare_gender <- gender_summary %>% filter(Count == min(Count)) %>% pull(Gender)
rare_age_group <- age_group_summary %>% filter(Count == min(Count)) %>% pull(Age_Group)
list(
"Rarest Platform" = rare_platform,
"Rarest Gender" = rare_gender,
"Rarest Age Group" = rare_age_group
)
## $`Rarest Platform`
## [1] "Instagram"
##
## $`Rarest Gender`
## [1] "Female"
##
## $`Rarest Age Group`
## [1] 0-18
## Levels: 0-18 19-30 31-50 51+
Conclusion:
- Rarest groups (e.g., Instagram, certain genders, and age groups)
represent niche demographics.
- These insights help target specialized campaigns or further
investigate underrepresented segments.
We will examine combinations of Primary Platform and Gender to find missing values.
category_combinations <- data %>%
count(`Primary Platform`, Gender) %>%
complete(`Primary Platform`, Gender, fill = list(n = 0)) %>%
arrange(n)
# Display unique category combinations
category_combinations
## # A tibble: 15 × 3
## `Primary Platform` Gender n
## <chr> <chr> <int>
## 1 YouTube Female 19705
## 2 Facebook Male 19789
## 3 Instagram Male 19829
## 4 Instagram Female 19915
## 5 Instagram Other 19977
## 6 YouTube Other 20002
## 7 YouTube Male 20050
## 8 TikTok Other 20051
## 9 Twitter Female 20064
## 10 Facebook Other 20065
## 11 Facebook Female 20082
## 12 Twitter Male 20091
## 13 TikTok Female 20107
## 14 Twitter Other 20130
## 15 TikTok Male 20143
Insight:
- Missing combinations (e.g., “Twitter” and “Other Gender”) suggest gaps
in data collection or natural preferences.
Each group will have an accompanying visualization.
# Add percentage and sort by average time spent
platform_summary <- platform_summary %>%
mutate(
Percentage = Count / sum(Count) * 100 # Normalize counts
) %>%
arrange(desc(Avg_TimeSpent)) # Sort by average time spent
# Visualize with both metrics (Avg_TimeSpent + Count)
ggplot(platform_summary, aes(x = reorder(`Primary Platform`, -Avg_TimeSpent))) +
geom_bar(aes(y = Avg_TimeSpent, fill = `Primary Platform`), stat = "identity", show.legend = FALSE) +
geom_text(aes(y = Avg_TimeSpent, label = paste0(round(Avg_TimeSpent, 1), " hrs")), vjust = -0.5, size = 3) + # Avg time labels
geom_point(aes(y = Count / 10000), color = "blue", size = 3) + # Overlay user counts (scaled down for dual-axis effect)
scale_y_continuous(
name = "Average Time Spent per User (hrs)",
sec.axis = sec_axis(~ . * 10000, name = "User Count")
) +
labs(
title = "Platform Engagement: Average Time Spent vs User Count",
x = "Primary Platform",
y = "Average Time Spent per User (hrs)"
) +
theme_minimal()
ggplot(age_group_summary, aes(x = Age_Group, y = Count, fill = Age_Group)) +
geom_bar(stat = "identity") +
labs(
title = "User Count by Age Group",
x = "Age Group",
y = "User Count"
) +
theme_minimal()