This analysis explores how random sampling affects
conclusions using the Social Media and Entertainment
Dataset. We aim to:
- Generate 5 random samples (with replacement) to
simulate real-world data collection.
- Analyze differences and consistencies between
samples.
- Identify anomalies and trends in categorical and
numeric data.
- Understand how sampling impacts conclusions for
future analyses.
We create 5 random samples, each covering 50% of the dataset.
set.seed(123) # Ensures reproducibility
# Create 5 samples (with replacement)
samples <- map(1:5, ~ data %>% sample_frac(size = 0.5, replace = TRUE))
# Store each sample in a separate dataframe
df_1 <- samples[[1]]
df_2 <- samples[[2]]
df_3 <- samples[[3]]
df_4 <- samples[[4]]
df_5 <- samples[[5]]
# Check sample sizes
sapply(samples, nrow)
## [1] 150000 150000 150000 150000 150000
We calculate average age, time spent on social media, and most common platform for each sample.
sample_summaries <- map(samples, ~ .x %>%
summarize(
Avg_Age = mean(Age, na.rm = TRUE),
Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE),
Platform_Mode = names(sort(table(`Primary Platform`), decreasing = TRUE))[1]
))
# Combine results
sample_summaries_df <- bind_rows(sample_summaries, .id = "Sample_ID")
# Display
sample_summaries_df
## # A tibble: 5 × 4
## Sample_ID Avg_Age Avg_TimeSpent Platform_Mode
## <chr> <dbl> <dbl> <chr>
## 1 1 38.5 4.26 TikTok
## 2 2 38.5 4.26 Facebook
## 3 3 38.5 4.26 Twitter
## 4 4 38.5 4.25 TikTok
## 5 5 38.6 4.26 Twitter
Insight:
We identify users with extreme social media usage (top 5%) in each sample.
anomalies <- map(samples, ~ .x %>%
filter(`Daily Social Media Time (hrs)` > quantile(`Daily Social Media Time (hrs)`, 0.95, na.rm = TRUE)))
# Display anomalies count
anomalies_count <- sapply(anomalies, nrow)
anomalies_count
## [1] 7487 7483 7369 7415 7307
Insight:
We check whether platform and gender distributions remain stable.
# Count occurrences of (Primary Platform, Gender) in each sample
platform_gender_counts <- map(samples, ~ .x %>%
count(`Primary Platform`, Gender) %>%
rename(Count = n))
# Identify common platform-gender pairs appearing in all 5 samples
common_platform_gender <- Reduce(function(x, y) inner_join(x, y, by = c("Primary Platform", "Gender")),
platform_gender_counts)
# Display how many common platform-gender pairs exist
nrow(common_platform_gender)
## [1] 15
Insight:
We simulate 1,000 repeated samples to estimate the true mean of social media time.
set.seed(42)
mc_samples <- replicate(1000, mean(sample(data$`Daily Social Media Time (hrs)`, size = 0.5 * nrow(data), replace = TRUE), na.rm = TRUE))
# Plot the distribution
ggplot(data.frame(mc_samples), aes(x = mc_samples)) +
geom_histogram(fill = "lightgreen", bins = 30, alpha = 0.7) +
labs(title = "Monte Carlo Simulation: Average Time Spent",
x = "Estimated Mean (hrs)",
y = "Frequency") +
theme_minimal()
Insight: