Week 4 | Data Dive — Sampling and Drawing Conclusions

Introduction

This analysis explores how random sampling affects conclusions using the Social Media and Entertainment Dataset. We aim to:
- Generate 5 random samples (with replacement) to simulate real-world data collection.
- Analyze differences and consistencies between samples.
- Identify anomalies and trends in categorical and numeric data.
- Understand how sampling impacts conclusions for future analyses.

Sampling Data

We create 5 random samples, each covering 50% of the dataset.

set.seed(123)  # Ensures reproducibility

# Create 5 samples (with replacement)
samples <- map(1:5, ~ data %>% sample_frac(size = 0.5, replace = TRUE))

# Store each sample in a separate dataframe
df_1 <- samples[[1]]
df_2 <- samples[[2]]
df_3 <- samples[[3]]
df_4 <- samples[[4]]
df_5 <- samples[[5]]

# Check sample sizes
sapply(samples, nrow)
## [1] 150000 150000 150000 150000 150000

Sample Statistics

We calculate average age, time spent on social media, and most common platform for each sample.

sample_summaries <- map(samples, ~ .x %>%
  summarize(
    Avg_Age = mean(Age, na.rm = TRUE),
    Avg_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE),
    Platform_Mode = names(sort(table(`Primary Platform`), decreasing = TRUE))[1]
  ))

# Combine results
sample_summaries_df <- bind_rows(sample_summaries, .id = "Sample_ID")

# Display
sample_summaries_df
## # A tibble: 5 × 4
##   Sample_ID Avg_Age Avg_TimeSpent Platform_Mode
##   <chr>       <dbl>         <dbl> <chr>        
## 1 1            38.5          4.26 TikTok       
## 2 2            38.5          4.26 Facebook     
## 3 3            38.5          4.26 Twitter      
## 4 4            38.5          4.25 TikTok       
## 5 5            38.6          4.26 Twitter

Insight:

The most popular platform varies across samples, suggesting fluctuations in platform engagement.
Avg. social media time is stable (~4.25 hrs), indicating consistent user behavior across samples.
Avg. age differs slightly, which could influence trends in media consumption.

Detecting Outliers

We identify users with extreme social media usage (top 5%) in each sample.

anomalies <- map(samples, ~ .x %>%
  filter(`Daily Social Media Time (hrs)` > quantile(`Daily Social Media Time (hrs)`, 0.95, na.rm = TRUE)))

# Display anomalies count
anomalies_count <- sapply(anomalies, nrow)
anomalies_count
## [1] 7487 7483 7369 7415 7307

Insight:

The number of extreme users fluctuates across samples (~7,300–7,500).
A user classified as an outlier in one sample may not be in another, highlighting the instability of anomaly detection in small samples.
This impacts strategies for identifying high-usage users in marketing or intervention studies.

Consistency Across Samples

We check whether platform and gender distributions remain stable.

# Count occurrences of (Primary Platform, Gender) in each sample
platform_gender_counts <- map(samples, ~ .x %>%
  count(`Primary Platform`, Gender) %>%
  rename(Count = n))

# Identify common platform-gender pairs appearing in all 5 samples
common_platform_gender <- Reduce(function(x, y) inner_join(x, y, by = c("Primary Platform", "Gender")), 
                                 platform_gender_counts)

# Display how many common platform-gender pairs exist
nrow(common_platform_gender)
## [1] 15

Insight:

Only 15 platform-gender pairs remain consistent across all samples, showing that rare demographics may not always appear in every sample.
This highlights potential sampling bias, where some groups are overrepresented or underrepresented.

Monte Carlo Simulation

We simulate 1,000 repeated samples to estimate the true mean of social media time.

set.seed(42)
mc_samples <- replicate(1000, mean(sample(data$`Daily Social Media Time (hrs)`, size = 0.5 * nrow(data), replace = TRUE), na.rm = TRUE))

# Plot the distribution
ggplot(data.frame(mc_samples), aes(x = mc_samples)) +
  geom_histogram(fill = "lightgreen", bins = 30, alpha = 0.7) +
  labs(title = "Monte Carlo Simulation: Average Time Spent",
       x = "Estimated Mean (hrs)",
       y = "Frequency") +
  theme_minimal()

Insight:

The estimated mean remains centered around ~4.25 hrs, meaning repeated sampling provides stable estimates.
Single random samples may misrepresent the true mean, but Monte Carlo methods help reduce uncertainty.

Final Insights and Next Steps

Sample Variability: Different samples yield slightly different conclusions, affecting how we generalize findings.
Anomaly Instability: Outliers vary across samples, making them difficult to define consistently.
Demographic Representation: Not all user groups appear in every sample, indicating potential biases in data collection.
Monte Carlo Stability: Repeated sampling reduces uncertainty in estimating true population values.

Next Steps:

Use stratified sampling to ensure balanced representation of all groups.
Apply confidence intervals to quantify uncertainty in estimates.
Explore trend stability over time by incorporating time-based sampling.