Week 7 | Data Dive — Hypothesis Testing

Introduction

This analysis explores hypothesis testing using the Social Media and Entertainment Dataset. Key objectives:

Define two null hypotheses using different dataset columns.
Use Neyman-Pearson and Fisher’s Significance Testing frameworks.
Perform Sample Size Calculation for appropriate hypothesis testing.
Conduct AB Testing to compare two groups.
Visualize results for each hypothesis.

Hypothesis 1: Social Media Time and Age

Defining the Hypothesis

Null Hypothesis (H₀): There is no difference in average social media time between users aged under 30 (Group A) and 30 & older (Group B).
Alternative Hypothesis (H₁): Younger users spend more time on social media.

Sample Size Calculation

Parameters were chosen to balance the risk of Type I and Type II errors with a reasonable effect size for social behavior patterns.

Parameters:

Significance Level (α): 0.05 (Standard for social data studies)
Power (1 - β): 0.80 (Balances Type I and Type II error risks)
Effect Size (d): 0.2 (Small effect size considering social behavior variability)

# Sample size calculation
sample_size <- pwr.t.test(
  d = 0.2,          # Small effect size
  sig.level = 0.05, # Alpha level
  power = 0.80,     # Power level
  type = "two.sample"
)$n

ceiling(sample_size)  # Round up for whole number
## [1] 394

Insight: The required sample size (394 per group) confirms that our dataset is sufficiently large, ensuring our results are meaningful.

Choosing the Test

Two-sample t-test (independent groups, continuous data).
Assumptions: Normality (checked with visualization), equal variances (checked with Levene’s test).

# Create age-based groups
data <- data %>%
  mutate(Age_Group = ifelse(Age < 30, "Under 30", "30 & Older"))

# Summary statistics
group_summary <- data %>%
  group_by(Age_Group) %>%
  summarize(
    Mean_SM_Time = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE),
    SD_SM_Time = sd(`Daily Social Media Time (hrs)`, na.rm = TRUE),
    Count = n()
  )

group_summary
## # A tibble: 2 × 4
##   Age_Group  Mean_SM_Time SD_SM_Time  Count
##   <chr>             <dbl>      <dbl>  <int>
## 1 30 & Older         4.25       2.16 202144
## 2 Under 30           4.26       2.17  97856

Insight: The average social media time is nearly identical across both groups, suggesting that age alone may not strongly influence social media behavior.

Checking Assumptions

# Visualizing distribution
ggplot(data, aes(x = `Daily Social Media Time (hrs)`, fill = Age_Group)) +
  geom_density(alpha = 0.5) +
  labs(title = "Distribution of Social Media Time by Age Group",
       x = "Daily Social Media Time (hrs)", 
       y = "Density") +
  theme_minimal()

Insight: The overlapping distributions indicate that social media time is similarly distributed across both age groups, which weakens the argument for a strong age-based effect.

# Levene’s test for equal variances
leveneTest(`Daily Social Media Time (hrs)` ~ Age_Group, data = data)
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value Pr(>F)
## group 1e+00   0.676  0.411
##       3e+05

Insight: Since the p-value is greater than 0.05, the assumption of equal variances holds true, making the t-test appropriate.

Performing the Test

# Independent t-test
t_test_result <- t.test(
  `Daily Social Media Time (hrs)` ~ Age_Group, 
  data = data, 
  var.equal = TRUE
)

t_test_result
## 
##  Two Sample t-test
## 
## data:  Daily Social Media Time (hrs) by Age_Group
## t = -1.4304, df = 3e+05, p-value = 0.1526
## alternative hypothesis: true difference in means between group 30 & Older and group Under 30 is not equal to 0
## 95 percent confidence interval:
##  -0.028592853  0.004466521
## sample estimates:
## mean in group 30 & Older   mean in group Under 30 
##                 4.250873                 4.262936

Interpretation:

The p-value (0.1526) is greater than 0.05, so we fail to reject H₀.Age alone does not significantly impact social media usage.
This means there is no strong evidence that younger users spend more time on social media compared to older users.

Hypothesis 2: Engagement Score and Gender

Defining the Hypothesis

Null Hypothesis (H₀): Engagement scores do not differ by gender.
Alternative Hypothesis (H₁): Engagement scores differ by gender.

Defining Engagement Score

Since the dataset does not contain a direct Engagement Score, we create one using relevant numeric columns:
Engagement Score = (Daily Social Media Time + Time Spent in Online Communities + Ad Interaction Count) / 3

# Compute Engagement Score
data <- data %>%
  mutate(Engagement_Score = (`Daily Social Media Time (hrs)` + `Time Spent in Online Communities (hrs)` + `Ad Interaction Count`) / 3)

# Check if column is created
head(data$Engagement_Score, n = 10)
##  [1]  9.156667 10.773333 18.583333 10.176667 15.630000  8.866667 17.466667
##  [8] 10.906667  4.656667  8.780000

Choosing the Test

Fisher’s Exact Test with simulated p-values to handle large data.
Assumptions: Expected counts ≥ 5 in each category.

# Create engagement categories (Low: < Median, High: ≥ Median)
median_engagement <- median(data$Engagement_Score, na.rm = TRUE)
data <- data %>%
  mutate(Engagement_Level = ifelse(Engagement_Score >= median_engagement, "High", "Low"))

# Count table
gender_engagement_table <- table(data$Gender, data$Engagement_Level)

gender_engagement_table
##         
##           High   Low
##   Female 50096 49777
##   Male   49785 50117
##   Other  50135 50090

Performing Fisher’s Exact Test

# Fisher’s Exact Test with simulated p-value for large data
fisher_result <- fisher.test(
  gender_engagement_table,
  simulate.p.value = TRUE  # Monte Carlo simulation for large tables
)

fisher_result
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  gender_engagement_table
## p-value = 0.3388
## alternative hypothesis: two.sided

Interpretation: - The p-value (0.3208) is greater than 0.05, so we fail to reject H₀. - This means there is no significant difference in engagement levels between genders.

Visualizing Results

1. Social Media Time by Age Group

ggplot(data, aes(x = Age_Group, y = `Daily Social Media Time (hrs)`, fill = Age_Group)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Social Media Time by Age Group", 
       x = "Age Group", 
       y = "Daily Social Media Time (hrs)") +
  theme_minimal()

Analysis:

The boxplots show that social media time is very similar for both age groups.
There is no significant difference, which aligns with the t-test results.

2. Engagement Levels by Gender

ggplot(data, aes(x = Gender, fill = Engagement_Level)) +
  geom_bar(position = "fill") +
  labs(title = "Engagement Levels by Gender",
       x = "Gender", 
       y = "Proportion") +
  theme_minimal()

Analysis:

The engagement level distribution is almost equal across genders.
This aligns with the chi-square test, which found no significant difference in engagement scores by gender.

Final Insights and Next Steps

Key Findings:

Social Media Usage and Age:
- No significant difference between younger and older users.
- Age is not a strong factor influencing social media usage.
Engagement and Gender:
- No significant difference in engagement levels by gender.
- Engagement is likely influenced by other factors.
Statistical Testing and Practical Meaning:
- Both tests indicate that these variables (age, gender) may not be strong predictors of behavior.
- Future analysis could explore other potential predictors like occupation, income, or device type

Next Steps:

Investigate other variables that might impact social media usage and engagement.
Explore possible interactions between multiple factors (e.g., occupation and engagement).
Conduct more granular studies within subgroups to uncover hidden patterns.