Fall 2025

Introduction

For this project, I chose the dataset “Social Media and Mental Health Balance” from Kaggle. The data consists of 500 rows and 10 columns, including demographic information, screen time, sleep quality, and other factors. By performing data manipulation and statistical analysis on this dataset, we can explore the relationship between social media usage and mental wellbeing.

The goal of this project is to identify patterns that highlight the influence that social media has on factors affecting mental health. This topic is particularly meaningful to me, as I use social media daily for both personal and professional purposes and have observed both its positive and negative effects on my own wellbeing.

Link to dataset: https://www.kaggle.com/datasets/ayeshaimran123/social-media-and-mental-health-balance/data

Dataset Overview

The Social Media and Mental Health Balance dataset has the following columns:

  • User_ID
  • Age
  • Gender
  • Daily_Screen_Time(hrs)
  • Sleep_Quality(1-10)
  • Stress_Level(1-10)
  • Days_Without_Social_Media
  • Exercise_Frequency(week)
  • Social_Media_Platform
  • Happiness_Index(1-10)

Data Wrangling

  • Load and clean the data
# load libraries
library(ggplot2)
library(plotly)
library(dplyr)
library(knitr)

# load dataset + clean by renaming messy column names (data wrangling)
data <- read.csv("Mental_Health_and_Social_Media_Balance_Dataset.csv") %>%
  rename(
    Screen_Time = Daily_Screen_Time.hrs.,
    Sleep_Quality = Sleep_Quality.1.10.,
    Stress_Level = Stress_Level.1.10.,
    Days_Off_Social_Media = Days_Without_Social_Media,
    Exercise_Freq = Exercise_Frequency.week.,
    Happiness = Happiness_Index.1.10.,
    Platform = Social_Media_Platform
  )

Data Wrangling

Dataset Preview
Age Gender Screen_Time Sleep_Quality Stress_Level
44 Male 3.1 7 6
30 Other 5.1 7 8
23 Other 7.4 6 7
36 Female 5.7 7 8
34 Female 7.0 4 7
Days_Off_Social_Media Exercise_Freq Platform Happiness
2 5 Facebook 10
5 3 LinkedIn 10
1 3 YouTube 6
1 1 TikTok 8
5 1 X (Twitter) 8

Exploratory Data Analysis

  • Get statistics
# summary statistics
summary(data)
numeric_data <- data %>% 
  select(Age, Screen_Time, Sleep_Quality, 
         Stress_Level, Days_Off_Social_Media, 
         Exercise_Freq, Happiness)

# calculate mean, min, max
stats_summary <- data.frame(
  Variable = c("Age", "Daily Screen Time (hrs)", "Sleep Quality (1-10)",
               "Stress Level (1-10)", "Days Without Social Media",
               "Exercise Frequency (week)", "Happiness Index (1-10)"),
  Mean = sapply(numeric_data, mean, na.rm = TRUE),
  Min = sapply(numeric_data, min, na.rm = TRUE),
  Max = sapply(numeric_data, max, na.rm = TRUE)
)
stats_summary

Exploratory Data Analysis

Summary Statistics
Variable Mean Min Max
Age 32.99 16 49.0
Daily Screen Time (hrs) 5.53 1 10.8
Sleep Quality (1-10) 6.30 2 10.0
Stress Level (1-10) 6.62 2 10.0
Days Without Social Media 3.13 0 9.0
Exercise Frequency (week) 2.45 0 7.0
Happiness Index (1-10) 8.38 4 10.0

Exploratory Data Analysis

gender_counts <- as.data.frame(table(data$Gender))
colnames(gender_counts) <- c("Gender", "Number of Users")

kable(gender_counts, col.names = c("Gender", "Number of Users"), align = c("c", "c"))
Gender Number of Users
Female 229
Male 248
Other 23

Exploratory Data Analysis

platform_counts <- as.data.frame(table(data$Platform))
colnames(platform_counts) <- c("Social Media Platform", "Number of Users")

kable(platform_counts,
col.names = c("Social Media Platform", "Number of Users"), align = c("c", "c"))
Social Media Platform Number of Users
Facebook 81
Instagram 74
LinkedIn 87
TikTok 95
X (Twitter) 88
YouTube 75

Statistical Analysis - Hypothesis Testing

Since I based my HW3 on Hypothesis Testing (particularly z-test) and found it interesting, I decided to explore this dataset further by using different(and more complex) types of hypothesis tests for the various problems I wanted to find the conclusions to in this dataset.

Different types of hypothesis tests include (one-sample and two-sample):

  • z-test
  • t-test
  • F-test
  • \(\chi^2\)-test
  • linear regression

Q1: Do older individuals experience more stress?

  • \(H_0\): There is no significant correlation between age and stress level.
  • \(H_1\): There is a significant positive correlation between age and stress level.
  • Statistical Test: Linear regression

Q1: Do older individuals experience more stress?

model_age <- lm(Stress_Level ~ Age, data = data)
slope <- round(coef(model_age)[2], 4)
p_val <- summary(model_age)$coefficients[2, 4]
r_squared <- round(summary(model_age)$r.squared, 3)
# extract key values
cat("\nRegression equation: Stress =", 
    round(coef(model_age)[1], 2), "+", 
    round(coef(model_age)[2], 4), "* Age\n")
# scatter plot code (reduced spaces to fit on slide)
ggplot(data, aes(x = Age, y = Stress_Level)) +
  geom_point(alpha = 0.6, color = "#0B968E", size = 2.5) +geom_smooth(method = "lm", color = "black", 
              se = TRUE, linewidth = 1) +
  labs(title = "Linear Regression: Age vs Stress Level",
       subtitle = paste0("Slope = ", slope, ", p = ", format.pval(p_val, digits = 3),", R² = ", r_squared),
       x = "Age (years)", y = "Stress Level (1-10)") +
  theme_minimal(base_size = 16) + theme(plot.title = 
        element_text(face = "bold", hjust = 0.5),plot.subtitle = element_text(hjust = 0.5))

Scatter plot of Age vs Stress

Conclusion: Age does not significantly predict stress levels (p = 0.712). The relationship is essentially flat, meaning older and younger individuals experience similar stress levels in this sample.

Q2: Is there a relationship between gender and social media platform preference?

  • \(H_0\): Gender and social media platform preference are independent (no association).
  • \(H_1\): There is a significant association between gender and social media platform preference.
  • Statistical Test: \(\chi^2\)-test

Q2: Is there a relationship between gender and social media platform preference?

# filter to include only Male and Female 
data_filtered <- data %>%
  filter(Gender %in% c("Male", "Female"))
# create contingency table
cont_table <- table(data_filtered$Gender, data_filtered$Platform)
# chi-square test
chi_test <- chisq.test(cont_table)
# important values
cat("\nChi-square statistic:", round(chi_test$statistic, 3), "Degrees of freedom:", chi_test$parameter, "P-value:", 
    format.pval(chi_test$p.value, digits = 4), "Sample size after filtering:", nrow(data_filtered), "users")
## 
## Chi-square statistic: 7.182 Degrees of freedom: 5 P-value: 0.2075 Sample size after filtering: 477 users

Conclusion: The slight variations in social media platform preference among males and females can be observed on the bar chart. However, the chi-square test shows no statistically significant association (χ² = 7.182, p = 0.207) between gender and social media platform preference, i.e., these two genders have similar social media platform preferences in this specific sample.

Q3: Are people with less screen time happier?

  • \(H_0\): There is no significant difference in happiness index between individuals with low and high screen time.
  • \(H_1\): Individuals with lower screen time have significantly higher happiness scores than those with higher screen time.
  • Statistical Test: Two-sample t-test

Q3: Are people with less screen time happier?

# create screen time categories
data <- data %>%
  mutate(Screen_Category = case_when(
    Screen_Time < 4 ~ "Low",
    Screen_Time >= 4 & Screen_Time <= 6 ~ "Medium",
    Screen_Time > 6 ~ "High"
  ))

# low and high only for this test
low_group <- data %>% filter(Screen_Category == "Low")
high_group <- data %>% filter(Screen_Category == "High")

# two-sample t-test
t_test <- t.test(low_group$Happiness, high_group$Happiness,
                 alternative = "two.sided")

cat("P-value:", format.pval(t_test$p.value, digits = 4))
## P-value: < 2.2e-16
# display results
cat("Mean Happiness (Low Screen Time):", 
    round(mean(low_group$Happiness), 2), "\n")
## Mean Happiness (Low Screen Time): 9.74
cat("Mean Happiness (High Screen Time):", 
    round(mean(high_group$Happiness), 2), "\n")
## Mean Happiness (High Screen Time): 7.25
cat("T-statistic:", round(t_test$statistic, 3), "\n")
## T-statistic: 21.303

Q3: Are people with less screen time happier?

Conclusion: The two-sample t-test reveals a statistically significant difference (t = 16.526 , p < 0.05) in happiness levels between screen time groups. Individuals with low screen time report significantly higher happiness than those with high screen time.

Q4: What is the relationship between screen time, sleep quality, and stress?

  • \(H_0\): There is no significant relationship between daily screen time and the combination of sleep quality and stress level.
  • \(H_1\): Higher screen time is associated with lower sleep quality and higher stress levels.
  • Statistical Test: Multiple linear regression

Q4: What is the relationship between screen time, sleep quality, and stress?

# multiple linear regression models
model_sleep <- lm(Sleep_Quality ~ Screen_Time, data = data) # Sleep Quality ~ Screen Time
model_stress <- lm(Stress_Level ~ Screen_Time, data = data) # Stress Level ~ Screen Time

# display results
cat(
"Sleep Quality - Coefficient:", round(coef(model_sleep)[2], 4),
", P-value:", format.pval(summary(model_sleep)$coefficients[2,4], digits = 4),
", R^2:", round(summary(model_sleep)$r.squared, 3), "\n",
"Stress Level - Coefficient:", round(coef(model_stress)[2], 4),
", P-value:", format.pval(summary(model_stress)$coefficients[2,4], digits = 4),
", R^2:", round(summary(model_stress)$r.squared, 3), "\n"
)
## Sleep Quality - Coefficient: -0.6692 , P-value: < 2.2e-16 , R^2: 0.576 
##  Stress Level - Coefficient: 0.6581 , P-value: < 2.2e-16 , R^2: 0.547

Q4: Screen Time vs. Sleep Quality

Conclusion: Screen time has a statistically significant relationship with sleep quality (p < 0.05). The scatter plot also shows that low screen time = better sleep quality.

Q4: Screen Time vs. Stress Level

Conclusion: Screen time has a statistically significant relationship with stress level (p < 0.05). The scatter plot also shows that higher the screen time, higher the stress level.

Conclusion

  1. Age and Stress: There is no significant relationship between age and stress level, in the context of this sample.
  2. Gender and Platform Preference: There is no statistically significant relationship between gender and preferred social media platform. While small variations are seen from the bar chart, there is not enough evidence to suggest a statistically strong relationship.
  3. Screen Time and Happiness: There is a statistically significant difference to show that people with lower screen time have higher happiness levels, implying that reduced screen exposure contributes to better mental health.
  4. Screen Time, Sleep and Stress: The multiple linear regression models indicate that screen time impacts both sleep quality and stress levels. Higher the screen time, lower the sleep quality and higher the stress levels.

Overall, the results suggest that regardless of factors such as age, gender, or social media platform choice, screen time plays a significant role in shaping mental wellbeing. Reducing screen time and social media usage can improve sleep, lower stress, and contribute to overall better mental health.