In recent years, the mental health of students has become an increasing concern, and with levels of depression rising year by year, especially among high- school and university students, understanding the root causes is essential for mitigating such issues. The objective of this EDA project is to explore patterns of mental health and relate them to root factors, namely sleep and social media use. Determining the causal factors helps move towards an explanation for low scores of mental health - the goal of this exploratory analysis.
# Load required libraries
library(tidyverse)
library(readxl)
library(dplyr)
# Set working directory and import dataset
setwd("/Users/emeliekaml/Documents/PKU/Classes/Semester 1/RStudio/Homework/Final Project")
student_data <- read_excel("Students Social Media Addiction.xls")
The data was taken from the platform “Kaggle” (Aminasalamt, 2025). It contains survey-based data on social media usage, mental health, and related lifestyle and academic factors among students. Each observation represents a single student, and the students were surveyed across different countries (e.g.: Bangladesh, India, USA, Germany…) and across different levels of school (high school, undergraduate, graduate)
The survey was administered to 16-25 students in each school, and the survey took place in Q1 of 2025. The full data contains 705 observations (= surveyed students) and the following are the variables that were analyzed:
Mental_Health_Score: Mental Health as a numerical variable ranked from 1(poor)-10(very good) - as only full numbers could be selected, this variable will be evaluated as a categorical variable (see “Data Cleaning” step)
Avg_Daily_Usage_Hours: The average daily use of social media given in hours (numerical)
Sleep_Hours_Per_Night: The amount of sleep per night a student gets given in hours (numerical)
Other variables include: “Student_ID”, “Age”, “Gender”, “Academic Level”, “County”, “Most Used Social Media Platform”, “Relationship Status”, “Conflicts Over Social Media”, and “Addicted Score”
# Inspect the dataset
summary(student_data)
## Student_ID Age Gender Academic_Level
## Min. : 1 Min. :18.00 Length:705 Length:705
## 1st Qu.:177 1st Qu.:19.00 Class :character Class :character
## Median :353 Median :21.00 Mode :character Mode :character
## Mean :353 Mean :20.66
## 3rd Qu.:529 3rd Qu.:22.00
## Max. :705 Max. :24.00
## Country Avg_Daily_Usage_Hours Most_Used_Platform
## Length:705 Min. :1.500 Length:705
## Class :character 1st Qu.:4.100 Class :character
## Mode :character Median :4.800 Mode :character
## Mean :4.919
## 3rd Qu.:5.800
## Max. :8.500
## Affects_Academic_Performance Sleep_Hours_Per_Night Mental_Health_Score
## Length:705 Min. :3.800 Min. :4.000
## Class :character 1st Qu.:6.000 1st Qu.:5.000
## Mode :character Median :6.900 Median :6.000
## Mean :6.869 Mean :6.227
## 3rd Qu.:7.700 3rd Qu.:7.000
## Max. :9.600 Max. :9.000
## Relationship_Status Conflicts_Over_Social_Media Addicted_Score
## Length:705 Min. :0.00 Min. :2.000
## Class :character 1st Qu.:2.00 1st Qu.:5.000
## Mode :character Median :3.00 Median :7.000
## Mean :2.85 Mean :6.437
## 3rd Qu.:4.00 3rd Qu.:8.000
## Max. :5.00 Max. :9.000
Looking at the data summary, it is evident that for the numerical variables, there are no values missing (no NAs). In terms of extremes, most numeric variables are all in reasonable areas. For example having a “Mental Health Score” between 4 and 9 is very realistic and will not cause domination of the scale of any visualizations.
However, we can see that the response variable, “Mental Health Score”, only has absolute values, therefore, for further analyses, it will be transformed into a categorical variable
# Create categorical mental health variable
student_data$MHS_cat <- factor(
student_data$Mental_Health_Score,
levels = 4:9, # Summary shows Mental Health Score has min. value 4 and max. value 9
ordered = TRUE
)
The variable must be transformed - if not, when using scatter plots or other visualization methods with numerical variables, the points will cluster on one line and hide density. From hereon the variable “MHS_cat” will be used to analyze the mental health score
What is the Overall Distribution of Mental Health Scores Across Students?
# Plot the distribution of mental health using the student_data dataset
ggplot(data = student_data) +
geom_bar( mapping = aes(x = MHS_cat),
fill = "darkgreen",
color = "white"
) +
geom_text( # Add labels to the bars to see how many student actually fall into the category
stat = "count",
aes(x = MHS_cat, label = after_stat(count)),
vjust = -0.3,
color = "black"
) +
labs(
x = "Mental Health Score",
y = "Number of Students",
title = "Distribution of Mental Health Across Students"
)
Interpretation: The distribution of mental health scores is concentrated around the mid-range values , with most students reporting scores between 5 and 7. The highest frequency occurs at a score of 6, indicating that moderate mental health levels are most common in the sample. Lower scores (such as 4) and higher scores (such as 8) are less frequent, while very high mental health scores (9) are extremely rare. Overall, the distribution suggests a concentration toward the center rather than any extreme skewes, implying that most students report neither very poor nor very high mental health, but instead cluster around moderate levels.
What is the Overall Pattern of Mental Health Across Different Hours of Social-Media Usage?
# Compare the mental health score for different hours of media usage
ggplot(data = student_data) +
geom_boxplot(
mapping = aes(x = MHS_cat, y = Avg_Daily_Usage_Hours),
varwidth = T) +
labs(
x = "Mental Health Score",
y = "Average Daily Social Media Usage (in hours)",
title = "Usage of Social Media by Mental Health Score"
)
Interpretation: The boxplots show a clear negative association between mental health scores and average daily social media usage. Students with higher levels of social media use tend to report lower mental health scores, while students who spend fewer hours per day on social media generally report higher mental health scores. As mental health scores increase from 4 to 8, the median level of social media usage declines steadily, and the overall distribution shifts downward.
Variability in social media use is larger among students with lower and mid-range mental health scores, whereas usage becomes more concentrated at lower levels for students with higher mental health scores.
The decreasing number of observations around at the maximum (9) and the minimum (4) category should be interpreted with caution, but the overall pattern suggests a consistent decrease in social media usage as mental health improves.
What is the Overall Pattern of Mental Health Across Different Hours of Sleep?
# Compare the mental health score for different hours of sleep
ggplot(data = student_data) +
geom_boxplot(
mapping = aes(x = MHS_cat, y = Sleep_Hours_Per_Night),
varwidth = T) +
labs(
x = "Mental Health Score",
y = "Sleep per Night (in hours)",
title = "Hours of Sleep by Mental Health Score"
)
Interpretation: The boxplots indicate a positive association between sleep duration and mental health scores. Students with fewer hours of sleep per night tend to report lower levels of mental health, while longer sleep durations are associated with higher mental health scores. As mental health scores increase from 4 to 8, the median number of hours slept rises steadily, and the overall distribution shifts upward.
Variability in sleep duration is greater among students with lower and mid-range mental health scores, whereas sleep patterns appear more concentrated at higher mental health levels. As with previous figures, the highest score category (9) contains very few observations and should therefore be interpreted with caution.
From the pattern anlysis we can see that
Both Factors Affect the Response Variable, “Mental Health Score”, But Which Factor Is More Pronounced?
# Compute median outcomes by mental health category
median_summary <- student_data %>%
group_by(MHS_cat) %>% # group observations by mental health score category
summarise(
median_sleep = median(Sleep_Hours_Per_Night, na.rm = TRUE), # median sleep hours
median_usage = median(Avg_Daily_Usage_Hours, na.rm = TRUE) # median daily social media use
)
# Standardize median values
median_summary <- median_summary %>%
mutate(
sleep_scaled = scale(median_sleep), # standardized median sleep duration
usage_scaled = scale(median_usage) # standardized median social media usage
)
# Visualize comparative trends across mental health categories
ggplot(median_summary, aes(x = as.numeric(as.character(MHS_cat)))) +
geom_line(aes(y = sleep_scaled, color = "Sleep"), linewidth = 1) + # line for standardized sleep trend
geom_point(aes(y = sleep_scaled, color = "Sleep"), size = 2) + # points for median sleep values
geom_line(aes(y = usage_scaled, color = "Social Media"), linewidth = 1) + # line for standardized social media usage trend
geom_point(aes(y = usage_scaled, color = "Social Media"), size = 2) + # points for median social media usage values
labs(
title = "Comparative Trends Across Mental Health Scores",
x = "Mental Health Score",
y = "Standardized median value",
color = "Outcome"
) +
scale_color_manual(
values = c(
"Sleep" = "darkblue",
"Social Media" = "darkred"
)
)
Interpretation: The standardized trend plot shows that both sleep duration and social media usage change systematically across mental health score categories, but with clearly different magnitudes. As mental health scores increase, median sleep duration rises steadily, confirming the previously observed positive association between better mental health and longer sleep. At the same time, median social media usage declines sharply as mental health scores increase.
This visual comparison suggests that social media behavior varies more strongly with mental health scores than sleep does. To further analyze this we compare:
Which of the Two Factors (Sleep, Social Media) has the Stronger Absolute Effect on Mental Health?
# Estimate linear trends of median outcomes
sleep_lm <- lm(median_sleep ~ as.numeric(MHS_cat), data = median_summary) # regress median sleep on mental health score and use aggregated median-level data
usage_lm <- lm(median_usage ~ as.numeric(MHS_cat), data = median_summary) # regress median social media usage on mental health score and use aggregated median-level data
# Extract slope coefficients
coef(sleep_lm)[2] # slope of median sleep duration trend
## as.numeric(MHS_cat)
## 0.5114286
coef(usage_lm)[2] # slope of median social media usage trend
## as.numeric(MHS_cat)
## -1.037143
Interpretation: This shows the numerical comparison of the two effects using the slope of the line charts: The slope for median sleep duration is positive (≈ 0.5) but relatively small. In contrast, the slope for median social media usage is substantially larger in absolute value and negative (≈ 1.0).
The difference in magnitude between these coefficients suggests that variation in social media usage across mental health scores is stronger than variation in sleep duration, consistent with the graphical evidence.
Therefore the further analysis will further examine this relationship between social media and mental health.
Is Heavy Social Media Use Always Associated with Worse Mental Health, or Does It Vary for Different Lengths of Sleep?
To answer this question sleep is categorized into the following groups:
(Zomers et al., 2017)
# Create sleep duration groups
student_data <- student_data %>%
mutate(
Sleep_Group = cut( # create a new variable with the sleep groups
Sleep_Hours_Per_Night,
breaks = c(0, 5, 8, Inf), # cut points defining sleep categories
labels = c("Short sleep", "Moderate sleep", "Long sleep")
)
)
# Visualize social media usage across mental health scores by sleep group
ggplot(student_data,
aes(x = MHS_cat,
y = Avg_Daily_Usage_Hours)) +
geom_boxplot(fill = "white", color = "black") +
stat_summary( # create lines connecting the median values
fun = median,
geom = "line",
aes(group = 1),
color = "darkred",
linewidth = 1
) +
stat_summary( # create points indicating medians
fun = median,
geom = "point",
color = "darkred",
size = 2
) +
facet_wrap(~ Sleep_Group) + # create separate panels by sleep category
labs(
title = "Social Media Use Across Mental Health Scores by Sleep Duration",
x = "Mental Health Score",
y = "Average daily social media use (hours)"
)
Interpretation: The faceted boxplots show how the relationship between mental health scores and social media usage differs across sleep duration groups. Among students with short sleep, social media usage is consistently high and declines only modestly as mental health scores increase. Moreover, with short sleep all students report lower levels of mental health (maximum is 6). This indicates that limited sleep is associated with elevated social media use, and lower levels of mental health. In the moderate sleep group, a clear and steep negative relationship emerges: social media usage decreases substantially as mental health scores rise, suggesting that differences in mental health are strongly reflected in social media behavior when sleep duration is moderate. In this group, the negative relationship between social media and mental health is the most pronounced. In contrast, among students with long sleep, social media usage is uniformly low and varies relatively little across mental health scores, indicating that higher sleep duration may dampen the association between mental health and social media use.
This exploratory analysis examined the relationship between mental health, sleep duration, and social media usage among students. We first showed that mental health scores are concentrated around moderate values, with relatively few students reporting very low or very high scores on mental health. We then documented systematic associations between mental health and both behavioral factors: higher mental health scores are associated with longer sleep duration and lower social media usage. Building on these results, a comparative analysis revealed that social media usage varies more strongly across mental health categories than sleep duration, indicating a steeper association. Finally, by stratifying the analysis by sleep duration, we showed that the relationship between mental health and social media usage is not uniform but depends on sleep: the association is strongest among students with moderate sleep and weakest among those with long sleep. While these findings are descriptive and do not establish causality, they provide a coherent picture of how sleep and social media behavior jointly relate to mental health outcomes.
References
Aminasalamt. (2025, December 3). Social_Media_Students_Dataset (2025). Kaggle.com. https://www.kaggle.com/datasets/aminasalamt/social-media-dataset-2025
Zomers, M. L., Hulsegge, G., van Oostrom, S. H., Proper, K. I., Verschuren, W. M. M., & Picavet, H. S. J. (2017). Characterizing Adult Sleep Behavior Over 20 Years—The Population-Based Doetinchem Cohort Study. Sleep, 40(7). https://doi.org/10.1093/sleep/zsx085