Mental health is influenced by various lifestyle factors, including sleep, diet, exercise, and social interactions. This data set, sourced from “Global wellness surveys and mental health research studies (2019-2024)”, includes responses from individuals worldwide, covering 12 key variables such as social interaction, stress levels, anxiety, and happiness index.
Initially, I planned to explore the link between sleep and happiness, but after analyzing the data, I shifted my focus to how social interaction affects happiness and whether stress plays a role. This study aims to uncover patterns in social engagement and mental well-being using data visualization and statistical analysis.
Note:
Social Interaction Level: Frequency of socializing on a 1-10 scale.
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Rows: 3000 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Country, Gender, Exercise Level, Diet Type, Stress Level, Mental He...
dbl (6): Age, Sleep Hours, Work Hours per Week, Screen Time per Day (Hours),...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Mental_Health)
# A tibble: 6 × 12
Country Age Gender `Exercise Level` `Diet Type` `Sleep Hours` `Stress Level`
<chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Brazil 48 Male Low Vegetarian 6.3 Low
2 Austra… 31 Male Moderate Vegan 4.9 Low
3 Japan 37 Female Low Vegetarian 7.2 High
4 Brazil 35 Male Low Vegan 7.2 Low
5 Germany 46 Male Low Balanced 7.3 Low
6 Japan 23 Other Moderate Balanced 2.7 Moderate
# ℹ 5 more variables: `Mental Health Condition` <chr>,
# `Work Hours per Week` <dbl>, `Screen Time per Day (Hours)` <dbl>,
# `Social Interaction Score` <dbl>, `Happiness Score` <dbl>
# Renaming variables Mental_Health <- Mental_Health |>rename(Happiness_Score =`Happiness Score`, Sleep_Hours =`Sleep Hours`, Diet_Type =`Diet Type`, Exercise_Level =`Exercise Level`, Mental_Health_Condition =`Mental Health Condition`, Stress_Level =`Stress Level`, Work_Hours =`Work Hours per Week`, Screen_Time =`Screen Time per Day (Hours)`, Social_Interaction =`Social Interaction Score`)#Removing missing valuesMental_Health_nona <- Mental_Health |>filter(!is.na(Social_Interaction) &!is.na(Happiness_Score))
From the output below, Social Interaction (p = 0.0243) is statistically significant, suggesting that social interaction has a weak but significant negative relationship with happiness.
Sleep Hours, Work Hours, Screen Time, and Age (p > 0.05) suggesting statistically insignificant values, which means they do not have a strong impact on happiness in this model.
Adjusted R^2 values:
Adjusted R square = 0.00096 (very low), indicating that the model explains less than 0.1% of the variation in Happiness Score.
# Run linear regression model with three quantitative variableslm_model <-lm(Happiness_Score ~ Sleep_Hours + Social_Interaction + Work_Hours + Screen_Time + Age, data = Mental_Health_nona )# Summary of modelsummary(lm_model)
Call:
lm(formula = Happiness_Score ~ Sleep_Hours + Social_Interaction +
Work_Hours + Screen_Time + Age, data = Mental_Health_nona)
Residuals:
Min 1Q Median 3Q Max
-4.567 -2.195 0.016 2.134 4.764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.347220 0.344484 15.522 <2e-16 ***
Sleep_Hours 0.027929 0.031138 0.897 0.3698
Social_Interaction -0.041095 0.018230 -2.254 0.0243 *
Work_Hours 0.002591 0.004078 0.635 0.5252
Screen_Time 0.024639 0.026730 0.922 0.3567
Age -0.003296 0.003480 -0.947 0.3437
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.556 on 2994 degrees of freedom
Multiple R-squared: 0.002625, Adjusted R-squared: 0.0009593
F-statistic: 1.576 on 5 and 2994 DF, p-value: 0.1634
# Selecting only the relevant columns for the pair plot to analyze correlationselected_columns <- Mental_Health_nona [, c("Happiness_Score", "Sleep_Hours", "Social_Interaction", "Work_Hours", "Screen_Time")]plot_correlation(selected_columns)
Call:
lm(formula = Happiness_Score ~ Sleep_Hours + Social_Interaction +
Work_Hours + Screen_Time + Age, data = Mental_Health_nona)
Residuals:
Min 1Q Median 3Q Max
-4.567 -2.195 0.016 2.134 4.764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.347220 0.344484 15.522 <2e-16 ***
Sleep_Hours 0.027929 0.031138 0.897 0.3698
Social_Interaction -0.041095 0.018230 -2.254 0.0243 *
Work_Hours 0.002591 0.004078 0.635 0.5252
Screen_Time 0.024639 0.026730 0.922 0.3567
Age -0.003296 0.003480 -0.947 0.3437
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.556 on 2994 degrees of freedom
Multiple R-squared: 0.002625, Adjusted R-squared: 0.0009593
F-statistic: 1.576 on 5 and 2994 DF, p-value: 0.1634
autoplot(Diagnostic_Plot, 1:4, nrow=2, ncol=2)
# Plotting diagnostic plots only with respect to social interaction (Getting a little higher adjusted R^2 value and < 0.05 p value.Diagnostic_Plot2 <-lm(Happiness_Score ~ Social_Interaction, data = Mental_Health_nona)summary(Diagnostic_Plot2)
Call:
lm(formula = Happiness_Score ~ Social_Interaction, data = Mental_Health_nona)
Residuals:
Min 1Q Median 3Q Max
-4.5103 -2.1911 0.0019 2.1375 4.7787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.61465 0.10998 51.050 <2e-16 ***
Social_Interaction -0.04014 0.01821 -2.205 0.0275 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.556 on 2998 degrees of freedom
Multiple R-squared: 0.001619, Adjusted R-squared: 0.001286
F-statistic: 4.861 on 1 and 2998 DF, p-value: 0.02754
autoplot(Diagnostic_Plot2, 1:4, nrow=2, ncol=2)
# Create the ggplot object with smoothing linep <-ggplot(Mental_Health_nona, aes(x = Social_Interaction, y = Happiness_Score, color = Stress_Level)) +geom_point(alpha =0.6) +# Scatter plot with semi-transparent pointsgeom_smooth(method ="loess", se =FALSE, linetype ="solid", size =1) +# LOESS for smoothing linelabs(title ="Interactive Plot of Happiness Score vs. Social Interaction by Stress Level",x ="Social Interaction",y ="Happiness Score",color ="Stress Level") +# Removing caption from labs()theme_minimal() +theme(legend.position ="none") +# Removing legendscale_color_brewer(palette ="Set1") # Using a color palette from RColorBrewer
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Converting the ggplot object to a plotly interactive plotinteractive_plot <-ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'
# Adding caption as annotation in Plotly, since removing it in the above code is not working with Plotlyinteractive_plot <- interactive_plot %>%layout(annotations =list(text ="Source: Global wellness surveys & mental health research studies",x =1.00, # Position horizontally at the centery =-0.1, # Adjust to a small negative value to bring it just below the plotxref ="paper", yref ="paper",showarrow =FALSE,font =list(size =9, color ="black", family ="Arial"),align ="Right" )) # Used AI chatGPT to generate the code for annotation in Plotly. Prompt "Generate a code in R for inserting caption with Plotly if caption going out of grid"# Display the interactive plotinteractive_plot
Data Cleaning
To prepare the dataset for analysis, I first renamed variables for clarity and consistency. I then handled missing values by removing rows with NA values, ensuring that the data set remained clean and complete for accurate analysis.
Visualization Insights
I chose to explore the relationship between social interaction and happiness because, among all quantitative variables, social interaction had the highest (though still very low) correlation of -0.04 with happiness. This negative correlation was surprising, as one would typically expect happiness to increase with more social interaction. This led me to investigate whether stress plays a role in this unexpected trend.
Challenges and Limitations
I initially hoped to create a more advanced visualization but faced difficulties due to weak correlations. I also attempted using ‘Highcharter’ for interactive plotting but struggled to incorporate a smoothing line like in ‘ggplotly’. Additionally, I explored 3D visualizations, but the data set was too large, causing clustering issues. To manage this, I tried random sampling (500 values), but each run produced different p-values and intercepts, making results inconsistent. Due to time constraints, I decided to work with the full data set instead.
Overall, while I faced challenges, this project helped me refine my data analysis skills and understand the complexities of visualizing weak correlations effectively, and what I should do in the future to prevent the challenges I faced with this project.