Project 1: The Relationship Between Social Interaction, Stress, and Happiness

Author

Maisha Ann Subin

Mental health is influenced by various lifestyle factors, including sleep, diet, exercise, and social interactions. This data set, sourced from “Global wellness surveys and mental health research studies (2019-2024)”, includes responses from individuals worldwide, covering 12 key variables such as social interaction, stress levels, anxiety, and happiness index.

Initially, I planned to explore the link between sleep and happiness, but after analyzing the data, I shifted my focus to how social interaction affects happiness and whether stress plays a role. This study aims to uncover patterns in social engagement and mental well-being using data visualization and statistical analysis.

Note:

Social Interaction Level: Frequency of socializing on a 1-10 scale.
Happiness Index: Overall happiness rating (1-10 scale).
Stress Level: Self-reported stress from low - high

#loading necessary libraries and data sets
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggfortify)
library(DataExplorer)
library(RColorBrewer)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Mental_Health<- read_csv("/Users/maishasubin/Desktop/DATA110/Mental_Health_Lifestyle_Dataset.csv")

Rows: 3000 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Country, Gender, Exercise Level, Diet Type, Stress Level, Mental He...
dbl (6): Age, Sleep Hours, Work Hours per Week, Screen Time per Day (Hours),...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Mental_Health)

# A tibble: 6 × 12
  Country   Age Gender `Exercise Level` `Diet Type` `Sleep Hours` `Stress Level`
  <chr>   <dbl> <chr>  <chr>            <chr>               <dbl> <chr>         
1 Brazil     48 Male   Low              Vegetarian            6.3 Low           
2 Austra…    31 Male   Moderate         Vegan                 4.9 Low           
3 Japan      37 Female Low              Vegetarian            7.2 High          
4 Brazil     35 Male   Low              Vegan                 7.2 Low           
5 Germany    46 Male   Low              Balanced              7.3 Low           
6 Japan      23 Other  Moderate         Balanced              2.7 Moderate      
# ℹ 5 more variables: `Mental Health Condition` <chr>,
#   `Work Hours per Week` <dbl>, `Screen Time per Day (Hours)` <dbl>,
#   `Social Interaction Score` <dbl>, `Happiness Score` <dbl>

# Renaming  variables 
Mental_Health <- Mental_Health |> rename(Happiness_Score = `Happiness Score`, Sleep_Hours = `Sleep Hours`, Diet_Type = `Diet Type`, Exercise_Level = `Exercise Level`, Mental_Health_Condition = `Mental Health Condition`, Stress_Level = `Stress Level`, Work_Hours = `Work Hours per Week`, Screen_Time = `Screen Time per Day (Hours)`, Social_Interaction = `Social Interaction Score`)

#Removing missing values
Mental_Health_nona <- Mental_Health |> 
  filter(!is.na(Social_Interaction) & !is.na(Happiness_Score))

Equation for my model:

Happiness_Score = 5.35 + 0.0279(Sleep_Hours) − 0.0411(Social_Interaction) + 0.0026(Work_Hours) + 0.0246(Screen_Time) − 0.0033(Age)

Analyzing model based on p-values:

From the output below, Social Interaction (p = 0.0243) is statistically significant, suggesting that social interaction has a weak but significant negative relationship with happiness.

Sleep Hours, Work Hours, Screen Time, and Age (p > 0.05) suggesting statistically insignificant values, which means they do not have a strong impact on happiness in this model.

Adjusted R^2 values:

Adjusted R square = 0.00096 (very low), indicating that the model explains less than 0.1% of the variation in Happiness Score.

# Run linear regression model with three quantitative variables
lm_model <- lm(Happiness_Score ~ Sleep_Hours + Social_Interaction + Work_Hours + Screen_Time + Age, data = Mental_Health_nona )
# Summary of model
summary(lm_model)


Call:
lm(formula = Happiness_Score ~ Sleep_Hours + Social_Interaction + 
    Work_Hours + Screen_Time + Age, data = Mental_Health_nona)

Residuals:
   Min     1Q Median     3Q    Max 
-4.567 -2.195  0.016  2.134  4.764 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.347220   0.344484  15.522   <2e-16 ***
Sleep_Hours         0.027929   0.031138   0.897   0.3698    
Social_Interaction -0.041095   0.018230  -2.254   0.0243 *  
Work_Hours          0.002591   0.004078   0.635   0.5252    
Screen_Time         0.024639   0.026730   0.922   0.3567    
Age                -0.003296   0.003480  -0.947   0.3437    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.556 on 2994 degrees of freedom
Multiple R-squared:  0.002625,  Adjusted R-squared:  0.0009593 
F-statistic: 1.576 on 5 and 2994 DF,  p-value: 0.1634

# Selecting only the relevant columns for the pair plot to analyze correlation
selected_columns <- Mental_Health_nona  [, c("Happiness_Score", "Sleep_Hours", "Social_Interaction", "Work_Hours", "Screen_Time")]

plot_correlation(selected_columns)

# Plotting diagnostic plots
Diagnostic_Plot <- lm(Happiness_Score ~ Sleep_Hours + Social_Interaction + Work_Hours + Screen_Time + Age, data = Mental_Health_nona )
summary(Diagnostic_Plot)


Call:
lm(formula = Happiness_Score ~ Sleep_Hours + Social_Interaction + 
    Work_Hours + Screen_Time + Age, data = Mental_Health_nona)

Residuals:
   Min     1Q Median     3Q    Max 
-4.567 -2.195  0.016  2.134  4.764 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.347220   0.344484  15.522   <2e-16 ***
Sleep_Hours         0.027929   0.031138   0.897   0.3698    
Social_Interaction -0.041095   0.018230  -2.254   0.0243 *  
Work_Hours          0.002591   0.004078   0.635   0.5252    
Screen_Time         0.024639   0.026730   0.922   0.3567    
Age                -0.003296   0.003480  -0.947   0.3437    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.556 on 2994 degrees of freedom
Multiple R-squared:  0.002625,  Adjusted R-squared:  0.0009593 
F-statistic: 1.576 on 5 and 2994 DF,  p-value: 0.1634

autoplot(Diagnostic_Plot, 1:4, nrow=2, ncol=2)

# Plotting diagnostic plots only with respect to social interaction (Getting a little higher adjusted R^2 value and < 0.05 p value.
Diagnostic_Plot2 <- lm(Happiness_Score ~ Social_Interaction, data = Mental_Health_nona)
summary(Diagnostic_Plot2)


Call:
lm(formula = Happiness_Score ~ Social_Interaction, data = Mental_Health_nona)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5103 -2.1911  0.0019  2.1375  4.7787 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.61465    0.10998  51.050   <2e-16 ***
Social_Interaction -0.04014    0.01821  -2.205   0.0275 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.556 on 2998 degrees of freedom
Multiple R-squared:  0.001619,  Adjusted R-squared:  0.001286 
F-statistic: 4.861 on 1 and 2998 DF,  p-value: 0.02754

autoplot(Diagnostic_Plot2, 1:4, nrow=2, ncol=2)

# Create the ggplot object with smoothing line
p <- ggplot(Mental_Health_nona, aes(x = Social_Interaction, y = Happiness_Score, color = Stress_Level)) +
  geom_point(alpha = 0.6) +  # Scatter plot with semi-transparent points
  geom_smooth(method = "loess", se = FALSE, linetype = "solid", size = 1) +  # LOESS for smoothing line
  labs(title = "Interactive Plot of Happiness Score vs. Social Interaction by Stress Level",
       x = "Social Interaction",
       y = "Happiness Score",
       color = "Stress Level") +  # Removing caption from labs()
  theme_minimal() +
  theme(legend.position = "none") +  # Removing legend
  scale_color_brewer(palette = "Set1")  # Using a color palette from RColorBrewer

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Converting the ggplot object to a plotly interactive plot
interactive_plot <- ggplotly(p)

`geom_smooth()` using formula = 'y ~ x'

# Adding caption as annotation in Plotly, since removing it in the above code is not working with Plotly
interactive_plot <- interactive_plot %>%
  layout(annotations = list(
    text = "Source: Global wellness surveys & mental health research studies",
    x = 1.00,  # Position horizontally at the center
    y = -0.1,  # Adjust to a small negative value to bring it just below the plot
    xref = "paper", yref = "paper",
    showarrow = FALSE,
    font = list(size = 9, color = "black", family = "Arial"),
    align = "Right"
  )) # Used AI chatGPT to generate the code for annotation in Plotly. Prompt "Generate a code in R for inserting caption with Plotly if caption going out of grid"

# Display the interactive plot
interactive_plot

Data Cleaning

To prepare the dataset for analysis, I first renamed variables for clarity and consistency. I then handled missing values by removing rows with NA values, ensuring that the data set remained clean and complete for accurate analysis.
Visualization Insights

I chose to explore the relationship between social interaction and happiness because, among all quantitative variables, social interaction had the highest (though still very low) correlation of -0.04 with happiness. This negative correlation was surprising, as one would typically expect happiness to increase with more social interaction. This led me to investigate whether stress plays a role in this unexpected trend.
Challenges and Limitations

I initially hoped to create a more advanced visualization but faced difficulties due to weak correlations. I also attempted using ‘Highcharter’ for interactive plotting but struggled to incorporate a smoothing line like in ‘ggplotly’. Additionally, I explored 3D visualizations, but the data set was too large, causing clustering issues. To manage this, I tried random sampling (500 values), but each run produced different p-values and intercepts, making results inconsistent. Due to time constraints, I decided to work with the full data set instead.

Overall, while I faced challenges, this project helped me refine my data analysis skills and understand the complexities of visualizing weak correlations effectively, and what I should do in the future to prevent the challenges I faced with this project.