Image from https://stock.adobe.com/search?k=happy+youth
Introduction
I am looking to explore which factors have the greatest impact on an individual’s stress level. Mental health has become an increasingly important issue, especially among the younger generations, due to increasing suicide, anxiety, and depression rates. If I can determine which factors impact stress the most, we can take the first steps towards creating a mentally healthy society.
To do this, I will be using a dataset that examines the correlation between stress and social media, along with a few other factors. The dataset contains 500 observations of 10 variables (3 categorical and 7 numerical). The variables studied include self reported stress level (on a scale from 1-10), sleep quality, happiness level, daily screen time (hours), exercise frequency (out of a 7-day week), and days without social media.
Exploratory Data Analysis
Load the libraries
# Loading the necessary librarieslibrary(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(shiny)
Warning: package 'shiny' was built under R version 4.5.2
library(dplyr)library(bslib)
Attaching package: 'bslib'
The following object is masked from 'package:utils':
page
Thankfully there are no NA values in the dataset, so we can proceed with statistical analysis after some basic cleaning.
summary(mental_health$`Stress_Level(1-10)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 6.000 7.000 6.618 8.000 10.000
The lowest self reported stress level was 2 out of 10, the highest was 10, and the median self-reported stress level was about 7. The average stress level was a bit lower (6.618), which indicates there may be some outliers on the lower end of the spectrum that skewed the mean. This will be useful knowledge when examining which variables are impacting one’s stress level.
Renaming Variables to Make Coding Easier
Currently, the variable names in the dataset are more complicated than necessary and difficult to code with. To make things easier I will rename these variables.
Many of the numerical variables I am studying are discrete values, ranging from 1-10, so the easiest way to visualize them will likely be with boxplots, rather than scatterplots. However, in order to create this visualization, I need to coerce these numerical variables into factors. I will also create a new variable that simply states whether or not an individual exercises or not.
Exploring the Relationship Between Variables and Stress Level
I am going to create a shiny app that allows examination of different variables and how they each correlate to an individual’s stress level. These variables will include daily screen time, sleep quality, exercise, days without social media, and happiness level.
# Define UI for stress level app ----ui <-page_sidebar(# App title ----title ="Stress Level (1-10)",# Sidebar panel for inputs ----sidebar =sidebar(# Dropdown to select color of the graphselectInput("color", "Graph Color", choices =c("lightpink", "lightgreen", "steelblue", "lightblue", "orange")),# Input: Selector for variable to plot against mpg ----selectInput("variable","Variable:",c("Time Spent on Social Media (hrs)"="daily_screen_time","Sleep Quality (1-10)"="sleep_quality","Exercise (yes/no)"="exercise","Exercise (7 days)"="exercise_7d","Days Without Social Media"="days_without_social_media","Happiness Level (1-10)"="happiness_level") )),# Input: Checkbox for whether outliers should be included ----checkboxInput("outliers", "Show outliers", TRUE),# Output: Formatted text for caption ----h3(textOutput("caption")),# Output: Plot of the requested variable against stress_level ----plotOutput("stressPlot"))# Define server logic to plot various variables against stress_levels ----server <-function(input, output) {# Compute the formula text ----# This is in a reactive expression since it is shared by the# output$caption and output$stressPlot functions formulaText <-reactive({paste("stress_level ~", input$variable) })# Return the formula text for printing as a caption ---- output$caption <-renderText({formulaText() })# Generate a plot of the requested variable against mpg ----# and only exclude outliers if requested output$stressPlot <-renderPlot({boxplot(as.formula(formulaText()),data = mh_factor,outline = input$outliers,col = (input$color),par(family ="serif"),pch =19 ) })}
# Create Shiny app ----shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents
The following conclusions can be reached based on each of the graphs.
Stress Level ~ Daily Screen Time
There seems to be a positive correlation between stress level and daily screen time. As daily screen time increases, the median, maximum, and minimum stress level all increase.
For those with a daily screen time of about 1 hour, the median stress level is around 3.5, while those with a screen time of almost 10 hours per day have a median stress level of 9.5
Thus, while we cannot say whether daily screen time causes stress, we can determine that the two variables are correlated.
Stress Level ~ Sleep Quality
There seems to be a negative correlation between stress level and sleep quality. As sleep quality increases, stress level’s median decreases.
However, in the sleep quality rating of 7, we can see that the minimum and maximum range from 2 all the way to 10, meaning those who had an average sleep quality had stress-levels ranging from lowest to highest values
Stress Level ~ Exercise
Interestingly, those who exercise seem to have a higher median stress level than those who do not.
There is a wide range for both of the boxplots, those who exercise can have stress levels ranging from 3-10, and those who do not can rance from 2-9. On a scale of 1-10 a range of 7 is fairly large.
Stress Level ~ Exercise 7 Days
There does not seem to be any clear pattern. Those who exercise 6 out of 7 days have the lowest median stress level of 5 of the categories. It also had the lowest maximum stress level (the maximum stress for someone in this category was 7)
Stress level ranged from 2-10 for those who exercised 1-4 days out of the week, with the median stress level being around 7 for each of those categories.
No clear pattern, linear regression will determine whether or not this variable is significant
Stress Level ~ Days Without Social Media
This graph also seems to have no clear pattern. The majority of the groups had the same median stress level of 7, whether they’d gone zero days without social media or 7. This is especially interesting when the high correlation between stress level and screen time is examined.
Again, many of the boxplots had pretty wide ranges
Stress Level ~ Happiness Level
There is a strong negative correlation between self reported happiness level and self reported stress level. Those with high happiness levels had the lowest median stress, and those with low happiness levels had higher stress levels overall.
Multiple Linear Regression
To determine which variables have the greatest impact on an individual’s stress level, I will conduct a multiple linear regression. I will begin with the full model that includes all the numerical variables and then perform backwards elimination, removing variables with lower p-values to increase the model’s accuracy.
# Fit multiple linear regressionmodel1 <-lm(stress_level ~ happiness_level + sleep_quality + daily_screen_time + exercise_7d + age + days_without_social_media, data = mh1)# View the model summarysummary(model1)
Call:
lm(formula = stress_level ~ happiness_level + sleep_quality +
daily_screen_time + exercise_7d + age + days_without_social_media,
data = mh1)
Residuals:
Min 1Q Median 3Q Max
-2.62880 -0.60644 0.02399 0.62144 2.60318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.056504 0.582543 12.113 <2e-16 ***
happiness_level -0.465836 0.040318 -11.554 <2e-16 ***
sleep_quality 0.107689 0.043772 2.460 0.0142 *
daily_screen_time 0.446822 0.040073 11.150 <2e-16 ***
exercise_7d 0.049826 0.029174 1.708 0.0883 .
age 0.002643 0.004170 0.634 0.5265
days_without_social_media 0.033313 0.022247 1.497 0.1349
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9215 on 493 degrees of freedom
Multiple R-squared: 0.6476, Adjusted R-squared: 0.6433
F-statistic: 151 on 6 and 493 DF, p-value: < 2.2e-16
Currently, the full model accounts for 64.33% of the variability in the data, and the model is extremely significant.
To improve the model, I will remove age. The variable has no significance and has the highest p-value of 0.5265.
# Fit multiple linear regressionmodel2 <-lm(stress_level ~ happiness_level + sleep_quality + daily_screen_time + exercise_7d + days_without_social_media, data = mh1)# View the model summarysummary(model2)
Call:
lm(formula = stress_level ~ happiness_level + sleep_quality +
daily_screen_time + exercise_7d + days_without_social_media,
data = mh1)
Residuals:
Min 1Q Median 3Q Max
-2.61177 -0.60257 0.02747 0.62663 2.59216
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.13718 0.56812 12.563 <2e-16 ***
happiness_level -0.46398 0.04019 -11.545 <2e-16 ***
sleep_quality 0.10573 0.04364 2.423 0.0158 *
daily_screen_time 0.44710 0.04005 11.164 <2e-16 ***
exercise_7d 0.05102 0.02910 1.753 0.0802 .
days_without_social_media 0.03293 0.02223 1.482 0.1391
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.921 on 494 degrees of freedom
Multiple R-squared: 0.6473, Adjusted R-squared: 0.6438
F-statistic: 181.3 on 5 and 494 DF, p-value: < 2.2e-16
Removing age did not improve the model sigficantly, the adjusted R-squared only increased by about 0.05. Now I will remove days_without_social_media to see how that influences the model’s accuracy. That variable is the least significant (p value of 0.1391)
model3 <-lm(stress_level ~ happiness_level + sleep_quality + daily_screen_time + exercise_7d, data = mh1)# View the model summarysummary(model3)
Call:
lm(formula = stress_level ~ happiness_level + sleep_quality +
daily_screen_time + exercise_7d, data = mh1)
Residuals:
Min 1Q Median 3Q Max
-2.68688 -0.60160 0.03212 0.60585 2.55534
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.22193 0.56591 12.762 <2e-16 ***
happiness_level -0.46137 0.04020 -11.478 <2e-16 ***
sleep_quality 0.10543 0.04369 2.413 0.0162 *
daily_screen_time 0.44690 0.04009 11.146 <2e-16 ***
exercise_7d 0.05089 0.02913 1.747 0.0813 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9221 on 495 degrees of freedom
Multiple R-squared: 0.6458, Adjusted R-squared: 0.6429
F-statistic: 225.6 on 4 and 495 DF, p-value: < 2.2e-16
Now the model has decreased in accuracy slightly, accounting for less of the variation in the data.
Summary
That means model 2 is the best model, with an extremely significant p-value and an adjusted r-squared that shows 64.38% of the data. Model 2 included the variables happiness_level, sleep_quality, daily_screen_time, exercise_7d, and days_without_social_media. Of those variables, 4 were significant (below 0.05) and 3 were extremely significant.
I found that the variables that the variables that most impacted an individual’s stress level were happiness level, sleep quality, the amount of daily screen time, exercise, and days without social media. The equation reveals a negative correlation between happiness and stress level. The rest of the variables in the model, however, are positively correlated with stress level. As they increase, so does an individual’s overall stress level. Thus, spending less time on screens, prioritizing high quality sleep, slightly decreasing exercise, and spending less time on social media will help decrease an individual’s stress.
These findings are supported by the visualizations created through the shiny app. Variables that had a strong correlation with stress were indeed significant and influenced the multiple linear model. I would’ve loved to create a high chart to visualize trends in the data, but unfortunately that was not possible with the type of data in my dataset.
Citations
Image from Image from https://stock.adobe.com/search?k=happy+youth