Project #2 - DATA 110

Author

Kalina Peterson

What Most Impacts an Individual’s Stress Level?

Image from https://stock.adobe.com/search?k=happy+youth

Introduction

I am looking to explore which factors have the greatest impact on an individual’s stress level. Mental health has become an increasingly important issue, especially among the younger generations, due to increasing suicide, anxiety, and depression rates. If I can determine which factors impact stress the most, we can take the first steps towards creating a mentally healthy society.

To do this, I will be using a dataset that examines the correlation between stress and social media, along with a few other factors. The dataset contains 500 observations of 10 variables (3 categorical and 7 numerical). The variables studied include self reported stress level (on a scale from 1-10), sleep quality, happiness level, daily screen time (hours), exercise frequency (out of a 7-day week), and days without social media.

Exploratory Data Analysis

Load the libraries

# Loading the necessary libraries
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(shiny)
Warning: package 'shiny' was built under R version 4.5.2
library(dplyr)
library(bslib)

Attaching package: 'bslib'

The following object is masked from 'package:utils':

    page

Load the data

setwd("C:/Users/kpeter81/OneDrive - montgomerycollege.edu/Datasets")
mental_health <- read_csv("mental_health.csv")

Examining structure and NA values

str(mental_health)
spc_tbl_ [500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ User_ID                  : chr [1:500] "U001" "U002" "U003" "U004" ...
 $ Age                      : num [1:500] 44 30 23 36 34 38 26 26 39 39 ...
 $ Gender                   : chr [1:500] "Male" "Other" "Other" "Female" ...
 $ Daily_Screen_Time(hrs)   : num [1:500] 3.1 5.1 7.4 5.7 7 6.6 7.8 7.4 4.7 6.6 ...
 $ Sleep_Quality(1-10)      : num [1:500] 7 7 6 7 4 5 4 5 7 6 ...
 $ Stress_Level(1-10)       : num [1:500] 6 8 7 8 7 7 8 6 7 8 ...
 $ Days_Without_Social_Media: num [1:500] 2 5 1 1 5 4 2 1 6 0 ...
 $ Exercise_Frequency(week) : num [1:500] 5 3 3 1 1 3 0 4 1 2 ...
 $ Social_Media_Platform    : chr [1:500] "Facebook" "LinkedIn" "YouTube" "TikTok" ...
 $ Happiness_Index(1-10)    : num [1:500] 10 10 6 8 8 8 7 7 9 7 ...
 - attr(*, "spec")=
  .. cols(
  ..   User_ID = col_character(),
  ..   Age = col_double(),
  ..   Gender = col_character(),
  ..   `Daily_Screen_Time(hrs)` = col_double(),
  ..   `Sleep_Quality(1-10)` = col_double(),
  ..   `Stress_Level(1-10)` = col_double(),
  ..   Days_Without_Social_Media = col_double(),
  ..   `Exercise_Frequency(week)` = col_double(),
  ..   Social_Media_Platform = col_character(),
  ..   `Happiness_Index(1-10)` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
colSums(is.na(mental_health))
                  User_ID                       Age                    Gender 
                        0                         0                         0 
   Daily_Screen_Time(hrs)       Sleep_Quality(1-10)        Stress_Level(1-10) 
                        0                         0                         0 
Days_Without_Social_Media  Exercise_Frequency(week)     Social_Media_Platform 
                        0                         0                         0 
    Happiness_Index(1-10) 
                        0 

Thankfully there are no NA values in the dataset, so we can proceed with statistical analysis after some basic cleaning.

summary(mental_health$`Stress_Level(1-10)`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   6.000   7.000   6.618   8.000  10.000 

The lowest self reported stress level was 2 out of 10, the highest was 10, and the median self-reported stress level was about 7. The average stress level was a bit lower (6.618), which indicates there may be some outliers on the lower end of the spectrum that skewed the mean. This will be useful knowledge when examining which variables are impacting one’s stress level.

Renaming Variables to Make Coding Easier

Currently, the variable names in the dataset are more complicated than necessary and difficult to code with. To make things easier I will rename these variables.

colnames(mental_health) <- tolower(colnames(mental_health))

mh1 <- mental_health |>
  rename(happiness_level = `happiness_index(1-10)`, 
         stress_level = `stress_level(1-10)`,
         sleep_quality = `sleep_quality(1-10)`,
         exercise_7d =  `exercise_frequency(week)`,
         daily_screen_time = `daily_screen_time(hrs)`) 
head(mh1)
# A tibble: 6 × 10
  user_id   age gender daily_screen_time sleep_quality stress_level
  <chr>   <dbl> <chr>              <dbl>         <dbl>        <dbl>
1 U001       44 Male                 3.1             7            6
2 U002       30 Other                5.1             7            8
3 U003       23 Other                7.4             6            7
4 U004       36 Female               5.7             7            8
5 U005       34 Female               7               4            7
6 U006       38 Male                 6.6             5            7
# ℹ 4 more variables: days_without_social_media <dbl>, exercise_7d <dbl>,
#   social_media_platform <chr>, happiness_level <dbl>

Mutating factor variables for Visualization

Many of the numerical variables I am studying are discrete values, ranging from 1-10, so the easiest way to visualize them will likely be with boxplots, rather than scatterplots. However, in order to create this visualization, I need to coerce these numerical variables into factors. I will also create a new variable that simply states whether or not an individual exercises or not.

# coercing discrete numerical variables into 
# factor variables for visualization purposes.
mh_factor <- mh1 |>
  mutate(across(c(sleep_quality, exercise_7d, days_without_social_media, happiness_level), as.factor)) |>
    mutate(exercise = ifelse(exercise_7d %in% c(1,2,3,4,5,6,7), "yes", "no"))
str(mh_factor)
tibble [500 × 11] (S3: tbl_df/tbl/data.frame)
 $ user_id                  : chr [1:500] "U001" "U002" "U003" "U004" ...
 $ age                      : num [1:500] 44 30 23 36 34 38 26 26 39 39 ...
 $ gender                   : chr [1:500] "Male" "Other" "Other" "Female" ...
 $ daily_screen_time        : num [1:500] 3.1 5.1 7.4 5.7 7 6.6 7.8 7.4 4.7 6.6 ...
 $ sleep_quality            : Factor w/ 9 levels "2","3","4","5",..: 6 6 5 6 3 4 3 4 6 5 ...
 $ stress_level             : num [1:500] 6 8 7 8 7 7 8 6 7 8 ...
 $ days_without_social_media: Factor w/ 9 levels "0","1","2","3",..: 3 6 2 2 6 5 3 2 7 1 ...
 $ exercise_7d              : Factor w/ 8 levels "0","1","2","3",..: 6 4 4 2 2 4 1 5 2 3 ...
 $ social_media_platform    : chr [1:500] "Facebook" "LinkedIn" "YouTube" "TikTok" ...
 $ happiness_level          : Factor w/ 7 levels "4","5","6","7",..: 7 7 3 5 5 5 4 4 6 4 ...
 $ exercise                 : chr [1:500] "yes" "yes" "yes" "yes" ...

Exploring the Relationship Between Variables and Stress Level

I am going to create a shiny app that allows examination of different variables and how they each correlate to an individual’s stress level. These variables will include daily screen time, sleep quality, exercise, days without social media, and happiness level.

# Define UI for stress level app ----
ui <- page_sidebar(

  # App title ----
  title = "Stress Level (1-10)",

  # Sidebar panel for inputs ----
  sidebar = sidebar(
    
     # Dropdown to select color of the graph
      selectInput("color", "Graph Color", choices = c("lightpink", "lightgreen", "steelblue", "lightblue", "orange")),
    

     # Input: Selector for variable to plot against mpg ----
     selectInput(
        "variable",
        "Variable:",
        c(
          "Time Spent on Social Media (hrs)" = "daily_screen_time",
          "Sleep Quality (1-10)" = "sleep_quality",
          "Exercise (yes/no)" = "exercise",
          "Exercise (7 days)" = "exercise_7d",
          "Days Without Social Media" = "days_without_social_media",
          "Happiness Level (1-10)" = "happiness_level")
        
      )
    
),
    

    # Input: Checkbox for whether outliers should be included ----
    checkboxInput("outliers", "Show outliers", TRUE),

  # Output: Formatted text for caption ----
  h3(textOutput("caption")),

  # Output: Plot of the requested variable against stress_level ----
  plotOutput("stressPlot")
)

# Define server logic to plot various variables against stress_levels ----
server <- function(input, output) {

  # Compute the formula text ----
  # This is in a reactive expression since it is shared by the
  # output$caption and output$stressPlot functions
  formulaText <- reactive({
    paste("stress_level ~", input$variable)
  })

  # Return the formula text for printing as a caption ----
  output$caption <- renderText({
    formulaText()
  })

  # Generate a plot of the requested variable against mpg ----
  # and only exclude outliers if requested
  output$stressPlot <- renderPlot({
    boxplot(
      as.formula(formulaText()),
      data = mh_factor,
      outline = input$outliers,
      col = (input$color),
      par(family = "serif"),
      pch = 19
    ) 
  })
}
# Create Shiny app ----
shinyApp(ui, server)

Shiny applications not supported in static R Markdown documents

The following conclusions can be reached based on each of the graphs.

Stress Level ~ Daily Screen Time

  • There seems to be a positive correlation between stress level and daily screen time. As daily screen time increases, the median, maximum, and minimum stress level all increase.

  • For those with a daily screen time of about 1 hour, the median stress level is around 3.5, while those with a screen time of almost 10 hours per day have a median stress level of 9.5

  • Thus, while we cannot say whether daily screen time causes stress, we can determine that the two variables are correlated.

Stress Level ~ Sleep Quality

  • There seems to be a negative correlation between stress level and sleep quality. As sleep quality increases, stress level’s median decreases.

  • However, in the sleep quality rating of 7, we can see that the minimum and maximum range from 2 all the way to 10, meaning those who had an average sleep quality had stress-levels ranging from lowest to highest values

Stress Level ~ Exercise

  • Interestingly, those who exercise seem to have a higher median stress level than those who do not.

  • There is a wide range for both of the boxplots, those who exercise can have stress levels ranging from 3-10, and those who do not can rance from 2-9. On a scale of 1-10 a range of 7 is fairly large.

Stress Level ~ Exercise 7 Days

  • There does not seem to be any clear pattern. Those who exercise 6 out of 7 days have the lowest median stress level of 5 of the categories. It also had the lowest maximum stress level (the maximum stress for someone in this category was 7)

  • Stress level ranged from 2-10 for those who exercised 1-4 days out of the week, with the median stress level being around 7 for each of those categories.

  • No clear pattern, linear regression will determine whether or not this variable is significant

Stress Level ~ Days Without Social Media

  • This graph also seems to have no clear pattern. The majority of the groups had the same median stress level of 7, whether they’d gone zero days without social media or 7. This is especially interesting when the high correlation between stress level and screen time is examined.

  • Again, many of the boxplots had pretty wide ranges

Stress Level ~ Happiness Level

  • There is a strong negative correlation between self reported happiness level and self reported stress level. Those with high happiness levels had the lowest median stress, and those with low happiness levels had higher stress levels overall.

Multiple Linear Regression

To determine which variables have the greatest impact on an individual’s stress level, I will conduct a multiple linear regression. I will begin with the full model that includes all the numerical variables and then perform backwards elimination, removing variables with lower p-values to increase the model’s accuracy.

# Fit multiple linear regression
model1 <- lm(stress_level ~ happiness_level + sleep_quality + daily_screen_time + exercise_7d + age + days_without_social_media, data = mh1)

# View the model summary
summary(model1)

Call:
lm(formula = stress_level ~ happiness_level + sleep_quality + 
    daily_screen_time + exercise_7d + age + days_without_social_media, 
    data = mh1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62880 -0.60644  0.02399  0.62144  2.60318 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                7.056504   0.582543  12.113   <2e-16 ***
happiness_level           -0.465836   0.040318 -11.554   <2e-16 ***
sleep_quality              0.107689   0.043772   2.460   0.0142 *  
daily_screen_time          0.446822   0.040073  11.150   <2e-16 ***
exercise_7d                0.049826   0.029174   1.708   0.0883 .  
age                        0.002643   0.004170   0.634   0.5265    
days_without_social_media  0.033313   0.022247   1.497   0.1349    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9215 on 493 degrees of freedom
Multiple R-squared:  0.6476,    Adjusted R-squared:  0.6433 
F-statistic:   151 on 6 and 493 DF,  p-value: < 2.2e-16

Currently, the full model accounts for 64.33% of the variability in the data, and the model is extremely significant.

To improve the model, I will remove age. The variable has no significance and has the highest p-value of 0.5265.

# Fit multiple linear regression
model2 <- lm(stress_level ~ happiness_level + sleep_quality + daily_screen_time + exercise_7d + days_without_social_media, data = mh1)

# View the model summary
summary(model2)

Call:
lm(formula = stress_level ~ happiness_level + sleep_quality + 
    daily_screen_time + exercise_7d + days_without_social_media, 
    data = mh1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61177 -0.60257  0.02747  0.62663  2.59216 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                7.13718    0.56812  12.563   <2e-16 ***
happiness_level           -0.46398    0.04019 -11.545   <2e-16 ***
sleep_quality              0.10573    0.04364   2.423   0.0158 *  
daily_screen_time          0.44710    0.04005  11.164   <2e-16 ***
exercise_7d                0.05102    0.02910   1.753   0.0802 .  
days_without_social_media  0.03293    0.02223   1.482   0.1391    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.921 on 494 degrees of freedom
Multiple R-squared:  0.6473,    Adjusted R-squared:  0.6438 
F-statistic: 181.3 on 5 and 494 DF,  p-value: < 2.2e-16

Removing age did not improve the model sigficantly, the adjusted R-squared only increased by about 0.05. Now I will remove days_without_social_media to see how that influences the model’s accuracy. That variable is the least significant (p value of 0.1391)

model3 <- lm(stress_level ~ happiness_level + sleep_quality + daily_screen_time + exercise_7d, data = mh1)

# View the model summary
summary(model3)

Call:
lm(formula = stress_level ~ happiness_level + sleep_quality + 
    daily_screen_time + exercise_7d, data = mh1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.68688 -0.60160  0.03212  0.60585  2.55534 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.22193    0.56591  12.762   <2e-16 ***
happiness_level   -0.46137    0.04020 -11.478   <2e-16 ***
sleep_quality      0.10543    0.04369   2.413   0.0162 *  
daily_screen_time  0.44690    0.04009  11.146   <2e-16 ***
exercise_7d        0.05089    0.02913   1.747   0.0813 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9221 on 495 degrees of freedom
Multiple R-squared:  0.6458,    Adjusted R-squared:  0.6429 
F-statistic: 225.6 on 4 and 495 DF,  p-value: < 2.2e-16

Now the model has decreased in accuracy slightly, accounting for less of the variation in the data.

Summary

That means model 2 is the best model, with an extremely significant p-value and an adjusted r-squared that shows 64.38% of the data. Model 2 included the variables happiness_level, sleep_quality, daily_screen_time, exercise_7d, and days_without_social_media. Of those variables, 4 were significant (below 0.05) and 3 were extremely significant.

Model Equation

stress_level = 7.13718 + -0.46398 (happiness_level) + 0.10573 (sleep_quality) + 0.44710(daily_screen_time) + 0.05102(exercise_7d) + 0.03293 (days_without_social_media)

Conclusion

I found that the variables that the variables that most impacted an individual’s stress level were happiness level, sleep quality, the amount of daily screen time, exercise, and days without social media. The equation reveals a negative correlation between happiness and stress level. The rest of the variables in the model, however, are positively correlated with stress level. As they increase, so does an individual’s overall stress level. Thus, spending less time on screens, prioritizing high quality sleep, slightly decreasing exercise, and spending less time on social media will help decrease an individual’s stress.

These findings are supported by the visualizations created through the shiny app. Variables that had a strong correlation with stress were indeed significant and influenced the multiple linear model. I would’ve loved to create a high chart to visualize trends in the data, but unfortunately that was not possible with the type of data in my dataset.

Citations