Mental Health Data

Author

Kristoff Oliphant

Introduction

My goal in this dataset is tidy a wide mental health dataset to see if there’s a correlation between anxiety and depression with the amount of sleep an individual gets. The raw data was provided by my classmate in Discussion 5A, where they originally raised the question. The dataset has up to 180 participants that live in various states, therapy, notes, and their sleeping scores. We also have results from each participants PHQ-9 and GAD-7, which tracks their depression and anxiety. The dataset is wide and untidy due to the various columns and the missing values that’s scattered all over. It will be important to tidy this data and transform it into long form for analysis.

Planned Workflow

I plan to load the data and use pivot longer to break down colums like January, February, etc into ‘Month’ and provide a column for monthly score for each patient. Normalizing variables like city state using separate, and I also want to extract text strings to create numeric columns for the PHQ-9 and GAD-7 scores, and accounting for the inconsistent/missing values in the dataset. After tidying, I will use ggplot to visualize the relationship between sleep hours and mental health scores using a regression line to assess the correlation.

Anticipated Challenges

A challenge is the string manipulation for columns like screening scorres and sleep. Both have formatting that will need to be translated and tidied for them to be effectively used for calculations. Additionally, making sure that any blank entries are accounted for and do not bother the calculations as well.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
mh_raw <- read_csv("untidy_mental_health_data.csv")
Rows: 180 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Participant, AgeGroup, Gender, City_State, ScreeningScores, Jan, F...
dbl  (1): RecordID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mh_cleaned <- mh_raw %>%
  separate(City_State, into = c("City", "State"), sep = ", ") %>%
  mutate(Gender = case_when(
    Gender %in% c("F", "Female") ~ "Female",
    Gender %in% c("M", "Male") ~ "Male",
    TRUE ~ "Other/Unknown"
  )) %>%
  mutate(sleep_hours = as.numeric(str_extract(Sleep, "\\d+\\.?\\d*")))
head(mh_cleaned)
# A tibble: 6 × 23
  RecordID Participant   AgeGroup Gender City  State ScreeningScores Jan   Feb  
     <dbl> <chr>         <chr>    <chr>  <chr> <chr> <chr>           <chr> <chr>
1        1 Participant_1 unknown  Female Phoe… AZ    PHQ9:9,GAD7:0   <NA>  -    
2        2 Participant_2 unknown  Other… Hous… TX    PHQ-9=2;GAD-7=9 <NA>  miss…
3        3 Participant_3 26-35    Male   Chic… IL    PHQ-9=20;GAD-7… -     miss…
4        4 Participant_4 36-45    Female Hous… TX    PHQ-9=24;GAD-7… n/a   n/a  
5        5 Participant_5 36-45    Female New … NY    PHQ9:9,GAD7:8   <NA>  <NA> 
6        6 Participant_6 18-25    Female Hous… TX    PHQ9:12,GAD7:15 miss… <NA> 
# ℹ 14 more variables: Mar <chr>, Apr <chr>, May <chr>, TherapyType <chr>,
#   Sessions <chr>, Medication <chr>, DiagnosisStatus <chr>, WorkHours <chr>,
#   Sleep <chr>, StressScale <chr>, Insurance <chr>, SurveyDate <chr>,
#   Notes <chr>, sleep_hours <dbl>
mh_tidy <- mh_cleaned %>%
  mutate(
    phq9_raw = str_extract(ScreeningScores, "PHQ-?9[:=]\\d+"),
    phq9_score = as.numeric(str_remove(phq9_raw, "PHQ-?9[:=]")),
    gad7_raw = str_extract(ScreeningScores, "GAD-?7[:=]\\d+"),
    gad7_score = as.numeric(str_remove(gad7_raw, "GAD-?7[:=]"))
  ) %>%
  pivot_longer(
    cols = c(Jan, Feb, Mar, Apr, May),
    names_to = "Month",
    values_to = "Monthly_Score_Raw"
  ) %>%
  filter(!is.na(Monthly_Score_Raw),
         !Monthly_Score_Raw %in% c("-", "missing", "n/a")) %>%
  mutate(Monthly_Score = as.numeric(Monthly_Score_Raw))
glimpse(mh_tidy)
Rows: 148
Columns: 25
$ RecordID          <dbl> 8, 8, 8, 12, 12, 15, 15, 18, 20, 23, 24, 25, 27, 27,…
$ Participant       <chr> "Participant_8", "Participant_8", "Participant_8", "…
$ AgeGroup          <chr> "26-35", "26-35", "26-35", "60+", "60+", "46-60", "4…
$ Gender            <chr> "Female", "Female", "Female", "Female", "Female", "O…
$ City              <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago…
$ State             <chr> "IL", "IL", "IL", "IL", "IL", "IL", "IL", "TX", "IL"…
$ ScreeningScores   <chr> "PHQ9:10,GAD7:5", "PHQ9:10,GAD7:5", "PHQ9:10,GAD7:5"…
$ TherapyType       <chr> "None", "None", "None", "CBT", "CBT", "Group", "Grou…
$ Sessions          <chr> NA, NA, NA, "7", "7", NA, NA, "9", "18", NA, NA, "n/…
$ Medication        <chr> "SSRI", "SSRI", "SSRI", "Benzodiazepine", "Benzodiaz…
$ DiagnosisStatus   <chr> NA, NA, NA, "Diagnosed", "Diagnosed", NA, NA, "-", N…
$ WorkHours         <chr> "57h", "57h", "57h", "28 hours", "28 hours", "56h", …
$ Sleep             <chr> "7.8h", "7.8h", "7.8h", NA, NA, NA, NA, "8.8h", NA, …
$ StressScale       <chr> "7", "7", "7", "n/a", "n/a", "4", "4", "1", NA, "n/a…
$ Insurance         <chr> "n/a", "n/a", "n/a", "None", "None", "Private", "Pri…
$ SurveyDate        <chr> "2025-02-03", "2025-02-03", "2025-02-03", "2024-05-2…
$ Notes             <chr> NA, NA, NA, NA, NA, NA, NA, "missing", NA, "Increase…
$ sleep_hours       <dbl> 7.8, 7.8, 7.8, NA, NA, NA, NA, 8.8, NA, 5.7, 6.9, NA…
$ phq9_raw          <chr> "PHQ9:10", "PHQ9:10", "PHQ9:10", "PHQ9:8", "PHQ9:8",…
$ phq9_score        <dbl> 10, 10, 10, 8, 8, 22, 22, 2, 4, 4, 16, 7, 15, 15, 20…
$ gad7_raw          <chr> "GAD7:5", "GAD7:5", "GAD7:5", "GAD7:2", "GAD7:2", "G…
$ gad7_score        <dbl> 5, 5, 5, 2, 2, 3, 3, 19, 20, 0, 9, 7, 3, 3, 19, 4, 6…
$ Month             <chr> "Jan", "Feb", "Mar", "Jan", "Mar", "Feb", "Apr", "Ap…
$ Monthly_Score_Raw <chr> "140", "119", "111", "80", "99", "130", "105", "131"…
$ Monthly_Score     <dbl> 140, 119, 111, 80, 99, 130, 105, 131, 99, 107, 82, 1…
ggplot(mh_tidy, aes(x = sleep_hours, y = phq9_score)) +
  geom_jitter(alpha = 0.5, color = "darkorchid") +
  geom_smooth(method = "lm", color = "black") +
  labs(
    title = "The Relationship Between Sleep for Anxiety and Depression",
    subtitle = "Analysis of Reported Sleep Hours vs. PHQ-9 Screening Scores",
    x = "Average Sleep Hours",
    y = "Depression Score (PHQ-9)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 74 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 74 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(mh_tidy, aes( x = sleep_hours, y = gad7_score)) +
  geom_jitter(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "black") +
  labs(
    title = "The Relationship Between Sleep and Anxiety",
    subtitle = "Analysis of Reported Sleep Hours vs. GAD-7 Screening Scores",
    x = "Average Sleep Hours",
    y = "Anxiety Score (GAD-7)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 74 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 74 rows containing missing values or values outside the scale range
(`geom_point()`).

Conclusion

After reviewing and tidying this dataset, it apears that the relationship for sleep between anxiety and depression doesn’t strongly correlate. Both graphs have a fairly straight trend line that sits between 10-13 on the screening scales for both anxiety and depression. Additionally, for example in the Anxiety graph, some people in this dataset get over 8 hours of sleep and score as high as 20, while others with similar sleep are low on the scale at about a score of 5. The same story is true for depression graph as well.