Project 2 - Mental Health Dataset

Author

Theresa Benny

Approach

The first dataset contains simulated mental health survey data collected from participants across several cities in the United States. Each row represents a participant and includes demographic information, therapy participation, medication usage, screening scores, and monthly assessment data. However, the dataset is currently stored in a wide and inconsistent format, making it difficult to analyze directly.

Several columns contain combined variables (such as screening scores), inconsistent missing value labels (e.g., “NA”, “missing”, “-”, and “n/a”), and repeated measures across months (January through May). The goal of this dataset transformation is to convert the data into a tidy structure where each variable forms its own column and each observation forms its own row. This tidy format will allow easier analysis of trends in mental health screening scores, therapy participation, and stress levels across participants.

To preserve the original dataset structure, I will first recreate the dataset in its original wide format and store it as a .csv file. This raw file will be committed to my GitHub repository before performing any transformations.

The raw dataset contains columns such as:

Participant demographics (AgeGroup, Gender, City_State)
Mental health screening scores (PHQ-9 and GAD-7)
Monthly measurement columns (Jan–May)
Therapy and medication information
Diagnostic status
Lifestyle factors (WorkHours, Sleep, StressScale)
Insurance type and survey metadata

This file will serve as the unmodified input dataset for the tidying pipeline.

The dataset will be imported into R using read_csv() from the tidyverse package. I will inspect the structure of the dataset using functions such as glimpse() and summary() to understand the data types and identify inconsistencies.

Column names will be standardized using rename_with() to ensure consistent formatting (e.g., lowercase variable names and underscores instead of spaces). This improves readability and ensures compatibility with downstream analysis workflows.

The ScreeningScores column contains two separate measurements in the format:

PHQ-9:score, GAD-7:score

These represent two different mental health screening scales for depression and anxiety. I plan to separate this column into two variables using separate() from tidyr.

The dataset currently contains multiple columns representing monthly measurements:

These columns represent repeated observations over time and should be converted into a tidy structure using pivot_longer().

The dataset contains several different representations of missing data, including:

“NA”
“missing”
“-”
“n/a”

These values will be standardized into proper NA values using na_if() or mutate() operations. This ensures that missing data is handled consistently during analysis.

Some numeric fields currently include text (for example, “33 hours” or “7.8 hours”). These variables will be cleaned using string manipulation functions such as str_remove() before converting them to numeric types.

Once the dataset has been converted into tidy format, I will perform exploratory analysis to understand patterns in mental health outcomes.

Potential analyses include:

Comparing PHQ-9 and GAD-7 scores across age groups

Examining whether therapy type is associated with lower stress scores

Analyzing monthly changes in mental health screening results

Exploring relationships between sleep hours, work hours, and stress levels

Summary tables will be generated using dplyr functions such as group_by() and summarise(). Visualizations will be created using ggplot2, including bar charts and line plots showing mental health trends over time.

Several challenges are expected when tidying this dataset:

Inconsistent missing value labels that must be standardized.
Combined variables in the screening scores column that require separation.
Monthly measurements stored across multiple columns, which require reshaping.
Text embedded within numeric variables (e.g., “hours”) that must be cleaned before analysis.

Addressing these issues will ensure the final dataset follows tidy data principles and is suitable for reproducible analysis.

Codebase

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#import mental health data
mental_raw <- read_csv("untidy_mental_health_data.csv")

Rows: 180 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Participant, AgeGroup, Gender, City_State, ScreeningScores, Jan, F...
dbl  (1): RecordID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse (mental_raw)

Rows: 180
Columns: 21
$ RecordID        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ Participant     <chr> "Participant_1", "Participant_2", "Participant_3", "Pa…
$ AgeGroup        <chr> "unknown", "unknown", "26-35", "36-45", "36-45", "18-2…
$ Gender          <chr> "F", "n/a", "M", "Female", "Female", "Female", "Other"…
$ City_State      <chr> "Phoenix, AZ", "Houston, TX", "Chicago, IL", "Houston,…
$ ScreeningScores <chr> "PHQ9:9,GAD7:0", "PHQ-9=2;GAD-7=9", "PHQ-9=20;GAD-7=13…
$ Jan             <chr> NA, NA, "-", "n/a", NA, "missing", "n/a", "140", "miss…
$ Feb             <chr> "-", "missing", "missing", "n/a", NA, NA, NA, "119", "…
$ Mar             <chr> "-", "n/a", "n/a", NA, "n/a", NA, NA, "111", "n/a", NA…
$ Apr             <chr> "n/a", "n/a", "-", NA, "n/a", NA, "n/a", "-", NA, NA, …
$ May             <chr> "-", "n/a", NA, "missing", "missing", NA, NA, "missing…
$ TherapyType     <chr> "Group", "ACT", "n/a", "CBT", "Group", "CBT", "CBT", "…
$ Sessions        <chr> "18", NA, "n/a", "n/a", "9", "8", "n/a", NA, "n/a", NA…
$ Medication      <chr> NA, "SNRI", "None", "None", NA, "None", "SSRI", "SSRI"…
$ DiagnosisStatus <chr> "Undiagnosed", "-", "Pending", "Undiagnosed", "-", NA,…
$ WorkHours       <chr> NA, NA, NA, "n/a", "36h", "n/a", "33 hours", "57h", NA…
$ Sleep           <chr> "missing", NA, "missing", "missing", NA, NA, "8.7 hour…
$ StressScale     <chr> NA, NA, NA, NA, "10", NA, NA, "7", "2", "3", "n/a", "n…
$ Insurance       <chr> "Private", "-", "Public", NA, "Public", "Public", "-",…
$ SurveyDate      <chr> "2024-03-12", "2025-05-21", "03-30-2024", "2025-06-14"…
$ Notes           <chr> "missing", "missing", NA, "Follow-up needed", "Increas…

#Preview the raw data

head(mental_raw)

# A tibble: 6 × 21
  RecordID Participant   AgeGroup Gender City_State  ScreeningScores Jan   Feb  
     <dbl> <chr>         <chr>    <chr>  <chr>       <chr>           <chr> <chr>
1        1 Participant_1 unknown  F      Phoenix, AZ PHQ9:9,GAD7:0   <NA>  -    
2        2 Participant_2 unknown  n/a    Houston, TX PHQ-9=2;GAD-7=9 <NA>  miss…
3        3 Participant_3 26-35    M      Chicago, IL PHQ-9=20;GAD-7… -     miss…
4        4 Participant_4 36-45    Female Houston, TX PHQ-9=24;GAD-7… n/a   n/a  
5        5 Participant_5 36-45    Female New York, … PHQ9:9,GAD7:8   <NA>  <NA> 
6        6 Participant_6 18-25    Female Houston, TX PHQ9:12,GAD7:15 miss… <NA> 
# ℹ 13 more variables: Mar <chr>, Apr <chr>, May <chr>, TherapyType <chr>,
#   Sessions <chr>, Medication <chr>, DiagnosisStatus <chr>, WorkHours <chr>,
#   Sleep <chr>, StressScale <chr>, Insurance <chr>, SurveyDate <chr>,
#   Notes <chr>

From this preview, the dataset appears untidy for several reasons. Monthly observations are spread across multiple columns rather than stored in a single month variable. Some columns also contain inconsistent text formatting, and missing values may be represented in different ways instead of using standard NA values. In addition, certain variables may need to be separated or cleaned before they can be analyzed properly.

#Let's check the column names now
names(mental_raw)

 [1] "RecordID"        "Participant"     "AgeGroup"        "Gender"         
 [5] "City_State"      "ScreeningScores" "Jan"             "Feb"            
 [9] "Mar"             "Apr"             "May"             "TherapyType"    
[13] "Sessions"        "Medication"      "DiagnosisStatus" "WorkHours"      
[17] "Sleep"           "StressScale"     "Insurance"       "SurveyDate"     
[21] "Notes"

Reviewing the column names helps identify which variables are participant characteristics and which columns represent repeated monthly measurements. It also helps determine whether variable names should be standardized to improve readability and consistency in the tidied dataset.

# Check for Missing or Inconsistent Values

summary(mental_raw)

    RecordID      Participant          AgeGroup            Gender         
 Min.   :  1.00   Length:180         Length:180         Length:180        
 1st Qu.: 45.75   Class :character   Class :character   Class :character  
 Median : 90.50   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 90.50                                                           
 3rd Qu.:135.25                                                           
 Max.   :180.00                                                           
  City_State        ScreeningScores        Jan                Feb           
 Length:180         Length:180         Length:180         Length:180        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     Mar                Apr                May            TherapyType       
 Length:180         Length:180         Length:180         Length:180        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   Sessions          Medication        DiagnosisStatus     WorkHours        
 Length:180         Length:180         Length:180         Length:180        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    Sleep           StressScale         Insurance          SurveyDate       
 Length:180         Length:180         Length:180         Length:180        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    Notes          
 Length:180        
 Class :character  
 Mode  :character

#Let's start cleaning the dataset by renaming all the column names consistently

mental_clean <- mental_raw %>%
  rename_with(tolower)

glimpse(mental_clean)

Rows: 180
Columns: 21
$ recordid        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ participant     <chr> "Participant_1", "Participant_2", "Participant_3", "Pa…
$ agegroup        <chr> "unknown", "unknown", "26-35", "36-45", "36-45", "18-2…
$ gender          <chr> "F", "n/a", "M", "Female", "Female", "Female", "Other"…
$ city_state      <chr> "Phoenix, AZ", "Houston, TX", "Chicago, IL", "Houston,…
$ screeningscores <chr> "PHQ9:9,GAD7:0", "PHQ-9=2;GAD-7=9", "PHQ-9=20;GAD-7=13…
$ jan             <chr> NA, NA, "-", "n/a", NA, "missing", "n/a", "140", "miss…
$ feb             <chr> "-", "missing", "missing", "n/a", NA, NA, NA, "119", "…
$ mar             <chr> "-", "n/a", "n/a", NA, "n/a", NA, NA, "111", "n/a", NA…
$ apr             <chr> "n/a", "n/a", "-", NA, "n/a", NA, "n/a", "-", NA, NA, …
$ may             <chr> "-", "n/a", NA, "missing", "missing", NA, NA, "missing…
$ therapytype     <chr> "Group", "ACT", "n/a", "CBT", "Group", "CBT", "CBT", "…
$ sessions        <chr> "18", NA, "n/a", "n/a", "9", "8", "n/a", NA, "n/a", NA…
$ medication      <chr> NA, "SNRI", "None", "None", NA, "None", "SSRI", "SSRI"…
$ diagnosisstatus <chr> "Undiagnosed", "-", "Pending", "Undiagnosed", "-", NA,…
$ workhours       <chr> NA, NA, NA, "n/a", "36h", "n/a", "33 hours", "57h", NA…
$ sleep           <chr> "missing", NA, "missing", "missing", NA, NA, "8.7 hour…
$ stressscale     <chr> NA, NA, NA, NA, "10", NA, NA, "7", "2", "3", "n/a", "n…
$ insurance       <chr> "Private", "-", "Public", NA, "Public", "Public", "-",…
$ surveydate      <chr> "2024-03-12", "2025-05-21", "03-30-2024", "2025-06-14"…
$ notes           <chr> "missing", "missing", NA, "Follow-up needed", "Increas…

#Reshape monthly data to long format

mental_tidy <- mental_clean %>%
  pivot_longer(
    cols = jan:may,
    names_to = "month",
    values_to = "monthly_value"
  )
glimpse(mental_tidy)

Rows: 900
Columns: 18
$ recordid        <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, …
$ participant     <chr> "Participant_1", "Participant_1", "Participant_1", "Pa…
$ agegroup        <chr> "unknown", "unknown", "unknown", "unknown", "unknown",…
$ gender          <chr> "F", "F", "F", "F", "F", "n/a", "n/a", "n/a", "n/a", "…
$ city_state      <chr> "Phoenix, AZ", "Phoenix, AZ", "Phoenix, AZ", "Phoenix,…
$ screeningscores <chr> "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PH…
$ therapytype     <chr> "Group", "Group", "Group", "Group", "Group", "ACT", "A…
$ sessions        <chr> "18", "18", "18", "18", "18", NA, NA, NA, NA, NA, "n/a…
$ medication      <chr> NA, NA, NA, NA, NA, "SNRI", "SNRI", "SNRI", "SNRI", "S…
$ diagnosisstatus <chr> "Undiagnosed", "Undiagnosed", "Undiagnosed", "Undiagno…
$ workhours       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sleep           <chr> "missing", "missing", "missing", "missing", "missing",…
$ stressscale     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ insurance       <chr> "Private", "Private", "Private", "Private", "Private",…
$ surveydate      <chr> "2024-03-12", "2024-03-12", "2024-03-12", "2024-03-12"…
$ notes           <chr> "missing", "missing", "missing", "missing", "missing",…
$ month           <chr> "jan", "feb", "mar", "apr", "may", "jan", "feb", "mar"…
$ monthly_value   <chr> NA, "-", "-", "n/a", "-", NA, "missing", "n/a", "n/a",…

#Next let's handle missing and inconsistent values
mental_tidy <- mental_tidy %>%
  mutate(across(where(is.character), ~ na_if(.x, "missing"))) %>%
  mutate(across(where(is.character), ~ na_if(.x, "-"))) %>%
  mutate(across(where(is.character), ~ na_if(.x, "n/a")))

glimpse(mental_tidy)

Rows: 900
Columns: 18
$ recordid        <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, …
$ participant     <chr> "Participant_1", "Participant_1", "Participant_1", "Pa…
$ agegroup        <chr> "unknown", "unknown", "unknown", "unknown", "unknown",…
$ gender          <chr> "F", "F", "F", "F", "F", NA, NA, NA, NA, NA, "M", "M",…
$ city_state      <chr> "Phoenix, AZ", "Phoenix, AZ", "Phoenix, AZ", "Phoenix,…
$ screeningscores <chr> "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PH…
$ therapytype     <chr> "Group", "Group", "Group", "Group", "Group", "ACT", "A…
$ sessions        <chr> "18", "18", "18", "18", "18", NA, NA, NA, NA, NA, NA, …
$ medication      <chr> NA, NA, NA, NA, NA, "SNRI", "SNRI", "SNRI", "SNRI", "S…
$ diagnosisstatus <chr> "Undiagnosed", "Undiagnosed", "Undiagnosed", "Undiagno…
$ workhours       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sleep           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ stressscale     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ insurance       <chr> "Private", "Private", "Private", "Private", "Private",…
$ surveydate      <chr> "2024-03-12", "2024-03-12", "2024-03-12", "2024-03-12"…
$ notes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ month           <chr> "jan", "feb", "mar", "apr", "may", "jan", "feb", "mar"…
$ monthly_value   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

#Cleaning worked!

#Screening scores column has multiple value packed into it, so let's split those out. 
mental_tidy %>%
  select(screeningscores) %>%
  distinct() %>%
  head(10)

# A tibble: 10 × 1
   screeningscores  
   <chr>            
 1 PHQ9:9,GAD7:0    
 2 PHQ-9=2;GAD-7=9  
 3 PHQ-9=20;GAD-7=13
 4 PHQ-9=24;GAD-7=13
 5 PHQ9:9,GAD7:8    
 6 PHQ9:12,GAD7:15  
 7 PHQ-9=11;GAD-7=7 
 8 PHQ9:10,GAD7:5   
 9 PHQ-9=5;GAD-7=5  
10 PHQ-9=11;GAD-7=14

#PHQ and GAD scores are formatted differently

mental_tidy <- mental_tidy %>%
  mutate(
    phq9_score = str_extract(screeningscores, "(?<=PHQ-9=)\\d+|(?<=PHQ9:)\\d+"),
    gad7_score = str_extract(screeningscores, "(?<=GAD-7=)\\d+|(?<=GAD7:)\\d+")
  )

mental_tidy %>%
  select(screeningscores, phq9_score, gad7_score) %>%
  head(10)

# A tibble: 10 × 3
   screeningscores phq9_score gad7_score
   <chr>           <chr>      <chr>     
 1 PHQ9:9,GAD7:0   9          0         
 2 PHQ9:9,GAD7:0   9          0         
 3 PHQ9:9,GAD7:0   9          0         
 4 PHQ9:9,GAD7:0   9          0         
 5 PHQ9:9,GAD7:0   9          0         
 6 PHQ-9=2;GAD-7=9 2          9         
 7 PHQ-9=2;GAD-7=9 2          9         
 8 PHQ-9=2;GAD-7=9 2          9         
 9 PHQ-9=2;GAD-7=9 2          9         
10 PHQ-9=2;GAD-7=9 2          9

mental_tidy <- mental_tidy %>%
  mutate(
    phq9_score = as.numeric(phq9_score),
    gad7_score = as.numeric(gad7_score)
  )

glimpse(mental_tidy)

Rows: 900
Columns: 20
$ recordid        <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, …
$ participant     <chr> "Participant_1", "Participant_1", "Participant_1", "Pa…
$ agegroup        <chr> "unknown", "unknown", "unknown", "unknown", "unknown",…
$ gender          <chr> "F", "F", "F", "F", "F", NA, NA, NA, NA, NA, "M", "M",…
$ city_state      <chr> "Phoenix, AZ", "Phoenix, AZ", "Phoenix, AZ", "Phoenix,…
$ screeningscores <chr> "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PH…
$ therapytype     <chr> "Group", "Group", "Group", "Group", "Group", "ACT", "A…
$ sessions        <chr> "18", "18", "18", "18", "18", NA, NA, NA, NA, NA, NA, …
$ medication      <chr> NA, NA, NA, NA, NA, "SNRI", "SNRI", "SNRI", "SNRI", "S…
$ diagnosisstatus <chr> "Undiagnosed", "Undiagnosed", "Undiagnosed", "Undiagno…
$ workhours       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sleep           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ stressscale     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ insurance       <chr> "Private", "Private", "Private", "Private", "Private",…
$ surveydate      <chr> "2024-03-12", "2024-03-12", "2024-03-12", "2024-03-12"…
$ notes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ month           <chr> "jan", "feb", "mar", "apr", "may", "jan", "feb", "mar"…
$ monthly_value   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ phq9_score      <dbl> 9, 9, 9, 9, 9, 2, 2, 2, 2, 2, 20, 20, 20, 20, 20, 24, …
$ gad7_score      <dbl> 0, 0, 0, 0, 0, 9, 9, 9, 9, 9, 13, 13, 13, 13, 13, 13, …

#let's inspect if there's anywhere in the dataset that numeric values are stored as text
mental_tidy %>%
  select(workhours, sleep, stressscale) %>%
  distinct() %>%
  head(10)

# A tibble: 10 × 3
   workhours sleep     stressscale
   <chr>     <chr>     <chr>      
 1 <NA>      <NA>      <NA>       
 2 36h       <NA>      10         
 3 33 hours  8.7 hours <NA>       
 4 57h       7.8h      7          
 5 <NA>      <NA>      2          
 6 55 hours  5.6 hours 3          
 7 28 hours  <NA>      <NA>       
 8 <NA>      8.0 hours 3          
 9 56h       <NA>      4          
10 44 hours  <NA>      5

#We can see it's incosistently formatted (sometimes 'hours' sometimes 'h')

mental_tidy <- mental_tidy %>%
  mutate(
    workhours = as.numeric(str_extract(workhours, "\\d+\\.?\\d*")),
    sleep = as.numeric(str_extract(sleep, "\\d+\\.?\\d*"))
  )

mental_tidy %>%
  select(workhours, sleep) %>%
  head(10)

# A tibble: 10 × 2
   workhours sleep
       <dbl> <dbl>
 1        NA    NA
 2        NA    NA
 3        NA    NA
 4        NA    NA
 5        NA    NA
 6        NA    NA
 7        NA    NA
 8        NA    NA
 9        NA    NA
10        NA    NA

#Was getting all NA so ran everything again in one code chunk: 

mental_tidy <- mental_clean %>%
  pivot_longer(
    cols = jan:may,
    names_to = "month",
    values_to = "monthly_value"
  ) %>%
  mutate(across(where(is.character), ~ na_if(.x, "missing"))) %>%
  mutate(across(where(is.character), ~ na_if(.x, "-"))) %>%
  mutate(across(where(is.character), ~ na_if(.x, "n/a"))) %>%
  mutate(
    phq9_score = str_extract(screeningscores, "(?<=PHQ-9=)\\d+|(?<=PHQ9:)\\d+"),
    gad7_score = str_extract(screeningscores, "(?<=GAD-7=)\\d+|(?<=GAD7:)\\d+")
  ) %>%
  mutate(
    phq9_score = as.numeric(phq9_score),
    gad7_score = as.numeric(gad7_score)
  )

mental_tidy %>%
  filter(!is.na(workhours) | !is.na(sleep) | !is.na(stressscale)) %>%
  select(workhours, sleep, stressscale) %>%
  slice(1:10)

# A tibble: 10 × 3
   workhours sleep     stressscale
   <chr>     <chr>     <chr>      
 1 36h       <NA>      10         
 2 36h       <NA>      10         
 3 36h       <NA>      10         
 4 36h       <NA>      10         
 5 36h       <NA>      10         
 6 33 hours  8.7 hours <NA>       
 7 33 hours  8.7 hours <NA>       
 8 33 hours  8.7 hours <NA>       
 9 33 hours  8.7 hours <NA>       
10 33 hours  8.7 hours <NA>

mental_tidy <- mental_tidy %>%
  mutate(
    workhours = as.numeric(str_remove_all(workhours, "[^0-9.]")),
    sleep = as.numeric(str_remove_all(sleep, "[^0-9.]")),
    stressscale = as.numeric(stressscale)
  )

mental_tidy %>%
  filter(!is.na(workhours) | !is.na(sleep) | !is.na(stressscale)) %>%
  select(workhours, sleep, stressscale) %>%
  slice(1:10)

# A tibble: 10 × 3
   workhours sleep stressscale
       <dbl> <dbl>       <dbl>
 1        36  NA            10
 2        36  NA            10
 3        36  NA            10
 4        36  NA            10
 5        36  NA            10
 6        33   8.7          NA
 7        33   8.7          NA
 8        33   8.7          NA
 9        33   8.7          NA
10        33   8.7          NA

#Final tidy dataset check

glimpse(mental_tidy)

Rows: 900
Columns: 20
$ recordid        <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, …
$ participant     <chr> "Participant_1", "Participant_1", "Participant_1", "Pa…
$ agegroup        <chr> "unknown", "unknown", "unknown", "unknown", "unknown",…
$ gender          <chr> "F", "F", "F", "F", "F", NA, NA, NA, NA, NA, "M", "M",…
$ city_state      <chr> "Phoenix, AZ", "Phoenix, AZ", "Phoenix, AZ", "Phoenix,…
$ screeningscores <chr> "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PHQ9:9,GAD7:0", "PH…
$ therapytype     <chr> "Group", "Group", "Group", "Group", "Group", "ACT", "A…
$ sessions        <chr> "18", "18", "18", "18", "18", NA, NA, NA, NA, NA, NA, …
$ medication      <chr> NA, NA, NA, NA, NA, "SNRI", "SNRI", "SNRI", "SNRI", "S…
$ diagnosisstatus <chr> "Undiagnosed", "Undiagnosed", "Undiagnosed", "Undiagno…
$ workhours       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sleep           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ stressscale     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ insurance       <chr> "Private", "Private", "Private", "Private", "Private",…
$ surveydate      <chr> "2024-03-12", "2024-03-12", "2024-03-12", "2024-03-12"…
$ notes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ month           <chr> "jan", "feb", "mar", "apr", "may", "jan", "feb", "mar"…
$ monthly_value   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ phq9_score      <dbl> 9, 9, 9, 9, 9, 2, 2, 2, 2, 2, 20, 20, 20, 20, 20, 24, …
$ gad7_score      <dbl> 0, 0, 0, 0, 0, 9, 9, 9, 9, 9, 13, 13, 13, 13, 13, 13, …

At this point, the dataset has been transformed into a tidy structure. Monthly observations are stored in a single month column, screening scores have been separated into individual variables, and numeric values previously stored as text have been cleaned and converted for analysis.

Let’s begin our analysis.

#Average screening scores by age group

mental_summary <- mental_tidy %>%
  group_by(agegroup) %>%
  summarise(
    avg_phq9 = mean(phq9_score, na.rm = TRUE),
    avg_gad7 = mean(gad7_score, na.rm = TRUE)
  )

mental_summary

# A tibble: 6 × 3
  agegroup avg_phq9 avg_gad7
  <chr>       <dbl>    <dbl>
1 18-25        13.5     8.69
2 26-35        13.0     9.62
3 36-45        13.4    11.0 
4 46-60        13.3    10   
5 60+          12.8    10.3 
6 unknown      12.0    11.4

Let’s make the above into a chart for visualizing.

mental_summary_long <- mental_summary %>%
  pivot_longer(
    cols = c(avg_phq9, avg_gad7),
    names_to = "score_type",
    values_to = "average_score"
  )

ggplot(mental_summary_long, aes(x = agegroup, y = average_score, fill = score_type)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average PHQ-9 and GAD-7 Scores by Age Group",
    x = "Age Group",
    y = "Average Score",
    fill = "Score Type"
  )

This summary compares average depression screening scores (PHQ-9) and anxiety screening scores (GAD-7) across age groups. Grouping by age category helps show whether mental health burden differs across demographic segments. It’s clear that 18-25, 36-45 and 46-60 generally have the highest PHQ-9 scores while 36-45 and unknown age groups have the highest GAD-7 scores.