Data 607 Final Project- Involuntary Psychiatric Hospitalization Rates in New York

Introduction

This project analyzes psychiatric hospitalization patterns in New York City as a way to understand where emergency or involuntary mental health intervention may be most needed. Psychiatric hospitalization is important because it can indicate severe mental health crises where a person may be at risk of harm to themselves or others. Although hospitalization can provide immediate safety and stabilization, high hospitalization rates may also show gaps in outpatient care, crisis response, housing support, and preventive mental health services.

The main research question is: Do psychiatric hospitalization rates differ by community and demographic factors, and how can this information benefit the community?

My hypothesis is that communities with higher poverty levels and groups with less access to preventive mental health care will experience higher psychiatric hospitalization rates. Identifying these patterns can help public health officials better target crisis teams, outpatient programs, and community support services.

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)

install.packages("readr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)

install.packages("ggplot2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)

install.packages("scales")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(readr)
library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Data Sources

sparcs_url <- "https://health.data.ny.gov/resource/46xm-urtu.csv?$limit=50000"

sparcs <- read_csv(sparcs_url) %>%
  clean_names()

## Rows: 50000 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (29): hospital_service_area, hospital_county, operating_certificate_numb...
## dbl  (2): discharge_year, apr_severity_of_illness_code
## num  (2): total_charges, total_costs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(sparcs)

## Rows: 50,000
## Columns: 33
## $ hospital_service_area          <chr> "New York City", "New York City", "New …
## $ hospital_county                <chr> "Bronx", "Bronx", "Bronx", "Bronx", "Br…
## $ operating_certificate_number   <chr> "7000006", "7000006", "7000006", "70000…
## $ permanent_facility_id          <chr> "003058", "001168", "003058", "001169",…
## $ facility_name                  <chr> "Montefiore Med Center - Jack D Weiler …
## $ age_group                      <chr> "50 to 69", "30 to 49", "50 to 69", "18…
## $ zip_code_3_digits              <chr> "104", "104", "104", "104", "104", "104…
## $ gender                         <chr> "F", "M", "M", "M", "F", "F", "F", "F",…
## $ race                           <chr> "Other Race", "Black/African American",…
## $ ethnicity                      <chr> "Spanish/Hispanic", "Not Span/Hispanic"…
## $ length_of_stay                 <chr> "1", "4", "4", "5", "3", "1", "1", "1",…
## $ type_of_admission              <chr> "Emergency", "Emergency", "Emergency", …
## $ patient_disposition            <chr> "Home or Self Care", "Left Against Medi…
## $ discharge_year                 <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 202…
## $ ccsr_diagnosis_code            <chr> "NEO074", "CIR013", "INF002", "BLD005",…
## $ ccsr_diagnosis_description     <chr> "Conditions due to neoplasm or the trea…
## $ ccsr_procedure_code            <chr> "ADM021", "ADM021", "ADM019", NA, "ADM0…
## $ ccsr_procedure_description     <chr> "ADMINISTRATION OF THERAPEUTIC SUBSTANC…
## $ apr_drg_code                   <chr> "861", "134", "720", "662", "720", "244…
## $ apr_drg_description            <chr> "SIGNS, SYMPTOMS AND OTHER FACTORS INFL…
## $ apr_mdc_code                   <chr> "23", "04", "18", "16", "18", "06", "05…
## $ apr_mdc_description            <chr> "FACTORS INFLUENCING HLTH STAT & OTHR C…
## $ apr_severity_of_illness_code   <dbl> 2, 2, 4, 2, 3, 1, 2, 1, 2, 2, 2, 2, 1, …
## $ apr_severity_of_illness        <chr> "Moderate", "Moderate", "Extreme", "Mod…
## $ apr_risk_of_mortality          <chr> "Moderate", "Minor", "Extreme", "Minor"…
## $ apr_medical_surgical           <chr> "Medical", "Medical", "Medical", "Medic…
## $ payment_typology_1             <chr> "Medicaid", "Medicaid", "Medicare", "Me…
## $ payment_typology_2             <chr> NA, NA, "Medicaid", NA, "Medicaid", "Me…
## $ payment_typology_3             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ birth_weight                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ emergency_department_indicator <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ total_charges                  <dbl> 27380.60, 82629.98, 76210.47, 77532.60,…
## $ total_costs                    <dbl> 3533.21, 12715.35, 11201.40, 13872.77, …

nyc_mh_source <- "https://a816-dohbesp.nyc.gov/IndicatorPublic/data-explorer/mental-health/"

nyc_mh_context <- tibble(
  source = "NYC Department of Health Mental Health Data Explorer",
  topic = "Mental health and the environment",
  key_finding = "Living in high poverty neighborhoods is associated with depression, serious psychological distress, and psychiatric hospitalization.",
  project_connection = "This source supports the project hypothesis that psychiatric hospitalization patterns may be connected to community conditions such as poverty, access to care, and neighborhood stressors.",
  url = nyc_mh_source
)

nyc_mh_context

## # A tibble: 1 × 5
##   source                              topic key_finding project_connection url  
##   <chr>                               <chr> <chr>       <chr>              <chr>
## 1 NYC Department of Health Mental He… Ment… Living in … This source suppo… http…

Data Cleaning and Transformation

sparcs_clean <- sparcs %>%
  mutate(
    length_of_stay = readr::parse_number(as.character(length_of_stay)),
    total_charges = as.numeric(total_charges),
    age_group = as.factor(age_group),
    gender = as.factor(gender),
    race = as.factor(race),
    ethnicity = as.factor(ethnicity)
  )

Mental Health-Related Hospitalizations

mental_health <- sparcs_clean %>%
  filter(
    str_detect(
      tolower(apr_drg_description),
      "mental|psych|depress|bipolar|schizophrenia|substance|suicide"
    )
  )

nrow(mental_health)

## [1] 2059

Summary Statistics

mh_summary <- mental_health %>%
  summarise(
    total_hospitalizations = n(),
    average_length_of_stay = mean(length_of_stay, na.rm = TRUE),
    median_length_of_stay = median(length_of_stay, na.rm = TRUE),
    average_charges = mean(total_charges, na.rm = TRUE)
  )

mh_summary

## # A tibble: 1 × 4
##   total_hospitalizations average_length_of_stay median_length_of_stay
##                    <int>                  <dbl>                 <dbl>
## 1                   2059                   16.9                    12
## # ℹ 1 more variable: average_charges <dbl>

Hospitalizations by Age Group

age_summary <- mental_health %>%
  count(age_group, sort = TRUE)

ggplot(age_summary, aes(x = reorder(age_group, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Mental Health-Related Hospitalizations by Age Group",
    x = "Age Group",
    y = "Number of Hospitalizations"
  )

Hospitalizations by Race

race_summary <- mental_health %>%
  count(race, sort = TRUE)

ggplot(race_summary, aes(x = reorder(race, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Mental Health-Related Hospitalizations by Race",
    x = "Race",
    y = "Number of Hospitalizations"
  )

Hospitalizations by Gender

gender_summary <- mental_health %>%
  count(gender, sort = TRUE)

ggplot(gender_summary, aes(x = gender, y = n)) +
  geom_col() +
  labs(
    title = "Mental Health-Related Hospitalizations by Gender",
    x = "Gender",
    y = "Number of Hospitalizations"
  )

Statistical Analysis: Length of Stay by Gender

mental_health %>%
  group_by(gender) %>%
  summarise(
    mean_los = mean(length_of_stay, na.rm = TRUE),
    median_los = median(length_of_stay, na.rm = TRUE),
    n = n()
  )

## # A tibble: 3 × 4
##   gender mean_los median_los     n
##   <fct>     <dbl>      <dbl> <int>
## 1 F         15.2          11   971
## 2 M         18.4          13  1085
## 3 U          8.33          8     3

t.test(length_of_stay ~ gender, data = mental_health %>% filter(gender %in% c("M", "F")))

## 
##  Welch Two Sample t-test
## 
## data:  length_of_stay by gender
## t = -3.7803, df = 2039.5, p-value = 0.0001611
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -4.794200 -1.519053
## sample estimates:
## mean in group F mean in group M 
##        15.24614        18.40276

Statistical Analysis: Charges by Age Group

anova_model <- aov(total_charges ~ age_group, data = mental_health)
summary(anova_model)

##               Df    Sum Sq   Mean Sq F value   Pr(>F)    
## age_group      4 7.786e+11 1.947e+11   9.615 1.06e-07 ***
## Residuals   2054 4.158e+13 2.024e+10                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dashboard

install.packages("shiny")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)

library(shiny)

ui <- fluidPage(
  titlePanel("Mental Health-Related Hospitalizations in New York"),

  sidebarLayout(
    sidebarPanel(
      selectInput(
        inputId = "group_var",
        label = "Choose a variable to view:",
        choices = c("Age Group" = "age_group",
                    "Gender" = "gender",
                    "Race" = "race",
                    "Hospital County" = "hospital_county")
      )
    ),

    mainPanel(
      plotOutput("bar_plot"),
      tableOutput("summary_table")
    )
  )
)

server <- function(input, output) {

  output$bar_plot <- renderPlot({
    mental_health %>%
      count(.data[[input$group_var]], sort = TRUE) %>%
      ggplot(aes(x = reorder(.data[[input$group_var]], n), y = n)) +
      geom_col() +
      coord_flip() +
      labs(
        title = "Mental Health-Related Hospitalizations",
        x = input$group_var,
        y = "Number of Hospitalizations"
      )
  })

  output$summary_table <- renderTable({
    mental_health %>%
      count(.data[[input$group_var]], sort = TRUE)
  })
}

shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

Community Benefit Interpretation

The results of this project help show how mental health crises affect different groups of people in New York and why psychiatric hospitalization is an important public health issue. The dataset included more than 2,000 mental health-related hospitalizations, showing that severe psychiatric emergencies affect many individuals and communities. The average hospital stay was about 17 days, which suggests that many patients required long-term treatment and stabilization before being discharged. The analysis found that adults between the ages of 30 to 49 experienced the highest number of psychiatric hospitalizations. This may reflect the stress and pressures that people in this age group face, including work responsibilities, financial stress, housing insecurity, family obligations, and substance use.

The data also showed that Black/African American patients represented one of the largest groups in the hospitalization data, which may point to differences in access to preventive mental health care, economic inequality, or barriers to receiving treatment before a crisis becomes severe. In addition, male patients stayed in the hospital longer on average than female patients, and the statistical analysis showed that this difference was significant. Longer hospital stays may suggest more severe psychiatric conditions or more complex treatment needs.

The second data source from the NYC Department of Health supports this interpretation because it connects psychiatric hospitalization to broader neighborhood conditions, including poverty and serious psychological distress. This means the results should not only be viewed as individual hospital cases, but also as signs of larger community mental health needs.

These findings are important because they can help public health officials understand where mental health resources are most needed. Communities may benefit from more outpatient mental health clinics, crisis intervention teams, supportive housing programs, substance abuse treatment services, and better follow-up care after discharge. By identifying patterns in psychiatric hospitalizations, healthcare systems can focus more on prevention and early intervention, which may reduce psychiatric emergencies, repeated hospitalizations, and pressure on emergency departments.

Conclusion

This project analyzed psychiatric hospitalization patterns in New York using public hospital discharge data and statistical analysis to better understand mental health crises across different demographic groups. The findings showed that psychiatric hospitalizations varied across age groups, race, gender, and hospital stay length. Adults between the ages of 30 to 49 experienced the highest number of psychiatric hospitalizations, while male patients had significantly longer hospital stays than female patients. The analysis also found significant differences in hospital charges across age groups, suggesting that some groups may require more intensive or costly treatment. Overall, the results supported the original hypothesis that psychiatric hospitalization patterns differ across demographic groups and communities.

The second data source used in this project was the NYC Department of Health Mental Health Data Explorer. This source provides information about mental health trends and community conditions in New York City. Unlike the SPARCS dataset, which focuses on hospital records, the NYC Department of Health source explains how factors such as poverty, stress, housing problems, limited healthcare access, and substance use can affect mental health in different communities.

This source helped strengthen the project because it provided explanations for some of the patterns found in the hospitalization data. The SPARCS dataset showed which groups had higher numbers of psychiatric hospitalizations, while the NYC mental health source helped explain why some communities may experience more mental health crises than others. For example, the hospitalization analysis showed higher psychiatric hospitalization rates among adults ages 30 to 49 and among Black/African American patients. The NYC Department of Health source suggests that social and economic conditions may contribute to these differences.

Using both data sources together made the project stronger because one source provided the hospital statistics and the other provided public health context about the communities affected by mental health problems. Together, the two sources helped create a better understanding of how psychiatric hospitalization may be connected to larger community and public health issues in New York City.

The findings suggest that hospitalization data can help identify communities that may need stronger mental health support systems, preventive care, and crisis intervention services. This project also demonstrated how data science and statistical analysis can be used in healthcare research to better understand public health problems and support evidence-based decision making. By using visualizations, summary statistics, and hypothesis testing, the project showed how healthcare data can provide meaningful insights into community mental health needs.

Challenges Encountered

One of the biggest challenges in this project was working with a real-world healthcare dataset that required extensive cleaning and preparation before analysis could begin. Some variables that should have been numeric were stored as text values, which caused errors during calculations and statistical testing. For example, the length of stay variable had to be converted into a numeric format before averages and statistical analyses could be performed. Another challenge was that some variable names in the dataset were different from what was expected based on earlier examples or documentation. This required checking the structure of the dataset and adjusting the code to use the correct variable names.

In addition, the dataset did not directly identify whether psychiatric hospitalizations were voluntary or involuntary, which limited the ability to specifically study involuntary admissions. Because of this, the project focused more broadly on psychiatric hospitalizations as indicators of severe mental health crises. These challenges reflect common problems faced by data analysts and healthcare researchers when working with large public datasets. Addressing these issues improved the accuracy and reliability of the final analysis and helped create a more realistic understanding of the data preparation process in healthcare analytics.

New Feature Not Covered in Class

For the new feature, I created an interactive Shiny dashboard that allows users to explore mental health-related hospitalizations by age group, gender, race, and hospital county. This goes beyond regular static graphs because the viewer can choose which variable they want to analyze. This makes the project more interactive and useful for public health planning because different users can focus on the groups or locations most relevant to them.

Another feature used in this project that was not directly covered in class was working with a large real-world healthcare administrative dataset focused on psychiatric hospitalization trends. Unlike smaller practice datasets often used in coursework, the SPARCS hospital discharge dataset contained thousands of records with inconsistent formatting, healthcare coding systems, and complex demographic variables. The project required identifying mental health-related hospitalizations by filtering diagnosis descriptions and hospital classification information, which introduced a healthcare analytics component beyond the examples discussed in class.

Another feature that extended beyond the course material was applying multiple data science techniques together within a public health setting. The project combined data cleaning, transformation, visualization, statistical testing, and interpretation of healthcare outcomes to study mental health crises in New York communities. Instead of only performing exploratory analysis, the project connected the statistical results to real-world public health concerns such as crisis intervention, preventive mental health care, and healthcare inequality.

The project also required interpreting healthcare trends in a meaningful way for community benefit rather than focusing only on technical analysis. This included examining how demographic differences in psychiatric hospitalization rates may reflect barriers to healthcare access, economic stress, or gaps in preventive services. Using data science to support healthcare and public policy decisions added a practical application that went beyond many of the examples covered during the course.