Introduction to Medical Data Analysis

Analyzing Mental Health Survey Data in R

Tiffany M. Kollah, BSPH

Introduction

Medical data analysis is an important skill for healthcare researchers and data scientists.

In this presentation, we will use R to analyze mental health survey data.

Learning Objectives

By the end of this presentation, you will understand how to:

Load a medical dataset
Clean missing and inconsistent values
Perform exploratory data analysis
Create visualizations
Build a simple predictive model

Dataset Overview

We will use the Mental Health in Tech Survey dataset from Kaggle.

The dataset includes:

Age
Gender
Workplace factors
Mental health treatment history
Work interference
Company size

Analysis Workflow

The full workflow includes:

Load the dataset
Clean the data
Explore variables
Visualize patterns
Build a logistic regression model
Evaluate model accuracy

Setting Up the Environment

Install and load the required R packages.

install.packages(c("tidyverse", "ggplot2", "corrplot", "caret"))

library(tidyverse)
library(ggplot2)
library(corrplot)
library(caret)

Load the Dataset

data <- read.csv("survey.csv", stringsAsFactors = FALSE)

head(data)
str(data)

Check Missing Values

colSums(is.na(data))

Remove Missing Values

data <- na.omit(data)

Clean Treatment Responses

data$treatment <- tolower(data$treatment)

data$treatment[data$treatment %in% c("y", "yes")] <- "yes"
data$treatment[data$treatment %in% c("n", "no")] <- "no"

Exploratory Data Analysis

summary(data)

table(data$treatment)

Treatment by Gender

ggplot(data, aes(x = Gender, fill = treatment)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Mental Health Treatment by Gender",
    x = "Gender",
    y = "Count"
  ) +
  theme_minimal()

Correlation Analysis

num_data <- data %>%
  select(where(is.numeric)) %>%
  na.omit()

corrplot(
  cor(num_data),
  method = "color",
  tl.cex = 0.8
)

Predictive Modeling

We will use logistic regression to predict whether someone is likely to seek mental health treatment.

data$treatment <- factor(data$treatment, levels = c("no", "yes"))

Select Model Variables

model_data <- data %>%
  select(Age, work_interfere, no_employees, treatment) %>%
  na.omit()

Split Training and Test Data

set.seed(123)

train_index <- createDataPartition(
  model_data$treatment,
  p = 0.7,
  list = FALSE
)

train <- model_data[train_index, ]
test <- model_data[-train_index, ]

Logistic Regression Model

model <- glm(
  treatment ~ Age + work_interfere + no_employees,
  data = train,
  family = binomial
)

summary(model)

Make Predictions

pred <- predict(model, newdata = test, type = "response")

pred_class <- ifelse(pred > 0.5, "yes", "no")
pred_class <- factor(pred_class, levels = c("no", "yes"))

Evaluate Model Accuracy

accuracy <- mean(pred_class == test$treatment)

print(paste("Model Accuracy:", round(accuracy * 100, 2), "%"))

Age Distribution by Treatment

ggplot(data, aes(x = Age, fill = treatment)) +
  geom_histogram(binwidth = 5, color = "black") +
  labs(
    title = "Age Distribution of Mental Health Treatment",
    x = "Age",
    y = "Count"
  ) +
  theme_minimal()

Key Takeaways

R can clean and analyze medical data.
Visualization helps reveal treatment patterns.
Logistic regression can model treatment-seeking outcomes.
Quarto makes the workflow reproducible.

Conclusion

This presentation introduced a basic medical data analysis workflow using R.

The same approach can be applied to many healthcare datasets.

Thank You

Questions?