Introduction to Medical Data Analysis
Analyzing Mental Health Survey Data in R
Introduction
Medical data analysis is an important skill for healthcare researchers and data scientists.
In this presentation, we will use R to analyze mental health survey data.
Learning Objectives
By the end of this presentation, you will understand how to:
Load a medical dataset
Clean missing and inconsistent values
Perform exploratory data analysis
Create visualizations
Build a simple predictive model
Dataset Overview
We will use the Mental Health in Tech Survey dataset from Kaggle.
The dataset includes:
Age
Gender
Workplace factors
Mental health treatment history
Work interference
Company size
Analysis Workflow
The full workflow includes:
Load the dataset
Clean the data
Explore variables
Visualize patterns
Build a logistic regression model
Evaluate model accuracy
Setting Up the Environment
Install and load the required R packages.
install.packages (c ("tidyverse" , "ggplot2" , "corrplot" , "caret" ))
library (tidyverse)
library (ggplot2)
library (corrplot)
library (caret)
Load the Dataset
data <- read.csv ("survey.csv" , stringsAsFactors = FALSE )
head (data)
str (data)
Clean Treatment Responses
data$ treatment <- tolower (data$ treatment)
data$ treatment[data$ treatment %in% c ("y" , "yes" )] <- "yes"
data$ treatment[data$ treatment %in% c ("n" , "no" )] <- "no"
Exploratory Data Analysis
summary (data)
table (data$ treatment)
Treatment by Gender
ggplot (data, aes (x = Gender, fill = treatment)) +
geom_bar (position = "dodge" ) +
labs (
title = "Mental Health Treatment by Gender" ,
x = "Gender" ,
y = "Count"
) +
theme_minimal ()
Correlation Analysis
num_data <- data %>%
select (where (is.numeric)) %>%
na.omit ()
corrplot (
cor (num_data),
method = "color" ,
tl.cex = 0.8
)
Predictive Modeling
We will use logistic regression to predict whether someone is likely to seek mental health treatment.
data$ treatment <- factor (data$ treatment, levels = c ("no" , "yes" ))
Select Model Variables
model_data <- data %>%
select (Age, work_interfere, no_employees, treatment) %>%
na.omit ()
Split Training and Test Data
set.seed (123 )
train_index <- createDataPartition (
model_data$ treatment,
p = 0.7 ,
list = FALSE
)
train <- model_data[train_index, ]
test <- model_data[- train_index, ]
Logistic Regression Model
model <- glm (
treatment ~ Age + work_interfere + no_employees,
data = train,
family = binomial
)
summary (model)
Make Predictions
pred <- predict (model, newdata = test, type = "response" )
pred_class <- ifelse (pred > 0.5 , "yes" , "no" )
pred_class <- factor (pred_class, levels = c ("no" , "yes" ))
Evaluate Model Accuracy
accuracy <- mean (pred_class == test$ treatment)
print (paste ("Model Accuracy:" , round (accuracy * 100 , 2 ), "%" ))
Age Distribution by Treatment
ggplot (data, aes (x = Age, fill = treatment)) +
geom_histogram (binwidth = 5 , color = "black" ) +
labs (
title = "Age Distribution of Mental Health Treatment" ,
x = "Age" ,
y = "Count"
) +
theme_minimal ()
Key Takeaways
R can clean and analyze medical data.
Visualization helps reveal treatment patterns.
Logistic regression can model treatment-seeking outcomes.
Quarto makes the workflow reproducible.
Conclusion
This presentation introduced a basic medical data analysis workflow using R.
The same approach can be applied to many healthcare datasets.