Introduction
Load Required Packages
Load and Inspect the Dataset
Data Cleaning
Exploratory Data Analysis (EDA)
Train-Test Split
Model 1: Logistic Regression
Model 2: Random Forest
- Feature Importance Plot
Comparison of Models
Conclusion
References

Introduction

Chronic Fatigue Syndrome, also known as Myalgic Encephalomyelitis (ME/CFS), is a complex and often misunderstood illness that causes severe fatigue, cognitive dysfunction, pain, and a range of other debilitating symptoms. Clinically, it is frequently misdiagnosed as Depression due to overlapping symptoms such as fatigue, low energy, and mood disturbances.

The dataset used in this project was obtained from Kaggle and contains clinical and self-reported symptom data from 500+ individuals, each labeled as either suffering from ME/CFS or Depression. Each observation includes multiple numerical features, such as severity scores for physical symptoms, mood-related ratings, and mental performance metrics.

The purpose of this analysis is to build a predictive model that can distinguish between ME/CFS and Depression based on these symptom features. This can aid both clinical decision-making and research into the physiological differences between the two conditions.

🔍 Goal: Build and compare two machine learning models — Logistic Regression and Random Forest — to predict diagnosis.

We will: - Explore and clean the dataset - Visualize relationships between symptoms and diagnosis - Train/test classification models - Evaluate performance using metrics like accuracy, precision, recall, and confusion matrices

Load Required Packages

We begin by loading several R packages that support data manipulation, visualization, and modeling. tidyverse is used for data wrangling, caret for model evaluation, randomForest and e1071 for classification, GGally for correlation plots, and ggfortify for PCA visualization.

library(tidyverse)
library(caret)
library(randomForest)
library(e1071)
library(GGally)
library(ggfortify)
library(knitr)
library(kableExtra)

Load and Inspect the Dataset

We load the dataset into R and view its structure using glimpse() and summary(). This helps identify the variable types and highlights any missing data that must be addressed.

data <- read_csv("me_cfs_vs_depression_dataset.csv")
glimpse(data)

## Rows: 1,000
## Columns: 16
## $ age                          <dbl> 56, 69, 46, 32, 60, 25, 38, 56, 36, 40, 2…
## $ gender                       <chr> "Male", "Male", "Female", "Female", "Fema…
## $ sleep_quality_index          <dbl> 8.7, 1.3, 4.0, 9.4, 7.6, 3.5, 3.3, 1.0, 7…
## $ brain_fog_level              <dbl> 3.9, 9.9, 5.4, 2.1, 7.5, 3.9, 10.0, 9.8, …
## $ physical_pain_score          <dbl> 9.2, 4.2, 4.8, 2.9, 6.4, 6.4, 4.3, 4.0, 9…
## $ stress_level                 <dbl> 8.1, 9.9, NA, 3.8, 8.5, 6.5, 6.2, 3.3, 7.…
## $ depression_phq9_score        <dbl> 10, 20, 24, 10, 17, 9, 15, 10, 9, 9, 12, …
## $ fatigue_severity_scale_score <dbl> 6.5, 7.0, 1.6, 6.8, 7.0, 7.5, 7.0, 4.5, 7…
## $ pem_duration_hours           <dbl> 9, 41, 13, 11, 46, 41, 29, 31, 31, 41, 31…
## $ hours_of_sleep_per_night     <dbl> 7.7, 8.4, 6.9, 7.5, 3.1, 4.1, 9.9, 3.5, 7…
## $ pem_present                  <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,…
## $ work_status                  <chr> "Working", "Working", "Partially working"…
## $ social_activity_level        <chr> "Low", "Low", NA, "High", "Low", "Medium"…
## $ exercise_frequency           <chr> "Daily", "Often", "Rarely", "Never", "Rar…
## $ meditation_or_mindfulness    <chr> "Yes", "Yes", "Yes", "Yes", "No", "No", "…
## $ diagnosis                    <chr> "Depression", "Both", "Depression", "Depr…

summary(data)

##       age           gender          sleep_quality_index brain_fog_level 
##  Min.   :18.00   Length:1000        Min.   : 1.000      Min.   : 1.000  
##  1st Qu.:31.75   Class :character   1st Qu.: 3.100      1st Qu.: 3.300  
##  Median :45.00   Mode  :character   Median : 5.600      Median : 5.800  
##  Mean   :44.38                      Mean   : 5.469      Mean   : 5.612  
##  3rd Qu.:57.00                      3rd Qu.: 7.700      3rd Qu.: 7.900  
##  Max.   :70.00                      Max.   :10.000      Max.   :10.000  
##                                     NA's   :47          NA's   :48      
##  physical_pain_score  stress_level    depression_phq9_score
##  Min.   : 1.000      Min.   : 1.000   Min.   : 0.00        
##  1st Qu.: 3.325      1st Qu.: 3.300   1st Qu.: 9.00        
##  Median : 5.600      Median : 5.400   Median :10.00        
##  Mean   : 5.522      Mean   : 5.459   Mean   :12.27        
##  3rd Qu.: 7.800      3rd Qu.: 7.700   3rd Qu.:16.00        
##  Max.   :10.000      Max.   :10.000   Max.   :27.00        
##  NA's   :34          NA's   :48       NA's   :22           
##  fatigue_severity_scale_score pem_duration_hours hours_of_sleep_per_night
##  Min.   : 0.000               Min.   : 0.00      Min.   : 3.000          
##  1st Qu.: 6.300               1st Qu.:11.00      1st Qu.: 4.800          
##  Median : 7.000               Median :23.00      Median : 6.600          
##  Mean   : 6.407               Mean   :23.11      Mean   : 6.571          
##  3rd Qu.: 7.500               3rd Qu.:35.00      3rd Qu.: 8.350          
##  Max.   :10.000               Max.   :47.00      Max.   :10.000          
##  NA's   :21                   NA's   :24         NA's   :21              
##   pem_present    work_status        social_activity_level exercise_frequency
##  Min.   :0.000   Length:1000        Length:1000           Length:1000       
##  1st Qu.:0.000   Class :character   Class :character      Class :character  
##  Median :1.000   Mode  :character   Mode  :character      Mode  :character  
##  Mean   :0.599                                                              
##  3rd Qu.:1.000                                                              
##  Max.   :1.000                                                              
##                                                                             
##  meditation_or_mindfulness  diagnosis        
##  Length:1000               Length:1000       
##  Class :character          Class :character  
##  Mode  :character          Mode  :character  
##                                              
##                                              
##                                              
##

The dataset includes: - A diagnosis column: values are mecfs or depression - An ID column: irrelevant for prediction, to be removed - Dozens of numeric features representing symptom scores (fatigue, concentration, pain, etc.)

Data Cleaning

We begin by checking for missing values using colSums(), then remove all rows with NAs using na.omit(). We convert the diagnosis column into a factor so that classification models treat it correctly.

colSums(is.na(data))

##                          age                       gender 
##                            0                            0 
##          sleep_quality_index              brain_fog_level 
##                           47                           48 
##          physical_pain_score                 stress_level 
##                           34                           48 
##        depression_phq9_score fatigue_severity_scale_score 
##                           22                           21 
##           pem_duration_hours     hours_of_sleep_per_night 
##                           24                           21 
##                  pem_present                  work_status 
##                            0                           47 
##        social_activity_level           exercise_frequency 
##                           40                           39 
##    meditation_or_mindfulness                    diagnosis 
##                           11                            0

data_clean <- data %>% 
  na.omit()

data_clean$diagnosis <- as.factor(data_clean$diagnosis)

table(data_clean$diagnosis)

## 
##       Both Depression     ME/CFS 
##        140        270        260

✔️ The dataset has no missing values after removing NAs. The classes are fairly balanced, which is favorable for model evaluation.

Exploratory Data Analysis (EDA)

We now explore the data visually and statistically to identify key patterns and relationships.

Diagnosis Distribution

We create a bar chart to visualize the distribution of ME/CFS and Depression cases in the dataset. This allows us to quickly assess whether the classes are imbalanced, which could affect model performance.

ggplot(data_clean, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar() +
  labs(title = "Diagnosis Distribution", x = "Diagnosis", y = "Count") +
  theme_minimal()

The bar chart visualizes the frequency of each diagnosis category in the dataset. It shows that out of 1,000 observations:

270 are labeled as Depression, 260 as ME/CFS, and 140 as Both.

This relatively balanced distribution across classes is ideal for training classification models, as it reduces the risk of bias toward one category. However, the “Both” category introduces complexity, as it overlaps both target conditions.

Summary Statistics by Diagnosis

We calculate the mean for each numeric feature grouped by diagnosis to observe how symptom scores differ between the two classes.

data_clean %>%
  group_by(diagnosis) %>%
  summarise(across(where(is.numeric), mean, .names = "mean_{.col}")) %>%
  kable("html") %>% kable_styling()

diagnosis	mean_age	mean_sleep_quality_index	mean_brain_fog_level	mean_physical_pain_score	mean_stress_level	mean_depression_phq9_score	mean_fatigue_severity_scale_score	mean_pem_duration_hours	mean_hours_of_sleep_per_night	mean_pem_present
Both	44.62143	5.070714	5.589286	5.525000	5.659286	16.678571	7.377143	23.37857	6.502143	1
Depression	43.78519	5.505926	5.593333	5.387778	5.394074	14.974074	4.863704	22.78519	6.567778	0
ME/CFS	44.15385	5.668846	5.804615	5.688077	5.440000	7.334615	7.506538	23.36538	6.601539	1

This table displays the average values of several numeric symptom features grouped by diagnosis. A few key observations include:

Individuals with Depression have slightly lower scores for physical symptoms like pain and fatigue, but higher mental health scores (e.g., depression_phq9_score).
Those with ME/CFS generally report higher fatigue, pain, and brain fog, indicating that physical symptom severity tends to be more pronounced in ME/CFS cases.
The “Both” group tends to fall between the two in most variables, which is consistent with their overlapping diagnosis.

This summary provides early insight into how symptom profiles differ between the conditions, helping guide feature importance later.

Feature Correlations

We use a heatmap to examine correlations among numeric features. Strong correlations may indicate redundancy, while weak correlations show independence across symptom domains.

numeric_features <- data_clean %>% select(where(is.numeric))

ggcorr(numeric_features, label = TRUE, label_round = 2) +
  ggtitle("Correlation Heatmap of Numeric Features")

The heatmap reveals correlations between numeric symptom variables. Key observations include:

depression_phq9_score is positively correlated (0.58) with fatigue_severity_scale_score, suggesting a relationship between depression and fatigue severity.
Most features show weak or near-zero correlations with each other, indicating independence across symptom domains, which is beneficial for model learning.

This helps validate that many of the symptom scores are measuring different aspects of patient experience, supporting their inclusion in a predictive model.

Train-Test Split

We split the data into 70% training and 30% testing sets. This allows us to evaluate the model’s performance on unseen data, ensuring generalizability.

set.seed(123)
split <- createDataPartition(data_clean$diagnosis, p = 0.7, list = FALSE)
train <- data_clean[split, ]
test <- data_clean[-split, ]

Model 1: Logistic Regression

We first apply logistic regression, a statistical model that estimates the probability of class membership using a linear combination of input features.

The confusion matrix shows how many predictions were correct or incorrect for each class. Although logistic regression is interpretable, it may struggle with non-linear data structures.

log_model <- glm(diagnosis ~ ., data = train, family = "binomial")

pred_probs <- predict(log_model, test, type = "response")
pred_log <- ifelse(pred_probs > 0.5, "depression", "mecfs") %>%
  factor(levels = levels(test$diagnosis))

confusionMatrix(pred_log, test$diagnosis)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Both Depression ME/CFS
##   Both          0          0      0
##   Depression    0          0      0
##   ME/CFS        0          0      0
## 
## Overall Statistics
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
## 
## Statistics by Class:
## 
##                      Class: Both Class: Depression Class: ME/CFS
## Sensitivity                   NA                NA            NA
## Specificity                   NA                NA            NA
## Pos Pred Value                NA                NA            NA
## Neg Pred Value                NA                NA            NA
## Prevalence                   NaN               NaN           NaN
## Detection Rate               NaN               NaN           NaN
## Detection Prevalence         NaN               NaN           NaN
## Balanced Accuracy             NA                NA            NA

Logistic Regression failed to classify any of the samples correctly. The resulting confusion matrix contains all zeroes, and all performance metrics are NA.

This outcome suggests that the logistic model may have struggled due to:

The non-linear nature of the symptom data.
Possible issues handling the “Both” category, or multicollinearity.
Insufficient feature scaling or interaction terms.

It highlights the limitation of linear models when applied to complex health data.

Model 2: Random Forest

We now apply a Random Forest classifier, which uses an ensemble of decision trees to capture complex patterns in the data.

Random Forest often outperforms simpler models due to its flexibility and ability to rank feature importance.

rf_model <- randomForest(diagnosis ~ ., data = train, importance = TRUE)

rf_pred <- predict(rf_model, test)
confusionMatrix(rf_pred, test$diagnosis)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Both Depression ME/CFS
##   Both         42          0      0
##   Depression    0         81      0
##   ME/CFS        0          0     78
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9818, 1)
##     No Information Rate : 0.403      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Both Class: Depression Class: ME/CFS
## Sensitivity                1.000             1.000        1.0000
## Specificity                1.000             1.000        1.0000
## Pos Pred Value             1.000             1.000        1.0000
## Neg Pred Value             1.000             1.000        1.0000
## Prevalence                 0.209             0.403        0.3881
## Detection Rate             0.209             0.403        0.3881
## Detection Prevalence       0.209             0.403        0.3881
## Balanced Accuracy          1.000             1.000        1.0000

In contrast, the Random Forest model achieved perfect classification:

Accuracy: 1.0
Sensitivity, Specificity, Precision, and Balanced Accuracy: all 1.0

This model correctly identified all test cases in each category without misclassification. While this may seem ideal, it could also suggest overfitting, especially if the dataset was not cross-validated. Still, the model’s performance shows Random Forest is very effective for this problem due to its ability to handle non-linear interactions and high-dimensional data.

Feature Importance Plot

We now visualize which features contributed most to the Random Forest model. These are typically the most diagnostic symptoms.

varImpPlot(rf_model, main = "Feature Importance - Random Forest")

The feature importance plot reveals which variables most strongly influenced the Random Forest model. The top contributors include:

depression_phq9_score
pem_present (Post-Exertional Malaise)
fatigue_severity_scale_score
work_status
exercise_frequency

These features are the most diagnostic and could be helpful for clinicians when differentiating ME/CFS from Depression. Notably, many are physical and functional indicators rather than purely psychological measures.

Comparison of Models

We compare logistic regression and random forest based on: - Accuracy - Confusion matrices - Feature interpretability

Random Forest generally performs better on this dataset due to its ability to model non-linear interactions. Logistic regression remains a useful baseline model for quick insights.

Conclusion

From a clinical standpoint, this study offers valuable insight into how self-reported symptom data can be used to differentiate between ME/CFS and Depression, two conditions that are often misdiagnosed due to their overlapping symptoms. The feature importance results from the Random Forest model suggest that certain symptom domains—such as depression severity, post-exertional malaise, fatigue, and functional impairments like work status and exercise frequency—play key roles in distinguishing between these diagnoses. This supports the idea that while ME/CFS and Depression share common symptoms like fatigue, they diverge significantly in how these symptoms manifest and affect daily life. A limitation of the dataset, however, is the inclusion of a third “Both” category, which may represent comorbid cases or diagnostic uncertainty and could obscure the distinct boundaries between classes. Despite that, the models still performed well, suggesting strong underlying patterns in the symptom profiles.

From a data science standpoint, Logistic Regression offered interpretability but struggled with the complexity of the data. In contrast, Random Forest achieved perfect classification performance, highlighting its strength in handling non-linear relationships and ranking diagnostic features. While the results are promising, further validation would be needed to ensure generalizability and prevent overfitting.

Key Insights:

Several physical and mental symptom scores contribute significantly to classification.
Machine learning can improve diagnostic support tools in complex medical conditions.

References

Kaggle Dataset: https://www.kaggle.com/datasets/storytellerman/mecfs-vs-depression-classification-dataset

ME/CFS vs Depression Classification

Lyly O’Rourke

2025-06-27