Chronic Fatigue Syndrome, also known as Myalgic Encephalomyelitis (ME/CFS), is a complex and often misunderstood illness that causes severe fatigue, cognitive dysfunction, pain, and a range of other debilitating symptoms. Clinically, it is frequently misdiagnosed as Depression due to overlapping symptoms such as fatigue, low energy, and mood disturbances.
The dataset used in this project was obtained from Kaggle and contains clinical and self-reported symptom data from 500+ individuals, each labeled as either suffering from ME/CFS or Depression. Each observation includes multiple numerical features, such as severity scores for physical symptoms, mood-related ratings, and mental performance metrics.
The purpose of this analysis is to build a predictive model that can distinguish between ME/CFS and Depression based on these symptom features. This can aid both clinical decision-making and research into the physiological differences between the two conditions.
🔍 Goal: Build and compare two machine learning models — Logistic Regression and Random Forest — to predict diagnosis.
We will: - Explore and clean the dataset - Visualize relationships between symptoms and diagnosis - Train/test classification models - Evaluate performance using metrics like accuracy, precision, recall, and confusion matrices
We begin by loading several R packages that support data
manipulation, visualization, and modeling. tidyverse is
used for data wrangling, caret for model evaluation,
randomForest and e1071 for classification,
GGally for correlation plots, and ggfortify
for PCA visualization.
We load the dataset into R and view its structure using
glimpse() and summary(). This helps identify
the variable types and highlights any missing data that must be
addressed.
## Rows: 1,000
## Columns: 16
## $ age <dbl> 56, 69, 46, 32, 60, 25, 38, 56, 36, 40, 2…
## $ gender <chr> "Male", "Male", "Female", "Female", "Fema…
## $ sleep_quality_index <dbl> 8.7, 1.3, 4.0, 9.4, 7.6, 3.5, 3.3, 1.0, 7…
## $ brain_fog_level <dbl> 3.9, 9.9, 5.4, 2.1, 7.5, 3.9, 10.0, 9.8, …
## $ physical_pain_score <dbl> 9.2, 4.2, 4.8, 2.9, 6.4, 6.4, 4.3, 4.0, 9…
## $ stress_level <dbl> 8.1, 9.9, NA, 3.8, 8.5, 6.5, 6.2, 3.3, 7.…
## $ depression_phq9_score <dbl> 10, 20, 24, 10, 17, 9, 15, 10, 9, 9, 12, …
## $ fatigue_severity_scale_score <dbl> 6.5, 7.0, 1.6, 6.8, 7.0, 7.5, 7.0, 4.5, 7…
## $ pem_duration_hours <dbl> 9, 41, 13, 11, 46, 41, 29, 31, 31, 41, 31…
## $ hours_of_sleep_per_night <dbl> 7.7, 8.4, 6.9, 7.5, 3.1, 4.1, 9.9, 3.5, 7…
## $ pem_present <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,…
## $ work_status <chr> "Working", "Working", "Partially working"…
## $ social_activity_level <chr> "Low", "Low", NA, "High", "Low", "Medium"…
## $ exercise_frequency <chr> "Daily", "Often", "Rarely", "Never", "Rar…
## $ meditation_or_mindfulness <chr> "Yes", "Yes", "Yes", "Yes", "No", "No", "…
## $ diagnosis <chr> "Depression", "Both", "Depression", "Depr…
## age gender sleep_quality_index brain_fog_level
## Min. :18.00 Length:1000 Min. : 1.000 Min. : 1.000
## 1st Qu.:31.75 Class :character 1st Qu.: 3.100 1st Qu.: 3.300
## Median :45.00 Mode :character Median : 5.600 Median : 5.800
## Mean :44.38 Mean : 5.469 Mean : 5.612
## 3rd Qu.:57.00 3rd Qu.: 7.700 3rd Qu.: 7.900
## Max. :70.00 Max. :10.000 Max. :10.000
## NA's :47 NA's :48
## physical_pain_score stress_level depression_phq9_score
## Min. : 1.000 Min. : 1.000 Min. : 0.00
## 1st Qu.: 3.325 1st Qu.: 3.300 1st Qu.: 9.00
## Median : 5.600 Median : 5.400 Median :10.00
## Mean : 5.522 Mean : 5.459 Mean :12.27
## 3rd Qu.: 7.800 3rd Qu.: 7.700 3rd Qu.:16.00
## Max. :10.000 Max. :10.000 Max. :27.00
## NA's :34 NA's :48 NA's :22
## fatigue_severity_scale_score pem_duration_hours hours_of_sleep_per_night
## Min. : 0.000 Min. : 0.00 Min. : 3.000
## 1st Qu.: 6.300 1st Qu.:11.00 1st Qu.: 4.800
## Median : 7.000 Median :23.00 Median : 6.600
## Mean : 6.407 Mean :23.11 Mean : 6.571
## 3rd Qu.: 7.500 3rd Qu.:35.00 3rd Qu.: 8.350
## Max. :10.000 Max. :47.00 Max. :10.000
## NA's :21 NA's :24 NA's :21
## pem_present work_status social_activity_level exercise_frequency
## Min. :0.000 Length:1000 Length:1000 Length:1000
## 1st Qu.:0.000 Class :character Class :character Class :character
## Median :1.000 Mode :character Mode :character Mode :character
## Mean :0.599
## 3rd Qu.:1.000
## Max. :1.000
##
## meditation_or_mindfulness diagnosis
## Length:1000 Length:1000
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The dataset includes: - A diagnosis column: values are
mecfs or depression - An ID
column: irrelevant for prediction, to be removed - Dozens of numeric
features representing symptom scores (fatigue, concentration, pain,
etc.)
We begin by checking for missing values using colSums(),
then remove all rows with NAs using na.omit().
We convert the diagnosis column into a factor so that
classification models treat it correctly.
## age gender
## 0 0
## sleep_quality_index brain_fog_level
## 47 48
## physical_pain_score stress_level
## 34 48
## depression_phq9_score fatigue_severity_scale_score
## 22 21
## pem_duration_hours hours_of_sleep_per_night
## 24 21
## pem_present work_status
## 0 47
## social_activity_level exercise_frequency
## 40 39
## meditation_or_mindfulness diagnosis
## 11 0
data_clean <- data %>%
na.omit()
data_clean$diagnosis <- as.factor(data_clean$diagnosis)
table(data_clean$diagnosis)##
## Both Depression ME/CFS
## 140 270 260
✔️ The dataset has no missing values after removing NAs.
The classes are fairly balanced, which is favorable for model
evaluation.
We now explore the data visually and statistically to identify key patterns and relationships.
We create a bar chart to visualize the distribution of ME/CFS and Depression cases in the dataset. This allows us to quickly assess whether the classes are imbalanced, which could affect model performance.
ggplot(data_clean, aes(x = diagnosis, fill = diagnosis)) +
geom_bar() +
labs(title = "Diagnosis Distribution", x = "Diagnosis", y = "Count") +
theme_minimal()The bar chart visualizes the frequency of each diagnosis category in the dataset. It shows that out of 1,000 observations:
This relatively balanced distribution across classes is ideal for training classification models, as it reduces the risk of bias toward one category. However, the “Both” category introduces complexity, as it overlaps both target conditions.
We calculate the mean for each numeric feature grouped by diagnosis to observe how symptom scores differ between the two classes.
data_clean %>%
group_by(diagnosis) %>%
summarise(across(where(is.numeric), mean, .names = "mean_{.col}")) %>%
kable("html") %>% kable_styling()| diagnosis | mean_age | mean_sleep_quality_index | mean_brain_fog_level | mean_physical_pain_score | mean_stress_level | mean_depression_phq9_score | mean_fatigue_severity_scale_score | mean_pem_duration_hours | mean_hours_of_sleep_per_night | mean_pem_present |
|---|---|---|---|---|---|---|---|---|---|---|
| Both | 44.62143 | 5.070714 | 5.589286 | 5.525000 | 5.659286 | 16.678571 | 7.377143 | 23.37857 | 6.502143 | 1 |
| Depression | 43.78519 | 5.505926 | 5.593333 | 5.387778 | 5.394074 | 14.974074 | 4.863704 | 22.78519 | 6.567778 | 0 |
| ME/CFS | 44.15385 | 5.668846 | 5.804615 | 5.688077 | 5.440000 | 7.334615 | 7.506538 | 23.36538 | 6.601539 | 1 |
This table displays the average values of several numeric symptom features grouped by diagnosis. A few key observations include:
Individuals with Depression have slightly lower scores for physical symptoms like pain and fatigue, but higher mental health scores (e.g., depression_phq9_score).
Those with ME/CFS generally report higher fatigue, pain, and brain fog, indicating that physical symptom severity tends to be more pronounced in ME/CFS cases.
The “Both” group tends to fall between the two in most variables, which is consistent with their overlapping diagnosis.
This summary provides early insight into how symptom profiles differ between the conditions, helping guide feature importance later.
We use a heatmap to examine correlations among numeric features. Strong correlations may indicate redundancy, while weak correlations show independence across symptom domains.
numeric_features <- data_clean %>% select(where(is.numeric))
ggcorr(numeric_features, label = TRUE, label_round = 2) +
ggtitle("Correlation Heatmap of Numeric Features")The heatmap reveals correlations between numeric symptom variables. Key observations include:
This helps validate that many of the symptom scores are measuring different aspects of patient experience, supporting their inclusion in a predictive model.
We split the data into 70% training and 30% testing sets. This allows us to evaluate the model’s performance on unseen data, ensuring generalizability.
We first apply logistic regression, a statistical model that estimates the probability of class membership using a linear combination of input features.
The confusion matrix shows how many predictions were correct or incorrect for each class. Although logistic regression is interpretable, it may struggle with non-linear data structures.
log_model <- glm(diagnosis ~ ., data = train, family = "binomial")
pred_probs <- predict(log_model, test, type = "response")
pred_log <- ifelse(pred_probs > 0.5, "depression", "mecfs") %>%
factor(levels = levels(test$diagnosis))
confusionMatrix(pred_log, test$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction Both Depression ME/CFS
## Both 0 0 0
## Depression 0 0 0
## ME/CFS 0 0 0
##
## Overall Statistics
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Both Class: Depression Class: ME/CFS
## Sensitivity NA NA NA
## Specificity NA NA NA
## Pos Pred Value NA NA NA
## Neg Pred Value NA NA NA
## Prevalence NaN NaN NaN
## Detection Rate NaN NaN NaN
## Detection Prevalence NaN NaN NaN
## Balanced Accuracy NA NA NA
Logistic Regression failed to classify any of the samples correctly. The resulting confusion matrix contains all zeroes, and all performance metrics are NA.
This outcome suggests that the logistic model may have struggled due to:
The non-linear nature of the symptom data.
Possible issues handling the “Both” category, or multicollinearity.
Insufficient feature scaling or interaction terms.
It highlights the limitation of linear models when applied to complex health data.
We now apply a Random Forest classifier, which uses an ensemble of decision trees to capture complex patterns in the data.
Random Forest often outperforms simpler models due to its flexibility and ability to rank feature importance.
rf_model <- randomForest(diagnosis ~ ., data = train, importance = TRUE)
rf_pred <- predict(rf_model, test)
confusionMatrix(rf_pred, test$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction Both Depression ME/CFS
## Both 42 0 0
## Depression 0 81 0
## ME/CFS 0 0 78
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9818, 1)
## No Information Rate : 0.403
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Both Class: Depression Class: ME/CFS
## Sensitivity 1.000 1.000 1.0000
## Specificity 1.000 1.000 1.0000
## Pos Pred Value 1.000 1.000 1.0000
## Neg Pred Value 1.000 1.000 1.0000
## Prevalence 0.209 0.403 0.3881
## Detection Rate 0.209 0.403 0.3881
## Detection Prevalence 0.209 0.403 0.3881
## Balanced Accuracy 1.000 1.000 1.0000
In contrast, the Random Forest model achieved perfect classification:
This model correctly identified all test cases in each category without misclassification. While this may seem ideal, it could also suggest overfitting, especially if the dataset was not cross-validated. Still, the model’s performance shows Random Forest is very effective for this problem due to its ability to handle non-linear interactions and high-dimensional data.
We now visualize which features contributed most to the Random Forest model. These are typically the most diagnostic symptoms.
The feature importance plot reveals which variables most strongly influenced the Random Forest model. The top contributors include:
These features are the most diagnostic and could be helpful for clinicians when differentiating ME/CFS from Depression. Notably, many are physical and functional indicators rather than purely psychological measures.
We compare logistic regression and random forest based on: - Accuracy - Confusion matrices - Feature interpretability
Random Forest generally performs better on this dataset due to its ability to model non-linear interactions. Logistic regression remains a useful baseline model for quick insights.
From a clinical standpoint, this study offers valuable insight into how self-reported symptom data can be used to differentiate between ME/CFS and Depression, two conditions that are often misdiagnosed due to their overlapping symptoms. The feature importance results from the Random Forest model suggest that certain symptom domains—such as depression severity, post-exertional malaise, fatigue, and functional impairments like work status and exercise frequency—play key roles in distinguishing between these diagnoses. This supports the idea that while ME/CFS and Depression share common symptoms like fatigue, they diverge significantly in how these symptoms manifest and affect daily life. A limitation of the dataset, however, is the inclusion of a third “Both” category, which may represent comorbid cases or diagnostic uncertainty and could obscure the distinct boundaries between classes. Despite that, the models still performed well, suggesting strong underlying patterns in the symptom profiles.
From a data science standpoint, Logistic Regression offered interpretability but struggled with the complexity of the data. In contrast, Random Forest achieved perfect classification performance, highlighting its strength in handling non-linear relationships and ranking diagnostic features. While the results are promising, further validation would be needed to ensure generalizability and prevent overfitting.