Parkinson’s disease (PD) is a progressive neurodegenerative disorder that affects more than ten million people worldwide and is the second most common neurodegenerative disease after Alzheimer’s Disease. It is characterized clinically by the cardinal motor features of resting tremor, rigidity, and bradykinesia, but it also carries a wide range of non-motor manifestations including cognitive decline, REM sleep behavior disorder, hyposmia, and autonomic dysfunction. Pathologically, PD is defined by the loss of dopaminergic neurons in the substantia nigra pars compacta and the accumulation of misfolded alpha-synuclein into Lewy bodies. Despite decades of research, current diagnosis still relies primarily on clinical observation of motor symptoms supported by neuroimaging such as DaTscan. By the time these motor signs are evident, a substantial proportion of nigrostriatal dopaminergic neurons have already been lost, which limits the window in which disease-modifying interventions could plausibly be effective.
For this reason, identifying objective, fluid-based biomarkers of PD has become a central goal of translational neuroscience. Cerebrospinal fluid (CSF) is particularly attractive in this context because it is in direct contact with the central nervous system and reflects ongoing pathological processes in the brain more faithfully than blood. Targeted CSF markers such as alpha-synuclein, total tau, and amyloid-beta 42 have been studied for years, but no single analyte has proven sufficient on its own. Recent advances in high-throughput proteomics — in particular SomaScan aptamer-based assays — make it possible to measure thousands of proteins simultaneously from a single CSF sample, raising the possibility of building multi-protein signatures that classify disease status with higher accuracy than any single marker.
This project investigates whether such a signature can be learned from CSF proteomics data alone. Specifically, the question of interest is: Can protein expression profiles from CSF samples reliably distinguish Parkinson’s disease patients from healthy controls, and which proteins are most predictive of disease status? Based on prior literature reporting altered abundance of proteins involved in synaptic function, neuroinflammation, and lysosomal degradation in PD CSF, I hypothesize that a multivariate model trained on hundreds of CSF proteins will achieve classification performance meaningfully above chance, and that the proteins selected as most predictive will be enriched for biologically plausible PD-related pathways.
The data used in this project comes from the Parkinson’s Progression Markers Initiative (PPMI), a landmark longitudinal observational study sponsored by the Michael J. Fox Foundation. PPMI was launched in 2010 with the explicit goal of identifying biomarkers of PD onset and progression. It enrolls newly diagnosed PD patients, healthy controls, and several at-risk cohorts (e.g. prodromal participants, genetic carriers, and patients with REM sleep behavior disorder), and follows them with serial clinical examinations, imaging, and biospecimen collection. Because participants were recruited at multiple international sites using standardized protocols, the cohort is one of the largest and most carefully phenotyped PD resources currently available.
The CSF proteomics data used here come from PPMI Project 151, which
applied the SomaScan v4.1 aptamer-based assay to baseline CSF samples.
Each measurement represents the relative abundance of a single protein,
indexed by a SomaScan sequence identifier (TESTNAME) and
reported on a log2 scale (TESTVALUE). Roughly 4,700 unique
proteins were quantified per sample. Plate identifiers
(PLATEID) were retained as a technical batch variable. The
proteomics data were merged with PPMI clinical tables containing
diagnosis (COHORT), biological sex, age at visit, and
genetic consensus data including carrier status for the major
PD-associated mutations (LRRK2, GBA, SNCA, PRKN) and the APOE-e4 allele.
The analysis is restricted to PD patients and healthy controls so that
the classification task is well-defined; prodromal and SWEDD (“scan
without evidence of dopaminergic deficit”) participants are
excluded.
Clinical Data:
clinical_dir <- "/Users/ramazanegesolak/PPMI_STRING"
age_data <- read_csv(file.path(clinical_dir, "Age_at_visit_12Apr2026.csv"), show_col_types = FALSE)
clinical_dx <- read_csv(file.path(clinical_dir, "Clinical_Diagnosis_12Apr2026.csv"), show_col_types = FALSE)
demographics <- read_csv(file.path(clinical_dir, "Demographics_12Apr2026.csv"), show_col_types = FALSE)
genetics <- read_csv(file.path(clinical_dir, "iu_genetic_consensus_20251025_12Apr2026.csv"), show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
participant_status <- read_csv(file.path(clinical_dir, "Participant_Status_12Apr2026.csv"), show_col_types = FALSE)
cat("age_data: ", nrow(age_data), "rows\n")
## age_data: 46227 rows
cat("clinical_dx: ", nrow(clinical_dx), "rows\n")
## clinical_dx: 15440 rows
cat("demographics: ", nrow(demographics), "rows\n")
## demographics: 8351 rows
cat("genetics: ", nrow(genetics), "rows\n")
## genetics: 6265 rows
cat("participant_status:", nrow(participant_status), "rows\n")
## participant_status: 8417 rows
COHORT), recoded as a binary indicator of PD patient
versus healthy control.TESTVALUE), indexed by SomaScan sequence identifier
(TESTNAME). After reshaping to a wide patient-by-protein
matrix this yields several thousand candidate predictors.PLATEID, a
technical batch variable since plate-to-plate variability across
SomaScan runs can introduce systematic differences in protein
measurements unrelated to biology.library(ggdag)
library(ggplot2)
dag <- dagify(
Outcome ~ Predictor + SEX + AGE + Genetics,
Predictor ~ SEX + AGE + Genetics + Batch,
coords = list(
x = c(SEX = 1, AGE = 1, Genetics = 1,
Batch = 1, Predictor = 2.8, Outcome = 4.5),
y = c(SEX = 1.5, AGE = 0.5, Genetics = -0.5,
Batch = -1.5, Predictor = 0, Outcome = 0)
)
)
ggdag(dag, text = FALSE) +
geom_dag_point(color = "#4472C4", size = 22) +
geom_dag_text(color = "white", size = 3.2, fontface = "bold") +
theme_dag() +
labs(
title = "DAG: CSF Proteomics and Parkinson's Disease Prediction",
caption = paste(
"Outcome = PD Diagnosis (COHORT)",
"Predictor = CSF Protein Expression (TESTVALUE)",
"SEX = Biological sex — affects PD prevalence & protein levels",
"AGE = Age at visit — risk factor for PD & protein expression",
"Genetics = Mutation carrier status (LRRK2, GBA, SNCA, PRKN, APOE-e4)",
"Batch = Assay plate (PLATEID) — technical measurement confounder",
sep = "\n"
)
) +
theme(
plot.title = element_text(face = "bold", size = 13, hjust = 0.5),
plot.caption = element_text(size = 8.5, color = "gray30", hjust = 0,
margin = margin(t = 12), family = "mono")
)
The DAG above summarizes the assumed causal structure. PD status is the outcome; CSF protein expression is the primary predictor. Sex, age, and genetic carrier status are upstream variables that may influence both protein expression and disease status, which makes them confounders that have to be acknowledged in the analysis. Plate identifier (Batch) is a technical confounder. Batch does not affect disease status but it can shift protein measurements in ways that mimic biological signal. The DAG does not have an arrow from PD status back to the predictor, since we assume/treat the proteomic measurement as a snapshot taken at enrollment rather than a consequence of long-standing disease.
Because the outcome is binary, three classification approaches are built and compared: ordinary logistic regression, LASSO-penalized logistic regression, and random forest. These were chosen to span the bias–variance spectrum and to combine interpretable, sparse linear modeling (LASSO) with a flexible non-linear, ensemble approach (random forest). Standard logistic regression serves as a baseline. Comparing all three lets us evaluate predictive accuracy and the interpretability of the selected biomarkers side by side.
To control for confounding, age at visit and the five genetic carrier indicators (LRRK2, GBA, SNCA, PRKN, APOE-e4) are included as covariates alongside the protein expression features. Sex is examined in exploratory analysis and was originally considered as a covariate; in the final preprocessing recipe it is removed before modeling so that the protein signal is evaluated on its own merits rather than being inflated by demographic separation. Plate identifier is examined as a technical confounder, and protein expression values are normalized prior to modeling to further reduce technical variation. Class imbalance between PD and Control is addressed by random downsampling of the majority class inside the training pipeline so that the model is not allowed to “win” by always predicting PD. The analysis is also designed to avoid information leakage: variance filtering, t-test feature selection, and class balancing all happen inside the training fold or on the training set only. Model performance is evaluated using ROC AUC, accuracy, sensitivity, specificity, confusion matrices, and ROC curves on a held-out test set, and the most predictive proteins are inspected through LASSO coefficient magnitude and random forest permutation importance.
The exploratory analysis below has three aims: to characterize the cohort that the model will be trained on, to inspect the distribution of the predictor (CSF protein expression) across groups, and to look for imbalance in the candidate confounders. The proteomics CSV files for Project 151 are loaded and stacked into a single long-format data frame, retaining only the columns relevant to this analysis (patient ID, sex, cohort, protein name, expression value, plate ID).
wanted_cols <- c(
"PATNO", "SEX", "COHORT", "TESTNAME",
"TESTVALUE", "PLATEID"
)
raw_data <- file_list %>%
map_dfr(~read_csv(.x, show_col_types = FALSE)) %>%
select(any_of(wanted_cols))
glimpse(raw_data)
## Rows: 5,541,030
## Columns: 6
## $ PATNO <dbl> 53595, 53595, 53595, 53595, 53595, 53595, 53595, 53595, 5359…
## $ SEX <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
## $ COHORT <chr> "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", …
## $ TESTNAME <chr> "5632-6_3", "5631-83_3", "5630-48_3", "5629-58_3", "5628-21_…
## $ TESTVALUE <dbl> 10.182013, 6.346285, 12.943830, 5.891402, 12.836350, 7.03480…
## $ PLATEID <chr> "P0022899", "P0022899", "P0022899", "P0022899", "P0022899", …
head(raw_data, 20)
## # A tibble: 20 × 6
## PATNO SEX COHORT TESTNAME TESTVALUE PLATEID
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 53595 Female PD 5632-6_3 10.2 P0022899
## 2 53595 Female PD 5631-83_3 6.35 P0022899
## 3 53595 Female PD 5630-48_3 12.9 P0022899
## 4 53595 Female PD 5629-58_3 5.89 P0022899
## 5 53595 Female PD 5628-21_3 12.8 P0022899
## 6 53595 Female PD 5627-53_3 7.03 P0022899
## 7 53595 Female PD 5626-20_3 5.89 P0022899
## 8 53595 Female PD 5624-66_3 7.66 P0022899
## 9 53595 Female PD 5623-11_3 6.13 P0022899
## 10 53595 Female PD 5621-64_3 5.95 P0022899
## 11 53595 Female PD 5620-13_3 10.1 P0022899
## 12 53595 Female PD 5618-50_3 13.9 P0022899
## 13 53595 Female PD 5617-41_3 8.84 P0022899
## 14 53595 Female PD 5615-62_3 6.25 P0022899
## 15 53595 Female PD 5614-44_3 6.73 P0022899
## 16 53595 Female PD 5613-75_3 5.87 P0022899
## 17 53595 Female PD 5612-16_3 6.20 P0022899
## 18 53595 Female PD 5611-56_3 6.02 P0022899
## 19 53595 Female PD 5610-32_3 6.38 P0022899
## 20 53595 Female PD 5609-92_3 14.1 P0022899
raw_data %>% count(COHORT)
## # A tibble: 3 × 2
## COHORT n
## <chr> <int>
## 1 Control 890010
## 2 PD 2952345
## 3 Prodromal 1698675
raw_data %>% summarise(
n_patients = n_distinct(PATNO),
n_proteins = n_distinct(TESTNAME),
n_rows = n()
)
## # A tibble: 1 × 3
## n_patients n_proteins n_rows
## <int> <int> <int>
## 1 1158 4785 5541030
raw_data %>%
count(PATNO, TESTNAME) %>%
filter(n > 1) %>%
nrow()
## [1] 0
SomaScan sequence identifiers (e.g. seq.12345.6) are
opaque on their own, so the official PPMI annotation file is loaded and
a lookup table is built that maps each SOMA sequence ID to its target
gene symbol. This lookup is used later to relabel the top proteins in
the figures, so that the reader sees recognizable biological names
(e.g. APOE, GFAP) instead of internal aptamer
codes.
annotations <- read_csv(
file.path(data_dir, "PPMI_Project_151_pqtl_Analysis_Annotations_20210210.csv"),
show_col_types = FALSE
)
## New names:
## • `` -> `...3`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...36`
## • `` -> `...37`
## • `` -> `...38`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`
## • `` -> `...42`
## • `` -> `...43`
## • `` -> `...44`
## • `` -> `...45`
## • `` -> `...46`
## • `` -> `...47`
## • `` -> `...48`
## • `` -> `...49`
## • `` -> `...50`
## • `` -> `...51`
## • `` -> `...52`
## • `` -> `...53`
## • `` -> `...54`
## • `` -> `...55`
## • `` -> `...56`
## • `` -> `...57`
## • `` -> `...58`
## • `` -> `...59`
## • `` -> `...60`
## • `` -> `...61`
protein_lookup <- annotations %>%
select(SOMA_SEQ_ID, TARGET_GENE_SYMBOL) %>%
distinct() %>%
filter(!is.na(TARGET_GENE_SYMBOL), TARGET_GENE_SYMBOL != "")
head(protein_lookup)
## # A tibble: 6 × 2
## SOMA_SEQ_ID TARGET_GENE_SYMBOL
## <chr> <chr>
## 1 10000-28_3 CRYBB2
## 2 10001-7_3 RAF1
## 3 10003-15_3 ZNF41
## 4 10006-25_3 ELK1
## 5 10008-43_3 GUCA1A
## 6 10009-2_3 IRF1
raw_data %>%
distinct(PATNO, COHORT) %>%
count(COHORT) %>%
ggplot(aes(x = reorder(COHORT, -n), y = n, fill = COHORT)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = n), vjust = -0.5, fontface = "bold", size = 5) +
scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
labs(title = "Number of Patients by Cohort", x = "", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))
The bar plot above shows the number of unique patients in each cohort that have CSF proteomic data available. The PD group is substantially larger than the Control group. Controls are recruited to roughly half the number of patients rather than to a 1:1 ratio. This imbalance can create several biases in our analysis. A naive classifier that always predicts “PD” would already achieve apparently high accuracy. For this reason the modeling pipeline downsamples the PD class to match the control class inside training and reports sensitivity and specificity in addition to accuracy.
raw_data %>%
filter(COHORT %in% c("PD", "Control")) %>%
group_by(TESTNAME) %>%
summarise(variance = var(TESTVALUE, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(variance)) %>%
slice_head(n = 20) %>%
left_join(protein_lookup, by = c("TESTNAME" = "SOMA_SEQ_ID")) %>%
mutate(protein_label = ifelse(is.na(TARGET_GENE_SYMBOL), TESTNAME, TARGET_GENE_SYMBOL)) %>%
ggplot(aes(x = reorder(protein_label, variance), y = variance)) +
geom_col(fill = "#4472C4", width = 0.7) +
coord_flip() +
labs(title = "Top 20 Most Variable Proteins",
x = "", y = "Variance") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.text.y = element_text(size = 9))
According to this EDA, the single most variable protein is PILRA, a
myeloid immune receptor most familiar from Alzheimer’s GWAS, and a
cluster of microglial and immune-related proteins (SIGLEC9, SIRPA,
FCGR2A, C3, CHIT1) also appears in the top 20. These are all consistent
with the well-documented role of neuroinflammation in neurodegeneration.
Beyond that, however, the list is dominated by proteins whose presence
in CSF is almost certainly an artifact of low-level blood contamination
during lumbar puncture. These artifact proteinsa are the hemoglobin
chains HBB and HBA1, haptoglobin (HP), the plasma protein
alpha-2-HS-glycoprotein (AHSG), prothrombin (F2), and the ABO
blood-group glycosyltransferase. The variance in these proteins is high
because CSF samples differ in how much blood was admixed at collection,
not because of PD biology. A second cluster (HIST2H2BE, SETMAR,
C1orf226, LRPPRC, GAPDH) is more loosely interpretable and likely
reflects a mixture of cellular turnover, mitochondrial content, and
assay-level variability across samples.
According to this, a high variance is not the same as a disease-specific signal. A protein could vary widely between individuals because of age, comorbidities, or assay noise rather than PD biology. For that reason, variance ranking is used purely as a screening tool to retain the top 500 most variable proteins before disease-aware feature selection (the BH-adjusted t-test) is applied within the training set.
glimpse(age_data)
## Rows: 46,227
## Columns: 3
## $ PATNO <dbl> 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 300…
## $ EVENT_ID <chr> "BL", "R17", "R18", "R19", "R20", "SC", "V01", "V02", "V0…
## $ AGE_AT_VISIT <dbl> 69.1, 80.5, 81.4, 84.0, 84.0, 69.1, 69.4, 69.6, 69.9, 70.…
age_col <- intersect(c("AGE", "AGE_AT_VISIT", "PATVISITAGE"), names(age_data))[1]
cat("Using age column:", age_col, "\n")
## Using age column: AGE_AT_VISIT
age_cohort <- age_data %>%
inner_join(raw_data %>% distinct(PATNO, COHORT), by = "PATNO") %>%
rename(age = all_of(age_col)) %>%
filter(!is.na(age), COHORT %in% c("PD", "Control"))
ggplot(age_cohort, aes(x = age, fill = COHORT)) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity", color = "white") +
scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
labs(title = "Age Distribution by Cohort",
x = "Age", y = "Count", fill = "") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
legend.position = "bottom")
ggplot(age_cohort, aes(x = COHORT, y = age, fill = COHORT)) +
geom_boxplot(alpha = 0.7, show.legend = FALSE) +
geom_jitter(width = 0.15, alpha = 0.3, size = 1) +
scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
labs(title = "Age by Disease Status", x = "", y = "Age") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))
age_cohort %>%
group_by(COHORT) %>%
summarise(
n = n(),
mean_age = round(mean(age, na.rm = TRUE), 1),
median_age = median(age, na.rm = TRUE),
sd_age = round(sd(age, na.rm = TRUE), 1),
min_age = min(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE)
) %>%
print()
## # A tibble: 2 × 7
## COHORT n mean_age median_age sd_age min_age max_age
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Control 3582 65.4 66.5 12 30.6 98.1
## 2 PD 10961 66 66.8 9.8 32.3 93.6
The age distribution is broadly similar between PD patients and controls, with both groups concentrated in late middle age and consistent with PPMI’s recruitment of de novo PD cases. The median age in the PD group is slightly higher than in controls, which is biologically plausible given that PD risk rises sharply after age 60. This near-overlap is reassuring for the modeling step where age is not strongly separating the two cohorts. So, any class differences detected by the model are unlikely to be a pure age effect. Nonetheless, age is retained as a covariate in the final feature set so that the protein signal is interpreted conditional on age rather than confounded by it.
glimpse(demographics)
## Rows: 8,351
## Columns: 29
## $ REC_ID <chr> "IA86904", "IA86905", "IA86906", "IA86907", "IA86908", "2…
## $ PATNO <dbl> 3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 300…
## $ EVENT_ID <chr> "TRANS", "TRANS", "TRANS", "TRANS", "TRANS", "TRANS", "TR…
## $ PAG_NAME <chr> "SCREEN", "SCREEN", "SCREEN", "SCREEN", "SCREEN", "SCREEN…
## $ INFODT <chr> "01/2011", "02/2011", "03/2011", "03/2011", "03/2011", "0…
## $ AFICBERB <dbl> 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA, 0, NA, NA, 0,…
## $ ASHKJEW <dbl> 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA, 0, NA, NA, 0,…
## $ BASQUE <dbl> 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA, 0, NA, NA, 0,…
## $ BIRTHDT <chr> "12/1941", "01/1946", "08/1943", "07/1954", "11/1951", "1…
## $ SEX <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ CHLDBEAR <dbl> 0, NA, 0, 0, NA, 0, 0, NA, 0, 0, NA, NA, NA, 0, NA, NA, N…
## $ HOWLIVE <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ GAYLES <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HETERO <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BISEXUAL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ PANSEXUAL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ASEXUAL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ OTHSEXUALITY <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HANDED <dbl> 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, …
## $ HISPLAT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RAASIAN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RABLACK <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ RAHAWOPI <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RAINDALS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RANOS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RAWHITE <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, …
## $ RAUNKNOWN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ORIG_ENTRY <chr> "01/2011", "02/2011", "03/2011", "03/2011", "03/2011", "0…
## $ LAST_UPDATE <dttm> 2022-11-07 00:00:00, 2022-11-07 00:00:00, 2022-11-07 00:…
demo_cohort <- demographics %>%
inner_join(raw_data %>% distinct(PATNO, COHORT), by = "PATNO") %>%
distinct(PATNO, .keep_all = TRUE) %>%
filter(COHORT %in% c("PD", "Control")) %>%
mutate(SEX = factor(SEX, levels = c(0, 1),
labels = c("Female", "Male")))
demo_cohort %>%
count(COHORT, SEX) %>%
ggplot(aes(x = COHORT, y = n, fill = SEX)) +
geom_col(position = "dodge", width = 0.6) +
geom_text(aes(label = n), position = position_dodge(0.6),
vjust = -0.5, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Male" = "#4472C4",
"Female" = "#ED7D31")) +
labs(title = "Sex Distribution by Cohort",
x = "", y = "Count", fill = "") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
legend.position = "bottom")
demo_cohort %>%
count(COHORT, SEX) %>%
group_by(COHORT) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
print()
## # A tibble: 4 × 4
## # Groups: COHORT [2]
## COHORT SEX n pct
## <chr> <fct> <int> <dbl>
## 1 Control Female 69 37.1
## 2 Control Male 117 62.9
## 3 PD Female 244 39.5
## 4 PD Male 373 60.5
According to the Sex distribution graph, males outnumber females in both cohorts. But the imbalance is more pronounced in the PD group than in controls. This pattern mirrors the well-documented epidemiology of PD, where males have roughly a 1.5–2× higher lifetime incidence than females. From a modeling perspective, this means sex is correlated with disease status, which is why it appears as a confounder in the DAG. In the final recipe, sex is removed before fitting so that the protein signature is evaluated independently of sex; a sensitivity analysis adjusting for sex could be added in future work.
mutation_cols <- c("LRRK2", "GBA", "SNCA", "PRKN", "PINK1") # VPS35/PARK7 all 0
gen_cohort <- genetics %>%
inner_join(raw_data %>% distinct(PATNO, COHORT), by = "PATNO") %>%
distinct(PATNO, .keep_all = TRUE) %>%
filter(COHORT %in% c("PD", "Control")) %>%
mutate(across(all_of(mutation_cols), as.character))
gen_cohort %>%
pivot_longer(cols = all_of(mutation_cols),
names_to = "gene",
values_to = "status") %>%
mutate(carrier = status != "0" & !is.na(status)) %>%
group_by(COHORT, gene) %>%
summarise(n_carriers = sum(carrier, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = reorder(gene, -n_carriers), y = n_carriers, fill = COHORT)) +
geom_col(position = "dodge", width = 0.6) +
geom_text(aes(label = n_carriers), position = position_dodge(0.6),
vjust = -0.5, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
labs(title = "Number of Mutation Carriers by Gene and Cohort",
x = "Gene", y = "Number of Carriers", fill = "") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
legend.position = "bottom")
The number of confirmed genetic mutation carriers is small in
absolute terms but informative. LRRK2 and GBA carriers are the most
represented, consistent with their relatively high prevalence among PD
cases worldwide. The proportion of mutation carriers is higher in the PD
group than in controls for every gene examined, which is the expected
direction. Because these mutations affect both PD risk and proteomic
background, they are retained as binary covariates
(LRRK2_carrier, GBA_carrier,
SNCA_carrier, PRKN_carrier,
APOE_e4) in the modeling step rather than dropped or used
to stratify the analysis.
Overall, the EDA shows three things. First, the cohort is imbalanced toward PD and toward males, which must be addressed in modeling. Second, age, sex, and genetic mutation status differ between cohorts in expected ways and so are retained as covariates. Third, the disease signal, if it exists, is not detectable in the pooled marginal distribution of all proteins and must be sought in a much smaller, biologically focused subset. The next sections describe how that subset is selected without leaking information from the test set, and how three classification models are trained, tuned, and compared.
The modeling section is restricted to PD and Control patients, with PPMI’s prodromal and SWEDD cohorts excluded so that the classification task is well defined.
The first preprocessing step is to assemble per-patient clinical features. Age is taken at the baseline (BL) or screening (SC) visit so that every patient contributes a single, time-aligned age value. Genetic data are reduced to five binary indicator variables: LRRK2, GBA, SNCA, and PRKN are coded as carrier vs. non-carrier, and APOE is coded as the presence vs. absence of an e4 allele (APOE-e4 status is a well-established risk factor for cognitive decline in PD as well as in Alzheimer’s disease, and is therefore retained even though it is not strictly a PD-causing mutation).
age_baseline <- age_data %>%
filter(EVENT_ID %in% c("BL", "SC")) %>%
arrange(PATNO, EVENT_ID) %>%
distinct(PATNO, .keep_all = TRUE) %>%
select(PATNO, AGE_AT_VISIT)
genetics_features <- genetics %>%
mutate(
LRRK2_carrier = as.integer(LRRK2 != "0" & !is.na(LRRK2)),
GBA_carrier = as.integer(GBA != "0" & !is.na(GBA)),
SNCA_carrier = as.integer(as.character(SNCA) != "0" & !is.na(SNCA)),
PRKN_carrier = as.integer(as.character(PRKN) != "0" & !is.na(PRKN)),
APOE_e4 = as.integer(grepl("E4", APOE, ignore.case = TRUE))
) %>%
select(PATNO, LRRK2_carrier, GBA_carrier,
SNCA_carrier, PRKN_carrier, APOE_e4)
cat("Patients with baseline age:", nrow(age_baseline), "\n")
## Patients with baseline age: 8118
cat("Patients with genetics: ", nrow(genetics_features), "\n")
## Patients with genetics: 6265
Next, the outcome is recoded into a factor with two levels
(Control and PD).
data_binary <- raw_data %>%
filter(COHORT %in% c("PD", "Control")) %>%
mutate(outcome = factor(ifelse(COHORT == "PD", 1, 0),
levels = c(0, 1),
labels = c("Control", "PD")))
data_binary %>% count(outcome)
## # A tibble: 2 × 2
## outcome n
## <fct> <int>
## 1 Control 890010
## 2 PD 2952345
The long-format data are then pivoted to a wide patient-by-protein matrix and merged with the baseline age and genetic features. Each row now corresponds to a single patient, with the outcome, sex, baseline age, five genetic indicators, and one column per SomaScan protein.
wide_data <- data_binary %>%
select(PATNO, SEX, outcome, TESTNAME, TESTVALUE) %>%
pivot_wider(
names_from = TESTNAME,
values_from = TESTVALUE
) %>%
left_join(age_baseline, by = "PATNO") %>%
left_join(genetics_features, by = "PATNO")
dim(wide_data)
## [1] 803 4794
glimpse(wide_data %>% select(PATNO, SEX, outcome,
AGE_AT_VISIT, LRRK2_carrier,
GBA_carrier, SNCA_carrier,
PRKN_carrier, APOE_e4))
## Rows: 803
## Columns: 9
## $ PATNO <dbl> 53595, 3029, 3963, 3316, 3124, 3175, 4103, 3467, 3168, 5…
## $ SEX <chr> "Female", "Female", "Female", "Male", "Male", "Female", …
## $ outcome <fct> PD, Control, PD, Control, PD, PD, PD, PD, PD, PD, PD, PD…
## $ AGE_AT_VISIT <dbl> 65.2, 66.3, 58.4, 74.7, 57.2, 57.2, 59.0, 66.9, 63.1, 67…
## $ LRRK2_carrier <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GBA_carrier <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ SNCA_carrier <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PRKN_carrier <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ APOE_e4 <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
cat("Missing AGE_AT_VISIT: ", sum(is.na(wide_data$AGE_AT_VISIT)), "\n")
## Missing AGE_AT_VISIT: 0
cat("Missing LRRK2_carrier: ", sum(is.na(wide_data$LRRK2_carrier)), "\n")
## Missing LRRK2_carrier: 0
cat("LRRK2 carriers: ", sum(wide_data$LRRK2_carrier == 1, na.rm = TRUE), "\n")
## LRRK2 carriers: 156
cat("GBA carriers: ", sum(wide_data$GBA_carrier == 1, na.rm = TRUE), "\n")
## GBA carriers: 77
cat("APOE E4 carriers: ", sum(wide_data$APOE_e4 == 1, na.rm = TRUE), "\n")
## APOE E4 carriers: 197
With several thousand candidate proteins and only a few hundred patients, the feature space is much wider than the sample size. To make modeling tractable and to focus on proteins with meaningful biological variation, an unsupervised variance filter is applied. The variance of each protein is computed across patients, and only the top 500 most variable proteins are kept. This step is unsupervised because it does not look at the outcome label, so it does not leak class information into the training set. The cutoff at 500 was chosen by inspecting the variance distribution. Beyond the top several hundred proteins, the variance flattens into a long tail dominated by near-constant features that carry little information.
clinical_vars <- c("AGE_AT_VISIT", "LRRK2_carrier", "GBA_carrier",
"SNCA_carrier", "PRKN_carrier", "APOE_e4")
protein_cols <- wide_data %>%
select(-PATNO, -SEX, -outcome, -any_of(clinical_vars))
protein_vars <- protein_cols %>%
summarise(across(everything(), var, na.rm = TRUE)) %>%
pivot_longer(everything(),
names_to = "protein",
values_to = "variance") %>%
arrange(desc(variance))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), var, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
summary(protein_vars$variance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003911 0.025173 0.052631 0.122038 0.120528 12.827399
ggplot(protein_vars, aes(x = variance)) +
geom_histogram(bins = 100, fill = "steelblue", color = "white") +
geom_vline(xintercept = protein_vars$variance[500],
color = "red", linetype = "dashed", linewidth = 1) +
annotate("text",
x = protein_vars$variance[500],
y = Inf,
label = "Top 500 cutoff",
vjust = 2, hjust = -0.1, color = "red") +
coord_cartesian(xlim = c(0, 1)) +
labs(title = "Protein Variance Distribution",
x = "Variance",
y = "Count") +
theme_minimal()
top500_proteins <- protein_vars %>%
slice_head(n = 500) %>%
pull(protein)
data_filtered <- wide_data %>%
select(PATNO, SEX, outcome,
all_of(clinical_vars),
all_of(top500_proteins))
dim(data_filtered)
## [1] 803 509
cat("Clinical vars present:",
sum(clinical_vars %in% names(data_filtered)),
"of", length(clinical_vars), "\n")
## Clinical vars present: 6 of 6
The data are split into an 80/20 stratified train/test split before any supervised feature selection. This ordering is essential because if a t-test were run on the full dataset to choose discriminating proteins and the split were performed afterwards, information from the test patients would already have contaminated the chosen feature set, and test-set performance would be optimistically biased. The stratification on the outcome variable preserves the PD-to-Control ratio in both sets.
set.seed(42)
data_split_pre <- initial_split(
data_filtered %>% select(-PATNO),
prop = 0.80,
strata = outcome
)
train_pre <- training(data_split_pre)
test_pre <- testing(data_split_pre)
cat("Pre-split training set:\n")
## Pre-split training set:
train_pre %>% count(outcome) %>% print()
## # A tibble: 2 × 2
## outcome n
## <fct> <int>
## 1 Control 148
## 2 PD 493
cat("\nPre-split test set:\n")
##
## Pre-split test set:
test_pre %>% count(outcome) %>% print()
## # A tibble: 2 × 2
## outcome n
## <fct> <int>
## 1 Control 38
## 2 PD 124
For each of the top 500 variable proteins, a Welch two-sample t-test is run on the training set comparing log2 expression in PD vs. Control. Raw p-values are then adjusted for multiple testing using the Benjamini–Hochberg (BH) procedure to control the false discovery rate at 5%. Proteins with BH-adjusted p < 0.05 are carried forward as candidate features for the classification models. Restricting the t-test to the training data ensures that the test set remains genuinely held out for performance evaluation.
ttest_results <- train_pre %>%
select(outcome, all_of(top500_proteins)) %>%
pivot_longer(
cols = all_of(top500_proteins),
names_to = "protein",
values_to = "value"
) %>%
group_by(protein) %>%
summarise(
p_value = t.test(value ~ outcome)$p.value,
mean_PD = mean(value[outcome == "PD"], na.rm = TRUE),
mean_ctrl = mean(value[outcome == "Control"], na.rm = TRUE),
log2FC = mean_PD - mean_ctrl
) %>%
mutate(
p_adj = p.adjust(p_value, method = "BH"),
sig = p_adj < 0.05
) %>%
arrange(p_adj)
ttest_results %>% count(sig)
## # A tibble: 2 × 2
## sig n
## <lgl> <int>
## 1 FALSE 454
## 2 TRUE 46
head(ttest_results, 20)
## # A tibble: 20 × 7
## protein p_value mean_PD mean_ctrl log2FC p_adj sig
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 3504-58_2 0.00000000789 9.33 8.95 0.375 0.00000395 TRUE
## 2 7757-5_3 0.0000000423 9.41 9.11 0.300 0.0000106 TRUE
## 3 5080-131_3 0.00000437 9.68 9.48 0.203 0.000728 TRUE
## 4 6521-35_3 0.00000716 11.7 11.9 -0.228 0.000895 TRUE
## 5 14125-5_3 0.0000162 8.99 8.77 0.219 0.00162 TRUE
## 6 13722-105_3 0.0000431 10.4 10.2 0.202 0.00214 TRUE
## 7 2292-17_4 0.0000401 9.66 9.45 0.205 0.00214 TRUE
## 8 2743-5_2 0.0000267 12.9 12.7 0.288 0.00214 TRUE
## 9 2900-53_3 0.0000418 9.47 9.24 0.234 0.00214 TRUE
## 10 4141-79_1 0.0000471 10.6 10.3 0.265 0.00214 TRUE
## 11 8240-207_3 0.0000430 11.6 11.4 0.221 0.00214 TRUE
## 12 6520-87_3 0.0000540 13.4 13.2 0.228 0.00225 TRUE
## 13 3535-84_1 0.0000771 8.55 8.34 0.204 0.00297 TRUE
## 14 13416-8_3 0.0000942 12.1 12.3 -0.233 0.00314 TRUE
## 15 3060-43_2 0.0000911 11.2 11.0 0.187 0.00314 TRUE
## 16 9900-36_3 0.000206 9.14 8.92 0.221 0.00642 TRUE
## 17 3415-61_2 0.000238 8.70 8.23 0.469 0.00662 TRUE
## 18 9796-4_3 0.000226 15.2 15.0 0.215 0.00662 TRUE
## 19 8890-9_3 0.000352 11.7 11.9 -0.197 0.00926 TRUE
## 20 9986-14_3 0.000414 9.36 9.16 0.206 0.0104 TRUE
The volcano plot visualizes the results of this screening step. The x-axis is the log2 fold change between PD and Control, and the y-axis is the negative log10 of the BH-adjusted p-value. Points in the upper region of the plot represent proteins that are differentially abundant between PD and Control after multiple-testing correction. We can see that a meaningful number of proteins clear the BH < 0.05 threshold, including several with substantial fold changes, which suggests that there is a real disease signal in the CSF proteome that the models can exploit.
ggplot(ttest_results, aes(x = log2FC, y = -log10(p_adj), color = sig)) +
geom_point(alpha = 0.6, size = 1.2) +
scale_color_manual(values = c("FALSE" = "grey60", "TRUE" = "#C00000"),
labels = c("Not significant", "Significant (BH < 0.05)")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") +
labs(title = "Volcano Plot: PD vs Control (t-test, training set)",
x = "Log2 Fold Change (PD - Control)",
y = "-log10(Adjusted p-value)",
color = "") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
legend.position = "bottom")
The significant proteins from the volcano analysis, together with age and the five genetic indicator variables, form the final feature set. The training and test rows are recombined into a single data frame and the original split is reconstructed.
sig_proteins <- ttest_results %>%
filter(sig == TRUE) %>%
pull(protein)
cat("Significant proteins:", length(sig_proteins), "\n")
## Significant proteins: 46
train_data <- train_pre %>%
select(SEX, outcome, all_of(clinical_vars), all_of(sig_proteins))
test_data <- test_pre %>%
select(SEX, outcome, all_of(clinical_vars), all_of(sig_proteins))
data_model <- bind_rows(train_data, test_data)
dim(data_model)
## [1] 803 54
data_model %>% count(outcome)
## # A tibble: 2 × 2
## outcome n
## <fct> <int>
## 1 Control 186
## 2 PD 617
data_model %>%
select(all_of(clinical_vars)) %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") %>%
mutate(pct_missing = round(100 * n_missing / nrow(data_model), 1)) %>%
print()
## # A tibble: 6 × 3
## variable n_missing pct_missing
## <chr> <int> <dbl>
## 1 AGE_AT_VISIT 0 0
## 2 LRRK2_carrier 0 0
## 3 GBA_carrier 0 0
## 4 SNCA_carrier 0 0
## 5 PRKN_carrier 0 0
## 6 APOE_e4 0 0
data_split <- make_splits(
list(analysis = seq_len(nrow(train_data)),
assessment = seq(nrow(train_data) + 1, nrow(data_model))),
data = data_model
)
saveRDS(data_model, file.path(data_dir, "data_model.rds"))
Ten-fold stratified cross-validation is set up on the training data. Stratification preserves the PD-to-Control ratio in each fold so that no fold has so few minority-class samples that it cannot estimate sensitivity reliably. The same fold structure is used for tuning LASSO and the random forest as well as for cross-validating the baseline logistic regression.
set.seed(42)
cv_folds <- vfold_cv(
train_data,
v = 10,
strata = outcome
)
A single shared preprocessing recipe is used for all three models so that performance differences are attributable to the modeling approach rather than to preprocessing choices. The recipe (i) removes sex from the feature set so that the protein signature is evaluated on its own; (ii) median-imputes any remaining missing values in the clinical variables and in the proteins; (iii) normalizes all predictors to unit variance, which is required for LASSO and helps interpretability for logistic regression and random forest; and (iv) downsamples the majority class so that PD and Control are balanced inside the training fold.
Three model specifications are defined:
model_recipe <- recipe(outcome ~ ., data = train_data) %>%
step_rm(SEX) %>%
step_impute_median(all_of(clinical_vars)) %>%
step_impute_median(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_downsample(outcome, under_ratio = 1) # fix class imbalance
summary(model_recipe)
## # A tibble: 54 × 4
## variable type role source
## <chr> <list> <chr> <chr>
## 1 SEX <chr [3]> predictor original
## 2 AGE_AT_VISIT <chr [2]> predictor original
## 3 LRRK2_carrier <chr [2]> predictor original
## 4 GBA_carrier <chr [2]> predictor original
## 5 SNCA_carrier <chr [2]> predictor original
## 6 PRKN_carrier <chr [2]> predictor original
## 7 APOE_e4 <chr [2]> predictor original
## 8 3504-58_2 <chr [2]> predictor original
## 9 7757-5_3 <chr [2]> predictor original
## 10 5080-131_3 <chr [2]> predictor original
## # ℹ 44 more rows
log_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_workflow <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(log_spec)
lasso_spec <- logistic_reg(
penalty = tune(),
mixture = 1
) %>%
set_engine("glmnet") %>%
set_mode("classification")
lasso_workflow <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(lasso_spec)
rf_spec <- rand_forest(
mtry = tune(),
trees = 500,
min_n = tune()
) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
rf_workflow <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_spec)
log_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_rm()
## • step_impute_median()
## • step_impute_median()
## • step_normalize()
## • step_downsample()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
lasso_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_rm()
## • step_impute_median()
## • step_impute_median()
## • step_normalize()
## • step_downsample()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
rf_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_rm()
## • step_impute_median()
## • step_impute_median()
## • step_normalize()
## • step_downsample()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 500
## min_n = tune()
##
## Engine-Specific Arguments:
## importance = permutation
##
## Computational engine: ranger
To tune for LASSO, 50 different penalty values were tried, from very small to very large. For the random forest, five different values are tried for the number of proteins it randomly considers at each split, and the smallest node size before it stops splitting. Logistic regression doesn’t have any parameters to tune.
For each model, I pick the settings that give the highest cross-validated ROC AUC, then compare the three models on ROC AUC, accuracy, sensitivity, and specificity.
#LASSO
lasso_grid <- grid_regular(
penalty(range = c(-4, 0)),
levels = 50
)
set.seed(42)
lasso_tuned <- tune_grid(
lasso_workflow,
resamples = cv_folds,
grid = lasso_grid,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE)
)
## Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
## 1.2.0.
## ℹ See details at
## <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
lasso_best <- select_best(lasso_tuned, metric = "roc_auc")
cat("Best LASSO penalty:", lasso_best$penalty, "\n")
## Best LASSO penalty: 0.009102982
#Random Forest
rf_grid <- grid_regular(
mtry(range = c(2, 20)),
min_n(range = c(2, 20)),
levels = 5
)
set.seed(42)
rf_tuned <- tune_grid(
rf_workflow,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE)
)
rf_best <- select_best(rf_tuned, metric = "roc_auc")
cat("Best RF mtry:", rf_best$mtry, "min_n:", rf_best$min_n, "\n")
## Best RF mtry: 11 min_n: 2
#Logistic Regression
set.seed(42)
log_cv_results <- fit_resamples(
log_workflow,
resamples = cv_folds,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_resamples(save_pred = TRUE)
)
## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## There were issues with some computations A: x2There were issues with some computations A: x4
autoplot(lasso_tuned, metric = "roc_auc") +
labs(title = "LASSO Tuning: ROC AUC vs Penalty") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))
autoplot(rf_tuned, metric = "roc_auc") +
labs(title = "Random Forest Tuning: ROC AUC vs Hyperparameters") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))
log_metrics <- collect_metrics(log_cv_results) %>% mutate(model = "Logistic")
lasso_metrics <- collect_metrics(lasso_tuned) %>%
filter(penalty == lasso_best$penalty) %>%
mutate(model = "LASSO")
rf_metrics <- collect_metrics(rf_tuned) %>%
filter(mtry == rf_best$mtry, min_n == rf_best$min_n) %>%
mutate(model = "Random Forest")
cv_comparison <- bind_rows(log_metrics, lasso_metrics, rf_metrics) %>%
select(model, .metric, mean, std_err) %>%
filter(.metric %in% c("roc_auc", "accuracy", "sens", "spec")) %>%
arrange(.metric, desc(mean))
print(cv_comparison, n = 50)
## # A tibble: 12 × 4
## model .metric mean std_err
## <chr> <chr> <dbl> <dbl>
## 1 Random Forest accuracy 0.705 0.0165
## 2 Logistic accuracy 0.694 0.0191
## 3 LASSO accuracy 0.671 0.0219
## 4 LASSO roc_auc 0.810 0.0221
## 5 Random Forest roc_auc 0.795 0.0195
## 6 Logistic roc_auc 0.777 0.0162
## 7 LASSO sens 0.811 0.0386
## 8 Random Forest sens 0.736 0.0504
## 9 Logistic sens 0.729 0.0326
## 10 Random Forest spec 0.696 0.0190
## 11 Logistic spec 0.683 0.0252
## 12 LASSO spec 0.629 0.0245
cv_comparison %>%
mutate(.metric = recode(.metric,
"roc_auc" = "ROC AUC",
"accuracy" = "Accuracy",
"sens" = "Sensitivity",
"spec" = "Specificity"
)) %>%
ggplot(aes(x = model, y = mean, fill = model)) +
geom_col(width = 0.6, show.legend = FALSE) +
geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err),
width = 0.2) +
geom_text(aes(label = round(mean, 3)), vjust = -0.5, size = 3.5) +
facet_wrap(~ .metric, scales = "free_y") +
scale_fill_manual(values = c("Logistic" = "#4472C4",
"LASSO" = "#ED7D31",
"Random Forest" = "#70AD47")) +
labs(title = "Cross-Validation Performance by Model",
x = "", y = "Mean Score") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.text.x = element_text(angle = 20, hjust = 1))
The two tuning curves show how performance varies with the penalty
(LASSO) and with mtry/min_n (random forest).
The LASSO curve is informative: ROC AUC is roughly flat over a wide
range of small penalties and then drops sharply once the penalty becomes
large enough to zero out the informative coefficients. The random forest
is, as expected, less sensitive to its hyperparameters in this regime.
The bar plot at the end of the chunk compares the three models on their
best cross-validated settings and serves as the model-selection
figure.
The selected hyperparameters are now plugged back into the workflows and each model is refitted on the full training set and evaluated. Four metrics, ROC AUC, accuracy, sensitivity, and specificity, are computed on the test data, and ROC curves, confusion matrices, and variable importance plots are generated for interpretation.
final_log_workflow <- log_workflow
final_lasso_workflow <- lasso_workflow %>% finalize_workflow(lasso_best)
final_rf_workflow <- rf_workflow %>% finalize_workflow(rf_best)
set.seed(42)
log_final <- last_fit(final_log_workflow, data_split,
metrics = metric_set(roc_auc, accuracy, sens, spec))
## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## There were issues with some computations A: x2There were issues with some computations A: x2
lasso_final <- last_fit(final_lasso_workflow, data_split,
metrics = metric_set(roc_auc, accuracy, sens, spec))
rf_final <- last_fit(final_rf_workflow, data_split,
metrics = metric_set(roc_auc, accuracy, sens, spec))
test_metrics <- bind_rows(
collect_metrics(log_final) %>% mutate(model = "Logistic"),
collect_metrics(lasso_final) %>% mutate(model = "LASSO"),
collect_metrics(rf_final) %>% mutate(model = "Random Forest")
) %>%
select(model, .metric, .estimate) %>%
arrange(.metric, desc(.estimate))
print(test_metrics, n = 50)
## # A tibble: 12 × 3
## model .metric .estimate
## <chr> <chr> <dbl>
## 1 Logistic accuracy 0.722
## 2 Random Forest accuracy 0.716
## 3 LASSO accuracy 0.704
## 4 Logistic roc_auc 0.812
## 5 LASSO roc_auc 0.808
## 6 Random Forest roc_auc 0.778
## 7 LASSO sens 0.842
## 8 Logistic sens 0.789
## 9 Random Forest sens 0.711
## 10 Random Forest spec 0.718
## 11 Logistic spec 0.702
## 12 LASSO spec 0.661
test_metrics %>%
mutate(.metric = recode(.metric,
"roc_auc" = "ROC AUC",
"accuracy" = "Accuracy",
"sens" = "Sensitivity",
"spec" = "Specificity"
)) %>%
ggplot(aes(x = model, y = .estimate, fill = model)) +
geom_col(width = 0.6, show.legend = FALSE) +
geom_text(aes(label = round(.estimate, 3)), vjust = -0.5, size = 3.5) +
facet_wrap(~ .metric, scales = "free_y") +
scale_fill_manual(values = c("Logistic" = "#4472C4",
"LASSO" = "#ED7D31",
"Random Forest" = "#70AD47")) +
labs(title = "Test Set Performance by Model", x = "", y = "Score") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.text.x = element_text(angle = 20, hjust = 1))
cat("=== Logistic Regression ===\n")
## === Logistic Regression ===
collect_predictions(log_final) %>%
conf_mat(truth = outcome, estimate = .pred_class) %>%
print()
## Truth
## Prediction Control PD
## Control 30 37
## PD 8 87
cat("\n=== LASSO ===\n")
##
## === LASSO ===
collect_predictions(lasso_final) %>%
conf_mat(truth = outcome, estimate = .pred_class) %>%
print()
## Truth
## Prediction Control PD
## Control 32 42
## PD 6 82
cat("\n=== Random Forest ===\n")
##
## === Random Forest ===
collect_predictions(rf_final) %>%
conf_mat(truth = outcome, estimate = .pred_class) %>%
print()
## Truth
## Prediction Control PD
## Control 27 35
## PD 11 89
log_roc <- collect_predictions(log_final) %>% mutate(model = "Logistic")
lasso_roc <- collect_predictions(lasso_final) %>% mutate(model = "LASSO")
rf_roc <- collect_predictions(rf_final) %>% mutate(model = "Random Forest")
bind_rows(log_roc, lasso_roc, rf_roc) %>%
group_by(model) %>%
roc_curve(truth = outcome, .pred_Control) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
geom_line(linewidth = 1.2) +
geom_abline(linetype = "dashed", color = "grey50") +
scale_color_manual(values = c("Logistic" = "#4472C4",
"LASSO" = "#ED7D31",
"Random Forest" = "#70AD47")) +
labs(title = "ROC Curves on Test Set",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)",
color = "Model") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
legend.position = "bottom")
lasso_final %>%
extract_fit_parsnip() %>%
tidy() %>%
filter(term != "(Intercept)", estimate != 0) %>%
arrange(desc(abs(estimate))) %>%
slice_head(n = 20) %>%
left_join(protein_lookup, by = c("term" = "SOMA_SEQ_ID")) %>%
mutate(protein_label = ifelse(is.na(TARGET_GENE_SYMBOL), term, TARGET_GENE_SYMBOL)) %>%
ggplot(aes(x = reorder(protein_label, abs(estimate)), y = estimate,
fill = estimate > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_manual(values = c("TRUE" = "#C00000", "FALSE" = "#4472C4")) +
labs(title = "LASSO: Top 20 Coefficients",
x = "", y = "Coefficient") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.text.y = element_text(size = 8))
rf_imp <- rf_final %>%
extract_fit_parsnip() %>%
vi() %>%
slice_head(n = 20) %>%
left_join(protein_lookup, by = c("Variable" = "SOMA_SEQ_ID")) %>%
mutate(protein_label = ifelse(is.na(TARGET_GENE_SYMBOL), Variable, TARGET_GENE_SYMBOL))
ggplot(rf_imp, aes(x = reorder(protein_label, Importance), y = Importance)) +
geom_col(fill = "#70AD47", show.legend = FALSE) +
coord_flip() +
labs(title = "Random Forest: Top 20 Variable Importance",
x = "", y = "Permutation Importance") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.text.y = element_text(size = 8))
All three models achieve very similar cross-validated performance, with ROC AUCs clustered between 0.78 and 0.81. LASSO has the highest cross-validated ROC AUC at 0.810, followed by the random forest at 0.795 and logistic regression at 0.777. On accuracy the ordering reverses. The random forest is on top at 0.705, logistic regression at 0.694, and LASSO at 0.671.This is showing that LASSO is trading raw accuracy for a different operating point. That trade-off is explicit in the sensitivity and specificity panels: LASSO has by far the highest sensitivity but the lowest specificity. The random forest is the most balanced of the three, and logistic regression sits in between. The error bars overlap substantially across models on accuracy and ROC AUC.
The pattern carries over to the held-out test set, and importantly performance does not collapse from CV to test, which suggests none of the models is badly overfit. Logistic regression actually edges out on test ROC AUC at 0.812, with LASSO essentially tied at 0.808 and the random forest at 0.778. Test accuracy is again tight across models (Logistic 0.722, RF 0.716, LASSO 0.704). The most consistent story across both CV and test is the sensitivity–specificity trade-off: LASSO has the highest test sensitivity at 0.842 but the lowest specificity at 0.661, while the random forest sits at the opposite end (sensitivity 0.711, specificity 0.716). Logistic regression is balanced in the middle (sensitivity 0.789, specificity 0.702).
The question about which model is “best” therefore depends on the use case. If the goal is a screening tool that should rarely miss a true PD patient, LASSO is the more appropriate choice; if the goal is balanced classification with fewer false positives, the random forest or logistic regression is preferable.
The largest single predictor in both models is LRRK2 carrier status. GBA carrier status and PRKN carrier status also appear in the LASSO top 20 with positive coefficients. LRRK2, GBA, and PRKN are the canonical Mendelian PD risk genes, and the models recovered them without being told that they are biologically privileged.
Beyond the genetic variables, both models converge on the same short list of CSF proteins, which is a useful internal-replication signal: NEFH, HAMP, CECR1, CEL, EPHA10, GAPDH, NPW, and LRFN2 appear in both the top-20 LASSO coefficients and the top-20 random forest importances. NEFH (neurofilament heavy chain) which is a well-established CSF marker of axonal injury that is elevated across a range of neurodegenerative diseases. HAMP (hepcidin) ties into iron dysregulation, which has been increasingly implicated in dopaminergic cell loss due to oxidateive stress and lewy body formation. GAPDH, classically a housekeeping protein, has known interactions with alpha-synuclein aggregation. The random forest additionally surfaces GPNMB (a microglial protein at a confirmed PD GWAS locus), HLA-DQA2 (consistent with the neuroinflammatory axis seen earlier), PREP (prolyl endopeptidase, linked to alpha-synuclein clearance), and GFRA2 (a GDNF-family receptor involved in dopaminergic neuron support).
According to the LASSO coefficients, the most positive ones are LRRK2/GBA/PRKN carrier status, NEFH, GAPDH, DKK1, CEL, HAMP). Negative coefficients: SFRP1, EPHA10, MGP, HAPLN1, TREML2. DKK1 and SFRP1 are both Wnt-pathway regulators, which is interesting because Wnt signaling is required for dopaminergic neuron maintenance and is increasingly viewed as dysregulated in PD. AGE_AT_VISIT carries a negative coefficient, which is somewhat counterintuitive on its own. It does not mean older age protects against PD, but rather that given the protein signature, additional age does not further increase the predicted probability of PD, likely because age is partly already encoded in the protein levels themselves.
Overall, the agreement between LASSO and the random forest on both the genetic carriers and on the cluster of NEFH / HAMP / GPNMB / HLA / Wnt-related proteins suggests that the predictive signal in this analysis is not driven by a single protein or by SomaScan artifacts, but by a coherent biological signature spanning Mendelian PD genes, axonal injury, neuroinflammation, and dopaminergic-pathway biology.
This project asked whether CSF proteomics alone can distinguish Parkinson’s disease patients from healthy controls and, if so, which proteins drive the signal. Using baseline SomaScan measurements from PPMI together with age and genetic carrier status as covariates, three classification models, logistic regression, LASSO, and a random forest, were compared inside a leakage-aware pipeline.
Since there were more protein targets than numpber of patients, the most significantly differentiating proteins between Control and PD was statistically selected. After an unsupervised variance filter and a BH-controlled t-test on the training fold only, all three models reached ROC AUCs around 0.78–0.81 on both cross-validation and the held-out test set. This level of accuracy is somewhat promising about our models. The three models offered different sensitivity–specificity trade-offs rather than one clearly dominating. The top predictors agreed substantially across LASSO and the random forest and were biologically coherent. They showed that the canonical Mendelian PD genes (LRRK2, GBA, PRKN) and a cluster of proteins tied to axonal injury, neuroinflammation, and Wnt signaling are among the most important predictors.
In short, CSF proteomics data trained in a classifier model can create a real, biologically interpretable signal for PD, and the proteins driving it map onto known disease biology rather than onto platform artifacts.
Several limitations should temper the interpretation of these results. First, the PPMI cohort is consists of less than 1000 patients which is relatively large but not epidemiologically representative. Participants are overwhelmingly self-referred, of European ancestry, and recruited at academic movement-disorder centers, which limits how well the learned signature generalizes to broader populations. Second, the analysis uses CSF samples collected at a single baseline visit; whether this same signature tracks disease progression longitudinally, or whether it changes after the initiation of dopaminergic therapy, has not been examined here. Third, since there was more protein targets than the number of patients, the selected proteins for the machine learning models had to be reduced with statistical methods. This may have resulted in loss of some important predictors that can not survive the t-test. Fourth, the SomaScan platform itself has known biases. Protein measurements depend on aptamer affinity rather than on direct mass spectrometry and some of the top proteins identified may reflect platform-specific behavior that would not replicate on an orthogonal assay. Also, as seen on EDA, there was blood contamination in some samples, which challenges the data collection methods used in creating the PPMI dataset. Lastly, plate identifier was retained as a candidate technical confounder but was not explicitly modeled as a random effect.
Despite these limitations, the analysis suggests that CSF proteomics, paired with machine learning algorithms is promising way toward finding objective biomarkers for Parkinson’s disease. The proteins identified here are sensible candidates for follow-up in independent validation cohorts and in mechanistic studies of alpha-synuclein clearance, lysosomal function, and neuroinflammation.