Background and Question of Interest

Parkinson’s disease (PD) is a progressive neurodegenerative disorder that affects more than ten million people worldwide and is the second most common neurodegenerative disease after Alzheimer’s Disease. It is characterized clinically by the cardinal motor features of resting tremor, rigidity, and bradykinesia, but it also carries a wide range of non-motor manifestations including cognitive decline, REM sleep behavior disorder, hyposmia, and autonomic dysfunction. Pathologically, PD is defined by the loss of dopaminergic neurons in the substantia nigra pars compacta and the accumulation of misfolded alpha-synuclein into Lewy bodies. Despite decades of research, current diagnosis still relies primarily on clinical observation of motor symptoms supported by neuroimaging such as DaTscan. By the time these motor signs are evident, a substantial proportion of nigrostriatal dopaminergic neurons have already been lost, which limits the window in which disease-modifying interventions could plausibly be effective.

For this reason, identifying objective, fluid-based biomarkers of PD has become a central goal of translational neuroscience. Cerebrospinal fluid (CSF) is particularly attractive in this context because it is in direct contact with the central nervous system and reflects ongoing pathological processes in the brain more faithfully than blood. Targeted CSF markers such as alpha-synuclein, total tau, and amyloid-beta 42 have been studied for years, but no single analyte has proven sufficient on its own. Recent advances in high-throughput proteomics — in particular SomaScan aptamer-based assays — make it possible to measure thousands of proteins simultaneously from a single CSF sample, raising the possibility of building multi-protein signatures that classify disease status with higher accuracy than any single marker.

This project investigates whether such a signature can be learned from CSF proteomics data alone. Specifically, the question of interest is: Can protein expression profiles from CSF samples reliably distinguish Parkinson’s disease patients from healthy controls, and which proteins are most predictive of disease status? Based on prior literature reporting altered abundance of proteins involved in synaptic function, neuroinflammation, and lysosomal degradation in PD CSF, I hypothesize that a multivariate model trained on hundreds of CSF proteins will achieve classification performance meaningfully above chance, and that the proteins selected as most predictive will be enriched for biologically plausible PD-related pathways.

Data

The data used in this project comes from the Parkinson’s Progression Markers Initiative (PPMI), a landmark longitudinal observational study sponsored by the Michael J. Fox Foundation. PPMI was launched in 2010 with the explicit goal of identifying biomarkers of PD onset and progression. It enrolls newly diagnosed PD patients, healthy controls, and several at-risk cohorts (e.g. prodromal participants, genetic carriers, and patients with REM sleep behavior disorder), and follows them with serial clinical examinations, imaging, and biospecimen collection. Because participants were recruited at multiple international sites using standardized protocols, the cohort is one of the largest and most carefully phenotyped PD resources currently available.

The CSF proteomics data used here come from PPMI Project 151, which applied the SomaScan v4.1 aptamer-based assay to baseline CSF samples. Each measurement represents the relative abundance of a single protein, indexed by a SomaScan sequence identifier (TESTNAME) and reported on a log2 scale (TESTVALUE). Roughly 4,700 unique proteins were quantified per sample. Plate identifiers (PLATEID) were retained as a technical batch variable. The proteomics data were merged with PPMI clinical tables containing diagnosis (COHORT), biological sex, age at visit, and genetic consensus data including carrier status for the major PD-associated mutations (LRRK2, GBA, SNCA, PRKN) and the APOE-e4 allele. The analysis is restricted to PD patients and healthy controls so that the classification task is well-defined; prodromal and SWEDD (“scan without evidence of dopaminergic deficit”) participants are excluded.

Clinical Data:

clinical_dir <- "/Users/ramazanegesolak/PPMI_STRING"
age_data           <- read_csv(file.path(clinical_dir, "Age_at_visit_12Apr2026.csv"),                 show_col_types = FALSE)
clinical_dx        <- read_csv(file.path(clinical_dir, "Clinical_Diagnosis_12Apr2026.csv"),            show_col_types = FALSE)
demographics       <- read_csv(file.path(clinical_dir, "Demographics_12Apr2026.csv"),                  show_col_types = FALSE)
genetics           <- read_csv(file.path(clinical_dir, "iu_genetic_consensus_20251025_12Apr2026.csv"), show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
participant_status <- read_csv(file.path(clinical_dir, "Participant_Status_12Apr2026.csv"),            show_col_types = FALSE)

cat("age_data:          ", nrow(age_data),           "rows\n")
## age_data:           46227 rows
cat("clinical_dx:       ", nrow(clinical_dx),        "rows\n")
## clinical_dx:        15440 rows
cat("demographics:      ", nrow(demographics),       "rows\n")
## demographics:       8351 rows
cat("genetics:          ", nrow(genetics),           "rows\n")
## genetics:           6265 rows
cat("participant_status:", nrow(participant_status), "rows\n")
## participant_status: 8417 rows

Variables of interest

  • Outcome variable: Parkinson’s disease diagnosis (COHORT), recoded as a binary indicator of PD patient versus healthy control.
  • Primary predictor variable: CSF protein expression levels (TESTVALUE), indexed by SomaScan sequence identifier (TESTNAME). After reshaping to a wide patient-by-protein matrix this yields several thousand candidate predictors.
  • Possible confounders: Biological sex, because protein expression levels are known to vary by sex and PD prevalence differs between males and females; age at visit, because age is a strong risk factor for PD and many CSF proteins vary with age; carrier status for LRRK2, GBA, SNCA, PRKN, and APOE-e4, which can independently shape both disease risk and proteomic background; and PLATEID, a technical batch variable since plate-to-plate variability across SomaScan runs can introduce systematic differences in protein measurements unrelated to biology.
  • Potential effect modifiers: sex may modify the association between specific proteins and PD if the disease-related protein signature differs between males and females, which is a plausible hypothesis given the well-documented sex differences in PD prevalence and clinical phenotype.

Directed acyclic graph (DAG)

library(ggdag)
library(ggplot2)

dag <- dagify(
  Outcome   ~ Predictor + SEX + AGE + Genetics,
  Predictor ~ SEX + AGE + Genetics + Batch,
  coords = list(
    x = c(SEX      = 1,   AGE      = 1,    Genetics = 1,
          Batch    = 1,   Predictor = 2.8,  Outcome  = 4.5),
    y = c(SEX      = 1.5, AGE      = 0.5,  Genetics = -0.5,
          Batch    = -1.5, Predictor = 0,   Outcome  = 0)
  )
)

ggdag(dag, text = FALSE) +
  geom_dag_point(color = "#4472C4", size = 22) +
  geom_dag_text(color = "white", size = 3.2, fontface = "bold") +
  theme_dag() +
  labs(
    title   = "DAG: CSF Proteomics and Parkinson's Disease Prediction",
    caption = paste(
      "Outcome   = PD Diagnosis (COHORT)",
      "Predictor = CSF Protein Expression (TESTVALUE)",
      "SEX       = Biological sex — affects PD prevalence & protein levels",
      "AGE       = Age at visit — risk factor for PD & protein expression",
      "Genetics  = Mutation carrier status (LRRK2, GBA, SNCA, PRKN, APOE-e4)",
      "Batch     = Assay plate (PLATEID) — technical measurement confounder",
      sep = "\n"
    )
  ) +
  theme(
    plot.title   = element_text(face = "bold", size = 13, hjust = 0.5),
    plot.caption = element_text(size = 8.5, color = "gray30", hjust = 0,
                                margin = margin(t = 12), family = "mono")
  )

The DAG above summarizes the assumed causal structure. PD status is the outcome; CSF protein expression is the primary predictor. Sex, age, and genetic carrier status are upstream variables that may influence both protein expression and disease status, which makes them confounders that have to be acknowledged in the analysis. Plate identifier (Batch) is a technical confounder. Batch does not affect disease status but it can shift protein measurements in ways that mimic biological signal. The DAG does not have an arrow from PD status back to the predictor, since we assume/treat the proteomic measurement as a snapshot taken at enrollment rather than a consequence of long-standing disease.

Statistical analysis plan

Because the outcome is binary, three classification approaches are built and compared: ordinary logistic regression, LASSO-penalized logistic regression, and random forest. These were chosen to span the bias–variance spectrum and to combine interpretable, sparse linear modeling (LASSO) with a flexible non-linear, ensemble approach (random forest). Standard logistic regression serves as a baseline. Comparing all three lets us evaluate predictive accuracy and the interpretability of the selected biomarkers side by side.

To control for confounding, age at visit and the five genetic carrier indicators (LRRK2, GBA, SNCA, PRKN, APOE-e4) are included as covariates alongside the protein expression features. Sex is examined in exploratory analysis and was originally considered as a covariate; in the final preprocessing recipe it is removed before modeling so that the protein signal is evaluated on its own merits rather than being inflated by demographic separation. Plate identifier is examined as a technical confounder, and protein expression values are normalized prior to modeling to further reduce technical variation. Class imbalance between PD and Control is addressed by random downsampling of the majority class inside the training pipeline so that the model is not allowed to “win” by always predicting PD. The analysis is also designed to avoid information leakage: variance filtering, t-test feature selection, and class balancing all happen inside the training fold or on the training set only. Model performance is evaluated using ROC AUC, accuracy, sensitivity, specificity, confusion matrices, and ROC curves on a held-out test set, and the most predictive proteins are inspected through LASSO coefficient magnitude and random forest permutation importance.

Exploratory data analysis

The exploratory analysis below has three aims: to characterize the cohort that the model will be trained on, to inspect the distribution of the predictor (CSF protein expression) across groups, and to look for imbalance in the candidate confounders. The proteomics CSV files for Project 151 are loaded and stacked into a single long-format data frame, retaining only the columns relevant to this analysis (patient ID, sex, cohort, protein name, expression value, plate ID).

wanted_cols <- c(
  "PATNO", "SEX", "COHORT", "TESTNAME",
  "TESTVALUE", "PLATEID"
)
raw_data <- file_list %>%
  map_dfr(~read_csv(.x, show_col_types = FALSE)) %>%
  select(any_of(wanted_cols))
glimpse(raw_data)
## Rows: 5,541,030
## Columns: 6
## $ PATNO     <dbl> 53595, 53595, 53595, 53595, 53595, 53595, 53595, 53595, 5359…
## $ SEX       <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
## $ COHORT    <chr> "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", …
## $ TESTNAME  <chr> "5632-6_3", "5631-83_3", "5630-48_3", "5629-58_3", "5628-21_…
## $ TESTVALUE <dbl> 10.182013, 6.346285, 12.943830, 5.891402, 12.836350, 7.03480…
## $ PLATEID   <chr> "P0022899", "P0022899", "P0022899", "P0022899", "P0022899", …
head(raw_data, 20)
## # A tibble: 20 × 6
##    PATNO SEX    COHORT TESTNAME  TESTVALUE PLATEID 
##    <dbl> <chr>  <chr>  <chr>         <dbl> <chr>   
##  1 53595 Female PD     5632-6_3      10.2  P0022899
##  2 53595 Female PD     5631-83_3      6.35 P0022899
##  3 53595 Female PD     5630-48_3     12.9  P0022899
##  4 53595 Female PD     5629-58_3      5.89 P0022899
##  5 53595 Female PD     5628-21_3     12.8  P0022899
##  6 53595 Female PD     5627-53_3      7.03 P0022899
##  7 53595 Female PD     5626-20_3      5.89 P0022899
##  8 53595 Female PD     5624-66_3      7.66 P0022899
##  9 53595 Female PD     5623-11_3      6.13 P0022899
## 10 53595 Female PD     5621-64_3      5.95 P0022899
## 11 53595 Female PD     5620-13_3     10.1  P0022899
## 12 53595 Female PD     5618-50_3     13.9  P0022899
## 13 53595 Female PD     5617-41_3      8.84 P0022899
## 14 53595 Female PD     5615-62_3      6.25 P0022899
## 15 53595 Female PD     5614-44_3      6.73 P0022899
## 16 53595 Female PD     5613-75_3      5.87 P0022899
## 17 53595 Female PD     5612-16_3      6.20 P0022899
## 18 53595 Female PD     5611-56_3      6.02 P0022899
## 19 53595 Female PD     5610-32_3      6.38 P0022899
## 20 53595 Female PD     5609-92_3     14.1  P0022899
raw_data %>% count(COHORT)
## # A tibble: 3 × 2
##   COHORT          n
##   <chr>       <int>
## 1 Control    890010
## 2 PD        2952345
## 3 Prodromal 1698675
raw_data %>% summarise(
  n_patients = n_distinct(PATNO),
  n_proteins = n_distinct(TESTNAME),
  n_rows     = n()
)
## # A tibble: 1 × 3
##   n_patients n_proteins  n_rows
##        <int>      <int>   <int>
## 1       1158       4785 5541030
raw_data %>%
  count(PATNO, TESTNAME) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

SomaScan sequence identifiers (e.g. seq.12345.6) are opaque on their own, so the official PPMI annotation file is loaded and a lookup table is built that maps each SOMA sequence ID to its target gene symbol. This lookup is used later to relabel the top proteins in the figures, so that the reader sees recognizable biological names (e.g. APOE, GFAP) instead of internal aptamer codes.

annotations <- read_csv(
  file.path(data_dir, "PPMI_Project_151_pqtl_Analysis_Annotations_20210210.csv"),
  show_col_types = FALSE
)
## New names:
## • `` -> `...3`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...36`
## • `` -> `...37`
## • `` -> `...38`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`
## • `` -> `...42`
## • `` -> `...43`
## • `` -> `...44`
## • `` -> `...45`
## • `` -> `...46`
## • `` -> `...47`
## • `` -> `...48`
## • `` -> `...49`
## • `` -> `...50`
## • `` -> `...51`
## • `` -> `...52`
## • `` -> `...53`
## • `` -> `...54`
## • `` -> `...55`
## • `` -> `...56`
## • `` -> `...57`
## • `` -> `...58`
## • `` -> `...59`
## • `` -> `...60`
## • `` -> `...61`
protein_lookup <- annotations %>%
  select(SOMA_SEQ_ID, TARGET_GENE_SYMBOL) %>%
  distinct() %>%
  filter(!is.na(TARGET_GENE_SYMBOL), TARGET_GENE_SYMBOL != "")
head(protein_lookup)
## # A tibble: 6 × 2
##   SOMA_SEQ_ID TARGET_GENE_SYMBOL
##   <chr>       <chr>             
## 1 10000-28_3  CRYBB2            
## 2 10001-7_3   RAF1              
## 3 10003-15_3  ZNF41             
## 4 10006-25_3  ELK1              
## 5 10008-43_3  GUCA1A            
## 6 10009-2_3   IRF1

Cohort composition

raw_data %>%
  distinct(PATNO, COHORT) %>%
  count(COHORT) %>%
  ggplot(aes(x = reorder(COHORT, -n), y = n, fill = COHORT)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = n), vjust = -0.5, fontface = "bold", size = 5) +
  scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
  labs(title = "Number of Patients by Cohort", x = "", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The bar plot above shows the number of unique patients in each cohort that have CSF proteomic data available. The PD group is substantially larger than the Control group. Controls are recruited to roughly half the number of patients rather than to a 1:1 ratio. This imbalance can create several biases in our analysis. A naive classifier that always predicts “PD” would already achieve apparently high accuracy. For this reason the modeling pipeline downsamples the PD class to match the control class inside training and reports sensitivity and specificity in addition to accuracy.

Most variable proteins

raw_data %>%
  filter(COHORT %in% c("PD", "Control")) %>%
  group_by(TESTNAME) %>%
  summarise(variance = var(TESTVALUE, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(variance)) %>%
  slice_head(n = 20) %>%
  left_join(protein_lookup, by = c("TESTNAME" = "SOMA_SEQ_ID")) %>%
  mutate(protein_label = ifelse(is.na(TARGET_GENE_SYMBOL), TESTNAME, TARGET_GENE_SYMBOL)) %>%
  ggplot(aes(x = reorder(protein_label, variance), y = variance)) +
  geom_col(fill = "#4472C4", width = 0.7) +
  coord_flip() +
  labs(title = "Top 20 Most Variable Proteins",
       x = "", y = "Variance") +
  theme_minimal() +
  theme(plot.title  = element_text(face = "bold", size = 13),
        axis.text.y = element_text(size = 9))

According to this EDA, the single most variable protein is PILRA, a myeloid immune receptor most familiar from Alzheimer’s GWAS, and a cluster of microglial and immune-related proteins (SIGLEC9, SIRPA, FCGR2A, C3, CHIT1) also appears in the top 20. These are all consistent with the well-documented role of neuroinflammation in neurodegeneration. Beyond that, however, the list is dominated by proteins whose presence in CSF is almost certainly an artifact of low-level blood contamination during lumbar puncture. These artifact proteinsa are the hemoglobin chains HBB and HBA1, haptoglobin (HP), the plasma protein alpha-2-HS-glycoprotein (AHSG), prothrombin (F2), and the ABO blood-group glycosyltransferase. The variance in these proteins is high because CSF samples differ in how much blood was admixed at collection, not because of PD biology. A second cluster (HIST2H2BE, SETMAR, C1orf226, LRPPRC, GAPDH) is more loosely interpretable and likely reflects a mixture of cellular turnover, mitochondrial content, and assay-level variability across samples.

According to this, a high variance is not the same as a disease-specific signal. A protein could vary widely between individuals because of age, comorbidities, or assay noise rather than PD biology. For that reason, variance ranking is used purely as a screening tool to retain the top 500 most variable proteins before disease-aware feature selection (the BH-adjusted t-test) is applied within the training set.

Age distribution

glimpse(age_data)
## Rows: 46,227
## Columns: 3
## $ PATNO        <dbl> 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 300…
## $ EVENT_ID     <chr> "BL", "R17", "R18", "R19", "R20", "SC", "V01", "V02", "V0…
## $ AGE_AT_VISIT <dbl> 69.1, 80.5, 81.4, 84.0, 84.0, 69.1, 69.4, 69.6, 69.9, 70.…
age_col <- intersect(c("AGE", "AGE_AT_VISIT", "PATVISITAGE"), names(age_data))[1]
cat("Using age column:", age_col, "\n")
## Using age column: AGE_AT_VISIT
age_cohort <- age_data %>%
  inner_join(raw_data %>% distinct(PATNO, COHORT), by = "PATNO") %>%
  rename(age = all_of(age_col)) %>%
  filter(!is.na(age), COHORT %in% c("PD", "Control"))
ggplot(age_cohort, aes(x = age, fill = COHORT)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity", color = "white") +
  scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
  labs(title = "Age Distribution by Cohort",
       x = "Age", y = "Count", fill = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        legend.position = "bottom")

ggplot(age_cohort, aes(x = COHORT, y = age, fill = COHORT)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.3, size = 1) +
  scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
  labs(title = "Age by Disease Status", x = "", y = "Age") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

age_cohort %>%
  group_by(COHORT) %>%
  summarise(
    n          = n(),
    mean_age   = round(mean(age, na.rm = TRUE), 1),
    median_age = median(age, na.rm = TRUE),
    sd_age     = round(sd(age, na.rm = TRUE), 1),
    min_age    = min(age, na.rm = TRUE),
    max_age    = max(age, na.rm = TRUE)
  ) %>%
  print()
## # A tibble: 2 × 7
##   COHORT      n mean_age median_age sd_age min_age max_age
##   <chr>   <int>    <dbl>      <dbl>  <dbl>   <dbl>   <dbl>
## 1 Control  3582     65.4       66.5   12      30.6    98.1
## 2 PD      10961     66         66.8    9.8    32.3    93.6

The age distribution is broadly similar between PD patients and controls, with both groups concentrated in late middle age and consistent with PPMI’s recruitment of de novo PD cases. The median age in the PD group is slightly higher than in controls, which is biologically plausible given that PD risk rises sharply after age 60. This near-overlap is reassuring for the modeling step where age is not strongly separating the two cohorts. So, any class differences detected by the model are unlikely to be a pure age effect. Nonetheless, age is retained as a covariate in the final feature set so that the protein signal is interpreted conditional on age rather than confounded by it.

Sex distribution

glimpse(demographics)
## Rows: 8,351
## Columns: 29
## $ REC_ID       <chr> "IA86904", "IA86905", "IA86906", "IA86907", "IA86908", "2…
## $ PATNO        <dbl> 3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 300…
## $ EVENT_ID     <chr> "TRANS", "TRANS", "TRANS", "TRANS", "TRANS", "TRANS", "TR…
## $ PAG_NAME     <chr> "SCREEN", "SCREEN", "SCREEN", "SCREEN", "SCREEN", "SCREEN…
## $ INFODT       <chr> "01/2011", "02/2011", "03/2011", "03/2011", "03/2011", "0…
## $ AFICBERB     <dbl> 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA, 0, NA, NA, 0,…
## $ ASHKJEW      <dbl> 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA, 0, NA, NA, 0,…
## $ BASQUE       <dbl> 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, NA, NA, 0, NA, NA, 0,…
## $ BIRTHDT      <chr> "12/1941", "01/1946", "08/1943", "07/1954", "11/1951", "1…
## $ SEX          <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ CHLDBEAR     <dbl> 0, NA, 0, 0, NA, 0, 0, NA, 0, 0, NA, NA, NA, 0, NA, NA, N…
## $ HOWLIVE      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ GAYLES       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HETERO       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BISEXUAL     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ PANSEXUAL    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ASEXUAL      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ OTHSEXUALITY <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HANDED       <dbl> 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, …
## $ HISPLAT      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RAASIAN      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RABLACK      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ RAHAWOPI     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RAINDALS     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RANOS        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RAWHITE      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, …
## $ RAUNKNOWN    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ORIG_ENTRY   <chr> "01/2011", "02/2011", "03/2011", "03/2011", "03/2011", "0…
## $ LAST_UPDATE  <dttm> 2022-11-07 00:00:00, 2022-11-07 00:00:00, 2022-11-07 00:…
demo_cohort <- demographics %>%
  inner_join(raw_data %>% distinct(PATNO, COHORT), by = "PATNO") %>%
  distinct(PATNO, .keep_all = TRUE) %>%
  filter(COHORT %in% c("PD", "Control")) %>%
  mutate(SEX = factor(SEX, levels = c(0, 1),
                      labels = c("Female", "Male")))
demo_cohort %>%
  count(COHORT, SEX) %>%
  ggplot(aes(x = COHORT, y = n, fill = SEX)) +
  geom_col(position = "dodge", width = 0.6) +
  geom_text(aes(label = n), position = position_dodge(0.6),
            vjust = -0.5, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Male"   = "#4472C4",
                               "Female" = "#ED7D31")) +
  labs(title = "Sex Distribution by Cohort",
       x = "", y = "Count", fill = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        legend.position = "bottom")

demo_cohort %>%
  count(COHORT, SEX) %>%
  group_by(COHORT) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  print()
## # A tibble: 4 × 4
## # Groups:   COHORT [2]
##   COHORT  SEX        n   pct
##   <chr>   <fct>  <int> <dbl>
## 1 Control Female    69  37.1
## 2 Control Male     117  62.9
## 3 PD      Female   244  39.5
## 4 PD      Male     373  60.5

According to the Sex distribution graph, males outnumber females in both cohorts. But the imbalance is more pronounced in the PD group than in controls. This pattern mirrors the well-documented epidemiology of PD, where males have roughly a 1.5–2× higher lifetime incidence than females. From a modeling perspective, this means sex is correlated with disease status, which is why it appears as a confounder in the DAG. In the final recipe, sex is removed before fitting so that the protein signature is evaluated independently of sex; a sensitivity analysis adjusting for sex could be added in future work.

Genetic mutation carriers

mutation_cols <- c("LRRK2", "GBA", "SNCA", "PRKN", "PINK1")  # VPS35/PARK7 all 0
gen_cohort <- genetics %>%
  inner_join(raw_data %>% distinct(PATNO, COHORT), by = "PATNO") %>%
  distinct(PATNO, .keep_all = TRUE) %>%
  filter(COHORT %in% c("PD", "Control")) %>%
  mutate(across(all_of(mutation_cols), as.character))
gen_cohort %>%
  pivot_longer(cols = all_of(mutation_cols),
               names_to  = "gene",
               values_to = "status") %>%
  mutate(carrier = status != "0" & !is.na(status)) %>%
  group_by(COHORT, gene) %>%
  summarise(n_carriers = sum(carrier, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = reorder(gene, -n_carriers), y = n_carriers, fill = COHORT)) +
  geom_col(position = "dodge", width = 0.6) +
  geom_text(aes(label = n_carriers), position = position_dodge(0.6),
            vjust = -0.5, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Control" = "#4472C4", "PD" = "#C00000")) +
  labs(title = "Number of Mutation Carriers by Gene and Cohort",
       x = "Gene", y = "Number of Carriers", fill = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        legend.position = "bottom")

The number of confirmed genetic mutation carriers is small in absolute terms but informative. LRRK2 and GBA carriers are the most represented, consistent with their relatively high prevalence among PD cases worldwide. The proportion of mutation carriers is higher in the PD group than in controls for every gene examined, which is the expected direction. Because these mutations affect both PD risk and proteomic background, they are retained as binary covariates (LRRK2_carrier, GBA_carrier, SNCA_carrier, PRKN_carrier, APOE_e4) in the modeling step rather than dropped or used to stratify the analysis.

Summary of exploratory data analysis

Overall, the EDA shows three things. First, the cohort is imbalanced toward PD and toward males, which must be addressed in modeling. Second, age, sex, and genetic mutation status differ between cohorts in expected ways and so are retained as covariates. Third, the disease signal, if it exists, is not detectable in the pooled marginal distribution of all proteins and must be sought in a much smaller, biologically focused subset. The next sections describe how that subset is selected without leaking information from the test set, and how three classification models are trained, tuned, and compared.

Statistical analysis

The modeling section is restricted to PD and Control patients, with PPMI’s prodromal and SWEDD cohorts excluded so that the classification task is well defined.

Building the modelling dataset

The first preprocessing step is to assemble per-patient clinical features. Age is taken at the baseline (BL) or screening (SC) visit so that every patient contributes a single, time-aligned age value. Genetic data are reduced to five binary indicator variables: LRRK2, GBA, SNCA, and PRKN are coded as carrier vs. non-carrier, and APOE is coded as the presence vs. absence of an e4 allele (APOE-e4 status is a well-established risk factor for cognitive decline in PD as well as in Alzheimer’s disease, and is therefore retained even though it is not strictly a PD-causing mutation).

age_baseline <- age_data %>%
  filter(EVENT_ID %in% c("BL", "SC")) %>%
  arrange(PATNO, EVENT_ID) %>%
  distinct(PATNO, .keep_all = TRUE) %>%
  select(PATNO, AGE_AT_VISIT)
genetics_features <- genetics %>%
  mutate(
    LRRK2_carrier = as.integer(LRRK2 != "0" & !is.na(LRRK2)),
    GBA_carrier   = as.integer(GBA   != "0" & !is.na(GBA)),
    SNCA_carrier  = as.integer(as.character(SNCA) != "0" & !is.na(SNCA)),
    PRKN_carrier  = as.integer(as.character(PRKN) != "0" & !is.na(PRKN)),
    APOE_e4       = as.integer(grepl("E4", APOE, ignore.case = TRUE))
  ) %>%
  select(PATNO, LRRK2_carrier, GBA_carrier,
         SNCA_carrier, PRKN_carrier, APOE_e4)
cat("Patients with baseline age:", nrow(age_baseline), "\n")
## Patients with baseline age: 8118
cat("Patients with genetics:    ", nrow(genetics_features), "\n")
## Patients with genetics:     6265

Next, the outcome is recoded into a factor with two levels (Control and PD).

data_binary <- raw_data %>%
  filter(COHORT %in% c("PD", "Control")) %>%
  mutate(outcome = factor(ifelse(COHORT == "PD", 1, 0),
                          levels = c(0, 1),
                          labels = c("Control", "PD")))
data_binary %>% count(outcome)
## # A tibble: 2 × 2
##   outcome       n
##   <fct>     <int>
## 1 Control  890010
## 2 PD      2952345

The long-format data are then pivoted to a wide patient-by-protein matrix and merged with the baseline age and genetic features. Each row now corresponds to a single patient, with the outcome, sex, baseline age, five genetic indicators, and one column per SomaScan protein.

wide_data <- data_binary %>%
  select(PATNO, SEX, outcome, TESTNAME, TESTVALUE) %>%
  pivot_wider(
    names_from  = TESTNAME,
    values_from = TESTVALUE
  ) %>%
  left_join(age_baseline,      by = "PATNO") %>%
  left_join(genetics_features, by = "PATNO")
dim(wide_data)
## [1]  803 4794
glimpse(wide_data %>% select(PATNO, SEX, outcome,
                              AGE_AT_VISIT, LRRK2_carrier,
                              GBA_carrier, SNCA_carrier,
                              PRKN_carrier, APOE_e4))
## Rows: 803
## Columns: 9
## $ PATNO         <dbl> 53595, 3029, 3963, 3316, 3124, 3175, 4103, 3467, 3168, 5…
## $ SEX           <chr> "Female", "Female", "Female", "Male", "Male", "Female", …
## $ outcome       <fct> PD, Control, PD, Control, PD, PD, PD, PD, PD, PD, PD, PD…
## $ AGE_AT_VISIT  <dbl> 65.2, 66.3, 58.4, 74.7, 57.2, 57.2, 59.0, 66.9, 63.1, 67…
## $ LRRK2_carrier <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GBA_carrier   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ SNCA_carrier  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PRKN_carrier  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ APOE_e4       <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
cat("Missing AGE_AT_VISIT:  ", sum(is.na(wide_data$AGE_AT_VISIT)),  "\n")
## Missing AGE_AT_VISIT:   0
cat("Missing LRRK2_carrier: ", sum(is.na(wide_data$LRRK2_carrier)), "\n")
## Missing LRRK2_carrier:  0
cat("LRRK2 carriers:        ", sum(wide_data$LRRK2_carrier == 1, na.rm = TRUE), "\n")
## LRRK2 carriers:         156
cat("GBA carriers:          ", sum(wide_data$GBA_carrier   == 1, na.rm = TRUE), "\n")
## GBA carriers:           77
cat("APOE E4 carriers:      ", sum(wide_data$APOE_e4       == 1, na.rm = TRUE), "\n")
## APOE E4 carriers:       197

Variance-based pre-filter

With several thousand candidate proteins and only a few hundred patients, the feature space is much wider than the sample size. To make modeling tractable and to focus on proteins with meaningful biological variation, an unsupervised variance filter is applied. The variance of each protein is computed across patients, and only the top 500 most variable proteins are kept. This step is unsupervised because it does not look at the outcome label, so it does not leak class information into the training set. The cutoff at 500 was chosen by inspecting the variance distribution. Beyond the top several hundred proteins, the variance flattens into a long tail dominated by near-constant features that carry little information.

clinical_vars <- c("AGE_AT_VISIT", "LRRK2_carrier", "GBA_carrier",
                   "SNCA_carrier", "PRKN_carrier", "APOE_e4")

protein_cols <- wide_data %>%
  select(-PATNO, -SEX, -outcome, -any_of(clinical_vars))

protein_vars <- protein_cols %>%
  summarise(across(everything(), var, na.rm = TRUE)) %>%
  pivot_longer(everything(),
               names_to  = "protein",
               values_to = "variance") %>%
  arrange(desc(variance))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), var, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
summary(protein_vars$variance)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.003911  0.025173  0.052631  0.122038  0.120528 12.827399
ggplot(protein_vars, aes(x = variance)) +
  geom_histogram(bins = 100, fill = "steelblue", color = "white") +
  geom_vline(xintercept = protein_vars$variance[500],
             color = "red", linetype = "dashed", linewidth = 1) +
  annotate("text",
           x     = protein_vars$variance[500],
           y     = Inf,
           label = "Top 500 cutoff",
           vjust = 2, hjust = -0.1, color = "red") +
  coord_cartesian(xlim = c(0, 1)) +
  labs(title = "Protein Variance Distribution",
       x     = "Variance",
       y     = "Count") +
  theme_minimal()

top500_proteins <- protein_vars %>%
  slice_head(n = 500) %>%
  pull(protein)

data_filtered <- wide_data %>%
  select(PATNO, SEX, outcome,
         all_of(clinical_vars),
         all_of(top500_proteins))

dim(data_filtered)
## [1] 803 509
cat("Clinical vars present:",
    sum(clinical_vars %in% names(data_filtered)),
    "of", length(clinical_vars), "\n")
## Clinical vars present: 6 of 6

Train/test split

The data are split into an 80/20 stratified train/test split before any supervised feature selection. This ordering is essential because if a t-test were run on the full dataset to choose discriminating proteins and the split were performed afterwards, information from the test patients would already have contaminated the chosen feature set, and test-set performance would be optimistically biased. The stratification on the outcome variable preserves the PD-to-Control ratio in both sets.

set.seed(42)
data_split_pre <- initial_split(
  data_filtered %>% select(-PATNO),
  prop   = 0.80,
  strata = outcome
)
train_pre <- training(data_split_pre)
test_pre  <- testing(data_split_pre)

cat("Pre-split training set:\n")
## Pre-split training set:
train_pre %>% count(outcome) %>% print()
## # A tibble: 2 × 2
##   outcome     n
##   <fct>   <int>
## 1 Control   148
## 2 PD        493
cat("\nPre-split test set:\n")
## 
## Pre-split test set:
test_pre %>% count(outcome) %>% print()
## # A tibble: 2 × 2
##   outcome     n
##   <fct>   <int>
## 1 Control    38
## 2 PD        124

Supervised feature selection: t-test on training set only

For each of the top 500 variable proteins, a Welch two-sample t-test is run on the training set comparing log2 expression in PD vs. Control. Raw p-values are then adjusted for multiple testing using the Benjamini–Hochberg (BH) procedure to control the false discovery rate at 5%. Proteins with BH-adjusted p < 0.05 are carried forward as candidate features for the classification models. Restricting the t-test to the training data ensures that the test set remains genuinely held out for performance evaluation.

ttest_results <- train_pre %>%
  select(outcome, all_of(top500_proteins)) %>%
  pivot_longer(
    cols      = all_of(top500_proteins),
    names_to  = "protein",
    values_to = "value"
  ) %>%
  group_by(protein) %>%
  summarise(
    p_value   = t.test(value ~ outcome)$p.value,
    mean_PD   = mean(value[outcome == "PD"],      na.rm = TRUE),
    mean_ctrl = mean(value[outcome == "Control"],  na.rm = TRUE),
    log2FC    = mean_PD - mean_ctrl
  ) %>%
  mutate(
    p_adj = p.adjust(p_value, method = "BH"),
    sig   = p_adj < 0.05
  ) %>%
  arrange(p_adj)

ttest_results %>% count(sig)
## # A tibble: 2 × 2
##   sig       n
##   <lgl> <int>
## 1 FALSE   454
## 2 TRUE     46
head(ttest_results, 20)
## # A tibble: 20 × 7
##    protein           p_value mean_PD mean_ctrl log2FC      p_adj sig  
##    <chr>               <dbl>   <dbl>     <dbl>  <dbl>      <dbl> <lgl>
##  1 3504-58_2   0.00000000789    9.33      8.95  0.375 0.00000395 TRUE 
##  2 7757-5_3    0.0000000423     9.41      9.11  0.300 0.0000106  TRUE 
##  3 5080-131_3  0.00000437       9.68      9.48  0.203 0.000728   TRUE 
##  4 6521-35_3   0.00000716      11.7      11.9  -0.228 0.000895   TRUE 
##  5 14125-5_3   0.0000162        8.99      8.77  0.219 0.00162    TRUE 
##  6 13722-105_3 0.0000431       10.4      10.2   0.202 0.00214    TRUE 
##  7 2292-17_4   0.0000401        9.66      9.45  0.205 0.00214    TRUE 
##  8 2743-5_2    0.0000267       12.9      12.7   0.288 0.00214    TRUE 
##  9 2900-53_3   0.0000418        9.47      9.24  0.234 0.00214    TRUE 
## 10 4141-79_1   0.0000471       10.6      10.3   0.265 0.00214    TRUE 
## 11 8240-207_3  0.0000430       11.6      11.4   0.221 0.00214    TRUE 
## 12 6520-87_3   0.0000540       13.4      13.2   0.228 0.00225    TRUE 
## 13 3535-84_1   0.0000771        8.55      8.34  0.204 0.00297    TRUE 
## 14 13416-8_3   0.0000942       12.1      12.3  -0.233 0.00314    TRUE 
## 15 3060-43_2   0.0000911       11.2      11.0   0.187 0.00314    TRUE 
## 16 9900-36_3   0.000206         9.14      8.92  0.221 0.00642    TRUE 
## 17 3415-61_2   0.000238         8.70      8.23  0.469 0.00662    TRUE 
## 18 9796-4_3    0.000226        15.2      15.0   0.215 0.00662    TRUE 
## 19 8890-9_3    0.000352        11.7      11.9  -0.197 0.00926    TRUE 
## 20 9986-14_3   0.000414         9.36      9.16  0.206 0.0104     TRUE

The volcano plot visualizes the results of this screening step. The x-axis is the log2 fold change between PD and Control, and the y-axis is the negative log10 of the BH-adjusted p-value. Points in the upper region of the plot represent proteins that are differentially abundant between PD and Control after multiple-testing correction. We can see that a meaningful number of proteins clear the BH < 0.05 threshold, including several with substantial fold changes, which suggests that there is a real disease signal in the CSF proteome that the models can exploit.

ggplot(ttest_results, aes(x = log2FC, y = -log10(p_adj), color = sig)) +
  geom_point(alpha = 0.6, size = 1.2) +
  scale_color_manual(values = c("FALSE" = "grey60", "TRUE" = "#C00000"),
                     labels = c("Not significant", "Significant (BH < 0.05)")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black") +
  labs(title = "Volcano Plot: PD vs Control (t-test, training set)",
       x = "Log2 Fold Change (PD - Control)",
       y = "-log10(Adjusted p-value)",
       color = "") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        legend.position = "bottom")

Final modeling dataset

The significant proteins from the volcano analysis, together with age and the five genetic indicator variables, form the final feature set. The training and test rows are recombined into a single data frame and the original split is reconstructed.

sig_proteins <- ttest_results %>%
  filter(sig == TRUE) %>%
  pull(protein)
cat("Significant proteins:", length(sig_proteins), "\n")
## Significant proteins: 46
train_data <- train_pre %>%
  select(SEX, outcome, all_of(clinical_vars), all_of(sig_proteins))
test_data  <- test_pre  %>%
  select(SEX, outcome, all_of(clinical_vars), all_of(sig_proteins))

data_model <- bind_rows(train_data, test_data)
dim(data_model)
## [1] 803  54
data_model %>% count(outcome)
## # A tibble: 2 × 2
##   outcome     n
##   <fct>   <int>
## 1 Control   186
## 2 PD        617
data_model %>%
  select(all_of(clinical_vars)) %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") %>%
  mutate(pct_missing = round(100 * n_missing / nrow(data_model), 1)) %>%
  print()
## # A tibble: 6 × 3
##   variable      n_missing pct_missing
##   <chr>             <int>       <dbl>
## 1 AGE_AT_VISIT          0           0
## 2 LRRK2_carrier         0           0
## 3 GBA_carrier           0           0
## 4 SNCA_carrier          0           0
## 5 PRKN_carrier          0           0
## 6 APOE_e4               0           0
data_split <- make_splits(
  list(analysis   = seq_len(nrow(train_data)),
       assessment = seq(nrow(train_data) + 1, nrow(data_model))),
  data = data_model
)

saveRDS(data_model, file.path(data_dir, "data_model.rds"))

Cross-validation folds

Ten-fold stratified cross-validation is set up on the training data. Stratification preserves the PD-to-Control ratio in each fold so that no fold has so few minority-class samples that it cannot estimate sensitivity reliably. The same fold structure is used for tuning LASSO and the random forest as well as for cross-validating the baseline logistic regression.

set.seed(42)
cv_folds <- vfold_cv(
  train_data,
  v      = 10,
  strata = outcome
)

Preprocessing recipe and model specifications

A single shared preprocessing recipe is used for all three models so that performance differences are attributable to the modeling approach rather than to preprocessing choices. The recipe (i) removes sex from the feature set so that the protein signature is evaluated on its own; (ii) median-imputes any remaining missing values in the clinical variables and in the proteins; (iii) normalizes all predictors to unit variance, which is required for LASSO and helps interpretability for logistic regression and random forest; and (iv) downsamples the majority class so that PD and Control are balanced inside the training fold.

Three model specifications are defined:

  • Logistic regression — the baseline. It fits a single coefficient per feature with no penalty, so it serves as a reference.
  • LASSO logistic regression — adds an L1 penalty whose magnitude is tuned. The L1 penalty shrinks irrelevant coefficients exactly to zero and is well suited to high-dimensional, correlated predictors like CSF proteomics.
  • Random forest — a non-linear ensemble that captures interactions between proteins. Permutation importance is used so that the importance ranking is robust to correlated features.
model_recipe <- recipe(outcome ~ ., data = train_data) %>%
  step_rm(SEX) %>%
  step_impute_median(all_of(clinical_vars)) %>%
  step_impute_median(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_downsample(outcome, under_ratio = 1)   # fix class imbalance

summary(model_recipe)
## # A tibble: 54 × 4
##    variable      type      role      source  
##    <chr>         <list>    <chr>     <chr>   
##  1 SEX           <chr [3]> predictor original
##  2 AGE_AT_VISIT  <chr [2]> predictor original
##  3 LRRK2_carrier <chr [2]> predictor original
##  4 GBA_carrier   <chr [2]> predictor original
##  5 SNCA_carrier  <chr [2]> predictor original
##  6 PRKN_carrier  <chr [2]> predictor original
##  7 APOE_e4       <chr [2]> predictor original
##  8 3504-58_2     <chr [2]> predictor original
##  9 7757-5_3      <chr [2]> predictor original
## 10 5080-131_3    <chr [2]> predictor original
## # ℹ 44 more rows
log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
log_workflow <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(log_spec)

lasso_spec <- logistic_reg(
  penalty = tune(),
  mixture = 1
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")
lasso_workflow <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(lasso_spec)

rf_spec <- rand_forest(
  mtry  = tune(),
  trees = 500,
  min_n = tune()
) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")
rf_workflow <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(rf_spec)

log_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_rm()
## • step_impute_median()
## • step_impute_median()
## • step_normalize()
## • step_downsample()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
lasso_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_rm()
## • step_impute_median()
## • step_impute_median()
## • step_normalize()
## • step_downsample()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet
rf_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_rm()
## • step_impute_median()
## • step_impute_median()
## • step_normalize()
## • step_downsample()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 500
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger

Hyperparameter tuning and cross-validated comparison

To tune for LASSO, 50 different penalty values were tried, from very small to very large. For the random forest, five different values are tried for the number of proteins it randomly considers at each split, and the smallest node size before it stops splitting. Logistic regression doesn’t have any parameters to tune.

For each model, I pick the settings that give the highest cross-validated ROC AUC, then compare the three models on ROC AUC, accuracy, sensitivity, and specificity.

#LASSO 
lasso_grid <- grid_regular(
  penalty(range = c(-4, 0)),
  levels = 50
)
set.seed(42)
lasso_tuned <- tune_grid(
  lasso_workflow,
  resamples = cv_folds,
  grid      = lasso_grid,
  metrics   = metric_set(roc_auc, accuracy, sens, spec),
  control   = control_grid(save_pred = TRUE)
)
## Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
## 1.2.0.
## ℹ See details at
##   <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
lasso_best <- select_best(lasso_tuned, metric = "roc_auc")
cat("Best LASSO penalty:", lasso_best$penalty, "\n")
## Best LASSO penalty: 0.009102982
#Random Forest
rf_grid <- grid_regular(
  mtry(range  = c(2, 20)),
  min_n(range = c(2, 20)),
  levels = 5
)
set.seed(42)
rf_tuned <- tune_grid(
  rf_workflow,
  resamples = cv_folds,
  grid      = rf_grid,
  metrics   = metric_set(roc_auc, accuracy, sens, spec),
  control   = control_grid(save_pred = TRUE)
)
rf_best <- select_best(rf_tuned, metric = "roc_auc")
cat("Best RF mtry:", rf_best$mtry, "min_n:", rf_best$min_n, "\n")
## Best RF mtry: 11 min_n: 2
#Logistic Regression
set.seed(42)
log_cv_results <- fit_resamples(
  log_workflow,
  resamples = cv_folds,
  metrics   = metric_set(roc_auc, accuracy, sens, spec),
  control   = control_resamples(save_pred = TRUE)
)
## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## There were issues with some computations   A: x2There were issues with some computations   A: x4
autoplot(lasso_tuned, metric = "roc_auc") +
  labs(title = "LASSO Tuning: ROC AUC vs Penalty") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

autoplot(rf_tuned, metric = "roc_auc") +
  labs(title = "Random Forest Tuning: ROC AUC vs Hyperparameters") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

log_metrics   <- collect_metrics(log_cv_results) %>% mutate(model = "Logistic")
lasso_metrics <- collect_metrics(lasso_tuned) %>%
  filter(penalty == lasso_best$penalty) %>%
  mutate(model = "LASSO")
rf_metrics    <- collect_metrics(rf_tuned) %>%
  filter(mtry == rf_best$mtry, min_n == rf_best$min_n) %>%
  mutate(model = "Random Forest")

cv_comparison <- bind_rows(log_metrics, lasso_metrics, rf_metrics) %>%
  select(model, .metric, mean, std_err) %>%
  filter(.metric %in% c("roc_auc", "accuracy", "sens", "spec")) %>%
  arrange(.metric, desc(mean))
print(cv_comparison, n = 50)
## # A tibble: 12 × 4
##    model         .metric   mean std_err
##    <chr>         <chr>    <dbl>   <dbl>
##  1 Random Forest accuracy 0.705  0.0165
##  2 Logistic      accuracy 0.694  0.0191
##  3 LASSO         accuracy 0.671  0.0219
##  4 LASSO         roc_auc  0.810  0.0221
##  5 Random Forest roc_auc  0.795  0.0195
##  6 Logistic      roc_auc  0.777  0.0162
##  7 LASSO         sens     0.811  0.0386
##  8 Random Forest sens     0.736  0.0504
##  9 Logistic      sens     0.729  0.0326
## 10 Random Forest spec     0.696  0.0190
## 11 Logistic      spec     0.683  0.0252
## 12 LASSO         spec     0.629  0.0245
cv_comparison %>%
  mutate(.metric = recode(.metric,
    "roc_auc"  = "ROC AUC",
    "accuracy" = "Accuracy",
    "sens"     = "Sensitivity",
    "spec"     = "Specificity"
  )) %>%
  ggplot(aes(x = model, y = mean, fill = model)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err),
                width = 0.2) +
  geom_text(aes(label = round(mean, 3)), vjust = -0.5, size = 3.5) +
  facet_wrap(~ .metric, scales = "free_y") +
  scale_fill_manual(values = c("Logistic"      = "#4472C4",
                               "LASSO"         = "#ED7D31",
                               "Random Forest" = "#70AD47")) +
  labs(title = "Cross-Validation Performance by Model",
       x = "", y = "Mean Score") +
  theme_minimal() +
  theme(plot.title  = element_text(face = "bold", size = 13),
        axis.text.x = element_text(angle = 20, hjust = 1))

The two tuning curves show how performance varies with the penalty (LASSO) and with mtry/min_n (random forest). The LASSO curve is informative: ROC AUC is roughly flat over a wide range of small penalties and then drops sharply once the penalty becomes large enough to zero out the informative coefficients. The random forest is, as expected, less sensitive to its hyperparameters in this regime. The bar plot at the end of the chunk compares the three models on their best cross-validated settings and serves as the model-selection figure.

Final evaluation on the held-out test set

The selected hyperparameters are now plugged back into the workflows and each model is refitted on the full training set and evaluated. Four metrics, ROC AUC, accuracy, sensitivity, and specificity, are computed on the test data, and ROC curves, confusion matrices, and variable importance plots are generated for interpretation.

final_log_workflow   <- log_workflow
final_lasso_workflow <- lasso_workflow %>% finalize_workflow(lasso_best)
final_rf_workflow    <- rf_workflow    %>% finalize_workflow(rf_best)

set.seed(42)
log_final   <- last_fit(final_log_workflow,   data_split,
                        metrics = metric_set(roc_auc, accuracy, sens, spec))
## → A | warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## There were issues with some computations   A: x2There were issues with some computations   A: x2
lasso_final <- last_fit(final_lasso_workflow, data_split,
                        metrics = metric_set(roc_auc, accuracy, sens, spec))
rf_final    <- last_fit(final_rf_workflow,    data_split,
                        metrics = metric_set(roc_auc, accuracy, sens, spec))

test_metrics <- bind_rows(
  collect_metrics(log_final)   %>% mutate(model = "Logistic"),
  collect_metrics(lasso_final) %>% mutate(model = "LASSO"),
  collect_metrics(rf_final)    %>% mutate(model = "Random Forest")
) %>%
  select(model, .metric, .estimate) %>%
  arrange(.metric, desc(.estimate))
print(test_metrics, n = 50)
## # A tibble: 12 × 3
##    model         .metric  .estimate
##    <chr>         <chr>        <dbl>
##  1 Logistic      accuracy     0.722
##  2 Random Forest accuracy     0.716
##  3 LASSO         accuracy     0.704
##  4 Logistic      roc_auc      0.812
##  5 LASSO         roc_auc      0.808
##  6 Random Forest roc_auc      0.778
##  7 LASSO         sens         0.842
##  8 Logistic      sens         0.789
##  9 Random Forest sens         0.711
## 10 Random Forest spec         0.718
## 11 Logistic      spec         0.702
## 12 LASSO         spec         0.661
test_metrics %>%
  mutate(.metric = recode(.metric,
    "roc_auc"  = "ROC AUC",
    "accuracy" = "Accuracy",
    "sens"     = "Sensitivity",
    "spec"     = "Specificity"
  )) %>%
  ggplot(aes(x = model, y = .estimate, fill = model)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = round(.estimate, 3)), vjust = -0.5, size = 3.5) +
  facet_wrap(~ .metric, scales = "free_y") +
  scale_fill_manual(values = c("Logistic"      = "#4472C4",
                               "LASSO"         = "#ED7D31",
                               "Random Forest" = "#70AD47")) +
  labs(title = "Test Set Performance by Model", x = "", y = "Score") +
  theme_minimal() +
  theme(plot.title  = element_text(face = "bold", size = 13),
        axis.text.x = element_text(angle = 20, hjust = 1))

cat("=== Logistic Regression ===\n")
## === Logistic Regression ===
collect_predictions(log_final) %>%
  conf_mat(truth = outcome, estimate = .pred_class) %>%
  print()
##           Truth
## Prediction Control PD
##    Control      30 37
##    PD            8 87
cat("\n=== LASSO ===\n")
## 
## === LASSO ===
collect_predictions(lasso_final) %>%
  conf_mat(truth = outcome, estimate = .pred_class) %>%
  print()
##           Truth
## Prediction Control PD
##    Control      32 42
##    PD            6 82
cat("\n=== Random Forest ===\n")
## 
## === Random Forest ===
collect_predictions(rf_final) %>%
  conf_mat(truth = outcome, estimate = .pred_class) %>%
  print()
##           Truth
## Prediction Control PD
##    Control      27 35
##    PD           11 89
log_roc   <- collect_predictions(log_final)   %>% mutate(model = "Logistic")
lasso_roc <- collect_predictions(lasso_final) %>% mutate(model = "LASSO")
rf_roc    <- collect_predictions(rf_final)    %>% mutate(model = "Random Forest")

bind_rows(log_roc, lasso_roc, rf_roc) %>%
  group_by(model) %>%
  roc_curve(truth = outcome, .pred_Control) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_line(linewidth = 1.2) +
  geom_abline(linetype = "dashed", color = "grey50") +
  scale_color_manual(values = c("Logistic"      = "#4472C4",
                                "LASSO"         = "#ED7D31",
                                "Random Forest" = "#70AD47")) +
  labs(title    = "ROC Curves on Test Set",
       x        = "1 - Specificity (False Positive Rate)",
       y        = "Sensitivity (True Positive Rate)",
       color    = "Model") +
  theme_minimal() +
  theme(plot.title      = element_text(face = "bold", size = 13),
        legend.position = "bottom")

lasso_final %>%
  extract_fit_parsnip() %>%
  tidy() %>%
  filter(term != "(Intercept)", estimate != 0) %>%
  arrange(desc(abs(estimate))) %>%
  slice_head(n = 20) %>%
  left_join(protein_lookup, by = c("term" = "SOMA_SEQ_ID")) %>%
  mutate(protein_label = ifelse(is.na(TARGET_GENE_SYMBOL), term, TARGET_GENE_SYMBOL)) %>%
  ggplot(aes(x = reorder(protein_label, abs(estimate)), y = estimate,
             fill = estimate > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "#C00000", "FALSE" = "#4472C4")) +
  labs(title = "LASSO: Top 20 Coefficients",
       x = "", y = "Coefficient") +
  theme_minimal() +
  theme(plot.title  = element_text(face = "bold", size = 13),
        axis.text.y = element_text(size = 8))

rf_imp <- rf_final %>%
  extract_fit_parsnip() %>%
  vi() %>%
  slice_head(n = 20) %>%
  left_join(protein_lookup, by = c("Variable" = "SOMA_SEQ_ID")) %>%
  mutate(protein_label = ifelse(is.na(TARGET_GENE_SYMBOL), Variable, TARGET_GENE_SYMBOL))

ggplot(rf_imp, aes(x = reorder(protein_label, Importance), y = Importance)) +
  geom_col(fill = "#70AD47", show.legend = FALSE) +
  coord_flip() +
  labs(title = "Random Forest: Top 20 Variable Importance",
       x = "", y = "Permutation Importance") +
  theme_minimal() +
  theme(plot.title  = element_text(face = "bold", size = 13),
        axis.text.y = element_text(size = 8))

Results

Cross-validated model comparison

All three models achieve very similar cross-validated performance, with ROC AUCs clustered between 0.78 and 0.81. LASSO has the highest cross-validated ROC AUC at 0.810, followed by the random forest at 0.795 and logistic regression at 0.777. On accuracy the ordering reverses. The random forest is on top at 0.705, logistic regression at 0.694, and LASSO at 0.671.This is showing that LASSO is trading raw accuracy for a different operating point. That trade-off is explicit in the sensitivity and specificity panels: LASSO has by far the highest sensitivity but the lowest specificity. The random forest is the most balanced of the three, and logistic regression sits in between. The error bars overlap substantially across models on accuracy and ROC AUC.

Held-out Test set performance

The pattern carries over to the held-out test set, and importantly performance does not collapse from CV to test, which suggests none of the models is badly overfit. Logistic regression actually edges out on test ROC AUC at 0.812, with LASSO essentially tied at 0.808 and the random forest at 0.778. Test accuracy is again tight across models (Logistic 0.722, RF 0.716, LASSO 0.704). The most consistent story across both CV and test is the sensitivity–specificity trade-off: LASSO has the highest test sensitivity at 0.842 but the lowest specificity at 0.661, while the random forest sits at the opposite end (sensitivity 0.711, specificity 0.716). Logistic regression is balanced in the middle (sensitivity 0.789, specificity 0.702).

The question about which model is “best” therefore depends on the use case. If the goal is a screening tool that should rarely miss a true PD patient, LASSO is the more appropriate choice; if the goal is balanced classification with fewer false positives, the random forest or logistic regression is preferable.

Most predictive proteins

The largest single predictor in both models is LRRK2 carrier status. GBA carrier status and PRKN carrier status also appear in the LASSO top 20 with positive coefficients. LRRK2, GBA, and PRKN are the canonical Mendelian PD risk genes, and the models recovered them without being told that they are biologically privileged.

Beyond the genetic variables, both models converge on the same short list of CSF proteins, which is a useful internal-replication signal: NEFH, HAMP, CECR1, CEL, EPHA10, GAPDH, NPW, and LRFN2 appear in both the top-20 LASSO coefficients and the top-20 random forest importances. NEFH (neurofilament heavy chain) which is a well-established CSF marker of axonal injury that is elevated across a range of neurodegenerative diseases. HAMP (hepcidin) ties into iron dysregulation, which has been increasingly implicated in dopaminergic cell loss due to oxidateive stress and lewy body formation. GAPDH, classically a housekeeping protein, has known interactions with alpha-synuclein aggregation. The random forest additionally surfaces GPNMB (a microglial protein at a confirmed PD GWAS locus), HLA-DQA2 (consistent with the neuroinflammatory axis seen earlier), PREP (prolyl endopeptidase, linked to alpha-synuclein clearance), and GFRA2 (a GDNF-family receptor involved in dopaminergic neuron support).

According to the LASSO coefficients, the most positive ones are LRRK2/GBA/PRKN carrier status, NEFH, GAPDH, DKK1, CEL, HAMP). Negative coefficients: SFRP1, EPHA10, MGP, HAPLN1, TREML2. DKK1 and SFRP1 are both Wnt-pathway regulators, which is interesting because Wnt signaling is required for dopaminergic neuron maintenance and is increasingly viewed as dysregulated in PD. AGE_AT_VISIT carries a negative coefficient, which is somewhat counterintuitive on its own. It does not mean older age protects against PD, but rather that given the protein signature, additional age does not further increase the predicted probability of PD, likely because age is partly already encoded in the protein levels themselves.

Overall, the agreement between LASSO and the random forest on both the genetic carriers and on the cluster of NEFH / HAMP / GPNMB / HLA / Wnt-related proteins suggests that the predictive signal in this analysis is not driven by a single protein or by SomaScan artifacts, but by a coherent biological signature spanning Mendelian PD genes, axonal injury, neuroinflammation, and dopaminergic-pathway biology.

Summary of Findings

This project asked whether CSF proteomics alone can distinguish Parkinson’s disease patients from healthy controls and, if so, which proteins drive the signal. Using baseline SomaScan measurements from PPMI together with age and genetic carrier status as covariates, three classification models, logistic regression, LASSO, and a random forest, were compared inside a leakage-aware pipeline.

Since there were more protein targets than numpber of patients, the most significantly differentiating proteins between Control and PD was statistically selected. After an unsupervised variance filter and a BH-controlled t-test on the training fold only, all three models reached ROC AUCs around 0.78–0.81 on both cross-validation and the held-out test set. This level of accuracy is somewhat promising about our models. The three models offered different sensitivity–specificity trade-offs rather than one clearly dominating. The top predictors agreed substantially across LASSO and the random forest and were biologically coherent. They showed that the canonical Mendelian PD genes (LRRK2, GBA, PRKN) and a cluster of proteins tied to axonal injury, neuroinflammation, and Wnt signaling are among the most important predictors.

In short, CSF proteomics data trained in a classifier model can create a real, biologically interpretable signal for PD, and the proteins driving it map onto known disease biology rather than onto platform artifacts.

Limitations

Several limitations should temper the interpretation of these results. First, the PPMI cohort is consists of less than 1000 patients which is relatively large but not epidemiologically representative. Participants are overwhelmingly self-referred, of European ancestry, and recruited at academic movement-disorder centers, which limits how well the learned signature generalizes to broader populations. Second, the analysis uses CSF samples collected at a single baseline visit; whether this same signature tracks disease progression longitudinally, or whether it changes after the initiation of dopaminergic therapy, has not been examined here. Third, since there was more protein targets than the number of patients, the selected proteins for the machine learning models had to be reduced with statistical methods. This may have resulted in loss of some important predictors that can not survive the t-test. Fourth, the SomaScan platform itself has known biases. Protein measurements depend on aptamer affinity rather than on direct mass spectrometry and some of the top proteins identified may reflect platform-specific behavior that would not replicate on an orthogonal assay. Also, as seen on EDA, there was blood contamination in some samples, which challenges the data collection methods used in creating the PPMI dataset. Lastly, plate identifier was retained as a candidate technical confounder but was not explicitly modeled as a random effect.

Despite these limitations, the analysis suggests that CSF proteomics, paired with machine learning algorithms is promising way toward finding objective biomarkers for Parkinson’s disease. The proteins identified here are sensible candidates for follow-up in independent validation cohorts and in mechanistic studies of alpha-synuclein clearance, lysosomal function, and neuroinflammation.