Final Project: Draft 1

Question of interest

Parkinson’s disease (PD) is a progressive neurodegenerative disorder affecting millions of people worldwide, characterized by motor symptoms such as tremor, rigidity, and bradykinesia, as well as non-motor manifestations including cognitive decline and autonomic dysfunction. One of the major clinical challenges with Parkinson’s disease is the lack of reliable, objective set of biomarkers for early diagnosis. Ususally, patients are often diagnosed only after significant neurodegeneration has already occurred with physical tests or observing lesions in specific brain regions.

This project aims to investigate whether cerebrospinal fluid (CSF) proteomics data can be used to accurately predict Parkinson’s disease status. Specifically, Can protein expression profiles from CSF samples reliably distinguish Parkinson’s disease patients from healthy controls, and which proteins are most predictive of disease status?

Data

The data used in this project comes from the Parkinson’s Progression Markers Initiative (PPMI), a landmark longitudinal observational study sponsored by the Michael J. Fox Foundation. PPMI was designed to identify biomarkers of Parkinson’s disease onset and progression by recruiting both PD patients and healthy controls and collecting biological samples, imaging, and clinical assessments over time. https://www.ppmi-info.org/

Variables of interest

Outcome variable: Parkinson’s disease diagnosis (COHORT): binary, PD patient vs. control
Primary predictor variable: CSF protein expression levels (TESTVALUE), indexed by protein name (TESTNAME)
Possible confounders: SEX and PLATEID
Potential effect modifiers: SEX may act as an effect modifier if the relationship between specific protein expression levels and PD diagnosis differs between males and females

Directed acyclic graph (DAG)

library(ggdag)
library(ggplot2)

# Use short names inside nodes
dag <- dagify(
  Outcome ~ Predictor + SEX,
  Predictor ~ SEX + Batch,
  coords = list(
    x = c(SEX = 1, Batch = 1, Predictor = 2, Outcome = 3),
    y = c(SEX = 1, Batch = -1, Predictor = 0, Outcome = 0)
  )
)

ggdag(dag, text = FALSE) +
  geom_dag_point(color = "#4472C4", size = 22) +
  geom_dag_text(color = "white", size = 3.5, fontface = "bold") +
  theme_dag() +
  labs(
    title   = "DAG: CSF Proteomics and Parkinson's Disease Prediction",
    caption = paste(
      "Outcome   = PD Diagnosis (COHORT)",
      "Predictor = Protein Expression (TESTVALUE)",
      "SEX       = Sex — Biological Confounder",
      "Batch     = Assay Plate (PLATEID) — Technical Confounder",
      sep = "\n"
    )
  ) +
  theme(
    plot.title   = element_text(face = "bold", size = 13, hjust = 0.5),
    plot.caption = element_text(size = 9, color = "gray30", hjust = 0,
                                margin = margin(t = 12), family = "mono")
  )

### Statistical analysis plan

Because our outcome variable is binary, to answer the question of interest, three classification approaches will be built and compared: Logistic Regression, LASSO, and Random Forest. By comparing these models, this project will evaluate both predictive accuracy and the interpretability of selected biomarkers.

To control for potential confounding, sex will be included as a biological covariate across all three models, as protein expression levels are known to vary by sex and Parkinson’s disease prevalence differs between males and females. PLATEID will be examined and adjusted for as a technical confounder, since batch-to-batch variability across assay plates can introduce systematic differences in protein measurements unrelated to disease status.Protein expression values will also be normalized prior to modeling to further reduce technical variation. Model performance will be evaluated using AUC-ROC, accuracy and a confusion matrix.

Exploratory data analysis:

wanted_cols <- c(
  "PATNO", "SEX", "COHORT", "TESTNAME",
  "TESTVALUE", "PLATEID"
)
raw_data <- file_list %>%
  map_dfr(~read_csv(.x, show_col_types = FALSE)) %>%
  select(any_of(wanted_cols))

glimpse(raw_data)

## Rows: 5,541,030
## Columns: 6
## $ PATNO     <dbl> 53595, 53595, 53595, 53595, 53595, 53595, 53595, 53595, 5359…
## $ SEX       <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
## $ COHORT    <chr> "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", …
## $ TESTNAME  <chr> "5632-6_3", "5631-83_3", "5630-48_3", "5629-58_3", "5628-21_…
## $ TESTVALUE <dbl> 10.182013, 6.346285, 12.943830, 5.891402, 12.836350, 7.03480…
## $ PLATEID   <chr> "P0022899", "P0022899", "P0022899", "P0022899", "P0022899", …

head(raw_data, 20)

## # A tibble: 20 × 6
##    PATNO SEX    COHORT TESTNAME  TESTVALUE PLATEID 
##    <dbl> <chr>  <chr>  <chr>         <dbl> <chr>   
##  1 53595 Female PD     5632-6_3      10.2  P0022899
##  2 53595 Female PD     5631-83_3      6.35 P0022899
##  3 53595 Female PD     5630-48_3     12.9  P0022899
##  4 53595 Female PD     5629-58_3      5.89 P0022899
##  5 53595 Female PD     5628-21_3     12.8  P0022899
##  6 53595 Female PD     5627-53_3      7.03 P0022899
##  7 53595 Female PD     5626-20_3      5.89 P0022899
##  8 53595 Female PD     5624-66_3      7.66 P0022899
##  9 53595 Female PD     5623-11_3      6.13 P0022899
## 10 53595 Female PD     5621-64_3      5.95 P0022899
## 11 53595 Female PD     5620-13_3     10.1  P0022899
## 12 53595 Female PD     5618-50_3     13.9  P0022899
## 13 53595 Female PD     5617-41_3      8.84 P0022899
## 14 53595 Female PD     5615-62_3      6.25 P0022899
## 15 53595 Female PD     5614-44_3      6.73 P0022899
## 16 53595 Female PD     5613-75_3      5.87 P0022899
## 17 53595 Female PD     5612-16_3      6.20 P0022899
## 18 53595 Female PD     5611-56_3      6.02 P0022899
## 19 53595 Female PD     5610-32_3      6.38 P0022899
## 20 53595 Female PD     5609-92_3     14.1  P0022899

raw_data %>% count(COHORT)

## # A tibble: 3 × 2
##   COHORT          n
##   <chr>       <int>
## 1 Control    890010
## 2 PD        2952345
## 3 Prodromal 1698675

raw_data %>% summarise(
  n_patients  = n_distinct(PATNO),
  n_proteins  = n_distinct(TESTNAME),
  n_rows      = n()
)

## # A tibble: 1 × 3
##   n_patients n_proteins  n_rows
##        <int>      <int>   <int>
## 1       1158       4785 5541030

raw_data %>%
  count(PATNO, TESTNAME) %>%
  filter(n > 1) %>%
  nrow()

## [1] 0

For now I will only use PD and Control Patients

data_binary <- raw_data %>%
  filter(COHORT %in% c("PD", "Control")) %>%
  mutate(outcome = factor(ifelse(COHORT == "PD", 1, 0),
                          levels = c(0, 1),
                          labels = c("Control", "PD")))

data_binary %>% count(outcome)

## # A tibble: 2 × 2
##   outcome       n
##   <fct>     <int>
## 1 Control  890010
## 2 PD      2952345

wide_data <- data_binary %>%
  select(PATNO, SEX, outcome, TESTNAME, TESTVALUE) %>%
  pivot_wider(
    names_from  = TESTNAME,
    values_from = TESTVALUE
  )

dim(wide_data)

## [1]  803 4788

Caclulating the variance of each protein. Looking at the variabce distribution of proteins and plotting to pick a cutoff. Selecting the top 500 proteins.

protein_cols <- wide_data %>% 
  select(-PATNO, -SEX, -outcome)

protein_vars <- protein_cols %>%
  summarise(across(everything(), var, na.rm = TRUE)) %>%
  pivot_longer(everything(), 
               names_to  = "protein", 
               values_to = "variance") %>%
  arrange(desc(variance))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), var, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

summary(protein_vars$variance)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.003911  0.025173  0.052631  0.122038  0.120528 12.827399

ggplot(protein_vars, aes(x = variance)) +
  geom_histogram(bins = 100, fill = "steelblue", color = "white") +
  geom_vline(xintercept = protein_vars$variance[500], 
             color = "red", linetype = "dashed", linewidth = 1) +
  annotate("text", 
           x     = protein_vars$variance[500], 
           y     = Inf, 
           label = "Top 500 cutoff", 
           vjust = 2, hjust = -0.1, color = "red") +
  labs(title = "Protein Variance Distribution",
       x     = "Variance (log2 scale)",
       y     = "Count") +
  theme_minimal()

top500_proteins <- protein_vars %>%
  slice_head(n = 500) %>%
  pull(protein)

data_filtered <- wide_data %>%
  select(PATNO, SEX, outcome, all_of(top500_proteins))

dim(data_filtered)

## [1] 803 503

I will run t test on the 500 selected proteins. Volcano plot of PD vs Control

ttest_results <- data_filtered %>%
  select(outcome, all_of(top500_proteins)) %>%
  pivot_longer(
    cols      = all_of(top500_proteins),
    names_to  = "protein",
    values_to = "value"
  ) %>%
  group_by(protein) %>%
  summarise(
    p_value   = t.test(value ~ outcome)$p.value,
    mean_PD   = mean(value[outcome == "PD"],      na.rm = TRUE),
    mean_ctrl = mean(value[outcome == "Control"],  na.rm = TRUE),
    log2FC    = mean_PD - mean_ctrl  
  ) %>%
  mutate(
    p_adj     = p.adjust(p_value, method = "BH"),  
    sig       = p_adj < 0.05
  ) %>%
  arrange(p_adj)

ttest_results %>% count(sig)

## # A tibble: 2 × 2
##   sig       n
##   <lgl> <int>
## 1 FALSE   432
## 2 TRUE     68

head(ttest_results, 20)

## # A tibble: 20 × 7
##    protein      p_value mean_PD mean_ctrl log2FC        p_adj sig  
##    <chr>          <dbl>   <dbl>     <dbl>  <dbl>        <dbl> <lgl>
##  1 3504-58_2   4.21e-11    9.33      8.96  0.376 0.0000000142 TRUE 
##  2 7757-5_3    5.70e-11    9.42      9.11  0.309 0.0000000142 TRUE 
##  3 5080-131_3  1.03e- 8    9.69      9.47  0.220 0.00000171   TRUE 
##  4 8240-207_3  1.34e- 7   11.6      11.4   0.247 0.0000167    TRUE 
##  5 2900-53_3   1.05e- 6    9.47      9.23  0.240 0.000105     TRUE 
##  6 6521-35_3   1.44e- 6   11.7      11.9  -0.219 0.000120     TRUE 
##  7 6520-87_3   1.34e- 5   13.4      13.2   0.218 0.000958     TRUE 
##  8 2292-17_4   2.45e- 5    9.64      9.46  0.184 0.00136      TRUE 
##  9 4141-79_1   2.25e- 5   10.6      10.3   0.244 0.00136      TRUE 
## 10 2743-5_2    2.80e- 5   12.9      12.7   0.261 0.00140      TRUE 
## 11 13722-105_3 3.46e- 5   10.4      10.2   0.177 0.00157      TRUE 
## 12 13416-8_3   5.09e- 5   12.1      12.3  -0.216 0.00204      TRUE 
## 13 14125-5_3   5.53e- 5    8.98      8.79  0.187 0.00204      TRUE 
## 14 9796-4_3    5.70e- 5   15.2      15.0   0.215 0.00204      TRUE 
## 15 4544-4_3    7.20e- 5    9.36      9.17  0.184 0.00240      TRUE 
## 16 3060-43_2   9.06e- 5   11.1      11.0   0.163 0.00283      TRUE 
## 17 5688-65_3   1.04e- 4   12.2      12.4  -0.186 0.00296      TRUE 
## 18 9900-36_3   1.07e- 4    9.12      8.92  0.203 0.00296      TRUE 
## 19 9986-14_3   1.29e- 4    9.35      9.15  0.204 0.00340      TRUE 
## 20 3196-6_2    1.51e- 4    8.28      8.48 -0.196 0.00360      TRUE

ggplot(ttest_results, aes(x = log2FC, y = -log10(p_adj), color = sig)) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("grey60", "steelblue"),
                     labels = c("Not significant", "FDR < 0.05")) +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed", color = "red") +
  geom_vline(xintercept = c(-0.5, 0.5), 
             linetype = "dashed", color = "grey40") +
  labs(title = "Volcano Plot: PD vs Control",
       x     = "Log2 Fold Change (PD - Control)",
       y     = "-log10(BH adjusted p-value)",
       color = "") +
  theme_minimal()

Get the significant proteins from the t-test and this should be the data ready for training the models.

sig_proteins <- ttest_results %>%
  filter(sig == TRUE) %>%
  pull(protein)

data_model <- data_filtered %>%
  select(PATNO, SEX, outcome, all_of(sig_proteins))

dim(data_model)

## [1] 803  71

data_model %>% count(outcome)

## # A tibble: 2 × 2
##   outcome     n
##   <fct>   <int>
## 1 Control   186
## 2 PD        617

saveRDS(data_model, file.path(data_dir, "data_model.rds"))

Next step is to try running LASSO, Logistic Regression and Random forest on the data crearted with the selected proteins on PD and control patients.

Final Project: Draft 1

Ramazan Ege Solak

April 5th, 2026

Question of interest

Data

Variables of interest

Directed acyclic graph (DAG)