Parkinson’s disease (PD) is a progressive neurodegenerative disorder affecting millions of people worldwide, characterized by motor symptoms such as tremor, rigidity, and bradykinesia, as well as non-motor manifestations including cognitive decline and autonomic dysfunction. One of the major clinical challenges with Parkinson’s disease is the lack of reliable, objective set of biomarkers for early diagnosis. Ususally, patients are often diagnosed only after significant neurodegeneration has already occurred with physical tests or observing lesions in specific brain regions.
This project aims to investigate whether cerebrospinal fluid (CSF) proteomics data can be used to accurately predict Parkinson’s disease status. Specifically, Can protein expression profiles from CSF samples reliably distinguish Parkinson’s disease patients from healthy controls, and which proteins are most predictive of disease status?
The data used in this project comes from the Parkinson’s Progression Markers Initiative (PPMI), a landmark longitudinal observational study sponsored by the Michael J. Fox Foundation. PPMI was designed to identify biomarkers of Parkinson’s disease onset and progression by recruiting both PD patients and healthy controls and collecting biological samples, imaging, and clinical assessments over time. https://www.ppmi-info.org/
library(ggdag)
library(ggplot2)
# Use short names inside nodes
dag <- dagify(
Outcome ~ Predictor + SEX,
Predictor ~ SEX + Batch,
coords = list(
x = c(SEX = 1, Batch = 1, Predictor = 2, Outcome = 3),
y = c(SEX = 1, Batch = -1, Predictor = 0, Outcome = 0)
)
)
ggdag(dag, text = FALSE) +
geom_dag_point(color = "#4472C4", size = 22) +
geom_dag_text(color = "white", size = 3.5, fontface = "bold") +
theme_dag() +
labs(
title = "DAG: CSF Proteomics and Parkinson's Disease Prediction",
caption = paste(
"Outcome = PD Diagnosis (COHORT)",
"Predictor = Protein Expression (TESTVALUE)",
"SEX = Sex — Biological Confounder",
"Batch = Assay Plate (PLATEID) — Technical Confounder",
sep = "\n"
)
) +
theme(
plot.title = element_text(face = "bold", size = 13, hjust = 0.5),
plot.caption = element_text(size = 9, color = "gray30", hjust = 0,
margin = margin(t = 12), family = "mono")
)
### Statistical analysis plan
Because our outcome variable is binary, to answer the question of interest, three classification approaches will be built and compared: Logistic Regression, LASSO, and Random Forest. By comparing these models, this project will evaluate both predictive accuracy and the interpretability of selected biomarkers.
To control for potential confounding, sex will be included as a biological covariate across all three models, as protein expression levels are known to vary by sex and Parkinson’s disease prevalence differs between males and females. PLATEID will be examined and adjusted for as a technical confounder, since batch-to-batch variability across assay plates can introduce systematic differences in protein measurements unrelated to disease status.Protein expression values will also be normalized prior to modeling to further reduce technical variation. Model performance will be evaluated using AUC-ROC, accuracy and a confusion matrix.
Exploratory data analysis:
wanted_cols <- c(
"PATNO", "SEX", "COHORT", "TESTNAME",
"TESTVALUE", "PLATEID"
)
raw_data <- file_list %>%
map_dfr(~read_csv(.x, show_col_types = FALSE)) %>%
select(any_of(wanted_cols))
glimpse(raw_data)
## Rows: 5,541,030
## Columns: 6
## $ PATNO <dbl> 53595, 53595, 53595, 53595, 53595, 53595, 53595, 53595, 5359…
## $ SEX <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
## $ COHORT <chr> "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", "PD", …
## $ TESTNAME <chr> "5632-6_3", "5631-83_3", "5630-48_3", "5629-58_3", "5628-21_…
## $ TESTVALUE <dbl> 10.182013, 6.346285, 12.943830, 5.891402, 12.836350, 7.03480…
## $ PLATEID <chr> "P0022899", "P0022899", "P0022899", "P0022899", "P0022899", …
head(raw_data, 20)
## # A tibble: 20 × 6
## PATNO SEX COHORT TESTNAME TESTVALUE PLATEID
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 53595 Female PD 5632-6_3 10.2 P0022899
## 2 53595 Female PD 5631-83_3 6.35 P0022899
## 3 53595 Female PD 5630-48_3 12.9 P0022899
## 4 53595 Female PD 5629-58_3 5.89 P0022899
## 5 53595 Female PD 5628-21_3 12.8 P0022899
## 6 53595 Female PD 5627-53_3 7.03 P0022899
## 7 53595 Female PD 5626-20_3 5.89 P0022899
## 8 53595 Female PD 5624-66_3 7.66 P0022899
## 9 53595 Female PD 5623-11_3 6.13 P0022899
## 10 53595 Female PD 5621-64_3 5.95 P0022899
## 11 53595 Female PD 5620-13_3 10.1 P0022899
## 12 53595 Female PD 5618-50_3 13.9 P0022899
## 13 53595 Female PD 5617-41_3 8.84 P0022899
## 14 53595 Female PD 5615-62_3 6.25 P0022899
## 15 53595 Female PD 5614-44_3 6.73 P0022899
## 16 53595 Female PD 5613-75_3 5.87 P0022899
## 17 53595 Female PD 5612-16_3 6.20 P0022899
## 18 53595 Female PD 5611-56_3 6.02 P0022899
## 19 53595 Female PD 5610-32_3 6.38 P0022899
## 20 53595 Female PD 5609-92_3 14.1 P0022899
raw_data %>% count(COHORT)
## # A tibble: 3 × 2
## COHORT n
## <chr> <int>
## 1 Control 890010
## 2 PD 2952345
## 3 Prodromal 1698675
raw_data %>% summarise(
n_patients = n_distinct(PATNO),
n_proteins = n_distinct(TESTNAME),
n_rows = n()
)
## # A tibble: 1 × 3
## n_patients n_proteins n_rows
## <int> <int> <int>
## 1 1158 4785 5541030
raw_data %>%
count(PATNO, TESTNAME) %>%
filter(n > 1) %>%
nrow()
## [1] 0
For now I will only use PD and Control Patients
data_binary <- raw_data %>%
filter(COHORT %in% c("PD", "Control")) %>%
mutate(outcome = factor(ifelse(COHORT == "PD", 1, 0),
levels = c(0, 1),
labels = c("Control", "PD")))
data_binary %>% count(outcome)
## # A tibble: 2 × 2
## outcome n
## <fct> <int>
## 1 Control 890010
## 2 PD 2952345
wide_data <- data_binary %>%
select(PATNO, SEX, outcome, TESTNAME, TESTVALUE) %>%
pivot_wider(
names_from = TESTNAME,
values_from = TESTVALUE
)
dim(wide_data)
## [1] 803 4788
Caclulating the variance of each protein. Looking at the variabce distribution of proteins and plotting to pick a cutoff. Selecting the top 500 proteins.
protein_cols <- wide_data %>%
select(-PATNO, -SEX, -outcome)
protein_vars <- protein_cols %>%
summarise(across(everything(), var, na.rm = TRUE)) %>%
pivot_longer(everything(),
names_to = "protein",
values_to = "variance") %>%
arrange(desc(variance))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), var, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
summary(protein_vars$variance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003911 0.025173 0.052631 0.122038 0.120528 12.827399
ggplot(protein_vars, aes(x = variance)) +
geom_histogram(bins = 100, fill = "steelblue", color = "white") +
geom_vline(xintercept = protein_vars$variance[500],
color = "red", linetype = "dashed", linewidth = 1) +
annotate("text",
x = protein_vars$variance[500],
y = Inf,
label = "Top 500 cutoff",
vjust = 2, hjust = -0.1, color = "red") +
labs(title = "Protein Variance Distribution",
x = "Variance (log2 scale)",
y = "Count") +
theme_minimal()
top500_proteins <- protein_vars %>%
slice_head(n = 500) %>%
pull(protein)
data_filtered <- wide_data %>%
select(PATNO, SEX, outcome, all_of(top500_proteins))
dim(data_filtered)
## [1] 803 503
I will run t test on the 500 selected proteins. Volcano plot of PD vs Control
ttest_results <- data_filtered %>%
select(outcome, all_of(top500_proteins)) %>%
pivot_longer(
cols = all_of(top500_proteins),
names_to = "protein",
values_to = "value"
) %>%
group_by(protein) %>%
summarise(
p_value = t.test(value ~ outcome)$p.value,
mean_PD = mean(value[outcome == "PD"], na.rm = TRUE),
mean_ctrl = mean(value[outcome == "Control"], na.rm = TRUE),
log2FC = mean_PD - mean_ctrl
) %>%
mutate(
p_adj = p.adjust(p_value, method = "BH"),
sig = p_adj < 0.05
) %>%
arrange(p_adj)
ttest_results %>% count(sig)
## # A tibble: 2 × 2
## sig n
## <lgl> <int>
## 1 FALSE 432
## 2 TRUE 68
head(ttest_results, 20)
## # A tibble: 20 × 7
## protein p_value mean_PD mean_ctrl log2FC p_adj sig
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 3504-58_2 4.21e-11 9.33 8.96 0.376 0.0000000142 TRUE
## 2 7757-5_3 5.70e-11 9.42 9.11 0.309 0.0000000142 TRUE
## 3 5080-131_3 1.03e- 8 9.69 9.47 0.220 0.00000171 TRUE
## 4 8240-207_3 1.34e- 7 11.6 11.4 0.247 0.0000167 TRUE
## 5 2900-53_3 1.05e- 6 9.47 9.23 0.240 0.000105 TRUE
## 6 6521-35_3 1.44e- 6 11.7 11.9 -0.219 0.000120 TRUE
## 7 6520-87_3 1.34e- 5 13.4 13.2 0.218 0.000958 TRUE
## 8 2292-17_4 2.45e- 5 9.64 9.46 0.184 0.00136 TRUE
## 9 4141-79_1 2.25e- 5 10.6 10.3 0.244 0.00136 TRUE
## 10 2743-5_2 2.80e- 5 12.9 12.7 0.261 0.00140 TRUE
## 11 13722-105_3 3.46e- 5 10.4 10.2 0.177 0.00157 TRUE
## 12 13416-8_3 5.09e- 5 12.1 12.3 -0.216 0.00204 TRUE
## 13 14125-5_3 5.53e- 5 8.98 8.79 0.187 0.00204 TRUE
## 14 9796-4_3 5.70e- 5 15.2 15.0 0.215 0.00204 TRUE
## 15 4544-4_3 7.20e- 5 9.36 9.17 0.184 0.00240 TRUE
## 16 3060-43_2 9.06e- 5 11.1 11.0 0.163 0.00283 TRUE
## 17 5688-65_3 1.04e- 4 12.2 12.4 -0.186 0.00296 TRUE
## 18 9900-36_3 1.07e- 4 9.12 8.92 0.203 0.00296 TRUE
## 19 9986-14_3 1.29e- 4 9.35 9.15 0.204 0.00340 TRUE
## 20 3196-6_2 1.51e- 4 8.28 8.48 -0.196 0.00360 TRUE
ggplot(ttest_results, aes(x = log2FC, y = -log10(p_adj), color = sig)) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c("grey60", "steelblue"),
labels = c("Not significant", "FDR < 0.05")) +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed", color = "red") +
geom_vline(xintercept = c(-0.5, 0.5),
linetype = "dashed", color = "grey40") +
labs(title = "Volcano Plot: PD vs Control",
x = "Log2 Fold Change (PD - Control)",
y = "-log10(BH adjusted p-value)",
color = "") +
theme_minimal()
Get the significant proteins from the t-test and this should be the data ready for training the models.
sig_proteins <- ttest_results %>%
filter(sig == TRUE) %>%
pull(protein)
data_model <- data_filtered %>%
select(PATNO, SEX, outcome, all_of(sig_proteins))
dim(data_model)
## [1] 803 71
data_model %>% count(outcome)
## # A tibble: 2 × 2
## outcome n
## <fct> <int>
## 1 Control 186
## 2 PD 617
saveRDS(data_model, file.path(data_dir, "data_model.rds"))
Next step is to try running LASSO, Logistic Regression and Random forest on the data crearted with the selected proteins on PD and control patients.