Human studies of the metabolic basis of prostate cancer (PCa) have yet to determine which metabolic changes associated with PCa are a cause or a consequence of tumour development and progression.
We performed a nested case-control study of metabolites (log-transformed and scaled to z-scores) and PCa incidence, stage, and grade in a cross-sectional sample of 5,041 localized PCa cases (2335) and controls (2706) in the UK-based, National Institute for Health Research–supported Prostate Testing for Cancer and Treatment (ProtecT) trial (adjusting for age, study center, and family history). Then, identifying SNPs previously associated with metabolites (metab-QTLs) from Kettunen et al. (2016), we pulled the summary data for the association of metab-QTLS and PCa incidence in the Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome (PRACTICAL) consortium (44,825 PCa cases 27,904 controls). Next, extracting SNPs associated with PCa from the publicly available GWAS Catalog and using the summary data for the association of these SNPs with metabolites, we performed bi-directional, two-sample Mendelian randomization (MR), interrogating the causality of the metabolome on PCa incidence and the causality of PCa incidence on the metabolome.
Mendelian randomziation uses SNPs as instruments avoiding reverse causation and confounding.
Searchable lists of the ProtecT (metabolites on PCa incidence and PCa incidence on metabolites) findings (p-values less than the principal components (PCA)-derived multiple-testing correction threshold of 0.0014 deemed signficant) are below for both the observational associations and for the two MRs (metabolites on PCa incidence and PCa incidence on metabolites). In the MR of metabolites on PCa incidence, creatinine (Crea) showed an increased risk on PCa (OR=1.411; SE=0.0823; p-value=3.046832e-05), though creatinine was not significantly associated with PCa observationally (OR=1.048; SE=0.030; p-value=0.127). Interestingly, changes in glucose (glc) appear to be a consequence of PCa incidence and not its cause (MR beta=0.0730; SE=0.033; p-value=0.026), though this assertion depends on the threshold we use for multiple testing (the p-value for glucose was not <0.0014). The set of observational findings for stage (<3 and >=3) and grade (<7 and >=7; <8 and>=8) are also provided below, though we are awaiting the stage and grade equivalent outcome data from PRACTICAL.
Observational (ProtecT) results in red and MR (PRACTICAL) results in blue.
Will use a selected list, like for the set common to both ProtecT and PRACTICAL, when we decide if that is is the best way to display them.
Will use a selected list, like for the set common to both ProtecT and PRACTICAL, when we decide if that is is the best way to display them.
| Metabolite | Estimate | SE | Pval | Method | |
|---|---|---|---|---|---|
| 98 | Crea | 1.410868 | 0.0825445 | 0.0000305 | Maximum likelihood |
| 75 | Gp | 1.167327 | 0.0567604 | 0.0064149 | Maximum likelihood |
| 4 | Pyr | 1.292758 | 0.1165183 | 0.0275415 | Maximum likelihood |
| 43 | Ala | 1.198443 | 0.0837828 | 0.0307243 | Maximum likelihood |
| 85 | Lac | 1.379156 | 0.1520822 | 0.0345320 | Maximum likelihood |
| 109 | FAw79S | 1.119108 | 0.0539872 | 0.0371225 | Maximum likelihood |
Complete list of MR results for metabolites on prostate cancer incidence
Due to the non-normality in the residuals seen in the qq-plots (below under “Sensitivity Analyses”), I bootstrapped the effect estimates from the linear regression models and calculated the standard errors, lower and upper bounds, and p-values to compare the results with the non-bootstrapped linear regression findings.
Complete list of observational results for prostate cancer incidence on metabolites
| Metabolite | Estimate | SE | Pval | Method | |
|---|---|---|---|---|---|
| 22 | Glc | 0.0730025 | 0.0326753 | 0.0254713 | Maximum likelihood |
Complete list of MR results for prostate cancer incidence on metabolites
Metabolites log-transformed and scaled to z-scores. Labeled metabolites are Bonferroni significant.
Metabolites are labeled if the p-values for the MR are <0.05 or if the p-values for the observational association are <0.0014. Cyan coloring indicates a discordance between the observational and MR estimates, where discordance is defined as opposite direction of effect: i.e, for the comparison of odds ratios (above Left, Metabolites on Prostate Cancer), if the effect estimate for the observational association is <1 and the effect estimate for the causal association is >1, or if the effect estimate for the observation association is >1 and the effect estimate for the causal association is <1, then cyan (discordant); and, for the comparison of linear regressions (above Right, Prostate Cancer on Metabolites), if the effect estimate for the obervational association is <0 and the effect estimate for the causal association is >1 or if the effect estimate for the observation association is >0 and the effect estimate for the causal association is <0, then cyan (discordant).
# Complete lists of results provided above in red links are also available indirectly by the interactive shiny app files I placed on github. To view them, open R, and type the following code, which will access the project files (ui.R, server.R and .csv) stored in their respective repositories:
library(shiny)
runGitHub("protect", "CharleenAdams")
runGitHub("practical", "CharleenAdams")
runGitHub("protect2", "CharleenAdams")
runGitHub("practical2", "CharleenAdams")Selected baseline characteristics (medians and interquartile ranges, or percents) for cases and controls
| Characteristic | Case (n=2335) | Control (n=2706) | p value* |
|---|---|---|---|
| Age | 62.84 (58.69-66.59) | 62.78 (58.56-66.40) | 0.683 |
| Family history of prostate cancer (%)** | 7 | 5 | <0.001 |
| BMI# | 26.50 (24.45-28.85) | 26.58 (24.38-28.73) | 0.7463 |
* p-value based on Chi-squared tests (for categorical variables) and Wilcoxon rank sum tests (for continuous variables)
** Family history data available on only 88% of these subjects
# BMI data available on only 74% of these subjects
I wanted to see how adding BMI as a covariate to the observational models impacted results. Because we are missing BMI on 36% of participants, I used the MICE package to impute BMI. See results below. Adding in the imputed values for BMI (and family history, for which we were missing 537 observations) did not substantially change the fingdings.
This density plot compares the imputed (magenta) versus observed (blue) values for selected variables. The overlap of the densities for BMI indicates that the imputation performed well.
Indicates that some of the metabolites (S.HDL.C, S.HDL.CE, & SFA.FA) are not best fit with the current model.
Indicates that addings imputed family history and BMI did not improve the fit of the residuals for the metabolites not already well-modeled.
setwd('/panfs/panasas01/sscm/ca16591')
cleaned_met3 <- read.csv("/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/metabol.data.clean.demo3.20.02.17.csv")#BlueCrystal
data <- cleaned_met3
# Create metabolite name lists for use in loops
mname=colnames(data)[-c(1:93)]
sname=scan('/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/short_list_Pro_Prac1.txt',what='')#BlueCrystal
smname=mname[-c(229:231)]
data$casecontrol_n=ifelse(data$casecontrol=="Control", 0,1)
completedData <- read.csv("/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/completedData_imputation.csv")#BlueCrystal
data$Glol <- completedData$Glol
data$pounds = (data$p1dhl_wght_stones)*14
data$weight = data$pounds + data$p1dhl_wght_pounds
data$feet_to_inches = (data$p1dhl_tall_ft)*12
data$height_inches = data$p1dhl_tall_in + data$feet_to_inches
data$inches = (data$height_inches^2)
data$inches_squared = (data$inches)
data$bmi=(data$weight*703)/data$inches_squared
data$fac_centre <- as.factor(data$centre)
# Function to loop through the associations
logtransform<-function(dataset,smname)
{
tx=dataset[,smname]
tx=apply(tx,2,function(x){if (min(x,na.rm=T)==0) x=x+1 else x })
tx=log(tx)
dataset[,smname]=tx
dataset
}
linearRegress<-function(metabolite.name,exposure.name,dataset,covariate.name=TRUE,
metabolite.log='T',subset=NULL)
{
##subseting
tx=dataset
if(!is.null(subset)) {ind=with(tx,eval(parse(text=subset))); tx=tx[ind,]}
##log-transform and scaling the metabolites
logname=metabolite.name
scalename=metabolite.name
if (metabolite.log=='T') {tx=logtransform(tx,logname)}
tx[,scalename]=scale(tx[,scalename])
##linear regression of exposure with metabolites
add=numeric()
fom=formula(paste('met~',paste(c('casecontrol_n', covariate.name=c('age', 'fac_centre','fhhistpca')),collapse='+')))
for (j in 1:length(metabolite.name))
{
met=tx[,metabolite.name[j]];
fit=lm(fom,data=tx)
temp=c(summary(fit)$coef['casecontrol_n',],confint(fit)['casecontrol_n',]);
add=rbind(add,temp);
}
rownames(add)=metabolite.name
add=data.frame(add,check.names = F)
add
}
#
result_linear=linearRegress(smname,'casecontrol_n', data, covariate.name=c('age', 'fac_centre','fhhistpca'))
head(result_linear)
# This is the function which takes a random bootstrap sample and performs the analysis.
Funk<-function(Boot.Data){
###If your only input is the dataframe, then
A<-linearRegress(smname,'casecontrol_n', Boot.Data, covariate.name=c('age', 'fac_centre','fhhistpca'))
A <- as.data.frame(A)
A$Metabolites <- rownames(A)
return (A)
}
# This will return the output for linearRegress for the random bootstrap sample, running 500 times, one time for each dataset.
# First make 500 placeholders, which will be defined as random bootstrap dataframes.
beta.obs.boot <- numeric(500)
# Then the for loop of 500 dataframes...
for(i in 1:500) {
# Randomly sample a number of rows equal to the size of the original dataframe, using the original data frame
this.ind <- sample(nrow(data),nrow(data), replace=TRUE)
Boot.Data<-data[this.ind,]
beta.obs.boot[i] <- list(Boot.Data)
}
# Provies a big list of bootstrap dataframes which will be the input for the "Funk" function above
# "Sapply" to provide the set of random bootstrap dataframes, and after a lifetime of waiting, a list of each output of linearRegress for the corresponding random bootstrap dataframes is given
LSR.Est<-sapply(beta.obs.boot,Funk)
# Extract the coefficients, and calculate standard errors, 95% CIs, and p-values
##Effect
booted <- as.data.frame(LSR.Est[,1])
for(i in 2:500) {
booted <- rbind(booted, as.data.frame(LSR.Est[,i]))
}
head(booted)
summary(booted)
random_LSR.Est <- as.data.frame(LSR.Est[,1])
Mets <- random_LSR.Est$Metabolites
Mets_list <- levels(Mets)
funk.1 <- function(m){
mean <- mean(booted$Estimate[booted$Metabolites==Mets_list[m]])
se <- sqrt(var(booted$Estimate[booted$Metabolites==Mets_list[m]]))
return(list(mean,se))
}
funk.1.out <- mapply(funk.1, 1:228)
mean.EST <- unlist(funk.1.out[1,])
se.EST <- unlist(funk.1.out[2,])
big_df <- as.data.frame(mean.EST,Mets_list)
big_df$se <- se.EST
big_df$E_CI_min<- big_df$mean-(1.96*big_df$se)
big_df$E_CI_max<- big_df$mean+(1.96*big_df$se)
big_df = big_df[order(big_df$se),]
#head(big_df, n=50)
df.master <- big_df
write.csv(df.master, "/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/df.master.linear.csv")## Set up a folder in my home drive for BlueCrystal
/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/
## Write a short bash script that calls the code above and save as "Boot_script_linear.sh" #!/bin/bash
#PBS -l nodes=1:ppn=1,walltime=48:00:00
#PBS -N Boot_Mets_linear
cd "/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets"
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo PBS job ID is $PBS_JOBID
echo This jobs runs on the following machines:
echo `cat $PBS_NODEFILE | uniq`
module add languages/R-3.3.2-ATLAS
R CMD BATCH Boot_loop_linear.R## Within /panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/, type: qsub Boot_script_linear.sh
## Then wait.
## Read the results back into R and look at them.df.master <- read.csv("M:/data/protect/_devs/PROTECT_Clinical/data/Boot_ProtecT_Mets/df.master.linear.csv")
df.master = df.master[order(df.master$se),]
df.master.sub <- subset(df.master, df.master$E_CI_min<0 & df.master$E_CI_max<0 |df.master$E_CI_min>0 & df.master$E_CI_max>0)
df.master.sub = df.master.sub[order(df.master.sub$mean.EST),]
summary(df.master$se)
# Calculate the p-value
df.master$t <- df.master$mean.EST/df.master$se
df.master$p<-2*pnorm(-abs(df.master$t))
df.master = df.master[order(df.master$p),]
write.csv(df.master, "C:/Users/ca16591/Dropbox/Bristol/df.master.linear.p.csv")