Background

Human studies of the metabolic basis of prostate cancer (PCa) have yet to determine which metabolic changes associated with PCa are a cause or a consequence of tumour development and progression.

Methods

We performed a nested case-control study of metabolites (log-transformed and scaled to z-scores) and PCa incidence, stage, and grade in a cross-sectional sample of 5,041 localized PCa cases (2335) and controls (2706) in the UK-based, National Institute for Health Research–supported Prostate Testing for Cancer and Treatment (ProtecT) trial (adjusting for age, study center, and family history). Then, identifying SNPs previously associated with metabolites (metab-QTLs) from Kettunen et al. (2016), we pulled the summary data for the association of metab-QTLS and PCa incidence in the Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome (PRACTICAL) consortium (44,825 PCa cases 27,904 controls). Next, extracting SNPs associated with PCa from the publicly available GWAS Catalog and using the summary data for the association of these SNPs with metabolites, we performed bi-directional, two-sample Mendelian randomization (MR), interrogating the causality of the metabolome on PCa incidence and the causality of PCa incidence on the metabolome.

Visual heuristic

Mendelian randomziation uses SNPs as instruments avoiding reverse causation and confounding.

Mendelian randomziation uses SNPs as instruments avoiding reverse causation and confounding.


Results

Searchable lists of the ProtecT (metabolites on PCa incidence and PCa incidence on metabolites) findings (p-values less than the principal components (PCA)-derived multiple-testing correction threshold of 0.0014 deemed signficant) are below for both the observational associations and for the two MRs (metabolites on PCa incidence and PCa incidence on metabolites). In the MR of metabolites on PCa incidence, creatinine (Crea) showed an increased risk on PCa (OR=1.411; SE=0.0823; p-value=3.046832e-05), though creatinine was not significantly associated with PCa observationally (OR=1.048; SE=0.030; p-value=0.127). Interestingly, changes in glucose (glc) appear to be a consequence of PCa incidence and not its cause (MR beta=0.0730; SE=0.033; p-value=0.026), though this assertion depends on the threshold we use for multiple testing (the p-value for glucose was not <0.0014). The set of observational findings for stage (<3 and >=3) and grade (<7 and >=7; <8 and>=8) are also provided below, though we are awaiting the stage and grade equivalent outcome data from PRACTICAL.


Results for (observational) logistic regression of metabolites on prostate cancer incidence



Forest plot of selected metabolites common to both ProtecT and PRACTICAL for odds of prostate cancer

Observational (ProtecT) results in red and MR (PRACTICAL) results in blue.

Observational (ProtecT) results in red and MR (PRACTICAL) results in blue.


Complete list of observational results for the association of metabolites on prostate cancer incidence


Results for (observational) logistic regression of metabolites on prostate cancer stage (>=3)

Forest plot for stage (odds of stage >=3) (all metabolites…hard to read)

Results for (observational) logistic regression of metabolites on prostate cancer grade (>=7)

Forest plot for grade (odds of grade >=7) (Terribly hard to read!)

Will use a selected list, like for the set common to both ProtecT and PRACTICAL, when we decide if that is is the best way to display them.

Will use a selected list, like for the set common to both ProtecT and PRACTICAL, when we decide if that is is the best way to display them.

Results for (observational) logistic regression of metabolites on prostate cancer grade (>=8)

Forest plot for grade (odds of grade >=8) (Terribly hard to read!)

Will use a selected list, like for the set common to both ProtecT and PRACTICAL, when we decide if that is is the best way to display them.

Will use a selected list, like for the set common to both ProtecT and PRACTICAL, when we decide if that is is the best way to display them.


Results for the Mendelian randomization of metabolites on prostate cancer incidence (p<0.05)

Metabolite Estimate SE Pval Method
98 Crea 1.410868 0.0825445 0.0000305 Maximum likelihood
75 Gp 1.167327 0.0567604 0.0064149 Maximum likelihood
4 Pyr 1.292758 0.1165183 0.0275415 Maximum likelihood
43 Ala 1.198443 0.0837828 0.0307243 Maximum likelihood
85 Lac 1.379156 0.1520822 0.0345320 Maximum likelihood
109 FAw79S 1.119108 0.0539872 0.0371225 Maximum likelihood

Complete list of MR results for metabolites on prostate cancer incidence


Results for the (observational) linear regression of prostate cancer incidence on metabolites


Bootstrapped linear regression results

Due to the non-normality in the residuals seen in the qq-plots (below under “Sensitivity Analyses”), I bootstrapped the effect estimates from the linear regression models and calculated the standard errors, lower and upper bounds, and p-values to compare the results with the non-bootstrapped linear regression findings.


Complete list of observational results for prostate cancer incidence on metabolites


Results for the Mendelian randomization of prostate cancer incidence on metabolites (p<0.05)

Metabolite Estimate SE Pval Method
22 Glc 0.0730025 0.0326753 0.0254713 Maximum likelihood

Complete list of MR results for prostate cancer incidence on metabolites



Volcano plot of the ProtecT findings

Metabolites log-transformed and scaled to z-scores. Labeled metabolites are Bonferroni significant.

Metabolites log-transformed and scaled to z-scores. Labeled metabolites are Bonferroni significant.


Bi-directional MR of metabolites and prostate cancer

Metabolites are labeled if the p-values for the MR are <0.05 or if the p-values for the observational association are <0.0014. Cyan coloring indicates a discordance between the observational and MR estimates, where discordance is defined as opposite direction of effect: i.e, for the comparison of odds ratios (above Left, Metabolites on Prostate Cancer), if the effect estimate for the observational association is <1 and the effect estimate for the causal association is >1, or if the effect estimate for the observation association is >1 and the effect estimate for the causal association is <1, then cyan (discordant); and, for the comparison of linear regressions (above Right, Prostate Cancer on Metabolites), if the effect estimate for the obervational association is <0 and the effect estimate for the causal association is >1 or if the effect estimate for the observation association is >0 and the effect estimate for the causal association is <0, then cyan (discordant).

Metabolites are labeled if the p-values for the MR are <0.05 or if the p-values for the observational association are <0.0014. Cyan coloring indicates a discordance between the observational and MR estimates, where discordance is defined as opposite direction of effect: i.e, for the comparison of odds ratios (above Left, Metabolites on Prostate Cancer), if the effect estimate for the observational association is <1 and the effect estimate for the causal association is >1, or if the effect estimate for the observation association is >1 and the effect estimate for the causal association is <1, then cyan (discordant); and, for the comparison of linear regressions (above Right, Prostate Cancer on Metabolites), if the effect estimate for the obervational association is <0 and the effect estimate for the causal association is >1 or if the effect estimate for the observation association is >0 and the effect estimate for the causal association is <0, then cyan (discordant).


Discussion

# Complete lists of results provided above in red links are also available indirectly by the interactive shiny app files I placed on github. To view them, open R, and type the following code, which will access the project files (ui.R, server.R and .csv) stored in their respective repositories:

library(shiny) 
runGitHub("protect", "CharleenAdams")
runGitHub("practical", "CharleenAdams")
runGitHub("protect2", "CharleenAdams")
runGitHub("practical2", "CharleenAdams")

Nascent ‘Table 1’ for ProtecT

Selected baseline characteristics (medians and interquartile ranges, or percents) for cases and controls

Characteristic Case (n=2335) Control (n=2706) p value*
Age 62.84 (58.69-66.59) 62.78 (58.56-66.40) 0.683
Family history of prostate cancer (%)** 7 5 <0.001
BMI# 26.50 (24.45-28.85) 26.58 (24.38-28.73) 0.7463

p-value based on Chi-squared tests (for categorical variables) and Wilcoxon rank sum tests (for continuous variables)
** Family history data available on only 88% of these subjects
# BMI data available on only 74% of these subjects


Supplemental material

Sensitivity analyses

I wanted to see how adding BMI as a covariate to the observational models impacted results. Because we are missing BMI on 36% of participants, I used the MICE package to impute BMI. See results below. Adding in the imputed values for BMI (and family history, for which we were missing 537 observations) did not substantially change the fingdings.

This density plot compares the imputed (magenta) versus observed (blue) values for selected variables. The overlap of the densities for BMI indicates that the imputation performed well.

This density plot compares the imputed (magenta) versus observed (blue) values for selected variables. The overlap of the densities for BMI indicates that the imputation performed well.


Results for the sensitivity analysis adding in the imputed variables for BMI and family history


(Diagnostic) residuals plots for top linear regression results (all log-transformed)

Indicates that some of the metabolites (S.HDL.C, S.HDL.CE, & SFA.FA) are not best fit with the current model.

Indicates that some of the metabolites (S.HDL.C, S.HDL.CE, & SFA.FA) are not best fit with the current model.


(Diagnostic) residuals plots with imputed family history and BMI added to the models (all log-transformed)

Indicates that addings imputed family history and BMI did not improve the fit of the residuals for the metabolites not already well-modeled.

Indicates that addings imputed family history and BMI did not improve the fit of the residuals for the metabolites not already well-modeled.


(Diagnostic) scatter of linear regression output



Heterogeneity plot for creatinine


Heterogeneity plot for glucose


Metabolite abbreviation dictionary


Template code with functions for looping through the metabolites and bootstrapping

setwd('/panfs/panasas01/sscm/ca16591')

cleaned_met3 <- read.csv("/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/metabol.data.clean.demo3.20.02.17.csv")#BlueCrystal
data <- cleaned_met3

# Create metabolite name lists for use in loops

mname=colnames(data)[-c(1:93)]  
sname=scan('/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/short_list_Pro_Prac1.txt',what='')#BlueCrystal
smname=mname[-c(229:231)]
data$casecontrol_n=ifelse(data$casecontrol=="Control", 0,1)
completedData <- read.csv("/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/completedData_imputation.csv")#BlueCrystal
data$Glol <- completedData$Glol
data$pounds = (data$p1dhl_wght_stones)*14
data$weight = data$pounds + data$p1dhl_wght_pounds
data$feet_to_inches = (data$p1dhl_tall_ft)*12
data$height_inches = data$p1dhl_tall_in + data$feet_to_inches
data$inches = (data$height_inches^2)
data$inches_squared = (data$inches)
data$bmi=(data$weight*703)/data$inches_squared 
data$fac_centre <- as.factor(data$centre)

# Function to loop through the associations 

logtransform<-function(dataset,smname)  
{   
  tx=dataset[,smname]   
  tx=apply(tx,2,function(x){if (min(x,na.rm=T)==0) x=x+1 else x })  
  tx=log(tx)    
  dataset[,smname]=tx   
  dataset   
}   

linearRegress<-function(metabolite.name,exposure.name,dataset,covariate.name=TRUE,  
                        metabolite.log='T',subset=NULL) 
{   
  ##subseting   
  tx=dataset    
  if(!is.null(subset)) {ind=with(tx,eval(parse(text=subset))); tx=tx[ind,]} 
  
  ##log-transform and scaling the metabolites   
  logname=metabolite.name
  scalename=metabolite.name
  if (metabolite.log=='T') {tx=logtransform(tx,logname)}
  tx[,scalename]=scale(tx[,scalename])
  
  ##linear regression of exposure with metabolites      
  add=numeric() 
  fom=formula(paste('met~',paste(c('casecontrol_n', covariate.name=c('age', 'fac_centre','fhhistpca')),collapse='+')))  
  for (j in 1:length(metabolite.name))  
  { 
    met=tx[,metabolite.name[j]];    
    fit=lm(fom,data=tx) 
    temp=c(summary(fit)$coef['casecontrol_n',],confint(fit)['casecontrol_n',]); 
    add=rbind(add,temp);    
  }     
  rownames(add)=metabolite.name 
  add=data.frame(add,check.names = F)   
  add   
}   
#
result_linear=linearRegress(smname,'casecontrol_n', data, covariate.name=c('age', 'fac_centre','fhhistpca'))    
head(result_linear)

# This is the function which takes a random bootstrap sample and performs the analysis.

Funk<-function(Boot.Data){
  
  ###If your only input is the dataframe, then
  A<-linearRegress(smname,'casecontrol_n', Boot.Data, covariate.name=c('age', 'fac_centre','fhhistpca'))    
  A <- as.data.frame(A)
  A$Metabolites <- rownames(A)
  
  return (A)
}

# This will return the output for linearRegress for the random bootstrap sample, running 500 times, one time for each dataset.

# First make 500 placeholders, which will be defined as random bootstrap dataframes.
  beta.obs.boot <- numeric(500)

# Then the for loop of 500 dataframes...
  for(i in 1:500) {
  
# Randomly sample a number of rows equal to the size of the original dataframe, using the original data frame
  
  this.ind <- sample(nrow(data),nrow(data), replace=TRUE)
  Boot.Data<-data[this.ind,]
  
  beta.obs.boot[i] <- list(Boot.Data)
}

# Provies a big list of bootstrap dataframes which will be the input for the "Funk" function above

# "Sapply" to provide the set of random bootstrap dataframes, and after a lifetime of waiting, a list of each output of linearRegress for the corresponding random bootstrap dataframes is given

  LSR.Est<-sapply(beta.obs.boot,Funk)

# Extract the coefficients, and calculate standard errors, 95% CIs, and p-values

##Effect

booted <- as.data.frame(LSR.Est[,1])
for(i in 2:500) {
  booted <-  rbind(booted, as.data.frame(LSR.Est[,i]))
}

head(booted)
summary(booted)

random_LSR.Est <- as.data.frame(LSR.Est[,1])
Mets <- random_LSR.Est$Metabolites 
Mets_list <- levels(Mets)

funk.1 <- function(m){
  mean <- mean(booted$Estimate[booted$Metabolites==Mets_list[m]])
  se <-  sqrt(var(booted$Estimate[booted$Metabolites==Mets_list[m]]))
  return(list(mean,se))  
}

funk.1.out <- mapply(funk.1, 1:228)
mean.EST <- unlist(funk.1.out[1,])
se.EST <- unlist(funk.1.out[2,])


big_df <- as.data.frame(mean.EST,Mets_list)
big_df$se <- se.EST
big_df$E_CI_min<- big_df$mean-(1.96*big_df$se)
big_df$E_CI_max<- big_df$mean+(1.96*big_df$se)

big_df = big_df[order(big_df$se),]
#head(big_df, n=50)

df.master <- big_df

write.csv(df.master, "/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/df.master.linear.csv")
## Set up a folder in my home drive for BlueCrystal 
/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/

## Write a short bash script that calls the code above and save as "Boot_script_linear.sh" 
#!/bin/bash
#PBS -l nodes=1:ppn=1,walltime=48:00:00
#PBS -N Boot_Mets_linear

cd "/panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets"

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo PBS job ID is $PBS_JOBID
echo This jobs runs on the following machines:
echo `cat $PBS_NODEFILE | uniq`

module add languages/R-3.3.2-ATLAS
R CMD BATCH Boot_loop_linear.R
## Within /panfs/panasas01/sscm/ca16591/Boot_ProtecT_Mets/, type: qsub Boot_script_linear.sh
## Then wait.
## Read the results back into R and look at them.
df.master <- read.csv("M:/data/protect/_devs/PROTECT_Clinical/data/Boot_ProtecT_Mets/df.master.linear.csv")

df.master = df.master[order(df.master$se),]
df.master.sub <- subset(df.master, df.master$E_CI_min<0 & df.master$E_CI_max<0 |df.master$E_CI_min>0 & df.master$E_CI_max>0)
df.master.sub = df.master.sub[order(df.master.sub$mean.EST),]
summary(df.master$se)

# Calculate the p-value 

df.master$t <- df.master$mean.EST/df.master$se
df.master$p<-2*pnorm(-abs(df.master$t))
df.master = df.master[order(df.master$p),]

write.csv(df.master, "C:/Users/ca16591/Dropbox/Bristol/df.master.linear.p.csv")

References

  1. Kettunen, J. et al. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nat Genet. 44, 269–76 (2012).
  2. Kettunen, J. et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nat Commun 7, 1–9 (2016).
  3. Weinstein, S. J. et al. Serum creatinine and prostate cancer risk in a prospective study. Cancer Epidemiol Biomarkers Prev. 18, 2630–2649 (2009).
  4. Van Buuren, S. & Oudshoorn, C. G. M. Multivariate imputation by chained equations: MICE V1. 0 User’s Manual. The Netherlands: TNO Report PG/VGZ/00.038. Netherlands Organization for applied scientific research (2000).
  5. Davey Smith, G. & Hemani, G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum. Mol. Genet. 23, R89-98 (2014).