The tutorial includes my personal recipe, so please cite me when you get to use this tutorial. (helloyjjoo@gmail.com)

Step 1. Install PheWAS R package.
install.packages("devtools")
install.packages(c("dplyr","tidyr","ggplot2","MASS","meta","ggrepel","DT"))
devtools::install_github("PheWAS/PheWAS")

library(PheWAS)
  • Make sure you have enough understanding on your dataset before running the analysis - e.g. histogram (to confirm their distribution), number of NA, types of value, etc.
Step 2. Read in the necessary phenotype/genotype data

In this case, we will use PRS (or GPS, polygenic risk score or genome-wide polygenic score based on the types of traits) to predict the outcomes (phenotype).

phenotype <- read.csv("ABCD_phenotype_total.csv", header=TRUE)
prs <- read.csv("ABCD_all_Pt1_score_new_version_norm.csv", header=TRUE)
cov <- phenotype[,c(2:12)]  # create covariate table 
phenotype <- phenotype[,-c(1,3:12)]  # remove covariate columns from phenotype data 

# Unify the first column name to merge the data table for 'phewas()' function.
names(phenotype)[1] <- "KEY"
names(prs)[1] <- "KEY"
names(cov)[1] <- "KEY"
  • If you need to use individual SNP to run PheWAS, that will be a different story which will need different transformation/commands. Please see the package vignette then.
Step 3. Decide the type of regression you will use for each phenotype column.

A. Linear regression - make sure your columns are in numerical format.

B. Logistic regression - convert your binary variable into 'TRUE/FALSE' (logical format)

phenotype[,c(63:104)] <- data.frame(apply(phenotype[c(63:104)], 2, as.logical))

C. Firth Logistic regression - this type of analysis will be used if you have significantly imbalanced case/control ratio in your data.

phenotype_log <- phenotype[,c(1,63:104)] # extract specific phenotype columns with extremely low number of cases 
data=inner_join(phenotype_log,prs)
data_cov=inner_join(data,cov) 

result_firth <- phewas_ext(names(phenotype_log)[-1], predictors=names(prs)[-1], data=data_cov, cov=names(cov)[-1], additive.genotypes = FALSE, MASS.confint.level = TRUE, method="logistf", min.records = 10)
write.csv(result_firth, "PRS_ABCD_PheWAS_KSAD_firthRegresion_cov_AgeSexEAMaritalSite_v2.csv", quote=FALSE, row.names=FALSE)
Step 4. Run PheWAS and save the result.
result <- phewas(outcomes=phenotype, predictors=prs, covariates=cov, additive.genotypes = FALSE, significance.threshold = c("p-value", "bonferroni", "fdr"), quick.confint.level = TRUE)

write.csv(result, "PRS_ABCD_PheWAS_cov_AgeSexEAMaritalSite_imputed.csv", quote=FALSE, row.names=FALSE)

Sort the result by p-value and see which phenotype-predictor associations come up with the higest significance. Check whether the associations survived FDR test (0.05), or Bonferroni correction for multiple testing, which can be found in the last three columns of the result.

Drawing PheWAS Plots

You can also draw Manhattan plots using ggplot2 R package, which is a go-to graphical tool for drawing plots in R environment. The arguments within the command below can be modified on your own. Look up 'R ggplot cheatsheet' if you need to modify certain features of the plot. Please do not ask me ggplot2 questions.

ggplot(result, aes(x=Variable, y=-log(p))) + geom_point(aes(col=predictor, size=OR)) + theme_classic() + theme(axis.text.x = element_blank(), panel.grid.minor=element_line(colour = "grey", linetype="dashed"), axis.ticks=element_blank()) + labs(color="Category", size="Effect size", x="GPS - Suicidal Phenotypes", y="log(p-value)") +geom_text_repel(data=. %>% mutate(label = ifelse(p < 0.01, as.character(Variable), "")), aes(label=label), size=4.1, box.padding = unit(0.7, "lines")) + geom_hline(yintercept=-log(0.01), color="red", size=1, alpha=0.5) 
Good luck on your analysis!