1 Integrative Pathway Analysis with pathwayPCA

1.1 Instructors

Dr. Gabriel J. Odom11 Department of Public Health Sciences, Miller School of Medicine, The University of Miami, Miami, FL (gabriel.odom@med.miami.edu)
Dr. Yuguang James Ban22 Sylvester Comprehensive Cancer Center, The University of Miami, Miami, FL (Yuguang.ban@med.miami.edu)
Lizhong Liu33 Department of Public Health Sciences, Miller School of Medicine, The University of Miami, Miami, FL (lxl816@miami.edu)
Prof. Lily Wang44 Department of Public Health Sciences, Miller School of Medicine; McDonald Foundation Department of Human Genetics; The University of Miami, Miami, FL (lily.wang@med.miami.edu)
Prof. Steven Chen55 Department of Public Health Sciences, Miller School of Medicine; Sylvester Comprehensive Cancer Center; The University of Miami, Miami, FL (steven.chen@med.miami.edu)

1.2 Workshop Description

With the advance in high-throughput technology for molecular assays, multi-omics datasets have become increasingly available. In this workshop, we will demonstrate using the pathwayPCA package to perform integrative pathway-based analyses of multi-omics datasets. In particular, we will demonstrate through three case studies the capabilities of pathwayPCA to

perform pathway analysis with gene selection,
estimate and visualize sample-specific pathway activities in ovarian cancer,
integrate multi-omics datasets to identify driver genes, and
identify pathways with sex-specific effects in kidney cancer.

There is a built version of this workshop available at RPubs.66 http://rpubs.com/gabrielodom/Bioc2019_pathwayPCA The source is available on GitHub.77 https://github.com/gabrielodom/Bioc2019pathwayPCA

1.2.1 Pre-requisites

Basic knowledge of R syntax
Familiarity with RStudio
Familiarity with pathway analysis
Knowledge of Principal Components Analysis (PCA)88 https://en.wikipedia.org/wiki/Principal_component_analysis is helpful but not required

1.2.2 Workshop participation

This will be a hands-on workshop, students will live-code with us. It would be helpful for participants to bring a laptop with RStudio and pathwayPCA package installed.99 https://gabrielodom.github.io/pathwayPCA/index.html#installing-the-package

1.2.3 R/Bioconductor packages used

To install the workshop dependencies, assuming you have already installed Bioconductor:1010 https://www.bioconductor.org/install

install.packages(c("survival", "survminer", "tidyverse"))
BiocManager::install("rWikiPathways")
BiocManager::install("pathwayPCA")

We also strongly recommend the RStudio1111 https://www.rstudio.com/products/rstudio/download/ integrated development environment as a user-friendly graphical wrapper for R. We require R version 3.6 or later.

1.2.3.1 Package descriptions

This workshop package, Bioc2019pathwayPCA, depends on two packages:

pathwayPCA,1212 https://gabrielodom.github.io/pathwayPCA/ and
the tidyverse.1313 https://www.tidyverse.org/

The tidyverse package suite is a collection of utility packages for data science: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats. We will make use of some data constructs and ideas from these packages, but we do not expect users to be intimately familiar with them.

Additionally, because we will use survival data in our examples, we need survival1414 https://CRAN.R-project.org/package=survival and survminer1515 https://cran.r-project.org/package=survminer (for drawing survival curves with ggplot2).

library(survival)
library(survminer)
library(Bioc2019pathwayPCA)

1.2.3.2 (OPTIONAL) Development version

Because we are currently in the development phase for version 2 of this package, you can install the package from GitHub. In order to install a package from GitHub, you will need the devtools::1616 https://github.com/r-lib/devtools package and either Rtools1717 https://cran.r-project.org/bin/windows/Rtools/ (for Windows) or Xcode1818 https://developer.apple.com/xcode/ (for Mac). Then you can install the development version of the pathwayPCA package from GitHub via:

devtools::install_github("gabrielodom/pathwayPCA")

1.2.4 Time outline

50m total

Activity	Time
Introduction to `pathwayPCA`	5m
Case Study 1: Estimating sample-specific pathway activities	15m
Case Study 2: Pathway analysis for multi-omics datasets	10m
Case Study 3: Analyzing experiments with covariates and interactions	10m
Summary and Conclusion	5m
Questions and Comments	5m

1.3 Workshop Goals and Objectives

1.3.1 Learning goals

Describe PCA-based pathway analysis
Apply the pathwayPCA workflow to typical gene expression and copy number variations data
Perform integrative pathway-based analysis for multiple types of -omics data jointly
Apply pathwayPCA to experiments with complex designs, such as those with covariate and/or interaction effects

1.3.2 Learning objectives

Identify pathways significantly associated with survival, binary, or continuous outcome using the pathwayPCA approach
Estimate and visualize individual gene effects within a significant pathway
Estimate and visualize sample-specific pathway activities
Prioritize genes most likely to be driver (instead of passenger) genes using integrative analysis
Identify pathways with significant sex-specific effects

1.4 Introduction to `pathwayPCA`

pathwayPCA is an integrative analysis tool that implements PCA-based pathway analysis approaches.1919 Chen et al. (2008) 2020 Chen et al. (2010) 2121 Chen (2011)

1.4.2 Main functions

pathwayPCA uses four main groups of functions: data import and wrangling, object creation, object analysis, and analysis inspection. The main function groups are

Data:
- read_gmt() imports a .gmt file as a pathway collection.
- SE2Tidy() extracts an assay from a SummarizedExperiment object,3232 https://doi.org/10.18129/B9.bioc.SummarizedExperiment and turns it into a “tidy” data frame.
- TransposeAssay() is a variant of the base t() function designed specifcially for data frames and tibbles. It preserves row and column names after transposition.
Omics* Objects:
- CreateOmics() takes in a collection of pathways, a single -omics assay, and a clinical response data frame and creates a data object of class Omics*.
- SubsetPathwayData() can extract the pathway-specific assay values and responses for a given pathway from an Omics* object.
Methods:
- AESPCA_pVals() takes in an Omics* object and calculates pathway \(p\)-values (parametrically or non-parametrically), principal components, and loadings via AESPCA. This returns an object of class aespcOut.
- SuperPCA_pVals() takes in an Omics* object with valid response information and calculates pathway parametric \(p\)-values, principal components, and loadings via SuperPCA. This returns an object of class superpcOut.
Results:
- getPathPCLs() takes in an object of class aespcOut or superpcOut and the TERMSname of a pathway. This function extracts 1) the data frame of principal components and subject IDs for the given pathway, and 2) a data frame of sparse loadings and feature names for the given pathway.
- getPathpVals() takes in an object of class aespcOut or superpcOut and returns a table of the \(p\)-values and false discovery rates for each pathway.

1.4.3 `pathwayPCA` inputs

1.4.3.1 Assay data

We define our assay data format in R as follows:

Let \(\textbf{X} \in \mathbb{R}_{n \times p}\) be the observed single-omics data (gene expression, protein expression, copy-number variation, etc.).
We follow Tidy Data design3333 Wickham, 2014 and record subjects in the rows and -ome features in the columns.

Use the two utility functions, TransposeAssay() or SE2Tidy(), to wrangle your assay data into appropriate form for further analysis. Here is an example of “tidying” the first assay from a SummarizedExperiment object:

# DONT RUN
library(SummarizedExperiment)
data(airway, package = "airway")

airway_df <- SE2Tidy(airway, whichAssay = 1)

1.4.3.2 Pathway collections

Based on expert knowledge of the problem at hand, curate a set of biological pathways \(1, \ldots, K\), where pathway \(k\) contains \(p_k\) -ome measurements. These pathway collections are often stored in a .gmt file, a text file with each row corresponding to one pathway. Each row contains an ID (column TERMS), an optional description (column description), and the genes in the pathway (all subsequent columns). This file format is independently defined by the Broad Institute.3434 http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats Pathway collections in .gmt form can be downloaded from the MSigDB database.3535 http://software.broadinstitute.org/gsea/msigdb/collections.jsp

For WikiPathways,3636 https://www.wikipathways.org/index.php/WikiPathways a collection of community-defined biological pathways, one can download monthly data releases in .gmt format using the dowloadPathwayArchive() function in the rWikiPathways package from Bioconductor.3737 https://bioconductor.org/packages/release/bioc/html/rWikiPathways.html For example, the following commands3838 https://bioconductor.org/packages/release/bioc/vignettes/rWikiPathways/inst/doc/Overview.html#4_give_me_more download the latest release of the human pathways from WikiPathways to your current directory (we do not execute this code):

# DONT RUN
library(rWikiPathways)
# library(XML) # necessary if you encounter an error with readHTMLTable
downloadPathwayArchive(
  organism = "Homo sapiens", format = "gmt"
)

trying URL 'http://data.wikipathways.org/current/gmt/wikipathways-20190110-gmt-Homo_sapiens.gmt'
Content type '' length 174868 bytes (170 KB)
downloaded 170 KB

#> [1] "wikipathways-20190110-gmt-Homo_sapiens.gmt"

pathwayPCA includes the June 2018 Wikipathways collection for homo sapiens, which can be loaded using the read_gmt function:

dataDir_path <- system.file(
  "extdata", package = "pathwayPCA", mustWork = TRUE
)
wikipathways_PC <- read_gmt(
  paste0(dataDir_path, "/wikipathways_human_symbol.gmt"),
  description = TRUE
)

This pathway collection contains:

pathways: a list of character vectors matching some of the names of the -omes recorded in the assay
TERMS: a character vector of the names of the pathways
description: (OPTIONAL) descriptions or URLs of the pathways

wikipathways_PC
#> Object with Class(es) 'pathwayCollection', 'list' [package 'pathwayPCA'] with 3 elements: 
#>  $ pathways   :List of 457
#>  $ TERMS      : chr [1:457] "WP23" ...
#>  $ description: chr [1:457] "B Cell Receptor Signaling Pathway" ...

1.4.4 Methods

Pathway-based Adaptive, Elastic-net, Sparse PCA (AESPCA)3939 Chen (2011) combines the following methods into a single objective function: Elastic-Net4040 Zou and Hastie, 2005 , Adaptive Lasso4141 Zou, 2012 , and Sparse PCA4242 Zou, Hastie, and Tibshirani, 2012 . AESPCA extracts principal components from pathway \(k\) which minimize this composite objective function. These extracted PCs represent activities within each pathway. The estimated latent variables are then tested against phenotypes using either a parametric or permutation test (permuting the sample responses). Note that the AESPCA approach does not use the response information to estimate pathway PCs, so it is an unsupervised approach.

Pathway-based Supervised PCA (SuperPCA):4343 Chen et al. (2008) 4444 Chen et al. (2010) ranks each feature in pathway \(k\) by its univariate relationship with the outcome of interest (survival time, tumor size, cancer subtype, etc.), then extracts principal components from the features most related to the outcome. Because of this gene selection step, this method is “supervised”. Therefore, test statistics from the SuperPCA model can no longer be approximated using the Student’s \(t\)-distribution. To account for the gene selection step, SuperPCA as implemented in pathwayPCA estimates \(p\)-values from a two-component mixture of Gumbel extreme value distributions instead.

1.5 Case Study 1: Identifying Significant Pathways in Protein Expressions Associated with Survival Outcome in Ovarian Cancer

1.5.1 Ovarian cancer dataset

For this example, we will use the mass spectrometry based global proteomics data for ovarian cancer recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC).4545 https://proteomics.cancer.gov/programs/cptac The gene-level, log-ratio normalized protein abundance expression dataset for tumor samples can be obtained from the LinkedOmics database.4646 http://linkedomics.org/data_download/TCGA-OV/ We used the dataset “Proteome (PNNL, Gene level)” which was generated by the Pacific Northwest National Laboratory (PNNL).4747 https://www.pnnl.gov/ One subject was removed due to missing survival outcome. Proteins with greater than 20% missingness were removed from the assay. The remaining missing values were imputed using the Bioconductor package impute under default settings. The final dataset consisted of 5162 protein expression values for 83 samples.

1.5.2 Creating an `Omics` data object for pathway analysis

First, we need to create an Omics-class data object that stores

the expression dataset
phenotype information for the samples
a collection of pathways

1.5.2.1 Expression and phenotype data

We can obtain protein expression and phenotype datasets for the TCGA ovarian cancer dataset by downloading the raw files from the LinkedOmics website. However, for ease and to save time, we include cleaned and imputed versions of these datasets within this package (for the cleaning process, please see the clean_multi_omics.R file in inst/scripts).

The ovProteome_df dataset is a data frame with protein expression levels and TCGA sample IDs. The variables (columns) include expression data for 4763 proteins for each of the 83 primary tumour samples.

data("ovProteome_df")
ovProteome_df[, 1:5]
#> # A tibble: 83 x 5
#>    Sample          A1BG    A2M    AAAS    AAK1
#>    <chr>          <dbl>  <dbl>   <dbl>   <dbl>
#>  1 TCGA-13-1489  0.0154 -0.419  0.0359  0.014 
#>  2 TCGA-42-2590 -0.508  -0.769 -0.0443  0.109 
#>  3 TCGA-36-2529 -0.148   0.209 -0.212   0.599 
#>  4 TCGA-24-1105  0.302  -0.343  0.116  -0.403 
#>  5 TCGA-29-1785 -0.584  -0.353 -0.192  -0.220 
#>  6 TCGA-24-2290 -0.954  -0.670 -0.0364  0.590 
#>  7 TCGA-24-1428 -0.352  -1.34   0.0283 -0.200 
#>  8 TCGA-24-1923  0.245  -0.471  0.0789 -0.0568
#>  9 TCGA-24-1563  0.0212  0.242  0.0636  0.785 
#> 10 TCGA-24-1430 -0.205  -0.154  0.0028  0.105 
#> # ... with 73 more rows

The ovPheno_df data frame contains TCGA sample IDs, overall survival time, and overall survival censoring status.

data("ovPheno_df")
ovPheno_df
#> # A tibble: 565 x 3
#>    Sample       OS_time OS_status
#>    <chr>          <dbl>     <int>
#>  1 TCGA-3P-A9WA     420         0
#>  2 TCGA-59-A5PD     624         1
#>  3 TCGA-5X-AA5U     361         0
#>  4 TCGA-04-1331    1336         1
#>  5 TCGA-04-1332    1247         1
#>  6 TCGA-04-1335      55         1
#>  7 TCGA-04-1336    1495         0
#>  8 TCGA-04-1337      61         1
#>  9 TCGA-04-1338    1418         0
#> 10 TCGA-04-1341      33         0
#> # ... with 555 more rows

1.5.2.2 Create an `OmicsSurv` data container

Now that we have these three data components (pathway collection, proteomics, and clinical responses), we create an OmicsSurv data container. Note that when assayData_df and response are supplied from two different files, the user must match and merge these data sets by sample IDs. The assay and response must have matching row names, otherwise the function will error.

ov_OmicsSurv <- CreateOmics(
  # protein expression data
  assayData_df = ovProteome_df, 
  # pathway collection
  pathwayCollection_ls = wikipathways_PC,
  # survival phenotypes
  response = ovPheno_df,
  # phenotype is survival data
  respType = "survival",
  # retain pathways with > 5 proteins
  minPathSize = 5  
)
There are 83 samples shared by the assay and phenotype data.

  ======  Creating object of class OmicsSurv  =======
The input pathway database included 5831 unique features.
The input assay dataset included 4763 features.
Only pathways with at least 5 or more features included in the assay dataset are
  tested (specified by minPathSize parameter). There are 312 pathways which meet
  this criterion.
Because pathwayPCA is a self-contained test (PMID: 17303618), only features in
  both assay data and pathway database are considered for analysis. There are 1936
  such features shared by the input assay and pathway database.

To see a summary of the Omics data object we just created, simply type the name of the object:

ov_OmicsSurv
#> Formal class 'OmicsSurv' [package "pathwayPCA"] with 6 slots
#>   ..@ eventTime            : num [1:83] 2279 3785 2154 2553 2856 ...
#>   ..@ eventObserved        : logi [1:83] TRUE TRUE TRUE TRUE TRUE TRUE ...
#>   ..@ assayData_df         :Classes 'tbl_df', 'tbl' and 'data.frame':    83 obs. of  4763 variables:
#>   ..@ sampleIDs_char       : chr [1:83] "TCGA-09-1664" "TCGA-13-1484" "TCGA-13-1488" "TCGA-13-1489" ...
#>   ..@ pathwayCollection    :List of 4
#>   .. ..- attr(*, "class")= chr [1:2] "pathwayCollection" "list"
#>   ..@ trimPathwayCollection:List of 5
#>   .. ..- attr(*, "class")= chr [1:2] "pathwayCollection" "list"

1.5.3 Testing pathway association with phenotypes

Once we have a valid Omics object, we can perform pathway analysis using the AESPCA or SuperPCA methodology as described above. Because the syntax for performing SuperPCA is nearly identical to the AESPCA syntax, we will illustrate only the AESPCA workflow below.

1.5.3.1 Implementation

We estimate pathway significance via the following model: phenotype ~ intercept + PC1. Pathway \(p\)-values are estimated based on a likelihood ratio test that compares this model to the null model phenotype ~ intercept. Note that when the value supplied to the numReps argument is greater than 0, the AESPCA_pvals() function employs a non-parametric, permutation-based test to assign the statistical significance of this model.

ovarian_aespcOut <- AESPCA_pVals(
  # The Omics data container
  object = ov_OmicsSurv,
  # One principal component per pathway
  numPCs = 1,
  # Use parallel computing with 2 cores
  parallel = TRUE,
  numCores = 2,
  # # Use serial computing
  # parallel = FALSE,
  # Estimate the p-values parametrically (AESPCA only)
  numReps = 0,
  # Control FDR via Benjamini-Hochberg
  adjustment = "BH"
)
Part 1: Calculate Pathway AES-PCs
Initializing Computing Cluster: DONE
Extracting Pathway PCs in Parallel: DONE

Part 2: Calculate Pathway p-Values
Initializing Computing Cluster: DONE
Extracting Pathway p-Values in Parallel: DONE

Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame
DONE

This ovarian_aespcOut object contains 3 components: a table of pathway \(p\)-values, AESPCA-estimated PCs of each sample from each pathway, and the loadings of each protein onto the AESPCs.

names(ovarian_aespcOut)
#> [1] "pVals_df"    "PCs_ls"      "loadings_ls"

We haven’t yet added a print() method for these analysis outputs, so be careful: use the two accessor functions, getPathpVals() and getPathPCLs() instead. We will show examples of these functions next.

1.5.3.2 Table of pathway analysis results

For this ovarian cancer dataset, the top 20 most significant pathways identified by AESPCA are:

getPathpVals(ovarian_aespcOut)

terms	description	rawp	FDR_BH
WP2036	TNF related weak inducer of apoptosis (TWEAK) Signaling Pathway	0.0019066	0.3301101
WP363	Wnt Signaling Pathway	0.0046961	0.3301101
WP3850	Factors and pathways affecting insulin-like growth factor (IGF1)-Akt signaling	0.0079715	0.3301101
WP2447	Amyotrophic lateral sclerosis (ALS)	0.0082310	0.3301101
WP262	EBV LMP1 signaling	0.0116029	0.3301101
WP2795	Cardiac Hypertrophic Response	0.0116402	0.3301101
WP2840	Hair Follicle Development: Cytodifferentiation (Part 3 of 3)	0.0120476	0.3301101
WP3617	Photodynamic therapy-induced NF-kB survival signaling	0.0128125	0.3301101
WP195	IL-1 signaling pathway	0.0144856	0.3301101
WP3680	Association Between Physico-Chemical Features and Toxicity Associated Pathways	0.0157976	0.3301101
WP75	Toll-like Receptor Signaling Pathway	0.0163042	0.3301101
WP2018	RANKL/RANK (Receptor activator of NFKB (ligand)) Signaling Pathway	0.0175655	0.3301101
WP2203	Thymic Stromal LymphoPoietin (TSLP) Signaling Pathway	0.0181492	0.3301101
WP2064	Neural Crest Differentiation	0.0181831	0.3301101
WP2877	Vitamin D Receptor Pathway	0.0204298	0.3301101
WP1434	Osteopontin Signaling	0.0213633	0.3301101
WP408	Oxidative Stress	0.0221542	0.3301101
WP4136	Fibrin Complement Receptor 3 Signaling Pathway	0.0231842	0.3301101
WP4148	Splicing factor NOVA regulated synaptic proteins	0.0238131	0.3301101
WP2324	AGE/RAGE pathway	0.0266193	0.3301101

1.5.3.3 (OPTIONAL) Column chart of significant pathways

Before constructing a graph of the \(p\)-values, we extract the top 10 pathways (the default value for numPaths is 20). The score = TRUE returns the negative natural logarithm of the unadjusted \(p\)-values for each pathway (to enhance graphical display of the top pathways).

ovOutGather_df <- getPathpVals(ovarian_aespcOut, score = TRUE, numPaths = 10)

Now we plot the pathway significance level for the top 20 pathways.

ggplot(ovOutGather_df) +
  # set overall appearance of the plot
  theme_bw() +
  # Define the dependent and independent variables
  aes(x = reorder(terms, score), y = score) +
  # From the defined variables, create a vertical bar chart
  geom_col(position = "dodge", fill = "#005030", width = 0.7) +
  # Add a line showing the alpha = 0.0001 level
  geom_hline(yintercept = -log10(0.0001), size = 1, color = "#f47321") +
  # Add pathway labels
  geom_text(
    aes(x = reorder(terms, score), label = reorder(description, score), y = 0.1),
    color = "white",
    size = 4, 
    hjust = 0
  ) +
  # Set main and axis titles
  ggtitle("AESPCA Significant Pathways: Ovarian Cancer") +
  xlab("Pathways") +
  ylab("Negative Log10 (p-Value)") +
  # Flip the x and y axes
  coord_flip()

1.5.3.4 Extract relevant genes from significant pathways

Because pathways are defined a priori, typically only a subset of genes within each pathway are relevant to the phenotype and contribute to pathway significance. In AESPCA, these relevant genes are the genes with nonzero loadings in the first PC of AESPCs.

For example, we know that the IL-1 Signaling Pathway is one of cancer’s “usual suspects”.4848 https://doi.org/10.1371/journal.pcbi.1003470 For the “IL-1 signaling pathway” (Wikipathways WP1954949 https://www.wikipathways.org/index.php/Pathway:WP195), we can extract the PCs and their protein Loadings using the getPathPCLs() function:

wp195PCLs_ls <- getPathPCLs(ovarian_aespcOut, "WP195")
wp195PCLs_ls
#> $PCs
#> # A tibble: 83 x 2
#>    sampleID          V1
#>    <chr>          <dbl>
#>  1 TCGA-09-1664 -2.90  
#>  2 TCGA-13-1484  0.0985
#>  3 TCGA-13-1488 -0.0724
#>  4 TCGA-13-1489  0.201 
#>  5 TCGA-13-1494 -0.506 
#>  6 TCGA-13-1495 -0.360 
#>  7 TCGA-13-1499 -0.529 
#>  8 TCGA-13-2071  2.11  
#>  9 TCGA-23-1123 -0.874 
#> 10 TCGA-23-1124  1.52  
#> # ... with 73 more rows
#> 
#> $Loadings
#> # A tibble: 25 x 2
#>    featureID    PC1
#>    <chr>      <dbl>
#>  1 MAPK14    0     
#>  2 AKT1      0     
#>  3 IKBKB     0.531 
#>  4 NFKB1     0.528 
#>  5 PIK3R1    0.191 
#>  6 PLCG1     0     
#>  7 MAPK1     0.0882
#>  8 MAP2K1    0.300 
#>  9 MAP2K2    0     
#> 10 MAP2K6    0     
#> # ... with 15 more rows
#> 
#> $pathway
#> [1] "path22"
#> 
#> $term
#> [1] "WP195"
#> 
#> $description
#> [1] "IL-1 signaling pathway"

The proteins with non-zero loadings can be extracted as follows:

wp195Loadings_df <- 
  wp195PCLs_ls$Loadings %>% 
  filter(PC1 != 0)
wp195Loadings_df
#> # A tibble: 8 x 2
#>   featureID    PC1
#>   <chr>      <dbl>
#> 1 IKBKB     0.531 
#> 2 NFKB1     0.528 
#> 3 PIK3R1    0.191 
#> 4 MAPK1     0.0882
#> 5 MAP2K1    0.300 
#> 6 TAB1      0.0992
#> 7 MAPK3     0.324 
#> 8 SQSTM1    0.436

1.5.3.5 (OPTIONAL) Plot protein loadings for a single pathway

We can also prepare these loadings for graphics:

wp195Loadings_df <- 
  wp195Loadings_df %>% 
  # Sort Loading from Strongest to Weakest
  arrange(desc(abs(PC1))) %>% 
  # Order the Genes by Loading Strength
  mutate(featureID = factor(featureID, levels = featureID)) %>% 
  # Add Directional Indicator (for Colour)
  mutate(Direction = factor(ifelse(PC1 > 0, "Up", "Down")))

Now we will construct a column chart with ggplot2’s geom_col() function.

ggplot(data = wp195Loadings_df) +
  # Set overall appearance
  theme_bw() +
  # Define the dependent and independent variables
  aes(x = featureID, y = PC1, fill = Direction) +
  # From the defined variables, create a vertical bar chart
  geom_col(width = 0.5, fill = "#005030", color = "#f47321") +
  # Set main and axis titles
  labs(
    title = "Gene Loadings on IL-1 Signaling Pathway",
    x = "Protein",
    y = "Loadings of PC1"
  ) +
  # Remove the legend
  guides(fill = FALSE)

Alternatively, we can also plot the correlation of each gene with first PC for each gene. These correlations can be computed by using the TidyCorrelation() function in pathwayPCA’s vignette “Chapter 5 - Visualizing the Results”, Section 3.3.5050 https://gabrielodom.github.io/pathwayPCA/articles/Supplement5-Analyse_Results.html#gene-correlations-with-pcs

1.5.3.6 Subject-specific PCA estimates

In the study of complex diseases, there is often considerable heterogeneity among different subjects with regard to underlying causes of disease and benefit of particular treatment. Therefore, in addition to identifying disease-relevant pathways for the entire patient group, successful (personalized) treatment regimens will also depend upon knowing if a particular pathway is dysregulated for an individual patient.

To this end, we can also assess subject-specific pathway activity. As we saw earlier, the getPathPCLs() function also returns subject-specific estimates for the individual pathway PCs. We can plot these as follows.

ggplot(data = wp195PCLs_ls$PCs) +
  # Set overall appearance
  theme_classic() + 
  # Define the independent variable
  aes(x = V1) +
  # Add the histogram layer
  geom_histogram(bins = 10, color = "#005030", fill = "#f47321") +
  # Set main and axis titles
  labs(
    title = "Distribution of Sample-specific Estimate of Pathway Activities",
    subtitle = paste0(wp195PCLs_ls$pathway, ": ", wp195PCLs_ls$description),
    x = "PC1 Score for Each Sample",
    y = "Count"
  )

This graph shows there can be considerable heterogeneity in pathway activities between the patients.

1.5.3.7 Extract analysis dataset for entire pathway

Users are often also interested in examining the actual data used for analysis of the top pathways, especially for the relevant genes with the pathway. To extract this dataset, we can use the SubsetPathwayData() function. These commands extract data for the IL-1 signaling pathway:

wp195Data_df <- SubsetPathwayData(ov_OmicsSurv, "WP195")
wp195Data_df
#> # A tibble: 83 x 28
#>    sampleID EventTime EventObs  MAPK14    AKT1   IKBKB   NFKB1  PIK3R1
#>    <chr>        <dbl> <lgl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 TCGA-09~      2279 TRUE     -0.995   0.958  -1.70   -0.730  -1.35  
#>  2 TCGA-13~      3785 TRUE      1.71    0.575  -0.744   0.783   1.18  
#>  3 TCGA-13~      2154 TRUE     -0.127  -0.542  -0.908   0.418   0.324 
#>  4 TCGA-13~      2553 TRUE     -0.0591 -0.291  -0.416  -0.488  -2.37  
#>  5 TCGA-13~      2856 TRUE      0.610   1.25   -0.943   0.394   0.309 
#>  6 TCGA-13~      2749 TRUE      0.265   1.79   -0.515  -0.0499 -0.0504
#>  7 TCGA-13~      3500 FALSE     0.516  -0.519  -0.487   0.104   1.54  
#>  8 TCGA-13~       773 TRUE     -0.107   0.323   1.05    1.27    1.35  
#>  9 TCGA-23~      1018 TRUE     -0.419  -3.63    0.0398 -0.114  -0.0345
#> 10 TCGA-23~      1768 TRUE     -1.81   -0.0215  0.817   1.17    0.567 
#> # ... with 73 more rows, and 20 more variables: PLCG1 <dbl>, MAPK1 <dbl>,
#> #   MAP2K1 <dbl>, MAP2K2 <dbl>, MAP2K6 <dbl>, PTPN11 <dbl>, RELA <dbl>,
#> #   MAP3K7 <dbl>, IKBKG <dbl>, TAB1 <dbl>, IRAK4 <dbl>, ECSIT <dbl>,
#> #   TOLLIP <dbl>, MAPK3 <dbl>, MAP2K3 <dbl>, MAP2K4 <dbl>, TRAF6 <dbl>,
#> #   UBE2N <dbl>, SQSTM1 <dbl>, MAPKAPK2 <dbl>

1.5.3.8 Gene-specific CoxPH model

We can also perform analysis for individual genes belonging to the pathway:

library(survival)
NFKB1_df <- 
  wp195Data_df %>% 
  select(EventTime, EventObs, NFKB1)

wp195_mod <- coxph(
  Surv(EventTime, EventObs) ~ NFKB1,
  data = NFKB1_df
)

Now we inspect the output from the Cox PH model (we’ve included the pretty version using the prettify() function from the papeR package5151 https://CRAN.R-project.org/package=papeR):

summary(wp195_mod)

	coef	Hazard Ratio	CI (lower)	CI (upper)	se(coef)	z	Pr(>\|z\|)
NFKB1	0.4606061	1.585034	1.176442	2.135536	0.1521005	3.028302	0.002	**

1.5.3.9 (OPTIONAL) Gene-specific survival curves

Additionally, we can estimate Kaplan-Meier survival curves for patients with high or low expression values for individual genes:

NFKB1_df <- 
  NFKB1_df %>% 
  # Group subjects by gene expression
  mutate(NFKB1_Expr = ifelse(NFKB1 > median(NFKB1), "High", "Low")) %>% 
  # Re-code time to years
  mutate(EventTime = EventTime / 365.25) %>% 
  # Ignore any events past 10 years
  filter(EventTime <= 10)

# Fit the survival model
NFKB1_fit <- survfit(
  Surv(EventTime, EventObs) ~ NFKB1_Expr,
  data = NFKB1_df
)

Finally, we can plot these K-M curves over NFKB1 protein expression.

ggsurvplot(
  NFKB1_fit,
  # No confidence intervals; add the p-value
  conf.int = FALSE, pval = TRUE,
  # Show times to median survival
  surv.median.line = "hv",
  xlab = "Time in Years",
  palette = c("#f47321", "#005030")
)

1.6 Case study 2: An Integrative Multi-Omics Pathway Analysis of Ovarian Cancer Survival

While copy number alterations are common genomic aberrations in ovarian carcer, recent studies have shown these changes do not necessarily lead to concordant changes in protein expression. In Section 1.5.3 above, we illustrated testing pathway activities in protein expression against survival outcome. In this section, we will additionally test pathway activities in copy number against survival outcome. Moreover, we will perform integrative analysis to identify those survival-associated protein pathways, genes, and samples driven by copy number alterations.

1.6.1 Creating copy number `Omics` data object

We can identify copy number (CNV) pathways significantly associated with survival in the same way as we did for protein expressions. This gene-level CNV data was downloaded from the same LinkedOmics database.

data("ovCNV_df")
ovCNV_df[, 1:5]
#> # A tibble: 579 x 5
#>    Sample        ACAP3 ACTRT2   AGRN ANKRD65
#>    <chr>         <dbl>  <dbl>  <dbl>   <dbl>
#>  1 TCGA-04-1331 -0.703 -0.703 -0.703  -0.703
#>  2 TCGA-04-1332  0.08   0.08   0.08    0.08 
#>  3 TCGA-04-1335 -0.807 -0.807 -0.807  -0.807
#>  4 TCGA-04-1336  0.101  0.101  0.101   0.101
#>  5 TCGA-04-1337  0.021  0.021  0.021   0.021
#>  6 TCGA-04-1338 -0.999 -0.999 -0.999  -0.999
#>  7 TCGA-04-1341 -0.421 -0.421 -0.421  -0.421
#>  8 TCGA-04-1342  0.089  0.089  0.089   0.089
#>  9 TCGA-04-1343  0.279  0.279  0.279   0.279
#> 10 TCGA-04-1346 -0.396 -0.396 -0.396  -0.396
#> # ... with 569 more rows

And now we create an Omics data container. (Note: because these analysis steps take a little longer than the steps shown previously–3 minutes over 20 cores, we include the AESPCA output directly. You don’t need to execute the code in the next two chunks.)

# DONT RUN
ovCNV_Surv <- CreateOmics(
  assayData_df = ovCNV_df,
  pathwayCollection_ls = wikipathways_PC,
  response = ovPheno_df,
  respType = "survival",
  minPathSize = 5
)

1230 gene name(s) are invalid. Invalid name(s) are: ...

There are 549 samples shared by the assay and phenotype data.

  ======  Creating object of class OmicsSurv  =======
The input pathway database included 5831 unique features.
The input assay dataset included 24776 features.
Only pathways with at least 5 or more features included in the assay dataset are
  tested (specified by minPathSize parameter). There are 424 pathways which meet
  this criterion.
Because pathwayPCA is a self-contained test (PMID: 17303618), only features in
  both assay data and pathway database are considered for analysis. There are 5637
  such features shared by the input assay and pathway database.

1.6.1.1 AESPCA pathway analysis for copy-number data

Finally, we can apply the AESPCA method to this copy-number data container. Due to the large sample size, this will take a few moments.

# DONT RUN
ovCNV_aespcOut <- AESPCA_pVals(
  object = ovCNV_Surv,
  numPCs = 1,
  parallel = TRUE,
  numCores = 20,
  numReps = 0,
  adjustment = "BH"
)

Part 1: Calculate Pathway AES-PCs
Initializing Computing Cluster: DONE
Extracting Pathway PCs in Parallel: DONE

Part 2: Calculate Pathway p-Values
Initializing Computing Cluster: DONE
Extracting Pathway p-Values in Parallel: DONE

Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame
DONE

Rather than execute this code yourself, we have included the output object with this package:

data(ovCNV_aespcOut)

1.6.2 Combine significant pathways from CNV and protein analyses

Next, we identify the intersection of significant pathways based on both CNV and protein data. First, we will create a data frame of the pathway \(p\)-values from both CNV and proteomics.

# Copy Number
CNVpvals_df <- 
  getPathpVals(ovCNV_aespcOut, alpha = 0.05) %>% 
  mutate(rawp_CNV = rawp) %>% 
  select(description, rawp_CNV)

# Proteomics
PROTpvals_df <- 
  getPathpVals(ovarian_aespcOut, alpha = 0.05) %>% 
  mutate(rawp_PROT = rawp) %>% 
  select(description, rawp_PROT)


# Intersection
SigBoth_df <- inner_join(PROTpvals_df, CNVpvals_df, by = "description")
# WnT Signaling Pathway is listed as WP363 and WP428

The results showed there are 32 pathways significantly associated with survival in both CNV and protein data, which is significantly more than expected by chance (p-value = 0.00065; Fisher’s Exact Test; shown in multi_pathway_overlap_fishers.R). Here are the top-10 most significant pathways (sorted by protein data significance):

SigBoth_df[1:10, ]

description	rawp_PROT	rawp_CNV
TNF related weak inducer of apoptosis (TWEAK) Signaling Pathway	0.0019066	0.0128430
Wnt Signaling Pathway	0.0046961	0.0303257
Wnt Signaling Pathway	0.0046961	0.0371408
Factors and pathways affecting insulin-like growth factor (IGF1)-Akt signaling	0.0079715	0.0272851
EBV LMP1 signaling	0.0116029	0.0498388
Cardiac Hypertrophic Response	0.0116402	0.0274414
Hair Follicle Development: Cytodifferentiation (Part 3 of 3)	0.0120476	0.0006223
IL-1 signaling pathway	0.0144856	0.0141512
RANKL/RANK (Receptor activator of NFKB (ligand)) Signaling Pathway	0.0175655	0.0003343
Thymic Stromal LymphoPoietin (TSLP) Signaling Pathway	0.0181492	0.0153220

1.6.2.1 Integrative pathway-based gene detection

Similar to the protein pathway analysis shown in Section 1.5.3, we can also identify relevant genes with nonzero loadings that drives pathway significance in CNV. The “IL-1 signaling pathway” (WP195) is significant in both CNV and protein data, so we look for genes with non-zero loadings in both.

# Copy Number Loadings
CNVwp195_ls <- getPathPCLs(ovCNV_aespcOut, "WP195")
CNV195load_df <- 
  CNVwp195_ls$Loadings %>% 
  filter(PC1 != 0) %>% 
  rename(PC1_CNV = PC1)

# Protein Loadings
PROTwp195_ls <- getPathPCLs(ovarian_aespcOut, "WP195")
PROT195load_df <- 
  PROTwp195_ls$Loadings %>% 
  filter(PC1 != 0) %>% 
  rename(PC1_PROT = PC1)

# Intersection
inner_join(CNV195load_df, PROT195load_df)
#> Joining, by = "featureID"
#> # A tibble: 5 x 3
#>   featureID PC1_CNV PC1_PROT
#>   <chr>       <dbl>    <dbl>
#> 1 IKBKB      0.0615   0.531 
#> 2 NFKB1      0.133    0.528 
#> 3 PIK3R1     0.0704   0.191 
#> 4 TAB1       0.0384   0.0992
#> 5 MAPK3      0.0560   0.324

The result showed that NFKB1, IKBKB, and other genes are selected by AESPCA when testing IL-1 signaling pathway (WP195) against survival outcome in both CNV and protein pathway analysis.

1.6.3 (OPTIONAL) Integrating sample-specific pathway activities

Also in Section 1.5.3, we have seen that there can be considerable heterogeneity in pathway activities between patients. One possible reason could be that copy number changes might not directly result in changes in protein expression for some of the patients. pathwayPCA can be used to estimate pathway activities for each patient, for copy number, gene, and protein expressions separately. These estimates can then be viewed jointly using a Circos plot.

The accompanying Circos plot shown normalized copy number (inner circle) and protein expression (outer circle) pathway activities for the IL-1 signaling pathway in the ovarian cancer dataset samples. Each bar corresponds to a patient sample. Red bars indicate lower expression values and negative pathway activity for the sample, while blue color bars indicate higher expression values and stronger pathway activity for the sample. Note that only some patients have concordant changes in copy number and protein expression. The code to make this plot is included in inst/scripts/circle_plot.R.

1.7 (OPTIONAL) Case Study 3: Analysis of Studies with Complex Designs

pathwayPCA is capable of analyzing studies with multiple experimental factors. In this section, we illustrate using pathwayPCA to test differential association of pathway expression with survival outcome in male and female subjects.

1.7.1 Data setup and AESPCA analysis

For this example, we used IlluminaHiSeq gene expression RNAseq data from the TCGA KIRP cohort, which we downloaded from the Xena browser datahub.5252 https://xenabrowser.net/datapages/ (This process is described in inst/scripts/explore_KIRP.R.) First, we load the KIRP data, create an Omics data container, and extract first AESPC from each pathway.

data("kirpPheno_df")
data("kirpRNAseq_df")

Once again, this will take a few moments (on my machine, this takes 1.9 minutes over 20 cores). (You can skip the next two chunks and load the results directly.)

# DONT RUN
kidney_Omics <- CreateOmics(
  assayData_df = kirpRNAseq_df,
  pathwayCollection_ls = wikipathways_PC,
  # Exclude the gender indicator for now
  response = kirpPheno_df[, 1:3],
  respType = "surv",
  minPathSize = 5
)

317 genes have variance < epsilon and will be removed. These gene(s) are: ...

30 gene name(s) are invalid. Invalid name(s) are: ...

There are 320 samples shared by the assay and phenotype data.

  ======  Creating object of class OmicsSurv  =======
The input pathway database included 5831 unique features.
The input assay dataset included 20213 features.
Only pathways with at least 5 or more features included in the assay dataset are
  tested (specified by minPathSize parameter). There are 423 pathways which meet
  this criterion.
Because pathwayPCA is a self-contained test (PMID: 17303618), only features in
  both assay data and pathway database are considered for analysis. There are 5566
  such features shared by the input assay and pathway database.

# DONT RUN
kirpRNAseq_aespcOut <- AESPCA_pVals(
  object = kidney_Omics,
  numPCs = 1,
  parallel = TRUE,
  numCores = 20,
  numReps = 0,
  adjustment = "BH"
)

Part 1: Calculate Pathway AES-PCs
Initializing Computing Cluster: DONE
Extracting Pathway PCs in Parallel: DONE

Part 2: Calculate Pathway p-Values
Initializing Computing Cluster: DONE
Extracting Pathway p-Values in Parallel: DONE

Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame
DONE

Rather than execute this code yourself, we have included the output object with this package:

data("kirpRNAseq_aespcOut")

1.7.2 Test for sex interaction with first PC

Next we can test whether there is differential pathway association with survival outcome for males and females by the following model, \[ h(t) = h_0(t) \exp\left[ \beta_1\text{PC}_1 + \beta_2\text{male} + \beta_3(\text{PC}_1 \times \text{male}) \right]. \]

In this model, \(h(t)\) is expected hazard at time \(t\), \(h_0 (t)\) is baseline hazard when all predictors are zero, variable \(male\) is an indicator variable for male samples, and \(PC_1\) is a pathway’s estimated first principal component based on AESPCA.

In order to test the sex interaction effect for all pathways, we will write a function which tests the interaction effect for one pathway.

TestIntxn <- function(pathway, pcaOut, resp_df){
  
  # For the given pathway, extract the PCs and loadings from the pcaOut list
  PCL_ls <- getPathPCLs(pcaOut, pathway)
  # Select and rename the PC
  PC_df <- PCL_ls$PCs %>% rename(PC1 = V1)
  # Join the phenotype data to this PC
  data_df <- inner_join(resp_df, PC_df, by = c("Sample" = "sampleID"))
  
  # Construct a survival model with sex interaction
  sex_mod <- coxph(
    Surv(time, status) ~ PC1 + male + PC1 * male, data = data_df
  )
  # Extract the model fit statistics for the interaction
  modStats_mat <- t(
    coef(summary(sex_mod))["PC1:maleTRUE", ]
  )
  colnames(modStats_mat) <- c("coef", "exp_coef", "se_coef", "z", "pVal")
 
  # Return a data frame of the pathway and model statistics
  list(
    statistics = data.frame(
      terms = pathway,
      description = PCL_ls$description,
      modStats_mat,
      stringsAsFactors = FALSE
    ),
    model = sex_mod
  )

}

As an example, we can test it on pathway WP195,

TestIntxn("WP195", kirpRNAseq_aespcOut, kirpPheno_df)$model
#> Call:
#> coxph(formula = Surv(time, status) ~ PC1 + male + PC1 * male, 
#>     data = data_df)
#> 
#>                  coef exp(coef) se(coef)      z     p
#> PC1           0.03278   1.03332  0.08206  0.399 0.690
#> maleTRUE     -0.38822   0.67826  0.32827 -1.183 0.237
#> PC1:maleTRUE  0.05649   1.05812  0.10438  0.541 0.588
#> 
#> Likelihood ratio test=3.97  on 3 df, p=0.2652
#> n= 320, number of events= 51

We can also apply this function to our list of pathways.

# List of pathways
paths_char <- kirpRNAseq_aespcOut$pVals_df$terms

# Apply over this list
interactions_ls <- sapply(
    paths_char,
    FUN = TestIntxn,
    pcaOut = kirpRNAseq_aespcOut,
    resp_df = kirpPheno_df,
    simplify = FALSE
)

We will tabulate this list and sort the pathways by significance:

interactions_df <- 
  # Take list of interactions
  interactions_ls %>% 
  # select the first element (the data frame of model stats)
  lapply(`[[`, 1) %>% 
  # stack these data frames
  bind_rows() %>% 
  as_tibble() %>% 
  # sort the rows by significance
  arrange(pVal)

Now, we can inspect the pathways with significant gender-PC1 interaction (note that we have not adjusted these pathway results for false discovery rate; these examples are only to show functionality of the package).

interactions_df %>% 
    filter(pVal < 0.05)

terms	description	coef	exp_coef	se_coef	z	pVal
WP1559	TFs Regulate miRNAs related to cardiac hypertrophy	-0.5974123	0.5502336	0.2162150	-2.763048	0.0057264
WP453	Inflammatory Response Pathway	-0.2721135	0.7617678	0.1090278	-2.495817	0.0125667
WP3893	Development and heterogeneity of the ILC family	-0.3293713	0.7193758	0.1374713	-2.395928	0.0165783
WP3929	Chemokine signaling pathway	-0.2517367	0.7774494	0.1086949	-2.315995	0.0205586
WP2849	Hematopoietic Stem Cell Differentiation	-0.2782097	0.7571380	0.1286124	-2.163165	0.0305285
WP1423	Ganglio Sphingolipid Metabolism	-0.4868302	0.6145714	0.2303461	-2.113473	0.0345603
WP3892	Development of pulmonary dendritic cells and macrophage subsets	-0.3162134	0.7289039	0.1508071	-2.096807	0.0360107
WP3678	Amplification and Expansion of Oncogenic Pathways as Metastatic Traits	-0.3569601	0.6998004	0.1705141	-2.093434	0.0363104
WP4141	PI3K/AKT/mTOR - VitD3 Signalling	0.3137007	1.3684801	0.1533151	2.046117	0.0407448
WP3941	Oxidative Damage	-0.3536377	0.7021293	0.1757314	-2.012376	0.0441803
WP3863	T-Cell antigen Receptor (TCR) pathway during Staphylococcus aureus infection	-0.2327905	0.7923196	0.1167679	-1.993616	0.0461940
WP3872	Regulation of Apoptosis by Parathyroid Hormone-related Protein	0.2670302	1.3060799	0.1345790	1.984188	0.0472348
WP3672	LncRNA-mediated mechanisms of therapeutic resistance	-0.3923992	0.6754344	0.1979623	-1.982191	0.0474578
WP3967	miR-509-3p alteration of YAP1/ECM axis	-0.2994116	0.7412542	0.1514701	-1.976705	0.0480750

The results showed the most significant pathway is WP1559 (“TFs Regulate miRNAs related to cardiac hypertrophy”). We can inspect the model results for this pathway directly.

summary(interactions_ls[["WP1559"]]$model)
#> Call:
#> coxph(formula = Surv(time, status) ~ PC1 + male + PC1 * male, 
#>     data = data_df)
#> 
#>   n= 320, number of events= 51 
#> 
#>                 coef exp(coef) se(coef)      z Pr(>|z|)    
#> PC1           0.7534    2.1242   0.1782  4.228 2.36e-05 ***
#> maleTRUE      0.1825    1.2002   0.4268  0.427  0.66903    
#> PC1:maleTRUE -0.5974    0.5502   0.2162 -2.763  0.00573 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>              exp(coef) exp(-coef) lower .95 upper .95
#> PC1             2.1242     0.4708    1.4980    3.0122
#> maleTRUE        1.2002     0.8332    0.5199    2.7705
#> PC1:maleTRUE    0.5502     1.8174    0.3602    0.8406
#> 
#> Concordance= 0.654  (se = 0.047 )
#> Likelihood ratio test= 22.51  on 3 df,   p=5e-05
#> Wald test            = 30.03  on 3 df,   p=1e-06
#> Score (logrank) test = 33.09  on 3 df,   p=3e-07

The results showed that although sex is not significantly associated with survival outcome, the association of pathway gene expression (PC1) with survival is highly dependent on sex of the samples.

1.7.3 Survival curves by sex interaction

For visualization, we first match the PC expression to survival outcome.

# Bind the Pheno Data to WP1559's First PC
kidneyWP1559_df <- inner_join(
  kirpPheno_df,
  getPathPCLs(kirpRNAseq_aespcOut, "WP1559")$PCs,
  by = c("Sample" = "sampleID")
)

Now we divide subjects according to sex and high or low PC-expression, and add a group indicator for the four groups.

# Add Grouping Feature
kidneySurvWP1559grouped_df <- 
  kidneyWP1559_df %>% 
  rename(PC1 = V1) %>% 
  # add strength indicator
  mutate(direction = ifelse(PC1 > median(PC1), "High", "Low")) %>% 
  # group by interaction of sex and strength on PC
  mutate(group = paste0(direction, ifelse(male, "_Male", "_Female"))) %>% 
  # recode time in years
  mutate(time = time / 365.25) %>% 
  # remove summarized columns
  select(-male, -PC1, -direction)

Now we can plot survival curves for the four groups.

# Fit the survival model
fit <- survfit(
  Surv(time, status) ~ group,
  data = kidneySurvWP1559grouped_df
)

ggsurvplot(
  fit, conf.int = FALSE, 
  xlab = "Time in Years",
  ylim = c(0.4, 1),
  xlim = c(0, 10)
)

These Kaplan-Meier curves show that while high or low pathway activities were not significantly associated with survival in male subjects (green and purple curves, respectively), female subjects with high pathway activities (red) had significantly worse survival outcomes than those with low pathway activities (blue).

1.8 Further Reading

For addtional information on pathwayPCA, each of our supplementary vignette chapters contain detailed tutorials on the topics discussed above. These vignettes are:

1.9 References

Chen, X., Wang, L., Smith, J.D. and Zhang, B. (2008) Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics, 24, 2474-2481.

Chen, X., Wang, L., Hu, B., Guo, M., Barnard, J. and Zhu, X. (2010) Pathway-based analysis for genome-wide association studies using supervised principal components. Genetic epidemiology, 34, 716-724.

Chen, X. (2011) Adaptive elastic-net sparse principal component analysis for pathway association testing. Statistical applications in genetics and molecular biology, 10.

The R session information for this vignette is:

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] Bioc2019pathwayPCA_0.1.0 forcats_0.4.0           
#>  [3] stringr_1.4.0            dplyr_0.8.1             
#>  [5] purrr_0.3.2              readr_1.3.1             
#>  [7] tidyr_0.8.3              tibble_2.1.3            
#>  [9] tidyverse_1.2.1          pathwayPCA_1.0.0        
#> [11] survminer_0.4.4          ggpubr_0.2              
#> [13] magrittr_1.5             ggplot2_3.1.1           
#> [15] survival_2.44-1.1        papeR_1.0-4             
#> [17] xtable_1.8-4             car_3.0-3               
#> [19] carData_3.0-2            knitr_1.23              
#> [21] BiocStyle_2.12.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-140       cmprsk_2.2-8       lubridate_1.7.4   
#>  [4] gmodels_2.18.1     httr_1.4.0         tools_3.6.0       
#>  [7] backports_1.1.4    utf8_1.1.4         R6_2.4.0          
#> [10] lazyeval_0.2.2     colorspace_1.4-1   withr_2.1.2       
#> [13] tidyselect_0.2.5   gridExtra_2.3      curl_3.3          
#> [16] compiler_3.6.0     cli_1.1.0          rvest_0.3.4       
#> [19] xml2_1.2.0         labeling_0.3       bookdown_0.11     
#> [22] scales_1.0.0       survMisc_0.5.5     digest_0.6.19     
#> [25] foreign_0.8-71     rmarkdown_1.13     rio_0.5.16        
#> [28] pkgconfig_2.0.2    htmltools_0.3.6    highr_0.8         
#> [31] rlang_0.3.4        readxl_1.3.1       rstudioapi_0.10   
#> [34] generics_0.0.2     zoo_1.8-6          jsonlite_1.6      
#> [37] gtools_3.8.1       zip_2.0.2          lars_1.2          
#> [40] Matrix_1.2-17      Rcpp_1.0.1         munsell_0.5.0     
#> [43] fansi_0.4.0        abind_1.4-5        stringi_1.4.3     
#> [46] yaml_2.2.0         MASS_7.3-51.4      plyr_1.8.4        
#> [49] grid_3.6.0         parallel_3.6.0     gdata_2.18.0      
#> [52] crayon_1.3.4       lattice_0.20-38    haven_2.1.0       
#> [55] splines_3.6.0      hms_0.4.2          zeallot_0.1.0     
#> [58] pillar_1.4.1       glue_1.3.1         evaluate_0.14     
#> [61] data.table_1.12.2  BiocManager_1.30.4 modelr_0.1.4      
#> [64] vctrs_0.1.0        cellranger_1.1.0   gtable_0.3.0      
#> [67] km.ci_0.5-2        assertthat_0.2.1   xfun_0.7          
#> [70] openxlsx_4.1.0.1   broom_0.5.2        KMsurv_0.1-5

Integrative Pathway Analysis with pathwayPCA

2019-06-17

Contents