RNASeq from biopsied tumors from advanced GC patients treated with Ramucirumab

RNASeq report

This report is generated after the Lilly RNASeq rollup pipeline is completed. This pipeline is run to generated gene level counts. As a summary - Rollup is a tool to summarize the exon & junction read count data (roll up) to arrive at a single expression value for a gene in a sample. More details regarding rollup are available here (http://lillypedia.am.lilly.com/wiki/index.php/Ngs_gene_rollup)
Briefly, parameters as recommended have been used
- linear model normalization should not be done (-fn)
- aggregation of reads should be done (i.e. ‘-agg sum’)
- ‘-mq -1 -qt 80 -ct 10’ will remove exons that have 80th percentile of summed raw read counts < 10 from any further analysis Numerous count data files are available for the user to use and investigate.

the Unnormalized rollup data is placed in directory - <results>/summarization/results_no_Norm.

rlm (Robust linear model rollup)
meanlog (Mean of log2 exon count)
group (Group (cluster) exons then robust linear model rollup by cluster)
logsum (log2 of exon count sum rollup)

the quantile normalized data is placed in directory - <results>/summarization/results_quantileNorm

each of the abovementioned files are quantile normalized

rpm (Reads Per Million Reads) values for genes are also available after rollup in the same directories. Simplistic rpm counts prior to rollup are available for the study in <results>/raw_counts/simplistic_sample_level_rpm

Samples and conditions

There are a total of 53 samples in this study. The count represents the number of “sample.ids” in the study

Investigation of the assay level data

Prior to investigation of the gene counts generated by the Lilly RNASeq rollup pipeline, it may be useful to review the counts for each assay. Occasionally, there may be multiple assays for a sample but in other studies there may be just a single assay to the sample.

Specifically, in this study all assays for each sample have correlations above 0.80. Thus the assay level gene count data can be combined for each sample using roll up

Principal Component Analysis of the gene counts from each assay

Correlation between RNASeq metrics (like total read count, 3’bias) with the PCs from the assay level gene counts

Investigation of the variance in the RNASeq gene counts from each assay. These are the raw gene counts from each assay. To elaborate this analysis is before the gene counts are put through the Lilly pipeline of rollup to aggregate the exon and junction counts from assays to genes per sample. This prior to rollup analysis helps determine if any assays were identified as problematic and need further attention on a case-by-case situation.

Each assay is marked as an outlier by PC analysis.

An outlier is marked as an observation that is numerically distant (1.5 times the interquartile range above the upper quartile and below the lower quartile) from the rest of the data. The first two PCs are investigated for association with the various pre-calculated RNASeq metrics. This attempt may address the question of “why” certain assays appear to be outliers.

Investigation of the sample level data

This point forward the gene counts for the samples after the Lilly RNASeq rollup are investigated for outliers

Expression distributions boxplot

Sample Outliers

Lilly internal pipeline issues warnings on samples which are reported in the design file. In addition,sample level PC outliers are generated after gene counts are obtained after the Lilly rollup pipeline. This table provides a list of outliers.

	Sample.id	Sample warning reason	Additional Reason	sample.group_name
1	G039-5	greater than 30% mitochondrial content		Pretreatment
2	G064-3	greater than 30% mitochondrial content		During treatment of RAM monotherapy
3	G064-4	very biased base composition\|greater than 80% mitochondrial content	PC outlier	During treatment of RAM monotherapy
4	G082-3		PC outlier	During treatment of PTX plus RAM
5	G082-4		PC outlier	During treatment of PTX plus RAM
6	G127-6	very biased base composition\|greater than 80% mitochondrial content	PC outlier	During treatment of PTX plus RAM
7	G132-2	potential herpes virus infection\|potential fungal contamination	PC outlier	PTX
8	G132-3		PC outlier	During treatment of RAM monotherapy
9	G132-6	greater than 30% mitochondrial content		Post PTX RAM
10	NCCHE-G057-2	Patient is reported by investigator to be male but appears to be female		Pretreatment
11	NCCHE-G057-3	Patient is reported by investigator to be male but appears to be female		Post PTX RAM
12	NCCHE-G057-5	Patient is reported by investigator to be male but appears to be female		During treatment of RAM monotherapy
13	NCCHE-G136		PC outlier	Pretreatment
14	NCCHE-G164-2		PC outlier	During treatment of PTX plus RAM

Principal component analysis of the quantile normalized rollup gene counts

Samples G082-3, G082-4, G132-2, G132-3, NCCHE-G136, G064-4, G127-6, NCCHE-G164-2 are PC outliers. Carefully, consider these before proceding with downstream analysis. In numerous studies these samples should be dropped from the analysis.

An outlier is marked as an observation that is numerically distant (1.5 times the interquartile range above the upper quartile and below the lower quartile) from the rest of the data.

Outliers marked in RNASeq metrics boxplots

After the internal lilly pipeline there is no expectation to see correlation with RNASeq metrices. Therefore, outliers for various RNASeq metrices are calculated and presented if the PC outliers are overlap. This attempt may be suggestive to explaining the samples as outliers.

Again, each plot allows investigation of the PC outlier samples to various RNASeq metrics. This may explain why some of the samples are outliers.

Ethnicity

Ancestry informative markers are used to inform the ethnicity of the samples.

Quantile normalized rollup counts (rlm) and simplistic sample-level rpm distributions

As reminder in this run given the default parameters for Lilly RNASeq rollup, there is removal of exons that have 80th percentile of summed raw read counts < 10. The densityplots show the gene count distribution of the Lilly RNASeq rollup counts and simplistic sample level rpm counts. Clear evidence of the removal of the low expressed genes from the Lilly RNASeq rollup.

Genes with low expression (default parameters) are removed in the Lilly RNASeq rollup pipeline. There are 4072 genes with low expression that were removed from Lilly RNASeq rollup counts. So, the total genes in the simplistic sample level rpm is 25885.

## character(0)

Does running the filter eliminate some interesting genes? This table includes median expression of certain genes which shows difference in expression in specific sample groups. Please evaluate if these genes should be included in main analysis. As these genes are not present in the current Lilly RNASeq rollup pipeline (default parameters). An option is to rerun rollup with different settings.

In this study we can re-evaluate the expression of 93 genes

tSNE sample clustering

## Read the 52 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 2.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 52
## Done in 0.00 seconds (sparsity = 0.161982)!
## Learning embedding...
## Iteration 50: error is 79.812387 (50 iterations in 0.01 seconds)
## Iteration 100: error is 78.237445 (50 iterations in 0.01 seconds)
## Iteration 150: error is 68.266009 (50 iterations in 0.00 seconds)
## Iteration 200: error is 72.253209 (50 iterations in 0.01 seconds)
## Iteration 250: error is 71.060945 (50 iterations in 0.00 seconds)
## Iteration 300: error is 3.046715 (50 iterations in 0.01 seconds)
## Iteration 350: error is 2.382132 (50 iterations in 0.01 seconds)
## Iteration 400: error is 1.912596 (50 iterations in 0.00 seconds)
## Iteration 450: error is 2.211615 (50 iterations in 0.01 seconds)
## Iteration 500: error is 2.010348 (50 iterations in 0.01 seconds)
## Fitting performed in 0.07 seconds.
## [1] "perplexity = 2, max_iter = 500, learning rate = 200"

## Read the 52 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 5.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 52
## Done in 0.00 seconds (sparsity = 0.389793)!
## Learning embedding...
## Iteration 50: error is 66.111092 (50 iterations in 0.01 seconds)
## Iteration 100: error is 69.386717 (50 iterations in 0.00 seconds)
## Iteration 150: error is 65.174801 (50 iterations in 0.01 seconds)
## Iteration 200: error is 63.650356 (50 iterations in 0.01 seconds)
## Iteration 250: error is 69.414769 (50 iterations in 0.00 seconds)
## Iteration 300: error is 2.505615 (50 iterations in 0.01 seconds)
## Iteration 350: error is 2.130702 (50 iterations in 0.00 seconds)
## Iteration 400: error is 1.794287 (50 iterations in 0.01 seconds)
## Iteration 450: error is 1.830252 (50 iterations in 0.01 seconds)
## Iteration 500: error is 1.522245 (50 iterations in 0.00 seconds)
## Fitting performed in 0.06 seconds.
## [1] "perplexity = 5, max_iter = 500, learning rate = 200"

## Read the 52 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 52
## Done in 0.01 seconds (sparsity = 0.723373)!
## Learning embedding...
## Iteration 50: error is 59.716732 (50 iterations in 0.00 seconds)
## Iteration 100: error is 58.419821 (50 iterations in 0.01 seconds)
## Iteration 150: error is 57.084640 (50 iterations in 0.01 seconds)
## Iteration 200: error is 59.159695 (50 iterations in 0.00 seconds)
## Iteration 250: error is 62.227891 (50 iterations in 0.01 seconds)
## Iteration 300: error is 1.441920 (50 iterations in 0.00 seconds)
## Iteration 350: error is 0.942671 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.900135 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.643039 (50 iterations in 0.00 seconds)
## Iteration 500: error is 0.429907 (50 iterations in 0.01 seconds)
## Fitting performed in 0.06 seconds.
## [1] "perplexity = 10, max_iter = 500, learning rate = 200"

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

As a heuristic cutoff, tSNE was run on the top 7000 most variant genes. In detail the median absolute deviation (MAD) was used as a measure of variability. Each of the 7000 genes also had a minimum mean expression greater than 3 counts (~10 reads). Please change the default settings in tSNE and run numerous iterations before interpreting the clusters.