title: “sinscore” author: “Mohammad Hassan Tanveer, 18009978” date: “09/01/2022” output: html_document

Singscore R Package

1: Introduction 2: Materials 3: Methods and results 3.1.1: Installing relevant packages 3.1.2: Scoring samples against a gene set 3.1.3: Reduced number of measurements when sample scoring
3.2: Diagnostic and visualisation functions 3.2.1: Plot rank densities 3.2.2: Scores of plot dispersions 3.2.3: Plot score landscape 4: Discussion 5: Conclusion 6: Refrences

1: Introduction:

In this portfolio I am going to uncover the singscpre r package talking about its main uses with examples, r studio was use to generate the results from the code provided from the singscore Bioconductor package. Furthermore, the core principle of what this package assesses would be discussed where potential alternative methods may be mentioned.

The expression, regulation as well as gene structure is looked at through transcriptomics (Wang et al. 2019), the current space of transcriptomics has allowed us to look at genetic variables where their functional implications can be assed (Wang et al. 2019). Therefore, this shows that the understanding of genes can be developed through transcriptomics.

Transcriptome studies are based around RNA sequencing, and the method of which RNA sequencing is used for this studies doesn’t need the use of a reference genome, and only the use of small levels of input RNA. (Geniza and Jaiswal 2017). This suggest that transcriptomics is not that material strenuous, making the whole process easier to look at the gene structure. The thorough examination of the temporal as well as special properties of genes that are expressed is the highlight ability of transcriptomics (Geniza and Jaiswal 2017).

Serial analysis of gene expression is a transcriptomic method which is said to be amongst the early methods of transcriptomics which is also sequencing based, this dates back to the mid 90’s, (Lowe et al. 2017). The use of serial analysis of gene expression also known as SAGE allowed quantification of the transcripts by using known genes as references for the fragments. (Lowe et al. 2017).

A study preformed in 2004 used SAGE to look at the changes in transcriptomics associated with the progression of breast cancer (Abba et al. 2004), this study created a profile for gene expression showing the vital alterations which are observed in the malignancy of breast cancer development (Abba et al. 2004). The results showed that the comparison between normal tissue and in situ ductal carcinoma found 52 transcripts that were deregulated and a larger number of transcripts at 149 found to be deregulated when the comparison is between invasive dctal carcinoma and in situ ductal carcinoma this was abstracted from the genes that were analyzed in early cancer state (25,157 genes analyzed) (Abba et al. 2004).

This portfolio is going to be looking at an R package called singscore this package is said to provide gene set scoring (Foroutan et al. 2018). The amount of concordance can be quantified between a specific sample transcriptional profile which is against a chosen gene set, this quantification can be done using the single sample rank based gene set scoring (singscore) method (Bhuva et al. 2019). Furthermore, the method for gene expression scoring within this study using the singscore package has implemented functions which combat the issues found with other methods that also provide gene set scoring (Foroutan et al. 2018). The issues that have been found with other methods is that most of them have small datasets which have nonstable scores another issue is that the sample compositions end up incorporating biases these issues are most likely the cause of the methods trying to score a single sample from all the samples (Foroutan et al. 2018). Therefore, singscore also known as the rank based single sample scoring method was made to combat the issues found by other methods (Foroutan et al. 2018).

2: Materials:

Materials include singscore package which is obtained from Bioconductor, r studio where the coding will be done, the code that is in the singscore package is obtained from the (Foroutan et al. 2017) paper, some additional papers were also used when creating a plot score landscape (Barretina et al. 2012) and (Tan et al. 2014). Everything from the results and method is gathered from the singscore Bioconductor package HTML.

3: Methods and results:

3.1.1: Installing relevant packages

The package first needs to be installed then. The datasets need to be loaded into R the code to load them in will be found within the singscore package the singscore package and also another packgae called GSEABase were loaded in “library(singscore)” and “library(GSEABase)”.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("singscore")

library("singscore")
library(GSEABase)

3.1.2: Scoring samples against a gene set

To load in the actual datasets the code “tgfb_expr_10_se” was used this was taken from Foroutan et al.(2017) this had 4 cases and 6 controls of the integrated TGFb-treated gene expression dataset which was scored against a down and up regulated TGFb gene set pair (Foroutan et al.2017).

tgfb_expr_10_se

## class: SummarizedExperiment 
## dim: 11900 10 
## metadata(0):
## assays(1): counts
## rownames(11900): 2 9 ... 729164 752014
## rowData names(0):
## colnames(10): D_Ctrl_R1 D_TGFb_R1 ... Hil_Ctrl_R1 Hil_Ctrl_R2
## colData names(1): Treatment

Once the datasets were loaded into R, the function rankGenes() was needed to be used to rank the datasets this gave a rank matrix. Then the function simplescore() was used where the signatures as well as the matrix are passed through this function, this would have then gave the scores for each sample which were in the data.frame this was a result of using the simplescore() function.

rankData <- rankGenes(tgfb_expr_10_se)
scoredf <- simpleScore(rankData, upSet = tgfb_gs_up, downSet = tgfb_gs_dn)
scoredf

##               TotalScore TotalDispersion    UpScore UpDispersion   DownScore
## D_Ctrl_R1   -0.088097993        2867.348 0.06096415     3119.390 -0.14906214
## D_TGFb_R1    0.286994210        2217.970 0.24931565     2352.886  0.03767856
## D_Ctrl_R2   -0.098964086        2861.418 0.06841242     3129.769 -0.16737650
## D_TGFb_R2    0.270721958        2378.832 0.25035661     2470.012  0.02036534
## Hes_Ctrl_R1 -0.002084788        2746.146 0.08046490     3134.216 -0.08254969
## Hes_TGFb_R1  0.176122839        2597.515 0.22894035     2416.638 -0.05281751
## Hes_Ctrl_R2  0.016883867        2700.556 0.08817828     3138.664 -0.07129441
## Hes_TGFb_R2  0.188466953        2455.186 0.23895473     2324.717 -0.05048778
## Hil_Ctrl_R1 -0.061991164        3039.330 0.08314254     3553.792 -0.14513371
## Hil_Ctrl_R2 -0.064937366        2959.270 0.07433863     3396.637 -0.13927600
##             DownDispersion
## D_Ctrl_R1         2615.306
## D_TGFb_R1         2083.053
## D_Ctrl_R2         2593.067
## D_TGFb_R2         2287.652
## Hes_Ctrl_R1       2358.075
## Hes_TGFb_R1       2778.392
## Hes_Ctrl_R2       2262.448
## Hes_TGFb_R2       2585.654
## Hil_Ctrl_R1       2524.868
## Hil_Ctrl_R2       2521.903

3.1.3: Reduced number of measurements when sample scoring

The function getStableGenes was used to gather genes which are stably expressed in blood as well as carcinoma transcriptomes.

getStableGenes(5, type = 'carcinoma')

## [1] "RBM45"  "BRAP"   "CIAO1"  "TARDBP" "HNRNPK"

getStableGenes(5, type = 'blood')

## [1] "RBM45"  "BRAP"   "GOSR1"  "IWS1"   "HNRNPK"

getStableGenes(5, type = 'carcinoma', id = 'ensembl')

## [1] "ENSG00000155636" "ENSG00000089234" "ENSG00000144021" "ENSG00000120948"
## [5] "ENSG00000165119"

head(rankData[,2,drop = FALSE])

##    D_TGFb_R1
## 2       1255
## 9       7611
## 10      1599
## 12      3682
## 13      3599
## 14     10013

3.2: Diagnostic and visualisation functions

3.2.1: Plot rank densities

Firstly, scoredf is the storage for each sample scores. To plot the ranks of genes the function plotRankDensity was used this enabled specific samples of gene sets to be ranked. To enable the combining of a barcode and density plot for the second sample, the rank distribution was plotted in rankdata for the second sample. To also keep the matrix/data.frame structure the function drop = FALSE was used. The rank density plot that was retured will be also refered as figure 1.

Figutre 1 shows a rank distribution of the TGFb genes this plot shows specific sample gene ranks where density and barcode plots is shown in the plot. Figure 1 demosnstartes to have a score which is not high or not low but rather in the middle, Foroutan et al. (2018) also looked at a similar plot and described that the gene set would have no enrichment for a plot which has a centralized score (Foroutan et al. 2018). Therefore Figure 1 shows a score which is near the center this would mean that the TGFb gene set would have no enrichment.

plotRankDensity(rankData[,2,drop = FALSE], upSet = tgfb_gs_up, 
                downSet = tgfb_gs_dn, isInteractive = FALSE)

## Warning: Ignoring unknown aesthetics: text

Figure 1, the function plotRankDensity and rankData was used to create a combined density and barcode plot, which presents figure 1 showing a ranked genes score for TGFb dataset.

3.2.2: Scores of plot dispersions

To create “score against dispersions” scatter plots the plotDispersion function was used this was for all the scores which included the up and down scores of the samples. This will generate three panels which would be called figure 2.

Figure 2 is presented with three panels showing scores against dispersion for both control and TGFb. The panels show that the lower the sample score is the higher the dispersion where the panel on the left labeled as total shows a control sample at around a score of -0.06 has a dispersion of above 3000 and a TGFb sample of a score of around 0.29 with a dispersion slightly greater than 2200. The obvious differences seen are the TGFb samples having a high score low dispersion compared to the control for the first two panels (total and up) except for the third panel where the down regulated gene set pairs show a higher distribution for both the control and TGFb samples.

tgfbAnnot <- data.frame(SampleID = colnames(tgfb_expr_10_se),
                        Type = NA)
tgfbAnnot$Type[grepl("Ctrl", tgfbAnnot$SampleID)] = "Control"
tgfbAnnot$Type[grepl("TGFb", tgfbAnnot$SampleID)] = "TGFb"
tgfbAnnot$Type

##  [1] "Control" "TGFb"    "Control" "TGFb"    "Control" "TGFb"    "Control"
##  [8] "TGFb"    "Control" "Control"

plotDispersion(scoredf,annot = tgfbAnnot$Type,isInteractive = FALSE)

Figure 2, the function plotDispersion is used to create a scatter plot, which resulted in three panels showing score against dispersion for total score, up regulated gene set pairs score and down regulated gene set pair scores.

3.2.3: Plot score landscape

Plot score landscape allows you to look at the relationships between two different gene signatures therefore, more datasets were loaded in to create a signature landscape, plotScoreLandscape was the function used which plot the scores of the two different samples against each other. The extra datasets which were loaded in were taken from two CCLE datasets scoredf_ccle_mes and scoredf_ccle_epi (Barretina et al. 2012) these were scored against a mesenchymal and epithilal gene signature (Tan et al. 2014). The generated signiture landscape will be called figure 3.

Figure 3 shows the relationship between the epithelial and mesenchymal gene signature. This figure shows a negative relationship/ association between the two gene signitures this is evident because when there is a high count (around 4) that is located at 0.4 ccle-MES the location for ccle-EPI is around -0.8 but when there is a high count (around 6) located at 0.3 ccle-EPI the ccle-MES location is around 0. Therefore, showing negative association between the two gene signatures.

plotScoreLandscape(scoredf_ccle_epi, scoredf_ccle_mes, 
                   scorenames = c('ccle-EPI','ccle-MES'),hexMin = 10)

Figure 3, the function plotScoreLandscape is used to create a signiture landscape as well as using scoredf_ccle_epi and score_ccle_mes which are two different gene signature scores and are plotted against each other. The resulted figure shows ccle-epithelial and ccle-mesenchymal gene signature which are plotted against each other.

Moreover another signature landscape can be generated where rather than using the CCLE datasets (Barretina et al. 2012) this time the RNA-seq TCCA breast cancer datasets can be used to score against the mesenchymal and epithelial gene signatures this can be done using scoredf_tcga_mes and scoredf_tcga_epi cpde (Tan et al. 2014).

Figure 4 shows a neutral landscape where there isn’t really a negative or positive association between the tcga_MES and tcga_EPI although there is a slight negative association where most of the strong counts fall in the radius of 0.28 tcga_EPI and 0.21 tcga_MES other than that there isn’t really a skewness.

tcgaLandscape <- plotScoreLandscape(scoredf_tcga_epi, scoredf_tcga_mes, 
                   scorenames = c('tcga_EPI','tcga_MES'), isInteractive = FALSE)

tcgaLandscape

Figure 4, the function plotScoreLandscape is used to create a signiture landscape as well as using scored_tcga_epi and score_tcga_mes which are two different gene signature scores and are plotted against each other. The resulted figure shows tcga-epithelial and tcga-mesenchymal gene signature which are plotted against each other.

Additionally figure 4 can be modified to show new data points this is done using the function projectScoreLandscape, sampleLabels can also be used to add on customized tags.
The resulting signature landscape will be called figure 5.

Figure 5 this shows 5 breast cancer subtypes where the Hs.739.T subtype shows a high expression of tcga_MES but low expression of tcga_EPI and the rest of the breast cancer subtypes are closer together showing a more greater tcga_EPI expression but a lower tcga_MES expression.

projectScoreLandscape(plotObj = tcgaLandscape, scoredf_ccle_epi, 
                      scoredf_ccle_mes,
                      subSamples = rownames(scoredf_ccle_epi)[c(1,4,5)],
                      annot = rownames(scoredf_ccle_epi)[c(1,4,5)], 
                      sampleLabels = NULL,
                      isInteractive = FALSE)

## Warning: Ignoring unknown aesthetics: text

Figure 5 is the same as figure 4 but with more new data points using the function projectScoreLandscape and sampleLabels argument.

4: Discussion

As described before the singscore package that was used in this portfolio is used to carry out gene set scoring which was said to combat the issues associated with the majority of the other techniques which also look at gene set scoring (Foroutan et al. 2018).

The singscore package had a visualization and diagnostic functions where it was here that multiple plots were coded for and produced. The figures that were generated consisted of a plot rank density (figure 1), plot dispersion scores (figure 2), plot score landscape (figure 3), a plot score landscape with different variables (figure 4), another plot score landscape which is the exact same as figure 4 but with extra data points and custom labels (figure 5). This shows that the singscore package was working as all these plots were successful in getting generated through the singscore package. The singscore package is said to look at individual samples where their concordance will be quantified relative to the particular chosen gene sets to be precise it’s the concordance of the transcriptional profile that would be quantified (Bhuva et al. 2019). Moreover, the singscore package has successfully done this in generating figures 1-4 but the figures will be looked at in more detail to conclude this.

Rank density plot figure (figure 1) the plot generated shows the rank distribution of the type up and down TGFb gene sets this plot (as stated in the results section) shows no gene enrichment because the plot is close to the center (Foroutan et al. 2018). Furthermore Foroutan et al. (2018) has described that a sample with high scores would mean that these samples would have concordance with the tested gene sets therefore this would suggest that figure 1 may not have a concordance because of its medium score. Because the singscore package is used to create a quantification of the concordance of the individual samples (Bhuva et al. 2019). This not being the case for figure 1 suggests that singscore has failed in creating a rank density (figure 1) as the sample may not have concordance with the TGFb gene set.

Figure 2 presenting the plot dispersions of scores this figure shows the score against the dispersion for the TGFb samples as well as the control. This shows the total scores which is the left panel the up regulated scores of TGFb which is the middle panel and down regulated scores for the TGFb which is the panel on the right, looking at similar figures presented in another singscore paper (Foroutan et al. 2018) which expresses that if the sample has a great distribution range the gene set may have a different regulation (Foroutan et al. 2018), understanding this is key to interpretate and have a critical understanding of what the results show therefore looking at the figure 2 middle panel showing up score the results for this show that the TGFb is tightly packed together on the opposite side of the control scores where this suggests that up regulating the TGFb gene set would cause it to follow the same/ more assembly regulation. Figure 2 right panel (down regulated gene set) causes the TGFb and the control to be more scattered this suggests that the gene set might have a different regulation the same figures are also observed in Foroutan et al. (2018) paper.

Previous literature has been made that also uses the singscore package one example of the use of the singscore package was in the case of predicting acute myeloid leukemia mutation status this was done from looking at the transcriptomic signatures (Bhuva et al. 2019). This paper identifies that transcriptional programs may be driven by alternative mutations and that the location of where the alternative mutations arrive (start from) can be seen by the use of the singscore package (Bhuva et al. 2019), furthermore this paper also suggested that looking at cancer heterogeneity may be the best focus when using singscore (Bhuva et al. 2019).

Bhuva et al. (2019) also created signature landscapes to look at the relationship between two phenotypes where the signature landscape revealed a positive association between the NPM1c and MLL fusion signitures this was within the acute myeloid leukemia (Bhuva et al. 2019). The results of this positive associated landscape created curiosity where Bhuva et al. 2019 projected extra points onto the landscape to look at the sample stratification this in turn also has said to simplify the interpretation of the landscape as the different sub types of mutations would show their characteristics depending on where they are confied on the signature landscape (Bhuva et al. 2019). Therefore, this details that the production of signature landscapes would help us understand the association between 2 signatures as well as help us interpretate the characteristics of some mutations.

An alternative package which looks at RNA seq and microarray data is called the gene set variation analysis (GSVA) this package (Hänzelmann et al. 2013). This package has shown to have a great ability in the detection of little pathway activity changes, the use of GSVA helps combat the issues that are associated with the use of GSE therefore it is used in the application of GSE RNA-seq data (Hänzelmann et al. 2013). Alternatively GSVA is shown to be inferior to the singscore package in the production of specific sample scores that are stable this is evident through the results of Foroutan et al. (2018) work where the TCGA breast cancers data RNA-seq platform show stable scores when singscore is used (Foroutan et al. 2018). Therefore this shows that singscore is much more desirable than GSVA when scoring samples for individual samples.

A paper published in 2021 found a way to refine the gene expression signature quantifications this was done by using unique molecular identifiers when doing a targeted RNAseq assay (Fu et al. 2021). Furthermore, this study used concordance to measure each transcript and found that there was a large concordance between the tumor (breast cancer) sample formalin-fixed paraffin embedded and the fresh frozen samples (Fu et al. 2021). The results of this study found that using unique molecular identifiers when doing a targeted RNAseq assay resulted better measurements of each of the transcripts (Fu et al. 2021).

5: Conclusion

In summary the singscore package has shown to have a very strong use case in many different applications that involve single sample scoring one massive advantage to the use of this package is its ability to produce stable scores that most other methods struggle to do including the GSVA package (Foroutan et al. 2018). Multiple studies have also shown the specific use of singscore one study looked in the prediction of cancer (Bhuva et al. 2019). Overall, this package has shown to be very useful and the future increased use of this package may be promising.

6: References

Abba, M. C., Drake, J. A., Hawkins, K. A., Hu, Y., Sun, H., Notcovich, C., Gaddis, S., Sahin, A., Baggerly, K. and Aldaz, C. M. (2004) Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression. Breast cancer research : BCR 6 (5), R499-R513.

Barretina, J., Caponigro, G., Reddy, A., Liu, M., Murray, L., Berger, M. F., Monahan, J. E., Morais, P., Meltzer, J., Korejwa, A., Jane-Valbuena, J., Mapa, F. A., Stransky, N., Thibault, J., Bric-Furlong, E., Raman, P., Shipway, A., Engels, I. H., Cheng, J., Yu, G. K., Yu, J., Aspesi, P., De Silva, M., Venkatesan, K., Jagtap, K., Jones, M. D., Wang, L. I., Hatton, C., Palescandolo, E., Gupta, S., Mahan, S., Sougnez, C., Onofrio, R. C., Liefeld, T., Margolin, A. A., Macconaill, L., Winckler, W., Reich, M., Li, N., Mesirov, J. P., Gabriel, S. B., Getz, G., Ardlie, K., Chan, V., Myer, V. E., Kim, S., Weber, B. L., Porter, J., Warmuth, M., Finan, P., Harris, J. L., Meyerson, M., Golub, T. R., Morrissey, M. P., Sellers, W. R., Schlegel, R., Wilson, C. J., Lehar, J., Kryukov, G. V. and Sonkin, D. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature (London) 483 (7391), 603-607.

Bhuva, D. D., Foroutan, M., Xie, Y., Lyu, R., Cursons, J. and Davis, M. J. (2019) Using singscore to predict mutation status in acute myeloid leukemia from transcriptomic signatures [version 3; peer review: 2 approved]. F1000 research 8, 776-776.

Foroutan, M., Bhuva, D. D., Lyu, R., Horan, K., Cursons, J. and Davis, M. J. (2018) Single sample scoring of molecular phenotypes. BMC bioinformatics 19 (1), 404-404.

Foroutan, M., Cursons, J., Hediyeh-Zadeh, S., Thompson, E. W. and Davis, M. J. (2017) A Transcriptional Program for Detecting TGFβ-Induced EMT in Cancer. Molecular cancer research 15 (5), 619-631.

Fu, C., Marczyk, M., Samuels, M., Trevarton, A. J., Qu, J., Lau, R., Du, L., Pappas, T., Sinn, B. V., Gould, R. E., Pusztai, L., Hatzis, C. and Symmans, W. F. (2021) Targeted RNAseq assay incorporating unique molecular identifiers for improved quantification of gene expression signatures and transcribed mutation fraction in fixed tumor samples. BMC cancer 21 (1), 114-114.

Geniza, M. and Jaiswal, P. (2017) Tools for building de novo transcriptome assembly. Current plant biology 11-12 (C), 41-45.

Hänzelmann, S., Castelo, R. and Guinney, J. (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics 14 (1), 7-7.

Lowe, R., Shirley, N., Bleackley, M., Dolan, S. and Shafee, T. (2017) Transcriptomics technologies. PLoS computational biology 13 (5), e1005457-e1005457.

Tan, T. Z., Miow, Q. H., Miki, Y., Noda, T., Mori, S., Huang, R. Y. J. and Thiery, J. P. (2014) Epithelial‐mesenchymal transition spectrum quantification and its efficacy in deciphering survival and drug responses of cancer patients. EMBO molecular medicine 6 (10), 1279-1293.

Wang, B., Kumar, V., Olson, A. and Ware, D. (2019) Reviving the Transcriptome Studies: An Insight Into the Emergence of Single-Molecule Transcriptome Sequencing. Frontiers in genetics 10, 384-384.