Non-bayesian pharmaogenomic modelling

Introduction

In this example we do some very simple modelling using cancer cell line screening data and associated genetic data. The purpose of this example is to demonstrate how data can be extracted from PharmacoSet objects and then worked on.

Get the data

PharmacoSet objects can be downloaded as follows:

#ccle_pset <- downloadPSet('CCLE', saveDir='~/BigData/PSets/')
#gdsc_pset <- downloadPSet('GDSC', saveDir='~/BigData/PSets/')

And then loaded in the normal way:

load('~/BigData/PSets/GDSC.RData')

Explore the PharmacoSet objects

Objects have a show method so calling the name of the object reveals lots of useful information about it.

GDSC

## Name:  GDSC 
## Date Created:  Wed Dec 30 10:44:21 2015 
## Number of cell lines:  1124 
## Number of drug compounds:  139 
## RNA: 
##  Dim:  11833 789 
## CNV: 
##  Dim:  24960 936 
## Drug pertubation: 
##  Please look at pertNumber(pSet) to determine number of experiments for each drug-cell combination.
## Drug sensitivity: 
##  Number of Experiments:  79903 
##  Please look at sensNumber(pSet) to determine number of experiments for each drug-cell combination.

There are useful functions to show the elements of PharmacoSet object: drugs, cells and molecular data types:

drugNames(GDSC)[1:10]

##  [1] "Erlotinib"    "AICAR"        "Camptothecin" "Vinblastine" 
##  [5] "Cisplatin"    "Cytarabine"   "Docetaxel"    "Methotrexate"
##  [9] "ATRA"         "Gefitinib"

cellNames(GDSC)[1:10]

##  [1] "22RV1"    "23132-87" "380"      "5637"     "639-V"    "647-V"   
##  [7] "697"      "769-P"    "786-0"    "8305C"

mDataNames(GDSC)

## [1] "rna"      "rna2"     "mutation" "fusion"   "cnv"

Dose response data

We can plot the dose response data for a cell line:

drugDoseResponseCurve(drug='Nutlin-3', cellline='697', pSets=GDSC, plot.type='Both')

And get the sensitivity measure values for all cell lines for Nutlin-3 and Erlotinib:

sens_mat <- summarizeSensitivityProfiles(GDSC, sensitivity.measure = 'ic50_published', drugs = c('Nutlin-3', 'Erlotinib'), verbose = FALSE) %>% t()
sens_mat <- 9-log10(sens_mat)
dim(sens_mat)

## [1] 1124    2

Unfortunately there isn’t a convenience function to extract the dose response data itself, but this can be done as follows:

#get the information about the curve of interest
drug_profiles <- GDSC@sensitivity$info %>% 
    tibble::rownames_to_column('curve_id') %>%
    dplyr::filter(drugid=='Nutlin-3', cellid=='697')
drug_profiles

##          curve_id cellid   drugid drug.name nbr.conc.tested min.Dose.uM
## 1 drugid_1047_697    697 Nutlin-3  NUTLIN3A               9     0.03125
##   max.Dose.uM duration_h
## 1           8         72

#the raw data is stored in a 3D matrix, can extract as follows:
GDSC@sensitivity$raw["drugid_1047_697",,]

##        Dose      Viability         
## doses1 "0.03125" "99.9083625051504"
## doses2 "0.0625"  "110.360194300232"
## doses3 "0.125"   "97.9893987166772"
## doses4 "0.25"    "100.571670763227"
## doses5 "0.5"     "91.7495404302393"
## doses6 "1"       "87.6252737395088"
## doses7 "2"       "56.7022780978287"
## doses8 "4"       "14.4871092551858"
## doses9 "8"       "4.04710864108274"

Molecular profile data

Feature names are symbols for the mutation data and a derivation of ensembl id’s for expression data. Gene name information can be obtained using EnsDb.Hsapiens.v75 package

gene_info <- ensembldb::genes(EnsDb.Hsapiens.v75, filter=GenenameFilter(c('TP53', 'EGFR')), return.type='data.frame') %>%
    dplyr::select(gene_id, gene_name) %>%
    dplyr::mutate(probe_id=paste0(gene_id, '_at'))
gene_info

##           gene_id gene_name           probe_id
## 1 ENSG00000141510      TP53 ENSG00000141510_at
## 2 ENSG00000146648      EGFR ENSG00000146648_at

Get the mutatation and expression data for TP53 and EGFR:

genetic_mat <- summarizeMolecularProfiles(GDSC, mDataType = 'mutation', features = gene_info$gene_name, summary.stat='and', verbose=FALSE) %>% exprs() %>% t()
dim(genetic_mat)

## [1] 1124    2

affy_mat <- summarizeMolecularProfiles(GDSC, mDataType = 'rna', features = gene_info$probe_id, summary.stat='first', verbose=FALSE) %>% exprs() %>% t()
dim(affy_mat)

## [1] 1124    2

Integrated data analysis

Combine data together and turn into a data frame

#sensitivity data 
sens_df <- sens_mat %>% as.data.frame() %>% tibble::rownames_to_column('cell_line')

#mutation data
colnames(genetic_mat) <- paste0(colnames(genetic_mat), '_mut' )
genetic_df <- genetic_mat %>% as.data.frame() %>% tibble::rownames_to_column('cell_line')

#affymetrix data
colnames(affy_mat) <- paste0(colnames(affy_mat), '_affy' )
affy_df <- affy_mat %>% as.data.frame() %>% tibble::rownames_to_column('cell_line')

#combine into a single data frame
combined_df <- sens_df %>% 
    inner_join(genetic_df, by='cell_line') %>% 
    inner_join(affy_df, by='cell_line')

combined_df %>% tbl_df()

## # A tibble: 1,124 × 7
##    cell_line `Nutlin-3` Erlotinib TP53_mut EGFR_mut
##        <chr>      <dbl>     <dbl>   <fctr>   <fctr>
## 1      22RV1   7.892590       NaN        1        0
## 2   23132-87   7.680031       NaN        0        0
## 3        380        NaN       NaN       NA       NA
## 4       5637   6.214938       NaN        1        0
## 5      639-V   7.358565       NaN        1        0
## 6      647-V   6.072949       NaN        1        0
## 7        697   8.694393  8.576254        0        0
## 8      769-P   7.176188       NaN        0        0
## 9      786-0   6.995861       NaN        1        0
## 10     8305C   6.225848       NaN        1        0
## # ... with 1,114 more rows, and 2 more variables:
## #   ENSG00000141510_at_affy <dbl>, ENSG00000146648_at_affy <dbl>

Now do some basic modelling. Nutlin-3 is more effective in TP53 wild type than mutant cell lines:

lm(`Nutlin-3`~TP53_mut, data=combined_df) %>% summary()

## 
## Call:
## lm(formula = `Nutlin-3` ~ TP53_mut, data = combined_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84317 -0.45086 -0.04256  0.46663  2.12744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.69481    0.04444  173.16   <2e-16 ***
## TP53_mut1   -1.07415    0.05490  -19.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.671 on 659 degrees of freedom
##   (463 observations deleted due to missingness)
## Multiple R-squared:  0.3674, Adjusted R-squared:  0.3665 
## F-statistic: 382.8 on 1 and 659 DF,  p-value: < 2.2e-16

ggplot(combined_df, aes(y=`Nutlin-3`, x=TP53_mut)) + geom_violin() + geom_point()

## Warning: Removed 463 rows containing non-finite values (stat_ydensity).

## Warning: Removed 463 rows containing missing values (geom_point).

Whereas Erlotinib is more effective in cell lines with high levels of expression of its target, EGFR.

lm(Erlotinib~ENSG00000146648_at_affy, data=combined_df) %>% summary()

## 
## Call:
## lm(formula = Erlotinib ~ ENSG00000146648_at_affy, data = combined_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1594 -0.4591 -0.1275  0.3527  3.1857 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.99793    0.28864  17.315  < 2e-16 ***
## ENSG00000146648_at_affy  0.38049    0.05188   7.335 2.32e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6671 on 285 degrees of freedom
##   (837 observations deleted due to missingness)
## Multiple R-squared:  0.1588, Adjusted R-squared:  0.1558 
## F-statistic:  53.8 on 1 and 285 DF,  p-value: 2.319e-12

ggplot(combined_df, aes(y=Erlotinib, x=ENSG00000146648_at_affy)) + geom_point() + geom_smooth(method='lm')

## Warning: Removed 837 rows containing non-finite values (stat_smooth).

## Warning: Removed 837 rows containing missing values (geom_point).

The extension of this is to treat tissue as a covariate since some tissues have higher expression than others or have a higher level of mutation.

The Bayesian bit

There are several sources of uncertainty that are not accounted for by using a simple linear model:

How accurate is the sensitivity measure estimate and how much information are we losing by converting the dose response curve to a single value?
Are we able to measure resistant and sensitive cell lines equally accurately?
How accurate is our classifcation of mutation status - might we have missed mutations in some cell lines, and erroneously called variants in others? How sure can we be of the functional signifance of a mutation?

Could a Bayesian framework allow us to calculate a posterior distribution of effect size of a genetic feature on response to compound?

Session Info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v75_1.1.0 ensembldb_1.6.0         
##  [3] GenomicFeatures_1.26.0   AnnotationDbi_1.36.0    
##  [5] GenomicRanges_1.26.1     GenomeInfoDb_1.10.0     
##  [7] IRanges_2.8.0            S4Vectors_0.12.0        
##  [9] ggplot2_2.2.0            Biobase_2.34.0          
## [11] BiocGenerics_0.20.0      dplyr_0.5.0             
## [13] PharmacoGx_1.4.0        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.1                    maps_3.1.1                   
##  [3] AnnotationHub_2.6.0           gtools_3.5.0                 
##  [5] sm_2.2-5.4                    shiny_0.14.1                 
##  [7] assertthat_0.1                interactiveDisplayBase_1.12.0
##  [9] Rsamtools_1.26.1              yaml_2.1.13                  
## [11] slam_0.1-37                   RSQLite_1.0.0                
## [13] lattice_0.20-34               limma_3.30.0                 
## [15] downloader_0.4                chron_2.3-47                 
## [17] digest_0.6.10                 RColorBrewer_1.1-2           
## [19] XVector_0.14.0                colorspace_1.2-7             
## [21] httpuv_1.3.3                  htmltools_0.3.5              
## [23] Matrix_1.2-7.1                plyr_1.8.4                   
## [25] lsa_0.73.1                    XML_3.98-1.4                 
## [27] biomaRt_2.30.0                magicaxis_2.0.0              
## [29] zlibbioc_1.20.0               xtable_1.8-2                 
## [31] relations_0.6-6               scales_0.4.1                 
## [33] RANN_2.5                      gdata_2.17.0                 
## [35] BiocParallel_1.8.0            tibble_1.2                   
## [37] SummarizedExperiment_1.4.0    lazyeval_0.2.0               
## [39] mime_0.5                      magrittr_1.5                 
## [41] evaluate_0.10                 SnowballC_0.5.1              
## [43] MASS_7.3-45                   gplots_3.0.1                 
## [45] BiocInstaller_1.24.0          tools_3.3.1                  
## [47] data.table_1.9.6              formatR_1.4                  
## [49] stringr_1.1.0                 munsell_0.4.3                
## [51] cluster_2.0.5                 plotrix_3.6-3                
## [53] Biostrings_2.42.0             caTools_1.17.1               
## [55] grid_3.3.1                    RCurl_1.95-4.8               
## [57] marray_1.52.0                 igraph_1.0.1                 
## [59] celestial_1.3                 labeling_0.3                 
## [61] bitops_1.0-6                  rmarkdown_1.1                
## [63] gtable_0.2.0                  DBI_0.5-1                    
## [65] sets_1.0-16                   R6_2.2.0                     
## [67] GenomicAlignments_1.10.0      gridExtra_2.2.1              
## [69] knitr_1.14                    rtracklayer_1.34.0           
## [71] fastmatch_1.0-4               fgsea_1.0.0                  
## [73] KernSmooth_2.23-15            stringi_1.1.2                
## [75] Rcpp_0.12.8                   mapproj_1.2-4                
## [77] piano_1.14.0