In this example we do some very simple modelling using cancer cell line screening data and associated genetic data. The purpose of this example is to demonstrate how data can be extracted from PharmacoSet objects and then worked on.
PharmacoSet objects can be downloaded as follows:
#ccle_pset <- downloadPSet('CCLE', saveDir='~/BigData/PSets/')
#gdsc_pset <- downloadPSet('GDSC', saveDir='~/BigData/PSets/')
And then loaded in the normal way:
load('~/BigData/PSets/GDSC.RData')
Objects have a show method so calling the name of the object reveals lots of useful information about it.
GDSC
## Name: GDSC
## Date Created: Wed Dec 30 10:44:21 2015
## Number of cell lines: 1124
## Number of drug compounds: 139
## RNA:
## Dim: 11833 789
## CNV:
## Dim: 24960 936
## Drug pertubation:
## Please look at pertNumber(pSet) to determine number of experiments for each drug-cell combination.
## Drug sensitivity:
## Number of Experiments: 79903
## Please look at sensNumber(pSet) to determine number of experiments for each drug-cell combination.
There are useful functions to show the elements of PharmacoSet object: drugs, cells and molecular data types:
drugNames(GDSC)[1:10]
## [1] "Erlotinib" "AICAR" "Camptothecin" "Vinblastine"
## [5] "Cisplatin" "Cytarabine" "Docetaxel" "Methotrexate"
## [9] "ATRA" "Gefitinib"
cellNames(GDSC)[1:10]
## [1] "22RV1" "23132-87" "380" "5637" "639-V" "647-V"
## [7] "697" "769-P" "786-0" "8305C"
mDataNames(GDSC)
## [1] "rna" "rna2" "mutation" "fusion" "cnv"
We can plot the dose response data for a cell line:
drugDoseResponseCurve(drug='Nutlin-3', cellline='697', pSets=GDSC, plot.type='Both')
And get the sensitivity measure values for all cell lines for Nutlin-3 and Erlotinib:
sens_mat <- summarizeSensitivityProfiles(GDSC, sensitivity.measure = 'ic50_published', drugs = c('Nutlin-3', 'Erlotinib'), verbose = FALSE) %>% t()
sens_mat <- 9-log10(sens_mat)
dim(sens_mat)
## [1] 1124 2
Unfortunately there isn’t a convenience function to extract the dose response data itself, but this can be done as follows:
#get the information about the curve of interest
drug_profiles <- GDSC@sensitivity$info %>%
tibble::rownames_to_column('curve_id') %>%
dplyr::filter(drugid=='Nutlin-3', cellid=='697')
drug_profiles
## curve_id cellid drugid drug.name nbr.conc.tested min.Dose.uM
## 1 drugid_1047_697 697 Nutlin-3 NUTLIN3A 9 0.03125
## max.Dose.uM duration_h
## 1 8 72
#the raw data is stored in a 3D matrix, can extract as follows:
GDSC@sensitivity$raw["drugid_1047_697",,]
## Dose Viability
## doses1 "0.03125" "99.9083625051504"
## doses2 "0.0625" "110.360194300232"
## doses3 "0.125" "97.9893987166772"
## doses4 "0.25" "100.571670763227"
## doses5 "0.5" "91.7495404302393"
## doses6 "1" "87.6252737395088"
## doses7 "2" "56.7022780978287"
## doses8 "4" "14.4871092551858"
## doses9 "8" "4.04710864108274"
Feature names are symbols for the mutation data and a derivation of ensembl id’s for expression data. Gene name information can be obtained using EnsDb.Hsapiens.v75 package
gene_info <- ensembldb::genes(EnsDb.Hsapiens.v75, filter=GenenameFilter(c('TP53', 'EGFR')), return.type='data.frame') %>%
dplyr::select(gene_id, gene_name) %>%
dplyr::mutate(probe_id=paste0(gene_id, '_at'))
gene_info
## gene_id gene_name probe_id
## 1 ENSG00000141510 TP53 ENSG00000141510_at
## 2 ENSG00000146648 EGFR ENSG00000146648_at
Get the mutatation and expression data for TP53 and EGFR:
genetic_mat <- summarizeMolecularProfiles(GDSC, mDataType = 'mutation', features = gene_info$gene_name, summary.stat='and', verbose=FALSE) %>% exprs() %>% t()
dim(genetic_mat)
## [1] 1124 2
affy_mat <- summarizeMolecularProfiles(GDSC, mDataType = 'rna', features = gene_info$probe_id, summary.stat='first', verbose=FALSE) %>% exprs() %>% t()
dim(affy_mat)
## [1] 1124 2
Combine data together and turn into a data frame
#sensitivity data
sens_df <- sens_mat %>% as.data.frame() %>% tibble::rownames_to_column('cell_line')
#mutation data
colnames(genetic_mat) <- paste0(colnames(genetic_mat), '_mut' )
genetic_df <- genetic_mat %>% as.data.frame() %>% tibble::rownames_to_column('cell_line')
#affymetrix data
colnames(affy_mat) <- paste0(colnames(affy_mat), '_affy' )
affy_df <- affy_mat %>% as.data.frame() %>% tibble::rownames_to_column('cell_line')
#combine into a single data frame
combined_df <- sens_df %>%
inner_join(genetic_df, by='cell_line') %>%
inner_join(affy_df, by='cell_line')
combined_df %>% tbl_df()
## # A tibble: 1,124 × 7
## cell_line `Nutlin-3` Erlotinib TP53_mut EGFR_mut
## <chr> <dbl> <dbl> <fctr> <fctr>
## 1 22RV1 7.892590 NaN 1 0
## 2 23132-87 7.680031 NaN 0 0
## 3 380 NaN NaN NA NA
## 4 5637 6.214938 NaN 1 0
## 5 639-V 7.358565 NaN 1 0
## 6 647-V 6.072949 NaN 1 0
## 7 697 8.694393 8.576254 0 0
## 8 769-P 7.176188 NaN 0 0
## 9 786-0 6.995861 NaN 1 0
## 10 8305C 6.225848 NaN 1 0
## # ... with 1,114 more rows, and 2 more variables:
## # ENSG00000141510_at_affy <dbl>, ENSG00000146648_at_affy <dbl>
Now do some basic modelling. Nutlin-3 is more effective in TP53 wild type than mutant cell lines:
lm(`Nutlin-3`~TP53_mut, data=combined_df) %>% summary()
##
## Call:
## lm(formula = `Nutlin-3` ~ TP53_mut, data = combined_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.84317 -0.45086 -0.04256 0.46663 2.12744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.69481 0.04444 173.16 <2e-16 ***
## TP53_mut1 -1.07415 0.05490 -19.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.671 on 659 degrees of freedom
## (463 observations deleted due to missingness)
## Multiple R-squared: 0.3674, Adjusted R-squared: 0.3665
## F-statistic: 382.8 on 1 and 659 DF, p-value: < 2.2e-16
ggplot(combined_df, aes(y=`Nutlin-3`, x=TP53_mut)) + geom_violin() + geom_point()
## Warning: Removed 463 rows containing non-finite values (stat_ydensity).
## Warning: Removed 463 rows containing missing values (geom_point).
Whereas Erlotinib is more effective in cell lines with high levels of expression of its target, EGFR.
lm(Erlotinib~ENSG00000146648_at_affy, data=combined_df) %>% summary()
##
## Call:
## lm(formula = Erlotinib ~ ENSG00000146648_at_affy, data = combined_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1594 -0.4591 -0.1275 0.3527 3.1857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.99793 0.28864 17.315 < 2e-16 ***
## ENSG00000146648_at_affy 0.38049 0.05188 7.335 2.32e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6671 on 285 degrees of freedom
## (837 observations deleted due to missingness)
## Multiple R-squared: 0.1588, Adjusted R-squared: 0.1558
## F-statistic: 53.8 on 1 and 285 DF, p-value: 2.319e-12
ggplot(combined_df, aes(y=Erlotinib, x=ENSG00000146648_at_affy)) + geom_point() + geom_smooth(method='lm')
## Warning: Removed 837 rows containing non-finite values (stat_smooth).
## Warning: Removed 837 rows containing missing values (geom_point).
The extension of this is to treat tissue as a covariate since some tissues have higher expression than others or have a higher level of mutation.
There are several sources of uncertainty that are not accounted for by using a simple linear model:
Could a Bayesian framework allow us to calculate a posterior distribution of effect size of a genetic feature on response to compound?
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v75_1.1.0 ensembldb_1.6.0
## [3] GenomicFeatures_1.26.0 AnnotationDbi_1.36.0
## [5] GenomicRanges_1.26.1 GenomeInfoDb_1.10.0
## [7] IRanges_2.8.0 S4Vectors_0.12.0
## [9] ggplot2_2.2.0 Biobase_2.34.0
## [11] BiocGenerics_0.20.0 dplyr_0.5.0
## [13] PharmacoGx_1.4.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.2.1 maps_3.1.1
## [3] AnnotationHub_2.6.0 gtools_3.5.0
## [5] sm_2.2-5.4 shiny_0.14.1
## [7] assertthat_0.1 interactiveDisplayBase_1.12.0
## [9] Rsamtools_1.26.1 yaml_2.1.13
## [11] slam_0.1-37 RSQLite_1.0.0
## [13] lattice_0.20-34 limma_3.30.0
## [15] downloader_0.4 chron_2.3-47
## [17] digest_0.6.10 RColorBrewer_1.1-2
## [19] XVector_0.14.0 colorspace_1.2-7
## [21] httpuv_1.3.3 htmltools_0.3.5
## [23] Matrix_1.2-7.1 plyr_1.8.4
## [25] lsa_0.73.1 XML_3.98-1.4
## [27] biomaRt_2.30.0 magicaxis_2.0.0
## [29] zlibbioc_1.20.0 xtable_1.8-2
## [31] relations_0.6-6 scales_0.4.1
## [33] RANN_2.5 gdata_2.17.0
## [35] BiocParallel_1.8.0 tibble_1.2
## [37] SummarizedExperiment_1.4.0 lazyeval_0.2.0
## [39] mime_0.5 magrittr_1.5
## [41] evaluate_0.10 SnowballC_0.5.1
## [43] MASS_7.3-45 gplots_3.0.1
## [45] BiocInstaller_1.24.0 tools_3.3.1
## [47] data.table_1.9.6 formatR_1.4
## [49] stringr_1.1.0 munsell_0.4.3
## [51] cluster_2.0.5 plotrix_3.6-3
## [53] Biostrings_2.42.0 caTools_1.17.1
## [55] grid_3.3.1 RCurl_1.95-4.8
## [57] marray_1.52.0 igraph_1.0.1
## [59] celestial_1.3 labeling_0.3
## [61] bitops_1.0-6 rmarkdown_1.1
## [63] gtable_0.2.0 DBI_0.5-1
## [65] sets_1.0-16 R6_2.2.0
## [67] GenomicAlignments_1.10.0 gridExtra_2.2.1
## [69] knitr_1.14 rtracklayer_1.34.0
## [71] fastmatch_1.0-4 fgsea_1.0.0
## [73] KernSmooth_2.23-15 stringi_1.1.2
## [75] Rcpp_0.12.8 mapproj_1.2-4
## [77] piano_1.14.0