The aim of the CancerCellLines package is to provide standardised code to create and extract data from a SQLite database containing the published genomic data from the Cancer Cell Line Encyclopedia and similar projects. The reason for using a SQLite database is to allow data to be stored on disk, rather than be loaded into memory. This is useful when the user wishes to work with small subsets of the overall dataset, for example just 10-20 genes in 80 lung cell lines. This Vignette will cover the inital set up of the package along with some examples of its use.
Data files referred to here can be download from the CCLE project website.
There are also toy examples included in the package:
list.files(system.file("extdata", package = "CancerCellLines"))## [1] "CCLE_copynumber_byGene_2012-09-29_toy.txt"
## [2] "CCLE_Expression_Entrez_2012-09-29_toy.gct"
## [3] "CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07_toy.maf"
## [4] "CCLE_NP24.2009_Drug_data_2012.02.20_toy.txt"
## [5] "CCLE_sample_info_file_2012-10-18_toy.txt"
## [6] "CellLineIDNormalisationNov15_toy.txt"
## [7] "CellLineIDNormalisationNov15.txt"
## [8] "CosmicCLP_CompleteExport_v74_toy.tsv"
## [9] "Dietlein2014_supp_table_1.txt"
## [10] "scale_colours.txt"
## [11] "toy.db"
Either make a toy database from scratch using the convenience function makeToyDB:
test_db <- makeToyDB()## Parse the gene expression data file
## Writing to database
## Indexing the table
## Finished importing affy data
## Parse the copy number data file
## Writing to database
## Indexing the table
## Finished importing cn data
## Parse the Cosmic CLP exome data file
## Write the data to the database
## Indexing the table
## Finished importing Cosmic CLP exome sequencing data
## [1] "A database of toy data has successfully been created and a connector returned"
test_db## <SQLiteConnection>
test_db@dbname## [1] "/var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//RtmphyciSX/file8b85e03d71c"
dbListTables(test_db)## [1] "ccle_affy" "ccle_cn" "ccle_drug_data" "ccle_hybcap"
## [5] "ccle_sampleinfo" "cell_line_ids" "cosmicclp_exome"
Or connect to the one built into the package using the setupSQLite function:
test_db <- setupSQLite(system.file('extdata/toy.db', package="CancerCellLines"))
test_db## <SQLiteConnection>
test_db@dbname## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db"
dbListTables(test_db)## [1] "ccle_affy" "ccle_cn" "ccle_drug_data" "ccle_hybcap"
## [5] "ccle_sampleinfo" "cell_line_ids" "cosmicclp_exome"
The functions from RSQLite can be used to query data in the normal way:
dbGetQuery(test_db, "select * from ccle_affy limit 10")## ProbeID Symbol CCLE_name Signal
## 1 1956_at EGFR NCIH524_LUNG 4.773512
## 2 3845_at KRAS NCIH524_LUNG 10.399190
## 3 4893_at NRAS NCIH524_LUNG 8.802869
## 4 5728_at PTEN NCIH524_LUNG 7.442462
## 5 6597_at SMARCA4 NCIH524_LUNG 10.429500
## 6 673_at BRAF NCIH524_LUNG 6.445456
## 7 7157_at TP53 NCIH524_LUNG 7.120139
## 8 1956_at EGFR NCIH209_LUNG 5.712198
## 9 3845_at KRAS NCIH209_LUNG 10.544760
## 10 4893_at NRAS NCIH209_LUNG 9.539366
dbGetQuery(test_db, "select * from ccle_sampleinfo limit 10")[,1:5]## CCLE_name Primary_cell_name Cell_line_aliases Gender Site_primary
## 1 A549_LUNG A549 M lung
## 2 ABC1_LUNG ABC-1 M lung
## 3 AU565_BREAST AU565 F breast
## 4 BEN_LUNG BEN M lung
## 5 BT20_BREAST BT-20 F breast
## 6 BT474_BREAST BT-474 F breast
## 7 BT483_BREAST BT-483 F breast
## 8 BT549_BREAST BT-549 F breast
## 9 CAL120_BREAST CAL-120 F breast
## 10 CAL12T_LUNG CAL-12T M lung
dbGetQuery(test_db, "select Symbol, t1.CCLE_name, Signal, Site_primary, Hist_subtype1 from ccle_affy as t1
inner join ccle_sampleinfo t2 on t1.CCLE_name = t2.CCLE_name
where t2.Hist_subtype1 == 'ductal_carcinoma'
order by Symbol desc
limit 10")## Symbol CCLE_name Signal Site_primary Hist_subtype1
## 1 TP53 KPL1_BREAST 8.952216 breast ductal_carcinoma
## 2 TP53 HDQP1_BREAST 5.294608 breast ductal_carcinoma
## 3 TP53 UACC893_BREAST 5.066705 breast ductal_carcinoma
## 4 TP53 HCC1599_BREAST 5.015617 breast ductal_carcinoma
## 5 TP53 HCC1500_BREAST 7.630229 breast ductal_carcinoma
## 6 TP53 MDAMB134VI_BREAST 5.707668 breast ductal_carcinoma
## 7 TP53 CAL148_BREAST 9.157586 breast ductal_carcinoma
## 8 TP53 UACC812_BREAST 6.639146 breast ductal_carcinoma
## 9 TP53 HCC1806_BREAST 6.110553 breast ductal_carcinoma
## 10 TP53 JIMT1_BREAST 9.691416 breast ductal_carcinoma
Indexing the database allows fast retrieval even when the dataset gets large - more later.
However, writing the SQL yourself can get inconvenient if you want to retrieve several genes or cell lines:
dbGetQuery(test_db, "select * from ccle_affy
where symbol IN ('PTEN', 'TP53', 'BRAF' ) and
CCLE_name IN ('BT474_BREAST', 'MDAMB468_BREAST')
limit 10")## ProbeID Symbol CCLE_name Signal
## 1 673_at BRAF BT474_BREAST 6.844845
## 2 673_at BRAF MDAMB468_BREAST 6.335567
## 3 5728_at PTEN BT474_BREAST 9.582697
## 4 5728_at PTEN MDAMB468_BREAST 8.200749
## 5 7157_at TP53 BT474_BREAST 8.565303
## 6 7157_at TP53 MDAMB468_BREAST 9.301550
symbols <- c('PTEN', 'TP53', 'BRAF')
cell_lines <- c('BT474_BREAST', 'MDAMB468_BREAST')
symbols.sql <- paste(symbols, collapse="','")
cell_lines.sql <- paste(cell_lines, collapse="','")
dbGetQuery(test_db, sprintf("select * from ccle_affy
where symbol IN ('%s' ) and
CCLE_name IN ('%s')
limit 10", symbols.sql, cell_lines.sql))## ProbeID Symbol CCLE_name Signal
## 1 673_at BRAF BT474_BREAST 6.844845
## 2 673_at BRAF MDAMB468_BREAST 6.335567
## 3 5728_at PTEN BT474_BREAST 9.582697
## 4 5728_at PTEN MDAMB468_BREAST 8.200749
## 5 7157_at TP53 BT474_BREAST 8.565303
## 6 7157_at TP53 MDAMB468_BREAST 9.301550
Things become much nicer if you query with dplyr, since this writes the underlying SQL for you:
con <- src_sqlite(test_db@dbname)
ccle_affy <- con %>% tbl('ccle_affy')
ccle_affy## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: ccle_affy [1,722 x 4]
##
## ProbeID Symbol CCLE_name Signal
## (chr) (chr) (chr) (dbl)
## 1 1956_at EGFR NCIH524_LUNG 4.773512
## 2 3845_at KRAS NCIH524_LUNG 10.399190
## 3 4893_at NRAS NCIH524_LUNG 8.802869
## 4 5728_at PTEN NCIH524_LUNG 7.442462
## 5 6597_at SMARCA4 NCIH524_LUNG 10.429500
## 6 673_at BRAF NCIH524_LUNG 6.445456
## 7 7157_at TP53 NCIH524_LUNG 7.120139
## 8 1956_at EGFR NCIH209_LUNG 5.712198
## 9 3845_at KRAS NCIH209_LUNG 10.544760
## 10 4893_at NRAS NCIH209_LUNG 9.539366
## .. ... ... ... ...
ccle_sampleinfo <- con %>% tbl('ccle_sampleinfo')
ccle_sampleinfo## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: ccle_sampleinfo [247 x 13]
##
## CCLE_name Primary_cell_name Cell_line_aliases Gender Site_primary
## (chr) (chr) (chr) (chr) (chr)
## 1 A549_LUNG A549 M lung
## 2 ABC1_LUNG ABC-1 M lung
## 3 AU565_BREAST AU565 F breast
## 4 BEN_LUNG BEN M lung
## 5 BT20_BREAST BT-20 F breast
## 6 BT474_BREAST BT-474 F breast
## 7 BT483_BREAST BT-483 F breast
## 8 BT549_BREAST BT-549 F breast
## 9 CAL120_BREAST CAL-120 F breast
## 10 CAL12T_LUNG CAL-12T M lung
## .. ... ... ... ... ...
## Variables not shown: Histology (chr), Hist_subtype1 (chr), Notes (chr),
## Source (chr), Expression_arrays (chr), SNP_arrays (chr), Oncomap (chr),
## Hybrid_capture_sequencing (chr)
ccle_sampleinfo %>% dplyr::select(CCLE_name, Site_primary, Hist_subtype1) %>%
dplyr::filter(Hist_subtype1 == 'ductal_carcinoma') %>%
dplyr::inner_join(ccle_affy, by='CCLE_name') %>%
dplyr::arrange(desc(Symbol))## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: <derived table> [?? x 6]
## Arrange: desc(Symbol)
##
## CCLE_name Site_primary Hist_subtype1 ProbeID Symbol Signal
## (chr) (chr) (chr) (chr) (chr) (dbl)
## 1 KPL1_BREAST breast ductal_carcinoma 7157_at TP53 8.952216
## 2 HDQP1_BREAST breast ductal_carcinoma 7157_at TP53 5.294608
## 3 UACC893_BREAST breast ductal_carcinoma 7157_at TP53 5.066705
## 4 HCC1599_BREAST breast ductal_carcinoma 7157_at TP53 5.015617
## 5 HCC1500_BREAST breast ductal_carcinoma 7157_at TP53 7.630229
## 6 MDAMB134VI_BREAST breast ductal_carcinoma 7157_at TP53 5.707668
## 7 CAL148_BREAST breast ductal_carcinoma 7157_at TP53 9.157586
## 8 UACC812_BREAST breast ductal_carcinoma 7157_at TP53 6.639146
## 9 HCC1806_BREAST breast ductal_carcinoma 7157_at TP53 6.110553
## 10 JIMT1_BREAST breast ductal_carcinoma 7157_at TP53 9.691416
## .. ... ... ... ... ... ...
ccle_affy %>% filter(symbol %in% symbols & CCLE_name %in% cell_lines)## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: ccle_affy [6 x 4]
## Filter: symbol %in% c("PTEN", "TP53", "BRAF") & CCLE_name %in%
## c("BT474_BREAST", "MDAMB468_BREAST")
##
## ProbeID Symbol CCLE_name Signal
## (chr) (chr) (chr) (dbl)
## 1 673_at BRAF BT474_BREAST 6.844845
## 2 673_at BRAF MDAMB468_BREAST 6.335567
## 3 5728_at PTEN BT474_BREAST 9.582697
## 4 5728_at PTEN MDAMB468_BREAST 8.200749
## 5 7157_at TP53 BT474_BREAST 8.565303
## 6 7157_at TP53 MDAMB468_BREAST 9.301550
There are a number of convenience functions that assist in executing typical queries. For example, the getAffyData and getCopyNumberData functions can be used to simplify the queries above still further:
getAffyData(test_db, symbols, cell_lines)## CCLE_name ID Type original value
## 1 BT474_BREAST BRAF affy 6.844845 6.844845
## 2 MDAMB468_BREAST BRAF affy 6.335567 6.335567
## 3 BT474_BREAST PTEN affy 9.582697 9.582697
## 4 MDAMB468_BREAST PTEN affy 8.200749 8.200749
## 5 BT474_BREAST TP53 affy 8.565303 8.565303
## 6 MDAMB468_BREAST TP53 affy 9.30155 9.301550
getCopyNumberData(test_db, symbols, cell_lines)## CCLE_name ID Type original value
## 1 BT474_BREAST BRAF cn 0.4872 0.4872
## 2 MDAMB468_BREAST BRAF cn 0.0975 0.0975
## 3 BT474_BREAST PTEN cn -0.2208 -0.2208
## 4 MDAMB468_BREAST PTEN cn 0.1157 0.1157
## 5 BT474_BREAST TP53 cn -0.2853 -0.2853
## 6 MDAMB468_BREAST TP53 cn -0.3893 -0.3893
Whilst the getHybcapData and getCosmicCLPData functions retrieve the CCLE hybrid capture and Cosmic Cell Line Project sequencing data respectively:
getHybcapData(test_db, symbols, cell_lines)## Source: local data frame [6 x 5]
##
## CCLE_name ID Type original value
## (chr) (chr) (chr) (chr) (dbl)
## 1 BT474_BREAST TP53 hybcap p.E285K 1
## 2 MDAMB468_BREAST PTEN hybcap - 0
## 3 BT474_BREAST PTEN hybcap - 0
## 4 MDAMB468_BREAST TP53 hybcap - 0
## 5 MDAMB468_BREAST BRAF hybcap - 0
## 6 BT474_BREAST BRAF hybcap - 0
getCosmicCLPData(test_db, symbols, cell_lines)## Source: local data frame [6 x 5]
##
## CCLE_name ID Type original value
## (chr) (chr) (chr) (chr) (dbl)
## 1 BT474_BREAST TP53 cosmicclp p.E285K 1
## 2 MDAMB468_BREAST TP53 cosmicclp p.R273H 1
## 3 BT474_BREAST PTEN cosmicclp - 0
## 4 MDAMB468_BREAST PTEN cosmicclp - 0
## 5 BT474_BREAST BRAF cosmicclp - 0
## 6 MDAMB468_BREAST BRAF cosmicclp - 0
Note that the CancerCellLines package includes functionality to convert cell line identifiers between different datasets using the cell_line_ids table. This happens transparently in the getCosmicCLPData function:
con %>% tbl('cell_line_ids') %>% filter(unified_id %in% cell_lines)## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: cell_line_ids [8 x 8]
## Filter: unified_id %in% c("BT474_BREAST", "MDAMB468_BREAST")
##
## unified_id native_id alt_id id_type tissue
## (chr) (chr) (chr) (chr) (chr)
## 1 BT474_BREAST BT474_BREAST BT-474 CCLE breast
## 2 BT474_BREAST BT-474 NA cosmic_clp breast
## 3 BT474_BREAST BT-474 946359 gdsc breast
## 4 BT474_BREAST BT474 NA eurofins breast
## 5 MDAMB468_BREAST MDAMB468_BREAST MDA-MB-468 CCLE breast
## 6 MDAMB468_BREAST MDA-MB-468 NA cosmic_clp breast
## 7 MDAMB468_BREAST MDA-MB-468 908123 gdsc breast
## 8 MDAMB468_BREAST MDA MB 468 NA eurofins breast
## Variables not shown: hist_primary (chr), hist_secondary (chr), match_id
## (chr)
Finally, the getDrugData_CCLE function retrieves the CCLE drug response data:
drugs <- c('Lapatinib', 'AZD6244', 'Nilotinib' )
getDrugData_CCLE(test_db, drugs, cell_lines)## CCLE_name ID Type original value
## 1 BT474_BREAST AZD6244 resp <NA> NA
## 2 MDAMB468_BREAST AZD6244 resp <NA> NA
## 3 BT474_BREAST Lapatinib resp 8.020292071 5.095810
## 4 MDAMB468_BREAST Lapatinib resp 2.07880187 5.682187
## 5 BT474_BREAST Nilotinib resp 8.738396718 5.058568
## 6 MDAMB468_BREAST Nilotinib resp 3.195904255 5.495406
Whilst the getDrugData_custom function transforms an arbitrary data frame with the field names below into the standardised data frame:
data(dietlein_data)
head(dietlein_data)## unified_id compound_id endpoint original value
## 1 A375_SKIN KU60648 pGI50 810.9611 -2.909
## 2 A549_LUNG KU60648 pGI50 609.5369 -2.785
## 3 COLO205_LARGE_INTESTINE KU60648 pGI50 1233.1048 -3.091
## 4 COLO320_LARGE_INTESTINE KU60648 pGI50 903.6495 -2.956
## 5 DMS114_LUNG KU60648 pGI50 313.3286 -2.496
## 6 DV90_LUNG KU60648 pGI50 229.6149 -2.361
getDrugData_custom(dietlein_data, drugs = 'KU60648_pGI50', cell_lines = c('DMS114_LUNG', 'A549_LUNG'))## CCLE_name ID Type original value
## 1 A549_LUNG KU60648_pGI50 resp 609.5369 -2.785
## 2 DMS114_LUNG KU60648_pGI50 resp 313.3286 -2.496
These functions all have a standard output format which means that data from different assay types can be merged and plotted or analysed together.
The makeTallDataFrame function does this merging in a standard way and returns the the data in a ‘tidy’ format that is useful for plotting in ggplot2 or further manipulation with tidyr.
makeTallDataFrame(test_db, symbols, cell_lines, drugs)## Source: local data frame [30 x 5]
##
## CCLE_name ID Type original value
## (chr) (chr) (chr) (chr) (dbl)
## 1 BT474_BREAST BRAF affy 6.844845 6.844845
## 2 MDAMB468_BREAST BRAF affy 6.335567 6.335567
## 3 BT474_BREAST PTEN affy 9.582697 9.582697
## 4 MDAMB468_BREAST PTEN affy 8.200749 8.200749
## 5 BT474_BREAST TP53 affy 8.565303 8.565303
## 6 MDAMB468_BREAST TP53 affy 9.30155 9.301550
## 7 BT474_BREAST TP53 hybcap p.E285K 1.000000
## 8 MDAMB468_BREAST PTEN hybcap - 0.000000
## 9 BT474_BREAST PTEN hybcap - 0.000000
## 10 MDAMB468_BREAST TP53 hybcap - 0.000000
## .. ... ... ... ... ...
The makeWideFromTallDataFrame function can take the output from makeTallDataFrame and create a wide or matrix-like data frame which is a conveninent input for modelling packages such as caret.
my_df <- makeTallDataFrame(test_db, symbols, cell_lines, drugs)
makeWideFromTallDataFrame(my_df)## Source: local data frame [2 x 16]
##
## CCLE_name AZD6244_resp Lapatinib_resp Nilotinib_resp BRAF_affy
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 BT474_BREAST NA 5.095810 5.058568 6.844845
## 2 MDAMB468_BREAST NA 5.682187 5.495406 6.335567
## Variables not shown: BRAF_cn (dbl), BRAF_cosmicclp (dbl), BRAF_hybcap
## (dbl), PTEN_affy (dbl), PTEN_cn (dbl), PTEN_cosmicclp (dbl), PTEN_hybcap
## (dbl), TP53_affy (dbl), TP53_cn (dbl), TP53_cosmicclp (dbl), TP53_hybcap
## (dbl)
Finally, there is the makeWideDataFrame function which generates a wide data frame directly.
makeWideDataFrame(test_db, symbols, cell_lines, drugs)## Source: local data frame [2 x 16]
##
## CCLE_name AZD6244_resp Lapatinib_resp Nilotinib_resp BRAF_affy
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 BT474_BREAST NA 5.095810 5.058568 6.844845
## 2 MDAMB468_BREAST NA 5.682187 5.495406 6.335567
## Variables not shown: BRAF_cn (dbl), BRAF_cosmicclp (dbl), BRAF_hybcap
## (dbl), PTEN_affy (dbl), PTEN_cn (dbl), PTEN_cosmicclp (dbl), PTEN_hybcap
## (dbl), TP53_affy (dbl), TP53_cn (dbl), TP53_cosmicclp (dbl), TP53_hybcap
## (dbl)
The data_types parameter can be used to control which data types are returned, and the drug_df parameter is used to provide custom drug information as per the getDrugData_custom function description above
makeWideDataFrame(test_db, symbols, cell_lines, drugs, data_types=c('hybcap', 'affy', 'resp'))## Source: local data frame [2 x 10]
##
## CCLE_name AZD6244_resp Lapatinib_resp Nilotinib_resp BRAF_affy
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 BT474_BREAST NA 5.095810 5.058568 6.844845
## 2 MDAMB468_BREAST NA 5.682187 5.495406 6.335567
## Variables not shown: BRAF_hybcap (dbl), PTEN_affy (dbl), PTEN_hybcap
## (dbl), TP53_affy (dbl), TP53_hybcap (dbl)
The full CCLE dataset is not included in this package due to reasons of data size and because permission for data re-distribution has not yet been sought. However, the instructions below will demonstrate how this is done:
Define where the data is to be stored/found. Files are downloaded from the CCLE project website and COSMIC Cell Line Project website.
dbpath <- '~/BigData/CellLineData/CancerCellLines.db'
infopath <- '~/BigData/CellLineData/RawData/CCLE_sample_info_file_2012-10-18.txt'
affypath <- '~/BigData/CellLineData/RawData/CCLE_Expression_Entrez_2012-09-29.gct'
cnpath <- '~/BigData/CellLineData/RawData/CCLE_copynumber_byGene_2012-09-29.txt'
hybcappath <- '~/BigData/CellLineData/RawData/CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf'
cosmicclppath <- '~/BigData/CellLineData/RawData/CosmicCLP_CompleteExport_v74.tsv'
drugpath <- '~/BigData/CellLineData/RawData/CCLE_NP24.2009_Drug_data_2012.02.20.csv'
idspath <- system.file("extdata", "CellLineIDNormalisationNov15.txt", package = "CancerCellLines")Set up the SQLite database and run the import functions
full_con <- setupSQLite(dbpath)
importCCLE_info(infopath , full_con)
importCCLE_hybcap(hybcappath , full_con)
importCosmicCLP_exome(cosmicclppath, full_con)
importCCLE_drugresponse(drugpath , full_con)
importCCLE_affy(affypath , full_con)
importCCLE_cn(cnpath, full_con)
importCellLineIDs(idspath, full_con)This process should take 3-4 minutes with most of the time spent importing the affymetrix data.
Now use the database as per the toy example. Thanks to the speed of SQLite and the wonders of indexing, data retrieval should still be just as fast even though the ccle_affy table contains around 20 million data points.
To really put it through its paces try retrieving data from 2000 genes in 200 cell lines as below:
dplyr_con <- src_sqlite(full_con@dbname)
#get 2000 random genes
random_genes <- dplyr_con %>% tbl('ccle_affy') %>% group_by(Symbol) %>% summarise(N=n()) %>%
ungroup() %>% collect %>%
dplyr::filter(N < mean(N)) %>% sample_n(2000) %>% as.data.frame
random_genes <- random_genes$Symbol
#get 200 random cell lines
random_cell_lines <- dplyr_con %>% tbl('ccle_sampleinfo') %>% dplyr::select(CCLE_name) %>%
distinct %>% collect %>% sample_n(200) %>% as.data.frame
random_cell_lines <- random_cell_lines$CCLE_name
#get 10 random compounds
random_drugs <- dplyr_con %>% tbl('ccle_drug_data') %>% dplyr::select(Compound) %>%
distinct %>% collect %>% sample_n(10) %>% as.data.frame
random_drugs <- random_drugs$Compound
#retrieve the data
test_affy <- getAffyData(full_con, random_genes, random_cell_lines)
test_cn <- getCopyNumberData(full_con, random_genes, random_cell_lines)
test_hybcap <- getHybcapData(full_con, random_genes, random_cell_lines)
test_cosmicclp <- getCosmicCLPData(full_con, random_genes, random_cell_lines)
#make a big data frame
big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, random_drugs)
#without resp data
big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, drugs=NULL, data_types=c('affy', 'cn', 'hybcap', 'cosmicclp'))
#with custom resp data
big_df <- makeWideDataFrame(full_con, random_genes, cell_lines = c('DMS114_LUNG', 'A549_LUNG'), drugs = 'KU60648_pGI50', drug_df = dietlein_data)This should take no more than 4-5 seconds for each constituent retrieval, and ~10 seconds to make the data frame depending on your hardware (SSD’s will be quicker than HDD’s).
Future plans are to integrate the thinking of using SQLite for fast on disk subsetting and retrieval with the biocMultiAssay package. The will allow generic extension of the concept to other datasets without having to define import and retrieval functions and database schemas one dataset at a time.
sessionInfo() ## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.9.5 (Mavericks)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] CancerCellLines_0.6.6 reshape2_1.4.1 RSQLite_1.0.0
## [4] DBI_0.3.1 shiny_0.12.2 ggplot2_1.0.1
## [7] scales_0.3.0 tidyr_0.3.1 readr_0.2.2
## [10] readxl_0.1.0 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.2 knitr_1.11 magrittr_1.5 MASS_7.3-45
## [5] munsell_0.4.2 xtable_1.8-0 colorspace_1.2-6 R6_2.1.1
## [9] stringr_1.0.0 plyr_1.8.3 tools_3.2.2 parallel_3.2.2
## [13] grid_3.2.2 gtable_0.1.2 htmltools_0.2.6 lazyeval_0.1.10
## [17] yaml_2.1.13 assertthat_0.1 digest_0.6.8 formatR_1.2.1
## [21] mime_0.4 evaluate_0.8 rmarkdown_0.8.1 stringi_1.0-1
## [25] httpuv_1.3.3 proto_0.3-10