Overview Vignette for CancerCellLines Package

Introduction

The aim of the CancerCellLines package is to provide standardised code to create and extract data from a SQLite database containing the published genomic data from the Cancer Cell Line Encyclopedia and similar projects. The reason for using a SQLite database is to allow data to be stored on disk, rather than be loaded into memory. This is useful when the user wishes to work with small subsets of the overall dataset, for example just 10-20 genes in 80 lung cell lines. This Vignette will cover the inital set up of the package along with some examples of its use.

Data Origin

Data files referred to here can be download from the CCLE project website.

There are also toy examples included in the package:

list.files(system.file("extdata", package = "CancerCellLines"))

##  [1] "CCLE_copynumber_byGene_2012-09-29_toy.txt"                                         
##  [2] "CCLE_Expression_Entrez_2012-09-29_toy.gct"                                         
##  [3] "CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07_toy.maf"
##  [4] "CCLE_NP24.2009_Drug_data_2012.02.20_toy.txt"                                       
##  [5] "CCLE_sample_info_file_2012-10-18_toy.txt"                                          
##  [6] "CellLineIDNormalisationNov15_toy.txt"                                              
##  [7] "CellLineIDNormalisationNov15.txt"                                                  
##  [8] "CosmicCLP_CompleteExport_v74_toy.tsv"                                              
##  [9] "Dietlein2014_supp_table_1.txt"                                                     
## [10] "scale_colours.txt"                                                                 
## [11] "toy.db"

Creating/Connecting to the toy dataset

Either make a toy database from scratch using the convenience function makeToyDB:

test_db <- makeToyDB()

## Parse the gene expression data file
## Writing to database
## Indexing the table
## Finished importing affy data
## Parse the copy number data file
## Writing to database
## Indexing the table
## Finished importing cn data
## Parse the Cosmic CLP exome data file
## Write the data to the database
## Indexing the table
## Finished importing Cosmic CLP exome sequencing data

## [1] "A database of toy data has successfully been created and a connector returned"

test_db

## <SQLiteConnection>

test_db@dbname

## [1] "/var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//RtmphyciSX/file8b85e03d71c"

dbListTables(test_db)

## [1] "ccle_affy"       "ccle_cn"         "ccle_drug_data"  "ccle_hybcap"    
## [5] "ccle_sampleinfo" "cell_line_ids"   "cosmicclp_exome"

Or connect to the one built into the package using the setupSQLite function:

test_db <- setupSQLite(system.file('extdata/toy.db', package="CancerCellLines"))
test_db

## <SQLiteConnection>

test_db@dbname

## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db"

dbListTables(test_db)

## [1] "ccle_affy"       "ccle_cn"         "ccle_drug_data"  "ccle_hybcap"    
## [5] "ccle_sampleinfo" "cell_line_ids"   "cosmicclp_exome"

Querying the toy dataset with RSQLite

The functions from RSQLite can be used to query data in the normal way:

dbGetQuery(test_db, "select * from ccle_affy limit 10")

##    ProbeID  Symbol    CCLE_name    Signal
## 1  1956_at    EGFR NCIH524_LUNG  4.773512
## 2  3845_at    KRAS NCIH524_LUNG 10.399190
## 3  4893_at    NRAS NCIH524_LUNG  8.802869
## 4  5728_at    PTEN NCIH524_LUNG  7.442462
## 5  6597_at SMARCA4 NCIH524_LUNG 10.429500
## 6   673_at    BRAF NCIH524_LUNG  6.445456
## 7  7157_at    TP53 NCIH524_LUNG  7.120139
## 8  1956_at    EGFR NCIH209_LUNG  5.712198
## 9  3845_at    KRAS NCIH209_LUNG 10.544760
## 10 4893_at    NRAS NCIH209_LUNG  9.539366

dbGetQuery(test_db, "select * from ccle_sampleinfo limit 10")[,1:5]

##        CCLE_name Primary_cell_name Cell_line_aliases Gender Site_primary
## 1      A549_LUNG              A549                        M         lung
## 2      ABC1_LUNG             ABC-1                        M         lung
## 3   AU565_BREAST             AU565                        F       breast
## 4       BEN_LUNG               BEN                        M         lung
## 5    BT20_BREAST             BT-20                        F       breast
## 6   BT474_BREAST            BT-474                        F       breast
## 7   BT483_BREAST            BT-483                        F       breast
## 8   BT549_BREAST            BT-549                        F       breast
## 9  CAL120_BREAST           CAL-120                        F       breast
## 10   CAL12T_LUNG           CAL-12T                        M         lung

dbGetQuery(test_db, "select Symbol, t1.CCLE_name, Signal, Site_primary, Hist_subtype1 from ccle_affy as t1 
                      inner join ccle_sampleinfo t2 on t1.CCLE_name = t2.CCLE_name
                      where t2.Hist_subtype1 == 'ductal_carcinoma'
                      order by Symbol desc
                      limit 10")

##    Symbol         CCLE_name   Signal Site_primary    Hist_subtype1
## 1    TP53       KPL1_BREAST 8.952216       breast ductal_carcinoma
## 2    TP53      HDQP1_BREAST 5.294608       breast ductal_carcinoma
## 3    TP53    UACC893_BREAST 5.066705       breast ductal_carcinoma
## 4    TP53    HCC1599_BREAST 5.015617       breast ductal_carcinoma
## 5    TP53    HCC1500_BREAST 7.630229       breast ductal_carcinoma
## 6    TP53 MDAMB134VI_BREAST 5.707668       breast ductal_carcinoma
## 7    TP53     CAL148_BREAST 9.157586       breast ductal_carcinoma
## 8    TP53    UACC812_BREAST 6.639146       breast ductal_carcinoma
## 9    TP53    HCC1806_BREAST 6.110553       breast ductal_carcinoma
## 10   TP53      JIMT1_BREAST 9.691416       breast ductal_carcinoma

Indexing the database allows fast retrieval even when the dataset gets large - more later.

However, writing the SQL yourself can get inconvenient if you want to retrieve several genes or cell lines:

dbGetQuery(test_db, "select * from ccle_affy 
                      where symbol IN ('PTEN', 'TP53', 'BRAF' ) and 
                            CCLE_name IN ('BT474_BREAST', 'MDAMB468_BREAST') 
                      limit 10")

##   ProbeID Symbol       CCLE_name   Signal
## 1  673_at   BRAF    BT474_BREAST 6.844845
## 2  673_at   BRAF MDAMB468_BREAST 6.335567
## 3 5728_at   PTEN    BT474_BREAST 9.582697
## 4 5728_at   PTEN MDAMB468_BREAST 8.200749
## 5 7157_at   TP53    BT474_BREAST 8.565303
## 6 7157_at   TP53 MDAMB468_BREAST 9.301550

symbols <- c('PTEN', 'TP53', 'BRAF')
cell_lines <- c('BT474_BREAST', 'MDAMB468_BREAST') 
symbols.sql <- paste(symbols, collapse="','")
cell_lines.sql <- paste(cell_lines, collapse="','")

dbGetQuery(test_db, sprintf("select * from ccle_affy 
                      where symbol IN ('%s' ) and 
                            CCLE_name IN ('%s') 
                      limit 10", symbols.sql, cell_lines.sql))

##   ProbeID Symbol       CCLE_name   Signal
## 1  673_at   BRAF    BT474_BREAST 6.844845
## 2  673_at   BRAF MDAMB468_BREAST 6.335567
## 3 5728_at   PTEN    BT474_BREAST 9.582697
## 4 5728_at   PTEN MDAMB468_BREAST 8.200749
## 5 7157_at   TP53    BT474_BREAST 8.565303
## 6 7157_at   TP53 MDAMB468_BREAST 9.301550

Querying the toy dataset with dplyr

Things become much nicer if you query with dplyr, since this writes the underlying SQL for you:

con <- src_sqlite(test_db@dbname) 
ccle_affy <- con %>% tbl('ccle_affy')
ccle_affy

## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: ccle_affy [1,722 x 4]
## 
##    ProbeID  Symbol    CCLE_name    Signal
##      (chr)   (chr)        (chr)     (dbl)
## 1  1956_at    EGFR NCIH524_LUNG  4.773512
## 2  3845_at    KRAS NCIH524_LUNG 10.399190
## 3  4893_at    NRAS NCIH524_LUNG  8.802869
## 4  5728_at    PTEN NCIH524_LUNG  7.442462
## 5  6597_at SMARCA4 NCIH524_LUNG 10.429500
## 6   673_at    BRAF NCIH524_LUNG  6.445456
## 7  7157_at    TP53 NCIH524_LUNG  7.120139
## 8  1956_at    EGFR NCIH209_LUNG  5.712198
## 9  3845_at    KRAS NCIH209_LUNG 10.544760
## 10 4893_at    NRAS NCIH209_LUNG  9.539366
## ..     ...     ...          ...       ...

ccle_sampleinfo <- con %>% tbl('ccle_sampleinfo')
ccle_sampleinfo

## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: ccle_sampleinfo [247 x 13]
## 
##        CCLE_name Primary_cell_name Cell_line_aliases Gender Site_primary
##            (chr)             (chr)             (chr)  (chr)        (chr)
## 1      A549_LUNG              A549                        M         lung
## 2      ABC1_LUNG             ABC-1                        M         lung
## 3   AU565_BREAST             AU565                        F       breast
## 4       BEN_LUNG               BEN                        M         lung
## 5    BT20_BREAST             BT-20                        F       breast
## 6   BT474_BREAST            BT-474                        F       breast
## 7   BT483_BREAST            BT-483                        F       breast
## 8   BT549_BREAST            BT-549                        F       breast
## 9  CAL120_BREAST           CAL-120                        F       breast
## 10   CAL12T_LUNG           CAL-12T                        M         lung
## ..           ...               ...               ...    ...          ...
## Variables not shown: Histology (chr), Hist_subtype1 (chr), Notes (chr),
##   Source (chr), Expression_arrays (chr), SNP_arrays (chr), Oncomap (chr),
##   Hybrid_capture_sequencing (chr)

ccle_sampleinfo %>% dplyr::select(CCLE_name, Site_primary, Hist_subtype1) %>% 
  dplyr::filter(Hist_subtype1 == 'ductal_carcinoma') %>%
  dplyr::inner_join(ccle_affy, by='CCLE_name') %>%
  dplyr::arrange(desc(Symbol))

## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: <derived table> [?? x 6]
## Arrange: desc(Symbol) 
## 
##            CCLE_name Site_primary    Hist_subtype1 ProbeID Symbol   Signal
##                (chr)        (chr)            (chr)   (chr)  (chr)    (dbl)
## 1        KPL1_BREAST       breast ductal_carcinoma 7157_at   TP53 8.952216
## 2       HDQP1_BREAST       breast ductal_carcinoma 7157_at   TP53 5.294608
## 3     UACC893_BREAST       breast ductal_carcinoma 7157_at   TP53 5.066705
## 4     HCC1599_BREAST       breast ductal_carcinoma 7157_at   TP53 5.015617
## 5     HCC1500_BREAST       breast ductal_carcinoma 7157_at   TP53 7.630229
## 6  MDAMB134VI_BREAST       breast ductal_carcinoma 7157_at   TP53 5.707668
## 7      CAL148_BREAST       breast ductal_carcinoma 7157_at   TP53 9.157586
## 8     UACC812_BREAST       breast ductal_carcinoma 7157_at   TP53 6.639146
## 9     HCC1806_BREAST       breast ductal_carcinoma 7157_at   TP53 6.110553
## 10      JIMT1_BREAST       breast ductal_carcinoma 7157_at   TP53 9.691416
## ..               ...          ...              ...     ...    ...      ...

ccle_affy %>% filter(symbol %in% symbols & CCLE_name %in% cell_lines)

## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: ccle_affy [6 x 4]
## Filter: symbol %in% c("PTEN", "TP53", "BRAF") & CCLE_name %in%
##   c("BT474_BREAST", "MDAMB468_BREAST") 
## 
##   ProbeID Symbol       CCLE_name   Signal
##     (chr)  (chr)           (chr)    (dbl)
## 1  673_at   BRAF    BT474_BREAST 6.844845
## 2  673_at   BRAF MDAMB468_BREAST 6.335567
## 3 5728_at   PTEN    BT474_BREAST 9.582697
## 4 5728_at   PTEN MDAMB468_BREAST 8.200749
## 5 7157_at   TP53    BT474_BREAST 8.565303
## 6 7157_at   TP53 MDAMB468_BREAST 9.301550

Convenience functions to export data

There are a number of convenience functions that assist in executing typical queries. For example, the getAffyData and getCopyNumberData functions can be used to simplify the queries above still further:

getAffyData(test_db, symbols, cell_lines)

##         CCLE_name   ID Type original    value
## 1    BT474_BREAST BRAF affy 6.844845 6.844845
## 2 MDAMB468_BREAST BRAF affy 6.335567 6.335567
## 3    BT474_BREAST PTEN affy 9.582697 9.582697
## 4 MDAMB468_BREAST PTEN affy 8.200749 8.200749
## 5    BT474_BREAST TP53 affy 8.565303 8.565303
## 6 MDAMB468_BREAST TP53 affy  9.30155 9.301550

getCopyNumberData(test_db, symbols, cell_lines)

##         CCLE_name   ID Type original   value
## 1    BT474_BREAST BRAF   cn   0.4872  0.4872
## 2 MDAMB468_BREAST BRAF   cn   0.0975  0.0975
## 3    BT474_BREAST PTEN   cn  -0.2208 -0.2208
## 4 MDAMB468_BREAST PTEN   cn   0.1157  0.1157
## 5    BT474_BREAST TP53   cn  -0.2853 -0.2853
## 6 MDAMB468_BREAST TP53   cn  -0.3893 -0.3893

Whilst the getHybcapData and getCosmicCLPData functions retrieve the CCLE hybrid capture and Cosmic Cell Line Project sequencing data respectively:

getHybcapData(test_db, symbols, cell_lines)

## Source: local data frame [6 x 5]
## 
##         CCLE_name    ID   Type original value
##             (chr) (chr)  (chr)    (chr) (dbl)
## 1    BT474_BREAST  TP53 hybcap  p.E285K     1
## 2 MDAMB468_BREAST  PTEN hybcap        -     0
## 3    BT474_BREAST  PTEN hybcap        -     0
## 4 MDAMB468_BREAST  TP53 hybcap        -     0
## 5 MDAMB468_BREAST  BRAF hybcap        -     0
## 6    BT474_BREAST  BRAF hybcap        -     0

getCosmicCLPData(test_db, symbols, cell_lines)

## Source: local data frame [6 x 5]
## 
##         CCLE_name    ID      Type original value
##             (chr) (chr)     (chr)    (chr) (dbl)
## 1    BT474_BREAST  TP53 cosmicclp  p.E285K     1
## 2 MDAMB468_BREAST  TP53 cosmicclp  p.R273H     1
## 3    BT474_BREAST  PTEN cosmicclp        -     0
## 4 MDAMB468_BREAST  PTEN cosmicclp        -     0
## 5    BT474_BREAST  BRAF cosmicclp        -     0
## 6 MDAMB468_BREAST  BRAF cosmicclp        -     0

Note that the CancerCellLines package includes functionality to convert cell line identifiers between different datasets using the cell_line_ids table. This happens transparently in the getCosmicCLPData function:

con %>% tbl('cell_line_ids') %>% filter(unified_id %in% cell_lines)

## Source: sqlite 3.8.6 [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/CancerCellLines/extdata/toy.db]
## From: cell_line_ids [8 x 8]
## Filter: unified_id %in% c("BT474_BREAST", "MDAMB468_BREAST") 
## 
##        unified_id       native_id     alt_id    id_type tissue
##             (chr)           (chr)      (chr)      (chr)  (chr)
## 1    BT474_BREAST    BT474_BREAST     BT-474       CCLE breast
## 2    BT474_BREAST          BT-474         NA cosmic_clp breast
## 3    BT474_BREAST          BT-474     946359       gdsc breast
## 4    BT474_BREAST           BT474         NA   eurofins breast
## 5 MDAMB468_BREAST MDAMB468_BREAST MDA-MB-468       CCLE breast
## 6 MDAMB468_BREAST      MDA-MB-468         NA cosmic_clp breast
## 7 MDAMB468_BREAST      MDA-MB-468     908123       gdsc breast
## 8 MDAMB468_BREAST      MDA MB 468         NA   eurofins breast
## Variables not shown: hist_primary (chr), hist_secondary (chr), match_id
##   (chr)

Finally, the getDrugData_CCLE function retrieves the CCLE drug response data:

drugs <- c('Lapatinib', 'AZD6244', 'Nilotinib' )
getDrugData_CCLE(test_db, drugs, cell_lines)

##         CCLE_name        ID Type    original    value
## 1    BT474_BREAST   AZD6244 resp        <NA>       NA
## 2 MDAMB468_BREAST   AZD6244 resp        <NA>       NA
## 3    BT474_BREAST Lapatinib resp 8.020292071 5.095810
## 4 MDAMB468_BREAST Lapatinib resp  2.07880187 5.682187
## 5    BT474_BREAST Nilotinib resp 8.738396718 5.058568
## 6 MDAMB468_BREAST Nilotinib resp 3.195904255 5.495406

Whilst the getDrugData_custom function transforms an arbitrary data frame with the field names below into the standardised data frame:

data(dietlein_data)
head(dietlein_data)

##                unified_id compound_id endpoint  original  value
## 1               A375_SKIN     KU60648    pGI50  810.9611 -2.909
## 2               A549_LUNG     KU60648    pGI50  609.5369 -2.785
## 3 COLO205_LARGE_INTESTINE     KU60648    pGI50 1233.1048 -3.091
## 4 COLO320_LARGE_INTESTINE     KU60648    pGI50  903.6495 -2.956
## 5             DMS114_LUNG     KU60648    pGI50  313.3286 -2.496
## 6               DV90_LUNG     KU60648    pGI50  229.6149 -2.361

getDrugData_custom(dietlein_data, drugs = 'KU60648_pGI50', cell_lines = c('DMS114_LUNG', 'A549_LUNG'))

##     CCLE_name            ID Type original  value
## 1   A549_LUNG KU60648_pGI50 resp 609.5369 -2.785
## 2 DMS114_LUNG KU60648_pGI50 resp 313.3286 -2.496

Combining different data types

These functions all have a standard output format which means that data from different assay types can be merged and plotted or analysed together.

The makeTallDataFrame function does this merging in a standard way and returns the the data in a ‘tidy’ format that is useful for plotting in ggplot2 or further manipulation with tidyr.

makeTallDataFrame(test_db, symbols, cell_lines, drugs)

## Source: local data frame [30 x 5]
## 
##          CCLE_name    ID   Type original    value
##              (chr) (chr)  (chr)    (chr)    (dbl)
## 1     BT474_BREAST  BRAF   affy 6.844845 6.844845
## 2  MDAMB468_BREAST  BRAF   affy 6.335567 6.335567
## 3     BT474_BREAST  PTEN   affy 9.582697 9.582697
## 4  MDAMB468_BREAST  PTEN   affy 8.200749 8.200749
## 5     BT474_BREAST  TP53   affy 8.565303 8.565303
## 6  MDAMB468_BREAST  TP53   affy  9.30155 9.301550
## 7     BT474_BREAST  TP53 hybcap  p.E285K 1.000000
## 8  MDAMB468_BREAST  PTEN hybcap        - 0.000000
## 9     BT474_BREAST  PTEN hybcap        - 0.000000
## 10 MDAMB468_BREAST  TP53 hybcap        - 0.000000
## ..             ...   ...    ...      ...      ...

The makeWideFromTallDataFrame function can take the output from makeTallDataFrame and create a wide or matrix-like data frame which is a conveninent input for modelling packages such as caret.

my_df <- makeTallDataFrame(test_db, symbols, cell_lines, drugs)
makeWideFromTallDataFrame(my_df)

## Source: local data frame [2 x 16]
## 
##         CCLE_name AZD6244_resp Lapatinib_resp Nilotinib_resp BRAF_affy
##             (chr)        (dbl)          (dbl)          (dbl)     (dbl)
## 1    BT474_BREAST           NA       5.095810       5.058568  6.844845
## 2 MDAMB468_BREAST           NA       5.682187       5.495406  6.335567
## Variables not shown: BRAF_cn (dbl), BRAF_cosmicclp (dbl), BRAF_hybcap
##   (dbl), PTEN_affy (dbl), PTEN_cn (dbl), PTEN_cosmicclp (dbl), PTEN_hybcap
##   (dbl), TP53_affy (dbl), TP53_cn (dbl), TP53_cosmicclp (dbl), TP53_hybcap
##   (dbl)

Finally, there is the makeWideDataFrame function which generates a wide data frame directly.

makeWideDataFrame(test_db, symbols, cell_lines, drugs)

## Source: local data frame [2 x 16]
## 
##         CCLE_name AZD6244_resp Lapatinib_resp Nilotinib_resp BRAF_affy
##             (chr)        (dbl)          (dbl)          (dbl)     (dbl)
## 1    BT474_BREAST           NA       5.095810       5.058568  6.844845
## 2 MDAMB468_BREAST           NA       5.682187       5.495406  6.335567
## Variables not shown: BRAF_cn (dbl), BRAF_cosmicclp (dbl), BRAF_hybcap
##   (dbl), PTEN_affy (dbl), PTEN_cn (dbl), PTEN_cosmicclp (dbl), PTEN_hybcap
##   (dbl), TP53_affy (dbl), TP53_cn (dbl), TP53_cosmicclp (dbl), TP53_hybcap
##   (dbl)

The data_types parameter can be used to control which data types are returned, and the drug_df parameter is used to provide custom drug information as per the getDrugData_custom function description above

makeWideDataFrame(test_db, symbols, cell_lines, drugs, data_types=c('hybcap', 'affy', 'resp'))

## Source: local data frame [2 x 10]
## 
##         CCLE_name AZD6244_resp Lapatinib_resp Nilotinib_resp BRAF_affy
##             (chr)        (dbl)          (dbl)          (dbl)     (dbl)
## 1    BT474_BREAST           NA       5.095810       5.058568  6.844845
## 2 MDAMB468_BREAST           NA       5.682187       5.495406  6.335567
## Variables not shown: BRAF_hybcap (dbl), PTEN_affy (dbl), PTEN_hybcap
##   (dbl), TP53_affy (dbl), TP53_hybcap (dbl)

Working with the full CCLE dataset

The full CCLE dataset is not included in this package due to reasons of data size and because permission for data re-distribution has not yet been sought. However, the instructions below will demonstrate how this is done:

Define where the data is to be stored/found. Files are downloaded from the CCLE project website and COSMIC Cell Line Project website.

dbpath <- '~/BigData/CellLineData/CancerCellLines.db'
infopath <- '~/BigData/CellLineData/RawData/CCLE_sample_info_file_2012-10-18.txt'
affypath <- '~/BigData/CellLineData/RawData/CCLE_Expression_Entrez_2012-09-29.gct'
cnpath <- '~/BigData/CellLineData/RawData/CCLE_copynumber_byGene_2012-09-29.txt'
hybcappath <- '~/BigData/CellLineData/RawData/CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf'
cosmicclppath <- '~/BigData/CellLineData/RawData/CosmicCLP_CompleteExport_v74.tsv'
drugpath <- '~/BigData/CellLineData/RawData/CCLE_NP24.2009_Drug_data_2012.02.20.csv'
idspath <- system.file("extdata", "CellLineIDNormalisationNov15.txt", package = "CancerCellLines")

Set up the SQLite database and run the import functions

  full_con <- setupSQLite(dbpath)
  importCCLE_info(infopath , full_con)
  importCCLE_hybcap(hybcappath , full_con)
  importCosmicCLP_exome(cosmicclppath, full_con)
  importCCLE_drugresponse(drugpath , full_con)
  importCCLE_affy(affypath , full_con)
  importCCLE_cn(cnpath, full_con)
  importCellLineIDs(idspath, full_con)

This process should take 3-4 minutes with most of the time spent importing the affymetrix data.

Now use the database as per the toy example. Thanks to the speed of SQLite and the wonders of indexing, data retrieval should still be just as fast even though the ccle_affy table contains around 20 million data points.

To really put it through its paces try retrieving data from 2000 genes in 200 cell lines as below:

    dplyr_con <- src_sqlite(full_con@dbname)
    
    #get 2000 random genes
    random_genes <- dplyr_con %>% tbl('ccle_affy') %>% group_by(Symbol) %>% summarise(N=n()) %>% 
      ungroup() %>% collect %>% 
      dplyr::filter(N < mean(N)) %>% sample_n(2000) %>% as.data.frame
    random_genes <- random_genes$Symbol
  
    #get 200 random cell lines
    random_cell_lines <- dplyr_con %>% tbl('ccle_sampleinfo') %>% dplyr::select(CCLE_name) %>%
      distinct %>% collect %>% sample_n(200) %>% as.data.frame
    random_cell_lines <- random_cell_lines$CCLE_name
    
    #get 10 random compounds
    random_drugs <- dplyr_con %>% tbl('ccle_drug_data') %>% dplyr::select(Compound) %>%
      distinct %>% collect %>% sample_n(10) %>% as.data.frame
    random_drugs <- random_drugs$Compound
    
    #retrieve the data
    test_affy <- getAffyData(full_con, random_genes, random_cell_lines)
    test_cn <- getCopyNumberData(full_con, random_genes, random_cell_lines)
    test_hybcap <- getHybcapData(full_con, random_genes, random_cell_lines)
    test_cosmicclp <- getCosmicCLPData(full_con, random_genes, random_cell_lines)
    
    #make a big data frame
    big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, random_drugs)
    
    #without resp data
    big_df <- makeWideDataFrame(full_con, random_genes, random_cell_lines, drugs=NULL, data_types=c('affy', 'cn', 'hybcap', 'cosmicclp'))
    
    #with custom resp data
    big_df <- makeWideDataFrame(full_con, random_genes, cell_lines = c('DMS114_LUNG', 'A549_LUNG'), drugs = 'KU60648_pGI50', drug_df = dietlein_data)

This should take no more than 4-5 seconds for each constituent retrieval, and ~10 seconds to make the data frame depending on your hardware (SSD’s will be quicker than HDD’s).

Future directions

Future plans are to integrate the thinking of using SQLite for fast on disk subsetting and retrieval with the biocMultiAssay package. The will allow generic extension of the concept to other datasets without having to define import and retrieval functions and database schemas one dataset at a time.

Session Info

   sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.9.5 (Mavericks)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] CancerCellLines_0.6.6 reshape2_1.4.1        RSQLite_1.0.0        
##  [4] DBI_0.3.1             shiny_0.12.2          ggplot2_1.0.1        
##  [7] scales_0.3.0          tidyr_0.3.1           readr_0.2.2          
## [10] readxl_0.1.0          dplyr_0.4.3          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2      knitr_1.11       magrittr_1.5     MASS_7.3-45     
##  [5] munsell_0.4.2    xtable_1.8-0     colorspace_1.2-6 R6_2.1.1        
##  [9] stringr_1.0.0    plyr_1.8.3       tools_3.2.2      parallel_3.2.2  
## [13] grid_3.2.2       gtable_0.1.2     htmltools_0.2.6  lazyeval_0.1.10 
## [17] yaml_2.1.13      assertthat_0.1   digest_0.6.8     formatR_1.2.1   
## [21] mime_0.4         evaluate_0.8     rmarkdown_0.8.1  stringi_1.0-1   
## [25] httpuv_1.3.3     proto_0.3-10