BIOCONDUCTOR

In the video we talk about phenotypes. Here we show some examples of what we mean by phenotypes, how they can be coded in R objects, and how we compute with them. Later in the course we will perform analyses that statistically connect these phenotypes to measured molecular outcomes. Here we explore the use of data frames to store phenoypes (columns) from several individuals (rows).

Question 1.1.1

Install and attach the COPDSexualDimorphism.data package using biocLite.

source("http://www.bioconductor.org/biocLite.R")
biocLite("COPDSexualDimorphism.data")

Use the commands

library(COPDSexualDimorphism.data)
data(lgrc.expr.meta)

to add the object expr.meta to your workspace. The variable pkyrs in the expr.meta data.frame represents pack years smoked. Other variables include gender (self-explanatory) and diagmaj (disease status).

What is the number of female participants in this study?:

table(expr.meta$gender)

## 
##   1-Male 2-Female 
##      119      110

Question 1.1.2

What is the median of the distribution of pack years smoked in this cohort (women and men)?

median(expr.meta$pkyrs, na.rm=TRUE)

## [1] 40

summary(expr.meta$pkyrs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   40.00   44.17   60.00  212.00

Question 1.1.3

True or False: The distribution of pack-years smoked is well-approximated by a Gaussian (Normal) probability distribution.

library(rafalib)
mypar(1,2)
hist(expr.meta$pkyrs)
qqnorm(expr.meta$pkyrs)

FALSE. The substantial number of zero values renders a Gaussian model for pack years formally impossible.

As a result of the human genome project sequenced we have the consensus sequence of all human chromsomes, as well as several other species. We say consensus sequence because every individual has a different sequence. But well over 99% is the same.

Suppose you want to ask a questions such as: how many times does the sequence “ATG” appear on chromosome 11 ? Or what are the percentage of A,T,C and G on chromosome 7?

We can answer such question using Bioconductor tools. The human genome sequence is provided in the BSgenome.Hsapiens.UCSC.hg19 package. If you have not done so already please donwload and install this package. Note that it encodes 3 billion bases and is therefore a large package (over 800MB) so make time to download it especially if you have a slow internet connection.

library(BiocInstaller)
biocLite("checkmate")
biocLite("rtracklayer")
biocLite("BSgenome")
biocLite("BSgenome.Hsapiens.UCSC.hg19")

Then load the package and note that you now have access to sequence information.

# install.packages("~/R/BSgenome.Hsapiens.UCSC.hg19_1.4.0.zip", repos = NULL, type = "win.binary")
# install.packages("~/R/BSgenome_1.34.1.zip", repos = NULL, type = "win.binary")
library(BSgenome.Hsapiens.UCSC.hg19)

## Loading required package: BSgenome
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: rtracklayer

BSgenome.Hsapiens.UCSC.hg19

## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Note this divided into chromosomes and includes several unmapped. We will learn to use this type of object. We can access chromosome 11 like this.

chr11seq <- BSgenome.Hsapiens.UCSC.hg19[["chr11"]]

Here, for example, is a segment of 25 bases starting at base 1 million

subseq(chr11seq,start = 10^6, width = 25)

##   25-letter "DNAString" instance
## seq: GTTTCACCGTGTTAGCCAGGGTGGT

Question 1.2.1

Read the help file for the function countPattern and tell us which of the following sequences is most common on chromosome 11: “ATG”,“TGA”,“TAA”,“TAG”.

# Simple method 
seqs = c("ATG", "TGA", "TAA","TAG")
n =sapply(seqs, function(x) countPattern(x,chr11seq))
which.max(n)

## TAA 
##   3

# Complex method
ATG <- BString(x="ATG",start=1, nchar=NA)
TGA <- BString(x="TGA", start=1,nchar=NA)
TAA <- BString(x="TAA", start = 1, nchar = NA)
TAG <- BString(x="TAG",start = 1, nchar = NA)

countPattern(ATG,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")

## [1] 2389002

countPattern(TGA,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")

## [1] 2561021

countPattern(TAA,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")

## [1] 2624324

countPattern(TAG,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")

## [1] 1689356

Question 1.2.2

Now we move to a question about chromosome 7. Read the help page for the function alphabetFrequency and use it to determine what percent of chromosome 7 is T,C,G, and A. Note that we have other letters. For example \(N\), which represents positions that are not called-in, appears often.

chr7seq <- BSgenome.Hsapiens.UCSC.hg19[["chr7"]]
alphabetFrequency(chr7seq, as.prob=TRUE)

##          A          C          G          T          M          R 
## 0.28904200 0.19901933 0.19880134 0.28935305 0.00000000 0.00000000 
##          W          S          Y          K          V          H 
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
##          D          B          N          -          +          . 
## 0.00000000 0.00000000 0.02378429 0.00000000 0.00000000 0.00000000

Note that if you give as.prob=FALSE the total frequency will be printed.

Question 1.2.3

As explained in the video, many of the locations on the genome that are different across individual are single nucleotide polymorphisms (SNPs). This information is not on the human genome reference sequence. Instead, this information is stored in databases such as dbSNP. Bioconductor also gives you access to these database via the

package SNPlocs.Hsapiens.dbSNP.20120608. Download and install this package. This is also a large package.

library(BiocInstaller)
biocLite("SNPlocs.Hsapiens.dbSNP.20120608")
library(SNPlocs.Hsapiens.dbSNP.20120608)

# or locally downloaded and install using `install.packages` function

install.packages("C:/Users/Prasanth/Desktop/Harvard Biomedical Data Series -8 Courses/Softwares and packages/SNPlocs.Hsapiens.dbSNP.20120608_0.99.9.zip", repos = NULL, type = "win.binary")

To see all the SNPs on, for example, chromosome 17 we can use the following commands

library(SNPlocs.Hsapiens.dbSNP.20120608)

##   Please note that the SNPlocs.Hsapiens.dbSNP.20120608 package
##   contains outdated dbSNP data and will be deprecated in the near
##   future. We highly recommend that you use a SNPlocs package based
##   on a more recent dbSNP build for your analyses instead. See
##   available.SNPs() for the list of SNPlocs packages currently
##   available and make sure to pick up the most recent one.

s17 = getSNPlocs("ch17")
head(s17)

##   RefSNP_id alleles_as_ambig loc
## 1 145615430                Y  56
## 2 148170422                S  78
## 3 183779916                R  80
## 4 188505217                R 174
## 5   9747578                R 293
## 6 200398583                W 300

The first one listed is rs145615430 which is on location 56.

What is the location on chr17 of SNP rs73971683?

s17$loc[s17$RefSNP_id==73971683]

## [1] 135246

# or using dplyr

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:Biostrings':
## 
##     collapse, intersect, setdiff, setequal, union
## 
## The following object is masked from 'package:XVector':
## 
##     slice
## 
## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## 
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## 
## The following objects are masked from 'package:S4Vectors':
## 
##     intersect, rename, setdiff, union
## 
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

s17 %>% filter(RefSNP_id=="73971683") %>% select(loc)

##      loc
## 1 135246

We have shown how Bioconductor provides resources for studying the human genome sequence as well as SNPs. Bioconductor also provides resources that permit us to obtain information about genes. We will see how these databases can be quite complex. But before learning about these here we present some experimental data.

The driving force behind the formation of the Bioconductor project was the emergence of high throughput measurement of gene expression data. Unlike genome sequence and SNP data, gene expression data varies from cell to cell, from tissue to tissue and from individual to individual. Statistical techniques such as those implemented in R were natural tools to parse out variability and perform statistical inference. Furthermore, the ambitious use of newly invented technologies added issues of measurement error and bias to already difficulty challenge.

Note that in the previous assessments we focused on static database information: genome sequence and SNPs. In previous courses we have seen the ‘tissuesGeneExpression’ data which is experimental data measured with microarrays. If you have not installed it you can do it like this:

library(devtools)
install_github("genomicsclass/tissuesGeneExpression")

You can load the data

library(tissuesGeneExpression)
data(tissuesGeneExpression)
head(e[,1:5])

##           GSM11805.CEL.gz GSM11814.CEL.gz GSM11823.CEL.gz GSM11830.CEL.gz
## 1007_s_at       10.191267       10.509167       10.272027       10.252952
## 1053_at          6.040463        6.696075        6.144663        6.575153
## 117_at           7.447409        7.775354        7.696235        8.478135
## 121_at          12.025042       12.007817       11.633279       11.075286
## 1255_g_at        5.269269        5.180389        5.301714        5.372235
## 1294_at          8.535176        8.587241        8.277414        8.603650
##           GSM12067.CEL.gz
## 1007_s_at       10.157605
## 1053_at          6.606701
## 117_at           8.116336
## 121_at          10.832528
## 1255_g_at        5.334905
## 1294_at          8.303227

table(tissue)

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

The rows of the matrix e are the features, in this case representing genes, and the columns are the samples. The entries of the matrix are gene expression measurements (in log scale) obtained using a microarray technology.

Question 1.3.1

Look at the data for the feature with ID “209169_at”. You can index the rows of edirectly with this character string.

Which of the following best describes the data? (Hint: stratify data by tissue and create boxplots)

boxplot(e["209169_at",]~tissue,las=2)

# las is the argument for longitudinal alignment for x-axis labels

Question 1.3.2

Below is a vector of 6 IDs which index features of ‘e’:

IDs = c("201884_at", "209169_at", "206269_at", "207437_at", "219832_s_at", "212827_at")

Which of the following ID(s) appear to represent a gene specific to placenta? Be careful when you are picking, to pick the correct name or names. Names often look similar. Also, if you get your guess wrong, you need to uncheck the ones you think are wrong to guess again.

library(rafalib)
mypar(3,2)
sapply(IDs, function(x) boxplot(e[x,]~tissue, las=2))

##       201884_at   209169_at   206269_at   207437_at   219832_s_at
## stats Numeric,35  Numeric,35  Numeric,35  Numeric,35  Numeric,35 
## n     Numeric,7   Numeric,7   Numeric,7   Numeric,7   Numeric,7  
## conf  Numeric,14  Numeric,14  Numeric,14  Numeric,14  Numeric,14 
## out   Numeric,6   Numeric,7   Numeric,6   Numeric,10  Numeric,13 
## group Numeric,6   Numeric,7   Numeric,6   Numeric,10  Numeric,13 
## names Character,7 Character,7 Character,7 Character,7 Character,7
##       212827_at  
## stats Numeric,35 
## n     Numeric,7  
## conf  Numeric,14 
## out   Numeric,8  
## group Numeric,8  
## names Character,7

# you can use a for-loop also instead

for(i in IDs){
 boxplot(e[i,]~tissue,las=2)
}

Question 1.3.3

Note that there is much existing work on gene function and all we have here are identifiers provided by the manufacturer of the machine that makes the measurements. How would we go about finding more information about gene “206269_at” for example? Does it have a known function? Where is it on the genome? What is its sequence? One of the strengths of Bioconductor is that it connects R, an existing comprehensive toolbox for data analysis, with the existing comprehensive databases annotating the genome. We will learn about these powerful resources in this class.

The microarray product used to make the measurements described here is the Affymetirx Human GeneChip HG133A. Search the Bioconductor website and determine which of the following packages provides a connection to gene information:

hgu133a.db

Question 1.3.4

Another powerful aspect of Bioconductor is that it provides object classes specifically designed to keep high throughput data organized. Below we show an example of how the three tables that are needed to conduct data analysis are available from Bioconductor data objects. For example, for gene expression we can use the ExpressionSet object, which we will review in more detail in later weeks.

library(Biobase)

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## the GSE5859 package can be installed like this: 
## library(devtools)
## install_github("genomicsclass/GSE5859")
library(GSE5859)
data(GSE5859)
class(e)

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

These objects were originally designed for gene expression data so the methods to extract the high throughput measurements have related names:

dat = exprs(e)
dim(dat)

## [1] 8793  208

The information about samples is also stored in this object and the functions to create it try to guarantee that the columns of exprs(e) match the rows of the sample information table. pData is use as shorthand for phenotype data.

sampleInfo = pData(e)
dim(sampleInfo)

## [1] 208   3

head(sampleInfo)

##   ethnicity       date        filename
## 1       CEU 2003-02-04 GSM25349.CEL.gz
## 2       CEU 2003-02-04 GSM25350.CEL.gz
## 3       CEU 2002-12-17 GSM25356.CEL.gz
## 4       CEU 2003-01-30 GSM25357.CEL.gz
## 5       CEU 2003-01-03 GSM25358.CEL.gz
## 6       CEU 2003-01-16 GSM25359.CEL.gz

A final table, which we will cover in more detail later, is a table that describes the rows, in this case genes. Because each product will have a different table, these have already been created in Bioconductor.

Because there are certain products that are widely used, Bioconductor makes databases available from which you can extract this information. This way, every object which is linked to this information does not have to carry around the table. Later we will learn to write code like this to extract specific information from these tables:

# library(BiocInstaller)
# biocLite("hgfocus.db")
library(hgfocus.db)

## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Loading required package: org.Hs.eg.db
## Loading required package: DBI

annot = select(hgfocus.db, 
               keys=featureNames(e), 
               keytype="PROBEID", 
               columns=c("CHR", "CHRLOC", "SYMBOL"))
## here we pick one column from the annotation
annot = annot[ match(featureNames(e),annot$PROBEID), ]
head(annot)

##      PROBEID CHR     CHRLOC CHRLOCCHR SYMBOL
## 1  1007_s_at   6   30852327         6   DDR1
## 30   1053_at   7  -73645832         7   RFC2
## 31    117_at   1  161494036         1  HSPA6
## 32    121_at   2 -113973574         2   PAX8
## 33 1255_g_at   6   42123144         6 GUCA1A
## 34   1294_at   3  -49842638         3   UBA7

dim(annot)

## [1] 8793    5

Bioconductor_start

Prasanth

Wednesday, August 20, 2015