In the video we talk about phenotypes. Here we show some examples of what we mean by phenotypes, how they can be coded in R objects, and how we compute with them. Later in the course we will perform analyses that statistically connect these phenotypes to measured molecular outcomes. Here we explore the use of data frames to store phenoypes (columns) from several individuals (rows).
Install and attach the COPDSexualDimorphism.data package using biocLite.
source("http://www.bioconductor.org/biocLite.R")
biocLite("COPDSexualDimorphism.data")
Use the commands
library(COPDSexualDimorphism.data)
data(lgrc.expr.meta)
to add the object expr.meta to your workspace. The variable pkyrs in the expr.meta data.frame represents pack years smoked. Other variables include gender (self-explanatory) and diagmaj (disease status).
What is the number of female participants in this study?:
table(expr.meta$gender)
##
## 1-Male 2-Female
## 119 110
What is the median of the distribution of pack years smoked in this cohort (women and men)?
median(expr.meta$pkyrs, na.rm=TRUE)
## [1] 40
summary(expr.meta$pkyrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.00 40.00 44.17 60.00 212.00
True or False: The distribution of pack-years smoked is well-approximated by a Gaussian (Normal) probability distribution.
library(rafalib)
mypar(1,2)
hist(expr.meta$pkyrs)
qqnorm(expr.meta$pkyrs)
FALSE. The substantial number of zero values renders a Gaussian model for pack years formally impossible.
As a result of the human genome project sequenced we have the consensus sequence of all human chromsomes, as well as several other species. We say consensus sequence because every individual has a different sequence. But well over 99% is the same.
Suppose you want to ask a questions such as: how many times does the sequence “ATG” appear on chromosome 11 ? Or what are the percentage of A,T,C and G on chromosome 7?
We can answer such question using Bioconductor tools. The human genome sequence is provided in the BSgenome.Hsapiens.UCSC.hg19 package. If you have not done so already please donwload and install this package. Note that it encodes 3 billion bases and is therefore a large package (over 800MB) so make time to download it especially if you have a slow internet connection.
library(BiocInstaller)
biocLite("checkmate")
biocLite("rtracklayer")
biocLite("BSgenome")
biocLite("BSgenome.Hsapiens.UCSC.hg19")
Then load the package and note that you now have access to sequence information.
# install.packages("~/R/BSgenome.Hsapiens.UCSC.hg19_1.4.0.zip", repos = NULL, type = "win.binary")
# install.packages("~/R/BSgenome_1.34.1.zip", repos = NULL, type = "win.binary")
library(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: rtracklayer
BSgenome.Hsapiens.UCSC.hg19
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
## # chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
## # chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
## # chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
## # chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)
Note this divided into chromosomes and includes several unmapped. We will learn to use this type of object. We can access chromosome 11 like this.
chr11seq <- BSgenome.Hsapiens.UCSC.hg19[["chr11"]]
Here, for example, is a segment of 25 bases starting at base 1 million
subseq(chr11seq,start = 10^6, width = 25)
## 25-letter "DNAString" instance
## seq: GTTTCACCGTGTTAGCCAGGGTGGT
Read the help file for the function countPattern and tell us which of the following sequences is most common on chromosome 11: “ATG”,“TGA”,“TAA”,“TAG”.
# Simple method
seqs = c("ATG", "TGA", "TAA","TAG")
n =sapply(seqs, function(x) countPattern(x,chr11seq))
which.max(n)
## TAA
## 3
# Complex method
ATG <- BString(x="ATG",start=1, nchar=NA)
TGA <- BString(x="TGA", start=1,nchar=NA)
TAA <- BString(x="TAA", start = 1, nchar = NA)
TAG <- BString(x="TAG",start = 1, nchar = NA)
countPattern(ATG,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")
## [1] 2389002
countPattern(TGA,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")
## [1] 2561021
countPattern(TAA,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")
## [1] 2624324
countPattern(TAG,chr11seq,max.mismatch = 0,min.mismatch = 0,with.indels = FALSE,fixed = TRUE,algorithm = "auto")
## [1] 1689356
Now we move to a question about chromosome 7. Read the help page for the function alphabetFrequency and use it to determine what percent of chromosome 7 is T,C,G, and A. Note that we have other letters. For example \(N\), which represents positions that are not called-in, appears often.
chr7seq <- BSgenome.Hsapiens.UCSC.hg19[["chr7"]]
alphabetFrequency(chr7seq, as.prob=TRUE)
## A C G T M R
## 0.28904200 0.19901933 0.19880134 0.28935305 0.00000000 0.00000000
## W S Y K V H
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## D B N - + .
## 0.00000000 0.00000000 0.02378429 0.00000000 0.00000000 0.00000000
Note that if you give as.prob=FALSE the total frequency will be printed.
As explained in the video, many of the locations on the genome that are different across individual are single nucleotide polymorphisms (SNPs). This information is not on the human genome reference sequence. Instead, this information is stored in databases such as dbSNP. Bioconductor also gives you access to these database via the
package SNPlocs.Hsapiens.dbSNP.20120608. Download and install this package. This is also a large package.
library(BiocInstaller)
biocLite("SNPlocs.Hsapiens.dbSNP.20120608")
library(SNPlocs.Hsapiens.dbSNP.20120608)
# or locally downloaded and install using `install.packages` function
install.packages("C:/Users/Prasanth/Desktop/Harvard Biomedical Data Series -8 Courses/Softwares and packages/SNPlocs.Hsapiens.dbSNP.20120608_0.99.9.zip", repos = NULL, type = "win.binary")
To see all the SNPs on, for example, chromosome 17 we can use the following commands
library(SNPlocs.Hsapiens.dbSNP.20120608)
## Please note that the SNPlocs.Hsapiens.dbSNP.20120608 package
## contains outdated dbSNP data and will be deprecated in the near
## future. We highly recommend that you use a SNPlocs package based
## on a more recent dbSNP build for your analyses instead. See
## available.SNPs() for the list of SNPlocs packages currently
## available and make sure to pick up the most recent one.
s17 = getSNPlocs("ch17")
head(s17)
## RefSNP_id alleles_as_ambig loc
## 1 145615430 Y 56
## 2 148170422 S 78
## 3 183779916 R 80
## 4 188505217 R 174
## 5 9747578 R 293
## 6 200398583 W 300
The first one listed is rs145615430 which is on location 56.
What is the location on chr17 of SNP rs73971683?
s17$loc[s17$RefSNP_id==73971683]
## [1] 135246
# or using dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:Biostrings':
##
## collapse, intersect, setdiff, setequal, union
##
## The following object is masked from 'package:XVector':
##
## slice
##
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
##
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
##
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
##
## The following objects are masked from 'package:S4Vectors':
##
## intersect, rename, setdiff, union
##
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
s17 %>% filter(RefSNP_id=="73971683") %>% select(loc)
## loc
## 1 135246
We have shown how Bioconductor provides resources for studying the human genome sequence as well as SNPs. Bioconductor also provides resources that permit us to obtain information about genes. We will see how these databases can be quite complex. But before learning about these here we present some experimental data.
The driving force behind the formation of the Bioconductor project was the emergence of high throughput measurement of gene expression data. Unlike genome sequence and SNP data, gene expression data varies from cell to cell, from tissue to tissue and from individual to individual. Statistical techniques such as those implemented in R were natural tools to parse out variability and perform statistical inference. Furthermore, the ambitious use of newly invented technologies added issues of measurement error and bias to already difficulty challenge.
Note that in the previous assessments we focused on static database information: genome sequence and SNPs. In previous courses we have seen the ‘tissuesGeneExpression’ data which is experimental data measured with microarrays. If you have not installed it you can do it like this:
library(devtools)
install_github("genomicsclass/tissuesGeneExpression")
You can load the data
library(tissuesGeneExpression)
data(tissuesGeneExpression)
head(e[,1:5])
## GSM11805.CEL.gz GSM11814.CEL.gz GSM11823.CEL.gz GSM11830.CEL.gz
## 1007_s_at 10.191267 10.509167 10.272027 10.252952
## 1053_at 6.040463 6.696075 6.144663 6.575153
## 117_at 7.447409 7.775354 7.696235 8.478135
## 121_at 12.025042 12.007817 11.633279 11.075286
## 1255_g_at 5.269269 5.180389 5.301714 5.372235
## 1294_at 8.535176 8.587241 8.277414 8.603650
## GSM12067.CEL.gz
## 1007_s_at 10.157605
## 1053_at 6.606701
## 117_at 8.116336
## 121_at 10.832528
## 1255_g_at 5.334905
## 1294_at 8.303227
table(tissue)
## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
The rows of the matrix e are the features, in this case representing genes, and the columns are the samples. The entries of the matrix are gene expression measurements (in log scale) obtained using a microarray technology.
Look at the data for the feature with ID “209169_at”. You can index the rows of edirectly with this character string.
Which of the following best describes the data? (Hint: stratify data by tissue and create boxplots)
boxplot(e["209169_at",]~tissue,las=2)
# las is the argument for longitudinal alignment for x-axis labels
Below is a vector of 6 IDs which index features of ‘e’:
IDs = c("201884_at", "209169_at", "206269_at", "207437_at", "219832_s_at", "212827_at")
Which of the following ID(s) appear to represent a gene specific to placenta? Be careful when you are picking, to pick the correct name or names. Names often look similar. Also, if you get your guess wrong, you need to uncheck the ones you think are wrong to guess again.
library(rafalib)
mypar(3,2)
sapply(IDs, function(x) boxplot(e[x,]~tissue, las=2))
## 201884_at 209169_at 206269_at 207437_at 219832_s_at
## stats Numeric,35 Numeric,35 Numeric,35 Numeric,35 Numeric,35
## n Numeric,7 Numeric,7 Numeric,7 Numeric,7 Numeric,7
## conf Numeric,14 Numeric,14 Numeric,14 Numeric,14 Numeric,14
## out Numeric,6 Numeric,7 Numeric,6 Numeric,10 Numeric,13
## group Numeric,6 Numeric,7 Numeric,6 Numeric,10 Numeric,13
## names Character,7 Character,7 Character,7 Character,7 Character,7
## 212827_at
## stats Numeric,35
## n Numeric,7
## conf Numeric,14
## out Numeric,8
## group Numeric,8
## names Character,7
# you can use a for-loop also instead
for(i in IDs){
boxplot(e[i,]~tissue,las=2)
}
Note that there is much existing work on gene function and all we have here are identifiers provided by the manufacturer of the machine that makes the measurements. How would we go about finding more information about gene “206269_at” for example? Does it have a known function? Where is it on the genome? What is its sequence? One of the strengths of Bioconductor is that it connects R, an existing comprehensive toolbox for data analysis, with the existing comprehensive databases annotating the genome. We will learn about these powerful resources in this class.
The microarray product used to make the measurements described here is the Affymetirx Human GeneChip HG133A. Search the Bioconductor website and determine which of the following packages provides a connection to gene information:
hgu133a.db
Another powerful aspect of Bioconductor is that it provides object classes specifically designed to keep high throughput data organized. Below we show an example of how the three tables that are needed to conduct data analysis are available from Bioconductor data objects. For example, for gene expression we can use the ExpressionSet object, which we will review in more detail in later weeks.
library(Biobase)
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## the GSE5859 package can be installed like this:
## library(devtools)
## install_github("genomicsclass/GSE5859")
library(GSE5859)
data(GSE5859)
class(e)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
These objects were originally designed for gene expression data so the methods to extract the high throughput measurements have related names:
dat = exprs(e)
dim(dat)
## [1] 8793 208
The information about samples is also stored in this object and the functions to create it try to guarantee that the columns of exprs(e) match the rows of the sample information table. pData is use as shorthand for phenotype data.
sampleInfo = pData(e)
dim(sampleInfo)
## [1] 208 3
head(sampleInfo)
## ethnicity date filename
## 1 CEU 2003-02-04 GSM25349.CEL.gz
## 2 CEU 2003-02-04 GSM25350.CEL.gz
## 3 CEU 2002-12-17 GSM25356.CEL.gz
## 4 CEU 2003-01-30 GSM25357.CEL.gz
## 5 CEU 2003-01-03 GSM25358.CEL.gz
## 6 CEU 2003-01-16 GSM25359.CEL.gz
A final table, which we will cover in more detail later, is a table that describes the rows, in this case genes. Because each product will have a different table, these have already been created in Bioconductor.
Because there are certain products that are widely used, Bioconductor makes databases available from which you can extract this information. This way, every object which is linked to this information does not have to carry around the table. Later we will learn to write code like this to extract specific information from these tables:
# library(BiocInstaller)
# biocLite("hgfocus.db")
library(hgfocus.db)
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
annot = select(hgfocus.db,
keys=featureNames(e),
keytype="PROBEID",
columns=c("CHR", "CHRLOC", "SYMBOL"))
## here we pick one column from the annotation
annot = annot[ match(featureNames(e),annot$PROBEID), ]
head(annot)
## PROBEID CHR CHRLOC CHRLOCCHR SYMBOL
## 1 1007_s_at 6 30852327 6 DDR1
## 30 1053_at 7 -73645832 7 RFC2
## 31 117_at 1 161494036 1 HSPA6
## 32 121_at 2 -113973574 2 PAX8
## 33 1255_g_at 6 42123144 6 GUCA1A
## 34 1294_at 3 -49842638 3 UBA7
dim(annot)
## [1] 8793 5