source("http://www.bioconductor.org/biocLite.R")
## Bioconductor version 3.0 (BiocInstaller 1.16.5), ?biocLite for help
## A new version of Bioconductor is available after installing the most
##   recent version of R; see http://bioconductor.org/install
biocLite()
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.2.
## Old packages: 'boot', 'class', 'cluster', 'codetools', 'foreign',
##   'KernSmooth', 'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet',
##   'rpart', 'spatial', 'survival'

In human genome reference build hg19, what is the length of chromosome 16?

library(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:stats':
## 
##     xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 3.1.3
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: rtracklayer
## Warning: package 'rtracklayer' was built under R version 3.1.3
nchar(Hsapiens$chr16)
## [1] 90354753

Genefu package is an “R package providing various functions relevant for gene expression analysis with emphasis on breast cancer”.

biocLite(c("genefu", "COPDSexualDimorphism.data"))
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.2.
## Installing package(s) 'genefu' 'COPDSexualDimorphism.data'
## package 'genefu' successfully unpacked and MD5 sums checked
## package 'COPDSexualDimorphism.data' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Library\AppData\Local\Temp\Rtmpk5rTvI\downloaded_packages
## Old packages: 'boot', 'class', 'cluster', 'codetools', 'foreign',
##   'KernSmooth', 'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet',
##   'rpart', 'spatial', 'survival'

A data.frame with information on the 70 gene signature used in the mammaprint algorithm is in the sig.gene70 data.frame.

library(genefu)
## Loading required package: survcomp
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.1.3
## Loading required package: prodlim
## Warning: package 'prodlim' was built under R version 3.1.3
## Loading required package: mclust
## Warning: package 'mclust' was built under R version 3.1.3
## Package 'mclust' version 5.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:GenomicRanges':
## 
##     map
## The following object is masked from 'package:IRanges':
## 
##     map
## Loading required package: biomaRt
data(sig.gene70)
dim(sig.gene70)
## [1] 70  9
head(sig.gene70)[,1:6]
##                         probe correlation average.good.prognosis.profile
## NM_003748           NM_003748   -0.420671                     0.12350000
## NM_003862           NM_003862   -0.410964                     0.05159091
## Contig32125_RC Contig32125_RC   -0.409054                     0.05409091
## U82987                 U82987   -0.407002                     0.06150000
## AB037863             AB037863   -0.402335                     0.06334091
## NM_020974           NM_020974   -0.399987                    -0.06231818
##                EntrezGene.ID NCBI.gene.symbol HUGO.gene.symbol
## NM_003748               8659          ALDH4A1          ALDH4A1
## NM_003862               8817            FGF18            FGF18
## Contig32125_RC            NA             <NA>             <NA>
## U82987                 27113             BBC3             BBC3
## AB037863                  NA             <NA>             <NA>
## NM_020974              57758           SCUBE2           SCUBE2

How many components of the signature have a missing value for the associated NCBI gene symbol?

sum(is.na(sig.gene70$NCBI.gene.symbol))
## [1] 14

Phenotypic Variation

Part of the explanation for phenotypic differences come from the genome. The code to create proteins, receptors or other molecules is included in genome. DNA is firstly transcribed into mRNA which leaves the nucleus and then translated into proteins.

library(COPDSexualDimorphism.data)
data(lgrc.expr.meta)

The variable pkyrs in the expr.meta data.frame represents pack years smoked. Other variables include gender (self-explanatory) and diagmaj (disease status).

What is the number of female participants in this study?

table(expr.meta$gender)
## 
##   1-Male 2-Female 
##      119      110

What is the median of the distribution of pack years smoked in this cohort (women and men)?

summary(expr.meta$pkyrs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   40.00   44.17   60.00  212.00

True or False: The distribution of pack-years smoked is well-approximated by a Gaussian (Normal) probability distribution.

qqnorm(expr.meta$pkyrs)

False: The substantial number of zero values renders a Gaussian model for pack years formally implausible.

boxplot(pkyrs~gender, data=expr.meta)

Distributions appear quite asymmetric, with long tails skewed towards high values which suggest caution in using t-test in comparing males and females with respect to pack years smoked.

expr.meta$pyp1 <- expr.meta$pkyrs+1 # to define a positive-valued variable for transformation analysis

#install.packages("MASS")
library(MASS)
## Warning: package 'MASS' was built under R version 3.1.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:biomaRt':
## 
##     select
lm1 <- lm(pyp1~gender, data=expr.meta)

A plot of the likelihood function for a transformation model:

boxcox(lm1)

For what value of lambda does the likelihood reach its highest value for the model lm1? Obtained from the plot: 0.5

lambda <- 0.5
par(mfrow=c(1,2))
boxplot(I(pyp1^lambda)~gender, data=expr.meta)
boxplot(pkyrs~gender, data=expr.meta)

Transformations with similar intent will be important in various aspects of statistical analysis of genome-scale data.Transformed data has more symmetric structure and less outliers.