BIOF439 Final Project

---
title: "BIOF439 Final Project"
author: Lesley M Chapman
output: 
  flexdashboard::flex_dashboard:
    storyboard: true
    social: menu
    source: embed
---

```{r, include=FALSE}
#if (!('airway' %in% installed.packages()[,1])) #install.packages('airway', repos = 'http://cran.rstudio.com')
#if (!('ggbeeswarm' %in% installed.packages()[,1])) #install.packages('ggbeeswarm', repos = #'http://cran.rstudio.com')
#if (!('pheatmap' %in% installed.packages()[,1])) #install.packages('pheatmap', repos = #'http://cran.rstudio.com')
#if (!('PoiClaClu' %in% installed.packages()[,1])) #install.packages('PoiClaClu', repos = #'http://cran.rstudio.com')


#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("apeglm")

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("DESeq2")

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("Gviz")

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("IHW")

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("org.Hs.eg.db")

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
#BiocManager::install("vsn")


library("AnnotationDbi")
library("apeglm")
library(airway)
library('DESeq2')
library("dplyr")
library(flexdashboard)
library("genefilter")
library("ggbeeswarm")
library("ggplot2")
library("Gviz")
library("IHW")
library("org.Hs.eg.db")
library("pheatmap")
library("PoiClaClu")
library("RColorBrewer")
library("vsn")

dir <- system.file("extdata", package="airway", mustWork=TRUE)

```


```{r, include=FALSE}
'
Part 1
----------
Load data and create dataframe for analysis
'
# Load sample descriptions
csvfile <- file.path(dir,"sample_table.csv")
(sampleTable <- read.csv(csvfile,row.names=1))

data("airway")
se <- airway
head(se)

dds <- DESeqDataSet(se, design = ~ cell + dex)
head(dds, 5)

countdata <- assay(se)
coldata <- colData(se)
head(countdata, 3)

ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
                                 colData = coldata,
                                 design = ~ cell + dex)

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)



lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
meanSdPlot(cts, ranks = FALSE)

#logarithm-transformed counts
log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)

# FYI: fully unsupervised transformation blind = TRUE
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
colData(vsd)
#rlog
rld <- rlog(dds, blind = FALSE)

```

### Project Overview:
 Use visual exploratory data analysis (EDA) techniques to understand the structure and potential bias within a High Throughput Sequencing dataset.

```{r asthma, echo=FALSE, fig.cap="Asthma Overview", out.width = '100%'}
knitr::include_graphics("/Users/chapmanlm/Desktop/FAES/Homework/BIOF439/Project/normal-vs-asthma-lung.png")
```

***

##### Study Overview
* Asthma affects 300+ million people worldwide and is characterized as a chronic inflammtory respiratory disease

* Glucocorticoids have an anti-inflammatory effect on airway smooth muscle cells, and used as a therapy for asthma 

* Glucocorticoids act by binding to glucocorticoid receptors which then translocate to cellular nuclei and modulate gene expression

* The following describes transcriptomic profiling of cells treated with dexamethasone - a potent synthetic glucocorticoid

* Project re-analyzes RNA-seq data from 4 airway smooth muscle (ASM) cell lines:
  * N61311
  * N052611
  * N080611
  * N061011


**Data Source**

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

[Airways Dataset](https://bioconnector.github.io/workshops/data.html)

[Asthma Image Source](https://community.aafa.org/blog/what-happens-in-your-airways-when-you-have-asthma)


### **Problem Description:** 
A comparison of different read count normalization methods 
```{r, include=FALSE}
dds <- estimateSizeFactors(dds)
df <- bind_rows(
  as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
    mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))

colnames(df)[1:2] <- c("x", "y")  
```


```{r}
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)  
```

***
- Technical biases may lead to erroneous conclusions in downstream analysis

- Certain samples may have been sequenced at different depths

- The estimate size factors method was used to make comparison of gene counts more comparable
 - The following normalization methods were then applied:
 - log2 transformation of normalized counts
 - variance stabilizing transformation(VST) transformation of normalized counts
- rlog transofrmation of normalized counts


### **Raw Data Description:**
Similarities between cell line samples pre- and post-treatment were found based on similarities in gene expression.


```{r, include=FALSE}
dds <- DESeq(dds)
# results : extracts the estimated log2 fold changes and p values for the last variable in the design formula.
res <- results(dds)
res <- results(dds, contrast=c("dex","trt","untrt"))
mcols(res, use.names = TRUE)
summary(res)
res
# Only take differentially expressed genes with FDR < 0.05
res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)

results(dds, contrast = c("cell", "N061011", "N61311"))
topGene <- rownames(res)[which.min(res$padj)]
geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),
                         returnData = TRUE)
```


```{r, include=FALSE}
sampleDists <- dist(t(assay(vsd)))
sampleDists

sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vsd$dex, vsd$cell, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
```

```{r}
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)
```

***
**Data Description**
- Roughly 20,000 genes were analyzed
- Illumina sequencing results
- Fragment counts were analyzed for reads mapped to Chromosome 1 were analyzed

**Genes Analyzed**
- Many genes were found to have differential expression pre- and post-dexamethasone treatment

- The potential false discovery rate was accounted for by applying the Benjamini-Hochberg correction to the values and plotting differentially expressed genes with p <0.05

- 316 genes were significantly differentially expressed after correcting for false discovery rate by the Benjamini-Hochberg

- The Euclidead distance between samples



### **Major Result**: Evaluation of Genes with highest variance in expression

```{r}
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)

mat  <- assay(vsd)[ topVarGenes, ]
mat  <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)
```

***

- Evaluation of the top 20 genes with the highest variance

- Heatmap reflects the amount by which each gene deviates 
in a specific sample from the gene’s average across all samples


### **Manual Inspection**: 
Select Sample comparison
```{r, include=FALSE}
# Only take differentially expressed genes with FDR < 0.05
res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)

results(dds, contrast = c("cell", "N061011", "N61311"))

topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene = topGene, intgroup=c("dex"))
```


```{r}
geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),
                         returnData = TRUE)
ggplot(geneCounts, aes(x = dex, y = count, color = cell)) +
  scale_y_log10() +  geom_beeswarm(cex = 3)
```

***

Project Overview:
Use visual exploratory data analysis (EDA) techniques to understand the structure and potential bias within a High Throughput Sequencing dataset.

Study Overview

Problem Description:
A comparison of different read count normalization methods

Raw Data Description:
Similarities between cell line samples pre- and post-treatment were found based on similarities in gene expression.

Major Result: Evaluation of Genes with highest variance in expression

Manual Inspection:
Select Sample comparison

Project Overview: Use visual exploratory data analysis (EDA) techniques to understand the structure and potential bias within a High Throughput Sequencing dataset.

Study Overview

Problem Description: A comparison of different read count normalization methods

Raw Data Description:Similarities between cell line samples pre- and post-treatment were found based on similarities in gene expression.

Major Result: Evaluation of Genes with highest variance in expression

Manual Inspection: Select Sample comparison

Project Overview:
Use visual exploratory data analysis (EDA) techniques to understand the structure and potential bias within a High Throughput Sequencing dataset.

Problem Description:
A comparison of different read count normalization methods

Raw Data Description:
Similarities between cell line samples pre- and post-treatment were found based on similarities in gene expression.

Manual Inspection:
Select Sample comparison