BMEG310 2021W - Assignment 4

Instructions

Please fork this to your repo (or copy it there). Make sure your repo is private, but shared with the course admin. Complete the assignment within your forked version of this file. Please start early, because this assignment is challenging and if you wait until the last moment, you will not be able to complete it.

Group:

Group members:

Assignment due date: November 30, 2021, 11:59PM

Introduction

Epstein-Barr Virus (EBV) is a herpes virus and the cause of mononucleosis. Like other herpes viruses, infected cells can remain in a latent phase in the infected person their entire lives. B-cells are the usual target of EBV. When infected, B-cells can become immortalized, and can be cultured in vitro as Lymphoblastoid Cell Lines (LCLs). Like some other viruses, this immortalization can predispose the individual to certain types of cancer. For instance, Hodgkin’s Lymphoma is derived from EBV-infected B cells.

Today, we’re going to look at gene expression data to see how EBV infection changes the transcriptome. We will be looking at B-cell and LCL RNA-seq data from 5 donors, where LCLs were made from each of the donor’s B-cells. Please start from this file and add code/code blocks/and commentary as needed.

Load data

Load needed libraries (install any you don’t yet have - edgeR can be installed through BioConductor)

## Loading required package: limma

## Loading required package: carData

Load data:

rnaSeqData =  read.table(textConnection(readLines(gzcon(url(
  "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126379/suppl/GSE126379_BLCL_processed.txt.gz")))), 
  sep="\t", stringsAsFactors = FALSE, header = TRUE)

Reshape and re-label data to make it more usable.

#melt the data frame so that there is one row for every gene-sample-quantification.type (see below) combination
rnaSeqDataMelted = melt(rnaSeqData, id.vars=c("Chromosome", "Gene.Name", "Gene.ID", "Sum.of.Exon.Length.of.Gene"))
#label every value as a read count, but...
rnaSeqDataMelted$quantification.type = "counts"; 
#relabel those that are really RPKM
rnaSeqDataMelted$quantification.type[grepl(rnaSeqDataMelted$variable,pattern = "RPKM")] = "RPKM"; 
#get the donor (human) ID
rnaSeqDataMelted$donor = gsub(".*BLCL_([ABCDE])([12]).*", "\\1", rnaSeqDataMelted$variable)
#Get the identifier they use to distinguish LCLs from B-cells
rnaSeqDataMelted$celltype = gsub(".*BLCL_([ABCDE])([12]).*", "\\2", rnaSeqDataMelted$variable)
#They didn't label the cell types clearly, so we're going to guess their assignment and verify below
rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype=="1"]="B"
rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype=="2"]="LCL"
#Remove this, since we have already parsed out the needed info 
rnaSeqDataMelted$variable=NULL;
rnaSeqDataMelted$Gene.Name=NULL; #we only use Gene.ID
rnaSeqDataMelted$sampleID = paste(rnaSeqDataMelted$celltype, rnaSeqDataMelted$donor, sep="_")
#divide data into read counts and RPKM
rnaSeqDataMeltedCounts = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type=="counts",]
rnaSeqDataMeltedRPKM = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type=="RPKM",]
#relabel "value" as appropriate, and remove the 'quantification.type' column
rnaSeqDataMeltedCounts$counts=rnaSeqDataMeltedCounts$value; 
rnaSeqDataMeltedCounts$value=NULL; 
rnaSeqDataMeltedCounts$quantification.type=NULL;
rnaSeqDataMeltedRPKM$RPKM=rnaSeqDataMeltedRPKM$value; 
rnaSeqDataMeltedRPKM$value=NULL; 
rnaSeqDataMeltedRPKM$quantification.type=NULL;
rm('rnaSeqDataMelted') # remove

Here, rnaSeqDataMeltedRPKM contains the gene expression “Reads per kilobase of transcript per million reads”, and rnaSeqDataMeltedCounts contains the raw count data.

Questions

Question 1

It is always important to perform Quality Control on your data.

Question 1a (2 pts)

For some reason, the authors labeled the different cell types “1” and “2” in this table. We need to figure out which is which. You know that EBV induces the NF-kappaB pathway, and that the TNF gene should be highly induced. Make a graph that shows the relative gene expression levels of TNF (y axis) for both cell types (x axis), for each donor (colour or “fill”).

Enter your answer below:

gene <- "TNF"

TNFgene <- rnaSeqDataMeltedRPKM[rnaSeqDataMeltedRPKM$Gene.ID == gene,]

ggplot(TNFgene, aes(x=celltype, y = RPKM, fill = donor)) + geom_bar(stat = "identity", position = position_dodge())

Explain what you see. Are the celltype labels correct?

#The bar graph shows the highest RPKM values for donor c, and similar RPKM values for the rest of the donors. The two different cell types are labeled on the x axis and the celltype B and LCL are flipped, the higher RPKM values correalte with the LCL cell type.

print("This code below is switching the LCL and the B in order to correct the labelling as it was originally flipped.")

## [1] "This code below is switching the LCL and the B in order to correct the labelling as it was originally flipped."

#melt the data frame so that there is one row for every gene-sample-quantification.type (see below) combination
rnaSeqDataMelted = melt(rnaSeqData, id.vars=c("Chromosome", "Gene.Name", "Gene.ID", "Sum.of.Exon.Length.of.Gene"))
#label every value as a read count, but...
rnaSeqDataMelted$quantification.type = "counts"; 
#relabel those that are really RPKM
rnaSeqDataMelted$quantification.type[grepl(rnaSeqDataMelted$variable,pattern = "RPKM")] = "RPKM"; 
#get the donor (human) ID
rnaSeqDataMelted$donor = gsub(".*BLCL_([ABCDE])([12]).*", "\\1", rnaSeqDataMelted$variable)
#Get the identifier they use to distinguish LCLs from B-cells
rnaSeqDataMelted$celltype = gsub(".*BLCL_([ABCDE])([12]).*", "\\2", rnaSeqDataMelted$variable)
#They didn't label the cell types clearly, so we're going to guess their assignment and verify below
rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype=="1"]="LCL"
rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype=="2"]="B"
#Remove this, since we have already parsed out the needed info 
rnaSeqDataMelted$variable=NULL;
rnaSeqDataMelted$Gene.Name=NULL; #we only use Gene.ID
rnaSeqDataMelted$sampleID = paste(rnaSeqDataMelted$celltype, rnaSeqDataMelted$donor, sep="_")
#divide data into read counts and RPKM
rnaSeqDataMeltedCounts = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type=="counts",]
rnaSeqDataMeltedRPKM = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type=="RPKM",]
#relabel "value" as appropriate, and remove the 'quantification.type' column
rnaSeqDataMeltedCounts$counts=rnaSeqDataMeltedCounts$value; 
rnaSeqDataMeltedCounts$value=NULL; 
rnaSeqDataMeltedCounts$quantification.type=NULL;
rnaSeqDataMeltedRPKM$RPKM=rnaSeqDataMeltedRPKM$value; 
rnaSeqDataMeltedRPKM$value=NULL; 
rnaSeqDataMeltedRPKM$quantification.type=NULL;
rm('rnaSeqDataMelted') # remove

Question 1b ( 2pts)

With RNA-seq data, it is important both that samples have sufficient coverage, and that the samples have similar coverage. Either case can lead to underpowered analysis, or misleading results. Calcualte the read coverage for each sample. Make a plot with read coverage on the y-axis (total number of reads) and the samples on the x-axis.

ggplot(rnaSeqDataMeltedCounts, aes(x=sampleID, y = counts, fill=sampleID)) +
geom_col(position = position_dodge())

Which sample has the least coverage?

print("LCL_E")

## [1] "LCL_E"

#B_E has the least coverage

Which sample has the most?

print("B_B")

## [1] "B_B"

#LCL_B has the most coverage

What is the % difference between the max and min (relative to the min)?

B_B = rnaSeqDataMeltedCounts$counts[which(rnaSeqDataMeltedCounts$sampleID == "B_B")]

LCL_E = rnaSeqDataMeltedCounts$counts[which(rnaSeqDataMeltedCounts$sampleID == "LCL_E")]

maxBB = max(B_B)
maxLCLE = max(LCL_E)

percent_difference = (maxBB - maxLCLE)/(maxLCLE) * 100

print(percent_difference)

## [1] 335.3347

In cases where samples have vastly different coverage, you can potentially down-sample the higher-coverage samples. Sometimes, throwing out the data in this way can also introduce new problems, so we’re going to stick with the data we have. What is more important in our analysis is that the LCLs and B-cells have similar coverage.

Question 1c ( 4.5 pts)

One easy way to compare the samples, is to calculate the pairwise correlation between them. In general, we do this in log gene expression space.

First we need to create a RPKM matrix. You can use the reshape package to change the shape of a data.frame. Below is an example for how to reshape our RPKM data.frame into a matrix of genes by samples.

# Here, we need fun.aggregate because there are duplicate gene names. We'll just add everything up for each gene name.
rpkmMatrix = cast(rnaSeqDataMeltedRPKM, Gene.ID ~ sampleID, value="RPKM", fun.aggregate="sum")
#now convert this to a matrix
row.names(rpkmMatrix) = rpkmMatrix$Gene.ID
rpkmMatrix$Gene.ID=NULL;
rpkmMatrix= as.matrix(rpkmMatrix)
head(rpkmMatrix)

##                      B_A         B_B        B_C         B_D         B_E
## 1/2-SBSRNA4 19.352915310 15.99571042 27.1820512 21.61277484 31.71742222
## A1BG         4.302521114  3.44955682  3.0061274  1.74767321  3.14328570
## A1BG-AS1     5.752198035  3.86229662  5.0845046  2.50873340  5.84991498
## A1CF         0.009967274  0.02185651  0.1225670  0.04834247  0.04338079
## A2LD1        1.537710375  1.66803075  1.1338164  1.20598716  1.11882550
## A2M          0.000000000  0.00000000  0.1140946  0.00000000  0.00000000
##                 LCL_A      LCL_B     LCL_C      LCL_D      LCL_E
## 1/2-SBSRNA4 4.2774210 2.67034568 3.5642703 4.27670517 6.07668480
## A1BG        2.7449356 7.44206031 8.7141524 6.54710043 9.28019191
## A1BG-AS1    3.1861872 3.65645724 6.5173728 2.78618307 7.57017987
## A1CF        0.0113048 0.01277064 0.0000000 0.02147553 0.01387007
## A2LD1       2.1522430 2.55707183 2.7325435 4.33530178 2.82273836
## A2M         8.3692316 0.20922668 0.3068694 0.30786211 0.02840488

Calculate the pairwise Pearson correlations between samples. Use log(RPKM+1) to calculate the correlations. Plot the correlations as a heatmap.

pear_corr<-cor(log(rpkmMatrix+1), method = "pearson")
heatmap(pear_corr, symm = TRUE)

How similar are the B-cell samples to each other?

#The B-cell samples shown in the heatmap have very similar characterstics because the gradient color for each is a similar shade of red and also a dark color, the darker the color the higher the correlation.

The LCLs to each other? #The LCL samples are also very similar to one another because of the dark shades of red.

The B-cells to the LCLs? #Comparing the B-cells and LCL samples with one another shows very little correalation. This is indicated by the lighter yellow shades on the heat map.

The donors are labeled [A-E]. What would we expect to see if there were donor-specific effects here (e.g. some donor-specific signal affecting both cell types)? Do we see them? #If there were donor-specific signals affecting both cell types than the subsequent heatmap would show high correlation (red colors) between the two cell types. The current heatmap does not indicate donor-specific effects.

Question 1d (4 pts)

Plot the kernel density estimate for RPKM (x axis). Describe the distribution. #At around zero RPKM the function spikes at a high density value of 0.49 and then flatlines to zero.

plot(density(rnaSeqDataMeltedRPKM$RPKM) , main = "Kernel Desnity Estimate for RPKM")

Plot the kernel density estimate for log(RPKM+1) (x axis). Describe the distribution. #The kernel density estimate plot, shows a spike in density around a value of zero for the log(RPMK+1). The graph is steady with for lower density values for higher units of log(RPMK+1)

plot(density(log(rnaSeqDataMeltedRPKM$RPKM+1)) , main = "Kernel Density Estimate for log(RPKM+1)")

Why do we log-transform gene expression data? #log transformation is a form of scaling and will allow us to see a better representation for gene expression. Why do we use RPKM+1? #we use +1 because log of zero is undefined

Question 1e ( 2pts)

It is common to exclude some genes from analysis. For instance, if the gene is not expressed, there is no way it can be differentially expressed. Even in cases where there are a few reads, it is common to exclude these genes too since there we cannot be confident in any differential expression based on very few reads. Calculate the mean number of reads per gene using the count data. We’re going to use a cutoff of a mean of at least 5 reads per gene, on average, to be included in the subsequent analysis. Plot the mean reads per gene as a Cumulative Distribution (see https://en.wikipedia.org/wiki/Cumulative_distribution_function), and indicate with a vertical line where the cutoff lies.

gene = aggregate(rnaSeqDataMeltedCounts$counts, by = list(rnaSeqDataMeltedCounts$Gene.ID), FUN = mean)

plot_genemean <- ggplot(gene, aes(x)) + stat_ecdf(geom = "step") + geom_vline(xintercept=5,)

print (plot_genemean + ggtitle("Mean Number of Reads per Gene") + labs(y="mean number", x="number of genes"))

By looking at the graph, what fraction of genes will be excluded if we require an average of at least 5 reads per gene? #A very small portion of the genes, approximately close to zero will require an average of at least 5 reads.

Question 1f

Make a vector of Gene.IDs for all the genes we will include in our analysis (e.g. those with 5 or more reads per sample on average). Name it keepGenes.

reads_5 <- gene[gene$x >=5,]
keepGenes <- reads_5$Group.1

How many genes have we retained?

keepGenes_length <- length(keepGenes)

print(keepGenes_length)

## [1] 15533

Question 1f ( 2 bonus pts)

rnaSeqDataMeltedCounts contains a commmon error introduced by pasting the data having passed through excel. Can you identify this error and show some examples?

Question 2

We want to know what genes are differentially expressed between LCLs and B cells.

Today, we’re going to use edgeR. The user guide is here: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Question 2a (2pts)

For calculating differential expression, we need a count matrix. There is a lot more information in counts than in normalized counts. For example, you can believe 1/1E6 reads a lot less than 100/1E8 reads, but they both have the same RPKM values. Accordingly, Please create a matrix of counts (genes in the rows and samples in the columns) and filter it for genes with an average number of counts across samples >=5 (as in Q1f).

count_matrix = cast(rnaSeqDataMeltedCounts, Gene.ID ~ sampleID, value="counts", fun.aggregate="sum")
row.names(count_matrix) <- count_matrix$Gene.ID
count_matrix$Gene.ID=NULL;
count_matrix <- as.matrix(count_matrix)
count_means <- rowMeans(count_matrix)
rowstodelete <- which(count_means < 5)
count_matrix <- count_matrix[-c(which(count_means < 5)), ]
count_matrix <- as.matrix(count_matrix)
head(count_matrix)

##             B_A B_B B_C B_D B_E LCL_A LCL_B LCL_C LCL_D LCL_E
## 1/2-SBSRNA4 195 294 245 449 514    76    21    43    40    44
## A1BG         80 117  50  67  94    90   108   194   113   124
## A1BG-AS1    129 158 102 116 211   126    64   175    58   122
## A2LD1        47  93  31  76  55   116    61   100   123    62
## A2M           0   0   5   0   0   723     8    18    14     1
## A2ML1        64  91 104 197 148    23     5     7    14    15

Question 2b (4pts)

EdgeR is exceptionally versatile, with many different options for analysis. Today, you’re going to use the GLM-quasi-likelihood approach to calculate differential expression. Below is some incomplete code that you can use to start with. Please complete the code so that edgeR uses your countMatrix to calculate differential expression comparing B cells to LCLs. Please also comment the code so that it is clear what each step does (in your own words).

ct = as.factor(gsub("_.*","\\1",colnames(count_matrix)))

sample <- data.frame("sampleID" = unique(rnaSeqDataMeltedCounts$sampleID), "celltype" = unique(rnaSeqDataMeltedCounts$celltype))
#create data.frame specified by cell type by finding the sample ID, unique extracts each cell type
a = DGEList(counts = count_matrix, group = ct)
#stores data in DGEList while grouping by cell type.
a = calcNormFactors(a)
#calculates normalization factors, aligning columns  within the matrix
sample_matrix = model.matrix(~ct)
#create design matrix to filter by cell type
a = estimateDisp(a,sample_matrix)
#estimates the common dispersion for each cell type
fit = glmQLFit(a,sample_matrix)
#fits the negative binomial in our sample matrix
qlf = glmQLFTest(fit, coef=2)
#test the negative binomial, or the glmQlFit hypothesis


allDEStats = as.data.frame(topTags(qlf,n=nrow(count_matrix)))
allDEStats$Gene.ID=row.names(allDEStats)
head(allDEStats)

##               logFC   logCPM         F       PValue          FDR  Gene.ID
## CABLES1    7.285417 4.871496 1098.5494 1.295555e-13 2.016921e-09  CABLES1
## COBLL1   -11.562733 6.124496  923.7584 3.802225e-13 2.959652e-09   COBLL1
## GYLTL1B   -7.991501 5.252550  717.3561 1.821711e-12 7.786392e-09  GYLTL1B
## ARHGAP25  -4.344512 6.446297  683.4701 2.456919e-12 7.786392e-09 ARHGAP25
## Mar-01    -5.328281 7.403310  660.5827 3.032289e-12 7.786392e-09   Mar-01
## RASGRP2   -7.512479 7.986295  646.9773 3.448078e-12 7.786392e-09  RASGRP2

allDEstat_unpaired <- allDEStats

Question 2c (2pts)

Now, let’s confirm that we did the comparison in the correct direction. We want to ensure that we did LCLs/B-cells, and not the other way around. Make a volcano plot using ggplot2 for this analysis. A volcano plot is a scatter plot with log(fold-change) on the x-axis, and -log(P-value) on the y-axis. Label TNF in this plot. Alter the above code if logFC is not expressed as log(LCLs/B-cells).

ggplot(allDEStats, aes(x = logFC, y = -log(PValue,10))) + 
  geom_point() + geom_point(data = subset(allDEStats, Gene.ID == "TNF"), color = "red", aes(x=logFC, y=-log(PValue,10), label = Gene.ID, col = "TNF")) + geom_text(data = subset(allDEStats, Gene.ID == "TNF"), color = "blue", aes(x=logFC, y=-log(PValue,10), label = Gene.ID, col = "TNF"))

## Warning: Ignoring unknown aesthetics: label

Question 2d (2pts)

In the case of this experiment, each donor had corresponding B cell and LCL samples - these are indicated by the ‘donor’ variable, which takes the values A-E. For instance, the LCLs and B-cells labeled with Donor ‘E’ are both from the same individual. What happens if there are person-specific differences in expression? For instance, what if person E had autoimmune disease? This could perturb both their B-cells and their resulting LCLs. These are called ‘paired samples’ since there is a pair of samples for each donor (B cells and LCLs). Repeat the analysis above, this time incorporating the knowledge of paired samples in the analysis, using the donor as a blocking variable.

ct = as.factor(gsub("_.*","\\1",colnames(count_matrix)))

sample <- data.frame("sampleID" = unique(rnaSeqDataMeltedCounts$sampleID),"celltype" = unique(rnaSeqDataMeltedCounts$celltype), "donor" = substr(sample$sampleID, nchar(sample$sampleID), nchar(sample$sampleID)))


a = DGEList(counts = count_matrix, group = ct)
a = calcNormFactors(a)
sample_matrix = model.matrix(~ct + sample$donor)
a = estimateDisp(a,sample_matrix)
fit = glmQLFit(a,sample_matrix)
qlf = glmQLFTest(fit, coef=2)
allDEStats = as.data.frame(topTags(qlf,n=nrow(count_matrix)))
allDEStats$Gene.ID=row.names(allDEStats)
head(allDEStats)

##             logFC   logCPM        F       PValue          FDR Gene.ID
## JHDM1D  -7.383211 6.795459 665.5417 1.635597e-08 0.0002171182  JHDM1D
## HDGFRP3  7.326053 4.881600 496.7297 4.763405e-08 0.0002171182 HDGFRP3
## GYLTL1B -9.178574 5.258319 453.6521 6.630259e-08 0.0002171182 GYLTL1B
## CABLES1  7.240114 4.864193 424.8064 8.422119e-08 0.0002171182 CABLES1
## CDCA5    7.295491 4.611423 416.1041 9.080944e-08 0.0002171182   CDCA5
## BUB1     7.030729 5.605853 409.6640 9.611124e-08 0.0002171182    BUB1

ggplot(allDEStats, aes(x = logFC, y = -log(PValue,10))) + 
  geom_point() + geom_point(data = subset(allDEStats, Gene.ID == "TNF"), color = "red", aes(x=logFC, y=-log(PValue,10), label = Gene.ID, col = "TNF")) + geom_text(data = subset(allDEStats, Gene.ID == "TNF"), color = "blue", aes(x=logFC, y=-log(PValue,10), label = Gene.ID, col = "TNF"))

## Warning: Ignoring unknown aesthetics: label

allDEstat_paired <- allDEStats

Question 3

When performing any bioinformatic analysis, it is critically important to verify your answers. Here, we’re going to compare the results from our paired analysis vs. the unpaired analysis to ensure that our analysis is correct (go back and redo the previous parts if you find that your analysis appears to be incorrect).

Question 3a ( 1pts)

Compare the results from the paired analysis vs. the original analysis (where donor was not considered). How many genes have a logFC that differs by 1 or more?

diff_pair <- matrix(data = allDEstat_paired$logFC)
diff_unpair <- matrix(data = allDEstat_unpaired$logFC)
matrix_diff <- matrix(nrow = 15568, ncol = 1)

matrix_diff <- abs(diff_pair - diff_unpair)

count <- length(which(matrix_diff >= 1))

# count
sprintf("The there are %d genes that have a logFC that differ by 1 or more", count)

## [1] "The there are 8618 genes that have a logFC that differ by 1 or more"

Question 3b ( 3pts)

Plot the logFC values for paired (x-axis) vs unpaired (y-axis) analyses. What is the Pearson r between the two?

plot( x = allDEstat_paired$logFC, y = allDEstat_unpaired$logFC, col = "blue", xlab = "Paired logFC", ylab = "Unpaired logFC")

#Pearson r

pear_alph <- cor.test(allDEstat_paired$logFC, allDEstat_unpaired$logFC, method = "pearson")
pear_alph

## 
##  Pearson's product-moment correlation
## 
## data:  allDEstat_paired$logFC and allDEstat_unpaired$logFC
## t = -0.62208, df = 15566, p-value = 0.5339
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02069296  0.01072348
## sample estimates:
##          cor 
## -0.004985969

sprintf("The Pearson r value was %g, calculated using the pearson estimation method", pear_alph$estimate)

## [1] "The Pearson r value was -0.00498597, calculated using the pearson estimation method"

Question 3c

Why are there some genes whose logFC estimates differ so much between the paired and unpaired analyses? Let’s inspect the genes with a logFC difference between paired and unpaired of at least 1. A heatmap can be used to display a matrix of values. In this case, we want the values to represent log(RPKM+1) values (use rnaSeqDataMeltedRPKM), with genes (only those whose logFCs differ between the analysis using donor as a blocking variable vs. without by more than 1) on the y axis and samples on the x axis.

RPKM_matrix <- rpkmMatrix


for (x in 1:nrow(rpkmMatrix)) {
  for (y in 1:10) {
    RPKM_matrix[x,y] <- -log10(rpkmMatrix[x,y]+1)
  }
}

heat_matrix <- RPKM_matrix[c(matrix_diff)]

heatmap_gene <- heatmap(RPKM_matrix, Rowv=NA, Colv=NA, scale="column", margins=c(5,10) )

What are some reasons why paired and unpaired analyses appear to be giving different answers in this case? Are there donor-specific things going on?

#One of the reason for this difference in answer is potentially due to the fact that not all mutations are from B cells and LCLs. The clinical patient may also be suffering from different mutations outside the scope. Furthermore, for the paired analysis, it does not include any reptition where as paired analysis does this results in lower values of correlation and is not donor specific.

Question 4

There are many different ways of normalizing RNA seq data and they all make different assumptions.

For more information on RNA-seq normalization methods, see this great post by Harold Pimentel: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

From now on, we will be using only the paired (donor as blocking variable) analysis

Question 4a ( 3.5 pts)

edgeR uses TMM normalization. TMM normalization is usually a solid approach, but it assumes that most of the genes do not differ in expression between your samples. One approach you could use to check this is by looking at the P-value distribution from our DE analysis. For a well-behaved statistical test, P-values are uniformly distributed between 0 and 1. Plot the Cumulative Distribution of P-values for the paired analysis. On the same plot, include a CDF for 100,000 uniformly sampled random values between 0 and 1.

ecdf1 <- ecdf(allDEstat_paired$PValue)
ecdf2 <- ecdf(runif(100000, 0, 1))
plot(ecdf2, col='red', main = "Distribution Plot from 0 - 1") 
par(new = TRUE)
plot(ecdf1, col='blue',main = "Distribution Plot from 0 - 1")

# plot(distrib_Pval, sample_random)

Do the P-values calculated by edgeR look like they are sampled uniformly between 0 and 1? #The blue line represent the P-values calculated by edgeR. Because of the curvature of the line, it does not look like the P-values are sampled uniformly.

Estimate the fraction of genes that appear to have differential expression (does not have to be exact): # about 10% Do you think we should use TMM normalization for these samples? Why or why not? # Based on the fraction of genes that appear to have differential expression if there is a high correlation then you don’t want to use TMM normalization and if a low fraction of genes appear to have differential expression then you should use TMM normalization. #### Question 4b (3.5 pts) Another way to ensure that P-values are well-behaved is to make a Q-Q plot. A Q-Q plot, in this case, will show the expected P-values (under the null hypothesis) on the x axis, and actual p-values on the y-axis. Recall that by the definition of a P-value, if you generate n p-values under the null hypothesis, you expect to see 1 p-value of 1/n or lower, two of 2/n or lower, etc. Here, n is the number of DE tests you performed. Make a Q-Q plot of p-values for your DE analysis. Use -log(P) for both axes (this puts the most significant in the upper right, and least significant in the lower left). Also include the line y=x in red for comparison. For a well-behaved statistical test, most of the points should lie along the y=x line.

# plot(x = which(alph_paired$PValue == NULL), y = alph_paired$PValue)

null_matrix <- matrix(nrow = 15568, ncol = 1)
for (i in 1:15568) {
  null_matrix[i] <- i/15568
}

qqplot(x = -log(null_matrix), y = -log(allDEstat_paired$PValue))
par(new = TRUE)
abline(0,1, col = "red")

Do most of the P-values lie along the y=x line? #No, the P-values goes over the y=x line What does this mean about how we did our analysis? What would you do differently if we were to redo the analysis? #This means A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing. In this analysis, the P-value exceeded y=x line, thus we have P-value is statistically significant. We expect the P value to be on the y=x line. We ideally redo our analysis to acheive this.

Question 4c (3 pts)

FDRs are automatically calculated by edgeR. plot the -log10(FDR) (y axis) versus the -log10(P-value) (x axis) for the paired DE analysis. Include the line y=x in red again for reference

plot(x = -log10(allDEstat_paired$PValue), y = -log10(allDEstat_paired$FDR))
par(new = TRUE)
abline(0,1, col = "red")

Are the points above or below the line? Why? #The points are all below the line because smaller P values means we have less evidence to support the null hypothesis.

Question 4d (3 pts)

If we took our set of significant DE genes to be those that meet an FDR of 10%, how many genes would be considered significantly DE?

sig_DE <- which(allDEstat_paired$FDR <= 0.10)
num_sig <- length(sig_DE)

sprintf("There are %g DE genes that are less than or equal to an FDR of 10 percent", num_sig)

## [1] "There are 9577 DE genes that are less than or equal to an FDR of 10 percent"

What is the P-value of the gene with the highest FDR that is still less than or equal to FDR < 10%?

max_sigde <- max(sig_DE)
max_pval <- allDEstat_paired$PValue[max_sigde]
max_pval

## [1] 0.0615031

sprintf("The maxiumum Pvalue under 10 percent is %g", max_pval)

## [1] "The maxiumum Pvalue under 10 percent is 0.0615031"

Ignoring any potential flaws in the analysis up to this point and assuming our P-values and FDRs are well founded, is the probability of this gene NOT being differentially expressed (ie. a false discovery) greater than, less than, or equal to 10%?

#Based on the graphs of the P-values, the probability of this gene not being differentially expressed is less than or equeal to 10%.

Question 5

Now we’re going to spend some time actually looking at the results of our analysis, trying to determine what is going on when B-cells are converted to LCLs.

Question 5a (2 pts)

GOrilla, which we use below, can take a single ranked list as input. Here, you should make two ranked lists of gene IDs, one with the gene IDs sorted by log FC from high to low, and the other from low to high. Use the paired analysis for this question. You will only want gene IDs written to the file (e.g. no row or column names, no quotes) - GOrilla will give error messages if it is not properly formatted.

lowhigh <- row.names(allDEstat_paired[order(allDEstat_paired$logFC),])
highlow <- row.names(allDEstat_paired[order(-allDEstat_paired$logFC),])

write(lowhigh, file = "low_high.txt")
write(highlow, file = "high_low.txt")

file.show("low_high.txt")
file.show("high_low.txt")

Question 5b (6 pts)

Now use GOrilla to measure gene set enrichment analysis for both sets of genes. http://cbl-gorilla.cs.technion.ac.il/

Use Homo sapiens; a single ranked list of genes (those we just created - run it twice, once per file); and all ontologies.

When inspecting the results, consider both the significance of the enrichment, the degree of enrichment, and the number of genes involved. For instance, a high-fold enrichment for a very specific term can give more insight than a more significant but lower fold enrichment for a more generic term. Alternatively, enrichments involving fewer genes are more likely to occur by chance, and so may have a high fold enrichment, but nonetheless be spurious.

What processes are relatively high in B cells? The top three processes in B cells are: Multicellular Organismal Process Pval = 2.95 E-14 G protein-coupled receptor signaling pathway Pval = 4.16 E-12 Regulation of ion transport Pval = 7.12 E-9

These three were found to be the top processes as they all contain a relatively high measure of total genes associated with the GO term. P values found <= 0.05 are considered to be enriched thus these are all considered to very enriched.

In the top 10 the most repeating function of B cells appeared to be the adhesion funciton, indicating that adhesion is a very important function of B cells.

Which are relatively high in LCLs? The top three processes in LCLs are: Monocyte chemotaxis Pval = 2.44 E-20 Chemokine-mediated signaling pathway Pval = 1.64 E-19 Mononuclear cell migration Pval = 6.3 E-19

When looking at the top 20 functions, chemotaxis and cell migration came up 5 times each indicating that they are incredibly important functions of LCLs.

What do you think could be happening as B-cells are reprogrammed to LCLs? The B cells are transitioning from functions surrounding adhesion, to that about cell migration and chemotaxis. The definion of chemotaxis is movement of a motile cell or organism indicating that when reprogramming from B to LCL, the function is flipping from being more adhesive to surrounding cells, to transporting them.