Gene Expression Response to Thermal Stress in Acropora cervicornis

Introduction

What is the issue or disease?

Climate change is causing a major increase in sea surface temperature at unprecedented rates. Coral reefs, which live in shallow, coastal waters are very susceptible to thermal stress. The level of susceptibility is dependent on many things, including the species of the coral itself. When exposed to raised temperatures, some corals will bleach and even die if the temperature or time of exposure is drastic enough.

What samples are being used to study the topic?

Acropora cervicornis is a critically endangered coral species present in the Caribbean. This dataset includes 444 A. cervicornis fragments, which come from 37 nursery colonies. There were 3 temperature treatments and 4 seasons per colony (37 * 3 * 4). Samples were pooled for analysis, each pool included 7 host genets. Samples could be from Miami or Lower Keys nurseries, and were collected in either summer or winter.

Why do we care to know about the gene expression in these samples?

This gene expression data can give insights into stress responses to heat, which is an ongoing issue for all corals, but especially for A. cervicornis. Gene expression data can also lead to the development of biomarkers that can be used for coral restoration efforts by monitoring reef health based on gene expression.

Hypothesis: Acute heat exposure in Acropora cervicornis induces stress-response genes which will be significantly expressed in heat-treated samples when compared to ambient or cold conditions.

Data Import

Count Data

#importing the count data
coral_counts <- read_tsv("GSE108338_final_host_counts.txt", show_col_types = FALSE)

#importing the sample metadata
coral_info_raw <- read.delim("GSE108338_metadata.txt", 
                             header = FALSE, sep = "\t", quote = "\"", 
                             stringsAsFactors = FALSE)

Blast Results

#these are the hot vs cold blastx results
hot.vs.cold.blast <- read.csv("HotVsCold.csv", header = TRUE)

#these are the hot vs cold re-blast (blastn) results
hot.vs.cold.reblast <- read.csv("HotVsColdReblast.csv", header = TRUE)

#these are the hot vs ambient blastx results
hot.vs.ambient.blast <- read.csv("HotVsAmbient.csv", header = TRUE)

#these are the hot vs ambient re-blast(blastn) results
hot.vs.ambient.reblast <- read.csv("HotVsAmbientReblast.csv", header = TRUE)

#importing the csv files with the functions annotated for each transcript and matching protein ID from blast
HVC.funct <- read.csv("HotVsColdFull.csv", header = TRUE)
HVA.funct <- read.csv("HotVsAmbientFull.csv", header = TRUE)

Cleaning the Data

Cleaning Coral Counts

#creating a column "ID" with only the number associated with each gene
coral_counts$ID <- gsub("^[^|]*\\||\\|.*$", "", coral_counts$genes)

#creating a column "GenBank" with only the Accession ID for each gene
coral_counts$GenBank <- gsub("^([^|]*\\|){3}([^|]*)\\|.*", "\\2", coral_counts$genes)

#This is creating a new DF with ONLY the count data
coral_counts_raw <- coral_counts[, c(2,3,4,5,6,7,8,9,10,11,12,13)]

#This adds our genbank IDs to our coral counts raw matrix as the row names
rownames(coral_counts_raw) <- coral_counts$GenBank

## Warning: Setting row names on a tibble is deprecated.

head(coral_counts_raw, n=5)

Cleaning Coral Metadata

#This pulls out only the information we want from the metadata file
coral_info_clean <- coral_info_raw %>% filter(grepl("!Sample_", V1))

#This transposes the data so switching rows and columns
coral_info <- as.data.frame(t(coral_info_clean[,-1]))

head(coral_info, n=5)

Creating the DGE

#This is creating out DGEList starting with the coral counts
coral_dge <- DGEList(counts = coral_counts_raw)

#This adds an element for the gene ID
coral_dge$genes <- data.frame(GeneID = coral_counts$GenBank, stringsAsFactors = FALSE)

Cleaning Hot Vs Cold Blast Results

#this takes our hot vs cold data for the first blast and sorts it by e value and then takes only the lowest e value for each transcript ID (the first entry since we sorted by e value)
filterHVC <- hot.vs.cold.blast %>% group_by(Transcript.ID) %>% filter (E.value == min(E.value))
uniqueHVC <- filterHVC %>% distinct(Transcript.ID, .keep_all = TRUE)

#this takes our hot vs cold data for the second blast and sorts it by e value and then takes only the lowest e value for each transcript ID (the first entry since we sorted by e value)
filterHVCreblast <- hot.vs.cold.reblast %>% group_by(Transcript.ID) %>% filter (E.value == min(E.value))
uniqueHVCreblast <- filterHVCreblast %>% distinct(Transcript.ID, .keep_all = TRUE)

#this creates a DF of the full hot vs cold results
HVC <- rbind(uniqueHVCreblast, uniqueHVC)

Cleaning Hot vs Ambient Blast Results

#this takes our hot vs ambient data for the first blast and sorts it by e value and then takes only the lowest e value for each transcript ID (the first entry since we sorted by e value)
filterHVA <- hot.vs.ambient.blast %>% group_by(Transcript.ID) %>% filter (E.value == min(E.value))
uniqueHVA <- filterHVA %>% distinct(Transcript.ID, .keep_all = TRUE)

#this takes out hot vs ambient data for the second blast and sorts it by e value and then takes only the lowest e value for each transcript ID (the first entry since we sorted by e value)
filterHVAreblast <- hot.vs.ambient.reblast %>% group_by(Transcript.ID) %>% filter (E.value == min(E.value))
uniqueHVAreblast <- filterHVAreblast %>% distinct(Transcript.ID, .keep_all = TRUE)

HVA <- rbind(uniqueHVAreblast, uniqueHVA)

Creating a Full Gene List

#create a DF of all the DE genes that had blast results, with annotated functions
AllGenes <- rbind(HVC.funct, HVA.funct)

head(AllGenes, n=5)

Creating Sample Labels

#this is creating a vector of the treatments so we can add it to our DGEList
treatment <- as.factor(c("ambient", "cold", "hot","ambient", "cold", "hot", "ambient", "cold", "hot", "ambient", "cold", "hot"))

#This adds our treatment information to the DGElist
coral_dge$samples$treatment <- treatment

#Now we do the same for the pool, season, and nursery
pool <- as.factor(c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
coral_dge$samples$pool <- pool

season <- as.factor(c("summer", "summer", "summer","summer", "summer", "summer", "winter", "winter", "winter","winter", "winter", "winter"))
coral_dge$samples$season <- season

nursery <- as.factor(c("Miami", "Lower Keys", "Miami", "Lower Keys", "Miami", "Lower Keys", "Miami", "Lower Keys", "Miami", "Lower Keys", "Miami", "Lower Keys"))
coral_dge$samples$nursery <- nursery

Math and Examining the Data

#convert raw counts into CPM and log-CPM values using edgeR
cpm <- cpm(coral_dge)
lcpm <- cpm(coral_dge, log=TRUE)

How many genes have no counts in ALL samples?

tb <- table(rowSums(coral_dge$counts==0)==12)
tb

## 
## FALSE  TRUE 
## 22705    67

What is the percent of non-expressed genes?

tb["TRUE"]/(tb["FALSE"]+tb["TRUE"])

##       TRUE 
## 0.00294221

0.29%

#keeping only "well-expressed" genes instead of only non-zero gene counts
keep.exprs <- filterByExpr(coral_dge, group=treatment)
coral_dge2 <- coral_dge[keep.exprs,, keep.lib.sizes=FALSE]

#convert raw "well-expressed" genes into log-CPM values
lcpm2 <- cpm(coral_dge2, log=TRUE)

#assign different colors to different treatments, and then examine dimensions 1 and 2 using the color grouping by treatment
col.group <- treatment
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
plotMDS(lcpm2, labels=treatment, col=col.group)
title(main="Treatment Groups")

#math model
design <- model.matrix(~0+treatment)
colnames(design) <- gsub("treatment", "", colnames(design))

#creating comparisons between groups
contr.matrix <- makeContrasts(
    HotvsCold = hot-cold,
    HotvsAmbient = hot-ambient,
    ColdvsAmbient = cold-ambient,
    levels = colnames(design))

#initial data characteristics
par(mfrow=c(1,2))
v <- voom(coral_dge2, design, plot=TRUE)

#adjust model (bring mean and variance into alignment with precision weighting)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)

#plot the adjusted model
plotSA(efit, main="Final Model: Mean-Variance Trend")

Differentially Expressed Genes

Find the Differentially Expressed Genes

#DE genes with basic p-value cutoff
summary(decideTests(efit))

##        HotvsCold HotvsAmbient ColdvsAmbient
## Down        6449         3729            42
## NotSig     14303        17092         21150
## Up           461          392            21

#DE genes with stricter fold change cut-offs
tfit <- treat(vfit, lfc=1)
dt <- decideTests(tfit)
summary(dt)

##        HotvsCold HotvsAmbient ColdvsAmbient
## Down          10           27             0
## NotSig     21159        21141         21213
## Up            44           45             0

#pull out all DE genes that are non-zero in both hot vs cold and hot vs ambient
de.common <- which(dt[,1]!=0 & dt[,2]!=0)
length(de.common)

## [1] 37

#plot the intersect
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon"))

#take a look at the first 20 DE genes
head(tfit$genes$GeneID[de.common], n=20)

##  [1] "GASU01095046.1" "GASU01094876.1" "GASU01094062.1" "GASU01092575.1"
##  [5] "GASU01092535.1" "GASU01092436.1" "GASU01091367.1" "GASU01086893.1"
##  [9] "GASU01084707.1" "GASU01084399.1" "GASU01082780.1" "GASU01081142.1"
## [13] "GASU01079585.1" "GASU01076987.1" "GASU01076635.1" "GASU01076096.1"
## [17] "GASU01075882.1" "GASU01071410.1" "GASU01070878.1" "GASU01070778.1"

Hot vs Cold Differentially Expressed Genes

#see which genes are the MOST DE, with names and fold changes altogether
hot.vs.cold <- topTreat(tfit, coef=1, n=Inf)

#filter the topTreat data to exclude non-significant expression values
hot.vs.cold.sig <- hot.vs.cold[hot.vs.cold$adj.P.Val <= 0.05,]

#create a DF that has all of the DE hot vs cold genes with their matching functions and protein IDS
hot.vs.cold.DEfunct <- merge(hot.vs.cold.sig, HVC.funct, by = "GeneID", all.x = TRUE) 

head(hot.vs.cold.DEfunct, n=5)

#create DFs for the up and downregulated genes for hot vs cold based on log fold change
hot.vs.cold.down <- hot.vs.cold.DEfunct[hot.vs.cold.DEfunct$logFC < 0,]
hot.vs.cold.up <- hot.vs.cold.DEfunct[hot.vs.cold.DEfunct$logFC > 0,]

Hot vs Ambient Differentially Expressed Genes

#see which genes are the MOST DE, with names and fold changes altogether
hot.vs.ambient <- topTreat(tfit, coef=2, n=Inf)

#filter the topTreat data to exclude non-significant expression values
hot.vs.ambient.sig <- hot.vs.ambient[hot.vs.ambient$adj.P.Val <= 0.05,]

#create a DF that has all of the DE hot vs ambient genes with their matching functions and protein IDS
hot.vs.ambient.DEfunct <- merge(hot.vs.ambient.sig, HVA.funct, by = "GeneID", all.x = TRUE)

head(hot.vs.ambient.DEfunct, n=5)

#create DFs for the up and downregulated genes for hot vs ambient based on log fold change
hot.vs.ambient.down <- hot.vs.ambient.DEfunct[hot.vs.ambient.DEfunct$logFC < 0, ]
hot.vs.ambient.up <- hot.vs.ambient.DEfunct[hot.vs.ambient.DEfunct$logFC > 0, ]

Data Export

Note: the code here was purposefully left out, as this chunk is data manipulation and export to re-blast sequences, and thus does not impact the analysis in any way.

Plots

Volcano Plots

#Creating a volcano plot of up vs downregulated genes for hot vs ambient
ggplot(hot.vs.ambient.DEfunct, aes(x = logFC, y = -log10(adj.P.Val))) +
  geom_point(alpha = 0.5) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  theme_bw() +
  labs(title = "Hot vs Ambient Differential Expression",
       x = "log2 Fold Change",
       y = "-log10(p-value)")

#Creating a volcano plot of up vs downregulated genes for hot vs cold
ggplot(hot.vs.cold.DEfunct, aes(x = logFC, y = -log10(adj.P.Val))) +
  geom_point(alpha = 0.5) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  theme_bw() +
  labs(title = "Hot vs Cold Differential Expression",
       x = "log2 Fold Change",
       y = "-log10(p-value)")

Heat Maps

#remove any rows where Description='NA'
hva.DEfunct.clean <- hot.vs.ambient.DEfunct[!is.na(hot.vs.ambient.DEfunct$Description), ]

#converting description column to be more computer-readable
hva.DEfunct.clean$Description <- iconv(
  hva.DEfunct.clean$Description,
  from = "",
  to = "UTF-8",
  sub = "?"
)

#remove genes that are uninformative, like "uncharacterized protein"
top_genes_hva <- hva.DEfunct.clean %>%
  filter(
    !is.na(Description),
    !(
      grepl("uncharacterized|hypothetical|unnamed",
            Description,
            ignore.case = TRUE)
    )
  )

hva_table <- top_genes_hva %>% select(GeneID, logFC, P.Value, Description)

hva_table <- hva_table %>% mutate(logFC = round(logFC, 3))

datatable(hva_table,
          caption = "Differentially Expressed Hot vs Ambient Genes",
          class = 'cell-border hover',
          options = list(scrollX = TRUE, autoWidth = TRUE))

#create a variable of only the gene IDs for the informative HVA genes
hva_genes_plot <- top_genes_hva$GeneID

#create a variable of the lpcm values only for the informative HVA genes
hva_map <- lcpm[hva_genes_plot, ]

#pull rownames from HVA DF which has gene names (as row names) and lcpm numbers 
hva_genes <- rownames(hva_map)

#match Gene ID to Description column
description_hva <- top_genes_hva$Description[match(hva_genes, top_genes_hva$GeneID)]

#change rownames from Gene ID to Gene Description
rownames(hva_map) <- description_hva

#set custom colors for heat map
map_colors <- colorRampPalette(c("blue", "white", "red"))(256)

#create the heatmap
pheatmap(
  hva_map,
  scale = "row",
  main = "Hot vs Ambient Differentially Expressed Genes",
  col = map_colors,
  fontsize_row = 10,
  fontsize_col = 10
)

#remove any rows where Description='NA'
hvc.DEfunct.clean <- hot.vs.cold.DEfunct[!is.na(hot.vs.cold.DEfunct$Description), ]

#converting description column to be more computer-readable
hvc.DEfunct.clean$Description <- iconv(
  hvc.DEfunct.clean$Description,
  from = "",
  to = "UTF-8",
  sub = "?"
)

#remove genes that are uninformative, like "uncharacterized protein"
top_genes_hvc <- hvc.DEfunct.clean %>%
  filter(
    !is.na(Description),
    !(
      grepl("uncharacterized|hypothetical|unnamed",
            Description,
            ignore.case = TRUE)
    )
  )

hvc_table <- top_genes_hvc %>% select(GeneID, logFC, P.Value, Description)

hvc_table <- hvc_table %>% mutate(logFC = round(logFC, 3))

datatable(hvc_table,
          caption = "Differentially Expressed Hot vs Cold Genes",
          class = 'cell-border hover',
          options = list(scrollX = TRUE, autoWidth = TRUE))

#create a variable of only the gene IDs for the informative HVC genes
hvc_genes_plot <- top_genes_hvc$GeneID

#create a variable of the lpcm values only for the informative HVC genes
hvc_map <- lcpm[hvc_genes_plot, ]

#pull rownames from HVC DF which has gene names (as row names) and lcpm numbers 
hvc_genes <- rownames(hvc_map)

#match Gene ID to Description column
description_hvc <- top_genes_hvc$Description[match(hvc_genes, top_genes_hvc$GeneID)]

#change rownames from Gene ID to Gene Description
rownames(hvc_map) <- description_hvc

#create the heatmap
pheatmap(
  hvc_map,
  scale = "row",
  main = "Hot vs Cold Differentially Expressed Genes",
  col = map_colors,
  fontsize_row = 11,
  fontsize_col = 11
)