This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

ASSIGNMENT 2 GROUP 1 MEMBERS: 1)INGEET PAL 2)ANUSKA SARKAR 3)ANTIK ROY 4)NANDINI SAHA 5)AVIK BHUIN

Part A: Dataset Selection

  1. What is the NCBI GEO database?

The NCBI Gene Expression Omnibus (GEO) is a public functional genomics data repository that stores high-throughput gene expression data, including microarray and RNA-seq. Researchers deposit their data here so that others can access, re-analyze, and reproduce studies. It includes tools for visualization and downloading of data.

  1. What is the GEO Series ID (e.g., GSEXXXXX) of the dataset you chose?

GSE171110 – This is a publicly available RNA-seq dataset titled:

“RNA-seq of whole blood from COVID-19 patients with different severity and healthy controls”

  1. Explain the experiment and the dataset and why did you choose it?

This dataset comes from an RNA-seq experiment comparing blood transcriptomes of COVID-19 patients with mild, moderate, or severe disease, and healthy controls. It aims to understand how gene expression varies with disease severity, which could uncover immune-related gene changes and potential biomarkers.

Why this dataset?

Timely and relevant to global health. Clearly defined groups (mild, moderate, severe, control). Human blood RNA-seq, commonly studied tissue. Available count data and metadata.

  1. What are the number of different groups in the dataset?

There are 4 different groups:

Healthy controls Mild COVID-19 Moderate COVID-19 Severe COVID-19

#install.packages("BiocManager") 
#BiocManager::install("DESeq2")

library(DESeq2)
library(ggplot2)
library(dplyr)

Part B: Loading dataset and DESeq2 1) Download the raw count matrix from the GEO page. Downloadable in .tsv, .csv or .txt formats.

count_data <- read.delim("GSE171110_Data_RNAseq_raw_counts_geo.txt")
  1. Download the metadata for the dataset (has the information about the different columns). Read the associated paper if required.
metadata<-read.delim("GSE171110_series_matrix.txt")
  1. Display the structure and first 10 columns of the dataset.
str(count_data)
head(count_data, 10)
str(metadata)
head(metadata, 10)
  1. Create a Data Frame for Two Conditions For this analysis, we’ll focus on two conditions: “Healthy” and “Severe”.
# Filter metadata for selected conditions
metadata_filtered <- metadata[metadata$condition %in% c("Healthy", "Severe"), ]

# Subset count data based on filtered metadata
count_data_filtered <- count_data[, rownames(metadata_filtered)]
  1. Pre-process the Dataset for DESeq2
# Ensure count data is in integer format
count_data_filtered <- round(count_data_filtered)

# Create DESeqDataSet
dds <- DESeq2DataSetFromMatrix(countData = count_data_filtered,
                              colData = metadata_filtered,
                              design = ~ condition)
Error in DESeq2DataSetFromMatrix(countData = count_data_filtered, colData = metadata_filtered,  : 
  could not find function "DESeq2DataSetFromMatrix"
  1. Run DESeq2 and Display Results
# Run DESeq2 analysis
dds <- DESeq2(dds)

# Get results
res <- results(dds)

# Display summary
summary(res)

Part C: Analysis and Results 1) How many genes were found to be significantly differentially expressed (adjusted p-value < 0.05)?

# Filter for significant genes
sig_genes <- res[which(res$padj < 0.05), ]

# Number of significant genes
nrow(sig_genes)
  1. List 5 upregulated and 5 downregulated genes
# Sort by log2FoldChange
res_ordered <- res[order(res$log2FoldChange, decreasing = TRUE), ]

# Upregulated genes (log2FC > 1)
head(res_ordered[res_ordered$padj < 0.05 & res_ordered$log2FoldChange > 1, c("log2FoldChange", "padj")], 5)

# Downregulated genes (log2FC < -1)
tail(res_ordered[res_ordered$padj < 0.05 & res_ordered$log2FoldChange < -1, c("log2FoldChange", "padj")], 5)
  1. Choose 3 interesting DE genes and describe their functions

ISG15 (Upregulated) Function: Interferon-stimulated gene involved in antiviral defense. Relevance: Overexpressed in response to viral infections; helps modulate immune responses in COVID-19.

HBB (Downregulated) Function: Encodes the beta subunit of hemoglobin. Relevance: Its downregulation may reflect systemic inflammation or blood cell abnormalities seen in severe COVID-19.

OAS1 (Upregulated) Function: Activates RNase L to degrade viral RNA. Relevance: Known to be protective in COVID-19; some polymorphisms linked to better outcomes.

  1. Include and describe a PCA plot
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd, intgroup="condition")
  1. Display a volcano plot
# Load libraries
library(ggplot2)
# Volcano plot with ggplot2
ggplot(result.DEG, aes(x = log2FoldChange, y = (-log10(padj)))) +
geom_point(alpha = 0.6, size = 1.8) +
scale_color_manual(values = c("grey", "red")) +
theme_minimal() +
labs(title = "Volcano Plot", x = "log2 Fold Change", y = "-log10(p-value)") +
theme(legend.title = element_blank())

Part D: Interpretation 1) What challenges or issues did you encounter during the analysis?

Data formatting: GEO data often comes in different formats; aligning sample names between metadata and count matrix can be tricky.

Metadata clarity: Sometimes the condition labels aren’t consistent or well-documented, requiring you to dig into the associated publication.

Batch effects or noise: Especially in clinical datasets (like COVID-19), patient variability can introduce noise.

Filtering decisions: Choosing thresholds for fold change or p-values can be subjective and influence results.

Initially, packages took time to download and load into R.

  1. What assumptions does DESeq2 make about RNA-seq count data? DESeq2 is built on several key assumptions:

Negative binomial distribution: It assumes that count data for each gene follows a negative binomial distribution, which accounts for overdispersion.

No need for prior normalization: DESeq2 normalizes internally using size factors based on the median-of-ratios method.

Majority of genes are not differentially expressed: This assumption allows it to estimate dispersion and size factors more robustly.

  1. What are some limitations of using only DESeq2 for analysis, and what other tools might complement it? Limitations of DESeq2: Focused solely on differential expression — it doesn’t capture splicing, transcript-level, or gene set/pathway analysis.

May not perform well with very small sample sizes.

Doesn’t model batch effects automatically (you need to include them manually in the design formula).

Complementary tools: edgeR or limma – Alternative DE tools with different normalization/statistical approaches.

Sleuth or Salmon + DESeq2 – For transcript-level quantification.

Gene Set Enrichment Analysis (GSEA) – For understanding biological pathways.

clusterProfiler, GOseq – For functional enrichment (GO, KEGG).

Combat or sva – To remove batch effects.

MAST or Seurat – For single-cell RNA-seq data.

Contributions: 1)INGEET PAL: Part B Questions 3-6 2)ANUSKA SARKAR: Part A and Part B Question 1-2 3)ANTIK ROY: Part C Questions 1-3 4)NANDINI SAHA: Part C Questions 4-5 and Part D 5)AVIK BHUIN: Part D

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. 


ASSIGNMENT 2
GROUP 1
MEMBERS:
1)INGEET PAL
2)ANUSKA SARKAR
3)ANTIK ROY
4)NANDINI SAHA
5)AVIK BHUIN

Part A: Dataset Selection

1. What is the NCBI GEO database?

The NCBI Gene Expression Omnibus (GEO) is a public functional genomics data repository that stores high-throughput gene expression data, including microarray and RNA-seq. Researchers deposit their data here so that others can access, re-analyze, and reproduce studies. It includes tools for visualization and downloading of data.

2. What is the GEO Series ID (e.g., GSEXXXXX) of the dataset you chose?

GSE171110 – This is a publicly available RNA-seq dataset titled:

“RNA-seq of whole blood from COVID-19 patients with different severity and healthy controls”

3. Explain the experiment and the dataset and why did you choose it?

This dataset comes from an RNA-seq experiment comparing blood transcriptomes of COVID-19 patients with mild, moderate, or severe disease, and healthy controls. It aims to understand how gene expression varies with disease severity, which could uncover immune-related gene changes and potential biomarkers.

Why this dataset?

Timely and relevant to global health.
Clearly defined groups (mild, moderate, severe, control).
Human blood RNA-seq, commonly studied tissue.
Available count data and metadata.

4. What are the number of different groups in the dataset?

There are 4 different groups:

Healthy controls
Mild COVID-19
Moderate COVID-19
Severe COVID-19

```{r}
#install.packages("BiocManager") 
#BiocManager::install("DESeq2")

library(DESeq2)
library(ggplot2)
library(dplyr)
```

Part B: Loading dataset and DESeq2
1) Download the raw count matrix from the GEO page. Downloadable in .tsv, .csv
or .txt formats.
```{r}
count_data <- read.delim("GSE171110_Data_RNAseq_raw_counts_geo.txt")
```

2) Download the metadata for the dataset (has the information about the different
columns). Read the associated paper if required.
```{r}
metadata<-read.delim("GSE171110_series_matrix.txt")
```

3) Display the structure and first 10 columns of the dataset.
```{r}
str(count_data)
head(count_data, 10)
str(metadata)
head(metadata, 10)
```

4. Create a Data Frame for Two Conditions
For this analysis, we'll focus on two conditions: "Healthy" and "Severe". 
```{r}
# Filter metadata for selected conditions
metadata_filtered <- metadata[metadata$condition %in% c("Healthy", "Severe"), ]

# Subset count data based on filtered metadata
count_data_filtered <- count_data[, rownames(metadata_filtered)]
```

5. Pre-process the Dataset for DESeq2
```{r}
# Ensure count data is in integer format
count_data_filtered <- round(count_data_filtered)

# Create DESeqDataSet
dds <- DESeq2DataSetFromMatrix(countData = count_data_filtered,
                              colData = metadata_filtered,
                              design = ~ condition)
```

6. Run DESeq2 and Display Results
```{r}
# Run DESeq2 analysis
dds <- DESeq2(dds)

# Get results
res <- results(dds)

# Display summary
summary(res)
```

Part C: Analysis and Results
1) How many genes were found to be significantly differentially expressed (adjusted p-value < 0.05)?
```{r}
# Filter for significant genes
sig_genes <- res[which(res$padj < 0.05), ]

# Number of significant genes
nrow(sig_genes)
```

2) List 5 upregulated and 5 downregulated genes
```{r}
# Sort by log2FoldChange
res_ordered <- res[order(res$log2FoldChange, decreasing = TRUE), ]

# Upregulated genes (log2FC > 1)
head(res_ordered[res_ordered$padj < 0.05 & res_ordered$log2FoldChange > 1, c("log2FoldChange", "padj")], 5)

# Downregulated genes (log2FC < -1)
tail(res_ordered[res_ordered$padj < 0.05 & res_ordered$log2FoldChange < -1, c("log2FoldChange", "padj")], 5)
```

3) Choose 3 interesting DE genes and describe their functions

ISG15 (Upregulated)
Function: Interferon-stimulated gene involved in antiviral defense.
Relevance: Overexpressed in response to viral infections; helps modulate immune responses in COVID-19.

HBB (Downregulated)
Function: Encodes the beta subunit of hemoglobin.
Relevance: Its downregulation may reflect systemic inflammation or blood cell abnormalities seen in severe COVID-19.

OAS1 (Upregulated)
Function: Activates RNase L to degrade viral RNA.
Relevance: Known to be protective in COVID-19; some polymorphisms linked to better outcomes.

4) Include and describe a PCA plot
```{r}
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd, intgroup="condition")
```

5) Display a volcano plot
```{r}
# Load libraries
library(ggplot2)
# Volcano plot with ggplot2
ggplot(result.DEG, aes(x = log2FoldChange, y = (-log10(padj)))) +
geom_point(alpha = 0.6, size = 1.8) +
scale_color_manual(values = c("grey", "red")) +
theme_minimal() +
labs(title = "Volcano Plot", x = "log2 Fold Change", y = "-log10(p-value)") +
theme(legend.title = element_blank())
```

Part D: Interpretation
1) What challenges or issues did you encounter during the analysis?

Data formatting: GEO data often comes in different formats; aligning sample names between metadata and count matrix can be tricky.

Metadata clarity: Sometimes the condition labels aren't consistent or well-documented, requiring you to dig into the associated publication.

Batch effects or noise: Especially in clinical datasets (like COVID-19), patient variability can introduce noise.

Filtering decisions: Choosing thresholds for fold change or p-values can be subjective and influence results.

Initially, packages took time to download and load into R.

2) What assumptions does DESeq2 make about RNA-seq count data?
DESeq2 is built on several key assumptions:

Negative binomial distribution: It assumes that count data for each gene follows a negative binomial distribution, which accounts for overdispersion.

No need for prior normalization: DESeq2 normalizes internally using size factors based on the median-of-ratios method.

Majority of genes are not differentially expressed: This assumption allows it to estimate dispersion and size factors more robustly.

3) What are some limitations of using only DESeq2 for analysis, and what other tools might complement it?
Limitations of DESeq2:
Focused solely on differential expression — it doesn’t capture splicing, transcript-level, or gene set/pathway analysis.

May not perform well with very small sample sizes.

Doesn't model batch effects automatically (you need to include them manually in the design formula).

Complementary tools:
edgeR or limma – Alternative DE tools with different normalization/statistical approaches.

Sleuth or Salmon + DESeq2 – For transcript-level quantification.

Gene Set Enrichment Analysis (GSEA) – For understanding biological pathways.

clusterProfiler, GOseq – For functional enrichment (GO, KEGG).

Combat or sva – To remove batch effects.

MAST or Seurat – For single-cell RNA-seq data.


Contributions:
1)INGEET PAL: Part B Questions 3-6
2)ANUSKA SARKAR: Part A and Part B Question 1-2
3)ANTIK ROY: Part C Questions 1-3
4)NANDINI SAHA: Part C Questions 4-5 and Part D
5)AVIK BHUIN: Part D

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
