This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
Autosomal Dominant Polycystic Kidney Disease (ADPKD) is a disorder where multiple cysts grow in the kidneys. These cysts get bigger and reduce kidney function. This disease is often caused by mutations in PKD1 or PKD2 genes. We used single-cell RNA sequencing to study TAL2 cells and compare PKD patients with healthy controls. Our goal is to find biomarker genes that behave differently in TAL2 cells from people with PKD.
Biological Hypothesis:
TAL2 cells from PKD patients have different gene expression compared to
TAL2 cells from healthy people.
Statistical Hypothesis:
- H₀: There is no difference in gene expression between PKD and control
TAL2 cells.
- H₁: At least some genes are expressed differently between PKD and
control TAL2 cells.
We used a dataset specifically assigned to our group (Group 4), which consisted of single-cell RNA sequencing (scRNA-seq) data collected from human kidney tissue. This dataset, titled muto2.tsv, includes gene expression measurements for 2000 genes across 2400 individual cells. These cells were collected from 12 biological samples: 8 from patients diagnosed with autosomal dominant polycystic kidney disease (ADPKD) and 4 from healthy controls. Each sample contains 200 single cells.
The goal of this analysis was to identify PKD-associated biomarker genes specific to the TAL2 (Thick Ascending Limb 2) kidney cell type. We used RStudio and the R Markdown format provided in the “Final-report-template.Rmd” to run our analysis. Using tidyverse and other R packages, we generated summary tables, reshaped the data, performed t-tests for differential gene expression, and created visualizations including boxplots, UMAP, volcano plots, heatmaps, and top gene comparisons. We first selected only the TAL2 cells. Then we used t- tests to find genes that were significantly different between the PKD group and the control group. We made boxplots to show expression distributions, UMAPs to group similar cells, volcano plots to highlight significant genes, heatmaps for the top 50 genes, and boxplots for the top 10 genes.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pheatmap)
library(umap)
library(broom)
library(EnhancedVolcano)
## Loading required package: ggrepel
x <- read_tsv("https://borreliabase.org/~wgqiu/muto2.tsv")
## Rows: 2400 Columns: 2002
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): id, cell.type
## dbl (2000): PTPRQ, PTPRO, ST6GALNAC3, MAGI2, SERPINE1, SLC8A1, SLC26A7, PLA2...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(x)
## # A tibble: 6 × 2,002
## id cell.type PTPRQ PTPRO ST6GALNAC3 MAGI2 SERPINE1 SLC8A1 SLC26A7 PLA2R1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PKD_… PT1 -0.270 -0.487 -0.420 0.575 -0.115 -1.18 -0.264 -0.629
## 2 PKD_… DCT -0.246 -0.695 -0.411 -1.70 -0.390 -0.655 -0.451 -0.298
## 3 PKD_… TAL2 -0.288 -0.397 -0.335 -0.492 -0.355 1.35 -0.326 -0.302
## 4 PKD_… TAL1 -0.212 -0.331 -0.329 0.741 -0.395 -0.983 -0.263 -0.102
## 5 PKD_… PEC 0.625 1.71 0.195 0.868 0.465 1.38 -0.449 -1.34
## 6 PKD_… z12 -1.07 1.74 4.35 -0.312 -0.110 -0.560 -0.208 -0.869
## # ℹ 1,992 more variables: SLC12A3 <dbl>, AC109466.1 <dbl>, EMCN <dbl>,
## # NTNG1 <dbl>, TIMP3 <dbl>, ZPLD1 <dbl>, LINC01811 <dbl>, CELF2 <dbl>,
## # PLAT <dbl>, ROBO2 <dbl>, FGF1 <dbl>, FMN2 <dbl>, SLC14A2 <dbl>,
## # PAPPA2 <dbl>, PLPP1 <dbl>, ADAMTS19 <dbl>, CLNK <dbl>, LDB2 <dbl>,
## # AC093912.1 <dbl>, ADGRL3 <dbl>, NKAIN2 <dbl>, CLIC5 <dbl>, NPHS2 <dbl>,
## # ADGRF5 <dbl>, FLT1 <dbl>, ADAMTS6 <dbl>, RGS6 <dbl>, DCC <dbl>,
## # SNTG1 <dbl>, CNTNAP5 <dbl>, ENPP2 <dbl>, NPHS1 <dbl>, CNTN5 <dbl>, …
x <- x %>%
mutate(cohort = str_remove(id, "_.+")) %>%
mutate(sample = str_remove(id, "_.+?_"))
table(x$cohort)
##
## Cont PKD
## 1000 1400
table(x$sample)
##
## Cont1 Cont2 Cont3 Cont4 Cont5 PKD1 PKD3 PKD4 PKD5 PKD6 PKD7 PKD8
## 200 200 200 200 200 200 200 200 200 200 200 200
table(x$cell.type)
##
## CNT_PC DCT ENDO FIB ICA-ICB LEUK PEC PODO PT1 PT2
## 366 226 103 188 119 124 47 35 367 191
## TAL1 TAL2 z12
## 166 440 28
#Questions on Biological Samples
How many samples are there? There are 12 total samples in the dataset (8 from PKD patients and 4 from healthy controls).
How many cells per sample? Each sample contains 200 cells.
How many cell types are present? There are 13 different kidney cell types in the dataset, including TAL2, PT1, PT2, DCT, and others.
x.long <- x %>%
pivot_longer(cols = 3:2002, names_to = "gene", values_to = "rna.ct")
head(x.long)
## # A tibble: 6 × 6
## id cell.type cohort sample gene rna.ct
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 PKD_ACATTTCAGGGAGGCA-1_1 PT1 PKD PKD1 PTPRQ -0.270
## 2 PKD_ACATTTCAGGGAGGCA-1_1 PT1 PKD PKD1 PTPRO -0.487
## 3 PKD_ACATTTCAGGGAGGCA-1_1 PT1 PKD PKD1 ST6GALNAC3 -0.420
## 4 PKD_ACATTTCAGGGAGGCA-1_1 PT1 PKD PKD1 MAGI2 0.575
## 5 PKD_ACATTTCAGGGAGGCA-1_1 PT1 PKD PKD1 SERPINE1 -0.115
## 6 PKD_ACATTTCAGGGAGGCA-1_1 PT1 PKD PKD1 SLC8A1 -1.18
x.long %>%
filter(cell.type == "TAL2") %>%
ggplot(aes(x = sample, y = rna.ct)) +
geom_boxplot() +
labs(title = "Boxplot of TAL2 Gene Expression", y = "RNA Count")
Figure 1. Boxplot showing gene expression levels across
TAL2 cells by sample.
Key finding: TAL2 gene expression is generally
consistent, with slight variation between samples and occasional
outliers.
X-axis: Sample ID.
Y-axis: RNA count values for all genes. Each box
represents the distribution of expression values in a sample.
x.heat <- x %>% select(-c(1, 2, 2003, 2004)) %>% as.data.frame()
rownames(x.heat) <- x$id
x.umap <- umap(x.heat)
df.umap <- data.frame(x.umap$layout) %>%
mutate(cell.type = x$cell.type, sample = x$sample, cohort = x$cohort)
df.umap %>%
filter(cell.type == "TAL2") %>%
ggplot(aes(x = X1, y = X2, color = cohort)) +
geom_point(alpha = 0.3) +
labs(title = "UMAP of TAL2 Cells") +
theme_bw() +
stat_ellipse()
Figure 2. UMAP clustering of TAL2 cells based on gene
expression.
Key finding: TAL2 cells from PKD patients show slightly
different gene activity than healthy ones, which is why they form
separate groups on the UMAP plot. This supports our idea that PKD
affects how these cells function. X-axis and Y-axis:
UMAP 1 and UMAP 2, the two main dimensions used to visualize
clusters.
Each point: One TAL2 cell.
Color: Indicates cohort (PKD vs Control).
x.cell <- x.long %>% filter(cell.type == "TAL2")
t.out2 <- x.cell %>%
group_by(gene) %>%
do(tidy(t.test(data = ., rna.ct ~ cohort)))
EnhancedVolcano(toptable = t.out2, lab = t.out2$gene,
x = "estimate", y = "p.value",
pCutoff = 0.01,
FCcutoff = 0.7,
title = "Volcano Plot of TAL2 Differential Expression")
Figure 3. Volcano plot of differentially expressed
genes in TAL2 cells.
Key finding: Some genes show significant expression
differences between PKD and control cohorts.
X-axis: Fold change in gene expression (difference in
mean RNA count).
Y-axis: Negative log10 of the p-value, representing
statistical significance.
Each point: One gene.
Color: Highlights genes with significant expression
changes.
top_10_genes <- t.out2 %>% arrange(p.value) %>% head(10)
y.wide <- x.cell %>% filter(gene %in% top_10_genes$gene) %>% pivot_wider(names_from = gene, values_from = rna.ct)
y.heatmap <- y.wide %>% select(-c(2,3,4)) %>% column_to_rownames("id")
ann_cell <- data.frame(disease.type = factor(y.wide$cohort))
rownames(ann_cell) <- y.wide$id
pheatmap(y.heatmap, show_rownames = F, annotation_row = ann_cell, main = "Heatmap of Top 10 Genes in TAL2")
Figure 4. Heatmap showing top 50 differentially
expressed genes in TAL2 cells.
Key finding: Clear separation between PKD and control
gene expression signatures within TAL2 cells.
Rows: Individual TAL2 cells.
Columns: Genes.
Color scale: RNA expression levels from low (blue) to
high (red).
top10 <- t.out2 %>% arrange(p.value) %>% head(10) %>% pull(gene)
x.long %>%
filter(gene %in% top10, cell.type == "TAL2") %>%
ggplot(aes(x = gene, y = rna.ct, fill = cohort)) +
geom_boxplot() +
coord_flip() +
labs(title = "Top 10 Genes in TAL2 Cells")
Figure 5. Boxplot of top 10 most differentially
expressed genes in TAL2 cells.
Key finding: Top genes demonstrate clear separation
between PKD and control TAL2 cells, making them strong candidate
biomarkers. Several genes seem to be downregulated in PKD samples
compared to the controls (UMOD, SLC12A1, IFITM10, GPC5)
X-axis: Gene names (rotated using coord_flip).
Y-axis: RNA expression counts.
Color: Indicates whether the values are from PKD or
control samples.
We used single-cell RNA sequencing data to analyze TAL2 cells from both PKD patients and healthy controls.Our analysis revealed several genes that were differentially expressed in PKD TAL2 cells, including some that showed clear separation in the volcano plot, heatmap, and top 10 gene boxplots.Certain genes such as UMOD, SLC12A1, GPC5, and GP2 are potential biomarkers due to significant upregulation seen in the volcano plot, as well as some downregulation seen in the boxplot with the PKD samples compared to the control samples.