R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Introduction

Autosomal Dominant Polycystic Kidney Disease (ADPKD) is a disorder where multiple cysts grow in the kidneys. These cysts get bigger and reduce kidney function. This disease is often caused by mutations in PKD1 or PKD2 genes. We used single-cell RNA sequencing to study TAL2 cells and compare PKD patients with healthy controls. Our goal is to find biomarker genes that behave differently in TAL2 cells from people with PKD.

Hypotheses

Biological Hypothesis:
TAL2 cells from PKD patients have different gene expression compared to TAL2 cells from healthy people.

Statistical Hypothesis:
- H₀: There is no difference in gene expression between PKD and control TAL2 cells.
- H₁: At least some genes are expressed differently between PKD and control TAL2 cells.

Methods

We used a dataset specifically assigned to our group (Group 4), which consisted of single-cell RNA sequencing (scRNA-seq) data collected from human kidney tissue. This dataset, titled muto2.tsv, includes gene expression measurements for 2000 genes across 2400 individual cells. These cells were collected from 12 biological samples: 8 from patients diagnosed with autosomal dominant polycystic kidney disease (ADPKD) and 4 from healthy controls. Each sample contains 200 single cells.

The goal of this analysis was to identify PKD-associated biomarker genes specific to the TAL2 (Thick Ascending Limb 2) kidney cell type. We used RStudio and the R Markdown format provided in the “Final-report-template.Rmd” to run our analysis. Using tidyverse and other R packages, we generated summary tables, reshaped the data, performed t-tests for differential gene expression, and created visualizations including boxplots, UMAP, volcano plots, heatmaps, and top gene comparisons. We first selected only the TAL2 cells. Then we used t- tests to find genes that were significantly different between the PKD group and the control group. We made boxplots to show expression distributions, UMAPs to group similar cells, volcano plots to highlight significant genes, heatmaps for the top 50 genes, and boxplots for the top 10 genes.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pheatmap)
library(umap)
library(broom)
library(EnhancedVolcano)
## Loading required package: ggrepel
x <- read_tsv("https://borreliabase.org/~wgqiu/muto2.tsv")
## Rows: 2400 Columns: 2002
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr    (2): id, cell.type
## dbl (2000): PTPRQ, PTPRO, ST6GALNAC3, MAGI2, SERPINE1, SLC8A1, SLC26A7, PLA2...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(x)
## # A tibble: 6 × 2,002
##   id    cell.type  PTPRQ  PTPRO ST6GALNAC3  MAGI2 SERPINE1 SLC8A1 SLC26A7 PLA2R1
##   <chr> <chr>      <dbl>  <dbl>      <dbl>  <dbl>    <dbl>  <dbl>   <dbl>  <dbl>
## 1 PKD_… PT1       -0.270 -0.487     -0.420  0.575   -0.115 -1.18   -0.264 -0.629
## 2 PKD_… DCT       -0.246 -0.695     -0.411 -1.70    -0.390 -0.655  -0.451 -0.298
## 3 PKD_… TAL2      -0.288 -0.397     -0.335 -0.492   -0.355  1.35   -0.326 -0.302
## 4 PKD_… TAL1      -0.212 -0.331     -0.329  0.741   -0.395 -0.983  -0.263 -0.102
## 5 PKD_… PEC        0.625  1.71       0.195  0.868    0.465  1.38   -0.449 -1.34 
## 6 PKD_… z12       -1.07   1.74       4.35  -0.312   -0.110 -0.560  -0.208 -0.869
## # ℹ 1,992 more variables: SLC12A3 <dbl>, AC109466.1 <dbl>, EMCN <dbl>,
## #   NTNG1 <dbl>, TIMP3 <dbl>, ZPLD1 <dbl>, LINC01811 <dbl>, CELF2 <dbl>,
## #   PLAT <dbl>, ROBO2 <dbl>, FGF1 <dbl>, FMN2 <dbl>, SLC14A2 <dbl>,
## #   PAPPA2 <dbl>, PLPP1 <dbl>, ADAMTS19 <dbl>, CLNK <dbl>, LDB2 <dbl>,
## #   AC093912.1 <dbl>, ADGRL3 <dbl>, NKAIN2 <dbl>, CLIC5 <dbl>, NPHS2 <dbl>,
## #   ADGRF5 <dbl>, FLT1 <dbl>, ADAMTS6 <dbl>, RGS6 <dbl>, DCC <dbl>,
## #   SNTG1 <dbl>, CNTNAP5 <dbl>, ENPP2 <dbl>, NPHS1 <dbl>, CNTN5 <dbl>, …
x <- x %>%
  mutate(cohort = str_remove(id, "_.+")) %>%
  mutate(sample = str_remove(id, "_.+?_"))

table(x$cohort)
## 
## Cont  PKD 
## 1000 1400
table(x$sample)
## 
## Cont1 Cont2 Cont3 Cont4 Cont5  PKD1  PKD3  PKD4  PKD5  PKD6  PKD7  PKD8 
##   200   200   200   200   200   200   200   200   200   200   200   200
table(x$cell.type)
## 
##  CNT_PC     DCT    ENDO     FIB ICA-ICB    LEUK     PEC    PODO     PT1     PT2 
##     366     226     103     188     119     124      47      35     367     191 
##    TAL1    TAL2     z12 
##     166     440      28

#Questions on Biological Samples

How many samples are there? There are 12 total samples in the dataset (8 from PKD patients and 4 from healthy controls).

How many cells per sample? Each sample contains 200 cells.

How many cell types are present? There are 13 different kidney cell types in the dataset, including TAL2, PT1, PT2, DCT, and others.

Data Reshaping

x.long <- x %>%
  pivot_longer(cols = 3:2002, names_to = "gene", values_to = "rna.ct")
head(x.long)
## # A tibble: 6 × 6
##   id                       cell.type cohort sample gene       rna.ct
##   <chr>                    <chr>     <chr>  <chr>  <chr>       <dbl>
## 1 PKD_ACATTTCAGGGAGGCA-1_1 PT1       PKD    PKD1   PTPRQ      -0.270
## 2 PKD_ACATTTCAGGGAGGCA-1_1 PT1       PKD    PKD1   PTPRO      -0.487
## 3 PKD_ACATTTCAGGGAGGCA-1_1 PT1       PKD    PKD1   ST6GALNAC3 -0.420
## 4 PKD_ACATTTCAGGGAGGCA-1_1 PT1       PKD    PKD1   MAGI2       0.575
## 5 PKD_ACATTTCAGGGAGGCA-1_1 PT1       PKD    PKD1   SERPINE1   -0.115
## 6 PKD_ACATTTCAGGGAGGCA-1_1 PT1       PKD    PKD1   SLC8A1     -1.18

Boxplot of TAL2 Distribution

x.long %>%
  filter(cell.type == "TAL2") %>%
  ggplot(aes(x = sample, y = rna.ct)) +
  geom_boxplot() +
  labs(title = "Boxplot of TAL2 Gene Expression", y = "RNA Count")

Figure 1. Boxplot showing gene expression levels across TAL2 cells by sample.
Key finding: TAL2 gene expression is generally consistent, with slight variation between samples and occasional outliers.
X-axis: Sample ID.
Y-axis: RNA count values for all genes. Each box represents the distribution of expression values in a sample.

UMAP Clustering

x.heat <- x %>% select(-c(1, 2, 2003, 2004)) %>% as.data.frame()
rownames(x.heat) <- x$id
x.umap <- umap(x.heat)
df.umap <- data.frame(x.umap$layout) %>%
  mutate(cell.type = x$cell.type, sample = x$sample, cohort = x$cohort)

df.umap %>%
  filter(cell.type == "TAL2") %>%
  ggplot(aes(x = X1, y = X2, color = cohort)) +
  geom_point(alpha = 0.3) +
  labs(title = "UMAP of TAL2 Cells") +
  theme_bw() +
  stat_ellipse()

Figure 2. UMAP clustering of TAL2 cells based on gene expression.
Key finding: TAL2 cells from PKD patients show slightly different gene activity than healthy ones, which is why they form separate groups on the UMAP plot. This supports our idea that PKD affects how these cells function. X-axis and Y-axis: UMAP 1 and UMAP 2, the two main dimensions used to visualize clusters.
Each point: One TAL2 cell.
Color: Indicates cohort (PKD vs Control).

T-test and Volcano Plot

x.cell <- x.long %>% filter(cell.type == "TAL2")
t.out2 <- x.cell %>%
  group_by(gene) %>%
  do(tidy(t.test(data = ., rna.ct ~ cohort)))

EnhancedVolcano(toptable = t.out2, lab = t.out2$gene,
                x = "estimate", y = "p.value",
                pCutoff = 0.01,
                FCcutoff = 0.7,
                title = "Volcano Plot of TAL2 Differential Expression")

Figure 3. Volcano plot of differentially expressed genes in TAL2 cells.
Key finding: Some genes show significant expression differences between PKD and control cohorts.
X-axis: Fold change in gene expression (difference in mean RNA count).
Y-axis: Negative log10 of the p-value, representing statistical significance.
Each point: One gene.
Color: Highlights genes with significant expression changes.

Heatmap of Top 10 Genes

top_10_genes <- t.out2 %>% arrange(p.value) %>% head(10)

y.wide <- x.cell %>% filter(gene %in% top_10_genes$gene) %>% pivot_wider(names_from = gene, values_from = rna.ct)

y.heatmap <- y.wide %>% select(-c(2,3,4)) %>% column_to_rownames("id")

ann_cell <- data.frame(disease.type = factor(y.wide$cohort))

rownames(ann_cell) <- y.wide$id

pheatmap(y.heatmap, show_rownames = F, annotation_row = ann_cell, main = "Heatmap of Top 10 Genes in TAL2")

Figure 4. Heatmap showing top 50 differentially expressed genes in TAL2 cells.
Key finding: Clear separation between PKD and control gene expression signatures within TAL2 cells.
Rows: Individual TAL2 cells.
Columns: Genes.
Color scale: RNA expression levels from low (blue) to high (red).

Boxplot of Top 10 Genes

top10 <- t.out2 %>% arrange(p.value) %>% head(10) %>% pull(gene)

x.long %>%
  filter(gene %in% top10, cell.type == "TAL2") %>%
  ggplot(aes(x = gene, y = rna.ct, fill = cohort)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Top 10 Genes in TAL2 Cells")

Figure 5. Boxplot of top 10 most differentially expressed genes in TAL2 cells.
Key finding: Top genes demonstrate clear separation between PKD and control TAL2 cells, making them strong candidate biomarkers. Several genes seem to be downregulated in PKD samples compared to the controls (UMOD, SLC12A1, IFITM10, GPC5) X-axis: Gene names (rotated using coord_flip).
Y-axis: RNA expression counts.
Color: Indicates whether the values are from PKD or control samples.

Conclusion

We used single-cell RNA sequencing data to analyze TAL2 cells from both PKD patients and healthy controls.Our analysis revealed several genes that were differentially expressed in PKD TAL2 cells, including some that showed clear separation in the volcano plot, heatmap, and top 10 gene boxplots.Certain genes such as UMOD, SLC12A1, GPC5, and GP2 are potential biomarkers due to significant upregulation seen in the volcano plot, as well as some downregulation seen in the boxplot with the PKD samples compared to the control samples.