#Title:BIOL47120 Final Report: Autosomal dominant polycystic kidney disease (ADPKD) ’ #author: “Afsara Bhuyia” #date: “4/22/2024” output: word_document: toc: true toc_depth: ‘2’ fig_caption: true html_document: toc: true toc_depth: ‘2’ df_print: paged latex_engine: xelatex # pdf_document: # toc: true # toc_depth: ‘2’ latex_engine: xelatex —

Introduction

A common hereditary condition known as autosomal dominant polycystic kidney disease (ADPKD) is defined by the kidney cysts’ progressive expansion, which eventually results in end-stage renal disease. There are currently no viable therapeutic options available for ADPKD, despite its clinical relevance. This study clarifies the cellular and molecular landscape of ADPKD advancement by utilizing state-of-the-art single-cell and single-nucleus sequencing tools. The study reveals the activation of profibrotic and proinflammatory pathways and identifies proximal tubular cells with a transcriptome signature of unsuccessful repair. Furthermore, unique proinflammatory fibroblast and collecting duct cell subpopulations are identified, offering important new understandings into the pathophysiology of ADPKD.

Background

Even though it is severe and intense, therapeutic approaches for ADPKD remain extremely limited New insights into the cellular and molecular mechanisms discussing ADPKD progression are necessary for the development of effective treatment. Single-cell and single-nucleus sequencing technologies have emerged as tols to look at cellular heterogeneity and molecular signatures associated with ADPKD. By discussing the epigenetic information of ADPKD kidneys at single-cell resolution, this study looks at the pathogenic processes driving cyst growth and kidney dysfunction in ADPKD.

Biological hypothesis

Increased expression of specific genes in certain fibroblast groups, particularly myofibroblasts and PKD-fibroblasts, suggests they may contribute to inflammation and fibrosis in ADPKD kidneys through TNFα-induced NF-κB signaling and TGFβ pathway activation.

Significance

Single-cell multiomic analysis of late-stage ADPKD holds significant promise in unraveling the cellular heterogeneity underlying disease progression. Identifying the specific cell types and states driving ADPKD pathology could lead to the development of targeted diagnostic and therapeutic strategies, ultimately improving patient outcomes.

Materials and Methods

Samples

The RAND1 dataset comprises snRNA-seq and snATAC-seq data from eight ADPKD kidneys and five controls. Using 10X Genomics technology, snRNA-seq libraries were sequenced with a mean of 408,304,417 reads for controls and 358,474,996 reads for ADPKD. For snATAC-seq, libraries yielded a mean of 334,652,440 reads. Bioinformatics tools like Seurat and Harmony were used for processing, resulting in 62,073 nuclei from ADPKD and 40,637 from controls for snRNA-seq, and 128,008 peak regions among 50,986 nuclei for snATAC-seq.

Experimental procedure

We employed single nucleus RNA sequencing (snRNA-seq) to analyze the samples. This technology allows for the examination of gene expression profiles at the single-cell level. This gives us information into the cellular composition and states within the kidney tissues. The snRNA-seq data were generated using high-throughput sequencing platforms. This allows us to simoltenously profile of thousands of individual nuclei in each sample. ## Computational procedure 1. Tissue Procurement: -Kidney samples obtained from ADPKD patients and controls. -Intact nephrectomized kidneys received and dissected immediately. -Samples collected from large cysts, fixed or frozen. 2. Library Preparation: -Nuclei isolated using Nuclei EZ Lysis buffer. -snRNA-seq: Libraries prepared using 10X Genomics Chromium Single Cell 3’ v3 chemistry. -snATAC-seq: Libraries prepared using 10X Genomics Chromium Single Cell ATAC v1 chemistry. 3. Bioinformatics Workflow: -Cellranger used for read counting. -Seurat v4.0.0 for data processing. -SoupX used for ambient RNA correction. -Integration of datasets with Harmony. Quality control, removal of doublets, and batch effect correction. 4. Analysis: -Clustering, dimensional reduction, and cell-type annotation. -Differential gene expression analysis. -Transcription factor activity estimation with chromVAR. -Prediction of cis-coaccessibility networks with Cicero. -Single-cell gene enrichment analysis with VISION. -Cell-cell communication analysis with CellChat. -Deconvolution of published datasets with CIBERSORTx. -Pathway analysis with PROGENy. -Correlation analysis between datasets. 5. Immunofluorescence Studies: Tissue preparation, antibody staining, and imaging. 5.Cell Culture: -Maintenance of HEK293T and WT9-12 cells. -Treatment with forskolin and retinoic acid. 6. CRISPR Interference: -Design and cloning of sgRNAs. -Lentiviral transduction and selection. 7. Quantitative PCR: -RNA extraction, cDNA synthesis, and qPCR. -Normalization and data analysis. 8.Statistical Analysis: - Student’s t-test or ANOVA for comparisons.

Results

library(readr)
x <- read_tsv("https://borreliabase.org/~wgqiu/muto1.tsv")

## Rows: 2400 Columns: 2002
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr    (2): id, cell.type
## dbl (2000): PTPRQ, PTPRO, ST6GALNAC3, MAGI2, SERPINE1, SLC8A1, SLC26A7, PLA2...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

getwd()

## [1] "/Users/afsara/Downloads"

chooseCRANmirror(ind=1) 
library(tidyverse) # downloaded all required packages and loaded data

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pheatmap)
library(umap)
library(broom)
(require("BiocManager", quietly = TRUE))

## Bioconductor version '3.18' is out-of-date; the current release version '3.19'
##   is available with R version '4.4'; see https://bioconductor.org/install

## [1] TRUE

install.packages("BiocManager")

## 
## The downloaded binary packages are in
##  /var/folders/f6/2lf75m8n1z700np5sv7y9vmw0000gn/T//Rtmp8tCwG6/downloaded_packages

BiocManager::install("EnhancedVolcano")

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.2 (2023-10-31)

## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'EnhancedVolcano'

## Old packages: 'broom', 'cachem', 'farver', 'fastmap', 'KernSmooth', 'knitr',
##   'openssl', 'ragg', 'rmarkdown', 'stringi', 'systemfonts', 'tinytex', 'xfun'

library(EnhancedVolcano)  # Corrected package name

## Loading required package: ggrepel

library(ggrepel)
head(x)

## # A tibble: 6 × 2,002
##   id        cell.type   PTPRQ   PTPRO ST6GALNAC3  MAGI2 SERPINE1  SLC8A1 SLC26A7
##   <chr>     <chr>       <dbl>   <dbl>      <dbl>  <dbl>    <dbl>   <dbl>   <dbl>
## 1 PKD_ACAG… PT2       -0.258  -0.338      -0.359 -1.49     1.85  -0.968   -0.405
## 2 PKD_ACCA… DCT       -0.385  -0.523      -0.422  1.63    -0.421  0.500   -0.251
## 3 PKD_ACGC… DCT       -0.210  -0.355      -0.536  0.490   -0.338 -0.828   -0.297
## 4 PKD_ACGC… z12       -0.798  -0.966       2.10   0.827   -0.199  0.0483  -0.308
## 5 PKD_ACGG… PT1        1.56   -0.211      -0.400 -0.479    0.431  1.81    -0.372
## 6 PKD_ACGG… TAL2      -0.0403  0.0271     -0.365 -1.32    -0.330 -0.256   -0.203
## # ℹ 1,993 more variables: PLA2R1 <dbl>, SLC12A3 <dbl>, AC109466.1 <dbl>,
## #   EMCN <dbl>, NTNG1 <dbl>, TIMP3 <dbl>, ZPLD1 <dbl>, LINC01811 <dbl>,
## #   CELF2 <dbl>, PLAT <dbl>, ROBO2 <dbl>, FGF1 <dbl>, FMN2 <dbl>,
## #   SLC14A2 <dbl>, PAPPA2 <dbl>, PLPP1 <dbl>, ADAMTS19 <dbl>, CLNK <dbl>,
## #   LDB2 <dbl>, AC093912.1 <dbl>, ADGRL3 <dbl>, NKAIN2 <dbl>, CLIC5 <dbl>,
## #   NPHS2 <dbl>, ADGRF5 <dbl>, FLT1 <dbl>, ADAMTS6 <dbl>, RGS6 <dbl>,
## #   DCC <dbl>, SNTG1 <dbl>, CNTNAP5 <dbl>, ENPP2 <dbl>, NPHS1 <dbl>, …

The rows represent individual cells. The columns consist of various attributes, including cell ID, cell type, and gene expression levels for different genes. The numbers in the table represent gene expression levels, typically measured as log-transformed values. There are a total of 6 cells in the dataset. There are gene expression data for 2,002 genes in the dataset.

# Mutate the dataframe to create 'cohort' and 'sample' columns by removing specific patterns from the 'id' column
x <- x %>% 
  mutate(cohort = str_remove(id, "_.+")) %>% 
  mutate(sample = str_remove(id, "_.+_") )

table(x$cohort)

## 
## Cont  PKD 
## 1000 1400

table(x$sample)  # displaying the tables

## 
## Cont1 Cont2 Cont3 Cont4 Cont5  PKD1  PKD3  PKD4  PKD5  PKD6  PKD7  PKD8 
##   200   200   200   200   200   200   200   200   200   200   200   200

table(x$cell.type)

## 
##  CNT_PC     DCT    ENDO     FIB ICA-ICB    LEUK     PEC    PODO     PT1     PT2 
##     352     204     104     188     122     123      45      45     343     189 
##    TAL1    TAL2     z12 
##     172     483      30

There are a total of 10 samples: 5 control samples (Cont1, Cont2, Cont3, Cont4, Cont5) and 5 ADPKD samples (PKD1, PKD3, PKD4, PKD5, PKD6, PKD7, PKD8). Each sample contains 200 cells. There are 13 different cell types identified in the dataset: CNT_PC, DCT, ENDO, FIB, ICA-ICB, LEUK, PEC, PODO, PT1, PT2, TAL1, TAL2, and z12.

x.long <- x %>% 
  pivot_longer(3:2002, names_to = "gene", values_to = "rna.ct") # these are 2000 gene columns

head(x.long)

## # A tibble: 6 × 6
##   id                       cell.type cohort sample gene       rna.ct
##   <chr>                    <chr>     <chr>  <chr>  <chr>       <dbl>
## 1 PKD_ACAGCCGAGATAACGT-1_1 PT2       PKD    PKD1   PTPRQ      -0.258
## 2 PKD_ACAGCCGAGATAACGT-1_1 PT2       PKD    PKD1   PTPRO      -0.338
## 3 PKD_ACAGCCGAGATAACGT-1_1 PT2       PKD    PKD1   ST6GALNAC3 -0.359
## 4 PKD_ACAGCCGAGATAACGT-1_1 PT2       PKD    PKD1   MAGI2      -1.49 
## 5 PKD_ACAGCCGAGATAACGT-1_1 PT2       PKD    PKD1   SERPINE1    1.85 
## 6 PKD_ACAGCCGAGATAACGT-1_1 PT2       PKD    PKD1   SLC8A1     -0.968

x.long %>% 
  ggplot(aes(x = sample, y = rna.ct)) +
  geom_violin()

A violin plot to visualize the distribution of RNA counts across different samples. The X-axis represents the different samples in the dataset. Each sample is a identifier that is equivelent to a specific sample.The Y-axis represents the RNA counts. These counts show the level of gene expression in each sample. The data points are shown through the violin plots, which show the distribution of RNA counts for each sample. A wider section of the violin plot shows a higher density of data points at that RNA count.

x.long %>%
  sample_n(1000) %>% # take 1000 random rows; otherwise too much
  ggplot(aes(sample = rna.ct)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~sample)

gq Plot to assess the normalcy of RNA counts within each sample, comparing the observed quantiles against the expected quantiles from a normal distribution.he X-axis represents the expected quantiles from a standard normal distribution. These quantiles serve as a reference to compare the distribution of the sample data. The Y-axis represents the quantiles of the RNA counts from the sample data. These quantiles are calculated from the actual RNA count data. the data points compares the actual RNA count to the expected value if the data were normally distributed.

# Select gene columns and set row names
x.heat <- x %>% 
  select(-c(1, 2, 2003, 2004)) %>%  
  as.data.frame()
rownames(x.heat) <- x$id

# Create a data frame containing cell types
ann_cell <- data.frame(cell.type = factor(rep("FIB", nrow(x.heat))))
rownames(ann_cell) <- x$id

# Sample 1/20 of columns and 1/50 of rows (to reduce computational load)
pick.columns <- sample(colnames(x.heat), 100)
x.pick <- x.heat %>% 
  select(all_of(pick.columns)) %>% 
  sample_n(200)

# Generate heatmap
pheatmap(x.pick, show_rownames = FALSE, annotation_row = ann_cell)

This heatmap illustrates gene expression profiles in FIB cells, it lets us the variation of gene expression across different genes and samples. Columns: Each column is for a gene expression profile for a subset of genes that are randomly sampled from the original dataset. Rows: Each row represents a cell from the dataset. Color Gradient: The color gradient represents the expression levels of the genes. red indicates higher expression levels, while blue shows lower ones. Color Bar for the Rows: The color bar indicates the range of expression levels corresponding to the colors in the gradient. Cluster Diagrams: The cluster diagrams represent hierarchical clustering of the rows and/or columns based on their expression patterns.

x.pca <- prcomp(x.heat) 
# show variance explained by each principal component
plot(x.pca)

This plot visualizes the principal component analysis, showing the variance explained by each principal component.The X-axis represents the principal components, ordered from the first to the last. Each principal component is a linear combination of the original variables that captures the maximum variance in the data.he Y-axis represents the proportion of total variance explained by each principal component. It shows how much of the variability in the data is captured by each component.Each point on the plot indicates the proportion of variance explained by a corresponding principal component.

# Extract PCA coordinates and create a dataframe
pca.coord <- as.data.frame(x.pca[[5]]) 
# Extract PC1 and PC2, add a cell type column
df.pca <- tibble(pca.coord[1:2]) %>% 
  mutate(cell.type = "FIB") 

# Plot PCA with FIB cell type
df.pca %>% 
  ggplot(aes(x = PC1, y = PC2, color = cell.type)) + 
  geom_point(size = 0.1, alpha = 0.5) + 
  # increase point size & make transparent
  theme_bw() + 
  theme(legend.position = "bottom") + 
  stat_ellipse()

PC1 and PC2 are the first and second principal components. These components show the maximum amount of variance in the data. PC1 explains the most variance followed by PC2, and so on. Each principal component is a linear combination of the original variables, and they are orthogonal to each other. Each point represents an observation in the dataset. The position of each point on the plot corresponds to its scores on PC1 and PC2. Points that are close to each other have similar patterns of difference in the original variables. The colors represent different categories or groups of observations.The color represents FIB The distances between points on the plot represent the similarities between observations in terms of their patterns of variation. Points that are closer together have similar patterns of variation, while points that are further apart have dissimilar patterns.

# Generate UMAP embeddings for x.heat
x.umap <- umap(x.heat)

# Create dataframe with UMAP embeddings, cell type, sample, and cohort information
df.umap <- data.frame(x.umap$layout) %>% 
  mutate(cell.type = x$cell.type) %>% 
  mutate(sample = x$sample) %>% 
  mutate(cohort = x$cohort)

# Filter the dataframe to include only the rows with cell.type "FIB"
df_fib <- df.umap %>% filter(cell.type == "FIB")

# Plot UMAP for FIB Cell Type
ggplot(df_fib, aes(x = X1, y = X2)) +
  geom_point(alpha = 0.2, color = "blue") +
  theme_bw() +
  ggtitle("UMAP Plot for FIB Cell Type")

UMAP1 and UMAP2 represent the two axes that capture the dimensional representation of the original high-dimensional data. Each point on the plot represents an individual data point or sample from the original dataset. The colors represent different categories or groups in the data. In this case, since we are plotting the UMAP for the FIB cell type, the colors represent different subtypes or conditions within the FIB cell type. The distances between points in the plot represent thedistances or similarities between the corresponding data points. Points that are close together in the UMAP plot are more similar to each other in the original data space, while points that are far apart are more dissimilar.

 target.cells <- x %>% 
  filter(cell.type == 'FIB')  # Filter for FIB cell type
target.cells <- target.cells %>% 
  sample_n(100)

non.target.cells <- x %>% 
  filter(cell.type != 'FIB')  # Filter for non-FIB cell types
non.target.cells <- non.target.cells %>% 
  sample_n(100) %>% 
  mutate(cell.type = 'non-target')

y <- bind_rows(target.cells, non.target.cells) # Combine target and non-target cells into a single dataframe
y.long <- y %>% 
  pivot_longer(3:2002, names_to = "gene", values_to = "rna.ct")
# Reshape the dataframe from wide to long format
t.out <- y.long %>% 
  group_by(gene) %>% 
  do(tidy(t.test(data = ., rna.ct ~ cell.type))) # Perform t-test comparing gene expression between cell types

EnhancedVolcano(toptable = t.out, lab = t.out$gene, x = "estimate", y = "p.value",
#                ylim = c(0, 4),
#                pCutoff = 0.05,
                title = "Tissue-specific genes",
                subtitle = "cell type: FIB")  # Change subtitle to FIB

This Enhanced Volcano plot shows tissue-specific genes, with significance indicated by color. The x-axis represents the fold change in gene expression between FIB cells and non-target cells. The y-axis shows the significance level (p-value) of the observed fold changes. LHFPL3 seems to be significantly downregulated and CALD1 seems to be signifcantly upregulated.Blue represents significant p-value, the green is significant fold change value, the red is significant p value and fold change value. Horizontal Dashed Line:This line represents the p-value cutoff. Genes with p-values below this line (i.e., lower p-values) are considered statistically significant. Vertical Dashed Lines: These lines represent thresholds for the effect size, often the log2 fold change or another measure of the difference between groups.

 top5 <- t.out %>% 
  arrange(p.value) %>% # Identify top 5 genes
  head(5) %>% 
  pull(gene)

x.long %>% 
  filter(gene %in% top5) %>%   #created a violin plot
  ggplot(aes(x = cell.type, y = rna.ct)) +
  geom_violin() + 
  geom_jitter(shape = 1, alpha = 0.1) +
  coord_flip() +
  facet_wrap(~gene)

The graph shows the expression distribution of the top five genes with the lowest p-values among FIB cells. This shows violin plots with jittered points to visualize individual data points across different genes. The x-axis displays the top 5 genes with the lowest p-values from the t-test, while the y-axis represents their expression levels. Each point in the plot shows gene expression in FIB cells. Different panels represent different genes. The plot reflects t-test results and helps identify the most significant biomarker for FIB cells based on gene expression. The most significant biomarker is SLIT3.

x.cell <- x.long %>% 
  filter(cell.type == 'FIB') # Filter for FIB cell type

t.out2 <- x.cell %>% # Perform t-tests on RNA counts by cohort within each gene for FIB cell type
  group_by(gene) %>% 
  do(tidy(t.test(data = ., rna.ct ~ cohort)))
  
  EnhancedVolcano(toptable = t.out2, lab = t.out2$gene, x = "estimate", y = "p.value",
                pCutoff = 0.01,
                title = "PKD associated genes",
                subtitle = "cell type: FIB")

# Create an Enhanced Volcano plot

The volcano plot visualizes the significance (-log10 p-value) against the magnitude of change (log2 fold change) for genes associated with PKD in FIB cells, highlighting those with a p-value below 0.01. The X-axis shows the fold change, indicating gene expression differences. The Y-axis represents the statistical significance. Colors indicate significance levels. PIP5K1B is a gene that is significantly downregulated and FBLN is a gene that is significantly upregulated.Blue represents significant p-value, the green is significant fold change value, the red is significant p value and fold change value. Horizontal Dashed Line:This line represents the p-value cutoff. Genes with p-values below this line (i.e., lower p-values) are considered statistically significant. Vertical Dashed Lines: These lines represent thresholds for the effect size, often the log2 fold change or another measure of the difference between groups.

 top5 <- t.out2 %>% 
  arrange(p.value) %>% 
  head(5) %>%  # Identify top 5 genes with the smallest p-values
  pull(gene)

x.long %>% 
  filter(gene %in% top5, cell.type == "FIB") %>% 
  ggplot(aes(x = cell.type, y = rna.ct, color = cohort)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(shape = 1, alpha = 0.1) + #create box plot filtered for FIB cells
  coord_flip() +
  facet_wrap(~gene)

The X-axis represents the cell type, which in this case is ‘FIB’ (fibroblast). Since we are filtering for ‘FIB’ only, it will be the sole category on the X-axis.The Y-axis represents the RNA counts, indicating the level of gene expression for each gene in the FIB cell type. Different colors represent different cohorts. This helps in distinguishing the distribution of RNA counts across various cohorts within each gene.

Conclusions

Statistical conclusions

Null Hypothesis: The null hypothesis states that there is no significant difference in gene expression between FIB cells and non-target cells. However, based on the t-test results and volcano plot we are able to reject the null hypothesis as there is a significant difference between FIB cells and non traget cells in terms of the PKD disease. Fold Change: The fold change represents the magnitude of change in gene expression between FIB cells and non-target cells. Genes with significant fold changes show differences in expression levels between the two cell types. For example, FBLN1 has significant upregulation and LHFPL3 shows downregulation in FIB cells. p-values: The p-values indicate the significance level of observed fold changes. Lower p-values suggest stronger evidence against the null hypothesis, indicating significant differences in gene expression between FIB cells and non-target cells. In this study, genes with lower p-values, such as CALD1, show upregulation in FIB cells and genes like LHFPL3 show downregulation. The signifcant biomarker is FBLN1.

Biological conclusions

The involvement of single-cell and single-nucleus sequencing data in autosomal dominant polycystic kidney disease (ADPKD) patients has shown detailed cellular and molecular mechanisms that are able to encourage disease progression. Key findings include the activation of TNFα-induced NF-κB signaling in ADPKD-specific fibroblast subtypes. This leads to a proinflammatory microenvironment, and signatures of repair failure in proximal tubular cells. Additionally, identification of unique cell subpopulations like proinflammatory fibroblasts is able to show new insights into disease diagnosis with potential implications for targeted therapies. These discoveries show us how important is to understand cellular heterogeneity and dysregulated pathways in ADPKD. This improves the way for personalized treatment strategies to improve patient outcomes.

Future work

This analysis explored a limited set of statistical method. However, considering additional machine learning algorithms. For example, support vector machines or gradient boosting can give an overall improvement to predicative perfomance. Additionally, refining the feature selection process may reduce overfitting and improve how the models react to unseen data.

References

Muto et al (2022). Full title. Journal. pages
Any other references you have read and may want to include

Data file

NCBI GEO database GSE185948