#Title:BIOL47120 Final Report: Autosomal dominant polycystic kidney disease (ADPKD) ’ #author: “Afsara Bhuyia” #date: “4/22/2024” output: word_document: toc: true toc_depth: ‘2’ fig_caption: true html_document: toc: true toc_depth: ‘2’ df_print: paged latex_engine: xelatex # pdf_document: # toc: true # toc_depth: ‘2’ latex_engine: xelatex —
A common hereditary condition known as autosomal dominant polycystic kidney disease (ADPKD) is defined by the kidney cysts’ progressive expansion, which eventually results in end-stage renal disease. There are currently no viable therapeutic options available for ADPKD, despite its clinical relevance. This study clarifies the cellular and molecular landscape of ADPKD advancement by utilizing state-of-the-art single-cell and single-nucleus sequencing tools. The study reveals the activation of profibrotic and proinflammatory pathways and identifies proximal tubular cells with a transcriptome signature of unsuccessful repair. Furthermore, unique proinflammatory fibroblast and collecting duct cell subpopulations are identified, offering important new understandings into the pathophysiology of ADPKD.
Even though it is severe and intense, therapeutic approaches for ADPKD remain extremely limited New insights into the cellular and molecular mechanisms discussing ADPKD progression are necessary for the development of effective treatment. Single-cell and single-nucleus sequencing technologies have emerged as tols to look at cellular heterogeneity and molecular signatures associated with ADPKD. By discussing the epigenetic information of ADPKD kidneys at single-cell resolution, this study looks at the pathogenic processes driving cyst growth and kidney dysfunction in ADPKD.
Increased expression of specific genes in certain fibroblast groups, particularly myofibroblasts and PKD-fibroblasts, suggests they may contribute to inflammation and fibrosis in ADPKD kidneys through TNFα-induced NF-κB signaling and TGFβ pathway activation.
Single-cell multiomic analysis of late-stage ADPKD holds significant promise in unraveling the cellular heterogeneity underlying disease progression. Identifying the specific cell types and states driving ADPKD pathology could lead to the development of targeted diagnostic and therapeutic strategies, ultimately improving patient outcomes.
The RAND1 dataset comprises snRNA-seq and snATAC-seq data from eight ADPKD kidneys and five controls. Using 10X Genomics technology, snRNA-seq libraries were sequenced with a mean of 408,304,417 reads for controls and 358,474,996 reads for ADPKD. For snATAC-seq, libraries yielded a mean of 334,652,440 reads. Bioinformatics tools like Seurat and Harmony were used for processing, resulting in 62,073 nuclei from ADPKD and 40,637 from controls for snRNA-seq, and 128,008 peak regions among 50,986 nuclei for snATAC-seq.
We employed single nucleus RNA sequencing (snRNA-seq) to analyze the samples. This technology allows for the examination of gene expression profiles at the single-cell level. This gives us information into the cellular composition and states within the kidney tissues. The snRNA-seq data were generated using high-throughput sequencing platforms. This allows us to simoltenously profile of thousands of individual nuclei in each sample. ## Computational procedure 1. Tissue Procurement: -Kidney samples obtained from ADPKD patients and controls. -Intact nephrectomized kidneys received and dissected immediately. -Samples collected from large cysts, fixed or frozen. 2. Library Preparation: -Nuclei isolated using Nuclei EZ Lysis buffer. -snRNA-seq: Libraries prepared using 10X Genomics Chromium Single Cell 3’ v3 chemistry. -snATAC-seq: Libraries prepared using 10X Genomics Chromium Single Cell ATAC v1 chemistry. 3. Bioinformatics Workflow: -Cellranger used for read counting. -Seurat v4.0.0 for data processing. -SoupX used for ambient RNA correction. -Integration of datasets with Harmony. Quality control, removal of doublets, and batch effect correction. 4. Analysis: -Clustering, dimensional reduction, and cell-type annotation. -Differential gene expression analysis. -Transcription factor activity estimation with chromVAR. -Prediction of cis-coaccessibility networks with Cicero. -Single-cell gene enrichment analysis with VISION. -Cell-cell communication analysis with CellChat. -Deconvolution of published datasets with CIBERSORTx. -Pathway analysis with PROGENy. -Correlation analysis between datasets. 5. Immunofluorescence Studies: Tissue preparation, antibody staining, and imaging. 5.Cell Culture: -Maintenance of HEK293T and WT9-12 cells. -Treatment with forskolin and retinoic acid. 6. CRISPR Interference: -Design and cloning of sgRNAs. -Lentiviral transduction and selection. 7. Quantitative PCR: -RNA extraction, cDNA synthesis, and qPCR. -Normalization and data analysis. 8.Statistical Analysis: - Student’s t-test or ANOVA for comparisons.
library(readr)
x <- read_tsv("https://borreliabase.org/~wgqiu/muto1.tsv")
## Rows: 2400 Columns: 2002
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): id, cell.type
## dbl (2000): PTPRQ, PTPRO, ST6GALNAC3, MAGI2, SERPINE1, SLC8A1, SLC26A7, PLA2...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
getwd()
## [1] "/Users/afsara/Downloads"
chooseCRANmirror(ind=1)
library(tidyverse) # downloaded all required packages and loaded data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pheatmap)
library(umap)
library(broom)
(require("BiocManager", quietly = TRUE))
## Bioconductor version '3.18' is out-of-date; the current release version '3.19'
## is available with R version '4.4'; see https://bioconductor.org/install
## [1] TRUE
install.packages("BiocManager")
##
## The downloaded binary packages are in
## /var/folders/f6/2lf75m8n1z700np5sv7y9vmw0000gn/T//Rtmp8tCwG6/downloaded_packages
BiocManager::install("EnhancedVolcano")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cloud.r-project.org
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.2 (2023-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'EnhancedVolcano'
## Old packages: 'broom', 'cachem', 'farver', 'fastmap', 'KernSmooth', 'knitr',
## 'openssl', 'ragg', 'rmarkdown', 'stringi', 'systemfonts', 'tinytex', 'xfun'
library(EnhancedVolcano) # Corrected package name
## Loading required package: ggrepel
library(ggrepel)
head(x)
## # A tibble: 6 × 2,002
## id cell.type PTPRQ PTPRO ST6GALNAC3 MAGI2 SERPINE1 SLC8A1 SLC26A7
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PKD_ACAG… PT2 -0.258 -0.338 -0.359 -1.49 1.85 -0.968 -0.405
## 2 PKD_ACCA… DCT -0.385 -0.523 -0.422 1.63 -0.421 0.500 -0.251
## 3 PKD_ACGC… DCT -0.210 -0.355 -0.536 0.490 -0.338 -0.828 -0.297
## 4 PKD_ACGC… z12 -0.798 -0.966 2.10 0.827 -0.199 0.0483 -0.308
## 5 PKD_ACGG… PT1 1.56 -0.211 -0.400 -0.479 0.431 1.81 -0.372
## 6 PKD_ACGG… TAL2 -0.0403 0.0271 -0.365 -1.32 -0.330 -0.256 -0.203
## # ℹ 1,993 more variables: PLA2R1 <dbl>, SLC12A3 <dbl>, AC109466.1 <dbl>,
## # EMCN <dbl>, NTNG1 <dbl>, TIMP3 <dbl>, ZPLD1 <dbl>, LINC01811 <dbl>,
## # CELF2 <dbl>, PLAT <dbl>, ROBO2 <dbl>, FGF1 <dbl>, FMN2 <dbl>,
## # SLC14A2 <dbl>, PAPPA2 <dbl>, PLPP1 <dbl>, ADAMTS19 <dbl>, CLNK <dbl>,
## # LDB2 <dbl>, AC093912.1 <dbl>, ADGRL3 <dbl>, NKAIN2 <dbl>, CLIC5 <dbl>,
## # NPHS2 <dbl>, ADGRF5 <dbl>, FLT1 <dbl>, ADAMTS6 <dbl>, RGS6 <dbl>,
## # DCC <dbl>, SNTG1 <dbl>, CNTNAP5 <dbl>, ENPP2 <dbl>, NPHS1 <dbl>, …
The rows represent individual cells. The columns consist of various attributes, including cell ID, cell type, and gene expression levels for different genes. The numbers in the table represent gene expression levels, typically measured as log-transformed values. There are a total of 6 cells in the dataset. There are gene expression data for 2,002 genes in the dataset.
# Mutate the dataframe to create 'cohort' and 'sample' columns by removing specific patterns from the 'id' column
x <- x %>%
mutate(cohort = str_remove(id, "_.+")) %>%
mutate(sample = str_remove(id, "_.+_") )
table(x$cohort)
##
## Cont PKD
## 1000 1400
table(x$sample) # displaying the tables
##
## Cont1 Cont2 Cont3 Cont4 Cont5 PKD1 PKD3 PKD4 PKD5 PKD6 PKD7 PKD8
## 200 200 200 200 200 200 200 200 200 200 200 200
table(x$cell.type)
##
## CNT_PC DCT ENDO FIB ICA-ICB LEUK PEC PODO PT1 PT2
## 352 204 104 188 122 123 45 45 343 189
## TAL1 TAL2 z12
## 172 483 30
There are a total of 10 samples: 5 control samples (Cont1, Cont2, Cont3, Cont4, Cont5) and 5 ADPKD samples (PKD1, PKD3, PKD4, PKD5, PKD6, PKD7, PKD8). Each sample contains 200 cells. There are 13 different cell types identified in the dataset: CNT_PC, DCT, ENDO, FIB, ICA-ICB, LEUK, PEC, PODO, PT1, PT2, TAL1, TAL2, and z12.
x.long <- x %>%
pivot_longer(3:2002, names_to = "gene", values_to = "rna.ct") # these are 2000 gene columns
head(x.long)
## # A tibble: 6 × 6
## id cell.type cohort sample gene rna.ct
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 PKD_ACAGCCGAGATAACGT-1_1 PT2 PKD PKD1 PTPRQ -0.258
## 2 PKD_ACAGCCGAGATAACGT-1_1 PT2 PKD PKD1 PTPRO -0.338
## 3 PKD_ACAGCCGAGATAACGT-1_1 PT2 PKD PKD1 ST6GALNAC3 -0.359
## 4 PKD_ACAGCCGAGATAACGT-1_1 PT2 PKD PKD1 MAGI2 -1.49
## 5 PKD_ACAGCCGAGATAACGT-1_1 PT2 PKD PKD1 SERPINE1 1.85
## 6 PKD_ACAGCCGAGATAACGT-1_1 PT2 PKD PKD1 SLC8A1 -0.968
x.long %>%
ggplot(aes(x = sample, y = rna.ct)) +
geom_violin()
A violin plot to visualize the distribution of RNA counts across
different samples. The X-axis represents the different samples in the
dataset. Each sample is a identifier that is equivelent to a specific
sample.The Y-axis represents the RNA counts. These counts show the level
of gene expression in each sample. The data points are shown through the
violin plots, which show the distribution of RNA counts for each sample.
A wider section of the violin plot shows a higher density of data points
at that RNA count.
x.long %>%
sample_n(1000) %>% # take 1000 random rows; otherwise too much
ggplot(aes(sample = rna.ct)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~sample)
gq Plot to assess the normalcy of RNA counts within each sample,
comparing the observed quantiles against the expected quantiles from a
normal distribution.he X-axis represents the expected quantiles from a
standard normal distribution. These quantiles serve as a reference to
compare the distribution of the sample data. The Y-axis represents the
quantiles of the RNA counts from the sample data. These quantiles are
calculated from the actual RNA count data. the data points compares the
actual RNA count to the expected value if the data were normally
distributed.
# Select gene columns and set row names
x.heat <- x %>%
select(-c(1, 2, 2003, 2004)) %>%
as.data.frame()
rownames(x.heat) <- x$id
# Create a data frame containing cell types
ann_cell <- data.frame(cell.type = factor(rep("FIB", nrow(x.heat))))
rownames(ann_cell) <- x$id
# Sample 1/20 of columns and 1/50 of rows (to reduce computational load)
pick.columns <- sample(colnames(x.heat), 100)
x.pick <- x.heat %>%
select(all_of(pick.columns)) %>%
sample_n(200)
# Generate heatmap
pheatmap(x.pick, show_rownames = FALSE, annotation_row = ann_cell)
This heatmap illustrates gene expression profiles in FIB cells, it lets
us the variation of gene expression across different genes and samples.
Columns: Each column is for a gene expression profile for a subset of
genes that are randomly sampled from the original dataset. Rows: Each
row represents a cell from the dataset. Color Gradient: The color
gradient represents the expression levels of the genes. red indicates
higher expression levels, while blue shows lower ones. Color Bar for the
Rows: The color bar indicates the range of expression levels
corresponding to the colors in the gradient. Cluster Diagrams: The
cluster diagrams represent hierarchical clustering of the rows and/or
columns based on their expression patterns.
x.pca <- prcomp(x.heat)
# show variance explained by each principal component
plot(x.pca)
This plot visualizes the principal component analysis, showing the
variance explained by each principal component.The X-axis represents the
principal components, ordered from the first to the last. Each principal
component is a linear combination of the original variables that
captures the maximum variance in the data.he Y-axis represents the
proportion of total variance explained by each principal component. It
shows how much of the variability in the data is captured by each
component.Each point on the plot indicates the proportion of variance
explained by a corresponding principal component.
# Extract PCA coordinates and create a dataframe
pca.coord <- as.data.frame(x.pca[[5]])
# Extract PC1 and PC2, add a cell type column
df.pca <- tibble(pca.coord[1:2]) %>%
mutate(cell.type = "FIB")
# Plot PCA with FIB cell type
df.pca %>%
ggplot(aes(x = PC1, y = PC2, color = cell.type)) +
geom_point(size = 0.1, alpha = 0.5) +
# increase point size & make transparent
theme_bw() +
theme(legend.position = "bottom") +
stat_ellipse()
PC1 and PC2 are the first and second principal components. These
components show the maximum amount of variance in the data. PC1 explains
the most variance followed by PC2, and so on. Each principal component
is a linear combination of the original variables, and they are
orthogonal to each other. Each point represents an observation in the
dataset. The position of each point on the plot corresponds to its
scores on PC1 and PC2. Points that are close to each other have similar
patterns of difference in the original variables. The colors represent
different categories or groups of observations.The color represents FIB
The distances between points on the plot represent the similarities
between observations in terms of their patterns of variation. Points
that are closer together have similar patterns of variation, while
points that are further apart have dissimilar patterns.
# Generate UMAP embeddings for x.heat
x.umap <- umap(x.heat)
# Create dataframe with UMAP embeddings, cell type, sample, and cohort information
df.umap <- data.frame(x.umap$layout) %>%
mutate(cell.type = x$cell.type) %>%
mutate(sample = x$sample) %>%
mutate(cohort = x$cohort)
# Filter the dataframe to include only the rows with cell.type "FIB"
df_fib <- df.umap %>% filter(cell.type == "FIB")
# Plot UMAP for FIB Cell Type
ggplot(df_fib, aes(x = X1, y = X2)) +
geom_point(alpha = 0.2, color = "blue") +
theme_bw() +
ggtitle("UMAP Plot for FIB Cell Type")
UMAP1 and UMAP2 represent the two axes that capture the dimensional
representation of the original high-dimensional data. Each point on the
plot represents an individual data point or sample from the original
dataset. The colors represent different categories or groups in the
data. In this case, since we are plotting the UMAP for the FIB cell
type, the colors represent different subtypes or conditions within the
FIB cell type. The distances between points in the plot represent
thedistances or similarities between the corresponding data points.
Points that are close together in the UMAP plot are more similar to each
other in the original data space, while points that are far apart are
more dissimilar.
target.cells <- x %>%
filter(cell.type == 'FIB') # Filter for FIB cell type
target.cells <- target.cells %>%
sample_n(100)
non.target.cells <- x %>%
filter(cell.type != 'FIB') # Filter for non-FIB cell types
non.target.cells <- non.target.cells %>%
sample_n(100) %>%
mutate(cell.type = 'non-target')
y <- bind_rows(target.cells, non.target.cells) # Combine target and non-target cells into a single dataframe
y.long <- y %>%
pivot_longer(3:2002, names_to = "gene", values_to = "rna.ct")
# Reshape the dataframe from wide to long format
t.out <- y.long %>%
group_by(gene) %>%
do(tidy(t.test(data = ., rna.ct ~ cell.type))) # Perform t-test comparing gene expression between cell types
EnhancedVolcano(toptable = t.out, lab = t.out$gene, x = "estimate", y = "p.value",
# ylim = c(0, 4),
# pCutoff = 0.05,
title = "Tissue-specific genes",
subtitle = "cell type: FIB") # Change subtitle to FIB
This Enhanced Volcano plot shows tissue-specific genes, with
significance indicated by color. The x-axis represents the fold change
in gene expression between FIB cells and non-target cells. The y-axis
shows the significance level (p-value) of the observed fold changes.
LHFPL3 seems to be significantly downregulated and CALD1 seems to be
signifcantly upregulated.Blue represents significant p-value, the green
is significant fold change value, the red is significant p value and
fold change value. Horizontal Dashed Line:This line represents the
p-value cutoff. Genes with p-values below this line (i.e., lower
p-values) are considered statistically significant. Vertical Dashed
Lines: These lines represent thresholds for the effect size, often the
log2 fold change or another measure of the difference between
groups.
top5 <- t.out %>%
arrange(p.value) %>% # Identify top 5 genes
head(5) %>%
pull(gene)
x.long %>%
filter(gene %in% top5) %>% #created a violin plot
ggplot(aes(x = cell.type, y = rna.ct)) +
geom_violin() +
geom_jitter(shape = 1, alpha = 0.1) +
coord_flip() +
facet_wrap(~gene)
The graph shows the expression distribution of the top five genes with
the lowest p-values among FIB cells. This shows violin plots with
jittered points to visualize individual data points across different
genes. The x-axis displays the top 5 genes with the lowest p-values from
the t-test, while the y-axis represents their expression levels. Each
point in the plot shows gene expression in FIB cells. Different panels
represent different genes. The plot reflects t-test results and helps
identify the most significant biomarker for FIB cells based on gene
expression. The most significant biomarker is SLIT3.
x.cell <- x.long %>%
filter(cell.type == 'FIB') # Filter for FIB cell type
t.out2 <- x.cell %>% # Perform t-tests on RNA counts by cohort within each gene for FIB cell type
group_by(gene) %>%
do(tidy(t.test(data = ., rna.ct ~ cohort)))
EnhancedVolcano(toptable = t.out2, lab = t.out2$gene, x = "estimate", y = "p.value",
pCutoff = 0.01,
title = "PKD associated genes",
subtitle = "cell type: FIB")
# Create an Enhanced Volcano plot
The volcano plot visualizes the significance (-log10 p-value) against the magnitude of change (log2 fold change) for genes associated with PKD in FIB cells, highlighting those with a p-value below 0.01. The X-axis shows the fold change, indicating gene expression differences. The Y-axis represents the statistical significance. Colors indicate significance levels. PIP5K1B is a gene that is significantly downregulated and FBLN is a gene that is significantly upregulated.Blue represents significant p-value, the green is significant fold change value, the red is significant p value and fold change value. Horizontal Dashed Line:This line represents the p-value cutoff. Genes with p-values below this line (i.e., lower p-values) are considered statistically significant. Vertical Dashed Lines: These lines represent thresholds for the effect size, often the log2 fold change or another measure of the difference between groups.
top5 <- t.out2 %>%
arrange(p.value) %>%
head(5) %>% # Identify top 5 genes with the smallest p-values
pull(gene)
x.long %>%
filter(gene %in% top5, cell.type == "FIB") %>%
ggplot(aes(x = cell.type, y = rna.ct, color = cohort)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(shape = 1, alpha = 0.1) + #create box plot filtered for FIB cells
coord_flip() +
facet_wrap(~gene)
The X-axis represents the cell type, which in this case is ‘FIB’
(fibroblast). Since we are filtering for ‘FIB’ only, it will be the sole
category on the X-axis.The Y-axis represents the RNA counts, indicating
the level of gene expression for each gene in the FIB cell type.
Different colors represent different cohorts. This helps in
distinguishing the distribution of RNA counts across various cohorts
within each gene.
Null Hypothesis: The null hypothesis states that there is no significant difference in gene expression between FIB cells and non-target cells. However, based on the t-test results and volcano plot we are able to reject the null hypothesis as there is a significant difference between FIB cells and non traget cells in terms of the PKD disease. Fold Change: The fold change represents the magnitude of change in gene expression between FIB cells and non-target cells. Genes with significant fold changes show differences in expression levels between the two cell types. For example, FBLN1 has significant upregulation and LHFPL3 shows downregulation in FIB cells. p-values: The p-values indicate the significance level of observed fold changes. Lower p-values suggest stronger evidence against the null hypothesis, indicating significant differences in gene expression between FIB cells and non-target cells. In this study, genes with lower p-values, such as CALD1, show upregulation in FIB cells and genes like LHFPL3 show downregulation. The signifcant biomarker is FBLN1.
The involvement of single-cell and single-nucleus sequencing data in autosomal dominant polycystic kidney disease (ADPKD) patients has shown detailed cellular and molecular mechanisms that are able to encourage disease progression. Key findings include the activation of TNFα-induced NF-κB signaling in ADPKD-specific fibroblast subtypes. This leads to a proinflammatory microenvironment, and signatures of repair failure in proximal tubular cells. Additionally, identification of unique cell subpopulations like proinflammatory fibroblasts is able to show new insights into disease diagnosis with potential implications for targeted therapies. These discoveries show us how important is to understand cellular heterogeneity and dysregulated pathways in ADPKD. This improves the way for personalized treatment strategies to improve patient outcomes.
This analysis explored a limited set of statistical method. However, considering additional machine learning algorithms. For example, support vector machines or gradient boosting can give an overall improvement to predicative perfomance. Additionally, refining the feature selection process may reduce overfitting and improve how the models react to unseen data.
Any other references you have read and may want to include