Deciphering Cellular Complexity in Autosomal Dominant Polycystic Kidney Disease Through Multimodal Single-Cell Analysis
Introduction
Background & Abstract
ADPKD affects approximately 1 in 400 to 1,000 individuals globally and is the most prevalent inherited cystic renal disease. The disease is primarily caused by mutations in the PKD1 or PKD2 genes, which encode for polycystin proteins that are thought to regulate various cellular functions, including intracellular calcium signaling. Despite the availability of treatments like tolvaptan, which targets vasopressin signaling to slow disease progression, its efficacy is limited by side effects and does not halt disease progression. The advent of single-cell RNA sequencing (scRNA-seq) and single-nucleus RNA sequencing (snRNA-seq) technologies has revolutionized our ability to analyze cellular heterogeneity and gene expression at unprecedented resolution, facilitating a more nuanced understanding of the cellular dynamics in ADPKD. This article leverages these technologies to construct a comprehensive cellular atlas of ADPKD, highlighting the disease’s complex biology and pointing towards potential therapeutic targets.
Preface
Autosomal dominant polycystic kidney disease (ADPKD) is characterized by the progressive expansion of kidney cysts, leading to end-stage renal disease. It represents a significant genetic cause of renal failure. The complexity of cellular responses in ADPKD complicates the development of effective treatments, prompting the need for a deeper understanding of the cellular mechanisms driving the disease. This study employs multimodal single-cell analysis to dissect the cellular landscape of ADPKD, utilizing advanced sequencing technologies to examine both the transcriptomes and epigenomes of cells from ADPKD-affected kidneys. By analyzing cells at a single-cell level, the study identifies critical pathways and cellular states associated with disease progression, offering new insights into the roles of specific cell types and their genetic regulators.
Significance
The significance of this research lies in its potential to profoundly alter our understanding and treatment of autosomal dominant polycystic kidney disease (ADPKD), a leading genetic cause of end-stage renal disease. The study utilizes cutting-edge multimodal single-cell analysis to delve deeply into the cellular and molecular complexities of ADPKD, offering insights that are crucial for developing more effective therapeutic strategies. Firstly, this research identifies specific cellular populations and states that contribute to the progression of ADPKD, including the activation of pro-inflammatory and profibrotic pathways. Such detailed characterization at a single-cell resolution provides a granular view of the disease mechanisms that were previously obscured in bulk tissue analyses. By pinpointing which cells and pathways are pivotal in disease progression, this study opens new avenues for targeted therapies that could modulate these critical pathways and halt or even reverse disease progression. Moreover, the identification of novel cellular markers and regulatory elements specific to ADPKD-related cellular states presents potential biomarkers for early detection and progression monitoring of the disease. The ability to detect ADPKD early and to accurately monitor its progression is crucial for improving patient outcomes, as it allows for timely intervention and the potential to delay or prevent the onset of end-stage renal disease. Additionally, this research enhances our understanding of the genetic and epigenetic regulation underlying ADPKD.
Hypothesis
Biological Hypothesis
In autosomal dominant polycystic kidney disease, PT1 cells undergo transcriptional and epigenetic alterations that significantly upregulate pathways related to cell proliferation, adhesion, and fibrotic responses. This upregulation may be driven by specific genetic and epigenetic modifications that can be identified as biomarkers for disease progression. The comparison of gene expression profiles between PT1 cells in ADPKD and healthy controls will reveal a subset of genes whose expression levels are significantly different, providing insights into potential therapeutic targets.
Null Hypothesis (H0):
There is no significant difference in the expression levels of genes in PT1 cells between ADPKD patients and healthy controls. The genetic and epigenetic landscape of PT1 cells does not differ between the two groups, indicating that these cells do not contribute uniquely to the pathophysiology of ADPKD.
Alternative Hypothesis (H1):
PT1 cells in ADPKD patients exhibit significant differences in gene expression levels compared to healthy controls, with a distinct genetic and epigenetic profile that contributes to the development and progression of ADPKD. The differential expression of genes involved in pathways such as cell proliferation, adhesion, and fibrosis in PT1 cells is associated with disease-specific characteristics and may serve as biomarkers or therapeutic targets for ADPKD.
Materials and Methods
Samples
In this study on autosomal dominant polycystic kidney disease (ADPKD), the researchers analyzed kidney samples from thirteen individuals, including eight patients diagnosed with ADPKD and five healthy controls, to construct a detailed cellular atlas. The ADPKD samples were obtained from patients with advanced disease stages, all of whom required kidney transplantation, reflecting the severe impairment of renal function typical at the time of sample collection. These samples encompassed a diverse array of cell types and states within the diseased kidneys, providing a basis for comparison against the healthy control samples. The control samples, crucial for establishing baseline cellular metrics and identifying disease-specific alterations, were sourced from individuals with preserved renal function, ensuring the reliability of the control data. This careful selection of both diseased and healthy samples allows for a comprehensive analysis of the cellular and molecular alterations specific to ADPKD.
With this particular analysis we will be placing a detailed breakdown of the PT1 Cell Type (Proximal Tubule Type 1).
Experimental Procedure
In this study, single-nucleus RNA sequencing (snRNA-seq) technology was employed to analyze the cellular composition and gene expression patterns within human kidney samples affected by autosomal dominant polycystic kidney disease (ADPKD) and healthy controls. The snRNA-seq technology is particularly suited for profiling gene expression in frozen or challenging tissues like the kidney, where cell viability post-thaw might be compromised. This method involves isolating nuclei from cells and then sequencing the RNA content to capture a snapshot of the gene activity within individual nuclei. The researchers utilized the 10X Genomics Chromium Single Cell 3’ platform, which allows for high-throughput sequencing with precise resolution. This platform’s capability to tag RNA molecules with unique molecular identifiers ensures accurate quantification of gene expression levels, enabling the identification of distinct cellular subtypes and states based on their transcriptional profiles, even within the heterogeneous environments characteristic of diseased and healthy kidney tissues.
Computational Procedure
Step 1: Data Quality Control and Preprocessing
Step 1a: Cell Ranger Pipeline - Initially, the raw sequencing reads were processed using the Cell Ranger software (10X Genomics), which handles barcode processing, alignment of reads to the reference genome, and quantification of gene expression levels.
Step 1b: Filtering - The initial output includes a matrix of gene counts across individual nuclei. This matrix was then filtered to remove low-quality nuclei and potential doublets, ensuring that the data used for further analysis represented high-quality, single-nucleus profiles.
Step 2: Data Integration and Normalization
Step 2a: Seurat Package - Using the R package Seurat, data from ADPKD and control samples were integrated to correct for batch effects, a common issue when sequencing data are collected across multiple experiments.
Step 2b: Normalization - Within Seurat, data normalization procedures were applied to mitigate the impact of varying sequencing depths across samples and to standardize the expression measurements for comparative analysis.
Step 3: Dimensionality Reduction and Clustering
Step 3a: Principal Component Analysis (PCA) - This step reduces the complexity of the data by transforming the high-dimensional dataset (genes x cells) into a lower-dimensional space while preserving as much variance as possible.
Step 3b: Clustering and UMAP Visualization - Further dimensional reduction was performed using Uniform Manifold Approximation and Projection (UMAP) for visualization. This method helps in visualizing clusters of nuclei that share similar expression profiles, indicative of specific cell types or states.
Step 4: Differential Expression Analysis
Step 4a: Identification of Differentially Expressed Genes - For each cluster identified in the previous step, differential expression analysis was conducted to identify genes that were significantly up- or down-regulated in ADPKD samples compared to controls.
Step 4b: Marker Gene Identification - This analysis also facilitated the identification of marker genes for each cell cluster, aiding in the characterization and annotation of cell types present in the kidney samples.
Step 5: Pathway and Network Analysis
Step 5a: Enrichment Analysis - Using gene set enrichment analysis (GSEA) tools like MSigDB, the researchers identified key signaling pathways and biological processes that were enriched in specific clusters, particularly those altered in ADPKD.
Step 5b: Network Construction - Interaction networks were constructed to identify potential signaling pathways and interactions between identified markers, providing insights into the molecular mechanisms driving ADPKD.
Part 1: Kidney scRNA Data
Setting Up Data & Library
Tidyverse
PHeatmap
UMap
Broom
BiocManager & EnhancedVolcano
Understanding Data
Each row within the data set corresponds to a unique sample, identified by a distinct ID or identifier. This structure ensures that every experimental unit is clearly delineated, promoting precise data tracking and manipulation. The columns encompass various attributes associated with each sample, including gene expression levels among other quantitative metrics. This comprehensive setup allows for a multidimensional examination of each sample’s characteristics.
The dataset consists of 2,400 samples, each characterized by measurements across 2,000 genes. This results in a grand total of 4,800,000 individual data points, encompassing a vast array of genetic information. The numbers populated across the dataset represent quantitative values, predominantly expressing gene expression levels. These measurements are critical for assessing biological responses and interactions within the samples.
R Code
## # A tibble: 6 × 2,002
## id cell.type PTPRQ PTPRO ST6GALNAC3 MAGI2 SERPINE1 SLC8A1 SLC26A7
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PKD_ACAG… PT2 -0.258 -0.338 -0.359 -1.49 1.85 -0.968 -0.405
## 2 PKD_ACCA… DCT -0.385 -0.523 -0.422 1.63 -0.421 0.500 -0.251
## 3 PKD_ACGC… DCT -0.210 -0.355 -0.536 0.490 -0.338 -0.828 -0.297
## 4 PKD_ACGC… z12 -0.798 -0.966 2.10 0.827 -0.199 0.0483 -0.308
## 5 PKD_ACGG… PT1 1.56 -0.211 -0.400 -0.479 0.431 1.81 -0.372
## 6 PKD_ACGG… TAL2 -0.0403 0.0271 -0.365 -1.32 -0.330 -0.256 -0.203
## # ℹ 1,993 more variables: PLA2R1 <dbl>, SLC12A3 <dbl>, AC109466.1 <dbl>,
## # EMCN <dbl>, NTNG1 <dbl>, TIMP3 <dbl>, ZPLD1 <dbl>, LINC01811 <dbl>,
## # CELF2 <dbl>, PLAT <dbl>, ROBO2 <dbl>, FGF1 <dbl>, FMN2 <dbl>,
## # SLC14A2 <dbl>, PAPPA2 <dbl>, PLPP1 <dbl>, ADAMTS19 <dbl>, CLNK <dbl>,
## # LDB2 <dbl>, AC093912.1 <dbl>, ADGRL3 <dbl>, NKAIN2 <dbl>, CLIC5 <dbl>,
## # NPHS2 <dbl>, ADGRF5 <dbl>, FLT1 <dbl>, ADAMTS6 <dbl>, RGS6 <dbl>,
## # DCC <dbl>, SNTG1 <dbl>, CNTNAP5 <dbl>, ENPP2 <dbl>, NPHS1 <dbl>, …
Data Transformation and Clean-Up
This operation transforms the dataset by appending a new column named cohort. It extracts this identifier by eliminating all text following and including the underscore (_) in each id entry. This step effectively isolates the portion of the id that precedes the underscore, which typically represents the cohort to which the sample belongs. Following the cohort extraction, this command further refines the dataset by adding a sample column. It achieves this by removing the segments of the id that are enclosed by underscores, thereby preserving only the essential identifiers at the beginning and end of the id string. This process distinctly categorizes each entry for enhanced data manipulation and analysis.
Frequency Distribution Analysis
Total Samples
The dataset comprises 12 distinct samples, segmented into two main categories: 5 control samples (labelled CONT1 through CONT5) and 7 samples associated with Polycystic Kidney Disease (PKD), designated as PKD1 through PKD7.
Cells Per Sample
Each sample consistently contains 200 cells, facilitating uniform data analysis across different conditions.
Variety of Cell Types
A total of 13 unique cell types have been identified, enhancing the dataset’s utility for studying cellular diversity. The cell types include:
- CNT_PC: Connecting Tubule and Principal Cells
- DCT: Distal Convoluted Tubule
- ENDO: Endothelial Cells
- FIB: Fibroblasts
- ICA_ICB: Intercalated Cells Types A and B
- LEUK: Leukocytes
- PEC: Parietal Epithelial Cells
- PODO: Podocytes
- PT1: Proximal Tubule Type 1
- PT2: Proximal Tubule Type 2
- TAL1: Thick Ascending Limb of Henle’s Loop Type 1
- TAL2: Thick Ascending Limb of Henle’s Loop Type 2
- z12: A unique cell type
R Code
##
## Cont PKD
## 1000 1400
##
## Cont1 Cont2 Cont3 Cont4 Cont5 PKD1 PKD3 PKD4 PKD5 PKD6 PKD7 PKD8
## 200 200 200 200 200 200 200 200 200 200 200 200
##
## CNT_PC DCT ENDO FIB ICA-ICB LEUK PEC PODO PT1 PT2
## 352 204 104 188 122 123 45 45 343 189
## TAL1 TAL2 z12
## 172 483 30
Reshaping Gene Expression Data
The pivot_longer function transforms the data from a wide format to a long format, which is often more suitable for analysis in many data science tasks, especially in statistical modeling and visualization. The cols = 3:2002 parameter specifies that columns 3 through 2002 in MD_Data contain the data to be reshaped. The names_to = “gene” and values_to = “rna.ct” arguments create two new columns in the resulting long-format dataframe: gene holds the names of the genes, and rna.ct holds the corresponding RNA count values, indicating the level of expression for each gene.
Exam Data Distribution
MD_Data.long %>%
ggplot(aes(x = sample, y = rna.ct, fill = sample)) +
geom_violin() +
labs(title = "RNA Count Distribution by Sample",
subtitle = "Violin Plot Overlay",
x = "Sample",
y = "RNA Count",
fill = "Sample Type") +
theme( plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))MD_Data.long %>%
sample_n(1000) %>%
ggplot(aes(sample = rna.ct)) +
stat_qq() +
stat_qq_line(color = "red") +
facet_wrap(~sample, scales = "free") +
labs(title = "Quantile-Quantile Plots of RNA Counts",
subtitle = "Comparing RNA Counts Distribution Against Theoretical Quantiles",
x = "Theoretical Quantiles",
y = "Sample Quantiles",
caption = "Data randomly sampled to include 1000 observations.") +
theme( plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5) )Part 2. Identify Cell Types By Cluster Analysis
Cluster Analysis with Heatmap
# Preparing the heatmap data by selecting only gene expression columns.
x.heat <- x %>%
select(-c(1, 2, 2003, 2004)) %>%
as.data.frame()
# Set row names to unique identifiers from the data
rownames(x.heat) <- x$id
# Create a data frame for cell type annotations linked to each row/sample.
ann_cell <- data.frame(cell.type = factor(x$cell.type))
rownames(ann_cell) <- x$id
# Subsample the data to reduce computational load, selecting random subsets of columns and rows.
pick.columns <- sample(colnames(x.heat), 100)
x.pick <- x.heat %>%
select(all_of(pick.columns)) %>%
sample_n(200)
# Generate a heatmap with pheatmap, applying annotations and suppressing row names for clarity.
pheatmap(x.pick,
show_rownames = FALSE,
annotation_row = ann_cell,
main = "Heatmap of Gene Expression")Column & Row
In a heatmap generated from gene expression data, the columns represent individual genes. Each column corresponds to a specific gene whose expression levels are measured across different samples or conditions. The selection of these genes might be based on variability, relevance to the study, or other criteria depending on the analysis purpose. Each row corresponds to a different biological sample or experimental condition. For example, in a study of disease, rows could represent different patient samples, or in a time-course study, different time points.
What is the color gradient?
The color gradient in a heatmap represents the scale of measurement values, commonly gene expression levels in this context. Colors vary typically from low to high values, where: One end of the color spectrum might represent lower expression levels. The middle of the spectrum (e.g., white) often represents average or median expression levels. The other end of the spectrum represents higher expression levels.
What is the color bar for the rows?
A color bar for the rows, known as row annotation, can be used to provide additional information about each row. This could represent categorical data such as the cell type, treatment group, disease status, or any other pertinent classification of the samples. Each category is typically represented by a different color, and this bar runs parallel to the rows, helping to visually associate samples with their metadata.
What do the cluster diagrams represent?
The cluster diagrams, or dendrograms, displayed on the sides of the heatmap, represent hierarchical clustering results. They are visual representations of the distance or similarity between the elements (samples or genes) based on their expression patterns:
Dendrogram Meanings
The horizontal dendrogram shows clustering of columns, i.e., how genes cluster together based on their expression profiles across all samples. Genes that show similar expression patterns across the samples are clustered together.
The vertical dendrogram shows clustering of rows, i.e., how samples cluster together based on their gene expression across all measured genes. Samples with similar expression profiles are grouped, highlighting potential similarities in biological or experimental conditions.
These dendrograms help in identifying groups of genes or samples that behave similarly under the study conditions, potentially revealing biological processes or subgroup characteristics within the data.
Dimension Reduction with PCA
# Running the PCA without scaling
x.pca <- prcomp(x.heat)
# Showing the variance explained by each principal component
plot(x.pca)# Extracting PCA coordinates and preparing the data frame
pca.coord <- as.data.frame(x.pca[[5]])
df.pca <- tibble(pca.coord[1:2]) %>%
mutate(cell.type = x$cell.type)
# Visualizing PCA results with ggplot2
df.pca %>%
ggplot(aes(x = PC1, y = PC2, color = cell.type)) +
geom_point(size = 0.1, alpha = 0.5) + # increase point size & make transparent
theme_bw() +
theme(legend.position = "bottom") +
stat_ellipse() +
labs(
title = "PCA Plot of Cell Types",
x = "Principal Component 1",
y = "Principal Component 2",
color = "Cell Type") +
theme( plot.title = element_text(hjust = 0.5), plot.caption = element_text(size = 10) )PCA Plot of Kidney Cell Types
Each point represents a single cell’s gene expression profile projected onto the first two principal components (PC1 and PC2), which capture the largest variances within the dataset. PC1 explains the most variance among the samples, followed by PC2, which is orthogonal to PC1. The colors distinguish different cell types, highlighting the gene expression variability and clustering patterns among them. The proximity of points indicates similarity in gene expression profiles; closer points suggest similar profiles, while more distant points indicate dissimilar profiles. Ellipses represent statistical confidence regions for each cell type, aiding in visualizing the grouping and spread of data points within each category.
Dimension Reduction With UMAP
# Calculate UMAP dimensions based on the preprocessed data in x.heat
x.umap <- umap(x.heat)
# Convert UMAP output to a data frame and enrich it with metadata from the original dataset
df.umap <- data.frame(x.umap$layout) %>%
mutate(cell.type = x$cell.type) %>%
mutate(sample = x$sample) %>%
mutate(cohort = x$cohort)
# Create a ggplot to visualize the UMAP projection
df.umap %>%
ggplot(aes(x = X1, y = X2, color = cell.type)) +
geom_point(alpha = 0.6, size = 1.5) +
labs(
title = "UMAP Projection of Cell Types",
x = "UMAP Dimension 1",
y = "UMAP Dimension 2",
color = "Cell Type") +
theme_bw() +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5)) +
facet_wrap(~cohort, ncol = 1) UMAP Plot of Kidney Cell Types
UMAP Dimension 1 is the first dimension resulting from the UMAP algorithm. It represents the primary axis of variance in the dataset, capturing the largest difference across the data according to the manifold approximation. UMAP Dimension 2 is the second dimension resulting from the UMAP algorithm. It captures the second most significant variance in the dataset, orthogonal to the first dimension.
Each point in the UMAP plot represents an individual observation from the dataset. Each point corresponds to a single cell’s data profile based on its gene expression or other molecular features being analyzed. The location of each point in the plot reflects the cell’s relationship to other cells based on these features. Colors in the UMAP plot are used to indicate categories within the data. In biological datasets: Each color represents a different cell type or classification. This allows for quick visual differentiation between groups of cells and helps identify patterns or clusters within specific cell types.
Distances in a UMAP plot represent a non-linear approximation of similarity between data points. These points that are close together in the UMAP space exhibit similar gene expression profiles or molecular characteristics. They are more alike in terms of the data set’s measured features.Conversely, cells that are far apart have dissimilar profiles, indicating a significant difference in their gene expression or other key features analyzed.
Part 3. Identify Marker Genes for Cell Types
Volcano Plot
target.cells <- x %>%
filter(cell.type == 'PT1')
target.cells <- target.cells %>%
sample_n(100)
non.target.cells <- x %>%
filter(cell.type != 'PT1')
non.target.cells <- non.target.cells %>%
sample_n(100) %>%
mutate(cell.type = 'non-target')
y <- bind_rows(target.cells, non.target.cells)
y.long <- y %>%
pivot_longer(3:2002, names_to = "gene", values_to = "rna.ct")
t.out <- y.long %>%
group_by(gene) %>%
do(tidy(t.test(data = ., rna.ct ~ cell.type)))
EnhancedVolcano(toptable = t.out, lab = t.out$gene, x = "estimate", y = "p.value",
title = "Tissue-Specific Genes", subtitle = "Cell Type: PT1")+
theme( plot.title = element_text(hjust = 0.5) , plot.subtitle = element_text(hjust = 0.5) )
### Graph Interpretation
The x-axis displays the log2 fold change (Log2 FC) in gene expression between two groups of cells. In other words, this is a measure of how much a quantity changes between two conditions. In gene expression, a fold change of 2 means the gene expression is doubled; a fold change of 0.5 means it is halved. Log2 fold change is used to symmetrically represent both upregulation and downregulation on the same scale and to enhance the visualization of changes across a broad range of magnitudes. The y-axis displays the negative logarithm of the p-value obtained from statistical tests comparing gene expression between the two groups. This transformation is used to amplify the differences in small p-values, making them easier to visualize and interpret. A p-value in this context indicates the probability of observing the data if the null hypothesis were true. Smaller p-values indicate stronger evidence against the null hypothesis.
Color Interpretation
Gray (NS): Represents genes that are not statistically significant. These genes do not meet the criteria set for significance in terms of p-value and fold change.
Green: Represents genes with significant log2 fold changes but not significant p-values. These genes show considerable changes in expression levels but are not statistically significant.
Blue: Represents genes that are statistically significant in terms of p-value only. These genes have low p-values but their fold changes are not beyond the set threshold for log2 fold change.
Red: Represents genes that are significant both in terms of p-value and log2 fold change. These are the genes of highest interest because they show both a statistically significant difference in expression and a considerable magnitude of change.
Gene Regulation Interpretation
From the plot, you can identify genes that are significantly up- or down-regulated based on their position. For example, genes such as “ERBB4” and “LINC01811” appear in red, indicating they are significantly regulated. “ERBB4” is shown on the right side of the plot with a positive log2 fold change, indicating significant upregulation. “LINC01811” is also upregulated.
Top 5 Genes
top5 <- t.out %>%
arrange(p.value) %>%
head(5) %>%
pull(gene)
MD_Data.long %>%
filter(gene %in% top5) %>%
ggplot(aes(x = cell.type, y = rna.ct)) +
geom_violin() +
geom_jitter(shape = 1, alpha = 0.1) +
coord_flip() +
facet_wrap(~gene) +
labs(
title = "Top 5 Marker Gene Expression Across Cell Types",
x = "Cell Type",
y = "RNA Count"
)
### Graph Interpretation
The x-axis represents the RNA count (or expression level) for each of the top five selected genes. The scale varies across each panel, depending on the distribution of values for that specific gene. The y-axis shows different cell types. Each gene’s expression is plotted against all cell types, allowing us to see patterns across groups. The points are individual data observations representing the expression values for a particular gene in each sample of the given cell type. Jittering is used to spread out the points horizontally and prevent overlap, providing a clearer visualization of the distribution.ach panel corresponds to a specific gene from the selected list. For example, the “AC019068.1” panel shows how this gene is expressed across cell types like z12, TAL2, PT2, etc. The panels are arranged in a grid using facet_wrap, with each plot containing the violin shape and data points for a different gene.
T-Test Results Consistency
To confirm the consistency of these results with statistical analyses like t-tests, we’d need access to the p-values comparing each gene’s expression across cell types. If a gene shows clear differentiation in expression levels between groups visually (e.g., more concentrated in one cell type versus others), it may align with a low p-value from a t-test or other differential expression analysis.Based on visual inspection, “AC019068.1” appears to have the most distinct pattern, with significant expression levels in z12 and PEC cell types while remaining low in most others. Therefore, this gene might be the best biomarker for distinguishing those target cell types.
Valid marker genes appear to be present. For example, “AC019068.1” shows specific high expression in a limited number of cell types. This distinction is crucial in identifying target cell types from a larger mixture, and such differentiation makes these genes potential biomarkers.
Part 4. Identify Disease-Associated Genes
Volcano Plot
x.cell <- MD_Data.long %>%
filter(cell.type == 'PT1')
t.out2 <- x.cell %>%
group_by(gene) %>%
do(tidy(t.test(data = ., rna.ct ~ cohort)))
t.out2## # A tibble: 2,000 × 11
## # Groups: gene [2,000]
## gene estimate estimate1 estimate2 statistic p.value parameter conf.low
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A2M 0.0329 -0.302 -0.335 0.935 0.351 252. -0.0365
## 2 ABCA10 -0.0136 -0.151 -0.137 -0.158 0.875 309. -0.183
## 3 ABCA4 0.0613 -0.0701 -0.131 1.43 0.153 203. -0.0230
## 4 ABCA5 -0.0564 -0.199 -0.142 -0.597 0.551 331. -0.242
## 5 ABCA9 0.130 -0.113 -0.243 2.69 0.00755 340. 0.0349
## 6 ABCB1 -0.164 0.530 0.694 -1.27 0.205 230. -0.418
## 7 ABCB5 -0.0534 -0.0316 0.0218 -0.728 0.468 154. -0.198
## 8 ABCC3 -0.0688 0.562 0.631 -0.613 0.541 217. -0.290
## 9 ABCC4 -0.00301 0.386 0.389 -0.0268 0.979 297. -0.224
## 10 ABCC8 -0.0614 -0.0961 -0.0347 -3.54 0.000506 197. -0.0957
## # ℹ 1,990 more rows
## # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
EnhancedVolcano(
toptable = t.out2,
lab = t.out2$gene,
x = "estimate",
y = "p.value",
ylim = c(0, 4),
pCutoff = 0.01,
title = "PKD-Associated Genes",
subtitle = "Cell Type: PT1",
xlab = "Effect Size (Mean Difference)",
ylab = "-Log10(P-Value)",
legendPosition = "right",
legendLabSize = 10,
legendIconSize = 3.0,
drawConnectors = TRUE, # Draw lines to labels for better clarity
widthConnectors = 0.5,
pointSize = 3.0
) +
theme( plot.title = element_text(hjust = 0.5) , plot.subtitle = element_text(hjust = 0.5) )
### Graph Interpretation
The x-axis represents the effect size, which is the difference in mean expression levels (RNA count) between two groups (likely between PKD and non-PKD patients). The positive values on the right indicate genes that are upregulated in the target condition (e.g., PKD patients), while negative values to the left indicate genes downregulated in the target condition. The y-axis represents the statistical significance of differential expression, expressed as the -log10 of the p-value from the t-test. A higher y-axis value corresponds to a lower p-value, indicating greater statistical significance. This means genes higher on the plot are more significantly differentially expressed. The fold change is a measure used to represent how much a gene’s expression has increased or decreased in one condition relative to another. In this case, it’s the ratio of expression in PKD patients to non-PKD individuals. Here, fold change is expressed indirectly via the mean difference on the x-axis, as it is related to upregulation or downregulation.
Color Interpretation
Blue: Genes considered significant according to the p-value cutoff (pCutoff = 0.01) and shown in the legend as “p-value”.
Gray: Genes that are not statistically significant.
Gene Regulation Interpretation
A gene that is significantly up-regulated (e.g., ATP1B3 in the provided volcano plot) will have high positive mean differences and be colored in blue. A gene that is significantly down-regulated (e.g., DLGAP1 if the data points exist in the negative region) will have a high negative mean difference.
Top 5 Genes
top5 <- t.out2 %>%
arrange(p.value) %>%
head(5) %>%
pull(gene)
MD_Data.long %>%
filter(gene %in% top5) %>%
ggplot(aes(x = cell.type, y = rna.ct, color = cohort)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(shape = 1, alpha = 0.1) +
coord_flip() +
facet_wrap(~gene)Graph Interpretation
The x-axis indicates RNA count (rna.ct), which is the expression level of the genes across different cell types. For each cell type, the spread along the x-axis shows the distribution of expression levels (or counts) for a specific gene. The y-axis lists various cell types (e.g., z12, TAL2, PT1). Each box plot is organized vertically to show how a particular gene’s expression varies across all these cell types. The individual points represent specific observations (samples) within a given cell type. The points are slightly jittered horizontally to prevent overlap and improve visibility, while boxplot overlays indicate the median and interquartile ranges. The red and teal colors differentiate between cohorts. Each panel represents a specific gene, with its name displayed at the top of the panel. This layout helps visualize the expression patterns of multiple genes side by side across the same cell types, highlighting differential expression.
T-Test Results Consistency
The panels showing violin plots are consistent with a statistical test such as a t-test.A gene showing clear separation in expression levels between cohorts (red vs. teal points) would likely yield significant p-values in statistical analysis. “LINCO1811” shows a distinctive pattern with clear separation between the two cohorts, suggesting significant differential expression.”LINCO1811” appears to be a strong candidate for a biomarker for PKD patients. It shows significant upregulation in the PKD cohort (teal) compared to the control cohort (red), especially within certain cell types (PT1 and PT2).
Conclusion and Analysis
Overview of Study Context and Hypothesis
This segment of the study leverages advanced technologies such as single-cell RNA sequencing (scRNA-seq) and single-nucleus RNA sequencing (snRNA-seq) to dissect the complex cellular landscape of ADPKD. The Biological Hypothesis posits that PT1 cells in ADPKD undergo significant transcriptional and epigenetic changes, leading to the upregulation of pathways related to cell proliferation, adhesion, and fibrotic responses. These changes are theorized to be driven by specific genetic and epigenetic modifications that can serve as biomarkers for disease progression.
Analysis of Data and Methodology
Experimental Procedure
The use of snRNA-seq particularly caters to the challenges of kidney tissue analysis, allowing for robust gene expression profiling even in less viable post-thaw cells. This technology is instrumental in capturing a snapshot of gene activity within individual nuclei, providing a detailed molecular understanding that surpasses traditional bulk tissue analyses.
Computational Analysis
The computational steps outlined, including data quality control, normalization, and clustering, ensure that the data integrity is maintained and that the biological variability is accurately captured. This meticulous processing is crucial for the reliable identification of cell-specific gene expression patterns and the subsequent analysis of these patterns across different cell types and disease states.
Results from PT1 Data
Data Visualization and Clustering
The data integration and normalization procedures set the stage for effective comparison across ADPKD and healthy control samples. Dimensionality reduction (PCA and UMAP) and clustering reveal distinct groups of cells, highlighting the cellular heterogeneity within and between ADPKD and healthy samples. The differential expression analysis identifies genes that are significantly up- or down-regulated in ADPKD samples compared to controls, pinpointing potential biomarkers and therapeutic targets.
Pathway and Network Analysis
Enrichment analyses further elucidate the biological pathways that are disproportionately affected in ADPKD, particularly those related to cell proliferation and fibrosis, supporting the biological hypothesis. Network analyses potentially reveal interactions between key proteins and regulatory elements involved in ADPKD, offering insights into the disease’s molecular drivers.
Implications of Study Findings
Scientific and Clinical Implications:
The results provide a granular view of the disease mechanisms at the cellular level, which can significantly enhance the understanding of ADPKD progression and its underlying molecular drivers.
The identification of specific gene expression profiles associated with disease states presents new avenues for the development of diagnostic and prognostic biomarkers. The study highlights potential targets for therapeutic intervention, which could lead to the development of more effective treatments tailored to disrupt key pathological pathways in ADPKD.
Next Steps for Scientific Research
Further studies should aim to validate these findings in larger cohorts and through experimental models to confirm the roles of identified genes and pathways in ADPKD. Therapeutic Development: Based on these targets, the development of new drugs or repurposing of existing ones should be explored to find more effective treatments. Following patients over time and conducting functional studies to explore the impact of modifying identified targets on disease progression would be crucial next steps. Integrating the findings from this research into clinical trials could help in testing the efficacy of new diagnostic markers and therapeutic strategies in a clinical setting.
Conclusion
This detailed analysis of PT1 provides a framework for understanding the complex cellular dynamics in ADPKD. The findings not only bolster the initial hypothesis but also pave the way for significant advancements in the diagnosis and treatment of this challenging genetic disease. The implications for future research and clinical application are vast, holding promise for more precise medical interventions that could substantially improve patient outcomes in ADPKD.
Citation
Muto, Y., Dixon, E. E., Yoshimura, Y., Wu, H., Omachi, K., Ledru, N., Wilson, P. C., King, A. J., Eric Olson, N., Gunawan, M. G., Kuo, J. J., Cox, J. H., Miner, J. H., Seliger, S. L., Woodward, O. M., Welling, P. A., Watnick, T. J., & Humphreys, B. D. (2022, October 30). Defining cellular complexity in human autosomal dominant polycystic kidney disease by multimodal single cell analysis. Nature News. https://www.nature.com/articles/s41467-022-34255-z