Objectives:
Identify distinct cell populations within the IDC sample using clustering and cell-type annotation based on single-cell RNA sequencing (scRNA-seq) data.
Characterize differentially expressed genes (DEGs) between the identified cell clusters and highlight those linked to malignancy, immune response, and stromal components.
Compare transcriptomic profiles of tumor cells with known invasive ductal carcinoma markers and investigate novel potential biomarkers for IDC.
Visualize the expression of key genes associated with cancer pathways using UMAP, heatmaps, or violin plots.
Loading Required Libraries
importing our single cell .H5 data
cancer_data <- Read10X_h5("C:/Users/iyand/Downloads/SCT2/Breast_Cancer_3p_LT_filtered_feature_bc_matrix.h5")
####Creating a Seurate object
cancer_obj <- CreateSeuratObject(counts= cancer_data, min.cells = 3, min.features = 200)
print(cancer_obj)
## An object of class Seurat
## 17921 features across 661 samples within 1 assay
## Active assay: RNA (17921 features, 0 variable features)
## 1 layer present: counts
Quality Control
#### percentage of mitochondrial gene
cancer_obj [["percent.mt"]] <- PercentageFeatureSet(cancer_obj, pattern = "^MT-")
VlnPlot(cancer_obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol= 3)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
nFeature_RNA: Shows the distribution of genes detected per cell, with most cells having 2,000-5,000 genes. nCount_RNA: Indicates total RNA counts per cell, with most under 30,000 but some exceeding 100,000. percent.mt: Represents the percentage of mitochondrial gene expression, with most cells under 25%, though some have higher levels. - Elevated mitochondrial gene expression may indicate stressed or dying cells, while extreme values in gene or RNA counts suggest the presence of poor-quality cells or doublets.
FeatureScatter(cancer_obj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")+
geom_smooth(method ="lm")
## `geom_smooth()` using formula = 'y ~ x'
- a strong positive correlation (0.92) between nCount_RNA (total RNA
count per cell) and nFeature_RNA (number of genes detected per cell),
which is expected in single-cell RNA sequencing data. However, there are
some outliers with very high RNA and gene counts which probably
indicating DOUBLETS and some low gene and RNA count indicating dying or
extrememly stressed cells, further supporting why we see some cell
having higher mitochondrial RNA.
cancer_obj <- subset(cancer_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20)
#Normalization of Data
cancer_obj <- NormalizeData(cancer_obj)
## Normalizing layer: counts
##### Identification of Variable Features
cancer_obj <- FindVariableFeatures(cancer_obj, selection.method = "vst", nfeatures = 2000)
## Finding variable features for layer counts
####
top10 <- head(VariableFeatures(cancer_obj), 10)
variable_plot <- VariableFeaturePlot(cancer_obj)
####gene labeling
LabelPoints(plot = variable_plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The expression of these genes fluctuates significantly between individual cells, compared to other genes which show more consistent expression.
#dimentionality Reduction
cancer_obj <- ScaleData(cancer_obj)
## Centering and scaling data matrix
### PCA
cancer_obj <- RunPCA(cancer_obj, features = VariableFeatures(cancer_obj))
## PC_ 1
## Positive: GSTP1, VIM, COL6A2, FSTL1, CALD1, MARCKS, IGFBP7, IFITM3, CAVIN1, SPARC
## YBX3, COL6A1, BGN, GSN, COL1A2, ARID5B, EFEMP2, NNMT, PRRX1, FBN1
## COL3A1, VCAN, COL1A1, EMP1, CCDC80, CCN2, COL6A3, THBS2, TCF7L2, GGT5
## Negative: CARTPT, KRT18, XBP1, WFDC2, NPY1R, AGR3, PCSK1, CLEC3A, SERPINA3, CLDN4
## PIP, ELF3, TRPA1, AGR2, CHGB, SERPINA6, CFB, APOD, PDZK1, SLC7A2
## AQP3, VGF, CA12, KCNE4, CP, IRX2, DSP, NTS, SLC7A11, TM4SF1
## PC_ 2
## Positive: TYROBP, LAPTM5, FCER1G, AIF1, SRGN, HLA-DPA1, OLR1, HLA-DPB1, HLA-DRA, RGS1
## HLA-DRB5, HLA-DRB1, FCGR2A, LYZ, IFI30, CXCR4, FCGR3A, CLEC7A, HLA-DQA1, CD74
## HLA-DMA, ITGB2, HLA-DQB1, HLA-DMB, CTSS, C1QC, MS4A7, SAMSN1, GPR183, IL18
## Negative: TIMP1, APOD, IGFBP4, TPM1, COL1A1, BGN, FSTL1, EFEMP2, PRRX1, PDLIM2
## COL6A2, COL3A1, COL1A2, COL5A2, RARRES2, CALD1, PCOLCE, MXRA8, THY1, TIMP3
## C1R, LUM, GGT5, COL6A1, COL6A3, SPON2, CTHRC1, DCN, SFRP2, TAGLN
## PC_ 3
## Positive: PTPRB, ADGRL4, CALCRL, ITGB4, AQP1, PLVAP, EMCN, PCDH17, PCAT19, TIE1
## ECSCR, PECAM1, S1PR1, PALMD, CLDN5, NOVA2, MMRN2, SOX18, CLEC14A, MYCT1
## CDH5, EFNB2, SOX7, ADAMTS9, CD34, CYYR1, JAM2, APOLD1, FLT1, ADCY4
## Negative: SFRP2, LUM, DCN, THBS2, RARRES2, COL6A3, GLT8D2, SPON2, GXYLT2, SNAI2
## CTSK, ISLR, RCN3, MXRA8, PRRX2, PODN, BICC1, NTM, NDN, C1S
## MRC2, LRRC17, SFRP4, ANTXR1, COL3A1, ITGB5, LRP1, INHBA, COL1A1, COL10A1
## PC_ 4
## Positive: GPC3, BMPER, ZNF462, GDF10, GFPT2, ADH1B, LARP6, ANK2, SRPX, CHRDL1
## NGFR, SVEP1, CFD, HELLPAR, BOC, SLIT2, MFAP4, MEG3, GAL, PTGIS
## ABCA8, DPT, MFAP5, F10, IGFBP6, ABI3BP, KLHL29, EFEMP1, TNXB, FBLN1
## Negative: DKK3, HIGD1B, CDH6, NDUFA4L2, SEMA5B, PARD6G, TBX2, SEPTIN4, NOTCH3, CSRP2
## LMOD1, PGF, GJC1, COX4I2, TBX2-AS1, ACTA2, MUSTN1, PLXDC1, MCAM, PPP1R14A
## OLFML2A, FOXS1, MYLK, PTP4A3, TINAGL1, CNN1, TGFB1I1, KANK2, TPM2, AOC3
## PC_ 5
## Positive: COL10A1, GRP, COL8A1, COMP, NTM, RGS4, LRRC17, LAMP5, NKD2, CMTM3
## TMEM200A, SFRP4, COL8A2, PODNL1, SULF1, UBE2E3, SMCO4, LOXL1, APCDD1, MMP11
## ART5, ROR2, IGFBP3, COL11A1, ECRG4, GXYLT2, JAM2, ADAM12, VSNL1, LOXL2
## Negative: KDM6B, FOSB, GADD45B, MYC, MAFF, IER2, VEGFA, FOS, RNF152, ZFP36
## LMNA, JUN, BTG2, EGR1, DNAJB1, IRF1, JUNB, AC020916.1, CEBPB, HSPH1
## NR4A1, COX4I2, ARC, TIPARP, CCNL1, SOD3, ANGPTL4, KLF6, WEE1, NFKBIA
###Elbowplot
ElbowPlot(cancer_obj)
The elbow plot shows the variance explained by each principal component (PC), with a steep drop after the first few PCs. The elbow at PC 4 or 5 indicates that these components capture most of our cell variability. This supports our PCA results, where only the first 5 PCs have significant positive and negative.
Objective 1 clustring Analysis and identififcation of the types of cells present in the tumor microenvironment.
cancer_obj <- FindNeighbors(cancer_obj, dims =1:10)
## Computing nearest neighbor graph
## Computing SNN
cancer_obj <- FindClusters(cancer_obj, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 170
## Number of edges: 3639
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7949
## Number of communities: 5
## Elapsed time: 0 seconds
####Vizulizing using UMAP
cancer_obj <- RunUMAP(cancer_obj,dims= 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:42:36 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:42:36 Read 170 rows and found 10 numeric columns
## 16:42:36 Using Annoy for neighbor search, n_neighbors = 30
## 16:42:36 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:42:36 Writing NN index file to temp file C:\Users\iyand\AppData\Local\Temp\RtmpwPBGoy\file8a9010607bdb
## 16:42:36 Searching Annoy index using 1 thread, search_k = 3000
## 16:42:36 Annoy recall = 100%
## 16:42:37 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:42:37 Initializing from normalized Laplacian + noise (using RSpectra)
## 16:42:37 Commencing optimization for 500 epochs, with 4890 positive edges
## 16:42:38 Optimization finished
DimPlot(cancer_obj, reduction = "umap", label = TRUE)
-Small cluster was observed from cluster 0-4. Possible reason include strict quality control
-Marker Analysis
marker <- FindAllMarkers(cancer_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
###display top 10 markers
top_markers <- marker %>% group_by(cluster) %>% top_n(n= 10, wt = avg_log2FC)
FeaturePlot(cancer_obj, features = c("EPCAM", "CD3D", "PTPRC", "CD68", "CD163", "MS4A1", "KRT19"))
## Warning: The following requested variables were not found: CD3D, MS4A1
EPCAM: Epithelial Cells CD3D: T Cells PTPRC (CD45): Leukocytes/Immune Cells CD68: Macrophages/Monocytes CD163: Tissue-Resident Macrophages/M2 Macrophages MS4A1 (CD20): B Cells KRT19: Epithelial Cells/Cholangiocytes
-A know marker (“EPCAM”, “CD3D”, “PTPRC”, “CD68”, “CD163”, “MS4A1”, “KRT19”) was used to have idea of what the cluster entails. – Annotation of the cluster
cancer_obj <- RenameIdents(cancer_obj,
`0` = "Tumor cells",
`1` = "Tumor cells",
`2` = "Tumor cells",
`3` = "Macrophages",
`4` = "Tumor cells")
# Re-plot UMAP with cell type labels
DimPlot(cancer_obj, reduction = "umap", label = TRUE, group.by = "ident", pt.size = 0.5)
FeaturePlot(cancer_obj, features = "EPCAM")
FeaturePlot(cancer_obj, features = c("FSTL1", "SPARC", "CXCR4", "GPC3", "IGFBP4", "VIM"), cols = c("lightgrey", "red"), pt.size = 1)
he high expression of the genes you mentioned—FSTL1, SPARC, CXCR4, GPC3, IGFBP4, and VIM—is often associated with poor prognosis and increased metastasis in various cancers.They are involved in processes like inflammation, extracellular matrix remodeling, cell migration, and the epithelial-to-mesenchymal transition (EMT), which are all critical for cancer progression and spread to distant organs. These markers are commonly used to assess aggressive cancer phenotypes.
FSTL1: Promotes inflammation and angiogenesis, contributing to tumor growth and invasiveness.
SPARC: Facilitates cancer cell invasion through modulation of the extracellular matrix.
CXCR4: Drives tumor cell migration toward metastasis-prone sites, enhancing metastatic potential.
GPC3: Regulates cell growth and signaling, associated with increased tumor aggressiveness.
IGFBP4: Modulates insulin-like growth factors, with alterations linked to poor cancer prognosis.
VIM: Indicates epithelial-to-mesenchymal transition (EMT), associated with increased invasiveness and metastasis.
immune_markers <- c("TYROBP", "FCER1G", "AIF1", "HLA-DPA1", "HLA-DPB1",
"HLA-DRA", "FCGR3A", "CXCR4", "GPC3",
"HLA-DRB1", "HLA-DRB5", "MS4A7", "C1QC")
for (marker in immune_markers) {
p <- FeaturePlot(cancer_obj, features = marker, cols = c("lightgrey", "blue"),
pt.size = 0.5) +
ggtitle(marker) # Add a title with the marker name
print(p) # Print the plot
}
- All the Immune mArkers are clustered in cluster 3
# Identify differentially expressed genes
deg_results <- FindAllMarkers(cancer_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster Tumor cells
## Calculating cluster Macrophages
top_deg <- deg_results %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
malignancy_genes <- c("TP53", "KRAS", "MYC", "VEGFA")
malignancy_degs <- deg_results[deg_results$gene %in% malignancy_genes, ]
immune_genes <- c("CD3D", "CD68", "IL6", "TNF")
immune_degs <- deg_results[deg_results$gene %in% immune_genes, ]
stromal_genes <- c("COL1A1", "FAP", "ACTA2")
stromal_degs <- deg_results[deg_results$gene %in% stromal_genes, ]
combined_degs <- rbind(malignancy_degs, immune_degs, stromal_degs)
combined_degs <- combined_degs[!duplicated(combined_degs$gene), ]
expression_data <- GetAssayData(cancer_obj, slot = "data")
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
expression_matrix <- as.matrix(expression_data[combined_degs$gene, ])
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.20.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(circlize)
## ========================================
## circlize version 0.4.16
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
##
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
## in R. Bioinformatics 2014.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(circlize))
## ========================================
color_palette <- colorRamp2(c(-2, 0, 2), c("blue", "yellow", "red"))
Heatmap(expression_matrix,
name = "Expression",
row_title = "Genes",
column_title = "Clusters",
show_row_names = TRUE,
show_column_names = TRUE,
cluster_rows = TRUE,
cluster_columns = TRUE,
row_names_gp = gpar(fontsize = 7),
column_names_gp = gpar(fontsize = 4),
heatmap_legend_param = list(title_gp = gpar(fontsize = 4),
labels_gp = gpar(fontsize = 4)),
col = color_palette)
important_marker <- c( "TP53", "KRAS", "MYC", "VEGFA", "CD3D", "CD68", "IL6", "TNF", "COL1A1", "FAP", "ACTA2")
avg_expression <- AverageExpression(cancer_obj, features = important_marker)
## As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
## This message is displayed once per session.
## Warning: The following 1 features were not found in the RNA assay: CD3D
avg_expression_df <- as.data.frame(avg_expression$RNA)
print(avg_expression_df)
## Tumor cells Macrophages
## FAP 0.2229089 0.02416930
## TNF 0.0000000 0.60091113
## VEGFA 1.1480521 2.52611974
## IL6 0.1399495 0.05308476
## MYC 2.2739208 0.62734151
## ACTA2 1.8467000 0.02883157
## KRAS 0.4292466 0.57376758
## CD68 0.2089820 5.49513098
## TP53 0.1157107 0.15721448
## COL1A1 5.4727426 0.00000000
gene_data <- data.frame(
Gene = c("FAP", "TNF", "VEGFA", "IL6", "MYC", "ACTA2", "KRAS", "CD68", "TP53", "COL1A1"),
Tumor_cells1 = c(0, 0, 0.724330047, 0, 1.101840338, 0.009675718, 0.221122856, 0.276820269, 0.145754756, 0),
tumor_cells2 = c(0.2564637, 0, 1.6631963, 0.2406319, 4.1103628, 3.9591513, 0.5056775, 0.1589378, 0.1911437, 8.0474658),
tumor_cells3 = c(0.56887958, 0, 1.40652164, 0.27335534, 1.72099753, 2.77043800, 0.46335945, 0.18155172, 0.06162919, 12.12377773),
Macrophages = c(0.02416930, 0.60091113, 2.52611974, 0.05308476, 0.62734151, 0.02883157, 0.57376758, 5.49513098, 0.15721448, 0),
Tumorcell4 = c(0, 0, 0.5800464, 0, 2.2887723, 0, 0.6798487, 0.2085516, 0, 0)
)
library(reshape2)
gene_data_melt <- melt(gene_data, id.vars = "Gene", variable.name = "Cell_Type", value.name = "Expression")
# View the reshaped data
head(gene_data_melt)
## Gene Cell_Type Expression
## 1 FAP Tumor_cells1 0.000000000
## 2 TNF Tumor_cells1 0.000000000
## 3 VEGFA Tumor_cells1 0.724330047
## 4 IL6 Tumor_cells1 0.000000000
## 5 MYC Tumor_cells1 1.101840338
## 6 ACTA2 Tumor_cells1 0.009675718
The expression analysis of key genes in tumor cells reveals a distinctive profile characterized by high levels of MYC (1.10) and VEGFA (0.72), indicating aggressive tumor proliferation and significant angiogenic activity, respectively. In contrast, genes associated with inflammatory responses, such as TNF and IL6, show no expression (0.00), suggesting a lack of inflammatory signaling within the tumor microenvironment. Additionally, FAP (0.00) indicates minimal fibroblast activity, while ACTA2 (0.01) is expressed at low levels, reflecting a limited presence of smooth muscle or myofibroblasts. This expression profile suggests a tumor environment that favors growth and vascularization through MYC and VEGFA, while exhibiting an absence of typical inflammatory and remodeling markers, potentially indicating a unique or altered tumor microenvironment.
library(ggplot2)
ggplot(gene_data_melt, aes(x = Gene, y = Expression, fill = Cell_Type)) +
geom_boxplot() +
theme_bw() +
labs(title = "Gene Expression Across Different Cell Types",
x = "Gene",
y = "Expression Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(gene_data_melt, aes(x = Gene, y = Expression)) +
geom_boxplot(fill = "lightblue", color = "darkblue", outlier.color = "red") + # Boxplot with whiskers
stat_summary(fun = mean, geom = "point", shape = 20, size = 4, color = "darkred", fill = "red") + # Show the average
theme_minimal() + # Clean theme
labs(title = "Gene Expression Boxplot Across Different Cell Types",
x = "Gene Symbol",
y = "Expression Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
- COL1A1, FAP, and ACTA2 are involved in ECM remodeling and tumor
invasion, with higher expression often linked to poor prognosis and
increased metastatic potential. KRAS and MYC are key oncogenes driving
cancer cell proliferation and metastasis, while mutations in TP53 lead
to genomic instability and immune evasion. VEGFA promotes angiogenesis,
supporting tumor growth and metastasis, and CD68 (macrophage marker) and
IL6 contribute to immune suppression within the tumor microenvironment.
Overall, these genes either directly promote cancer progression or shape
an immunosuppressive microenvironment, making them critical targets for
cancer therapy.
My comprehensive analysis of the single-cell RNA sequencing data from Invasive Ductal Carcinoma (IDC) has illuminated critical aspects of the tumor microenvironment that directly impact health status, prognosis, metastasis, and immune dynamics. The findings indicate a complex interplay between tumor cells and the surrounding stromal and immune cells, which together shape the tumor’s aggressiveness and overall behavior.
The elevated expression of key oncogenes such as MYC and KRAS, along with angiogenic factors like VEGFA, underscores the aggressive nature of the tumor and its potential for metastasis. These markers not only signify poor prognosis but also suggest that patients may experience rapid disease progression and increased metastatic spread. Furthermore, the expression of COL1A1, FAP, and ACTA2 highlights the tumor’s capability for extracellular matrix remodeling, which is often associated with enhanced invasive characteristics, further complicating treatment outcomes.
Immune dynamics within the tumor microenvironment were characterized by the presence of immune suppressive markers such as CD68 and IL6. This suggests an immunosuppressive state that may hinder effective anti-tumor immune responses, allowing the tumor to evade immune detection and destruction. Understanding these immune interactions is critical for developing therapeutic strategies that could potentially reinvigorate the immune response against tumor cells.
Given these findings, the identified markers offer promising targets for therapy. Strategies that aim to inhibit pathways involving MYC, VEGFA, and immune suppressive factors could improve patient outcomes. Additionally, therapies targeting the extracellular matrix components may help in reducing tumor invasion and metastasis.