Objectives:

Loading Required Libraries

importing our single cell .H5 data

cancer_data <- Read10X_h5("C:/Users/iyand/Downloads/SCT2/Breast_Cancer_3p_LT_filtered_feature_bc_matrix.h5")
####Creating a Seurate object
cancer_obj <- CreateSeuratObject(counts= cancer_data, min.cells = 3, min.features = 200)

print(cancer_obj)
## An object of class Seurat 
## 17921 features across 661 samples within 1 assay 
## Active assay: RNA (17921 features, 0 variable features)
##  1 layer present: counts

Quality Control

#### percentage of mitochondrial gene
cancer_obj [["percent.mt"]] <- PercentageFeatureSet(cancer_obj, pattern = "^MT-")
VlnPlot(cancer_obj,  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol= 3)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.

nFeature_RNA: Shows the distribution of genes detected per cell, with most cells having 2,000-5,000 genes. nCount_RNA: Indicates total RNA counts per cell, with most under 30,000 but some exceeding 100,000. percent.mt: Represents the percentage of mitochondrial gene expression, with most cells under 25%, though some have higher levels. - Elevated mitochondrial gene expression may indicate stressed or dying cells, while extreme values in gene or RNA counts suggest the presence of poor-quality cells or doublets.

FeatureScatter(cancer_obj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")+
  geom_smooth(method ="lm")
## `geom_smooth()` using formula = 'y ~ x'

- a strong positive correlation (0.92) between nCount_RNA (total RNA count per cell) and nFeature_RNA (number of genes detected per cell), which is expected in single-cell RNA sequencing data. However, there are some outliers with very high RNA and gene counts which probably indicating DOUBLETS and some low gene and RNA count indicating dying or extrememly stressed cells, further supporting why we see some cell having higher mitochondrial RNA.

cancer_obj <- subset(cancer_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20)
#Normalization of Data
cancer_obj <- NormalizeData(cancer_obj)
## Normalizing layer: counts
##### Identification of Variable Features
cancer_obj <-  FindVariableFeatures(cancer_obj, selection.method = "vst", nfeatures = 2000)
## Finding variable features for layer counts
####
top10 <- head(VariableFeatures(cancer_obj), 10)
variable_plot <- VariableFeaturePlot(cancer_obj)
####gene labeling
LabelPoints(plot = variable_plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The expression of these genes fluctuates significantly between individual cells, compared to other genes which show more consistent expression.

#dimentionality Reduction
 cancer_obj <- ScaleData(cancer_obj)
## Centering and scaling data matrix
### PCA 
cancer_obj <- RunPCA(cancer_obj, features = VariableFeatures(cancer_obj))
## PC_ 1 
## Positive:  GSTP1, VIM, COL6A2, FSTL1, CALD1, MARCKS, IGFBP7, IFITM3, CAVIN1, SPARC 
##     YBX3, COL6A1, BGN, GSN, COL1A2, ARID5B, EFEMP2, NNMT, PRRX1, FBN1 
##     COL3A1, VCAN, COL1A1, EMP1, CCDC80, CCN2, COL6A3, THBS2, TCF7L2, GGT5 
## Negative:  CARTPT, KRT18, XBP1, WFDC2, NPY1R, AGR3, PCSK1, CLEC3A, SERPINA3, CLDN4 
##     PIP, ELF3, TRPA1, AGR2, CHGB, SERPINA6, CFB, APOD, PDZK1, SLC7A2 
##     AQP3, VGF, CA12, KCNE4, CP, IRX2, DSP, NTS, SLC7A11, TM4SF1 
## PC_ 2 
## Positive:  TYROBP, LAPTM5, FCER1G, AIF1, SRGN, HLA-DPA1, OLR1, HLA-DPB1, HLA-DRA, RGS1 
##     HLA-DRB5, HLA-DRB1, FCGR2A, LYZ, IFI30, CXCR4, FCGR3A, CLEC7A, HLA-DQA1, CD74 
##     HLA-DMA, ITGB2, HLA-DQB1, HLA-DMB, CTSS, C1QC, MS4A7, SAMSN1, GPR183, IL18 
## Negative:  TIMP1, APOD, IGFBP4, TPM1, COL1A1, BGN, FSTL1, EFEMP2, PRRX1, PDLIM2 
##     COL6A2, COL3A1, COL1A2, COL5A2, RARRES2, CALD1, PCOLCE, MXRA8, THY1, TIMP3 
##     C1R, LUM, GGT5, COL6A1, COL6A3, SPON2, CTHRC1, DCN, SFRP2, TAGLN 
## PC_ 3 
## Positive:  PTPRB, ADGRL4, CALCRL, ITGB4, AQP1, PLVAP, EMCN, PCDH17, PCAT19, TIE1 
##     ECSCR, PECAM1, S1PR1, PALMD, CLDN5, NOVA2, MMRN2, SOX18, CLEC14A, MYCT1 
##     CDH5, EFNB2, SOX7, ADAMTS9, CD34, CYYR1, JAM2, APOLD1, FLT1, ADCY4 
## Negative:  SFRP2, LUM, DCN, THBS2, RARRES2, COL6A3, GLT8D2, SPON2, GXYLT2, SNAI2 
##     CTSK, ISLR, RCN3, MXRA8, PRRX2, PODN, BICC1, NTM, NDN, C1S 
##     MRC2, LRRC17, SFRP4, ANTXR1, COL3A1, ITGB5, LRP1, INHBA, COL1A1, COL10A1 
## PC_ 4 
## Positive:  GPC3, BMPER, ZNF462, GDF10, GFPT2, ADH1B, LARP6, ANK2, SRPX, CHRDL1 
##     NGFR, SVEP1, CFD, HELLPAR, BOC, SLIT2, MFAP4, MEG3, GAL, PTGIS 
##     ABCA8, DPT, MFAP5, F10, IGFBP6, ABI3BP, KLHL29, EFEMP1, TNXB, FBLN1 
## Negative:  DKK3, HIGD1B, CDH6, NDUFA4L2, SEMA5B, PARD6G, TBX2, SEPTIN4, NOTCH3, CSRP2 
##     LMOD1, PGF, GJC1, COX4I2, TBX2-AS1, ACTA2, MUSTN1, PLXDC1, MCAM, PPP1R14A 
##     OLFML2A, FOXS1, MYLK, PTP4A3, TINAGL1, CNN1, TGFB1I1, KANK2, TPM2, AOC3 
## PC_ 5 
## Positive:  COL10A1, GRP, COL8A1, COMP, NTM, RGS4, LRRC17, LAMP5, NKD2, CMTM3 
##     TMEM200A, SFRP4, COL8A2, PODNL1, SULF1, UBE2E3, SMCO4, LOXL1, APCDD1, MMP11 
##     ART5, ROR2, IGFBP3, COL11A1, ECRG4, GXYLT2, JAM2, ADAM12, VSNL1, LOXL2 
## Negative:  KDM6B, FOSB, GADD45B, MYC, MAFF, IER2, VEGFA, FOS, RNF152, ZFP36 
##     LMNA, JUN, BTG2, EGR1, DNAJB1, IRF1, JUNB, AC020916.1, CEBPB, HSPH1 
##     NR4A1, COX4I2, ARC, TIPARP, CCNL1, SOD3, ANGPTL4, KLF6, WEE1, NFKBIA
###Elbowplot
ElbowPlot(cancer_obj)

cancer_obj <- FindNeighbors(cancer_obj, dims =1:10)
## Computing nearest neighbor graph
## Computing SNN
cancer_obj <- FindClusters(cancer_obj, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 170
## Number of edges: 3639
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7949
## Number of communities: 5
## Elapsed time: 0 seconds
####Vizulizing using  UMAP
cancer_obj <- RunUMAP(cancer_obj,dims= 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:42:36 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:42:36 Read 170 rows and found 10 numeric columns
## 16:42:36 Using Annoy for neighbor search, n_neighbors = 30
## 16:42:36 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:42:36 Writing NN index file to temp file C:\Users\iyand\AppData\Local\Temp\RtmpwPBGoy\file8a9010607bdb
## 16:42:36 Searching Annoy index using 1 thread, search_k = 3000
## 16:42:36 Annoy recall = 100%
## 16:42:37 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:42:37 Initializing from normalized Laplacian + noise (using RSpectra)
## 16:42:37 Commencing optimization for 500 epochs, with 4890 positive edges
## 16:42:38 Optimization finished
DimPlot(cancer_obj, reduction = "umap", label = TRUE)

-Small cluster was observed from cluster 0-4. Possible reason include strict quality control

-Marker Analysis

marker <- FindAllMarkers(cancer_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
###display top 10 markers
top_markers <- marker %>% group_by(cluster) %>% top_n(n= 10, wt = avg_log2FC)
FeaturePlot(cancer_obj, features = c("EPCAM", "CD3D", "PTPRC", "CD68", "CD163", "MS4A1", "KRT19"))
## Warning: The following requested variables were not found: CD3D, MS4A1

EPCAM: Epithelial Cells CD3D: T Cells PTPRC (CD45): Leukocytes/Immune Cells CD68: Macrophages/Monocytes CD163: Tissue-Resident Macrophages/M2 Macrophages MS4A1 (CD20): B Cells KRT19: Epithelial Cells/Cholangiocytes

-A know marker (“EPCAM”, “CD3D”, “PTPRC”, “CD68”, “CD163”, “MS4A1”, “KRT19”) was used to have idea of what the cluster entails. – Annotation of the cluster

cancer_obj <- RenameIdents(cancer_obj, 
                           `0` = "Tumor cells", 
                           `1` = "Tumor cells", 
                           `2` = "Tumor cells", 
                           `3` = "Macrophages", 
                           `4` = "Tumor cells")

# Re-plot UMAP with cell type labels
DimPlot(cancer_obj, reduction = "umap", label = TRUE, group.by = "ident", pt.size = 0.5)

FeaturePlot(cancer_obj, features = "EPCAM")

FeaturePlot(cancer_obj, features = c("FSTL1", "SPARC", "CXCR4", "GPC3", "IGFBP4", "VIM"), cols = c("lightgrey", "red"), pt.size = 1)

immune_markers <- c("TYROBP", "FCER1G", "AIF1", "HLA-DPA1", "HLA-DPB1", 
                    "HLA-DRA", "FCGR3A", "CXCR4", "GPC3", 
                    "HLA-DRB1", "HLA-DRB5", "MS4A7", "C1QC")
for (marker in immune_markers) {
  p <- FeaturePlot(cancer_obj, features = marker, cols = c("lightgrey", "blue"), 
                   pt.size = 0.5) +
    ggtitle(marker) # Add a title with the marker name
  print(p) # Print the plot
}

- All the Immune mArkers are clustered in cluster 3

# Identify differentially expressed genes
deg_results <- FindAllMarkers(cancer_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster Tumor cells
## Calculating cluster Macrophages
top_deg <- deg_results %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)

malignancy_genes <- c("TP53", "KRAS", "MYC", "VEGFA") 

malignancy_degs <- deg_results[deg_results$gene %in% malignancy_genes, ]

immune_genes <- c("CD3D", "CD68", "IL6", "TNF")

immune_degs <- deg_results[deg_results$gene %in% immune_genes, ]

stromal_genes <- c("COL1A1", "FAP", "ACTA2")

stromal_degs <- deg_results[deg_results$gene %in% stromal_genes, ]

combined_degs <- rbind(malignancy_degs, immune_degs, stromal_degs)

combined_degs <- combined_degs[!duplicated(combined_degs$gene), ]

expression_data <- GetAssayData(cancer_obj, slot = "data") 
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
expression_matrix <- as.matrix(expression_data[combined_degs$gene, ])

library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.20.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(circlize)
## ========================================
## circlize version 0.4.16
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================
color_palette <- colorRamp2(c(-2, 0, 2), c("blue", "yellow", "red"))

Heatmap(expression_matrix, 
        name = "Expression", 
        row_title = "Genes", 
        column_title = "Clusters",
        show_row_names = TRUE,               
        show_column_names = TRUE,            
        cluster_rows = TRUE,                 
        cluster_columns = TRUE,              
        row_names_gp = gpar(fontsize = 7),  
        column_names_gp = gpar(fontsize = 4),
             
        heatmap_legend_param = list(title_gp = gpar(fontsize = 4), 
                                    labels_gp = gpar(fontsize = 4)), 
        col = color_palette)   

important_marker <- c( "TP53", "KRAS", "MYC", "VEGFA", "CD3D", "CD68", "IL6", "TNF", "COL1A1", "FAP", "ACTA2")

avg_expression <- AverageExpression(cancer_obj, features = important_marker)
## As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
## This message is displayed once per session.
## Warning: The following 1 features were not found in the RNA assay: CD3D
avg_expression_df <- as.data.frame(avg_expression$RNA)
print(avg_expression_df)
##        Tumor cells Macrophages
## FAP      0.2229089  0.02416930
## TNF      0.0000000  0.60091113
## VEGFA    1.1480521  2.52611974
## IL6      0.1399495  0.05308476
## MYC      2.2739208  0.62734151
## ACTA2    1.8467000  0.02883157
## KRAS     0.4292466  0.57376758
## CD68     0.2089820  5.49513098
## TP53     0.1157107  0.15721448
## COL1A1   5.4727426  0.00000000
gene_data <- data.frame(
  Gene = c("FAP", "TNF", "VEGFA", "IL6", "MYC", "ACTA2", "KRAS", "CD68", "TP53", "COL1A1"),
  Tumor_cells1 = c(0, 0, 0.724330047, 0, 1.101840338, 0.009675718, 0.221122856, 0.276820269, 0.145754756, 0),
  tumor_cells2 = c(0.2564637, 0, 1.6631963, 0.2406319, 4.1103628, 3.9591513, 0.5056775, 0.1589378, 0.1911437, 8.0474658),
  tumor_cells3 = c(0.56887958, 0, 1.40652164, 0.27335534, 1.72099753, 2.77043800, 0.46335945, 0.18155172, 0.06162919, 12.12377773),
  Macrophages = c(0.02416930, 0.60091113, 2.52611974, 0.05308476, 0.62734151, 0.02883157, 0.57376758, 5.49513098, 0.15721448, 0),
  Tumorcell4 = c(0, 0, 0.5800464, 0, 2.2887723, 0, 0.6798487, 0.2085516, 0, 0)
)
library(reshape2)

gene_data_melt <- melt(gene_data, id.vars = "Gene", variable.name = "Cell_Type", value.name = "Expression")

# View the reshaped data
head(gene_data_melt)
##    Gene    Cell_Type  Expression
## 1   FAP Tumor_cells1 0.000000000
## 2   TNF Tumor_cells1 0.000000000
## 3 VEGFA Tumor_cells1 0.724330047
## 4   IL6 Tumor_cells1 0.000000000
## 5   MYC Tumor_cells1 1.101840338
## 6 ACTA2 Tumor_cells1 0.009675718

The expression analysis of key genes in tumor cells reveals a distinctive profile characterized by high levels of MYC (1.10) and VEGFA (0.72), indicating aggressive tumor proliferation and significant angiogenic activity, respectively. In contrast, genes associated with inflammatory responses, such as TNF and IL6, show no expression (0.00), suggesting a lack of inflammatory signaling within the tumor microenvironment. Additionally, FAP (0.00) indicates minimal fibroblast activity, while ACTA2 (0.01) is expressed at low levels, reflecting a limited presence of smooth muscle or myofibroblasts. This expression profile suggests a tumor environment that favors growth and vascularization through MYC and VEGFA, while exhibiting an absence of typical inflammatory and remodeling markers, potentially indicating a unique or altered tumor microenvironment.

library(ggplot2)

ggplot(gene_data_melt, aes(x = Gene, y = Expression, fill = Cell_Type)) +
  geom_boxplot() +
  theme_bw() +
  labs(title = "Gene Expression Across Different Cell Types",
       x = "Gene",
       y = "Expression Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(gene_data_melt, aes(x = Gene, y = Expression)) +
  geom_boxplot(fill = "lightblue", color = "darkblue", outlier.color = "red") +  # Boxplot with whiskers
  stat_summary(fun = mean, geom = "point", shape = 20, size = 4, color = "darkred", fill = "red") +  # Show the average
  theme_minimal() +  # Clean theme
  labs(title = "Gene Expression Boxplot Across Different Cell Types",
       x = "Gene Symbol",
       y = "Expression Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

- COL1A1, FAP, and ACTA2 are involved in ECM remodeling and tumor invasion, with higher expression often linked to poor prognosis and increased metastatic potential. KRAS and MYC are key oncogenes driving cancer cell proliferation and metastasis, while mutations in TP53 lead to genomic instability and immune evasion. VEGFA promotes angiogenesis, supporting tumor growth and metastasis, and CD68 (macrophage marker) and IL6 contribute to immune suppression within the tumor microenvironment. Overall, these genes either directly promote cancer progression or shape an immunosuppressive microenvironment, making them critical targets for cancer therapy.

My comprehensive analysis of the single-cell RNA sequencing data from Invasive Ductal Carcinoma (IDC) has illuminated critical aspects of the tumor microenvironment that directly impact health status, prognosis, metastasis, and immune dynamics. The findings indicate a complex interplay between tumor cells and the surrounding stromal and immune cells, which together shape the tumor’s aggressiveness and overall behavior.

The elevated expression of key oncogenes such as MYC and KRAS, along with angiogenic factors like VEGFA, underscores the aggressive nature of the tumor and its potential for metastasis. These markers not only signify poor prognosis but also suggest that patients may experience rapid disease progression and increased metastatic spread. Furthermore, the expression of COL1A1, FAP, and ACTA2 highlights the tumor’s capability for extracellular matrix remodeling, which is often associated with enhanced invasive characteristics, further complicating treatment outcomes.

Immune dynamics within the tumor microenvironment were characterized by the presence of immune suppressive markers such as CD68 and IL6. This suggests an immunosuppressive state that may hinder effective anti-tumor immune responses, allowing the tumor to evade immune detection and destruction. Understanding these immune interactions is critical for developing therapeutic strategies that could potentially reinvigorate the immune response against tumor cells.

Given these findings, the identified markers offer promising targets for therapy. Strategies that aim to inhibit pathways involving MYC, VEGFA, and immune suppressive factors could improve patient outcomes. Additionally, therapies targeting the extracellular matrix components may help in reducing tumor invasion and metastasis.