该文档主要整理了EvidenCellMarker Database的数据库结构,功能及搭建细节,按照页面进行划分

1. 数据介绍

EvidenCellMarker/为根目录来展示具体位置

  1. 路径:data/mydata.RDS

  2. 数据结构

    > mydata <- readRDS("./EvidenCellMarker/data/mydata.RDS")
    > str(mydata)
    'data.frame':    890296 obs. of  29 variables:
    $ top_level             : chr  "brain" "lung" "blood" "breast" ...
    $ top_level_uberon_id   : chr  "UBERON:0000955" "UBERON:0002048" "UBERON:0000178" "UBERON:0000310" ...
    $ top_level_mixed_name  : chr  "NULL" "NULL" "NULL" "NULL" ...
    $ tissue_class          : chr  "Nervous System" "Lung" "Blood" "Breast" ...
    $ cell_name_standardized: chr  "neuronal cell" "Nucleocapsid-specific B cell" "N-specific B cell" "neoplastic cell" ...
    $ cell_name_cl_id       : chr  "NULL" "NULL" "NULL" "CL:0001063" ...
    $ cell_name_orig        : chr  "neuronal" "N-specific B cells" "N-specific B cells" "tumour cell" ...
    $ disease_type_do       : chr  "glioblastoma" "COVID-19" "COVID-19" "breast cancer" ...
    $ disease_type_doid     : chr  "DOID:3068" "DOID:0080600" "DOID:0080600" "DOID:1612" ...
    $ disease_type          : chr  "Glioblastoma" "COVID-19 (SARS-CoV-2 infection)" "COVID-19 (SARS-CoV-2 infection)" "Breast cancer" ...
    $ marker                : chr  "Tuj 1" "N protein" "N" "N-cadherin" ...
    $ marker_polarity       : chr  "positive" "positive" "positive" "positive" ...
    $ marker_corrected      : chr  "NULL" "NULL" "NULL" "NULL" ...
    $ gene_symbol           : chr  "A1BG" "NAT2" "NAT2" "CDH2" ...
    $ gene_type             : chr  "protein-coding" "protein-coding" "protein-coding" "protein-coding" ...
    $ gene_full_name        : chr  "alpha-1-B glycoprotein" "N-acetyltransferase 2" "N-acetyltransferase 2" "cadherin 2" ...
    $ gene_aliases          : chr  "A1B, ABG, GAB, HYST2477" "AAC2, NAT-2, PNAT" "AAC2, NAT-2, PNAT" "ACOGS, ADHD8, ARVD14, CD325, CDHN, CDw325, NCAD" ...
    $ protein_id            : chr  "unknown" "unknown" "unknown" "P19022" ...
    $ protein_name          : chr  "unknown" "unknown" "unknown" "Cadherin-2" ...
    $ entrez_id             : chr  "1" "10" "10" "1000" ...
    $ ensembl_id            : chr  "ENSG00000121410" "ENSG00000156006" "ENSG00000156006" "ENSG00000170558" ...
    $ species               : chr  "Human" "Human" "Human" "Human" ...
    $ pmid                  : chr  "25871395" "37884506" "36560669" "33801519" ...
    $ pmcid                 : chr  "PMC4496184" "PMC10603102" "PMC9785906" "PMC7958863" ...
    $ journal               : chr  "Oncotarget" "Nature communications" "Viruses" "International journal of molecular sciences" ...
    $ year                  : chr  "2015" "2023" "2022" "2021" ...
    $ title                 : chr  "Targeted therapy of glioblastoma stem-like cells and tumor non-stem cells using cetuximab-conjugated iron-oxide nanoparticles." "Respiratory mucosal immune memory to SARS-CoV-2 after infection and vaccination." "CXCL12 and CXCL13 Cytokine Serum Levels Are Associated with the Magnitude and the Quality of SARS-CoV-2 Humoral Responses." "Heterogeneous Manifestations of Epithelial-Mesenchymal Plasticity of Circulating Tumor Cells in Breast Cancer Patients." ...
    $ section               : chr  "RESULTS; Multilineage differentiation and tumorigenicity of human GBM neurospheres" "Methods; B cells immunophenotyping and detection of SARS-CoV-2 specific B cells" "3. Results; 3.3. Frequencies of Spike, RBD, and N-Specific B Cell Responses" "4. Materials and Methods; 4.3. Multiplex Immunofluorescence (mIF) Staining" ...
    $ source                : chr  "N08-74, N08-30, and N08-1002 neurospheres formed invasive tumors in athymic nude mice brains within 4-11 months"| __truncated__ "Cryopreserved BAL cells and PBMCs were used for detection of SARS-CoV-2 specific B cells in lower airways and b"| __truncated__ "Briefly, blood SARS-CoV-2-specific B cells were identified using the biotinylated Spike, RBD, or N proteins (Fi"| __truncated__ "Heterogeneous Manifestations of Epithelial-Mesenchymal Plasticity of Circulating Tumor Cells in Breast Cancer P"| __truncated__ ...

数据解释如下:


我的文本全文数据库(记录在了texts中):

(base) server@server-MS03-CE0:./EvidenCellMarker/data$ sqlite3 pmc_texts.sqlite
SQLite version 3.50.2 2025-06-28 14:00:48
Enter ".help" for usage hints.
sqlite> .tables
status  texts
sqlite> SELECT * FROM texts LIMIT 10;
PMC12085826|Title|1|1|Navigating single-cell RNA-sequencing: protocols, tools, databases, and applications
PMC12085826|Abstract|1|1|Single-cell RNA-sequencing (scRNA-seq) technology brought about a revolutionary change in the transcriptomic world, paving the way for comprehensive analysis of cellular heterogeneity in complex biological systems.
PMC12085826|Abstract|1|2|It enabled researchers to see how different cells behaved at single-cell levels, providing new insights into the process.
PMC12085826|Abstract|1|3|However, despite all these advancements, scRNA-seq also experiences challenges related to the complexity of data analysis, interpretation, and multi-omics data integration.
PMC12085826|Abstract|1|4|In this review, these complications were discussed in detail, directly pointing at the optimization of scRNA-seq approaches and understanding the world of single-cell and its dynamics.
PMC12085826|Abstract|1|5|Different protocols and currently functional single-cell databases were also covered.
PMC12085826|Abstract|1|6|This review highlights different tools for the analysis of scRNA-seq and their methodologies, emphasizing innovative techniques that enhance resolution and accuracy at a single-cell level.
PMC12085826|Abstract|1|7|Various applications were explored across domains including drug discovery, tumor microenvironment (TME), biomarker discovery, and microbial profiling, and case studies were discussed to explain the importance of scRNA-seq by uncovering novel and rare cell types and their identification.
PMC12085826|Abstract|1|8|This review underlines a crucial aspect of scRNA-seq in the advancement of personalized medicine and highlights its potential to understand the complexity of biological systems.
PMC12085826|Introduction|1|1|Two centuries after Robert Hooke and Antonie van Leeuwenhoek, cells were redefined as the fundamental functional unit of life [1].

对应列:
sqlite> PRAGMA table_info(texts);
0|pmcid|TEXT|1||0
1|section|TEXT|0||0
2|paragraph|INTEGER|0||0
3|sentence|INTEGER|0||0
4|text|TEXT|0||0
sqlite>

pmcid: PMC ID
section: 章节名
paragraph: 段落名
sentence:句子序号(按顺序排列)
text:正文,按句子划分

我再把这些内容分类成多个板块,用于注释信息的展示 在以下说明中,我括号里的内容表示在数据库展示时的名称(因为上面表格的名称不太规范),所有的信息都按顺序排列。

  1. 注释信息: species(Species)
  2. 组织相关的(Tissue information):
  1. 细胞相关的(Cell information):
  1. 疾病信息相关的(Disease information):
  1. 基因和蛋白信息相关的(Gene & Protein information):
  1. 文献来源信息:

2. Home页

主要展示统计值:

  1. 细胞数量:unique cell_name_standardized
  2. 组织数量:unique top_level
  3. marker数量:unique gene_symbol
  4. 疾病数量:unique disease_type_do
  5. 文献数量:unique pmcid

物种分布:species(可以饼图或者柱形图)

还可以做器官-组织-细胞交互的图像

3. Search页

搜索检索词,检索库中出现这个词的行(泛检索)

给一些example(点击后写入搜索框):

3.2 Advanced

筛选条目(如果可以的话,在筛选时实时交互,相当于实时筛选已有的选项,不会出现not found的情况):

  1. Species
  2. Tissue Type
  3. Cell Type
  4. Disease Type
  5. Marker (Gene Symbol)
  6. Marker Polarity

在结果展示时,由于总的column过多,初步检索仅展示Species,Tissue Type,Cell Type,Disease Type,Gene Symbol,Protein name,Evidence,然后增加一个details按钮

点击detail后,展示完整的详细信息。

  1. 物种信息:
    • Species
  2. 组织信息(Tissue information):
    • top_level(Standardized tissue name
    • top_level_uberon_id:如果为NA则展示Not mapped(Uberon ID),需要设置超链接(需要把:改为_),例如UBERON:0000178的超链接为http://purl.obolibrary.org/obo/UBERON_0000178
    • top_level_mixed_name:如果为NA则不展示(Mixed tissue name
    • tissue_class(Original tissue name
  3. 细胞信息(Cell information):
    • cell_name_standardized(Standardized cell name
    • cell_name_cl_id:如果为NA则展示Not mapped(Cell Ontology ID),需要设置超链接(需要把“:”改为 " _ " ),例如CL:0002326的超链接为http://purl.obolibrary.org/obo/CL_0002326
    • cell_name_orig(Original cell name
  4. 疾病信息(Disease information):
    • disease_type_do(Standardized Disease type
    • disease_type_doid(Disease Ontology ID),统一超链接到https://disease-ontology.org/do/ (无ID后缀,因为该网页不支持查询)
    • disease_type(Original Disease type
  5. 基因和蛋白信息(Gene & Protein information):
    • marker(Original marker name
    • marker_polarity(Marker polarity
    • gene_symbol(HGNC gene symbol):注意,增加一个check的标志,当gene_qc_pass为TRUE时,为绿色的勾,如果为FALSE时,则是红色的叉。鼠标放上去时,显示gene_qc_note中的信息。以FLT3为例,统一链接到https://www.genecards.org/cgi-bin/carddisp.pl?gene=FLT3
    • gene_type(Gene type
    • gene_full_name(Gene full name
    • gene_aliases(Gene aliases
    • protein_id(Protein id):注意,增加一个check的标志,当protein_qc_pass为TRUE时,为绿色的勾,如果为FALSE时,则是红色的叉。鼠标放上去时,显示protein_qc_note中的信息。(以P12830为例,超链接到https://www.uniprot.org/uniprotkb/P12830/entry
    • protein_name(Protein name
    • entrez_id(Entrez ID
    • ensembl_id(Ensembl ID
  6. 文献来源信息:

4. Browse页

Browse主要由一个tree来组织,选择组织大类后,再来选择小类。大类在classifications,小类就是top_level

mydata中我做了更新,如果一个组织存在于多个关系,会用分号分隔

> str(mydata)
> ......
$ classifications       : chr  "body regions" "body regions" "body regions" "circulatory system; body fluids" ...

布局类似于(http://www.bio-bigdata.center/CellMarkerBrowse.jsp)

5. Cell Annotation页

5.1 背景代码

用户上传seurat对象的obj(在代码中为example_pbmc_input.rds),分析得到注释文件

封装的代码(bash命令)

cd /mnt/workdir/cellmarker/EvidenCellMarker

# 疾病+正常
Rscript ./cell_annotation/annotate_cells.R \
  --input ./data/example_pbmc_input.rds \
  --species Human \
  --tissue "blood" \
  --n_variable_features 2000 \
  --dims 30 \
  --cluster_resolutions "0.3,0.5,1.0" \
  --min_markers_per_cell 2 \
  --disease_type "Normal,acute myeloid leukemia,aplastic anemia" \
  --output_dir ./results/example_full \
  --n_threads 8 \
  --random_seed 1234

# 仅正常
Rscript ./cell_annotation/annotate_cells.R \
  --input ./data/example_pbmc_input.rds \
  --species Human \
  --tissue "blood" \
  --n_variable_features 2000 \
  --dims 30 \
  --cluster_resolutions "0.3,0.5,1.0" \
  --min_markers_per_cell 2 \
  --disease_type "Normal" \
  --output_dir ./results/example_full \
  --n_threads 32 \
  --random_seed 1234

# 仅疾病
Rscript ./cell_annotation/annotate_cells.R \
  --input ./data/example_pbmc_input.rds \
  --species Human \
  --tissue "blood" \
  --n_variable_features 2000 \
  --dims 30 \
  --cluster_resolutions "0.3,0.5,0.8" \
  --min_markers_per_cell 2 \
  --disease_type "acute myeloid leukemia" \
  --output_dir ./results/example_full \
  --n_threads 8 \
  --random_seed 1234

对于参数设置:

  1. input:在Web中来源于用户上传文件的调用
  2. speciesHumanMouseRat三选一,用户在线选择。
  3. tissue:来源于mydata$top_level,用户在线选择。
  4. n_variable_features:默认2000,最多3000,最少1000,每500一个刻度。用户在线选择。
  5. dims:默认30,最多40,最少10。每5一个刻度,必须是整数。用户在线选择。
  6. cluster_resolutions:可以多选,默认为0.8,最多选3个,最大为1.5。
  7. min_markers_per_cell,一个细胞中最少marker数量,默认为2,最小为1,必须是整数。
  8. disease_type,来源于mydata$disease_type_do,默认为Normal,表示为筛选正常细胞,当出现多个时,例如"Normal,acute myeloid leukemia,aplastic anemia"表示筛选多种疾病状态的细胞
  9. output_dir:在web应用中,应该固定一个输出目录,然后对于每次提交生成一个unique随机链接
  10. n_threads:固定为8,主要用于sctype中的多线程注释
  11. random_seed:默认为1234,表示降维聚类的种子数,用于重复结果

5.2 结果展示:

运行示例代码:

Rscript ./cell_annotation/annotate_cells.R \
  --input ./data/example_pbmc_input.rds \
  --species Human \
  --tissue "blood" \
  --n_variable_features 2000 \
  --dims 30 \
  --cluster_resolutions "0.3,0.5,1.0" \
  --min_markers_per_cell 2 \
  --disease_type "Normal" \
  --output_dir ./results/example_full \
  --n_threads 32 \
  --random_seed 1234

结果的结构:

(base) server@server-MS03-CE0:/mnt/workdir/cellmarker/EvidenCellMarker/results/example_full$ tree
.
├── annotated_seurat.RDS
├── resolution_0.30
│   ├── cluster_markers_all.csv
│   ├── cluster_markers_significant.csv
│   ├── llm_decisions.csv
│   ├── llm_prompts_log.csv
│   ├── umap_cluster_annotation.pdf
│   ├── umap_cluster_annotation.png
│   ├── umap_marker_dotplot.pdf
│   └── umap_marker_dotplot.png
├── resolution_0.50
│   ├── cluster_markers_all.csv
│   ├── cluster_markers_significant.csv
│   ├── llm_decisions.csv
│   ├── llm_prompts_log.csv
│   ├── umap_cluster_annotation.pdf
│   ├── umap_cluster_annotation.png
│   ├── umap_marker_dotplot.pdf
│   └── umap_marker_dotplot.png
├── resolution_1.00
│   ├── cluster_markers_all.csv
│   ├── cluster_markers_significant.csv
│   ├── llm_decisions.csv
│   ├── llm_prompts_log.csv
│   ├── umap_cluster_annotation.pdf
│   ├── umap_cluster_annotation.png
│   ├── umap_marker_dotplot.pdf
│   └── umap_marker_dotplot.png
└── summary.txt

4 directories, 26 files

按照分辨率分为3个panel,每个分辨率中内容都是一样的。

5.2.1 图像

在一个panel中,展示以下png图片(相对路径),同时提供相同前缀的pdf文件的下载按钮

  1. 用box包裹后,box的标题为UMAP plot ,相对路径在./results/example_full/resolution_0.50/umap_cluster_annotation.png

  1. 用box包裹后,box的标题为Dot plot ,相对路径在./results/example_full/resolution_0.50/umap_marker_dotplot.png

5.2.2 表格

  1. 用box包裹后,box标题为LLM decision process,相对路径在./results/example_full/resolution_0.50/llm_decisions.csv

该csv有以下几列:

  1. 提供两个下载按钮,用于下载:(1)命名为Download all cluster markers ,路径在./results/example_full/resolution_0.50/cluster_markers_all.csv,(2)命名为Download significant cluster markers,路径在./results/example_full/resolution_0.50/cluster_markers_significant.csv

    6. Cell Score页

6.1 背景代码

用户上传seurat对象,根据选择不同的tissue和cell来获得不同的细胞打分

封装的代码(bash命令)

Rscript ./cell_score/ucell_demo.R \
      --species Human \
      --tissue brain \
      --disease_name glioblastoma \
      --cell_name "malignant cell" \
      --polarity positive \
      --n_threads 4 \
      --outdir ./results/cell_score_positive_example_6threads \
      --input_seurat ./data/example_glioblastoma_input.rds \
      --marker_rds ./data/mydata.RDS \
      --type_in_metadata cell_type \
      --target_cells "malignant cell"

对于参数设置:

  1. species:设置为HumanMouseRat,用户在线选择(必选)

  2. tissue:来源于mydata$top_level,用户在线选择(必选)

  3. disease_name:来源于mydata$disease_type_do,用户在线选择(必选)

  4. cell_name:来源于mydata$cell_name_standardized,用户在线选择(必选)

  5. polarity:来源于mydata$marker_polarity,用户在线选择(必选)

  6. n_threads:固定为4(根据最终部署的服务器配置决定,4线程占用内存大约)

  7. outdir:在web应用中,一次提交应该固定一个输出目录,然后对于每次提交生成一个unique随机链接

  8. input_seurat:在Web中来源于用户上传文件的调用(必选)

  9. marker_rds:固定为./data/mydata.RDS

  10. type_in_metadata:对于input_seurat中的metadata的列(metadata中注释细胞类型的列的名称)(非必选)

  11. target_cells:列中需要展示的值(已注释细胞中的目标细胞名称,该值必须包含在type_in_metadata中)(非必选)

    6.2 结果展示

运行示例代码

Rscript ./cell_score/ucell_demo.R \
      --species Human \
      --tissue brain \
      --disease_name glioblastoma \
      --cell_name "malignant cell" \
      --polarity positive \
      --n_threads 6 \
      --outdir ./results/cell_score_positive_example \
      --input_seurat ./data/example_glioblastoma_input.rds \
      --marker_rds ./data/mydata.RDS \
      --type_in_metadata cell_type \
      --target_cells "malignant cell"

结果目录tree结构:

(base) server@server-MS03-CE0:/mnt/workdir/cellmarker/EvidenCellMarker$ tree results/cell_score_positive_example
results/cell_score_positive_example
├── Boxplot_cell_signatures_by_target_cell_type_malignant_cell.pdf
├── Boxplot_cell_signatures_by_target_cell_type_malignant_cell.png
├── CustomPlot_cell_type_umap.pdf
├── CustomPlot_cell_type_umap.png
├── FeaturePlot_cell_signatures_umap.pdf
├── FeaturePlot_cell_signatures_umap.png
├── meta_with_ucell_scores.csv
├── TargetCellsPlot_cell_type_malignant_cell_umap.pdf
└── TargetCellsPlot_cell_type_malignant_cell_umap.png

1 directory, 9 files

6.2.1 图像

由于参数设置的不同,分为以下几种情况

  1. 第一种情况,type_in_metadatatarget_cells均已设置的情况下,文件最全,按顺序展示如下:

(1)Web中标题取为Cell type plot

./results/cell_score_positive_example/CustomPlot_cell_type_umap.png

(2)Web中标题取为Cell score plot

./results/cell_score_positive_example/FeaturePlot_cell_signatures_umap.png

(3)Web中标题取为Target cells destribution plot

./results/cell_score_positive_example/TargetCellsPlot_cell_type_malignant_cell_umap.png

(4)Web中标题取为Boxplot of scores ./results/cell_score_positive_example/Boxplot_cell_signatures_by_target_cell_type_malignant_cell.png

  1. 第二种情况,如果设置了type_in_metadata而没有设置target_cells,那么就没有上面的(3)和(4)两张图
  2. 第三种情况,如果type_in_metadatatarget_cells都没有设置,那么就没有上面的(1)(3)(4)图

注:所有的图片均提供png格式和同名称的pdf格式文件下载。

6.2.2 表格

无论什么情况,都展示表格results/cell_score_positive_example/meta_with_ucell_scores.csv,并且提供下载功能

head results/cell_score_positive_example/meta_with_ucell_scores.csv
"","nCount_RNA","nFeature_RNA","cell_type","cell_signatures","umap_1","umap_2"
"PJ017_0",14882,4710,"malignant cell",0.382159413787947,-1.72728063129287,4.9318121515492
"PJ017_1",13889,4951,"malignant cell",0.300994466876028,-0.43518288396697,5.81114612448011
"PJ017_2",12877,4017,"macrophage",0.0835327251881761,-6.45049966834884,-6.57972015512194
"PJ017_3",12742,3993,"malignant cell",0.377698021035841,-1.36464430354934,4.81031928884779
"PJ017_4",12775,4280,"malignant cell",0.368226907930811,-1.43739284538131,4.84666381704603
"PJ017_5",12530,3919,"malignant cell",0.353559144608943,-1.76326598190169,4.60113368856703
"PJ017_6",11919,3874,"mural cell",0.182717710981506,-10.1987182047258,4.58634935247694
"PJ017_7",12183,4103,"malignant cell",0.380389811076217,-1.38707925342422,4.72182260381971
"PJ017_8",12079,3693,"malignant cell",0.334280444643836,-1.86855424903731,4.69542299139295

supple. 其余功能

1. 邮件通知

由于分析时间较长,因此在提交任务后即生成unique的url链接,用户可通过该链接来访问分析结果,当分析完成时,以邮件形式通知用户

2. Job Status

对于目前任务状态进行查询(方便用户查看服务器负载情况,规划分析时间)

3. Download页

待编写

4. Help页

待编写