1 Tóm tắt mục tiêu nghiên cứu
- 1.1 Câu hỏi nghiên cứu
- 1.2 Thiết kế nghiên cứu
- 1.3 Lưu ý quan trọng về phạm vi trình bày
2 Cơ sở lý thuyết chung
- 2.1 TCGA-THCA và dữ liệu RNA-seq
- 2.2 SummarizedExperiment
- 2.3 DESeq2
3 0. Khởi tạo thư mục dự án và cài package
- 3.1 Mục đích
- 3.2 Phương pháp nghiên cứu
- 3.3 Code
4 1. Tải dữ liệu TCGA-THCA RNA-seq STAR counts
- 4.1 Mục đích
- 4.2 Cơ sở lý thuyết
- 4.3 Phương pháp nghiên cứu
- 4.4 Code
5 2. Trích xuất count matrix, TPM và sample metadata
- 5.1 Mục đích
- 5.2 Cơ sở lý thuyết
- 5.3 Phương pháp nghiên cứu
- 5.4 Code
6 3. Lọc mẫu cho phân tích Tumor vs Normal
- 6.1 Mục đích
- 6.2 Cơ sở lý thuyết
- 6.3 Phương pháp nghiên cứu
- 6.4 Code
7 4. Phân tích biểu hiện khác biệt bằng DESeq2
- 7.1 Mục đích
- 7.2 Cơ sở lý thuyết
- 7.3 Phương pháp nghiên cứu
- 7.4 Code
8 5. Annotation kết quả DESeq2
- 8.1 Mục đích
- 8.2 Cơ sở lý thuyết
- 8.3 Phương pháp nghiên cứu
- 8.4 Code
9 6. Gộp kết quả DESeq2 với annotation
- 9.1 Mục đích
- 9.2 Cơ sở lý thuyết
- 9.3 Phương pháp nghiên cứu
- 9.4 Code
10 7. Kiểm tra nhóm gene đích liên quan PTC
- 10.1 Mục đích
- 10.2 Cơ sở lý thuyết
- 10.3 Phương pháp nghiên cứu
- 10.4 Code
11 8. Trích xuất normalized counts
- 11.1 Mục đích
- 11.2 Cơ sở lý thuyết
- 11.3 Phương pháp nghiên cứu
- 11.4 Code
12 9. Tạo bảng expression dạng long cho gene đích
- 12.1 Mục đích
- 12.2 Cơ sở lý thuyết
- 12.3 Phương pháp nghiên cứu
- 12.4 Code
13 10. Boxplot biểu hiện gene đích
- 13.1 Mục đích
- 13.2 Cơ sở lý thuyết
- 13.3 Phương pháp nghiên cứu
- 13.4 Code
14 11. Volcano plot
- 14.1 Mục đích
- 14.2 Cơ sở lý thuyết
- 14.3 Phương pháp nghiên cứu
- 14.4 Code
15 12. Heatmap top 50 DEG
- 15.1 Mục đích
- 15.2 Cơ sở lý thuyết
- 15.3 Phương pháp nghiên cứu
- 15.4 Code
16 13. Xây dựng panel gene PTC mở rộng
- 16.1 Mục đích
- 16.2 Cơ sở lý thuyết
- 16.3 Phương pháp nghiên cứu
- 16.4 Code
17 14. Trích xuất kết quả DESeq2 cho panel PTC mở rộng
- 17.1 Mục đích
- 17.2 Cơ sở lý thuyết
- 17.3 Phương pháp nghiên cứu
- 17.4 Code
18 15. Kiểm tra gene thiếu trong panel mở rộng
- 18.1 Mục đích
- 18.2 Cơ sở lý thuyết
- 18.3 Phương pháp nghiên cứu
- 18.4 Code
19 16. Chấm điểm ưu tiên sơ bộ cho quy trình thiết kế vaccine
- 19.1 Mục đích
- 19.2 Cơ sở lý thuyết
- 19.3 Phương pháp nghiên cứu
- 19.4 Code
20 17. Top gene protein-coding tăng biểu hiện trong Tumor
- 20.1 Mục đích
- 20.2 Cơ sở lý thuyết
- 20.3 Phương pháp nghiên cứu
- 20.4 Code
21 18. Top gene protein-coding giảm biểu hiện trong Tumor
- 21.1 Mục đích
- 21.2 Cơ sở lý thuyết
- 21.3 Phương pháp nghiên cứu
- 21.4 Code
22 19. Boxplot biểu hiện expanded PTC-relevant gene panel
- 22.1 Mục đích
- 22.2 Cơ sở lý thuyết
- 22.3 Phương pháp nghiên cứu
- 22.4 Code
23 20. Volcano plot highlight expanded PTC-relevant gene panel
- 23.1 Mục đích
- 23.2 Cơ sở lý thuyết
- 23.3 Phương pháp nghiên cứu
- 23.4 Code
24 21. Heatmap biểu hiện expanded PTC-relevant gene panel
- 24.1 Mục đích
- 24.2 Cơ sở lý thuyết
- 24.3 Phương pháp nghiên cứu
- 24.4 Code
25 22. Heatmap sạch hơn cho expanded PTC-relevant gene panel
- 25.1 Mục đích
- 25.2 Phương pháp nghiên cứu
- 25.3 Code
26 Tổng hợp output của quy trình hiện tại
27 Diễn giải phương pháp nghiên cứu có thể đưa vào bài viết
- 27.1 Thiết kế nghiên cứu
- 27.2 Tiền xử lý dữ liệu
- 27.3 Phân tích biểu hiện khác biệt
- 27.4 Annotation và ưu tiên gene ứng viên
- 27.5 Trực quan hóa dữ liệu
- 27.6 Chấm điểm ưu tiên vaccine sơ bộ
28 Hạn chế của quy trình hiện tại
29 Bước tiếp theo đề xuất
30 Thông tin phiên làm việc
31 Tài liệu tham khảo

knitr::opts_chunk$set(
  echo = TRUE,
  message = TRUE,
  warning = TRUE,
  fig.width = 8,
  fig.height = 6
)

1 Tóm tắt mục tiêu nghiên cứu

1.1 Câu hỏi nghiên cứu

Mục tiêu của quy trình này là khai thác dữ liệu RNA-seq từ TCGA-THCA để so sánh biểu hiện gene giữa mô u tuyến giáp và mô tuyến giáp bình thường, sau đó ưu tiên các gene có ý nghĩa sinh học đối với ung thư tuyến giáp thể nhú (papillary thyroid carcinoma, PTC) và có thể dùng như nguồn kháng nguyên ban đầu cho pipeline thiết kế vaccine đa epitope.

1.2 Thiết kế nghiên cứu

Đây là nghiên cứu tin sinh học hồi cứu, sử dụng dữ liệu công khai từ Genomic Data Commons / TCGA. Đơn vị phân tích là mẫu RNA-seq cấp gene. Biến phân nhóm chính là loại mẫu theo mã TCGA barcode:

01: Primary Solid Tumor.
11: Solid Tissue Normal.

Các mã này được quy ước trong bảng sample type code của GDC/TCGA.¹

1.3 Lưu ý quan trọng về phạm vi trình bày

Phần trình bày này thực hiện phần phân tích biểu hiện gene và ưu tiên sơ bộ gene ứng viên. Phần giới thiệu ban đầu có nhắc đến HPA, survival, UniProt và input NetMHC, nhưng các bước đó chưa được triển khai đầy đủ trong đoạn code hiện tại. Vì vậy, phần kết quả này chỉ chuẩn hóa và giải thích các bước đang có: tải TCGA-THCA, xử lý count matrix, DESeq2, annotation, kiểm tra gene PTC, biểu đồ hóa và chấm điểm ưu tiên.

2 Cơ sở lý thuyết chung

2.1 TCGA-THCA và dữ liệu RNA-seq

TCGA-THCA là bộ dữ liệu ung thư tuyến giáp của The Cancer Genome Atlas. Công bố nền tảng của TCGA về PTC mô tả đặc điểm hệ gene của 496 ung thư tuyến giáp thể nhú, bao gồm đột biến BRAF/RAS, tái sắp xếp RET/NTRK và các nhóm tín hiệu MAPK/thyroid differentiation.²

Trong quy trình này, dữ liệu được tải bằng TCGAbiolinks, một package Bioconductor cho phép truy vấn, tải và chuẩn bị dữ liệu từ GDC bằng các hàm GDCquery(), GDCdownload() và GDCprepare().³

2.2 SummarizedExperiment

Kết quả GDCprepare() thường được lưu dưới dạng đối tượng SummarizedExperiment. Cấu trúc này gom ba lớp thông tin chính:

assay(): ma trận biểu hiện gen, ví dụ raw count hoặc TPM.
colData(): metadata của mẫu, tức là thông tin mô tả từng cột trong assay().Nó giúp bạn biết mẫu nào là tumor, mẫu nào là normal, hoặc mẫu thuộc bệnh nhân nào.
rowData(): annotation của gene hoặc feature. Tức là thông tin mô tả từng hàng trong assay(). Nó giúp bạn biết mỗi hàng trong ma trận count tương ứng gene nào.

Cách tổ chức này giúp bảo đảm count matrix, sample metadata và gene annotation luôn được đồng bộ trong cùng một đối tượng phân tích.⁴

2.3 DESeq2

DESeq2 là phương pháp phân tích biểu hiện khác biệt cho RNA-seq count data. Mô hình thống kê cốt lõi là negative binomial generalized linear model. DESeq2 ước lượng size factor (chỉnh khác biệt về độ sâu giải trình tự giữa các mẫu), dispersion (ước lượng độ dao động sinh học/kỹ thuật của count), log2 fold change (mức tăng hoặc giảm biểu hiện gene giữa hai nhóm, ví dụ Tumor so với Normal) và kiểm định khác biệt biểu hiện giữa các điều kiện. Giá trị padj là p-value đã hiệu chỉnh đa kiểm định, thường theo Benjamini-Hochberg để kiểm soát false discovery rate.⁵ Dispersion là mức độ biến thiên của count quanh giá trị trung bình. Trong RNA-seq, count không chỉ biến thiên theo quy luật Poisson đơn giản. Dữ liệu sinh học thường có biến thiên lớn hơn Poisson, gọi là overdispersion. Vì vậy DESeq2 dùng phân phối negative binomial, có thêm tham số dispersion để mô tả biến thiên này. Log2 fold change, viết tắt là log2FC, là mức thay đổi biểu hiện gene giữa hai nhóm trên thang log2. Trả lời câu hỏi: Gene này biểu hiện ở Tumor cao hơn hay thấp hơn Normal bao nhiêu lần?

Trong script này, tiêu chí DEG có ý nghĩa được đặt là:

padj < 0.05.
abs(log2FoldChange) >= 1.

Về mặt diễn giải, log2FoldChange >= 1 tương ứng biểu hiện ở nhóm u cao ít nhất khoảng 2 lần so với nhóm bình thường, còn log2FoldChange <= -1 tương ứng giảm ít nhất khoảng 2 lần.

3 0. Khởi tạo thư mục dự án và cài package

3.1 Mục đích

Tạo cấu trúc thư mục chuẩn để lưu dữ liệu thô, dữ liệu xử lý, bảng kết quả, hình ảnh và file đầu vào cho các bước miễn dịch tin sinh học sau này.

3.2 Phương pháp nghiên cứu

Tất cả output nên được lưu vào thư mục con để dễ truy xuất và tái lập phân tích:

data_raw: dữ liệu tải trực tiếp từ GDC.
data_processed: dữ liệu đã xử lý.
results: bảng kết quả chính.
results/plots: hình ảnh.
results/fasta: trình tự FASTA, nếu bổ sung về sau.
results/netmhc_input: input cho NetMHCpan hoặc NetMHCIIpan, nếu bổ sung về sau.

3.3 Code

project_dir <- path.expand(params$project_dir)

if (!dir.exists(project_dir)) {
  dir.create(project_dir, recursive = TRUE)
}
knitr::opts_knit$set(root.dir = project_dir)
proj_path <- function(...) {
  file.path(project_dir, ...)
}
# Kiểm tra quyền ghi vào thư mục project.
if (file.access(project_dir, mode = 2) != 0) {
  stop("R không có quyền ghi vào project_dir: ", project_dir)
}
# Tạo cấu trúc thư mục chuẩn.
dirs <- c(
  "data_raw",
  "data_raw/GDCdata",
  "data_processed",
  "results",
  "results/plots",
  "results/fasta",
  "results/netmhc_input"
)
for (d in dirs) {
  dir.create(proj_path(d), recursive = TRUE, showWarnings = FALSE)
}

# Kiểm tra nhanh đường dẫn đang sử dụng.
message("Project directory: ", project_dir)

## Project directory: /Users/vuanhduy/Desktop/PTC_multi_epitope_project

message("Data raw directory: ", proj_path("data_raw"))

## Data raw directory: /Users/vuanhduy/Desktop/PTC_multi_epitope_project/data_raw

message("Results directory: ", proj_path("results"))

## Results directory: /Users/vuanhduy/Desktop/PTC_multi_epitope_project/results

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

bioc_pkgs <- c(
  "TCGAbiolinks",
  "DESeq2",
  "SummarizedExperiment",
  "Biostrings"
)

cran_pkgs <- c(
  "tidyverse",
  "data.table",
  "survival",
  "survminer",
  "pheatmap",
  "ggrepel",
  "readxl",
  "writexl",
  "jsonlite",
  "httr2"
)

for (p in bioc_pkgs) {
  if (!requireNamespace(p, quietly = TRUE)) {
    BiocManager::install(p, ask = FALSE, update = FALSE)
  }
}

for (p in cran_pkgs) {
  if (!requireNamespace(p, quietly = TRUE)) {
    install.packages(p)
  }
}

suppressPackageStartupMessages({
  library(TCGAbiolinks)
  library(DESeq2)
  library(SummarizedExperiment)
  library(Biostrings)
  library(tidyverse)
  library(data.table)
  library(survival)
  library(survminer)
  library(pheatmap)
  library(writexl)
  library(jsonlite)
  library(httr2)
})

## Warning: package 'TCGAbiolinks' was built under R version 4.5.2

## Warning: package 'DESeq2' was built under R version 4.5.2

## Warning: package 'S4Vectors' was built under R version 4.5.3

## Warning: package 'BiocGenerics' was built under R version 4.5.2

## Warning: package 'IRanges' was built under R version 4.5.2

## Warning: package 'GenomicRanges' was built under R version 4.5.2

## Warning: package 'Seqinfo' was built under R version 4.5.2

## Warning: package 'SummarizedExperiment' was built under R version 4.5.2

## Warning: package 'MatrixGenerics' was built under R version 4.5.2

## Warning: package 'Biobase' was built under R version 4.5.3

## Warning: package 'Biostrings' was built under R version 4.5.2

## Warning: package 'XVector' was built under R version 4.5.2

## Warning: package 'survival' was built under R version 4.5.2

## Warning: package 'survminer' was built under R version 4.5.2

## Warning: package 'ggpubr' was built under R version 4.5.2

## Warning: package 'httr2' was built under R version 4.5.2

4 1. Tải dữ liệu TCGA-THCA RNA-seq STAR counts

4.1 Mục đích

Tải dữ liệu RNA-seq cấp gene từ TCGA-THCA, cụ thể là dữ liệu Gene Expression Quantification với workflow STAR - Counts.

4.2 Cơ sở lý thuyết

STAR counts là dữ liệu đếm số read gán vào gene sau bước alignment/quantification. Với DESeq2, đầu vào phù hợp là raw count hoặc count gần raw, không phải TPM/FPKM, vì mô hình negative binomial của DESeq2 được xây dựng cho count data.

4.3 Phương pháp nghiên cứu

Truy vấn dự án TCGA-THCA.
Chọn nhóm dữ liệu Transcriptome Profiling.
Chọn loại dữ liệu Gene Expression Quantification.
Chọn workflow STAR - Counts.
Tải bằng GDC API.
Chuẩn bị dữ liệu thành đối tượng SummarizedExperiment.

4.4 Code

# Tải hoặc nạp lại TCGA-THCA RNA-seq STAR counts
suppressPackageStartupMessages({
  library(TCGAbiolinks)
  library(SummarizedExperiment)
})

rna_rda_file <- proj_path("data_raw", "TCGA_THCA_STAR_counts.rda")
gdc_download_dir <- proj_path("data_raw", "GDCdata")

dir.create(dirname(rna_rda_file), recursive = TRUE, showWarnings = FALSE)
dir.create(gdc_download_dir, recursive = TRUE, showWarnings = FALSE)

# Nếu params$force_download chưa được khai báo trong YAML thì mặc định là FALSE
force_download <- FALSE

if (!is.null(params$force_download)) {
  force_download <- isTRUE(params$force_download)
}

# Nếu params$gdc_files_per_chunk chưa được khai báo trong YAML thì mặc định là 10
gdc_files_per_chunk <- 10

if (!is.null(params$gdc_files_per_chunk)) {
  gdc_files_per_chunk <- params$gdc_files_per_chunk
}

# ------------------------------------------------------------
# 1. Nạp lại file RDA nếu đã tồn tại
# ------------------------------------------------------------

if (file.exists(rna_rda_file) && !force_download) {
  
  message("Đã có file RDA. Đang nạp lại: ", rna_rda_file)
  
  loaded_objects <- load(rna_rda_file)
  
  message("Object được nạp từ file RDA: ", paste(loaded_objects, collapse = ", "))
  
  # Trường hợp tốt nhất: file có sẵn object tên rna_se
  if ("rna_se" %in% loaded_objects) {
    
    message("Đã tìm thấy object rna_se trong file RDA.")
    
  # Trường hợp thường gặp khi dùng GDCprepare(save = TRUE):
  # file chứa object tên data thay vì rna_se
  } else if ("data" %in% loaded_objects) {
    
    message("File RDA chứa object tên 'data'. Đổi tên data thành rna_se.")
    rna_se <- data
    
    # Lưu lại file RDA với object tên rna_se để lần sau không lỗi
    save(rna_se, file = rna_rda_file)
    message("Đã ghi đè file RDA với object tên rna_se.")
    
  # Trường hợp khác: tự tìm object thuộc class SummarizedExperiment
  } else {
    
    se_candidates <- loaded_objects[
      sapply(
        loaded_objects,
        function(x) inherits(get(x), "SummarizedExperiment")
      )
    ]
    
    if (length(se_candidates) >= 1) {
      message("Tìm thấy SummarizedExperiment: ", se_candidates[1])
      rna_se <- get(se_candidates[1])
      
      # Lưu lại dưới tên chuẩn rna_se
      save(rna_se, file = rna_rda_file)
      message("Đã chuẩn hóa file RDA với object tên rna_se.")
      
    } else {
      stop(
        "File RDA tồn tại nhưng không chứa object SummarizedExperiment. ",
        "Hãy xóa file này hoặc đặt force_download = TRUE để tải lại."
      )
    }
  }
  
# ------------------------------------------------------------
# 2. Tải mới từ GDC nếu chưa có file hoặc force_download = TRUE
# ------------------------------------------------------------

} else {
  
  message("Chưa có file RDA hoặc force_download = TRUE. Bắt đầu tải dữ liệu từ GDC.")
  
  query_rna <- GDCquery(
    project = "TCGA-THCA",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"
  )
  
  GDCdownload(
    query = query_rna,
    method = "api",
    files.per.chunk = gdc_files_per_chunk,
    directory = gdc_download_dir
  )
  
  rna_se <- GDCprepare(
    query = query_rna,
    directory = gdc_download_dir,
    save = FALSE
  )
  
  save(rna_se, file = rna_rda_file)
  
  message("Đã lưu rna_se tại: ", rna_rda_file)
}

## Đã có file RDA. Đang nạp lại: /Users/vuanhduy/Desktop/PTC_multi_epitope_project/data_raw/TCGA_THCA_STAR_counts.rda

## Object được nạp từ file RDA: rna_se

## Đã tìm thấy object rna_se trong file RDA.

# ------------------------------------------------------------
# 3. Kiểm tra object sau khi nạp hoặc tải
# ------------------------------------------------------------

if (!exists("rna_se")) {
  stop("Object rna_se vẫn chưa tồn tại sau bước nạp/tải dữ liệu.")
}

if (!inherits(rna_se, "SummarizedExperiment")) {
  stop("rna_se không phải object SummarizedExperiment.")
}

message("Nạp dữ liệu thành công.")

## Nạp dữ liệu thành công.

message("Kích thước object rna_se:")

## Kích thước object rna_se:

print(rna_se)

## class: RangedSummarizedExperiment 
## dim: 60660 572 
## metadata(1): data_release
## assays(6): unstranded stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(60660): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(572): TCGA-DJ-A2Q6-01A-11R-A18C-07
##   TCGA-FK-A3SE-01A-11R-A22L-07 ... TCGA-DJ-A2PX-01A-11R-A18C-07
##   TCGA-EL-A3ZS-01A-12R-A23N-07
## colData names(242): barcode patient ... paper_TERT_AC paper_TERT_Q

message("Các assay có sẵn:")

## Các assay có sẵn:

print(assayNames(rna_se))

## [1] "unstranded"       "stranded_first"   "stranded_second"  "tpm_unstrand"    
## [5] "fpkm_unstrand"    "fpkm_uq_unstrand"

load(proj_path("data_raw", "TCGA_THCA_STAR_counts.rda"))

5 2. Trích xuất count matrix, TPM và sample metadata

5.1 Mục đích

Tách các thành phần cần cho phân tích:

counts: raw count matrix dùng cho DESeq2. Counts cho biết mỗi gene có bao nhiêu reads được đếm trong từng mẫu.
tpm: TPM, nếu có, dùng cho mô tả biểu hiện hoặc các phân tích phụ. TPM là dữ liệu TPM, viết tắt của Transcripts Per Million, là dạng biểu hiện gene đã được chuẩn hóa theo:độ dài gene, tổng lượng transcript trong mẫu. TPM thường dùng để mô tả mức biểu hiện gene, vẽ heatmap, vẽ boxplot biểu hiện một số gene, so sánh biểu hiện tương đối của gene quan tâm, minh họa kết quả sau khi đã chọn gene
sample_info: metadata của mẫu, bao gồm mã TCGA barcode và phân loại Tumor/Normal. Sample_info cho biết mỗi mẫu là mẫu ung thư hay mẫu bình thường, mã TCGA barcode là gì, thuộc bệnh nhân nào.

5.2 Cơ sở lý thuyết

Trong đối tượng SummarizedExperiment, nhiều assay có thể cùng tồn tại. Với dữ liệu GDC STAR counts, thường có các assay như:

unstranded
stranded_first
stranded_second
tpm_unstrand
fpkm_unstrand
fpkm_uq_unstrand

DESeq2 cần raw count và count phải là số nguyên. Do đó, chunk này ưu tiên dùng unstranded và ép kiểu integer sau khi làm tròn.

5.3 Phương pháp nghiên cứu

Kiểm tra assay hiện có.
Chọn unstranded làm count assay.
Kiểm tra TPM nếu có.
Dùng ký tự thứ 14-15 của TCGA barcode để xác định sample type.

5.4 Code

# Kiểm tra assay trong đối tượng.
assayNames(rna_se)

## [1] "unstranded"       "stranded_first"   "stranded_second"  "tpm_unstrand"    
## [5] "fpkm_unstrand"    "fpkm_uq_unstrand"

# Chọn raw count assay cho DESeq2.
count_assay <- params$count_assay

if (!count_assay %in% assayNames(rna_se)) {
  stop(
    paste0("Không tìm thấy assay '", count_assay, "'. Các assay hiện có là: "),
    paste(assayNames(rna_se), collapse = ", ")
  )
}

counts <- assay(rna_se, count_assay)

# DESeq2 yêu cầu count là số nguyên.
counts <- round(counts)
storage.mode(counts) <- "integer"

# Tìm TPM assay nếu có.
tpm_assay_candidates <- c("tpm_unstrand", "tpm_unstranded", "TPM", "tpm")
tpm_assay <- intersect(tpm_assay_candidates, assayNames(rna_se))

if (length(tpm_assay) > 0) {
  tpm_assay <- tpm_assay[1]
  tpm <- assay(rna_se, tpm_assay)
  message("Using TPM assay: ", tpm_assay)
} else {
  tpm <- NULL
  message("Không có TPM trong SummarizedExperiment. Sẽ dùng normalized counts cho expression plot.")
}

## Using TPM assay: tpm_unstrand

# Trích xuất metadata mẫu.
sample_info <- as.data.frame(colData(rna_se))
sample_info$sample_id <- colnames(counts)

# TCGA barcode: vị trí 14-15 là sample type code.
sample_info$tcga_sample_code <- substr(sample_info$sample_id, 14, 15)

sample_info$condition <- dplyr::case_when(
  sample_info$tcga_sample_code == "01" ~ "Tumor",
  sample_info$tcga_sample_code == "11" ~ "Normal",
  TRUE ~ NA_character_
)

# Kiểm tra số lượng Tumor/Normal.
table(sample_info$condition, useNA = "ifany")

## 
## Normal  Tumor   <NA> 
##     59    505      8

6 3. Lọc mẫu cho phân tích Tumor vs Normal

6.1 Mục đích

Giữ lại hai nhóm mẫu cần so sánh:

Primary solid tumor.
Solid tissue normal.

6.2 Cơ sở lý thuyết

So sánh Tumor vs Normal là chiến lược thường dùng để nhận diện gene tăng hoặc giảm biểu hiện trong mô u. Đối với quy trình thiết kế vaccine đa epitope, nhóm gene tăng biểu hiện ở u có thể là nguồn kháng nguyên ban đầu, nhưng cần nhấn mạnh rằng biểu hiện RNA không đủ để kết luận antigen phù hợp. Sau đó cần lọc thêm protein expression, tính đặc hiệu mô, đột biến/fusion, MHC binding, antigen processing, độc tính, dị ứng và khả năng kích thích miễn dịch.

6.3 Phương pháp nghiên cứu

Tạo patient_id từ 12 ký tự đầu của TCGA barcode.
Lấy mẫu Tumor và Normal theo biến condition.
Đồng bộ thứ tự cột của count matrix với hàng của sample_info.
Nếu có TPM, đồng bộ TPM theo count matrix.

6.4 Code

sample_info$patient_id <- substr(sample_info$sample_id, 1, 12)

tumor_ptc_samples <- sample_info %>%
  filter(condition == "Tumor") %>%
  pull(sample_id)

normal_samples <- sample_info %>%
  filter(condition == "Normal") %>%
  pull(sample_id)

keep_samples_ptc_analysis <- c(tumor_ptc_samples, normal_samples)
keep_samples_ptc_analysis <- intersect(keep_samples_ptc_analysis, colnames(counts))

counts <- counts[, keep_samples_ptc_analysis, drop = FALSE]

sample_info <- sample_info %>%
  filter(sample_id %in% keep_samples_ptc_analysis)

sample_info <- sample_info[match(colnames(counts), sample_info$sample_id), ]
rownames(sample_info) <- sample_info$sample_id

# Kiểm tra tính đồng bộ giữa count matrix và metadata.
stopifnot(all(colnames(counts) == rownames(sample_info)))

if (!is.null(tpm)) {
  tpm <- tpm[, keep_samples_ptc_analysis, drop = FALSE]
  tpm <- tpm[, colnames(counts), drop = FALSE]
}

table(sample_info$condition, useNA = "ifany")

## 
## Normal  Tumor 
##     59    505

7 4. Phân tích biểu hiện khác biệt bằng DESeq2

7.1 Mục đích

Xác định gene biểu hiện khác biệt giữa nhóm u và nhóm bình thường.

7.2 Cơ sở lý thuyết

DESeq2 mô hình hóa count của mỗi gene bằng phân phối negative binomial. Quy trình chính gồm:

Ước lượng size factor để hiệu chỉnh độ sâu giải trình tự.
Ước lượng dispersion cho từng gene.
Fit generalized linear model.
Kiểm định khác biệt biểu hiện giữa Tumor và Normal.
Hiệu chỉnh đa kiểm định để tạo padj.

7.3 Phương pháp nghiên cứu

Đặt Normal làm mức tham chiếu.
Lọc gene có count thấp: giữ gene có count >= 10 ở ít nhất 10 mẫu.
Tạo DESeqDataSet với thiết kế ~ condition.
So sánh Tumor so với Normal.
DEG có ý nghĩa: padj < 0.05 và abs(log2FoldChange) >= 1.

7.4 Code

# Đặt Normal là nhóm tham chiếu để log2FC = Tumor / Normal.
sample_info$condition <- factor(
  sample_info$condition,
  levels = c("Normal", "Tumor")
)

# Chỉ giữ sample có condition rõ ràng.
keep_condition <- !is.na(sample_info$condition)
counts <- counts[, keep_condition, drop = FALSE]
sample_info <- sample_info[keep_condition, , drop = FALSE]

stopifnot(all(colnames(counts) == rownames(sample_info)))

# Lọc gene có count quá thấp.
keep_genes <- rowSums(counts >= params$min_count) >= params$min_samples
counts_filtered <- counts[keep_genes, , drop = FALSE]

dim(counts)

## [1] 60660   564

dim(counts_filtered)

## [1] 28853   564

# Tạo DESeq2 object.
dds <- DESeqDataSetFromMatrix(
  countData = counts_filtered,
  colData = sample_info,
  design = ~ condition
)

# Lọc thêm gene tổng count quá thấp.
dds <- dds[rowSums(counts(dds)) > params$min_count, ]

# Chạy DESeq2.
dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## -- replacing outliers and refitting for 1568 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)

## estimating dispersions

## fitting model and testing

res <- results(
  dds,
  contrast = c("condition", "Tumor", "Normal"),
  alpha = 0.05
)

# Chuyển kết quả sang data frame.
res_df <- as.data.frame(res)
res_df$gene_id <- rownames(res_df)

res_df <- res_df %>%
  arrange(padj)

head(res_df)

##                     baseMean log2FoldChange     lfcSE     stat        pvalue
## ENSG00000145864.14  5747.334       7.140052 0.2338046 30.53854 8.027451e-205
## ENSG00000170439.7   3657.721       4.319133 0.1585528 27.24098 2.125361e-163
## ENSG00000163898.10  3909.655       5.672596 0.2099107 27.02386 7.750831e-161
## ENSG00000259803.7   3021.754       7.150000 0.2644916 27.03299 6.054084e-161
## ENSG00000134569.10 12692.303       4.505362 0.1753171 25.69836 1.219299e-145
## ENSG00000162873.15  2924.457       4.903984 0.1924817 25.47766 3.486703e-143
##                             padj            gene_id
## ENSG00000145864.14 2.316160e-200 ENSG00000145864.14
## ENSG00000170439.7  3.066153e-159  ENSG00000170439.7
## ENSG00000163898.10 5.590868e-157 ENSG00000163898.10
## ENSG00000259803.7  5.590868e-157  ENSG00000259803.7
## ENSG00000134569.10 7.036085e-142 ENSG00000134569.10
## ENSG00000162873.15 1.676698e-139 ENSG00000162873.15

summary(res)

## 
## out of 28853 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 9546, 33%
## LFC < 0 (down)     : 10345, 36%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

# Chọn DEG có ý nghĩa.
deg_sig <- res_df %>%
  filter(
    !is.na(padj),
    padj < params$deg_padj_cutoff,
    abs(log2FoldChange) >= params$deg_log2fc_cutoff
  )

dim(deg_sig)

## [1] 5456    7

head(deg_sig)

##                     baseMean log2FoldChange     lfcSE     stat        pvalue
## ENSG00000145864.14  5747.334       7.140052 0.2338046 30.53854 8.027451e-205
## ENSG00000170439.7   3657.721       4.319133 0.1585528 27.24098 2.125361e-163
## ENSG00000163898.10  3909.655       5.672596 0.2099107 27.02386 7.750831e-161
## ENSG00000259803.7   3021.754       7.150000 0.2644916 27.03299 6.054084e-161
## ENSG00000134569.10 12692.303       4.505362 0.1753171 25.69836 1.219299e-145
## ENSG00000162873.15  2924.457       4.903984 0.1924817 25.47766 3.486703e-143
##                             padj            gene_id
## ENSG00000145864.14 2.316160e-200 ENSG00000145864.14
## ENSG00000170439.7  3.066153e-159  ENSG00000170439.7
## ENSG00000163898.10 5.590868e-157 ENSG00000163898.10
## ENSG00000259803.7  5.590868e-157  ENSG00000259803.7
## ENSG00000134569.10 7.036085e-142 ENSG00000134569.10
## ENSG00000162873.15 1.676698e-139 ENSG00000162873.15

8 5. Annotation kết quả DESeq2

8.1 Mục đích

Gắn gene_symbol và gene_type cho kết quả DESeq2 để dễ diễn giải sinh học.

8.2 Cơ sở lý thuyết

DESeq2 trả kết quả theo rowname của count matrix, thường là Ensembl gene ID. Tuy nhiên, diễn giải y sinh thường cần gene symbol, ví dụ BRAF, RET, NTRK1. Annotation từ rowData(rna_se) giúp chuyển Ensembl ID sang gene symbol và gene biotype.

8.3 Phương pháp nghiên cứu

Trích xuất rowData(rna_se).
Xây dựng hàm chọn cột annotation linh hoạt vì tên cột có thể khác nhau giữa phiên bản dữ liệu.
Bỏ version Ensembl sau dấu chấm, ví dụ ENSG00000157764.14 thành ENSG00000157764.
Loại trùng theo gene ID đã làm sạch.

8.4 Code

gene_anno <- as.data.frame(rowData(rna_se))

# Xem các cột annotation hiện có.
head(colnames(gene_anno), 20)

##  [1] "source"      "type"        "score"       "phase"       "gene_id"    
##  [6] "gene_type"   "gene_name"   "level"       "hgnc_id"     "havana_gene"

# Tạo gene_id từ rownames nếu cần.
gene_anno$gene_id_original <- rownames(gene_anno)

# Hàm chọn cột annotation an toàn.
get_first_existing_col <- function(df, candidates, default = NA_character_) {
  hit <- intersect(candidates, colnames(df))
  if (length(hit) > 0) {
    return(as.character(df[[hit[1]]]))
  } else {
    return(rep(default, nrow(df)))
  }
}

# Lấy gene_id, gene_symbol, gene_type nếu có.
gene_anno_clean <- data.frame(
  gene_id_original = gene_anno$gene_id_original,
  gene_id = get_first_existing_col(
    gene_anno,
    candidates = c("gene_id", "ensembl_gene_id", "ID")
  ),
  gene_symbol = get_first_existing_col(
    gene_anno,
    candidates = c("gene_name", "external_gene_name", "gene_symbol", "symbol")
  ),
  gene_type = get_first_existing_col(
    gene_anno,
    candidates = c("gene_type", "gene_biotype", "type")
  ),
  stringsAsFactors = FALSE
)

# Nếu gene_id bị thiếu, dùng rownames.
gene_anno_clean$gene_id <- ifelse(
  is.na(gene_anno_clean$gene_id) | gene_anno_clean$gene_id == "",
  gene_anno_clean$gene_id_original,
  gene_anno_clean$gene_id
)

# Bỏ version Ensembl nếu có.
gene_anno_clean$gene_id_clean <- sub("\\..*$", "", gene_anno_clean$gene_id)

gene_anno_clean <- gene_anno_clean %>%
  distinct(gene_id_clean, .keep_all = TRUE)

head(gene_anno_clean)

##     gene_id_original            gene_id gene_symbol      gene_type
## 1 ENSG00000000003.15 ENSG00000000003.15      TSPAN6 protein_coding
## 2  ENSG00000000005.6  ENSG00000000005.6        TNMD protein_coding
## 3 ENSG00000000419.13 ENSG00000000419.13        DPM1 protein_coding
## 4 ENSG00000000457.14 ENSG00000000457.14       SCYL3 protein_coding
## 5 ENSG00000000460.17 ENSG00000000460.17    C1orf112 protein_coding
## 6 ENSG00000000938.13 ENSG00000000938.13         FGR protein_coding
##     gene_id_clean
## 1 ENSG00000000003
## 2 ENSG00000000005
## 3 ENSG00000000419
## 4 ENSG00000000457
## 5 ENSG00000000460
## 6 ENSG00000000938

9 6. Gộp kết quả DESeq2 với annotation

9.1 Mục đích

Tạo bảng kết quả cuối cùng có đầy đủ thống kê DESeq2 và thông tin gene symbol/gene type.

9.2 Cơ sở lý thuyết

Một bảng DEG chỉ có Ensembl ID sẽ khó dùng trong phân tích ung thư học. Việc thêm gene symbol cho phép lọc các gene driver, gene fusion, gene biệt hóa tuyến giáp và gene liên quan tiến triển.

9.3 Phương pháp nghiên cứu

Làm sạch gene ID trong kết quả DESeq2.
left_join() với bảng annotation.
Phân loại hướng thay đổi biểu hiện:
- Upregulated_in_Tumor.
- Downregulated_in_Tumor.
- Not_significant.

9.4 Code

res_annotated <- res_df %>%
  mutate(
    gene_id_clean = sub("\\..*$", "", gene_id)
  ) %>%
  left_join(
    gene_anno_clean %>%
      select(gene_id_clean, gene_symbol, gene_type),
    by = "gene_id_clean"
  ) %>%
  mutate(
    direction = case_when(
      padj < params$deg_padj_cutoff & log2FoldChange >= params$deg_log2fc_cutoff  ~ "Upregulated_in_Tumor",
      padj < params$deg_padj_cutoff & log2FoldChange <= -params$deg_log2fc_cutoff ~ "Downregulated_in_Tumor",
      TRUE ~ "Not_significant"
    )
  ) %>%
  arrange(padj)

head(res_annotated)

##    baseMean log2FoldChange     lfcSE     stat        pvalue          padj
## 1  5747.334       7.140052 0.2338046 30.53854 8.027451e-205 2.316160e-200
## 2  3657.721       4.319133 0.1585528 27.24098 2.125361e-163 3.066153e-159
## 3  3909.655       5.672596 0.2099107 27.02386 7.750831e-161 5.590868e-157
## 4  3021.754       7.150000 0.2644916 27.03299 6.054084e-161 5.590868e-157
## 5 12692.303       4.505362 0.1753171 25.69836 1.219299e-145 7.036085e-142
## 6  2924.457       4.903984 0.1924817 25.47766 3.486703e-143 1.676698e-139
##              gene_id   gene_id_clean gene_symbol      gene_type
## 1 ENSG00000145864.14 ENSG00000145864      GABRB2 protein_coding
## 2  ENSG00000170439.7 ENSG00000170439     METTL7B protein_coding
## 3 ENSG00000163898.10 ENSG00000163898        LIPH protein_coding
## 4  ENSG00000259803.7 ENSG00000259803    SLC22A31 protein_coding
## 5 ENSG00000134569.10 ENSG00000134569        LRP4 protein_coding
## 6 ENSG00000162873.15 ENSG00000162873     KLHDC8A protein_coding
##              direction
## 1 Upregulated_in_Tumor
## 2 Upregulated_in_Tumor
## 3 Upregulated_in_Tumor
## 4 Upregulated_in_Tumor
## 5 Upregulated_in_Tumor
## 6 Upregulated_in_Tumor

table(res_annotated$direction, useNA = "ifany")

## 
## Downregulated_in_Tumor        Not_significant   Upregulated_in_Tumor 
##                   2476                  23397                   2980

deg_sig_annotated <- res_annotated %>%
  filter(
    !is.na(padj),
    padj < params$deg_padj_cutoff,
    abs(log2FoldChange) >= params$deg_log2fc_cutoff
  )
dim(deg_sig_annotated)

## [1] 5456   11

head(deg_sig_annotated)

##    baseMean log2FoldChange     lfcSE     stat        pvalue          padj
## 1  5747.334       7.140052 0.2338046 30.53854 8.027451e-205 2.316160e-200
## 2  3657.721       4.319133 0.1585528 27.24098 2.125361e-163 3.066153e-159
## 3  3909.655       5.672596 0.2099107 27.02386 7.750831e-161 5.590868e-157
## 4  3021.754       7.150000 0.2644916 27.03299 6.054084e-161 5.590868e-157
## 5 12692.303       4.505362 0.1753171 25.69836 1.219299e-145 7.036085e-142
## 6  2924.457       4.903984 0.1924817 25.47766 3.486703e-143 1.676698e-139
##              gene_id   gene_id_clean gene_symbol      gene_type
## 1 ENSG00000145864.14 ENSG00000145864      GABRB2 protein_coding
## 2  ENSG00000170439.7 ENSG00000170439     METTL7B protein_coding
## 3 ENSG00000163898.10 ENSG00000163898        LIPH protein_coding
## 4  ENSG00000259803.7 ENSG00000259803    SLC22A31 protein_coding
## 5 ENSG00000134569.10 ENSG00000134569        LRP4 protein_coding
## 6 ENSG00000162873.15 ENSG00000162873     KLHDC8A protein_coding
##              direction
## 1 Upregulated_in_Tumor
## 2 Upregulated_in_Tumor
## 3 Upregulated_in_Tumor
## 4 Upregulated_in_Tumor
## 5 Upregulated_in_Tumor
## 6 Upregulated_in_Tumor

10 7. Kiểm tra nhóm gene đích liên quan PTC

10.1 Mục đích

Trích riêng kết quả DESeq2 của các gene đang được quan tâm trong bối cảnh PTC và vaccine đa epitope.

10.2 Cơ sở lý thuyết

Các gene như BRAF, RET, NTRK1, NTRK3, NRAS, HRAS, KRAS, TERT là những gene thường được thảo luận trong sinh học phân tử PTC. Tuy nhiên, cần phân biệt:

Gene driver/oncogenic alteration: có giá trị sinh học ung thư, nhưng không nhất thiết tăng biểu hiện RNA.
Kháng nguyên vaccine: cần có bằng chứng biểu hiện protein, khả năng tạo peptide trình diện qua HLA và tính an toàn miễn dịch.
Neoantigen: nên dựa vào đột biến/fusion đặc hiệu khối u, không chỉ gene wild-type tăng biểu hiện.

10.3 Phương pháp nghiên cứu

Lọc bảng res_annotated theo danh sách gene đích và lưu thành bảng riêng.

10.4 Code

ptc_target_genes <- c(
  "BRAF",
  "RET",
  "NTRK1",
  "NTRK3",
  "NRAS",
  "HRAS",
  "KRAS",
  "TERT",
  "TG",
  "TTN"
)

ptc_target_result <- res_annotated %>%
  filter(gene_symbol %in% ptc_target_genes) %>%
  select(
    gene_id,
    gene_symbol,
    gene_type,
    baseMean,
    log2FoldChange,
    lfcSE,
    stat,
    pvalue,
    padj,
    direction
  ) %>%
  arrange(padj)

ptc_target_result

##               gene_id gene_symbol      gene_type     baseMean log2FoldChange
## 1  ENSG00000140538.16       NTRK3 protein_coding 5.884629e+02     1.68478216
## 2  ENSG00000042832.12          TG protein_coding 2.342569e+06    -1.12906181
## 3  ENSG00000174775.17        HRAS protein_coding 1.726473e+03     0.59891786
## 4  ENSG00000165731.20         RET protein_coding 3.426115e+02     1.69653085
## 5  ENSG00000157764.14        BRAF protein_coding 2.691720e+03     0.08687558
## 6  ENSG00000155657.28         TTN protein_coding 6.828899e+02    -0.27883741
## 7  ENSG00000198400.12       NTRK1 protein_coding 9.670243e+00     0.25791740
## 8  ENSG00000164362.21        TERT protein_coding 4.406779e+00    -0.56834370
## 9  ENSG00000133703.13        KRAS protein_coding 1.860714e+03     0.04447180
## 10  ENSG00000213281.5        NRAS protein_coding 2.462547e+03    -0.03461003
##         lfcSE       stat       pvalue         padj              direction
## 1  0.19750444  8.5303510 1.459043e-17 1.207624e-16   Upregulated_in_Tumor
## 2  0.16356168 -6.9029728 5.092540e-12 2.550513e-11 Downregulated_in_Tumor
## 3  0.09187965  6.5185039 7.101208e-11 3.163853e-10        Not_significant
## 4  0.29327886  5.7847021 7.264094e-09 2.614657e-08   Upregulated_in_Tumor
## 5  0.04308163  2.0165339 4.374417e-02 6.228229e-02        Not_significant
## 6  0.15120715 -1.8440756 6.517212e-02 8.966722e-02        Not_significant
## 7  0.23050104  1.1189425 2.631647e-01 3.176361e-01        Not_significant
## 8  0.52547519 -1.0815805 2.794390e-01 3.352594e-01        Not_significant
## 9  0.04718236  0.9425512 3.459105e-01 4.049551e-01        Not_significant
## 10 0.05488486 -0.6305934 5.283064e-01 5.851301e-01        Not_significant

11 8. Trích xuất normalized counts

11.1 Mục đích

Tạo ma trận normalized counts để mô tả biểu hiện gene và vẽ hình.

11.2 Cơ sở lý thuyết

DESeq2 normalized counts là count đã được hiệu chỉnh theo size factor, giúp giảm ảnh hưởng của độ sâu giải trình tự giữa các mẫu. Normalized counts thích hợp để mô tả và vẽ biểu đồ, nhưng phân tích thống kê khác biệt vẫn nên dựa trên mô hình DESeq2 thay vì kiểm định trực tiếp trên normalized counts.

11.3 Phương pháp nghiên cứu

Dùng counts(dds, normalized = TRUE).
Chuyển sang data frame.
Gắn annotation gene symbol và gene type.

11.4 Code

norm_counts <- counts(dds, normalized = TRUE)

norm_counts_df <- as.data.frame(norm_counts) %>%
  rownames_to_column("gene_id")

norm_counts_df$gene_id_clean <- sub("\\..*$", "", norm_counts_df$gene_id)

norm_counts_annotated <- norm_counts_df %>%
  left_join(
    gene_anno_clean %>%
      select(gene_id_clean, gene_symbol, gene_type),
    by = "gene_id_clean"
  )

12 9. Tạo bảng expression dạng long cho gene đích

12.1 Mục đích

Chuyển ma trận biểu hiện gene đích sang dạng long để dễ vẽ boxplot theo từng gene.

12.2 Cơ sở lý thuyết

Dữ liệu dạng wide phù hợp lưu trữ ma trận, nhưng dữ liệu dạng long phù hợp với ggplot2, vì mỗi dòng tương ứng một quan sát: một gene, một mẫu, một giá trị biểu hiện.

12.3 Phương pháp nghiên cứu

Lọc normalized counts theo ptc_target_genes.
Dùng pivot_longer() để chuyển sample columns thành biến sample_id.
Gắn condition và patient_id từ sample_info.

12.4 Code

target_norm_counts <- norm_counts_annotated %>%
  filter(gene_symbol %in% ptc_target_genes)

target_norm_counts_long <- target_norm_counts %>%
  select(-gene_id_clean, -gene_type) %>%
  tidyr::pivot_longer(
    cols = -c(gene_id, gene_symbol),
    names_to = "sample_id",
    values_to = "normalized_count"
  ) %>%
  left_join(
    sample_info %>%
      select(sample_id, condition, patient_id),
    by = "sample_id"
  )

head(target_norm_counts_long)

## # A tibble: 6 × 6
##   gene_id            gene_symbol sample_id normalized_count condition patient_id
##   <chr>              <chr>       <chr>                <dbl> <fct>     <chr>     
## 1 ENSG00000042832.12 TG          TCGA-DJ-…         1330788. Tumor     TCGA-DJ-A…
## 2 ENSG00000042832.12 TG          TCGA-FK-…         1565019. Tumor     TCGA-FK-A…
## 3 ENSG00000042832.12 TG          TCGA-DJ-…         4564491. Tumor     TCGA-DJ-A…
## 4 ENSG00000042832.12 TG          TCGA-FY-…         4235972. Tumor     TCGA-FY-A…
## 5 ENSG00000042832.12 TG          TCGA-EL-…          957345. Tumor     TCGA-EL-A…
## 6 ENSG00000042832.12 TG          TCGA-EL-…          591310. Tumor     TCGA-EL-A…

13 10. Boxplot biểu hiện gene đích

13.1 Mục đích

Trực quan hóa mức biểu hiện của các gene đích giữa nhóm Tumor và Normal.

13.2 Cơ sở lý thuyết

Boxplot giúp đánh giá phân bố biểu hiện giữa hai nhóm. Trục y dùng log2(normalized_count + 1) để giảm lệch phải của RNA-seq count và tránh log của 0.

13.3 Phương pháp nghiên cứu

Mỗi facet là một gene.
Trục x là condition.
Trục y là log2 normalized count.
Dùng jitter để quan sát từng mẫu riêng lẻ.

13.4 Code

p_boxplot <- ggplot(
  target_norm_counts_long,
  aes(
    x = condition,
    y = log2(normalized_count + 1),
    fill = condition
  )
) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, size = 0.8, alpha = 0.5) +
  facet_wrap(~ gene_symbol, scales = "free_y") +
  theme_bw() +
  labs(
    title = "Expression of PTC-related target genes in TCGA-THCA",
    x = "",
    y = "log2(normalized count + 1)"
  ) +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold")
  )

p_boxplot

14 11. Volcano plot

14.1 Mục đích

Tổng quan toàn bộ kết quả DESeq2 theo hai chiều:

Trục x: log2FoldChange.
Trục y: -log10(padj).

14.2 Cơ sở lý thuyết

Volcano plot giúp nhận diện gene vừa có độ lớn thay đổi biểu hiện cao vừa có ý nghĩa thống kê mạnh. Gene nằm xa về bên phải là tăng trong Tumor, xa về bên trái là giảm trong Tumor, và càng cao thì padj càng nhỏ.

14.3 Phương pháp nghiên cứu

Phân loại gene thành Upregulated, Downregulated và Not significant.
Vẽ đường ngưỡng log2FC = ±1 và padj = 0.05.
Xử lý padj = 0 bằng .Machine$double.xmin để tránh Inf.

14.4 Code

volcano_df <- res_annotated %>%
  mutate(
    padj_for_plot = ifelse(is.na(padj), NA_real_, pmax(padj, .Machine$double.xmin)),
    neg_log10_padj = -log10(padj_for_plot),
    significant = case_when(
      padj < params$deg_padj_cutoff & log2FoldChange >= params$deg_log2fc_cutoff  ~ "Upregulated",
      padj < params$deg_padj_cutoff & log2FoldChange <= -params$deg_log2fc_cutoff ~ "Downregulated",
      TRUE ~ "Not significant"
    )
  )

p_volcano <- ggplot(
  volcano_df,
  aes(
    x = log2FoldChange,
    y = neg_log10_padj,
    color = significant
  )
) +
  geom_point(alpha = 0.6, size = 1) +
  geom_vline(xintercept = c(-params$deg_log2fc_cutoff, params$deg_log2fc_cutoff), linetype = "dashed") +
  geom_hline(yintercept = -log10(params$deg_padj_cutoff), linetype = "dashed") +
  theme_bw() +
  labs(
    title = "Volcano plot: PTC Tumor vs Normal",
    x = "log2 fold change",
    y = "-log10 adjusted p-value"
  )

p_volcano

15 12. Heatmap top 50 DEG

15.1 Mục đích

Vẽ heatmap của 50 gene có padj nhỏ nhất trong nhóm DEG có ý nghĩa.

15.2 Cơ sở lý thuyết

Heatmap giúp đánh giá xem các gene khác biệt biểu hiện có phân tách mẫu Tumor và Normal theo pattern biểu hiện hay không. scale = "row" chuẩn hóa từng gene theo z-score để nhấn mạnh sự thay đổi tương đối giữa mẫu.

15.3 Phương pháp nghiên cứu

Chọn top 50 DEG theo padj.
Trích normalized counts.
Chuyển sang log2(count + 1).
Đổi rowname thành gene symbol.
Thêm annotation cột theo condition.

15.4 Code

top_genes <- deg_sig_annotated %>%
  filter(!is.na(gene_symbol)) %>%
  arrange(padj) %>%
  slice_head(n = 50)

top_gene_ids <- intersect(top_genes$gene_id, rownames(norm_counts))

if (length(top_gene_ids) >= 2) {
  heatmap_mat <- norm_counts[top_gene_ids, , drop = FALSE]
  heatmap_mat_log <- log2(heatmap_mat + 1)

  gene_symbol_map <- top_genes$gene_symbol
  names(gene_symbol_map) <- top_genes$gene_id
  rownames(heatmap_mat_log) <- gene_symbol_map[rownames(heatmap_mat_log)]

  annotation_col <- sample_info %>%
    select(condition) %>%
    as.data.frame()

  rownames(annotation_col) <- rownames(sample_info)

  pheatmap(
    heatmap_mat_log,
    scale = "row",
    annotation_col = annotation_col,
    show_colnames = FALSE,
    fontsize_row = 7,
    main = "Top 50 DEGs: PTC Tumor vs Normal"
  )

  png(
    filename = file.path("results", "plots", "TCGA_THCA_PTC_Top50_DEGs_heatmap.png"),
    width = 1200,
    height = 1000,
    res = 150
  )

  pheatmap(
    heatmap_mat_log,
    scale = "row",
    annotation_col = annotation_col,
    show_colnames = FALSE,
    fontsize_row = 7,
    main = "Top 50 DEGs: PTC Tumor vs Normal"
  )

  dev.off()
} else {
  message("Không đủ gene để vẽ heatmap.")
}

## quartz_off_screen 
##                 3

16 13. Xây dựng panel gene PTC mở rộng

16.1 Mục đích

Tạo danh sách gene liên quan PTC để trích xuất kết quả DESeq2 và chấm điểm ưu tiên sơ bộ.

16.2 Cơ sở lý thuyết

Trong PTC, các nhóm gene thường được xem xét gồm:

Driver MAPK: BRAF, RAS, EIF1AX.
Fusion hoặc kinase: RET, NTRK1, NTRK3, ALK, MET, ROS1, PPARG.
Tiến triển/nguy cơ cao: TERT, TP53, PIK3CA, PTEN, AKT1, MTOR, CDKN2A, CDKN2B.
Biệt hóa tuyến giáp: TG, TPO, TSHR, SLC5A5, PAX8, NKX2-1, FOXE1, DIO1, DIO2, DIO3.

Đối với vaccine đa epitope, panel này chỉ là bước định hướng.

16.3 Phương pháp nghiên cứu

Tạo bốn nhóm gene rồi gộp thành ptc_extended_genes bằng unique().

16.4 Code

ptc_driver_genes <- c(
  "BRAF", "NRAS", "HRAS", "KRAS", "EIF1AX"
)

ptc_fusion_genes <- c(
  "RET", "NTRK1", "NTRK3",
  "ALK", "MET", "ROS1", "PPARG"
)

ptc_progression_genes <- c(
  "TERT", "TP53", "PIK3CA", "PTEN",
  "AKT1", "MTOR", "CDKN2A", "CDKN2B",
  "DICER1", "CHEK2", "PPM1D"
)

thyroid_lineage_genes <- c(
  "TG", "TPO", "TSHR", "SLC5A5",
  "PAX8", "NKX2-1", "FOXE1",
  "DIO1", "DIO2", "DIO3"
)

ptc_extended_genes <- unique(c(
  ptc_driver_genes,
  ptc_fusion_genes,
  ptc_progression_genes,
  thyroid_lineage_genes
))

length(ptc_extended_genes)

## [1] 33

ptc_extended_genes

##  [1] "BRAF"   "NRAS"   "HRAS"   "KRAS"   "EIF1AX" "RET"    "NTRK1"  "NTRK3" 
##  [9] "ALK"    "MET"    "ROS1"   "PPARG"  "TERT"   "TP53"   "PIK3CA" "PTEN"  
## [17] "AKT1"   "MTOR"   "CDKN2A" "CDKN2B" "DICER1" "CHEK2"  "PPM1D"  "TG"    
## [25] "TPO"    "TSHR"   "SLC5A5" "PAX8"   "NKX2-1" "FOXE1"  "DIO1"   "DIO2"  
## [33] "DIO3"

17 14. Trích xuất kết quả DESeq2 cho panel PTC mở rộng

17.1 Mục đích

Tạo bảng riêng cho các gene liên quan PTC mở rộng, kèm nhóm gene và diễn giải biểu hiện.

17.2 Cơ sở lý thuyết

Một gene có thể quan trọng trong PTC dù không tăng biểu hiện RNA. Ví dụ, BRAF quan trọng chủ yếu vì đột biến p.V600E, còn RET/NTRK quan trọng vì fusion. Do đó, kết quả DESeq2 cần được đọc như bằng chứng biểu hiện, không thay thế cho bằng chứng biến thể gene.

17.3 Phương pháp nghiên cứu

Lọc theo ptc_extended_genes.
Gán nhóm gene.
Diễn giải biểu hiện theo padj và log2FoldChange.
Lưu bảng kết quả.

17.4 Code

ptc_extended_result <- res_annotated %>%
  filter(gene_symbol %in% ptc_extended_genes) %>%
  mutate(
    gene_group = case_when(
      gene_symbol %in% ptc_driver_genes ~ "Driver_MAPK",
      gene_symbol %in% ptc_fusion_genes ~ "Fusion_or_kinase",
      gene_symbol %in% ptc_progression_genes ~ "Progression_or_high_risk",
      gene_symbol %in% thyroid_lineage_genes ~ "Thyroid_lineage",
      TRUE ~ "Other"
    ),
    expression_interpretation = case_when(
      padj < params$deg_padj_cutoff & log2FoldChange >= params$deg_log2fc_cutoff ~ "Strongly_upregulated_in_Tumor",
      padj < params$deg_padj_cutoff & log2FoldChange <= -params$deg_log2fc_cutoff ~ "Strongly_downregulated_in_Tumor",
      padj < params$deg_padj_cutoff & abs(log2FoldChange) < params$deg_log2fc_cutoff ~ "Statistically_significant_but_low_FC",
      TRUE ~ "Not_statistically_significant"
    )
  ) %>%
  select(
    gene_group,
    gene_id,
    gene_symbol,
    gene_type,
    baseMean,
    log2FoldChange,
    lfcSE,
    stat,
    pvalue,
    padj,
    expression_interpretation
  ) %>%
  arrange(gene_group, padj)

ptc_extended_result

##                  gene_group            gene_id gene_symbol      gene_type
## 1               Driver_MAPK ENSG00000174775.17        HRAS protein_coding
## 2               Driver_MAPK ENSG00000173674.11      EIF1AX protein_coding
## 3               Driver_MAPK ENSG00000157764.14        BRAF protein_coding
## 4               Driver_MAPK ENSG00000133703.13        KRAS protein_coding
## 5               Driver_MAPK  ENSG00000213281.5        NRAS protein_coding
## 6          Fusion_or_kinase ENSG00000105976.16         MET protein_coding
## 7          Fusion_or_kinase ENSG00000171094.18         ALK protein_coding
## 8          Fusion_or_kinase ENSG00000140538.16       NTRK3 protein_coding
## 9          Fusion_or_kinase ENSG00000132170.23       PPARG protein_coding
## 10         Fusion_or_kinase ENSG00000047936.11        ROS1 protein_coding
## 11         Fusion_or_kinase ENSG00000165731.20         RET protein_coding
## 12         Fusion_or_kinase ENSG00000198400.12       NTRK1 protein_coding
## 13 Progression_or_high_risk ENSG00000147883.12      CDKN2B protein_coding
## 14 Progression_or_high_risk ENSG00000147889.18      CDKN2A protein_coding
## 15 Progression_or_high_risk ENSG00000141510.18        TP53 protein_coding
## 16 Progression_or_high_risk ENSG00000100697.15      DICER1 protein_coding
## 17 Progression_or_high_risk ENSG00000183765.23       CHEK2 protein_coding
## 18 Progression_or_high_risk ENSG00000142208.17        AKT1 protein_coding
## 19 Progression_or_high_risk ENSG00000171862.11        PTEN protein_coding
## 20 Progression_or_high_risk ENSG00000198793.13        MTOR protein_coding
## 21 Progression_or_high_risk ENSG00000170836.12       PPM1D protein_coding
## 22 Progression_or_high_risk ENSG00000164362.21        TERT protein_coding
## 23 Progression_or_high_risk  ENSG00000121879.6      PIK3CA protein_coding
## 24          Thyroid_lineage ENSG00000136352.19      NKX2-1 protein_coding
## 25          Thyroid_lineage ENSG00000125618.17        PAX8 protein_coding
## 26          Thyroid_lineage ENSG00000211448.13        DIO2 protein_coding
## 27          Thyroid_lineage ENSG00000042832.12          TG protein_coding
## 28          Thyroid_lineage  ENSG00000197406.7        DIO3 protein_coding
## 29          Thyroid_lineage  ENSG00000105641.4      SLC5A5 protein_coding
## 30          Thyroid_lineage ENSG00000165409.18        TSHR protein_coding
## 31          Thyroid_lineage ENSG00000115705.22         TPO protein_coding
## 32          Thyroid_lineage ENSG00000211452.11        DIO1 protein_coding
## 33          Thyroid_lineage  ENSG00000178919.9       FOXE1 protein_coding
##        baseMean log2FoldChange      lfcSE       stat       pvalue         padj
## 1  1.726473e+03     0.59891786 0.09187965  6.5185039 7.101208e-11 3.163853e-10
## 2  4.742313e+03    -0.20774299 0.04508766 -4.6075353 4.074700e-06 1.066467e-05
## 3  2.691720e+03     0.08687558 0.04308163  2.0165339 4.374417e-02 6.228229e-02
## 4  1.860714e+03     0.04447180 0.04718236  0.9425512 3.459105e-01 4.049551e-01
## 5  2.462547e+03    -0.03461003 0.05488486 -0.6305934 5.283064e-01 5.851301e-01
## 6  2.794368e+04     2.67762563 0.14948195 17.9127021 9.386618e-72 1.631519e-69
## 7  2.160376e+02     3.21414491 0.19049283 16.8727865 7.134968e-64 8.835417e-62
## 8  5.884629e+02     1.68478216 0.19750444  8.5303510 1.459043e-17 1.207624e-16
## 9  5.084104e+02    -0.87089889 0.13140071 -6.6278094 3.407049e-11 1.561614e-10
## 10 1.613612e+01     2.97071741 0.45977920  6.4611827 1.038878e-10 4.537504e-10
## 11 3.426115e+02     1.69653085 0.29327886  5.7847021 7.264094e-09 2.614657e-08
## 12 9.670243e+00     0.25791740 0.23050104  1.1189425 2.631647e-01 3.176361e-01
## 13 6.510688e+02     2.97114555 0.15200291 19.5466358 4.406375e-85 1.258784e-82
## 14 2.592684e+02     3.19851691 0.22022816 14.5236506 8.581562e-48 5.042848e-46
## 15 3.134114e+03     0.48302146 0.04868008  9.9223642 3.327788e-23 4.263618e-22
## 16 3.720577e+03    -0.63392825 0.06485250 -9.7749235 1.442662e-22 1.753375e-21
## 17 4.751797e+02     0.40784810 0.07974622  5.1143250 3.148648e-07 9.442672e-07
## 18 6.678602e+03     0.22934978 0.04757751  4.8205501 1.431629e-06 3.959623e-06
## 19 6.384400e+03    -0.12297798 0.04186449 -2.9375249 3.308436e-03 5.727041e-03
## 20 3.637936e+03     0.09121340 0.04768823  1.9127027 5.578612e-02 7.783351e-02
## 21 1.036964e+03    -0.08264470 0.04449921 -1.8572170 6.328029e-02 8.733086e-02
## 22 4.406779e+00    -0.56834370 0.52547519 -1.0815805 2.794390e-01 3.352594e-01
## 23 1.608636e+03     0.04411375 0.07527116  0.5860645 5.578322e-01 6.128677e-01
## 24 1.607623e+04     0.63920190 0.08042123  7.9481738 1.892810e-15 1.306850e-14
## 25 4.812762e+04    -0.60278570 0.08321456 -7.2437527 4.364370e-13 2.419776e-12
## 26 1.876872e+04    -1.32095629 0.18321300 -7.2099486 5.597300e-13 3.062751e-12
## 27 2.342569e+06    -1.12906181 0.16356168 -6.9029728 5.092540e-12 2.550513e-11
## 28 2.005290e+01    -1.75660612 0.31149529 -5.6392703 1.707723e-08 5.904484e-08
## 29 1.231463e+03    -2.34774688 0.42423284 -5.5340998 3.128308e-08 1.051226e-07
## 30 3.541045e+04    -0.43457743 0.07884485 -5.5118049 3.551726e-08 1.186225e-07
## 31 9.837561e+04    -1.72815412 0.31537893 -5.4796118 4.262600e-08 1.413177e-07
## 32 7.043099e+03    -1.28227402 0.33903895 -3.7820847 1.555204e-04 3.301863e-04
## 33 1.835582e+04    -0.30850514 0.10353100 -2.9798335 2.884051e-03 5.039579e-03
##               expression_interpretation
## 1  Statistically_significant_but_low_FC
## 2  Statistically_significant_but_low_FC
## 3         Not_statistically_significant
## 4         Not_statistically_significant
## 5         Not_statistically_significant
## 6         Strongly_upregulated_in_Tumor
## 7         Strongly_upregulated_in_Tumor
## 8         Strongly_upregulated_in_Tumor
## 9  Statistically_significant_but_low_FC
## 10        Strongly_upregulated_in_Tumor
## 11        Strongly_upregulated_in_Tumor
## 12        Not_statistically_significant
## 13        Strongly_upregulated_in_Tumor
## 14        Strongly_upregulated_in_Tumor
## 15 Statistically_significant_but_low_FC
## 16 Statistically_significant_but_low_FC
## 17 Statistically_significant_but_low_FC
## 18 Statistically_significant_but_low_FC
## 19 Statistically_significant_but_low_FC
## 20        Not_statistically_significant
## 21        Not_statistically_significant
## 22        Not_statistically_significant
## 23        Not_statistically_significant
## 24 Statistically_significant_but_low_FC
## 25 Statistically_significant_but_low_FC
## 26      Strongly_downregulated_in_Tumor
## 27      Strongly_downregulated_in_Tumor
## 28      Strongly_downregulated_in_Tumor
## 29      Strongly_downregulated_in_Tumor
## 30 Statistically_significant_but_low_FC
## 31      Strongly_downregulated_in_Tumor
## 32      Strongly_downregulated_in_Tumor
## 33 Statistically_significant_but_low_FC

18 15. Kiểm tra gene thiếu trong panel mở rộng

18.1 Mục đích

Xác định gene nào trong danh sách quan tâm không tìm thấy trong bảng kết quả.

18.2 Cơ sở lý thuyết

Gene có thể thiếu vì nhiều lý do:

Không có trong annotation hiện tại.
Tên gene symbol khác phiên bản.
Gene bị lọc do count quá thấp.
Fusion không thể được đánh giá đúng chỉ bằng biểu hiện gene-level count.

18.3 Phương pháp nghiên cứu

So sánh ptc_extended_genes với gene symbol tìm thấy trong ptc_extended_result.

18.4 Code

found_genes <- unique(ptc_extended_result$gene_symbol)
missing_genes <- setdiff(ptc_extended_genes, found_genes)

missing_genes

## character(0)

length(missing_genes)

## [1] 0

19 16. Chấm điểm ưu tiên sơ bộ cho quy trình thiết kế vaccine

19.1 Mục đích

Tạo điểm ưu tiên ban đầu để xếp hạng gene trước khi chuyển sang các bước miễn dịch tin sinh học.

19.2 Cơ sở lý thuyết

Một gene ứng viên vaccine nên được ưu tiên nếu có:

Vai trò driver hoặc liên quan sinh học mạnh trong PTC.
Biểu hiện đủ cao ở mô u.
Tăng biểu hiện ở mô ung thư (Tumor) so với mô bình thường (Normal).
Có khả năng tạo peptide trình diện qua HLA.
An toàn: không biểu hiện quá rộng ở mô thiết yếu bình thường và không gây dị ứng/độc tính theo dự đoán.

Điểm trong script này là preliminary score, không phải tiêu chí chọn cuối cùng.

19.3 Phương pháp nghiên cứu

driver_score: dựa trên mức độ liên quan ung thư học của gene.
expression_score: dựa trên baseMean, log2FoldChange và padj.
vaccine_priority_score = driver_score + expression_score.

19.4 Code

ptc_extended_scored <- ptc_extended_result %>%
  mutate(
    driver_score = case_when(
      gene_symbol %in% c("BRAF", "NRAS", "HRAS", "KRAS", "RET", "NTRK1", "NTRK3", "EIF1AX") ~ 3,
      gene_symbol %in% c("ALK", "MET", "ROS1", "PPARG", "TERT", "TP53", "PIK3CA", "PTEN") ~ 2,
      TRUE ~ 1
    ),
    expression_score = case_when(
      baseMean >= 100 & log2FoldChange >= params$deg_log2fc_cutoff & padj < params$deg_padj_cutoff ~ 3,
      baseMean >= 100 & padj < params$deg_padj_cutoff ~ 2,
      baseMean >= 50 ~ 1,
      TRUE ~ 0
    ),
    vaccine_priority_score = driver_score + expression_score
  ) %>%
  arrange(desc(vaccine_priority_score), padj)

ptc_extended_scored

##                  gene_group            gene_id gene_symbol      gene_type
## 1          Fusion_or_kinase ENSG00000140538.16       NTRK3 protein_coding
## 2          Fusion_or_kinase ENSG00000165731.20         RET protein_coding
## 3          Fusion_or_kinase ENSG00000105976.16         MET protein_coding
## 4          Fusion_or_kinase ENSG00000171094.18         ALK protein_coding
## 5               Driver_MAPK ENSG00000174775.17        HRAS protein_coding
## 6               Driver_MAPK ENSG00000173674.11      EIF1AX protein_coding
## 7  Progression_or_high_risk ENSG00000147883.12      CDKN2B protein_coding
## 8  Progression_or_high_risk ENSG00000147889.18      CDKN2A protein_coding
## 9  Progression_or_high_risk ENSG00000141510.18        TP53 protein_coding
## 10         Fusion_or_kinase ENSG00000132170.23       PPARG protein_coding
## 11 Progression_or_high_risk ENSG00000171862.11        PTEN protein_coding
## 12              Driver_MAPK ENSG00000157764.14        BRAF protein_coding
## 13              Driver_MAPK ENSG00000133703.13        KRAS protein_coding
## 14              Driver_MAPK  ENSG00000213281.5        NRAS protein_coding
## 15 Progression_or_high_risk ENSG00000100697.15      DICER1 protein_coding
## 16          Thyroid_lineage ENSG00000136352.19      NKX2-1 protein_coding
## 17          Thyroid_lineage ENSG00000125618.17        PAX8 protein_coding
## 18          Thyroid_lineage ENSG00000211448.13        DIO2 protein_coding
## 19          Thyroid_lineage ENSG00000042832.12          TG protein_coding
## 20          Thyroid_lineage  ENSG00000105641.4      SLC5A5 protein_coding
## 21          Thyroid_lineage ENSG00000165409.18        TSHR protein_coding
## 22          Thyroid_lineage ENSG00000115705.22         TPO protein_coding
## 23 Progression_or_high_risk ENSG00000183765.23       CHEK2 protein_coding
## 24 Progression_or_high_risk ENSG00000142208.17        AKT1 protein_coding
## 25          Thyroid_lineage ENSG00000211452.11        DIO1 protein_coding
## 26          Thyroid_lineage  ENSG00000178919.9       FOXE1 protein_coding
## 27         Fusion_or_kinase ENSG00000198400.12       NTRK1 protein_coding
## 28 Progression_or_high_risk  ENSG00000121879.6      PIK3CA protein_coding
## 29         Fusion_or_kinase ENSG00000047936.11        ROS1 protein_coding
## 30 Progression_or_high_risk ENSG00000198793.13        MTOR protein_coding
## 31 Progression_or_high_risk ENSG00000170836.12       PPM1D protein_coding
## 32 Progression_or_high_risk ENSG00000164362.21        TERT protein_coding
## 33          Thyroid_lineage  ENSG00000197406.7        DIO3 protein_coding
##        baseMean log2FoldChange      lfcSE       stat       pvalue         padj
## 1  5.884629e+02     1.68478216 0.19750444  8.5303510 1.459043e-17 1.207624e-16
## 2  3.426115e+02     1.69653085 0.29327886  5.7847021 7.264094e-09 2.614657e-08
## 3  2.794368e+04     2.67762563 0.14948195 17.9127021 9.386618e-72 1.631519e-69
## 4  2.160376e+02     3.21414491 0.19049283 16.8727865 7.134968e-64 8.835417e-62
## 5  1.726473e+03     0.59891786 0.09187965  6.5185039 7.101208e-11 3.163853e-10
## 6  4.742313e+03    -0.20774299 0.04508766 -4.6075353 4.074700e-06 1.066467e-05
## 7  6.510688e+02     2.97114555 0.15200291 19.5466358 4.406375e-85 1.258784e-82
## 8  2.592684e+02     3.19851691 0.22022816 14.5236506 8.581562e-48 5.042848e-46
## 9  3.134114e+03     0.48302146 0.04868008  9.9223642 3.327788e-23 4.263618e-22
## 10 5.084104e+02    -0.87089889 0.13140071 -6.6278094 3.407049e-11 1.561614e-10
## 11 6.384400e+03    -0.12297798 0.04186449 -2.9375249 3.308436e-03 5.727041e-03
## 12 2.691720e+03     0.08687558 0.04308163  2.0165339 4.374417e-02 6.228229e-02
## 13 1.860714e+03     0.04447180 0.04718236  0.9425512 3.459105e-01 4.049551e-01
## 14 2.462547e+03    -0.03461003 0.05488486 -0.6305934 5.283064e-01 5.851301e-01
## 15 3.720577e+03    -0.63392825 0.06485250 -9.7749235 1.442662e-22 1.753375e-21
## 16 1.607623e+04     0.63920190 0.08042123  7.9481738 1.892810e-15 1.306850e-14
## 17 4.812762e+04    -0.60278570 0.08321456 -7.2437527 4.364370e-13 2.419776e-12
## 18 1.876872e+04    -1.32095629 0.18321300 -7.2099486 5.597300e-13 3.062751e-12
## 19 2.342569e+06    -1.12906181 0.16356168 -6.9029728 5.092540e-12 2.550513e-11
## 20 1.231463e+03    -2.34774688 0.42423284 -5.5340998 3.128308e-08 1.051226e-07
## 21 3.541045e+04    -0.43457743 0.07884485 -5.5118049 3.551726e-08 1.186225e-07
## 22 9.837561e+04    -1.72815412 0.31537893 -5.4796118 4.262600e-08 1.413177e-07
## 23 4.751797e+02     0.40784810 0.07974622  5.1143250 3.148648e-07 9.442672e-07
## 24 6.678602e+03     0.22934978 0.04757751  4.8205501 1.431629e-06 3.959623e-06
## 25 7.043099e+03    -1.28227402 0.33903895 -3.7820847 1.555204e-04 3.301863e-04
## 26 1.835582e+04    -0.30850514 0.10353100 -2.9798335 2.884051e-03 5.039579e-03
## 27 9.670243e+00     0.25791740 0.23050104  1.1189425 2.631647e-01 3.176361e-01
## 28 1.608636e+03     0.04411375 0.07527116  0.5860645 5.578322e-01 6.128677e-01
## 29 1.613612e+01     2.97071741 0.45977920  6.4611827 1.038878e-10 4.537504e-10
## 30 3.637936e+03     0.09121340 0.04768823  1.9127027 5.578612e-02 7.783351e-02
## 31 1.036964e+03    -0.08264470 0.04449921 -1.8572170 6.328029e-02 8.733086e-02
## 32 4.406779e+00    -0.56834370 0.52547519 -1.0815805 2.794390e-01 3.352594e-01
## 33 2.005290e+01    -1.75660612 0.31149529 -5.6392703 1.707723e-08 5.904484e-08
##               expression_interpretation driver_score expression_score
## 1         Strongly_upregulated_in_Tumor            3                3
## 2         Strongly_upregulated_in_Tumor            3                3
## 3         Strongly_upregulated_in_Tumor            2                3
## 4         Strongly_upregulated_in_Tumor            2                3
## 5  Statistically_significant_but_low_FC            3                2
## 6  Statistically_significant_but_low_FC            3                2
## 7         Strongly_upregulated_in_Tumor            1                3
## 8         Strongly_upregulated_in_Tumor            1                3
## 9  Statistically_significant_but_low_FC            2                2
## 10 Statistically_significant_but_low_FC            2                2
## 11 Statistically_significant_but_low_FC            2                2
## 12        Not_statistically_significant            3                1
## 13        Not_statistically_significant            3                1
## 14        Not_statistically_significant            3                1
## 15 Statistically_significant_but_low_FC            1                2
## 16 Statistically_significant_but_low_FC            1                2
## 17 Statistically_significant_but_low_FC            1                2
## 18      Strongly_downregulated_in_Tumor            1                2
## 19      Strongly_downregulated_in_Tumor            1                2
## 20      Strongly_downregulated_in_Tumor            1                2
## 21 Statistically_significant_but_low_FC            1                2
## 22      Strongly_downregulated_in_Tumor            1                2
## 23 Statistically_significant_but_low_FC            1                2
## 24 Statistically_significant_but_low_FC            1                2
## 25      Strongly_downregulated_in_Tumor            1                2
## 26 Statistically_significant_but_low_FC            1                2
## 27        Not_statistically_significant            3                0
## 28        Not_statistically_significant            2                1
## 29        Strongly_upregulated_in_Tumor            2                0
## 30        Not_statistically_significant            1                1
## 31        Not_statistically_significant            1                1
## 32        Not_statistically_significant            2                0
## 33      Strongly_downregulated_in_Tumor            1                0
##    vaccine_priority_score
## 1                       6
## 2                       6
## 3                       5
## 4                       5
## 5                       5
## 6                       5
## 7                       4
## 8                       4
## 9                       4
## 10                      4
## 11                      4
## 12                      4
## 13                      4
## 14                      4
## 15                      3
## 16                      3
## 17                      3
## 18                      3
## 19                      3
## 20                      3
## 21                      3
## 22                      3
## 23                      3
## 24                      3
## 25                      3
## 26                      3
## 27                      3
## 28                      3
## 29                      2
## 30                      2
## 31                      2
## 32                      2
## 33                      1

20 17. Top gene protein-coding tăng biểu hiện trong Tumor

20.1 Mục đích

Trích xuất các gene protein-coding tăng biểu hiện mạnh trong Tumor để làm nguồn ứng viên mở rộng ngoài panel PTC định trước.

20.2 Cơ sở lý thuyết

Trong thiết kế vaccine dựa trên transcriptome, gene protein-coding tăng biểu hiện có thể cung cấp peptide kháng nguyên. Tuy nhiên, cần thận trọng vì gene tăng biểu hiện không đồng nghĩa với tumor-specific antigen. Các gene này cần được lọc thêm theo HPA/GTEx, UniProt protein sequence và dự đoán MHC.

20.3 Phương pháp nghiên cứu

Lọc gene thỏa:

gene_type == "protein_coding".
padj < 0.05.
log2FoldChange >= 1.
baseMean >= 50.

20.4 Code

top_upregulated_genes <- res_annotated %>%
  filter(
    gene_type == "protein_coding",
    !is.na(padj),
    padj < params$deg_padj_cutoff,
    log2FoldChange >= params$deg_log2fc_cutoff,
    baseMean >= 50
  ) %>%
  arrange(desc(log2FoldChange))

head(top_upregulated_genes, 50)

##        baseMean log2FoldChange     lfcSE     stat        pvalue          padj
## 1     139.19914       7.611620 0.5684245 13.39073  6.850402e-41  2.811588e-39
## 2   10654.86873       7.528596 0.3878325 19.41198  6.113065e-84  1.695964e-81
## 3      86.24530       7.480511 0.5483813 13.64107  2.281743e-42  1.028674e-40
## 4     231.61762       7.448300 0.3743877 19.89462  4.530557e-88  1.502530e-85
## 5    3021.75435       7.150000 0.2644916 27.03299 6.054084e-161 5.590868e-157
## 6    5747.33432       7.140052 0.2338046 30.53854 8.027451e-205 2.316160e-200
## 7    1290.65258       7.098649 0.4453479 15.93956  3.367108e-57  3.093986e-55
## 8    3351.88885       7.068247 0.3098013 22.81542 3.223279e-115 3.720050e-112
## 9    1460.32112       6.902744 0.3890016 17.74477  1.891795e-70  3.049384e-68
## 10  30888.04667       6.811595 0.2773459 24.55993 3.387735e-133 9.774631e-130
## 11    851.74582       6.799043 0.3567015 19.06087  5.337479e-81  1.283352e-78
## 12    142.86850       6.693051 0.4233742 15.80883  2.704455e-56  2.386289e-54
## 13   4976.78953       6.653533 0.2977219 22.34814 1.258673e-110 1.100500e-107
## 14    243.21483       6.551546 0.3691888 17.74579  1.857727e-70  3.011292e-68
## 15    390.04069       6.546445 0.3880981 16.86802  7.734867e-64  9.537355e-62
## 16    531.99707       6.468710 0.3654269 17.70179  4.061938e-70  6.434761e-68
## 17    170.88659       6.374495 0.3644404 17.49118  1.672448e-68  2.474622e-66
## 18    137.22944       6.373977 0.3779327 16.86538  8.088518e-64  9.930979e-62
## 19    985.21699       6.355541 0.3408408 18.64665  1.344323e-77  2.810706e-75
## 20    192.66464       6.345654 0.3842796 16.51312  2.952218e-61  3.166556e-59
## 21    209.29242       6.296718 0.3809182 16.53037  2.217849e-61  2.414778e-59
## 22   4140.60816       6.292665 0.2882844 21.82798 1.258734e-105 8.858110e-103
## 23   1161.99626       6.290860 0.3445783 18.25669  1.830563e-74  3.429691e-72
## 24     66.15691       6.264599 0.3812760 16.43061  1.154862e-60  1.211682e-58
## 25    301.94513       6.184367 0.3181151 19.44066  3.496338e-84  9.794160e-82
## 26   2386.03189       6.171108 0.3323255 18.56947  5.676314e-77  1.161551e-74
## 27    656.98265       6.136619 0.3415887 17.96494  3.667047e-72  6.412442e-70
## 28   2649.58725       6.094615 0.2905359 20.97715  1.060641e-97  5.368891e-95
## 29    403.68670       6.077394 0.3731307 16.28758  1.209288e-59  1.224266e-57
## 30    598.25526       6.005846 0.3329362 18.03903  9.621922e-73  1.713712e-70
## 31   2585.87063       5.929989 0.2931497 20.22854  5.490991e-91  2.005463e-88
## 32   1565.68351       5.907715 0.3109027 19.00181  1.647539e-80  3.864751e-78
## 33   7249.60161       5.898601 0.3414102 17.27717  6.989765e-67  9.558090e-65
## 34     54.43943       5.856942 0.4954532 11.82138  3.026617e-32  7.432082e-31
## 35   1453.29869       5.832175 0.2538034 22.97911 7.541492e-117 9.066444e-114
## 36   6717.02856       5.819223 0.2403594 24.21051 1.724303e-129 4.145944e-126
## 37    300.32426       5.781535 0.3771580 15.32921  4.878785e-53  3.763839e-51
## 38    161.10307       5.772022 0.2780773 20.75690  1.062040e-95  4.642887e-93
## 39     50.56827       5.726005 0.4584062 12.49111  8.347566e-36  2.501063e-34
## 40   4925.33263       5.724080 0.2724347 21.01083  5.221419e-98  2.690243e-95
## 41 838667.28568       5.704339 0.2587789 22.04329 1.107761e-107 9.132067e-105
## 42    274.51671       5.695708 0.3397416 16.76482  4.413320e-63  5.327930e-61
## 43   3909.65470       5.672596 0.2099107 27.02386 7.750831e-161 5.590868e-157
## 44     74.61713       5.669866 0.2988115 18.97472  2.759752e-80  6.421541e-78
## 45    182.42916       5.666730 0.4726649 11.98890  4.062850e-33  1.057037e-31
## 46   2870.76749       5.612894 0.2267285 24.75602 2.670494e-135 8.561306e-132
## 47    556.70963       5.591001 0.2762490 20.23899  4.442063e-91  1.643165e-88
## 48   1110.82434       5.504737 0.3288548 16.73911  6.799614e-63  8.073633e-61
## 49 159102.76778       5.494196 0.2706550 20.29963  1.295588e-91  4.854753e-89
## 50    908.69584       5.490704 0.3275721 16.76182  4.642042e-63  5.580701e-61
##               gene_id   gene_id_clean gene_symbol      gene_type
## 1  ENSG00000137745.12 ENSG00000137745       MMP13 protein_coding
## 2  ENSG00000147256.12 ENSG00000147256    ARHGAP36 protein_coding
## 3   ENSG00000187714.7 ENSG00000187714     SLC18A3 protein_coding
## 4  ENSG00000197587.11 ENSG00000197587       DMBX1 protein_coding
## 5   ENSG00000259803.7 ENSG00000259803    SLC22A31 protein_coding
## 6  ENSG00000145864.14 ENSG00000145864      GABRB2 protein_coding
## 7  ENSG00000122852.15 ENSG00000122852      SFTPA1 protein_coding
## 8  ENSG00000187045.19 ENSG00000187045     TMPRSS6 protein_coding
## 9  ENSG00000167757.14 ENSG00000167757       KLK11 protein_coding
## 10  ENSG00000174460.4 ENSG00000174460     ZCCHC12 protein_coding
## 11  ENSG00000275896.6 ENSG00000275896       PRSS2 protein_coding
## 12 ENSG00000167755.15 ENSG00000167755        KLK6 protein_coding
## 13 ENSG00000173227.14 ENSG00000173227       SYT12 protein_coding
## 14 ENSG00000204983.14 ENSG00000204983       PRSS1 protein_coding
## 15  ENSG00000163207.7 ENSG00000163207         IVL protein_coding
## 16 ENSG00000155897.10 ENSG00000155897       ADCY8 protein_coding
## 17 ENSG00000231924.10 ENSG00000231924        PSG1 protein_coding
## 18 ENSG00000100341.12 ENSG00000100341      PNPLA5 protein_coding
## 19 ENSG00000134873.10 ENSG00000134873      CLDN10 protein_coding
## 20 ENSG00000087128.10 ENSG00000087128   TMPRSS11E protein_coding
## 21  ENSG00000170369.4 ENSG00000170369        CST2 protein_coding
## 22 ENSG00000137648.19 ENSG00000137648     TMPRSS4 protein_coding
## 23  ENSG00000204544.5 ENSG00000204544       MUC21 protein_coding
## 24  ENSG00000170367.5 ENSG00000170367        CST5 protein_coding
## 25  ENSG00000167105.8 ENSG00000167105      TMEM92 protein_coding
## 26 ENSG00000169035.12 ENSG00000169035        KLK7 protein_coding
## 27 ENSG00000172020.13 ENSG00000172020       GAP43 protein_coding
## 28 ENSG00000129451.12 ENSG00000129451       KLK10 protein_coding
## 29 ENSG00000134258.17 ENSG00000134258       VTCN1 protein_coding
## 30 ENSG00000163817.16 ENSG00000163817     SLC6A20 protein_coding
## 31 ENSG00000179913.11 ENSG00000179913      B3GNT3 protein_coding
## 32  ENSG00000188133.6 ENSG00000188133     TMEM215 protein_coding
## 33  ENSG00000164935.6 ENSG00000164935     DCSTAMP protein_coding
## 34  ENSG00000170373.8 ENSG00000170373        CST1 protein_coding
## 35 ENSG00000147041.12 ENSG00000147041       SYTL5 protein_coding
## 36  ENSG00000176532.4 ENSG00000176532       PRR15 protein_coding
## 37 ENSG00000185303.17 ENSG00000185303      SFTPA2 protein_coding
## 38  ENSG00000138308.6 ENSG00000138308    PLA2G12B protein_coding
## 39 ENSG00000172461.11 ENSG00000172461        FUT9 protein_coding
## 40 ENSG00000187122.17 ENSG00000187122       SLIT1 protein_coding
## 41 ENSG00000115414.21 ENSG00000115414         FN1 protein_coding
## 42 ENSG00000166897.16 ENSG00000166897       ELFN2 protein_coding
## 43 ENSG00000163898.10 ENSG00000163898        LIPH protein_coding
## 44  ENSG00000164400.6 ENSG00000164400        CSF2 protein_coding
## 45 ENSG00000124467.19 ENSG00000124467        PSG8 protein_coding
## 46 ENSG00000149948.14 ENSG00000149948       HMGA2 protein_coding
## 47  ENSG00000187823.3 ENSG00000187823        RTL4 protein_coding
## 48 ENSG00000109255.11 ENSG00000109255         NMU protein_coding
## 49 ENSG00000157765.13 ENSG00000157765     SLC34A2 protein_coding
## 50  ENSG00000072315.3 ENSG00000072315       TRPC5 protein_coding
##               direction
## 1  Upregulated_in_Tumor
## 2  Upregulated_in_Tumor
## 3  Upregulated_in_Tumor
## 4  Upregulated_in_Tumor
## 5  Upregulated_in_Tumor
## 6  Upregulated_in_Tumor
## 7  Upregulated_in_Tumor
## 8  Upregulated_in_Tumor
## 9  Upregulated_in_Tumor
## 10 Upregulated_in_Tumor
## 11 Upregulated_in_Tumor
## 12 Upregulated_in_Tumor
## 13 Upregulated_in_Tumor
## 14 Upregulated_in_Tumor
## 15 Upregulated_in_Tumor
## 16 Upregulated_in_Tumor
## 17 Upregulated_in_Tumor
## 18 Upregulated_in_Tumor
## 19 Upregulated_in_Tumor
## 20 Upregulated_in_Tumor
## 21 Upregulated_in_Tumor
## 22 Upregulated_in_Tumor
## 23 Upregulated_in_Tumor
## 24 Upregulated_in_Tumor
## 25 Upregulated_in_Tumor
## 26 Upregulated_in_Tumor
## 27 Upregulated_in_Tumor
## 28 Upregulated_in_Tumor
## 29 Upregulated_in_Tumor
## 30 Upregulated_in_Tumor
## 31 Upregulated_in_Tumor
## 32 Upregulated_in_Tumor
## 33 Upregulated_in_Tumor
## 34 Upregulated_in_Tumor
## 35 Upregulated_in_Tumor
## 36 Upregulated_in_Tumor
## 37 Upregulated_in_Tumor
## 38 Upregulated_in_Tumor
## 39 Upregulated_in_Tumor
## 40 Upregulated_in_Tumor
## 41 Upregulated_in_Tumor
## 42 Upregulated_in_Tumor
## 43 Upregulated_in_Tumor
## 44 Upregulated_in_Tumor
## 45 Upregulated_in_Tumor
## 46 Upregulated_in_Tumor
## 47 Upregulated_in_Tumor
## 48 Upregulated_in_Tumor
## 49 Upregulated_in_Tumor
## 50 Upregulated_in_Tumor

21 18. Top gene protein-coding giảm biểu hiện trong Tumor

21.1 Mục đích

Trích xuất gene protein-coding giảm biểu hiện trong Tumor để mô tả sinh học bệnh và kiểm tra các gene biệt hóa tuyến giáp bị mất biểu hiện.

21.2 Cơ sở lý thuyết

Gene giảm biểu hiện trong Tumor thường không phải nguồn kháng nguyên ưu tiên cho vaccine, nhưng có giá trị trong diễn giải sinh học PTC. Ví dụ, mất biểu hiện các gene biệt hóa tuyến giáp có thể liên quan đến giảm chức năng biệt hóa và tiến triển bệnh.

21.3 Phương pháp nghiên cứu

Lọc gene thỏa:

gene_type == "protein_coding".
padj < 0.05.
log2FoldChange <= -1.
baseMean >= 50.

21.4 Code

top_downregulated_genes <- res_annotated %>%
  filter(
    gene_type == "protein_coding",
    !is.na(padj),
    padj < params$deg_padj_cutoff,
    log2FoldChange <= -params$deg_log2fc_cutoff,
    baseMean >= 50
  ) %>%
  arrange(log2FoldChange)

head(top_downregulated_genes, 50)

##       baseMean log2FoldChange     lfcSE       stat       pvalue         padj
## 1    640.08393      -3.669924 0.4781618  -7.675066 1.653333e-14 1.047050e-13
## 2    180.21576      -3.439544 0.3838133  -8.961502 3.202847e-19 3.026916e-18
## 3    174.91070      -3.414607 0.2587846 -13.194786 9.402183e-40 3.626754e-38
## 4    316.81415      -3.277959 0.3137872 -10.446437 1.521339e-25 2.343577e-24
## 5    350.49927      -3.255289 0.2931823 -11.103293 1.209060e-28 2.311797e-27
## 6     65.61661      -3.241349 0.2886033 -11.231158 2.866973e-29 5.756491e-28
## 7   3062.49891      -3.226218 0.3728960  -8.651791 5.069730e-18 4.375619e-17
## 8    311.58458      -3.187993 0.3522922  -9.049285 1.439079e-19 1.393815e-18
## 9    240.20947      -3.179130 0.4259791  -7.463111 8.450306e-14 5.013709e-13
## 10   606.33236      -3.146754 0.1965460 -16.010263 1.083503e-57 1.011725e-55
## 11 11295.75574      -3.132265 0.3355518  -9.334669 1.013077e-20 1.073460e-19
## 12    53.91776      -3.119818 0.3164322  -9.859357 6.244823e-23 7.830590e-22
## 13   234.26973      -3.105607 0.2281302 -13.613310 3.337778e-42 1.486187e-40
## 14    95.20278      -3.057267 0.2241273 -13.640763 2.291496e-42 1.031459e-40
## 15    56.68241      -3.054084 0.3335049  -9.157537 5.309545e-20 5.339711e-19
## 16   559.06760      -3.036121 0.3143629  -9.658015 4.545882e-22 5.329636e-21
## 17   385.90387      -2.988129 0.1765688 -16.923312 3.029081e-64 3.783467e-62
## 18    54.44398      -2.957510 0.2711694 -10.906503 1.073049e-27 1.937465e-26
## 19   809.02491      -2.951601 0.2903045 -10.167260 2.776069e-24 3.858282e-23
## 20   142.84240      -2.910644 0.4622354  -6.296886 3.036835e-10 1.264931e-09
## 21   693.34957      -2.910438 0.2364955 -12.306525 8.354371e-35 2.386620e-33
## 22   116.27607      -2.888023 0.2495786 -11.571596 5.740452e-31 1.296004e-29
## 23   341.76782      -2.881251 0.6686900  -4.308799 1.641432e-05 3.979851e-05
## 24  1901.47025      -2.864094 0.2313649 -12.379120 3.390383e-35 9.871112e-34
## 25    62.01151      -2.825277 0.4725690  -5.978549 2.251336e-09 8.550456e-09
## 26  3356.00892      -2.802483 0.1808969 -15.492151 3.919565e-54 3.132721e-52
## 27   102.28354      -2.769642 0.3341112  -8.289580 1.136495e-16 8.751348e-16
## 28  1261.27529      -2.763253 0.3179783  -8.690070 3.622176e-18 3.159330e-17
## 29   128.35056      -2.752176 0.1826636 -15.066911 2.673236e-51 1.895108e-49
## 30  3347.01420      -2.749180 0.2342583 -11.735677 8.365426e-32 1.993127e-30
## 31    53.64868      -2.715544 0.2768703  -9.808001 1.040089e-22 1.279185e-21
## 32   461.53865      -2.712453 0.2522950 -10.751115 5.854954e-27 1.001975e-25
## 33    50.66118      -2.692956 0.2556302 -10.534575 5.985275e-26 9.525270e-25
## 34   166.36653      -2.682117 0.2415350 -11.104465 1.193302e-28 2.286211e-27
## 35   433.23819      -2.604618 0.2018573 -12.903260 4.314508e-38 1.501646e-36
## 36  1873.46633      -2.598860 0.3156751  -8.232706 1.830314e-16 1.381016e-15
## 37    82.12775      -2.587127 0.2896947  -8.930529 4.239761e-19 3.973038e-18
## 38    94.22522      -2.572962 0.3471523  -7.411625 1.247614e-13 7.278083e-13
## 39   519.00352      -2.561459 0.2743236  -9.337362 9.876429e-21 1.047664e-19
## 40    94.45686      -2.561336 0.3831729  -6.684543 2.316463e-11 1.081328e-10
## 41  1587.91595      -2.547745 0.3228642  -7.891073 2.995994e-15 2.030141e-14
## 42   364.85829      -2.523266 0.1935605 -13.036056 7.630328e-39 2.741692e-37
## 43   510.81467      -2.519122 0.2710503  -9.293928 1.486966e-20 1.560125e-19
## 44    56.50537      -2.495550 0.3955345  -6.309311 2.802805e-10 1.171171e-09
## 45   258.39886      -2.482653 0.2839335  -8.743784 2.254297e-18 1.999484e-17
## 46    61.24206      -2.462583 0.4231863  -5.819145 5.914937e-09 2.148605e-08
## 47   231.02438      -2.453409 0.1555536 -15.772110 4.840562e-56 4.219478e-54
## 48   158.22983      -2.438495 0.1741499 -14.002275 1.509598e-44 7.522702e-43
## 49 17842.89274      -2.419733 0.2188671 -11.055720 2.056794e-28 3.881274e-27
## 50  6399.53993      -2.416210 0.3360806  -7.189377 6.508759e-13 3.540004e-12
##               gene_id   gene_id_clean gene_symbol      gene_type
## 1  ENSG00000110680.13 ENSG00000110680       CALCA protein_coding
## 2  ENSG00000034971.17 ENSG00000034971        MYOC protein_coding
## 3  ENSG00000189056.14 ENSG00000189056        RELN protein_coding
## 4   ENSG00000143196.5 ENSG00000143196         DPT protein_coding
## 5  ENSG00000196616.14 ENSG00000196616       ADH1B protein_coding
## 6   ENSG00000173467.9 ENSG00000173467        AGR3 protein_coding
## 7   ENSG00000137077.8 ENSG00000137077       CCL21 protein_coding
## 8  ENSG00000168079.17 ENSG00000168079      SCARA5 protein_coding
## 9   ENSG00000182348.7 ENSG00000182348     ZNF804B protein_coding
## 10 ENSG00000138722.10 ENSG00000138722       MMRN1 protein_coding
## 11 ENSG00000160180.15 ENSG00000160180        TFF3 protein_coding
## 12  ENSG00000196666.6 ENSG00000196666     FAM180B protein_coding
## 13 ENSG00000147257.15 ENSG00000147257        GPC3 protein_coding
## 14  ENSG00000152580.8 ENSG00000152580      IGSF10 protein_coding
## 15 ENSG00000205221.12 ENSG00000205221         VIT protein_coding
## 16 ENSG00000168702.18 ENSG00000168702       LRP1B protein_coding
## 17  ENSG00000133800.9 ENSG00000133800       LYVE1 protein_coding
## 18  ENSG00000204065.3 ENSG00000204065      TCEAL5 protein_coding
## 19 ENSG00000101938.15 ENSG00000101938      CHRDL1 protein_coding
## 20 ENSG00000100721.11 ENSG00000100721       TCL1A protein_coding
## 21 ENSG00000198626.17 ENSG00000198626        RYR2 protein_coding
## 22 ENSG00000183775.11 ENSG00000183775      KCTD16 protein_coding
## 23  ENSG00000181617.6 ENSG00000181617       FDCSP protein_coding
## 24 ENSG00000074706.13 ENSG00000074706      IPCEF1 protein_coding
## 25 ENSG00000162897.16 ENSG00000162897       FCAMR protein_coding
## 26 ENSG00000153246.13 ENSG00000153246      PLA2R1 protein_coding
## 27 ENSG00000122145.15 ENSG00000122145       TBX22 protein_coding
## 28 ENSG00000130226.17 ENSG00000130226        DPP6 protein_coding
## 29 ENSG00000154258.17 ENSG00000154258       ABCA9 protein_coding
## 30  ENSG00000115112.8 ENSG00000115112     TFCP2L1 protein_coding
## 31  ENSG00000179674.4 ENSG00000179674       ARL14 protein_coding
## 32 ENSG00000138356.14 ENSG00000138356        AOX1 protein_coding
## 33  ENSG00000177363.5 ENSG00000177363     LRRN4CL protein_coding
## 34 ENSG00000149972.11 ENSG00000149972       CNTN5 protein_coding
## 35 ENSG00000150625.16 ENSG00000150625       GPM6A protein_coding
## 36 ENSG00000166589.13 ENSG00000166589       CDH16 protein_coding
## 37  ENSG00000128218.8 ENSG00000128218      VPREB3 protein_coding
## 38 ENSG00000145242.13 ENSG00000145242       EPHA5 protein_coding
## 39 ENSG00000164530.15 ENSG00000164530        PI16 protein_coding
## 40 ENSG00000104921.15 ENSG00000104921       FCER2 protein_coding
## 41 ENSG00000075035.10 ENSG00000075035       WSCD2 protein_coding
## 42 ENSG00000064270.13 ENSG00000064270      ATP2C2 protein_coding
## 43 ENSG00000141639.12 ENSG00000141639       MAPK4 protein_coding
## 44 ENSG00000163534.15 ENSG00000163534       FCRL1 protein_coding
## 45  ENSG00000165300.7 ENSG00000165300     SLITRK5 protein_coding
## 46 ENSG00000109205.16 ENSG00000109205        ODAM protein_coding
## 47  ENSG00000198774.5 ENSG00000198774      RASSF9 protein_coding
## 48 ENSG00000153823.19 ENSG00000153823        PID1 protein_coding
## 49  ENSG00000147606.9 ENSG00000147606     SLC26A7 protein_coding
## 50 ENSG00000205038.12 ENSG00000205038     PKHD1L1 protein_coding
##                 direction
## 1  Downregulated_in_Tumor
## 2  Downregulated_in_Tumor
## 3  Downregulated_in_Tumor
## 4  Downregulated_in_Tumor
## 5  Downregulated_in_Tumor
## 6  Downregulated_in_Tumor
## 7  Downregulated_in_Tumor
## 8  Downregulated_in_Tumor
## 9  Downregulated_in_Tumor
## 10 Downregulated_in_Tumor
## 11 Downregulated_in_Tumor
## 12 Downregulated_in_Tumor
## 13 Downregulated_in_Tumor
## 14 Downregulated_in_Tumor
## 15 Downregulated_in_Tumor
## 16 Downregulated_in_Tumor
## 17 Downregulated_in_Tumor
## 18 Downregulated_in_Tumor
## 19 Downregulated_in_Tumor
## 20 Downregulated_in_Tumor
## 21 Downregulated_in_Tumor
## 22 Downregulated_in_Tumor
## 23 Downregulated_in_Tumor
## 24 Downregulated_in_Tumor
## 25 Downregulated_in_Tumor
## 26 Downregulated_in_Tumor
## 27 Downregulated_in_Tumor
## 28 Downregulated_in_Tumor
## 29 Downregulated_in_Tumor
## 30 Downregulated_in_Tumor
## 31 Downregulated_in_Tumor
## 32 Downregulated_in_Tumor
## 33 Downregulated_in_Tumor
## 34 Downregulated_in_Tumor
## 35 Downregulated_in_Tumor
## 36 Downregulated_in_Tumor
## 37 Downregulated_in_Tumor
## 38 Downregulated_in_Tumor
## 39 Downregulated_in_Tumor
## 40 Downregulated_in_Tumor
## 41 Downregulated_in_Tumor
## 42 Downregulated_in_Tumor
## 43 Downregulated_in_Tumor
## 44 Downregulated_in_Tumor
## 45 Downregulated_in_Tumor
## 46 Downregulated_in_Tumor
## 47 Downregulated_in_Tumor
## 48 Downregulated_in_Tumor
## 49 Downregulated_in_Tumor
## 50 Downregulated_in_Tumor

22 19. Boxplot biểu hiện expanded PTC-relevant gene panel

22.1 Mục đích

So sánh mức biểu hiện của toàn bộ gene trong expanded PTC-relevant gene panel giữa nhóm Tumor và Normal. Biểu đồ này giúp nhận diện gene nào có biểu hiện cao hơn rõ rệt trong mô u và có thể được ưu tiên ở bước sàng lọc kháng nguyên.

22.2 Cơ sở lý thuyết

DESeq2 cung cấp normalized counts, tức count đã hiệu chỉnh theo khác biệt về độ sâu giải trình tự giữa các mẫu. Khi trực quan hóa biểu hiện gene, thường dùng biến đổi log2(normalized count + 1) để giảm lệch phải của dữ liệu RNA-seq và tránh lỗi log của giá trị bằng 0.

Trong bối cảnh thiết kế vaccine đa epitope, boxplot không chứng minh gene là neoantigen, nhưng giúp đánh giá tiêu chí tumor-enriched expression. Gene có biểu hiện cao trong Tumor, đặc biệt nếu đồng thời là driver/fusion/protein-coding, sẽ hợp lý hơn để đưa vào các bước kiểm tra protein sequence và dự đoán epitope.

22.3 Phương pháp nghiên cứu

Đầu vào: norm_counts, sample_info, gene_anno_clean, ptc_extended_genes.
Gắn gene symbol và gene group cho từng gene trong panel.
Chuyển ma trận biểu hiện sang long format.
Vẽ boxplot theo condition, chia facet theo từng gene_symbol.
Lưu hình ở định dạng .png và .pdf trong results/plots.

22.4 Code

# ============================================================
# 19. Boxplot expression of expanded PTC-relevant target genes
# ============================================================
# Mục đích:
#   - So sánh mức biểu hiện các gene trong expanded PTC panel
#     giữa Tumor và Normal.
#   - Dữ liệu sử dụng: normalized counts từ DESeq2.
#   - Trục Y: log2(normalized count + 1).

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(tibble)
  library(ggplot2)
})

# Tạo thư mục lưu hình nếu chưa có.
if (!dir.exists("results")) dir.create("results")
if (!dir.exists(file.path("results", "plots"))) {
  dir.create(file.path("results", "plots"), recursive = TRUE)
}

# ------------------------------------------------------------
# 19.1. Chuẩn bị bảng annotation gene group cho expanded panel
# ------------------------------------------------------------

ptc_extended_gene_group <- tibble(
  gene_symbol = ptc_extended_genes,
  gene_group = case_when(
    gene_symbol %in% ptc_driver_genes ~ "Driver_MAPK",
    gene_symbol %in% ptc_fusion_genes ~ "Fusion_or_kinase",
    gene_symbol %in% ptc_progression_genes ~ "Progression_or_high_risk",
    gene_symbol %in% thyroid_lineage_genes ~ "Thyroid_lineage",
    TRUE ~ "Other"
  )
)

# ------------------------------------------------------------
# 19.2. Lấy normalized counts cho các gene trong expanded panel
# ------------------------------------------------------------
# norm_counts là ma trận normalized counts từ:
# norm_counts <- counts(dds, normalized = TRUE)

norm_counts_df <- as.data.frame(norm_counts) %>%
  rownames_to_column("gene_id")

norm_counts_df$gene_id_clean <- sub("\\..*$", "", norm_counts_df$gene_id)

# Gắn gene symbol và gene type.
norm_counts_annotated <- norm_counts_df %>%
  left_join(
    gene_anno_clean %>%
      select(gene_id_clean, gene_symbol, gene_type),
    by = "gene_id_clean"
  )

# Lọc các gene thuộc expanded PTC panel.
ptc_extended_expression <- norm_counts_annotated %>%
  filter(gene_symbol %in% ptc_extended_genes) %>%
  left_join(
    ptc_extended_gene_group,
    by = "gene_symbol"
  )

# Kiểm tra các gene có trong dữ liệu biểu hiện.
unique(ptc_extended_expression$gene_symbol)

##  [1] "TG"     "ROS1"   "DICER1" "SLC5A5" "MET"    "TPO"    "PIK3CA" "PAX8"  
##  [9] "PPARG"  "KRAS"   "NKX2-1" "NTRK3"  "TP53"   "AKT1"   "CDKN2B" "CDKN2A"
## [17] "BRAF"   "TERT"   "TSHR"   "RET"    "PPM1D"  "ALK"    "PTEN"   "EIF1AX"
## [25] "HRAS"   "FOXE1"  "CHEK2"  "DIO3"   "NTRK1"  "MTOR"   "DIO2"   "DIO1"  
## [33] "NRAS"

# ------------------------------------------------------------
# 19.3. Chuyển dữ liệu sang long format để vẽ boxplot
# ------------------------------------------------------------

ptc_extended_expression_long <- ptc_extended_expression %>%
  select(
    gene_id,
    gene_symbol,
    gene_type,
    gene_group,
    all_of(sample_info$sample_id)
  ) %>%
  pivot_longer(
    cols = all_of(sample_info$sample_id),
    names_to = "sample_id",
    values_to = "normalized_count"
  ) %>%
  left_join(
    sample_info %>%
      select(sample_id, condition, patient_id),
    by = "sample_id"
  ) %>%
  filter(!is.na(condition))

# Sắp xếp gene theo nhóm sinh học để facet dễ đọc hơn.
gene_order_boxplot <- ptc_extended_expression_long %>%
  distinct(gene_group, gene_symbol) %>%
  arrange(gene_group, gene_symbol) %>%
  pull(gene_symbol)

ptc_extended_expression_long$gene_symbol <- factor(
  ptc_extended_expression_long$gene_symbol,
  levels = unique(gene_order_boxplot)
)

# ------------------------------------------------------------
# 19.4. Vẽ boxplot
# ------------------------------------------------------------

p_boxplot_extended <- ggplot(
  ptc_extended_expression_long,
  aes(
    x = condition,
    y = log2(normalized_count + 1),
    fill = condition
  )
) +
  geom_boxplot(
    outlier.shape = NA,
    alpha = 0.7
  ) +
  geom_jitter(
    width = 0.2,
    size = 0.6,
    alpha = 0.45
  ) +
  facet_wrap(
    ~ gene_symbol,
    scales = "free_y",
    ncol = 5
  ) +
  theme_bw() +
  labs(
    title = "Expression of expanded PTC-relevant genes in TCGA-THCA",
    subtitle = "Tumor vs Normal, DESeq2 normalized counts",
    x = "",
    y = "log2(normalized count + 1)"
  ) +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold", size = 8),
    plot.title = element_text(face = "bold")
  )

p_boxplot_extended

ggsave(
  filename = file.path("results", "plots", "TCGA_THCA_PTC_extended_gene_panel_boxplot.png"),
  plot = p_boxplot_extended,
  width = 16,
  height = 12,
  dpi = 300
)

ggsave(
  filename = file.path("results", "plots", "TCGA_THCA_PTC_extended_gene_panel_boxplot.pdf"),
  plot = p_boxplot_extended,
  width = 16,
  height = 12
)

23 20. Volcano plot highlight expanded PTC-relevant gene panel

23.1 Mục đích

Vẽ volcano plot cho toàn bộ kết quả DESeq2, đồng thời highlight các gene thuộc expanded PTC panel. Biểu đồ này giúp xác định vị trí của các gene đích so với toàn bộ transcriptome về cả mức độ thay đổi biểu hiện và ý nghĩa thống kê.

23.2 Cơ sở lý thuyết

Volcano plot biểu diễn:

Trục X: log2FoldChange, thể hiện hướng và độ lớn thay đổi biểu hiện.
Trục Y: -log10(padj), thể hiện mức ý nghĩa thống kê sau hiệu chỉnh đa kiểm định.

Gene nằm phía bên phải và cao thường là gene tăng biểu hiện ở Tumor với ý nghĩa thống kê mạnh. Gene nằm phía bên trái và cao là gene giảm biểu hiện ở Tumor. Trong pipeline vaccine, các gene tăng biểu hiện có ý nghĩa thường được ưu tiên hơn, nhưng vẫn cần xác nhận thêm bằng mutation/fusion và dữ liệu protein.

23.3 Phương pháp nghiên cứu

Đầu vào: res_annotated, ptc_extended_genes.
Phân loại gene thành Upregulated, Downregulated, hoặc Not_significant.
Highlight gene trong expanded PTC panel bằng vòng tròn đen.
Gắn nhãn gene đích có padj < 0.05.
Lưu hình ở định dạng .png và .pdf.

23.4 Code

# ============================================================
# 20. Volcano plot highlighting expanded PTC-relevant genes
# ============================================================
# Mục đích:
#   - Vẽ volcano plot toàn bộ gene trong phân tích DESeq2.
#   - Highlight các gene thuộc expanded PTC panel.
#   - Đánh giá đồng thời log2FoldChange và adjusted p-value.

suppressPackageStartupMessages({
  library(dplyr)
  library(ggplot2)
})

# ------------------------------------------------------------
# 20.1. Chuẩn bị dữ liệu volcano
# ------------------------------------------------------------

volcano_extended_df <- res_annotated %>%
  mutate(
    padj_for_plot = ifelse(is.na(padj), NA_real_, pmax(padj, .Machine$double.xmin)),
    neg_log10_padj = -log10(padj_for_plot),
    DEG_status = case_when(
      !is.na(padj) & padj < params$deg_padj_cutoff & log2FoldChange >= params$deg_log2fc_cutoff  ~ "Upregulated",
      !is.na(padj) & padj < params$deg_padj_cutoff & log2FoldChange <= -params$deg_log2fc_cutoff ~ "Downregulated",
      TRUE ~ "Not_significant"
    ),
    is_extended_panel_gene = ifelse(
      gene_symbol %in% ptc_extended_genes,
      "Expanded_PTC_panel",
      "Other_gene"
    ),
    label_gene = ifelse(
      gene_symbol %in% ptc_extended_genes & !is.na(padj) & padj < params$deg_padj_cutoff,
      gene_symbol,
      NA_character_
    )
  )

# ------------------------------------------------------------
# 20.2. Volcano plot, highlight expanded panel
# ------------------------------------------------------------

p_volcano_extended <- ggplot(
  volcano_extended_df,
  aes(
    x = log2FoldChange,
    y = neg_log10_padj
  )
) +
  geom_point(
    aes(color = DEG_status),
    alpha = 0.45,
    size = 1
  ) +
  geom_point(
    data = volcano_extended_df %>%
      filter(is_extended_panel_gene == "Expanded_PTC_panel"),
    aes(x = log2FoldChange, y = neg_log10_padj),
    shape = 21,
    size = 3,
    stroke = 0.8,
    fill = NA,
    color = "black"
  ) +
  geom_vline(
    xintercept = c(-params$deg_log2fc_cutoff, params$deg_log2fc_cutoff),
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = -log10(params$deg_padj_cutoff),
    linetype = "dashed"
  ) +
  theme_bw() +
  labs(
    title = "Volcano plot: PTC Tumor vs Normal",
    subtitle = "Expanded PTC-relevant genes are circled",
    x = "log2 fold change",
    y = "-log10 adjusted p-value",
    color = "DEG status"
  ) +
  theme(
    plot.title = element_text(face = "bold")
  )

# ------------------------------------------------------------
# 20.3. Thêm nhãn gene nếu package ggrepel có sẵn
# ------------------------------------------------------------

if (requireNamespace("ggrepel", quietly = TRUE)) {
  p_volcano_extended <- p_volcano_extended +
    ggrepel::geom_text_repel(
      data = volcano_extended_df %>%
        filter(!is.na(label_gene)),
      aes(label = label_gene),
      size = 3,
      max.overlaps = 50,
      box.padding = 0.4,
      point.padding = 0.3
    )
} else {
  p_volcano_extended <- p_volcano_extended +
    geom_text(
      data = volcano_extended_df %>%
        filter(!is.na(label_gene)),
      aes(label = label_gene),
      size = 3,
      vjust = -0.6,
      check_overlap = TRUE
    )
}

p_volcano_extended

ggsave(
  filename = file.path("results", "plots", "TCGA_THCA_PTC_volcano_highlight_extended_panel.png"),
  plot = p_volcano_extended,
  width = 9,
  height = 7,
  dpi = 300
)

ggsave(
  filename = file.path("results", "plots", "TCGA_THCA_PTC_volcano_highlight_extended_panel.pdf"),
  plot = p_volcano_extended,
  width = 9,
  height = 7
)

24 21. Heatmap biểu hiện expanded PTC-relevant gene panel

24.1 Mục đích

Tạo heatmap biểu hiện cho các gene trong expanded PTC panel nhằm quan sát mô hình biểu hiện đa gene giữa Tumor và Normal. Heatmap đặc biệt hữu ích để đánh giá liệu panel gene có tạo được pattern phân biệt nhóm mô u và mô bình thường hay không.

24.2 Cơ sở lý thuyết

Heatmap sử dụng ma trận biểu hiện gene theo mẫu. Trong RNA-seq, dữ liệu thường được biến đổi log2(normalized count + 1). Khi dùng scale = "row", mỗi gene được chuẩn hóa theo hàng, tức biểu hiện của gene đó được so sánh tương đối giữa các mẫu. Cách này phù hợp để quan sát pattern tăng/giảm tương đối, nhưng không dùng để so sánh giá trị tuyệt đối giữa các gene khác nhau.

Trong nghiên cứu PTC, heatmap của gene driver, fusion/kinase, progression và thyroid lineage có thể giúp thấy các cụm sinh học như mất biệt hóa tuyến giáp hoặc hoạt hóa các trục tín hiệu ung thư.

24.3 Phương pháp nghiên cứu

Đầu vào: norm_counts, ptc_extended_result, sample_info.
Chọn các gene thuộc expanded panel có trong ma trận normalized counts.
Nếu một gene symbol có nhiều gene ID, chọn gene ID có padj nhỏ nhất.
Dùng log2(normalized count + 1).
Gắn annotation cột theo condition và annotation hàng theo gene_group.
Vẽ heatmap bằng pheatmap và lưu .png, .pdf.

24.4 Code

# ============================================================
# 21. Heatmap of expanded PTC-relevant gene panel
# ============================================================
# Mục đích:
#   - Vẽ heatmap biểu hiện các gene trong expanded PTC panel.
#   - Dữ liệu dùng log2(normalized count + 1).
#   - scale = "row" để so sánh tương đối biểu hiện của từng gene
#     giữa các mẫu Tumor và Normal.

suppressPackageStartupMessages({
  library(dplyr)
  library(tibble)
  library(pheatmap)
})

# ------------------------------------------------------------
# 21.1. Chọn gene trong expanded panel có trong norm_counts
# ------------------------------------------------------------

heatmap_gene_table <- ptc_extended_result %>%
  filter(gene_symbol %in% ptc_extended_genes) %>%
  filter(gene_id %in% rownames(norm_counts)) %>%
  arrange(gene_group, padj)

# Nếu một gene symbol có nhiều gene_id, ưu tiên dòng có padj nhỏ nhất.
heatmap_gene_table <- heatmap_gene_table %>%
  group_by(gene_symbol) %>%
  arrange(padj, .by_group = TRUE) %>%
  slice_head(n = 1) %>%
  ungroup()

heatmap_gene_ids <- heatmap_gene_table$gene_id

# Kiểm tra số gene dùng cho heatmap.
length(heatmap_gene_ids)

## [1] 33

heatmap_gene_table %>%
  select(gene_group, gene_symbol, gene_id, log2FoldChange, padj)

## # A tibble: 33 × 5
##    gene_group               gene_symbol gene_id          log2FoldChange     padj
##    <chr>                    <chr>       <chr>                     <dbl>    <dbl>
##  1 Progression_or_high_risk AKT1        ENSG00000142208…         0.229  3.96e- 6
##  2 Fusion_or_kinase         ALK         ENSG00000171094…         3.21   8.84e-62
##  3 Driver_MAPK              BRAF        ENSG00000157764…         0.0869 6.23e- 2
##  4 Progression_or_high_risk CDKN2A      ENSG00000147889…         3.20   5.04e-46
##  5 Progression_or_high_risk CDKN2B      ENSG00000147883…         2.97   1.26e-82
##  6 Progression_or_high_risk CHEK2       ENSG00000183765…         0.408  9.44e- 7
##  7 Progression_or_high_risk DICER1      ENSG00000100697…        -0.634  1.75e-21
##  8 Thyroid_lineage          DIO1        ENSG00000211452…        -1.28   3.30e- 4
##  9 Thyroid_lineage          DIO2        ENSG00000211448…        -1.32   3.06e-12
## 10 Thyroid_lineage          DIO3        ENSG00000197406…        -1.76   5.90e- 8
## # ℹ 23 more rows

# ------------------------------------------------------------
# 21.2. Tạo ma trận biểu hiện cho heatmap
# ------------------------------------------------------------

heatmap_mat <- norm_counts[heatmap_gene_ids, , drop = FALSE]

# Log2 transform.
heatmap_mat_log <- log2(heatmap_mat + 1)

# Đổi rownames thành gene symbol.
gene_symbol_map <- heatmap_gene_table$gene_symbol
names(gene_symbol_map) <- heatmap_gene_table$gene_id

rownames(heatmap_mat_log) <- gene_symbol_map[rownames(heatmap_mat_log)]

# Tránh lỗi nếu có gene symbol trùng.
rownames(heatmap_mat_log) <- make.unique(rownames(heatmap_mat_log))

# ------------------------------------------------------------
# 21.3. Annotation cho sample
# ------------------------------------------------------------

annotation_col <- sample_info %>%
  select(condition) %>%
  as.data.frame()

rownames(annotation_col) <- rownames(sample_info)

# Đảm bảo thứ tự annotation khớp với cột ma trận.
annotation_col <- annotation_col[colnames(heatmap_mat_log), , drop = FALSE]

# ------------------------------------------------------------
# 21.4. Annotation cho gene row
# ------------------------------------------------------------

annotation_row <- heatmap_gene_table %>%
  select(gene_symbol, gene_group, expression_interpretation) %>%
  as.data.frame()

rownames(annotation_row) <- make.unique(annotation_row$gene_symbol)
annotation_row$gene_symbol <- NULL

# Đảm bảo annotation row khớp với rownames ma trận.
annotation_row <- annotation_row[rownames(heatmap_mat_log), , drop = FALSE]

# ------------------------------------------------------------
# 21.5. Vẽ heatmap và lưu file
# ------------------------------------------------------------

pheatmap(
  heatmap_mat_log,
  scale = "row",
  annotation_col = annotation_col,
  annotation_row = annotation_row,
  show_colnames = FALSE,
  fontsize_row = 8,
  main = "Expanded PTC-relevant gene panel: Tumor vs Normal"
)

png(
  filename = file.path("results", "plots", "TCGA_THCA_PTC_extended_gene_panel_heatmap.png"),
  width = 1400,
  height = 1100,
  res = 150
)

pheatmap(
  heatmap_mat_log,
  scale = "row",
  annotation_col = annotation_col,
  annotation_row = annotation_row,
  show_colnames = FALSE,
  fontsize_row = 8,
  main = "Expanded PTC-relevant gene panel: Tumor vs Normal"
)

dev.off()

## quartz_off_screen 
##                 3

pdf(
  file = file.path("results", "plots", "TCGA_THCA_PTC_extended_gene_panel_heatmap.pdf"),
  width = 11,
  height = 9
)

pheatmap(
  heatmap_mat_log,
  scale = "row",
  annotation_col = annotation_col,
  annotation_row = annotation_row,
  show_colnames = FALSE,
  fontsize_row = 8,
  main = "Expanded PTC-relevant gene panel: Tumor vs Normal"
)

dev.off()

## quartz_off_screen 
##                 3

25 22. Heatmap sạch hơn cho expanded PTC-relevant gene panel

25.1 Mục đích

Tạo phiên bản heatmap dễ trình bày hơn so với heatmap đầy đủ. Phiên bản này dùng z-score theo từng gene và giới hạn giá trị ngoại lai trong khoảng -2.5 đến 2.5, giúp tránh hiện tượng một số mẫu cực trị chi phối thang màu.

25.2 Phương pháp nghiên cứu

Dùng cùng danh sách gene của expanded panel.
Biến đổi log2(normalized count + 1).
Chuẩn hóa z-score theo từng gene.
Giới hạn z-score trong khoảng [-2.5, 2.5].
Chỉ giữ annotation hàng là gene_group để hình dễ đọc hơn.

25.3 Code

# ============================================================
# 22. Cleaner heatmap for expanded PTC-relevant gene panel
# ============================================================

heatmap_gene_table_clean <- ptc_extended_result %>%
  filter(gene_symbol %in% ptc_extended_genes) %>%
  filter(gene_id %in% rownames(norm_counts)) %>%
  group_by(gene_symbol) %>%
  arrange(padj, .by_group = TRUE) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  arrange(gene_group, padj)

heatmap_gene_ids_clean <- heatmap_gene_table_clean$gene_id

if (length(heatmap_gene_ids_clean) >= 2) {
  heatmap_mat_clean <- norm_counts[heatmap_gene_ids_clean, , drop = FALSE]
  heatmap_mat_log_clean <- log2(heatmap_mat_clean + 1)
  
  gene_symbol_map_clean <- heatmap_gene_table_clean$gene_symbol
  names(gene_symbol_map_clean) <- heatmap_gene_table_clean$gene_id
  rownames(heatmap_mat_log_clean) <- make.unique(
    gene_symbol_map_clean[rownames(heatmap_mat_log_clean)]
  )
  
  annotation_col_clean <- sample_info %>%
    select(condition) %>%
    as.data.frame()
  rownames(annotation_col_clean) <- rownames(sample_info)
  annotation_col_clean <- annotation_col_clean[colnames(heatmap_mat_log_clean), , drop = FALSE]
  
  annotation_row_clean <- heatmap_gene_table_clean %>%
    select(gene_symbol, gene_group) %>%
    as.data.frame()
  rownames(annotation_row_clean) <- make.unique(annotation_row_clean$gene_symbol)
  annotation_row_clean$gene_symbol <- NULL
  annotation_row_clean <- annotation_row_clean[rownames(heatmap_mat_log_clean), , drop = FALSE]
  
  # Z-score theo từng gene.
  heatmap_mat_z <- t(scale(t(heatmap_mat_log_clean)))
  heatmap_mat_z[is.na(heatmap_mat_z)] <- 0
  heatmap_mat_z[heatmap_mat_z > 2.5] <- 2.5
  heatmap_mat_z[heatmap_mat_z < -2.5] <- -2.5
  
  pheatmap(
    heatmap_mat_z,
    annotation_col = annotation_col_clean,
    annotation_row = annotation_row_clean,
    show_colnames = FALSE,
    fontsize_row = 9,
    cluster_cols = TRUE,
    cluster_rows = TRUE,
    border_color = NA,
    main = "Expanded PTC-relevant gene panel"
  )
  
  png(
    filename = file.path("results", "plots", "TCGA_THCA_PTC_extended_gene_panel_heatmap_clean.png"),
    width = 1200,
    height = 1000,
    res = 150
  )
  
  pheatmap(
    heatmap_mat_z,
    annotation_col = annotation_col_clean,
    annotation_row = annotation_row_clean,
    show_colnames = FALSE,
    fontsize_row = 9,
    cluster_cols = TRUE,
    cluster_rows = TRUE,
    border_color = NA,
    main = "Expanded PTC-relevant gene panel"
  )
  
  dev.off()
  
  pdf(
    file = file.path("results", "plots", "TCGA_THCA_PTC_extended_gene_panel_heatmap_clean.pdf"),
    width = 11,
    height = 9
  )
  
  pheatmap(
    heatmap_mat_z,
    annotation_col = annotation_col_clean,
    annotation_row = annotation_row_clean,
    show_colnames = FALSE,
    fontsize_row = 9,
    cluster_cols = TRUE,
    cluster_rows = TRUE,
    border_color = NA,
    main = "Expanded PTC-relevant gene panel"
  )
  
  dev.off()
} else {
  message("Không đủ gene để vẽ clean heatmap.")
}

## quartz_off_screen 
##                 3

#23: Lựa chọn dựa vào thang điểm các yếu tố lựa chọn GENE phù hợp Dựa vào các tiêu chí sau CGC_score COSMIC Cancer Gene Census là cơ sở dữ liệu chuyên phân loại các gene liên quan ung thư. CGC chia gene thành Tier 1 và Tier 2; Tier 1 yêu cầu bằng chứng về đột biến soma trong ung thư và bằng chứng chức năng liên quan quá trình sinh ung thư. OncoKB_score OncoKB đánh giá mức độ bằng chứng lâm sàng/actionability của biến đổi ung thư. OncoKB có các level như Level 1, 2, 3A, 3B, 4, R1, R2 để phản ánh mức độ bằng chứng điều trị hoặc kháng trị. CIViC là cơ sở dữ liệu bằng chứng lâm sàng và y văn cho biến thể ung thư. Nó phân loại mức bằng chứng từ A đến E, trong đó mức cao hơn thường dựa trên guideline, approval hoặc dữ liệu lâm sàng mạnh hơn. PTC_frequency_score Đây là điểm dựa trên tần suất biến đổi trong PTC. Một gene có thể rất quan trọng trong ung thư nói chung, nhưng nếu rất hiếm hoặc không liên quan nhiều đến PTC thì mức ưu tiên cho đề tài PTC giảm. variant_relevance_score Đây là điểm đánh giá biến thể đó có phù hợp với vaccine hay không.

gene_evidence_table <- tibble::tribble(
  ~gene_symbol, ~CGC_tier, ~OncoKB_level, ~CIViC_level, ~ptc_mutation_frequency, ~variant_relevance,
  "BRAF",  "Tier1", "Level1", "A", 0.60, "canonical_driver_hotspot",
  "NRAS",  "Tier1", "Level3", "B", 0.08, "canonical_driver_hotspot",
  "HRAS",  "Tier1", "Level3", "B", 0.03, "canonical_driver_hotspot",
  "KRAS",  "Tier1", "Level3", "B", 0.01, "canonical_driver_hotspot",
  "RET",   "Tier1", "Level1", "A", 0.07, "fusion_driver",
  "NTRK1", "Tier1", "Level1", "A", 0.02, "fusion_driver",
  "NTRK3", "Tier1", "Level1", "A", 0.01, "fusion_driver",
  "TERT",  "Tier1", "Level3", "B", 0.10, "prognostic_promoter",
  "TP53",  "Tier1", "Level3", "B", 0.02, "late_progression",
  "TG",    NA,      NA,       NA,  0.00, "thyroid_expression_marker"
)
library(dplyr)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

gene_evidence_scored <- gene_evidence_table %>%
  mutate(
    cgc_score = case_when(
      CGC_tier == "Tier1" ~ 3,
      CGC_tier == "Tier2" ~ 2,
      TRUE ~ 0
    ),
    
    oncokb_score = case_when(
      OncoKB_level %in% c("Level1", "Level2") ~ 3,
      OncoKB_level %in% c("Level3A", "Level3B", "Level3") ~ 2,
      OncoKB_level == "Level4" ~ 1,
      TRUE ~ 0
    ),
    
    civic_score = case_when(
      CIViC_level %in% c("A", "B") ~ 3,
      CIViC_level == "C" ~ 2,
      CIViC_level %in% c("D", "E") ~ 1,
      TRUE ~ 0
    ),
    
    ptc_frequency_score = rescale(
      log10(ptc_mutation_frequency + 0.001),
      to = c(0, 3)
    ),
    
    variant_relevance_score = case_when(
      variant_relevance %in% c(
        "canonical_driver_hotspot",
        "fusion_driver"
      ) ~ 3,
      variant_relevance == "prognostic_promoter" ~ 2,
      variant_relevance == "late_progression" ~ 1,
      TRUE ~ 0
    ),
    
    driver_evidence_score =
      0.25 * cgc_score +
      0.25 * oncokb_score +
      0.25 * civic_score +
      0.15 * ptc_frequency_score +
      0.1 * variant_relevance_score
  ) %>%
  arrange(desc(driver_evidence_score))

ptc_extended_scored_objective <- ptc_extended_result %>%
  left_join(gene_evidence_scored, by = "gene_symbol") %>%
  mutate(
    expression_score = case_when(
      baseMean >= 100 &
        log2FoldChange >= params$deg_log2fc_cutoff &
        padj < params$deg_padj_cutoff ~ 3,
      
      baseMean >= 100 &
        padj < params$deg_padj_cutoff ~ 2,
      
      baseMean >= 50 ~ 1,
      
      TRUE ~ 0
    ),
    
    vaccine_priority_score =
      driver_evidence_score + expression_score
  ) %>%
  arrange(desc(vaccine_priority_score), padj)

ptc_extended_scored_objective

##                  gene_group            gene_id gene_symbol      gene_type
## 1          Fusion_or_kinase ENSG00000165731.20         RET protein_coding
## 2          Fusion_or_kinase ENSG00000140538.16       NTRK3 protein_coding
## 3               Driver_MAPK ENSG00000174775.17        HRAS protein_coding
## 4  Progression_or_high_risk ENSG00000141510.18        TP53 protein_coding
## 5               Driver_MAPK ENSG00000157764.14        BRAF protein_coding
## 6               Driver_MAPK  ENSG00000213281.5        NRAS protein_coding
## 7               Driver_MAPK ENSG00000133703.13        KRAS protein_coding
## 8          Fusion_or_kinase ENSG00000198400.12       NTRK1 protein_coding
## 9  Progression_or_high_risk ENSG00000164362.21        TERT protein_coding
## 10          Thyroid_lineage ENSG00000042832.12          TG protein_coding
## 11 Progression_or_high_risk ENSG00000147883.12      CDKN2B protein_coding
## 12         Fusion_or_kinase ENSG00000105976.16         MET protein_coding
## 13         Fusion_or_kinase ENSG00000171094.18         ALK protein_coding
## 14 Progression_or_high_risk ENSG00000147889.18      CDKN2A protein_coding
## 15 Progression_or_high_risk ENSG00000100697.15      DICER1 protein_coding
## 16          Thyroid_lineage ENSG00000136352.19      NKX2-1 protein_coding
## 17          Thyroid_lineage ENSG00000125618.17        PAX8 protein_coding
## 18          Thyroid_lineage ENSG00000211448.13        DIO2 protein_coding
## 19         Fusion_or_kinase ENSG00000132170.23       PPARG protein_coding
## 20         Fusion_or_kinase ENSG00000047936.11        ROS1 protein_coding
## 21          Thyroid_lineage  ENSG00000197406.7        DIO3 protein_coding
## 22          Thyroid_lineage  ENSG00000105641.4      SLC5A5 protein_coding
## 23          Thyroid_lineage ENSG00000165409.18        TSHR protein_coding
## 24          Thyroid_lineage ENSG00000115705.22         TPO protein_coding
## 25 Progression_or_high_risk ENSG00000183765.23       CHEK2 protein_coding
## 26 Progression_or_high_risk ENSG00000142208.17        AKT1 protein_coding
## 27              Driver_MAPK ENSG00000173674.11      EIF1AX protein_coding
## 28          Thyroid_lineage ENSG00000211452.11        DIO1 protein_coding
## 29          Thyroid_lineage  ENSG00000178919.9       FOXE1 protein_coding
## 30 Progression_or_high_risk ENSG00000171862.11        PTEN protein_coding
## 31 Progression_or_high_risk ENSG00000198793.13        MTOR protein_coding
## 32 Progression_or_high_risk ENSG00000170836.12       PPM1D protein_coding
## 33 Progression_or_high_risk  ENSG00000121879.6      PIK3CA protein_coding
##        baseMean log2FoldChange      lfcSE       stat       pvalue         padj
## 1  3.426115e+02     1.69653085 0.29327886  5.7847021 7.264094e-09 2.614657e-08
## 2  5.884629e+02     1.68478216 0.19750444  8.5303510 1.459043e-17 1.207624e-16
## 3  1.726473e+03     0.59891786 0.09187965  6.5185039 7.101208e-11 3.163853e-10
## 4  3.134114e+03     0.48302146 0.04868008  9.9223642 3.327788e-23 4.263618e-22
## 5  2.691720e+03     0.08687558 0.04308163  2.0165339 4.374417e-02 6.228229e-02
## 6  2.462547e+03    -0.03461003 0.05488486 -0.6305934 5.283064e-01 5.851301e-01
## 7  1.860714e+03     0.04447180 0.04718236  0.9425512 3.459105e-01 4.049551e-01
## 8  9.670243e+00     0.25791740 0.23050104  1.1189425 2.631647e-01 3.176361e-01
## 9  4.406779e+00    -0.56834370 0.52547519 -1.0815805 2.794390e-01 3.352594e-01
## 10 2.342569e+06    -1.12906181 0.16356168 -6.9029728 5.092540e-12 2.550513e-11
## 11 6.510688e+02     2.97114555 0.15200291 19.5466358 4.406375e-85 1.258784e-82
## 12 2.794368e+04     2.67762563 0.14948195 17.9127021 9.386618e-72 1.631519e-69
## 13 2.160376e+02     3.21414491 0.19049283 16.8727865 7.134968e-64 8.835417e-62
## 14 2.592684e+02     3.19851691 0.22022816 14.5236506 8.581562e-48 5.042848e-46
## 15 3.720577e+03    -0.63392825 0.06485250 -9.7749235 1.442662e-22 1.753375e-21
## 16 1.607623e+04     0.63920190 0.08042123  7.9481738 1.892810e-15 1.306850e-14
## 17 4.812762e+04    -0.60278570 0.08321456 -7.2437527 4.364370e-13 2.419776e-12
## 18 1.876872e+04    -1.32095629 0.18321300 -7.2099486 5.597300e-13 3.062751e-12
## 19 5.084104e+02    -0.87089889 0.13140071 -6.6278094 3.407049e-11 1.561614e-10
## 20 1.613612e+01     2.97071741 0.45977920  6.4611827 1.038878e-10 4.537504e-10
## 21 2.005290e+01    -1.75660612 0.31149529 -5.6392703 1.707723e-08 5.904484e-08
## 22 1.231463e+03    -2.34774688 0.42423284 -5.5340998 3.128308e-08 1.051226e-07
## 23 3.541045e+04    -0.43457743 0.07884485 -5.5118049 3.551726e-08 1.186225e-07
## 24 9.837561e+04    -1.72815412 0.31537893 -5.4796118 4.262600e-08 1.413177e-07
## 25 4.751797e+02     0.40784810 0.07974622  5.1143250 3.148648e-07 9.442672e-07
## 26 6.678602e+03     0.22934978 0.04757751  4.8205501 1.431629e-06 3.959623e-06
## 27 4.742313e+03    -0.20774299 0.04508766 -4.6075353 4.074700e-06 1.066467e-05
## 28 7.043099e+03    -1.28227402 0.33903895 -3.7820847 1.555204e-04 3.301863e-04
## 29 1.835582e+04    -0.30850514 0.10353100 -2.9798335 2.884051e-03 5.039579e-03
## 30 6.384400e+03    -0.12297798 0.04186449 -2.9375249 3.308436e-03 5.727041e-03
## 31 3.637936e+03     0.09121340 0.04768823  1.9127027 5.578612e-02 7.783351e-02
## 32 1.036964e+03    -0.08264470 0.04449921 -1.8572170 6.328029e-02 8.733086e-02
## 33 1.608636e+03     0.04411375 0.07527116  0.5860645 5.578322e-01 6.128677e-01
##               expression_interpretation CGC_tier OncoKB_level CIViC_level
## 1         Strongly_upregulated_in_Tumor    Tier1       Level1           A
## 2         Strongly_upregulated_in_Tumor    Tier1       Level1           A
## 3  Statistically_significant_but_low_FC    Tier1       Level3           B
## 4  Statistically_significant_but_low_FC    Tier1       Level3           B
## 5         Not_statistically_significant    Tier1       Level1           A
## 6         Not_statistically_significant    Tier1       Level3           B
## 7         Not_statistically_significant    Tier1       Level3           B
## 8         Not_statistically_significant    Tier1       Level1           A
## 9         Not_statistically_significant    Tier1       Level3           B
## 10      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 11        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 12        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 13        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 14        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 15 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 16 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 17 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 18      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 19 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 20        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 21      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 22      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 23 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 24      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 25 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 26 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 27 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 28      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 29 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 30 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 31        Not_statistically_significant     <NA>         <NA>        <NA>
## 32        Not_statistically_significant     <NA>         <NA>        <NA>
## 33        Not_statistically_significant     <NA>         <NA>        <NA>
##    ptc_mutation_frequency         variant_relevance cgc_score oncokb_score
## 1                    0.07             fusion_driver         3            3
## 2                    0.01             fusion_driver         3            3
## 3                    0.03  canonical_driver_hotspot         3            2
## 4                    0.02          late_progression         3            2
## 5                    0.60  canonical_driver_hotspot         3            3
## 6                    0.08  canonical_driver_hotspot         3            2
## 7                    0.01  canonical_driver_hotspot         3            2
## 8                    0.02             fusion_driver         3            3
## 9                    0.10       prognostic_promoter         3            2
## 10                   0.00 thyroid_expression_marker         0            0
## 11                     NA                      <NA>        NA           NA
## 12                     NA                      <NA>        NA           NA
## 13                     NA                      <NA>        NA           NA
## 14                     NA                      <NA>        NA           NA
## 15                     NA                      <NA>        NA           NA
## 16                     NA                      <NA>        NA           NA
## 17                     NA                      <NA>        NA           NA
## 18                     NA                      <NA>        NA           NA
## 19                     NA                      <NA>        NA           NA
## 20                     NA                      <NA>        NA           NA
## 21                     NA                      <NA>        NA           NA
## 22                     NA                      <NA>        NA           NA
## 23                     NA                      <NA>        NA           NA
## 24                     NA                      <NA>        NA           NA
## 25                     NA                      <NA>        NA           NA
## 26                     NA                      <NA>        NA           NA
## 27                     NA                      <NA>        NA           NA
## 28                     NA                      <NA>        NA           NA
## 29                     NA                      <NA>        NA           NA
## 30                     NA                      <NA>        NA           NA
## 31                     NA                      <NA>        NA           NA
## 32                     NA                      <NA>        NA           NA
## 33                     NA                      <NA>        NA           NA
##    civic_score ptc_frequency_score variant_relevance_score
## 1            3            1.998570                       3
## 2            3            1.124260                       3
## 3            3            1.610035                       3
## 4            3            1.427433                       1
## 5            3            3.000000                       3
## 6            3            2.060350                       3
## 7            3            1.124260                       3
## 8            3            1.427433                       3
## 9            3            2.163813                       2
## 10           0            0.000000                       0
## 11          NA                  NA                      NA
## 12          NA                  NA                      NA
## 13          NA                  NA                      NA
## 14          NA                  NA                      NA
## 15          NA                  NA                      NA
## 16          NA                  NA                      NA
## 17          NA                  NA                      NA
## 18          NA                  NA                      NA
## 19          NA                  NA                      NA
## 20          NA                  NA                      NA
## 21          NA                  NA                      NA
## 22          NA                  NA                      NA
## 23          NA                  NA                      NA
## 24          NA                  NA                      NA
## 25          NA                  NA                      NA
## 26          NA                  NA                      NA
## 27          NA                  NA                      NA
## 28          NA                  NA                      NA
## 29          NA                  NA                      NA
## 30          NA                  NA                      NA
## 31          NA                  NA                      NA
## 32          NA                  NA                      NA
## 33          NA                  NA                      NA
##    driver_evidence_score expression_score vaccine_priority_score
## 1               2.849785                3               5.849785
## 2               2.718639                3               5.718639
## 3               2.541505                2               4.541505
## 4               2.314115                2               4.314115
## 5               3.000000                1               4.000000
## 6               2.609053                1               3.609053
## 7               2.468639                1               3.468639
## 8               2.764115                0               2.764115
## 9               2.524572                0               2.524572
## 10              0.000000                2               2.000000
## 11                    NA                3                     NA
## 12                    NA                3                     NA
## 13                    NA                3                     NA
## 14                    NA                3                     NA
## 15                    NA                2                     NA
## 16                    NA                2                     NA
## 17                    NA                2                     NA
## 18                    NA                2                     NA
## 19                    NA                2                     NA
## 20                    NA                0                     NA
## 21                    NA                0                     NA
## 22                    NA                2                     NA
## 23                    NA                2                     NA
## 24                    NA                2                     NA
## 25                    NA                2                     NA
## 26                    NA                2                     NA
## 27                    NA                2                     NA
## 28                    NA                2                     NA
## 29                    NA                2                     NA
## 30                    NA                2                     NA
## 31                    NA                1                     NA
## 32                    NA                1                     NA
## 33                    NA                1                     NA

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(tibble)
  library(ggplot2)
  library(scales)
  library(pheatmap)
})

if (!dir.exists("results")) dir.create("results")
if (!dir.exists("results/plots")) dir.create("results/plots", recursive = TRUE)
if (!dir.exists("results/tables")) dir.create("results/tables", recursive = TRUE)

deg_log2fc_cutoff <- if (exists("params") && "deg_log2fc_cutoff" %in% names(params)) params$deg_log2fc_cutoff else 1
deg_padj_cutoff <- if (exists("params") && "deg_padj_cutoff" %in% names(params)) params$deg_padj_cutoff else 0.05

gene_evidence_scored <- gene_evidence_table %>%
  mutate(
    cgc_score = case_when(
      CGC_tier == "Tier1" ~ 3,
      CGC_tier == "Tier2" ~ 2,
      TRUE ~ 0
    ),
    oncokb_score = case_when(
      OncoKB_level %in% c("Level1", "Level2") ~ 3,
      OncoKB_level %in% c("Level3A", "Level3B", "Level3") ~ 2,
      OncoKB_level == "Level4" ~ 1,
      TRUE ~ 0
    ),
    civic_score = case_when(
      CIViC_level %in% c("A", "B") ~ 3,
      CIViC_level == "C" ~ 2,
      CIViC_level %in% c("D", "E") ~ 1,
      TRUE ~ 0
    ),
    ptc_frequency_score = scales::rescale(
      log10(ptc_mutation_frequency + 0.001),
      to = c(0, 3)
    ),
    variant_relevance_score = case_when(
      variant_relevance %in% c("canonical_driver_hotspot", "fusion_driver") ~ 3,
      variant_relevance == "prognostic_promoter" ~ 2,
      variant_relevance == "late_progression" ~ 1,
      TRUE ~ 0
    ),
    driver_evidence_score =
      0.25 * cgc_score +
      0.25 * oncokb_score +
      0.25 * civic_score +
      0.15 * ptc_frequency_score +
      0.10 * variant_relevance_score
  ) %>%
  arrange(desc(driver_evidence_score))

ptc_extended_scored_objective <- ptc_extended_result %>%
  left_join(gene_evidence_scored, by = "gene_symbol") %>%
  mutate(
    driver_evidence_score = ifelse(is.na(driver_evidence_score), 0, driver_evidence_score),
    expression_score = case_when(
      baseMean >= 100 & log2FoldChange >= deg_log2fc_cutoff & padj < deg_padj_cutoff ~ 3,
      baseMean >= 100 & padj < deg_padj_cutoff ~ 2,
      baseMean >= 50 ~ 1,
      TRUE ~ 0
    ),
    vaccine_priority_score = driver_evidence_score + expression_score
  ) %>%
  arrange(desc(vaccine_priority_score), padj)

write.csv(
  ptc_extended_scored_objective,
  file = "results/tables/TCGA_THCA_PTC_extended_gene_panel_vaccine_priority_objective.csv",
  row.names = FALSE
)

top_upregulated_genes <- res_annotated %>%
  filter(
    gene_type == "protein_coding",
    !is.na(padj),
    padj < deg_padj_cutoff,
    log2FoldChange >= deg_log2fc_cutoff,
    baseMean >= 50
  ) %>%
  arrange(desc(log2FoldChange))

write.csv(
  top_upregulated_genes,
  file = "results/tables/TCGA_THCA_PTC_top_upregulated_protein_coding_genes.csv",
  row.names = FALSE
)

top_downregulated_genes <- res_annotated %>%
  filter(
    gene_type == "protein_coding",
    !is.na(padj),
    padj < deg_padj_cutoff,
    log2FoldChange <= -deg_log2fc_cutoff,
    baseMean >= 50
  ) %>%
  arrange(log2FoldChange)

write.csv(
  top_downregulated_genes,
  file = "results/tables/TCGA_THCA_PTC_top_downregulated_protein_coding_genes.csv",
  row.names = FALSE
)

p_top_upregulated <- top_upregulated_genes %>%
  slice_head(n = 30) %>%
  mutate(gene_symbol = factor(gene_symbol, levels = rev(unique(gene_symbol)))) %>%
  ggplot(aes(x = gene_symbol, y = log2FoldChange)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(
    title = "Top upregulated protein-coding genes in PTC Tumor",
    x = "",
    y = "log2 fold change"
  ) +
  theme(
    plot.title = element_text(face = "bold")
  )

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_top_upregulated_protein_coding_genes.png",
  plot = p_top_upregulated,
  width = 8,
  height = 7,
  dpi = 300
)

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_top_upregulated_protein_coding_genes.pdf",
  plot = p_top_upregulated,
  width = 8,
  height = 7
)

p_top_downregulated <- top_downregulated_genes %>%
  slice_head(n = 30) %>%
  mutate(gene_symbol = factor(gene_symbol, levels = unique(gene_symbol))) %>%
  ggplot(aes(x = gene_symbol, y = log2FoldChange)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(
    title = "Top downregulated protein-coding genes in PTC Tumor",
    x = "",
    y = "log2 fold change"
  ) +
  theme(
    plot.title = element_text(face = "bold")
  )

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_top_downregulated_protein_coding_genes.png",
  plot = p_top_downregulated,
  width = 8,
  height = 7,
  dpi = 300
)

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_top_downregulated_protein_coding_genes.pdf",
  plot = p_top_downregulated,
  width = 8,
  height = 7
)

ptc_extended_gene_group <- tibble(
  gene_symbol = ptc_extended_genes
) %>%
  mutate(
    gene_group = case_when(
      gene_symbol %in% ptc_driver_genes ~ "Driver_MAPK",
      gene_symbol %in% ptc_fusion_genes ~ "Fusion_or_kinase",
      gene_symbol %in% ptc_progression_genes ~ "Progression_or_high_risk",
      gene_symbol %in% thyroid_lineage_genes ~ "Thyroid_lineage",
      TRUE ~ "Other"
    )
  )

norm_counts_df <- as.data.frame(norm_counts) %>%
  rownames_to_column("gene_id") %>%
  mutate(gene_id_clean = sub("\\..*$", "", gene_id))

norm_counts_annotated <- norm_counts_df %>%
  left_join(
    gene_anno_clean %>%
      select(gene_id_clean, gene_symbol, gene_type) %>%
      distinct(),
    by = "gene_id_clean"
  )

ptc_extended_expression <- norm_counts_annotated %>%
  filter(gene_symbol %in% ptc_extended_genes) %>%
  left_join(ptc_extended_gene_group, by = "gene_symbol") %>%
  group_by(gene_symbol) %>%
  mutate(mean_expression = rowMeans(across(any_of(sample_info$sample_id)), na.rm = TRUE)) %>%
  arrange(desc(mean_expression), .by_group = TRUE) %>%
  slice_head(n = 1) %>%
  ungroup()

ptc_extended_expression_long <- ptc_extended_expression %>%
  select(
    gene_id,
    gene_symbol,
    gene_type,
    gene_group,
    any_of(sample_info$sample_id)
  ) %>%
  pivot_longer(
    cols = any_of(sample_info$sample_id),
    names_to = "sample_id",
    values_to = "normalized_count"
  ) %>%
  left_join(
    sample_info %>%
      select(any_of(c("sample_id", "condition", "patient_id"))),
    by = "sample_id"
  ) %>%
  filter(!is.na(condition))

gene_order_boxplot <- ptc_extended_expression_long %>%
  distinct(gene_group, gene_symbol) %>%
  arrange(gene_group, gene_symbol) %>%
  pull(gene_symbol)

ptc_extended_expression_long <- ptc_extended_expression_long %>%
  mutate(
    gene_symbol = factor(gene_symbol, levels = unique(gene_order_boxplot)),
    condition = factor(condition, levels = c("Normal", "Tumor"))
  )

p_boxplot_extended <- ggplot(
  ptc_extended_expression_long,
  aes(
    x = condition,
    y = log2(normalized_count + 1),
    fill = condition
  )
) +
  geom_boxplot(
    outlier.shape = NA,
    alpha = 0.7
  ) +
  geom_jitter(
    width = 0.2,
    size = 0.6,
    alpha = 0.45
  ) +
  facet_wrap(
    ~ gene_symbol,
    scales = "free_y",
    ncol = 5
  ) +
  theme_bw() +
  labs(
    title = "Expression of expanded PTC-relevant genes in TCGA-THCA",
    subtitle = "Tumor vs Normal, DESeq2 normalized counts",
    x = "",
    y = "log2(normalized count + 1)"
  ) +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold", size = 8),
    plot.title = element_text(face = "bold")
  )

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_extended_gene_panel_boxplot.png",
  plot = p_boxplot_extended,
  width = 16,
  height = 12,
  dpi = 300
)

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_extended_gene_panel_boxplot.pdf",
  plot = p_boxplot_extended,
  width = 16,
  height = 12
)

target_gene <- if (exists("target_gene")) target_gene else "BRAF"

target_gene_expression_long <- norm_counts_annotated %>%
  filter(gene_symbol == target_gene) %>%
  group_by(gene_symbol) %>%
  mutate(mean_expression = rowMeans(across(any_of(sample_info$sample_id)), na.rm = TRUE)) %>%
  arrange(desc(mean_expression), .by_group = TRUE) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(
    gene_id,
    gene_symbol,
    gene_type,
    any_of(sample_info$sample_id)
  ) %>%
  pivot_longer(
    cols = any_of(sample_info$sample_id),
    names_to = "sample_id",
    values_to = "normalized_count"
  ) %>%
  left_join(
    sample_info %>%
      select(any_of(c("sample_id", "condition", "patient_id"))),
    by = "sample_id"
  ) %>%
  filter(!is.na(condition)) %>%
  mutate(condition = factor(condition, levels = c("Normal", "Tumor")))

p_boxplot_target_gene <- ggplot(
  target_gene_expression_long,
  aes(
    x = condition,
    y = log2(normalized_count + 1),
    fill = condition
  )
) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, size = 0.8, alpha = 0.5) +
  theme_bw() +
  labs(
    title = paste0("Expression of ", target_gene, " in TCGA-THCA"),
    subtitle = "Tumor vs Normal, DESeq2 normalized counts",
    x = "",
    y = "log2(normalized count + 1)"
  ) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold")
  )

ggsave(
  filename = paste0("results/plots/TCGA_THCA_PTC_", target_gene, "_boxplot.png"),
  plot = p_boxplot_target_gene,
  width = 5.5,
  height = 4.5,
  dpi = 300
)

ggsave(
  filename = paste0("results/plots/TCGA_THCA_PTC_", target_gene, "_boxplot.pdf"),
  plot = p_boxplot_target_gene,
  width = 5.5,
  height = 4.5
)

volcano_extended_df <- res_annotated %>%
  mutate(
    padj_for_plot = ifelse(is.na(padj), 1, pmax(padj, .Machine$double.xmin)),
    neg_log10_padj = -log10(padj_for_plot),
    DEG_status = case_when(
      !is.na(padj) & padj < deg_padj_cutoff & log2FoldChange >= deg_log2fc_cutoff ~ "Upregulated",
      !is.na(padj) & padj < deg_padj_cutoff & log2FoldChange <= -deg_log2fc_cutoff ~ "Downregulated",
      TRUE ~ "Not_significant"
    ),
    is_extended_panel_gene = ifelse(
      gene_symbol %in% ptc_extended_genes,
      "Expanded_PTC_panel",
      "Other_gene"
    ),
    label_gene = ifelse(
      gene_symbol %in% ptc_extended_genes & !is.na(padj) & padj < deg_padj_cutoff,
      gene_symbol,
      NA_character_
    )
  )

p_volcano_extended <- ggplot(
  volcano_extended_df,
  aes(
    x = log2FoldChange,
    y = neg_log10_padj
  )
) +
  geom_point(
    aes(color = DEG_status),
    alpha = 0.45,
    size = 1
  ) +
  geom_point(
    data = volcano_extended_df %>%
      filter(is_extended_panel_gene == "Expanded_PTC_panel"),
    aes(x = log2FoldChange, y = neg_log10_padj),
    shape = 21,
    size = 3,
    stroke = 0.8,
    fill = NA,
    color = "black"
  ) +
  geom_vline(
    xintercept = c(-deg_log2fc_cutoff, deg_log2fc_cutoff),
    linetype = "dashed"
  ) +
  geom_hline(
    yintercept = -log10(deg_padj_cutoff),
    linetype = "dashed"
  ) +
  theme_bw() +
  labs(
    title = "Volcano plot: PTC Tumor vs Normal",
    subtitle = "Expanded PTC-relevant genes are circled",
    x = "log2 fold change",
    y = "-log10 adjusted p-value",
    color = "DEG status"
  ) +
  theme(
    plot.title = element_text(face = "bold")
  )

if (requireNamespace("ggrepel", quietly = TRUE)) {
  p_volcano_extended <- p_volcano_extended +
    ggrepel::geom_text_repel(
      data = volcano_extended_df %>%
        filter(!is.na(label_gene)),
      aes(label = label_gene),
      size = 3,
      max.overlaps = 50,
      box.padding = 0.4,
      point.padding = 0.3
    )
} else {
  p_volcano_extended <- p_volcano_extended +
    geom_text(
      data = volcano_extended_df %>%
        filter(!is.na(label_gene)),
      aes(label = label_gene),
      size = 3,
      vjust = -0.6,
      check_overlap = TRUE
    )
}

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_volcano_highlight_extended_panel.png",
  plot = p_volcano_extended,
  width = 9,
  height = 7,
  dpi = 300
)

ggsave(
  filename = "results/plots/TCGA_THCA_PTC_volcano_highlight_extended_panel.pdf",
  plot = p_volcano_extended,
  width = 9,
  height = 7
)

ptc_extended_scored_objective

##                  gene_group            gene_id gene_symbol      gene_type
## 1          Fusion_or_kinase ENSG00000165731.20         RET protein_coding
## 2          Fusion_or_kinase ENSG00000140538.16       NTRK3 protein_coding
## 3               Driver_MAPK ENSG00000174775.17        HRAS protein_coding
## 4  Progression_or_high_risk ENSG00000141510.18        TP53 protein_coding
## 5               Driver_MAPK ENSG00000157764.14        BRAF protein_coding
## 6               Driver_MAPK  ENSG00000213281.5        NRAS protein_coding
## 7               Driver_MAPK ENSG00000133703.13        KRAS protein_coding
## 8  Progression_or_high_risk ENSG00000147883.12      CDKN2B protein_coding
## 9          Fusion_or_kinase ENSG00000105976.16         MET protein_coding
## 10         Fusion_or_kinase ENSG00000171094.18         ALK protein_coding
## 11 Progression_or_high_risk ENSG00000147889.18      CDKN2A protein_coding
## 12         Fusion_or_kinase ENSG00000198400.12       NTRK1 protein_coding
## 13 Progression_or_high_risk ENSG00000164362.21        TERT protein_coding
## 14 Progression_or_high_risk ENSG00000100697.15      DICER1 protein_coding
## 15          Thyroid_lineage ENSG00000136352.19      NKX2-1 protein_coding
## 16          Thyroid_lineage ENSG00000125618.17        PAX8 protein_coding
## 17          Thyroid_lineage ENSG00000211448.13        DIO2 protein_coding
## 18          Thyroid_lineage ENSG00000042832.12          TG protein_coding
## 19         Fusion_or_kinase ENSG00000132170.23       PPARG protein_coding
## 20          Thyroid_lineage  ENSG00000105641.4      SLC5A5 protein_coding
## 21          Thyroid_lineage ENSG00000165409.18        TSHR protein_coding
## 22          Thyroid_lineage ENSG00000115705.22         TPO protein_coding
## 23 Progression_or_high_risk ENSG00000183765.23       CHEK2 protein_coding
## 24 Progression_or_high_risk ENSG00000142208.17        AKT1 protein_coding
## 25              Driver_MAPK ENSG00000173674.11      EIF1AX protein_coding
## 26          Thyroid_lineage ENSG00000211452.11        DIO1 protein_coding
## 27          Thyroid_lineage  ENSG00000178919.9       FOXE1 protein_coding
## 28 Progression_or_high_risk ENSG00000171862.11        PTEN protein_coding
## 29 Progression_or_high_risk ENSG00000198793.13        MTOR protein_coding
## 30 Progression_or_high_risk ENSG00000170836.12       PPM1D protein_coding
## 31 Progression_or_high_risk  ENSG00000121879.6      PIK3CA protein_coding
## 32         Fusion_or_kinase ENSG00000047936.11        ROS1 protein_coding
## 33          Thyroid_lineage  ENSG00000197406.7        DIO3 protein_coding
##        baseMean log2FoldChange      lfcSE       stat       pvalue         padj
## 1  3.426115e+02     1.69653085 0.29327886  5.7847021 7.264094e-09 2.614657e-08
## 2  5.884629e+02     1.68478216 0.19750444  8.5303510 1.459043e-17 1.207624e-16
## 3  1.726473e+03     0.59891786 0.09187965  6.5185039 7.101208e-11 3.163853e-10
## 4  3.134114e+03     0.48302146 0.04868008  9.9223642 3.327788e-23 4.263618e-22
## 5  2.691720e+03     0.08687558 0.04308163  2.0165339 4.374417e-02 6.228229e-02
## 6  2.462547e+03    -0.03461003 0.05488486 -0.6305934 5.283064e-01 5.851301e-01
## 7  1.860714e+03     0.04447180 0.04718236  0.9425512 3.459105e-01 4.049551e-01
## 8  6.510688e+02     2.97114555 0.15200291 19.5466358 4.406375e-85 1.258784e-82
## 9  2.794368e+04     2.67762563 0.14948195 17.9127021 9.386618e-72 1.631519e-69
## 10 2.160376e+02     3.21414491 0.19049283 16.8727865 7.134968e-64 8.835417e-62
## 11 2.592684e+02     3.19851691 0.22022816 14.5236506 8.581562e-48 5.042848e-46
## 12 9.670243e+00     0.25791740 0.23050104  1.1189425 2.631647e-01 3.176361e-01
## 13 4.406779e+00    -0.56834370 0.52547519 -1.0815805 2.794390e-01 3.352594e-01
## 14 3.720577e+03    -0.63392825 0.06485250 -9.7749235 1.442662e-22 1.753375e-21
## 15 1.607623e+04     0.63920190 0.08042123  7.9481738 1.892810e-15 1.306850e-14
## 16 4.812762e+04    -0.60278570 0.08321456 -7.2437527 4.364370e-13 2.419776e-12
## 17 1.876872e+04    -1.32095629 0.18321300 -7.2099486 5.597300e-13 3.062751e-12
## 18 2.342569e+06    -1.12906181 0.16356168 -6.9029728 5.092540e-12 2.550513e-11
## 19 5.084104e+02    -0.87089889 0.13140071 -6.6278094 3.407049e-11 1.561614e-10
## 20 1.231463e+03    -2.34774688 0.42423284 -5.5340998 3.128308e-08 1.051226e-07
## 21 3.541045e+04    -0.43457743 0.07884485 -5.5118049 3.551726e-08 1.186225e-07
## 22 9.837561e+04    -1.72815412 0.31537893 -5.4796118 4.262600e-08 1.413177e-07
## 23 4.751797e+02     0.40784810 0.07974622  5.1143250 3.148648e-07 9.442672e-07
## 24 6.678602e+03     0.22934978 0.04757751  4.8205501 1.431629e-06 3.959623e-06
## 25 4.742313e+03    -0.20774299 0.04508766 -4.6075353 4.074700e-06 1.066467e-05
## 26 7.043099e+03    -1.28227402 0.33903895 -3.7820847 1.555204e-04 3.301863e-04
## 27 1.835582e+04    -0.30850514 0.10353100 -2.9798335 2.884051e-03 5.039579e-03
## 28 6.384400e+03    -0.12297798 0.04186449 -2.9375249 3.308436e-03 5.727041e-03
## 29 3.637936e+03     0.09121340 0.04768823  1.9127027 5.578612e-02 7.783351e-02
## 30 1.036964e+03    -0.08264470 0.04449921 -1.8572170 6.328029e-02 8.733086e-02
## 31 1.608636e+03     0.04411375 0.07527116  0.5860645 5.578322e-01 6.128677e-01
## 32 1.613612e+01     2.97071741 0.45977920  6.4611827 1.038878e-10 4.537504e-10
## 33 2.005290e+01    -1.75660612 0.31149529 -5.6392703 1.707723e-08 5.904484e-08
##               expression_interpretation CGC_tier OncoKB_level CIViC_level
## 1         Strongly_upregulated_in_Tumor    Tier1       Level1           A
## 2         Strongly_upregulated_in_Tumor    Tier1       Level1           A
## 3  Statistically_significant_but_low_FC    Tier1       Level3           B
## 4  Statistically_significant_but_low_FC    Tier1       Level3           B
## 5         Not_statistically_significant    Tier1       Level1           A
## 6         Not_statistically_significant    Tier1       Level3           B
## 7         Not_statistically_significant    Tier1       Level3           B
## 8         Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 9         Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 10        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 11        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 12        Not_statistically_significant    Tier1       Level1           A
## 13        Not_statistically_significant    Tier1       Level3           B
## 14 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 15 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 16 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 17      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 18      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 19 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 20      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 21 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 22      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 23 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 24 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 25 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 26      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
## 27 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 28 Statistically_significant_but_low_FC     <NA>         <NA>        <NA>
## 29        Not_statistically_significant     <NA>         <NA>        <NA>
## 30        Not_statistically_significant     <NA>         <NA>        <NA>
## 31        Not_statistically_significant     <NA>         <NA>        <NA>
## 32        Strongly_upregulated_in_Tumor     <NA>         <NA>        <NA>
## 33      Strongly_downregulated_in_Tumor     <NA>         <NA>        <NA>
##    ptc_mutation_frequency         variant_relevance cgc_score oncokb_score
## 1                    0.07             fusion_driver         3            3
## 2                    0.01             fusion_driver         3            3
## 3                    0.03  canonical_driver_hotspot         3            2
## 4                    0.02          late_progression         3            2
## 5                    0.60  canonical_driver_hotspot         3            3
## 6                    0.08  canonical_driver_hotspot         3            2
## 7                    0.01  canonical_driver_hotspot         3            2
## 8                      NA                      <NA>        NA           NA
## 9                      NA                      <NA>        NA           NA
## 10                     NA                      <NA>        NA           NA
## 11                     NA                      <NA>        NA           NA
## 12                   0.02             fusion_driver         3            3
## 13                   0.10       prognostic_promoter         3            2
## 14                     NA                      <NA>        NA           NA
## 15                     NA                      <NA>        NA           NA
## 16                     NA                      <NA>        NA           NA
## 17                     NA                      <NA>        NA           NA
## 18                   0.00 thyroid_expression_marker         0            0
## 19                     NA                      <NA>        NA           NA
## 20                     NA                      <NA>        NA           NA
## 21                     NA                      <NA>        NA           NA
## 22                     NA                      <NA>        NA           NA
## 23                     NA                      <NA>        NA           NA
## 24                     NA                      <NA>        NA           NA
## 25                     NA                      <NA>        NA           NA
## 26                     NA                      <NA>        NA           NA
## 27                     NA                      <NA>        NA           NA
## 28                     NA                      <NA>        NA           NA
## 29                     NA                      <NA>        NA           NA
## 30                     NA                      <NA>        NA           NA
## 31                     NA                      <NA>        NA           NA
## 32                     NA                      <NA>        NA           NA
## 33                     NA                      <NA>        NA           NA
##    civic_score ptc_frequency_score variant_relevance_score
## 1            3            1.998570                       3
## 2            3            1.124260                       3
## 3            3            1.610035                       3
## 4            3            1.427433                       1
## 5            3            3.000000                       3
## 6            3            2.060350                       3
## 7            3            1.124260                       3
## 8           NA                  NA                      NA
## 9           NA                  NA                      NA
## 10          NA                  NA                      NA
## 11          NA                  NA                      NA
## 12           3            1.427433                       3
## 13           3            2.163813                       2
## 14          NA                  NA                      NA
## 15          NA                  NA                      NA
## 16          NA                  NA                      NA
## 17          NA                  NA                      NA
## 18           0            0.000000                       0
## 19          NA                  NA                      NA
## 20          NA                  NA                      NA
## 21          NA                  NA                      NA
## 22          NA                  NA                      NA
## 23          NA                  NA                      NA
## 24          NA                  NA                      NA
## 25          NA                  NA                      NA
## 26          NA                  NA                      NA
## 27          NA                  NA                      NA
## 28          NA                  NA                      NA
## 29          NA                  NA                      NA
## 30          NA                  NA                      NA
## 31          NA                  NA                      NA
## 32          NA                  NA                      NA
## 33          NA                  NA                      NA
##    driver_evidence_score expression_score vaccine_priority_score
## 1               2.849785                3               5.849785
## 2               2.718639                3               5.718639
## 3               2.541505                2               4.541505
## 4               2.314115                2               4.314115
## 5               3.000000                1               4.000000
## 6               2.609053                1               3.609053
## 7               2.468639                1               3.468639
## 8               0.000000                3               3.000000
## 9               0.000000                3               3.000000
## 10              0.000000                3               3.000000
## 11              0.000000                3               3.000000
## 12              2.764115                0               2.764115
## 13              2.524572                0               2.524572
## 14              0.000000                2               2.000000
## 15              0.000000                2               2.000000
## 16              0.000000                2               2.000000
## 17              0.000000                2               2.000000
## 18              0.000000                2               2.000000
## 19              0.000000                2               2.000000
## 20              0.000000                2               2.000000
## 21              0.000000                2               2.000000
## 22              0.000000                2               2.000000
## 23              0.000000                2               2.000000
## 24              0.000000                2               2.000000
## 25              0.000000                2               2.000000
## 26              0.000000                2               2.000000
## 27              0.000000                2               2.000000
## 28              0.000000                2               2.000000
## 29              0.000000                1               1.000000
## 30              0.000000                1               1.000000
## 31              0.000000                1               1.000000
## 32              0.000000                0               0.000000
## 33              0.000000                0               0.000000

top_upregulated_genes %>% head(50)

##        baseMean log2FoldChange     lfcSE     stat        pvalue          padj
## 1     139.19914       7.611620 0.5684245 13.39073  6.850402e-41  2.811588e-39
## 2   10654.86873       7.528596 0.3878325 19.41198  6.113065e-84  1.695964e-81
## 3      86.24530       7.480511 0.5483813 13.64107  2.281743e-42  1.028674e-40
## 4     231.61762       7.448300 0.3743877 19.89462  4.530557e-88  1.502530e-85
## 5    3021.75435       7.150000 0.2644916 27.03299 6.054084e-161 5.590868e-157
## 6    5747.33432       7.140052 0.2338046 30.53854 8.027451e-205 2.316160e-200
## 7    1290.65258       7.098649 0.4453479 15.93956  3.367108e-57  3.093986e-55
## 8    3351.88885       7.068247 0.3098013 22.81542 3.223279e-115 3.720050e-112
## 9    1460.32112       6.902744 0.3890016 17.74477  1.891795e-70  3.049384e-68
## 10  30888.04667       6.811595 0.2773459 24.55993 3.387735e-133 9.774631e-130
## 11    851.74582       6.799043 0.3567015 19.06087  5.337479e-81  1.283352e-78
## 12    142.86850       6.693051 0.4233742 15.80883  2.704455e-56  2.386289e-54
## 13   4976.78953       6.653533 0.2977219 22.34814 1.258673e-110 1.100500e-107
## 14    243.21483       6.551546 0.3691888 17.74579  1.857727e-70  3.011292e-68
## 15    390.04069       6.546445 0.3880981 16.86802  7.734867e-64  9.537355e-62
## 16    531.99707       6.468710 0.3654269 17.70179  4.061938e-70  6.434761e-68
## 17    170.88659       6.374495 0.3644404 17.49118  1.672448e-68  2.474622e-66
## 18    137.22944       6.373977 0.3779327 16.86538  8.088518e-64  9.930979e-62
## 19    985.21699       6.355541 0.3408408 18.64665  1.344323e-77  2.810706e-75
## 20    192.66464       6.345654 0.3842796 16.51312  2.952218e-61  3.166556e-59
## 21    209.29242       6.296718 0.3809182 16.53037  2.217849e-61  2.414778e-59
## 22   4140.60816       6.292665 0.2882844 21.82798 1.258734e-105 8.858110e-103
## 23   1161.99626       6.290860 0.3445783 18.25669  1.830563e-74  3.429691e-72
## 24     66.15691       6.264599 0.3812760 16.43061  1.154862e-60  1.211682e-58
## 25    301.94513       6.184367 0.3181151 19.44066  3.496338e-84  9.794160e-82
## 26   2386.03189       6.171108 0.3323255 18.56947  5.676314e-77  1.161551e-74
## 27    656.98265       6.136619 0.3415887 17.96494  3.667047e-72  6.412442e-70
## 28   2649.58725       6.094615 0.2905359 20.97715  1.060641e-97  5.368891e-95
## 29    403.68670       6.077394 0.3731307 16.28758  1.209288e-59  1.224266e-57
## 30    598.25526       6.005846 0.3329362 18.03903  9.621922e-73  1.713712e-70
## 31   2585.87063       5.929989 0.2931497 20.22854  5.490991e-91  2.005463e-88
## 32   1565.68351       5.907715 0.3109027 19.00181  1.647539e-80  3.864751e-78
## 33   7249.60161       5.898601 0.3414102 17.27717  6.989765e-67  9.558090e-65
## 34     54.43943       5.856942 0.4954532 11.82138  3.026617e-32  7.432082e-31
## 35   1453.29869       5.832175 0.2538034 22.97911 7.541492e-117 9.066444e-114
## 36   6717.02856       5.819223 0.2403594 24.21051 1.724303e-129 4.145944e-126
## 37    300.32426       5.781535 0.3771580 15.32921  4.878785e-53  3.763839e-51
## 38    161.10307       5.772022 0.2780773 20.75690  1.062040e-95  4.642887e-93
## 39     50.56827       5.726005 0.4584062 12.49111  8.347566e-36  2.501063e-34
## 40   4925.33263       5.724080 0.2724347 21.01083  5.221419e-98  2.690243e-95
## 41 838667.28568       5.704339 0.2587789 22.04329 1.107761e-107 9.132067e-105
## 42    274.51671       5.695708 0.3397416 16.76482  4.413320e-63  5.327930e-61
## 43   3909.65470       5.672596 0.2099107 27.02386 7.750831e-161 5.590868e-157
## 44     74.61713       5.669866 0.2988115 18.97472  2.759752e-80  6.421541e-78
## 45    182.42916       5.666730 0.4726649 11.98890  4.062850e-33  1.057037e-31
## 46   2870.76749       5.612894 0.2267285 24.75602 2.670494e-135 8.561306e-132
## 47    556.70963       5.591001 0.2762490 20.23899  4.442063e-91  1.643165e-88
## 48   1110.82434       5.504737 0.3288548 16.73911  6.799614e-63  8.073633e-61
## 49 159102.76778       5.494196 0.2706550 20.29963  1.295588e-91  4.854753e-89
## 50    908.69584       5.490704 0.3275721 16.76182  4.642042e-63  5.580701e-61
##               gene_id   gene_id_clean gene_symbol      gene_type
## 1  ENSG00000137745.12 ENSG00000137745       MMP13 protein_coding
## 2  ENSG00000147256.12 ENSG00000147256    ARHGAP36 protein_coding
## 3   ENSG00000187714.7 ENSG00000187714     SLC18A3 protein_coding
## 4  ENSG00000197587.11 ENSG00000197587       DMBX1 protein_coding
## 5   ENSG00000259803.7 ENSG00000259803    SLC22A31 protein_coding
## 6  ENSG00000145864.14 ENSG00000145864      GABRB2 protein_coding
## 7  ENSG00000122852.15 ENSG00000122852      SFTPA1 protein_coding
## 8  ENSG00000187045.19 ENSG00000187045     TMPRSS6 protein_coding
## 9  ENSG00000167757.14 ENSG00000167757       KLK11 protein_coding
## 10  ENSG00000174460.4 ENSG00000174460     ZCCHC12 protein_coding
## 11  ENSG00000275896.6 ENSG00000275896       PRSS2 protein_coding
## 12 ENSG00000167755.15 ENSG00000167755        KLK6 protein_coding
## 13 ENSG00000173227.14 ENSG00000173227       SYT12 protein_coding
## 14 ENSG00000204983.14 ENSG00000204983       PRSS1 protein_coding
## 15  ENSG00000163207.7 ENSG00000163207         IVL protein_coding
## 16 ENSG00000155897.10 ENSG00000155897       ADCY8 protein_coding
## 17 ENSG00000231924.10 ENSG00000231924        PSG1 protein_coding
## 18 ENSG00000100341.12 ENSG00000100341      PNPLA5 protein_coding
## 19 ENSG00000134873.10 ENSG00000134873      CLDN10 protein_coding
## 20 ENSG00000087128.10 ENSG00000087128   TMPRSS11E protein_coding
## 21  ENSG00000170369.4 ENSG00000170369        CST2 protein_coding
## 22 ENSG00000137648.19 ENSG00000137648     TMPRSS4 protein_coding
## 23  ENSG00000204544.5 ENSG00000204544       MUC21 protein_coding
## 24  ENSG00000170367.5 ENSG00000170367        CST5 protein_coding
## 25  ENSG00000167105.8 ENSG00000167105      TMEM92 protein_coding
## 26 ENSG00000169035.12 ENSG00000169035        KLK7 protein_coding
## 27 ENSG00000172020.13 ENSG00000172020       GAP43 protein_coding
## 28 ENSG00000129451.12 ENSG00000129451       KLK10 protein_coding
## 29 ENSG00000134258.17 ENSG00000134258       VTCN1 protein_coding
## 30 ENSG00000163817.16 ENSG00000163817     SLC6A20 protein_coding
## 31 ENSG00000179913.11 ENSG00000179913      B3GNT3 protein_coding
## 32  ENSG00000188133.6 ENSG00000188133     TMEM215 protein_coding
## 33  ENSG00000164935.6 ENSG00000164935     DCSTAMP protein_coding
## 34  ENSG00000170373.8 ENSG00000170373        CST1 protein_coding
## 35 ENSG00000147041.12 ENSG00000147041       SYTL5 protein_coding
## 36  ENSG00000176532.4 ENSG00000176532       PRR15 protein_coding
## 37 ENSG00000185303.17 ENSG00000185303      SFTPA2 protein_coding
## 38  ENSG00000138308.6 ENSG00000138308    PLA2G12B protein_coding
## 39 ENSG00000172461.11 ENSG00000172461        FUT9 protein_coding
## 40 ENSG00000187122.17 ENSG00000187122       SLIT1 protein_coding
## 41 ENSG00000115414.21 ENSG00000115414         FN1 protein_coding
## 42 ENSG00000166897.16 ENSG00000166897       ELFN2 protein_coding
## 43 ENSG00000163898.10 ENSG00000163898        LIPH protein_coding
## 44  ENSG00000164400.6 ENSG00000164400        CSF2 protein_coding
## 45 ENSG00000124467.19 ENSG00000124467        PSG8 protein_coding
## 46 ENSG00000149948.14 ENSG00000149948       HMGA2 protein_coding
## 47  ENSG00000187823.3 ENSG00000187823        RTL4 protein_coding
## 48 ENSG00000109255.11 ENSG00000109255         NMU protein_coding
## 49 ENSG00000157765.13 ENSG00000157765     SLC34A2 protein_coding
## 50  ENSG00000072315.3 ENSG00000072315       TRPC5 protein_coding
##               direction
## 1  Upregulated_in_Tumor
## 2  Upregulated_in_Tumor
## 3  Upregulated_in_Tumor
## 4  Upregulated_in_Tumor
## 5  Upregulated_in_Tumor
## 6  Upregulated_in_Tumor
## 7  Upregulated_in_Tumor
## 8  Upregulated_in_Tumor
## 9  Upregulated_in_Tumor
## 10 Upregulated_in_Tumor
## 11 Upregulated_in_Tumor
## 12 Upregulated_in_Tumor
## 13 Upregulated_in_Tumor
## 14 Upregulated_in_Tumor
## 15 Upregulated_in_Tumor
## 16 Upregulated_in_Tumor
## 17 Upregulated_in_Tumor
## 18 Upregulated_in_Tumor
## 19 Upregulated_in_Tumor
## 20 Upregulated_in_Tumor
## 21 Upregulated_in_Tumor
## 22 Upregulated_in_Tumor
## 23 Upregulated_in_Tumor
## 24 Upregulated_in_Tumor
## 25 Upregulated_in_Tumor
## 26 Upregulated_in_Tumor
## 27 Upregulated_in_Tumor
## 28 Upregulated_in_Tumor
## 29 Upregulated_in_Tumor
## 30 Upregulated_in_Tumor
## 31 Upregulated_in_Tumor
## 32 Upregulated_in_Tumor
## 33 Upregulated_in_Tumor
## 34 Upregulated_in_Tumor
## 35 Upregulated_in_Tumor
## 36 Upregulated_in_Tumor
## 37 Upregulated_in_Tumor
## 38 Upregulated_in_Tumor
## 39 Upregulated_in_Tumor
## 40 Upregulated_in_Tumor
## 41 Upregulated_in_Tumor
## 42 Upregulated_in_Tumor
## 43 Upregulated_in_Tumor
## 44 Upregulated_in_Tumor
## 45 Upregulated_in_Tumor
## 46 Upregulated_in_Tumor
## 47 Upregulated_in_Tumor
## 48 Upregulated_in_Tumor
## 49 Upregulated_in_Tumor
## 50 Upregulated_in_Tumor

top_downregulated_genes %>% head(50)

##       baseMean log2FoldChange     lfcSE       stat       pvalue         padj
## 1    640.08393      -3.669924 0.4781618  -7.675066 1.653333e-14 1.047050e-13
## 2    180.21576      -3.439544 0.3838133  -8.961502 3.202847e-19 3.026916e-18
## 3    174.91070      -3.414607 0.2587846 -13.194786 9.402183e-40 3.626754e-38
## 4    316.81415      -3.277959 0.3137872 -10.446437 1.521339e-25 2.343577e-24
## 5    350.49927      -3.255289 0.2931823 -11.103293 1.209060e-28 2.311797e-27
## 6     65.61661      -3.241349 0.2886033 -11.231158 2.866973e-29 5.756491e-28
## 7   3062.49891      -3.226218 0.3728960  -8.651791 5.069730e-18 4.375619e-17
## 8    311.58458      -3.187993 0.3522922  -9.049285 1.439079e-19 1.393815e-18
## 9    240.20947      -3.179130 0.4259791  -7.463111 8.450306e-14 5.013709e-13
## 10   606.33236      -3.146754 0.1965460 -16.010263 1.083503e-57 1.011725e-55
## 11 11295.75574      -3.132265 0.3355518  -9.334669 1.013077e-20 1.073460e-19
## 12    53.91776      -3.119818 0.3164322  -9.859357 6.244823e-23 7.830590e-22
## 13   234.26973      -3.105607 0.2281302 -13.613310 3.337778e-42 1.486187e-40
## 14    95.20278      -3.057267 0.2241273 -13.640763 2.291496e-42 1.031459e-40
## 15    56.68241      -3.054084 0.3335049  -9.157537 5.309545e-20 5.339711e-19
## 16   559.06760      -3.036121 0.3143629  -9.658015 4.545882e-22 5.329636e-21
## 17   385.90387      -2.988129 0.1765688 -16.923312 3.029081e-64 3.783467e-62
## 18    54.44398      -2.957510 0.2711694 -10.906503 1.073049e-27 1.937465e-26
## 19   809.02491      -2.951601 0.2903045 -10.167260 2.776069e-24 3.858282e-23
## 20   142.84240      -2.910644 0.4622354  -6.296886 3.036835e-10 1.264931e-09
## 21   693.34957      -2.910438 0.2364955 -12.306525 8.354371e-35 2.386620e-33
## 22   116.27607      -2.888023 0.2495786 -11.571596 5.740452e-31 1.296004e-29
## 23   341.76782      -2.881251 0.6686900  -4.308799 1.641432e-05 3.979851e-05
## 24  1901.47025      -2.864094 0.2313649 -12.379120 3.390383e-35 9.871112e-34
## 25    62.01151      -2.825277 0.4725690  -5.978549 2.251336e-09 8.550456e-09
## 26  3356.00892      -2.802483 0.1808969 -15.492151 3.919565e-54 3.132721e-52
## 27   102.28354      -2.769642 0.3341112  -8.289580 1.136495e-16 8.751348e-16
## 28  1261.27529      -2.763253 0.3179783  -8.690070 3.622176e-18 3.159330e-17
## 29   128.35056      -2.752176 0.1826636 -15.066911 2.673236e-51 1.895108e-49
## 30  3347.01420      -2.749180 0.2342583 -11.735677 8.365426e-32 1.993127e-30
## 31    53.64868      -2.715544 0.2768703  -9.808001 1.040089e-22 1.279185e-21
## 32   461.53865      -2.712453 0.2522950 -10.751115 5.854954e-27 1.001975e-25
## 33    50.66118      -2.692956 0.2556302 -10.534575 5.985275e-26 9.525270e-25
## 34   166.36653      -2.682117 0.2415350 -11.104465 1.193302e-28 2.286211e-27
## 35   433.23819      -2.604618 0.2018573 -12.903260 4.314508e-38 1.501646e-36
## 36  1873.46633      -2.598860 0.3156751  -8.232706 1.830314e-16 1.381016e-15
## 37    82.12775      -2.587127 0.2896947  -8.930529 4.239761e-19 3.973038e-18
## 38    94.22522      -2.572962 0.3471523  -7.411625 1.247614e-13 7.278083e-13
## 39   519.00352      -2.561459 0.2743236  -9.337362 9.876429e-21 1.047664e-19
## 40    94.45686      -2.561336 0.3831729  -6.684543 2.316463e-11 1.081328e-10
## 41  1587.91595      -2.547745 0.3228642  -7.891073 2.995994e-15 2.030141e-14
## 42   364.85829      -2.523266 0.1935605 -13.036056 7.630328e-39 2.741692e-37
## 43   510.81467      -2.519122 0.2710503  -9.293928 1.486966e-20 1.560125e-19
## 44    56.50537      -2.495550 0.3955345  -6.309311 2.802805e-10 1.171171e-09
## 45   258.39886      -2.482653 0.2839335  -8.743784 2.254297e-18 1.999484e-17
## 46    61.24206      -2.462583 0.4231863  -5.819145 5.914937e-09 2.148605e-08
## 47   231.02438      -2.453409 0.1555536 -15.772110 4.840562e-56 4.219478e-54
## 48   158.22983      -2.438495 0.1741499 -14.002275 1.509598e-44 7.522702e-43
## 49 17842.89274      -2.419733 0.2188671 -11.055720 2.056794e-28 3.881274e-27
## 50  6399.53993      -2.416210 0.3360806  -7.189377 6.508759e-13 3.540004e-12
##               gene_id   gene_id_clean gene_symbol      gene_type
## 1  ENSG00000110680.13 ENSG00000110680       CALCA protein_coding
## 2  ENSG00000034971.17 ENSG00000034971        MYOC protein_coding
## 3  ENSG00000189056.14 ENSG00000189056        RELN protein_coding
## 4   ENSG00000143196.5 ENSG00000143196         DPT protein_coding
## 5  ENSG00000196616.14 ENSG00000196616       ADH1B protein_coding
## 6   ENSG00000173467.9 ENSG00000173467        AGR3 protein_coding
## 7   ENSG00000137077.8 ENSG00000137077       CCL21 protein_coding
## 8  ENSG00000168079.17 ENSG00000168079      SCARA5 protein_coding
## 9   ENSG00000182348.7 ENSG00000182348     ZNF804B protein_coding
## 10 ENSG00000138722.10 ENSG00000138722       MMRN1 protein_coding
## 11 ENSG00000160180.15 ENSG00000160180        TFF3 protein_coding
## 12  ENSG00000196666.6 ENSG00000196666     FAM180B protein_coding
## 13 ENSG00000147257.15 ENSG00000147257        GPC3 protein_coding
## 14  ENSG00000152580.8 ENSG00000152580      IGSF10 protein_coding
## 15 ENSG00000205221.12 ENSG00000205221         VIT protein_coding
## 16 ENSG00000168702.18 ENSG00000168702       LRP1B protein_coding
## 17  ENSG00000133800.9 ENSG00000133800       LYVE1 protein_coding
## 18  ENSG00000204065.3 ENSG00000204065      TCEAL5 protein_coding
## 19 ENSG00000101938.15 ENSG00000101938      CHRDL1 protein_coding
## 20 ENSG00000100721.11 ENSG00000100721       TCL1A protein_coding
## 21 ENSG00000198626.17 ENSG00000198626        RYR2 protein_coding
## 22 ENSG00000183775.11 ENSG00000183775      KCTD16 protein_coding
## 23  ENSG00000181617.6 ENSG00000181617       FDCSP protein_coding
## 24 ENSG00000074706.13 ENSG00000074706      IPCEF1 protein_coding
## 25 ENSG00000162897.16 ENSG00000162897       FCAMR protein_coding
## 26 ENSG00000153246.13 ENSG00000153246      PLA2R1 protein_coding
## 27 ENSG00000122145.15 ENSG00000122145       TBX22 protein_coding
## 28 ENSG00000130226.17 ENSG00000130226        DPP6 protein_coding
## 29 ENSG00000154258.17 ENSG00000154258       ABCA9 protein_coding
## 30  ENSG00000115112.8 ENSG00000115112     TFCP2L1 protein_coding
## 31  ENSG00000179674.4 ENSG00000179674       ARL14 protein_coding
## 32 ENSG00000138356.14 ENSG00000138356        AOX1 protein_coding
## 33  ENSG00000177363.5 ENSG00000177363     LRRN4CL protein_coding
## 34 ENSG00000149972.11 ENSG00000149972       CNTN5 protein_coding
## 35 ENSG00000150625.16 ENSG00000150625       GPM6A protein_coding
## 36 ENSG00000166589.13 ENSG00000166589       CDH16 protein_coding
## 37  ENSG00000128218.8 ENSG00000128218      VPREB3 protein_coding
## 38 ENSG00000145242.13 ENSG00000145242       EPHA5 protein_coding
## 39 ENSG00000164530.15 ENSG00000164530        PI16 protein_coding
## 40 ENSG00000104921.15 ENSG00000104921       FCER2 protein_coding
## 41 ENSG00000075035.10 ENSG00000075035       WSCD2 protein_coding
## 42 ENSG00000064270.13 ENSG00000064270      ATP2C2 protein_coding
## 43 ENSG00000141639.12 ENSG00000141639       MAPK4 protein_coding
## 44 ENSG00000163534.15 ENSG00000163534       FCRL1 protein_coding
## 45  ENSG00000165300.7 ENSG00000165300     SLITRK5 protein_coding
## 46 ENSG00000109205.16 ENSG00000109205        ODAM protein_coding
## 47  ENSG00000198774.5 ENSG00000198774      RASSF9 protein_coding
## 48 ENSG00000153823.19 ENSG00000153823        PID1 protein_coding
## 49  ENSG00000147606.9 ENSG00000147606     SLC26A7 protein_coding
## 50 ENSG00000205038.12 ENSG00000205038     PKHD1L1 protein_coding
##                 direction
## 1  Downregulated_in_Tumor
## 2  Downregulated_in_Tumor
## 3  Downregulated_in_Tumor
## 4  Downregulated_in_Tumor
## 5  Downregulated_in_Tumor
## 6  Downregulated_in_Tumor
## 7  Downregulated_in_Tumor
## 8  Downregulated_in_Tumor
## 9  Downregulated_in_Tumor
## 10 Downregulated_in_Tumor
## 11 Downregulated_in_Tumor
## 12 Downregulated_in_Tumor
## 13 Downregulated_in_Tumor
## 14 Downregulated_in_Tumor
## 15 Downregulated_in_Tumor
## 16 Downregulated_in_Tumor
## 17 Downregulated_in_Tumor
## 18 Downregulated_in_Tumor
## 19 Downregulated_in_Tumor
## 20 Downregulated_in_Tumor
## 21 Downregulated_in_Tumor
## 22 Downregulated_in_Tumor
## 23 Downregulated_in_Tumor
## 24 Downregulated_in_Tumor
## 25 Downregulated_in_Tumor
## 26 Downregulated_in_Tumor
## 27 Downregulated_in_Tumor
## 28 Downregulated_in_Tumor
## 29 Downregulated_in_Tumor
## 30 Downregulated_in_Tumor
## 31 Downregulated_in_Tumor
## 32 Downregulated_in_Tumor
## 33 Downregulated_in_Tumor
## 34 Downregulated_in_Tumor
## 35 Downregulated_in_Tumor
## 36 Downregulated_in_Tumor
## 37 Downregulated_in_Tumor
## 38 Downregulated_in_Tumor
## 39 Downregulated_in_Tumor
## 40 Downregulated_in_Tumor
## 41 Downregulated_in_Tumor
## 42 Downregulated_in_Tumor
## 43 Downregulated_in_Tumor
## 44 Downregulated_in_Tumor
## 45 Downregulated_in_Tumor
## 46 Downregulated_in_Tumor
## 47 Downregulated_in_Tumor
## 48 Downregulated_in_Tumor
## 49 Downregulated_in_Tumor
## 50 Downregulated_in_Tumor

p_top_upregulated

p_top_downregulated

p_boxplot_target_gene

p_boxplot_extended

p_volcano_extended

26 Tổng hợp output của quy trình hiện tại

Sau khi chạy thành công, các file chính được lưu trong thư mục results và results/plots:

Nhóm output	File	Ý nghĩa
DESeq2 toàn bộ	`TCGA_THCA_PTC_Tumor_vs_Normal_DESeq2_all_results.csv`	Toàn bộ kết quả DESeq2 chưa annotation
DEG có ý nghĩa	`TCGA_THCA_PTC_Tumor_vs_Normal_DESeq2_significant_DEGs.csv`	DEG theo ngưỡng `padj < 0.05`, `abs(log2FC) >= 1`
DESeq2 annotated	`TCGA_THCA_PTC_Tumor_vs_Normal_DESeq2_annotated_all_results.csv`	Kết quả DESeq2 có gene symbol/gene type
DEG annotated	`TCGA_THCA_PTC_Tumor_vs_Normal_DESeq2_annotated_significant_DEGs.csv`	DEG có ý nghĩa kèm annotation
Gene PTC đích	`TCGA_THCA_PTC_target_genes_DESeq2_result.csv`	Kết quả riêng cho BRAF/RET/NTRK/RAS/TERT/TG/TTN
Normalized counts	`TCGA_THCA_PTC_normalized_counts_annotated.csv`	Count đã chuẩn hóa để mô tả biểu hiện
Expression long	`TCGA_THCA_PTC_target_genes_normalized_counts_long.csv`	Dữ liệu long format để vẽ boxplot
Panel PTC mở rộng	`TCGA_THCA_PTC_extended_gene_panel_DESeq2_result.csv`	Kết quả riêng cho panel gene PTC mở rộng
Chấm điểm sơ bộ	`TCGA_THCA_PTC_extended_gene_panel_vaccine_priority.csv`	Điểm ưu tiên sơ bộ cho pipeline vaccine
Gene tăng biểu hiện	`TCGA_THCA_PTC_top_upregulated_protein_coding_genes.csv`	Gene protein-coding tăng trong Tumor
Gene giảm biểu hiện	`TCGA_THCA_PTC_top_downregulated_protein_coding_genes.csv`	Gene protein-coding giảm trong Tumor
Boxplot	`plots/TCGA_THCA_PTC_target_genes_boxplot.png`	Biểu hiện gene đích theo Tumor/Normal
Volcano plot	`plots/TCGA_THCA_PTC_Tumor_vs_Normal_volcano_plot.png`	Tổng quan DEG
Heatmap	`plots/TCGA_THCA_PTC_Top50_DEGs_heatmap.png`	Top 50 DEG
Boxplot expanded panel	`plots/TCGA_THCA_PTC_extended_gene_panel_boxplot.png`	Biểu hiện toàn bộ expanded PTC panel theo Tumor/Normal
Volcano expanded panel	`plots/TCGA_THCA_PTC_volcano_highlight_extended_panel.png`	Volcano plot toàn bộ DEG, highlight expanded PTC panel
Heatmap expanded panel	`plots/TCGA_THCA_PTC_extended_gene_panel_heatmap.png`	Heatmap biểu hiện expanded PTC panel
Heatmap expanded panel sạch	`plots/TCGA_THCA_PTC_extended_gene_panel_heatmap_clean.png`	Heatmap z-score dễ trình bày hơn cho expanded PTC panel

27 Diễn giải phương pháp nghiên cứu có thể đưa vào bài viết

27.1 Thiết kế nghiên cứu

Nghiên cứu này là phân tích tin sinh học hồi cứu sử dụng dữ liệu RNA-seq công khai từ TCGA-THCA. Dữ liệu biểu hiện gene cấp raw count được tải từ Genomic Data Commons bằng package TCGAbiolinks. Các mẫu được phân nhóm dựa trên mã sample type của TCGA barcode, trong đó mã 01 được xác định là Primary Solid Tumor và mã 11 là Solid Tissue Normal.

27.2 Tiền xử lý dữ liệu

Dữ liệu sau khi tải được lưu dưới dạng đối tượng SummarizedExperiment. Raw count được trích từ assay unstranded và ép kiểu số nguyên để phù hợp với yêu cầu đầu vào của DESeq2. Metadata mẫu được trích từ colData, còn annotation gene được trích từ rowData. Các mẫu không thuộc nhóm Tumor hoặc Normal được loại khỏi phân tích. Thứ tự sample trong count matrix và metadata được kiểm tra bằng stopifnot() để tránh sai lệch thiết kế phân tích.

27.3 Phân tích biểu hiện khác biệt

Biểu hiện khác biệt giữa Tumor và Normal được phân tích bằng DESeq2 với thiết kế ~ condition, trong đó Normal là nhóm tham chiếu. Các gene có biểu hiện thấp được lọc trước phân tích bằng tiêu chí count lớn hơn hoặc bằng 10 ở ít nhất 10 mẫu. DESeq2 được sử dụng để ước lượng size factor, dispersion và kiểm định khác biệt biểu hiện bằng mô hình negative binomial. Gene có ý nghĩa thống kê được xác định khi padj < 0.05 và abs(log2FoldChange) >= 1.

27.4 Annotation và ưu tiên gene ứng viên

Kết quả DESeq2 được gắn thêm gene symbol và gene type từ annotation của đối tượng SummarizedExperiment. Các gene liên quan PTC như BRAF, RET, NTRK1, NTRK3, NRAS, HRAS, KRAS, TERT, TG và TTN được trích xuất riêng để đánh giá biểu hiện. Ngoài ra, một panel PTC mở rộng được xây dựng gồm gene driver MAPK, gene fusion/kinase, gene liên quan tiến triển/nguy cơ cao và gene biệt hóa tuyến giáp.

27.5 Trực quan hóa dữ liệu

Biểu hiện của các gene đích được biểu diễn bằng boxplot sử dụng log2(normalized count + 1). Kết quả DESeq2 toàn cục được trực quan hóa bằng volcano plot, với ngưỡng log2FoldChange = ±1 và padj = 0.05. Top 50 DEG theo padj được biểu diễn bằng heatmap sử dụng normalized counts sau biến đổi log2 và chuẩn hóa theo hàng.

27.6 Chấm điểm ưu tiên vaccine sơ bộ

Một hệ điểm ưu tiên sơ bộ được xây dựng dựa trên hai thành phần: điểm driver và điểm biểu hiện. Gene có vai trò driver hoặc fusion quan trọng trong PTC được chấm điểm driver cao hơn. Gene có biểu hiện đủ cao, tăng biểu hiện ở Tumor và có ý nghĩa thống kê được chấm điểm biểu hiện cao hơn. Tổng điểm này chỉ dùng để định hướng sàng lọc ban đầu, không thay thế các bước bắt buộc sau đó như xác nhận biến thể/fusion, đánh giá biểu hiện protein ở mô bình thường, dự đoán epitope gắn HLA, xử lý kháng nguyên, độc tính, dị ứng và bao phủ quần thể HLA.

28 Hạn chế của quy trình hiện tại

Không thay thế phân tích mutation/fusion: DESeq2 chỉ phân tích biểu hiện RNA cấp gene, không xác định trực tiếp BRAF V600E, RAS Q61 hoặc fusion RET/NTRK.
Chưa đánh giá protein expression: RNA tăng không đồng nghĩa protein tăng hoặc peptide được trình diện qua HLA.
Chưa đánh giá tính đặc hiệu mô bình thường: Cần tích hợp HPA/GTEx để tránh chọn gene biểu hiện cao ở mô thiết yếu.
Chưa có survival analysis: Các biến OS/PFS/DFS chưa được đưa vào script hiện tại.
Chưa tạo FASTA và NetMHC input: Các bước thiết kế epitope MHC-I/MHC-II cần bổ sung sau khi chọn gene/biến thể cuối cùng.
TCGA-THCA chủ yếu là PTC nhưng vẫn nên kiểm tra metadata lâm sàng: Nếu mục tiêu là PTC thuần nhất theo mô bệnh học, nên bổ sung bước tải hoặc nhập clinical subtype để loại mẫu không phù hợp.

29 Bước tiếp theo đề xuất

Để hoàn chỉnh pipeline vaccine đa epitope từ script này, nên bổ sung các module sau:

Tải hoặc nhập mutation/fusion data để xác định biến thể thật sự: BRAF p.V600E, RET fusion, NTRK fusion, RAS Q61.
Tải UniProt canonical protein sequence cho từng gene/biến thể.
Tạo peptide mutant/fusion-specific thay vì chỉ peptide wild-type.
Dự đoán MHC-I bằng NetMHCpan và MHC-II bằng NetMHCIIpan.
Lọc epitope bằng antigenicity, allergenicity, toxicity và cytokine prediction.
Tính population coverage bằng IEDB cho bộ HLA Việt Nam và HLA supertype.
Xây dựng construct vaccine đa epitope kèm adjuvant/linker và đánh giá cấu trúc 3D/in silico immune simulation.

30 Thông tin phiên làm việc

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Ho_Chi_Minh
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scales_1.4.0                httr2_1.2.2                
##  [3] jsonlite_2.0.0              writexl_1.5.4              
##  [5] pheatmap_1.0.13             survminer_0.5.2            
##  [7] ggpubr_0.6.3                survival_3.8-6             
##  [9] data.table_1.17.8           lubridate_1.9.4            
## [11] forcats_1.0.1               stringr_1.5.2              
## [13] dplyr_1.1.4                 purrr_1.1.0                
## [15] readr_2.1.5                 tidyr_1.3.1                
## [17] tibble_3.3.0                ggplot2_4.0.0              
## [19] tidyverse_2.0.0             Biostrings_2.78.0          
## [21] XVector_0.50.0              DESeq2_1.50.2              
## [23] SummarizedExperiment_1.40.0 Biobase_2.70.0             
## [25] MatrixGenerics_1.22.0       matrixStats_1.5.0          
## [27] GenomicRanges_1.62.1        Seqinfo_1.0.0              
## [29] IRanges_2.44.0              S4Vectors_0.48.1           
## [31] BiocGenerics_0.56.0         generics_0.1.4             
## [33] TCGAbiolinks_2.38.0        
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3                   gridExtra_2.3              
##  [3] readxl_1.4.5                biomaRt_2.66.2             
##  [5] rlang_1.1.6                 magrittr_2.0.4             
##  [7] compiler_4.5.1              RSQLite_2.4.3              
##  [9] systemfonts_1.3.1           png_0.1-8                  
## [11] vctrs_0.6.5                 rvest_1.0.5                
## [13] pkgconfig_2.0.3             crayon_1.5.3               
## [15] fastmap_1.2.0               backports_1.5.0            
## [17] dbplyr_2.5.1                labeling_0.4.3             
## [19] utf8_1.2.6                  rmarkdown_2.30             
## [21] tzdb_0.5.0                  ragg_1.5.0                 
## [23] bit_4.6.0                   xfun_0.53                  
## [25] cachem_1.1.0                progress_1.2.3             
## [27] blob_1.2.4                  DelayedArray_0.36.1        
## [29] BiocParallel_1.44.0         broom_1.0.10               
## [31] parallel_4.5.1              prettyunits_1.2.0          
## [33] R6_2.6.1                    bslib_0.9.0                
## [35] stringi_1.8.7               RColorBrewer_1.1-3         
## [37] car_3.1-3                   cellranger_1.1.0           
## [39] jquerylib_0.1.4             Rcpp_1.1.0                 
## [41] knitr_1.50                  downloader_0.4.1           
## [43] timechange_0.3.0            Matrix_1.7-3               
## [45] splines_4.5.1               tidyselect_1.2.1           
## [47] rstudioapi_0.17.1           abind_1.4-8                
## [49] yaml_2.3.10                 codetools_0.2-20           
## [51] curl_7.0.0                  lattice_0.22-7             
## [53] plyr_1.8.9                  withr_3.0.2                
## [55] KEGGREST_1.50.0             S7_0.2.0                   
## [57] evaluate_1.0.5              BiocFileCache_3.0.0        
## [59] xml2_1.4.0                  pillar_1.11.1              
## [61] BiocManager_1.30.27         filelock_1.0.3             
## [63] carData_3.0-5               hms_1.1.3                  
## [65] glue_1.8.0                  tools_4.5.1                
## [67] locfit_1.5-9.12             ggsignif_0.6.4             
## [69] XML_3.99-0.23               grid_4.5.1                 
## [71] AnnotationDbi_1.72.0        Formula_1.2-5              
## [73] cli_3.6.5                   rappdirs_0.3.3             
## [75] textshaping_1.0.4           S4Arrays_1.10.1            
## [77] gtable_0.3.6                rstatix_0.7.2              
## [79] TCGAbiolinksGUI.data_1.30.0 sass_0.4.10                
## [81] digest_0.6.37               ggrepel_0.9.6              
## [83] SparseArray_1.10.10         farver_2.1.2               
## [85] memoise_2.0.1               htmltools_0.5.8.1          
## [87] lifecycle_1.0.4             httr_1.4.7                 
## [89] bit64_4.6.0-1

31 Tài liệu tham khảo

National Cancer Institute Genomic Data Commons. TCGA sample type codes. https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes ↩︎
Cancer Genome Atlas Research Network. Integrated genomic characterization of papillary thyroid carcinoma. Cell. 2014;159(3):676-690. doi:10.1016/j.cell.2014.09.050 ↩︎
Silva TC, Colaprico A, Olsen C, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis with GDC data. https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html ↩︎
Bioconductor. SummarizedExperiment vignette. https://www.bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html ↩︎
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014. doi:10.1186/s13059-014-0550-8 ↩︎

Phân tích TCGA-THCA RNA-seq để ưu tiên lựa chọn gene cho thiết kế vaccine đa epitope trong ung thư tuyến giáp thể nhú

Vũ Anh Duy

04-05-2026