setwd("C:/Users/石源方/Desktop/数据搬家/华理工/班级-华/各科课程作业/高等生物信息学-注意PDF格式/24-12-12-practice5_Proteomic")
# 加载必要的库
library(pheatmap)# 用于绘制热图
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# 读取CSV文件
proteomics_data <- read.csv("proteomics_data.csv", row.names = 1)
sample_group <- read.csv("sample_group.csv")


# 确保样本顺序一致
sample_group <- sample_group %>%
  mutate(Group = factor(Group, levels = c("BPH", "TA1", "TA2"))) %>%  # 指定分组顺序
  filter(Sample_ID %in% colnames(proteomics_data))

# 对蛋白表达数据进行排序,使列顺序和样本组一致
proteomics_data <- proteomics_data[, sample_group$Sample_ID]

# 数据标准化:将每个蛋白的数据进行Z-score标准化
normalized_data <- t(scale(t(proteomics_data)))

# 初始化一个空向量,用于存储每组样本聚类后的列名
ordered_columns <- c()

# 对每个组进行样本层次聚类
for (grp in levels(sample_group$Group)) {
  # 筛选当前组别的样本
  current_samples <- sample_group$Sample_ID[sample_group$Group == grp]
  current_matrix <- normalized_data[, current_samples]
  
  # 计算当前组的样本之间的欧几里得距离,并进行层次聚类
  dist_matrix <- dist(t(current_matrix), method = "euclidean")
  cluster_result <- hclust(dist_matrix, method = "complete")
  
  # 获取聚类后的样本顺序
  ordered_columns <- c(ordered_columns, current_samples[cluster_result$order])
}

# 按聚类顺序重新排列整个数据矩阵
normalized_data <- normalized_data[, ordered_columns]

# 重新生成列注释(Group 信息)
annotation_col <- data.frame(Group = sample_group$Group)
rownames(annotation_col) <- sample_group$Sample_ID
annotation_col <- annotation_col[ordered_columns, , drop = FALSE]  # 按新顺序排列

# 定义颜色 (根据组别区分颜色)
group_colors <- list(Group = c("BPH" = "#1F78B4", "TA1" = "#E31A1C", "TA2" = "#33A02C"))

# 绘制热图
pheatmap(
  normalized_data,              # 标准化后的蛋白表达矩阵
  annotation_col = annotation_col,  # 样本的分组信息
  annotation_colors = group_colors, # 分组颜色定义
  color = colorRampPalette(c("blue", "white", "red"))(100), # 热图颜色
  scale = "none",               # 已经标准化,不再额外缩放
  cluster_cols = FALSE,         # 列样本不聚类
  cluster_rows = TRUE,          # 行进行聚类
  show_rownames = FALSE,         # 不显示蛋白名
  show_colnames = FALSE,         # 不显示样本名
  fontsize_row = 8,              # 行名字体大小
  main = "Hierarchical Clustering Heatmap by Group" # 图标题
  )