日期:2021-09-23
任务:
nomogram的绘制
C-index的计算和calibrate的绘制
ROC的绘制
DCA的绘制
设置工作路径
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 416142 22.3 869494 46.5 638965 34.2
## Vcells 759728 5.8 8388608 64.0 1632729 12.5
setwd("D:/R/R-4.0.5/bin/project_writing/singleGene/Nomogram")
nomogram的绘制步骤:
suppressMessages(library(Hmisc))
suppressMessages(library(grid))
suppressMessages(library(lattice))
suppressMessages(library(Formula))
suppressMessages(library(ggplot2))
suppressMessages(library(rms))
suppressMessages(library(survival))
suppressMessages(library(tidyverse))
说明:
Nomogram数据格式要求:与 cox
一样可以是 factor
、numeric
factor
可以是分类变量、有序等级变量(有序等级变量数量不能小于3),但是必须是数值型因子
numeric
可以是连续/离散型数值型变量
数据内容要求:先构思一个草图,想想自己要什么
## 读取数据集
load("D:/R/R-4.0.5/bin/project_writing/singleGene/phenotype_gene_HNSC_TCGA.Rdata")
str(pdata)
## 'data.frame': 501 obs. of 28 variables:
## $ sampleID : chr "TCGA-4P-AA8J-01A" "TCGA-BA-4074-01A" "TCGA-BA-4075-01A" "TCGA-BA-4076-01A" ...
## $ AGE : Factor w/ 2 levels "<=65",">65": 2 2 1 1 1 2 2 1 1 1 ...
## $ age : num 66 69 49 39 45 83 72 56 51 54 ...
## $ gender : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 1 1 1 1 ...
## $ race : Factor w/ 4 levels "american indian",..: 3 4 3 4 4 4 4 4 4 3 ...
## $ site : Factor w/ 3 levels "tongue","Oropharynx",..: 1 1 1 2 1 2 3 3 2 2 ...
## $ Stage : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 3 4 ...
## $ AJCC_T : Factor w/ 4 levels "1","2","3","4": 4 3 4 3 4 2 4 4 2 2 ...
## $ AJCC_N : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 4 3 1 1 2 3 ...
## $ AJCC_M : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Grade : Factor w/ 4 levels "1","2","3","4": 2 3 2 2 2 2 1 2 2 2 ...
## $ Alcohol : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ Alcohol_amount : num NA NA 5 NA 0 NA 0 0 0 0 ...
## $ Smoke : Factor w/ 3 levels "Lifelong Non-smoker",..: NA 2 2 2 3 3 3 1 1 3 ...
## $ HPV : Factor w/ 2 levels "Negative","Positive": NA NA NA NA NA NA NA NA NA NA ...
## $ radiation_therapy : Factor w/ 2 levels "NO","YES": 1 2 1 2 1 2 2 2 NA 1 ...
## $ therapy_outcome : Factor w/ 5 levels "Complete Remission/Response",..: 1 1 5 1 4 NA 1 1 NA 1 ...
## $ lymphovascular_invasion_present: Factor w/ 2 levels "NO","YES": 2 NA NA NA NA 1 1 1 NA 1 ...
## $ OS : int 0 1 1 1 1 1 0 0 1 0 ...
## $ OS.time : int 102 462 283 415 1134 276 722 1288 1762 520 ...
## $ DSS : int 0 1 1 1 1 1 0 0 1 0 ...
## $ DSS.time : int 102 462 283 415 1134 276 722 1288 1762 520 ...
## $ DFI : int NA NA NA NA NA NA NA NA 1 NA ...
## $ DFI.time : int NA NA NA NA NA NA NA NA 1522 NA ...
## $ PFI : int 0 1 1 1 1 1 1 0 1 1 ...
## $ PFI.time : int 102 396 236 286 1134 276 517 1288 1522 248 ...
## $ expr : num 1 3.17 3.17 1 2 ...
## $ group : Factor w/ 2 levels "high","low": 2 1 1 2 2 1 1 2 2 2 ...
colnames(pdata)
## [1] "sampleID" "AGE"
## [3] "age" "gender"
## [5] "race" "site"
## [7] "Stage" "AJCC_T"
## [9] "AJCC_N" "AJCC_M"
## [11] "Grade" "Alcohol"
## [13] "Alcohol_amount" "Smoke"
## [15] "HPV" "radiation_therapy"
## [17] "therapy_outcome" "lymphovascular_invasion_present"
## [19] "OS" "OS.time"
## [21] "DSS" "DSS.time"
## [23] "DFI" "DFI.time"
## [25] "PFI" "PFI.time"
## [27] "expr" "group"
pdNMG <- pdata %>% select(1,2,4,9,10,19,20,28)
label(pdNMG$group) <- "HOXC5"
head(pdNMG) ## 显示数据集的前6行结果`
## sampleID AGE gender AJCC_N AJCC_M OS OS.time group
## TCGA-4P-AA8J-01A TCGA-4P-AA8J-01A >65 M 2 0 0 102 low
## TCGA-BA-4074-01A TCGA-BA-4074-01A >65 M 2 0 1 462 high
## TCGA-BA-4075-01A TCGA-BA-4075-01A <=65 M 1 0 1 283 high
## TCGA-BA-4076-01A TCGA-BA-4076-01A <=65 M 2 0 1 415 low
## TCGA-BA-4077-01B TCGA-BA-4077-01B <=65 F 3 0 1 1134 low
## TCGA-BA-4078-01A TCGA-BA-4078-01A >65 M 2 0 1 276 high
按照nomogram要求“打包”数据,绘制nomogram的关键步骤,??datadist查看详细说明
dd <- datadist(pdNMG)
options(datadist="dd")
可选择的模型有 lrm
, cph
, psm
。
lrm
logistic regression model:因变量为二分类变量时使用;所以 lrm
公式左边是OS,生存事件结局;
cph
Cox proportional hazards model:生存分析最常用;
psm
Parametric Survival Model: 生存分析升级版,基于加速时间失效模型(AFT),有文献报告认为它更精确。
建模代码:输入公式 因变量~自变量
,数据 data=
即可
用法: nomogram(fit, fun, funlabel)
输入:
回归模型
fun =
输入一个函数:
fun= function(x)1/(1+exp(-x))
med(lp=x)
surv(time,x)
,输入对应的时间即可。list()
funlabel=
输入名称输出:
结果是对象,用 plot(nom,xfrac = )
画出来,xfrac=num
是名称与图之间的距离
## 构建logisitc回归模型
fl <- lrm(OS~ AGE + gender + group, data = pdNMG)
## 绘制logisitc回归的风险预测值的nomogram图
noml <- nomogram(fl, fun= function(x)1/(1+exp(-x)), # or fun=plogis
lp=F, funlabel="Risk")
plot(noml)
## 提取生存数据
NMGsurv <- Surv(pdNMG$OS.time,pdNMG$OS)#即【时间+事件】
head(NMGsurv)
## [1] 102+ 462 283 415 1134 276
## 构建模型
NMGvar <- colnames(pdNMG)[c(1:4,8)]
NMGvar <- paste(NMGvar,collapse = "+")
NMGvar
## [1] "sampleID+AGE+gender+AJCC_N+group"
fp <- psm(NMGsurv ~ AGE+gender+AJCC_N+AJCC_M+ group, data = pdNMG, dist='lognormal')
## 绘制中位生存时间的Nomogram图
med <- Quantile(fp) # 计算中位生存时间
surv <- Survival(fp) # 构建生存概率函数
nomM <- nomogram(fp, fun=function(x) med(lp=x),
funlabel="Median Survival Time")
plot(nomM)
## 绘制Nomogram图
## 注意数据的time是以”天“为单位
nomP <- nomogram(fp, fun=list(function(x) surv(1095, x),
function(x) surv(1825, x)),
funlabel=c("3-year Survival Probability",
"5-year Survival Probability"))
plot(nomP, xfrac=.6)#标签和图的间距
psm()
是参数生存模型,基于AFT(加速失效时间模型)
surv()
注意时间要与生存时间的单位一致
rcorrcens(NMGsurv ~ predict(fp), data = pdNMG)
##
## Somers' Rank Correlation for Censored Data Response variable:NMGsurv
##
## C Dxy aDxy SD Z P n
## predict(fp) 0.61 0.22 0.22 0.045 4.88 0 467
结果如下:
c-index=0.609
C-index大于0.9为高可信,0.7-0.9为中可信,0.5-0.7为低可信
1、绘制校正曲线前需要在模型函数中添加参数x=T, y=T,在nomogram输出的对象里面增加一些信息,x=T
增加设计矩阵;y=T
增加残差
2、u需要与之前模型中定义好的time.inc一致,即365或730;
3、m要根据样本量来确定,由于标准曲线一般将所有样本分为3组(在图中显示3个点),而m代表每组的样本量数,因此m*3应该等于或近似等于样本量;
4、B代表最大再抽样的样本量,常规为1000,如果电脑运算比较慢,可以500,750等
## 重新调整模型函数f1,即添加x=T, y=T
f2 <- psm(NMGsurv ~ AGE+gender+AJCC_N+AJCC_M+group,
data = pdNMG,
x=T, y=T,
dist='lognormal')
## 构建校正曲线
cal1 <- calibrate(f2, cmethod='KM', method="boot", u=1095, m=155, B=1000)
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
cal2 <- calibrate(f2, cmethod='KM', method="boot", u=1825, m=155, B=1000)
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
??rms::calibrate查看详细参数说明
par(mar=c(7,5,1,1),cex = 0.75)
plot(cal1,lwd=2,lty=1,
errbar.col="#FC4E07", #线上面的竖线
xlim=c(0.25,0.70),ylim=c(0.2,0.80),
xlab="Nomogram-Predicted Probability of 3,5 Year OS",
ylab="Actual 3,5 Year OS (proportion)",
col="#FC4E07")
plot(cal2, add=T, conf.int=T,
subtitles = F, cex.subtitles=0.8,
lwd=2, lty=1, errbar.col="#00AFBB", col="#00AFBB")
#加上图例
legend("bottomright", legend=c("3 years", "5 years"),
col=c("#FC4E07","#00AFBB"), lwd=2)
#调整对角线
abline(0,1,lty=3,lwd=1,col="grey")
plot
里面加上第二条线只需要add=T
;
图例legend(site,legend =,col=,lwd=)
:四个参数:位置,内容,颜色,宽度
位置:可以用坐标c(,),可以选择 “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”;
legend = “expression”;
errbar
绘制垂直误差条,详见??errbar
par()
绘图参数:mar()
图形边界,默认为mar(5,4,4,2)