Nomogram 的绘制及模型评价

日期:2021-09-23

任务:

  1. nomogram的绘制

  2. C-index的计算和calibrate的绘制

  3. ROC的绘制

DCA的绘制


设置工作路径

rm(list = ls()) 
gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 416142 22.3     869494 46.5   638965 34.2
## Vcells 759728  5.8    8388608 64.0  1632729 12.5
setwd("D:/R/R-4.0.5/bin/project_writing/singleGene/Nomogram")

1 nomogram图

nomogram的绘制步骤:

  1. 读取rms程序包及辅助程序包
  2. 数据准备
  3. 拟合模型
  4. 绘图

第一步 加载R包

suppressMessages(library(Hmisc))
suppressMessages(library(grid))
suppressMessages(library(lattice))
suppressMessages(library(Formula))
suppressMessages(library(ggplot2))
suppressMessages(library(rms))
suppressMessages(library(survival))
suppressMessages(library(tidyverse))

第二步 读取数据

说明:

  1. Nomogram数据格式要求:与 cox 一样可以是 factornumeric

    • factor 可以是分类变量、有序等级变量(有序等级变量数量不能小于3),但是必须是数值型因子

    • numeric 可以是连续/离散型数值型变量

  2. 数据内容要求:先构思一个草图,想想自己要什么

## 读取数据集 
load("D:/R/R-4.0.5/bin/project_writing/singleGene/phenotype_gene_HNSC_TCGA.Rdata") 
str(pdata) 
## 'data.frame':    501 obs. of  28 variables:
##  $ sampleID                       : chr  "TCGA-4P-AA8J-01A" "TCGA-BA-4074-01A" "TCGA-BA-4075-01A" "TCGA-BA-4076-01A" ...
##  $ AGE                            : Factor w/ 2 levels "<=65",">65": 2 2 1 1 1 2 2 1 1 1 ...
##  $ age                            : num  66 69 49 39 45 83 72 56 51 54 ...
##  $ gender                         : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 1 1 1 1 ...
##  $ race                           : Factor w/ 4 levels "american indian",..: 3 4 3 4 4 4 4 4 4 3 ...
##  $ site                           : Factor w/ 3 levels "tongue","Oropharynx",..: 1 1 1 2 1 2 3 3 2 2 ...
##  $ Stage                          : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 3 4 ...
##  $ AJCC_T                         : Factor w/ 4 levels "1","2","3","4": 4 3 4 3 4 2 4 4 2 2 ...
##  $ AJCC_N                         : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 4 3 1 1 2 3 ...
##  $ AJCC_M                         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Grade                          : Factor w/ 4 levels "1","2","3","4": 2 3 2 2 2 2 1 2 2 2 ...
##  $ Alcohol                        : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ Alcohol_amount                 : num  NA NA 5 NA 0 NA 0 0 0 0 ...
##  $ Smoke                          : Factor w/ 3 levels "Lifelong Non-smoker",..: NA 2 2 2 3 3 3 1 1 3 ...
##  $ HPV                            : Factor w/ 2 levels "Negative","Positive": NA NA NA NA NA NA NA NA NA NA ...
##  $ radiation_therapy              : Factor w/ 2 levels "NO","YES": 1 2 1 2 1 2 2 2 NA 1 ...
##  $ therapy_outcome                : Factor w/ 5 levels "Complete Remission/Response",..: 1 1 5 1 4 NA 1 1 NA 1 ...
##  $ lymphovascular_invasion_present: Factor w/ 2 levels "NO","YES": 2 NA NA NA NA 1 1 1 NA 1 ...
##  $ OS                             : int  0 1 1 1 1 1 0 0 1 0 ...
##  $ OS.time                        : int  102 462 283 415 1134 276 722 1288 1762 520 ...
##  $ DSS                            : int  0 1 1 1 1 1 0 0 1 0 ...
##  $ DSS.time                       : int  102 462 283 415 1134 276 722 1288 1762 520 ...
##  $ DFI                            : int  NA NA NA NA NA NA NA NA 1 NA ...
##  $ DFI.time                       : int  NA NA NA NA NA NA NA NA 1522 NA ...
##  $ PFI                            : int  0 1 1 1 1 1 1 0 1 1 ...
##  $ PFI.time                       : int  102 396 236 286 1134 276 517 1288 1522 248 ...
##  $ expr                           : num  1 3.17 3.17 1 2 ...
##  $ group                          : Factor w/ 2 levels "high","low": 2 1 1 2 2 1 1 2 2 2 ...
colnames(pdata) 
##  [1] "sampleID"                        "AGE"                            
##  [3] "age"                             "gender"                         
##  [5] "race"                            "site"                           
##  [7] "Stage"                           "AJCC_T"                         
##  [9] "AJCC_N"                          "AJCC_M"                         
## [11] "Grade"                           "Alcohol"                        
## [13] "Alcohol_amount"                  "Smoke"                          
## [15] "HPV"                             "radiation_therapy"              
## [17] "therapy_outcome"                 "lymphovascular_invasion_present"
## [19] "OS"                              "OS.time"                        
## [21] "DSS"                             "DSS.time"                       
## [23] "DFI"                             "DFI.time"                       
## [25] "PFI"                             "PFI.time"                       
## [27] "expr"                            "group"
pdNMG <- pdata %>% select(1,2,4,9,10,19,20,28) 
label(pdNMG$group) <- "HOXC5" 
head(pdNMG) ## 显示数据集的前6行结果`
##                          sampleID  AGE gender AJCC_N AJCC_M OS OS.time group
## TCGA-4P-AA8J-01A TCGA-4P-AA8J-01A  >65      M      2      0  0     102   low
## TCGA-BA-4074-01A TCGA-BA-4074-01A  >65      M      2      0  1     462  high
## TCGA-BA-4075-01A TCGA-BA-4075-01A <=65      M      1      0  1     283  high
## TCGA-BA-4076-01A TCGA-BA-4076-01A <=65      M      2      0  1     415   low
## TCGA-BA-4077-01B TCGA-BA-4077-01B <=65      F      3      0  1    1134   low
## TCGA-BA-4078-01A TCGA-BA-4078-01A  >65      M      2      0  1     276  high

第三步 打包数据

按照nomogram要求“打包”数据,绘制nomogram的关键步骤,??datadist查看详细说明

dd  <- datadist(pdNMG)
options(datadist="dd") 

第四步 构建模型及绘图

建模

  1. 可选择的模型有 lrmcphpsm

    • lrm logistic regression model:因变量为二分类变量时使用;所以 lrm 公式左边是OS,生存事件结局;

    • cph Cox proportional hazards model:生存分析最常用;

    • psm Parametric Survival Model: 生存分析升级版,基于加速时间失效模型(AFT),有文献报告认为它更精确。

  2. 建模代码:输入公式 因变量~自变量 ,数据 data= 即可

nomogram函数

用法: nomogram(fit, fun, funlabel)

输入:

  1. 回归模型

  2. fun = 输入一个函数:

  • logistic是 fun= function(x)1/(1+exp(-x))
  • 中位生存时间是固有函数 med(lp=x)
  • psm的离线表是固有函数surv(time,x) ,输入对应的时间即可。
  • 多个则用list()
  1. funlabel= 输入名称

输出:

结果是对象,用 plot(nom,xfrac = ) 画出来,xfrac=num是名称与图之间的距离

1. 基于logistic回归的列线表

## 构建logisitc回归模型
fl <- lrm(OS~ AGE + gender + group, data = pdNMG) 

## 绘制logisitc回归的风险预测值的nomogram图
noml <- nomogram(fl, fun= function(x)1/(1+exp(-x)), # or fun=plogis
                lp=F, funlabel="Risk")
plot(noml)

2.基于参数生存分析的列线表

## 提取生存数据
NMGsurv <- Surv(pdNMG$OS.time,pdNMG$OS)#即【时间+事件】
head(NMGsurv)
## [1]  102+  462   283   415  1134   276
## 构建模型
NMGvar <- colnames(pdNMG)[c(1:4,8)]
NMGvar <- paste(NMGvar,collapse = "+")
NMGvar
## [1] "sampleID+AGE+gender+AJCC_N+group"
fp <- psm(NMGsurv ~ AGE+gender+AJCC_N+AJCC_M+ group, data =  pdNMG, dist='lognormal') 

## 绘制中位生存时间的Nomogram图
med <- Quantile(fp) # 计算中位生存时间
surv <- Survival(fp) # 构建生存概率函数
nomM <- nomogram(fp, fun=function(x) med(lp=x),
                funlabel="Median Survival Time")
plot(nomM)

## 绘制Nomogram图
## 注意数据的time是以”天“为单位
nomP <- nomogram(fp, fun=list(function(x) surv(1095, x),
                             function(x) surv(1825, x)),
                funlabel=c("3-year Survival Probability",
                           "5-year Survival Probability"))
plot(nomP, xfrac=.6)#标签和图的间距

说明:

  • psm() 是参数生存模型,基于AFT(加速失效时间模型)

  • surv()注意时间要与生存时间的单位一致

评价Nomogram的预测效果

第一步 计算c-index

rcorrcens(NMGsurv ~ predict(fp), data = pdNMG)
## 
## Somers' Rank Correlation for Censored Data    Response variable:NMGsurv
## 
##                C  Dxy aDxy    SD    Z P   n
## predict(fp) 0.61 0.22 0.22 0.045 4.88 0 467

结果如下:

  • c-index=0.609

  • C-index大于0.9为高可信,0.7-0.9为中可信,0.5-0.7为低可信

第二步 绘制校正曲线

参数说明:

1、绘制校正曲线前需要在模型函数中添加参数x=T, y=T,在nomogram输出的对象里面增加一些信息,x=T 增加设计矩阵;y=T 增加残差

2、u需要与之前模型中定义好的time.inc一致,即365或730;

3、m要根据样本量来确定,由于标准曲线一般将所有样本分为3组(在图中显示3个点),而m代表每组的样本量数,因此m*3应该等于或近似等于样本量;

4、B代表最大再抽样的样本量,常规为1000,如果电脑运算比较慢,可以500,750等

建模

## 重新调整模型函数f1,即添加x=T, y=T
f2 <- psm(NMGsurv ~ AGE+gender+AJCC_N+AJCC_M+group, 
          data = pdNMG, 
          x=T, y=T, 
          dist='lognormal') 
## 构建校正曲线 
cal1 <- calibrate(f2, cmethod='KM', method="boot", u=1095, m=155, B=1000) 
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations
cal2 <- calibrate(f2, cmethod='KM', method="boot", u=1825, m=155, B=1000)
## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

绘图

??rms::calibrate查看详细参数说明

par(mar=c(7,5,1,1),cex = 0.75) 
  plot(cal1,lwd=2,lty=1, 
       errbar.col="#FC4E07", #线上面的竖线
       xlim=c(0.25,0.70),ylim=c(0.2,0.80), 
       xlab="Nomogram-Predicted Probability of 3,5 Year OS", 
       ylab="Actual 3,5 Year OS (proportion)", 
       col="#FC4E07") 
  plot(cal2, add=T, conf.int=T, 
       subtitles = F, cex.subtitles=0.8, 
       lwd=2, lty=1, errbar.col="#00AFBB", col="#00AFBB") 
  #加上图例 
  legend("bottomright", legend=c("3 years", "5 years"), 
         col=c("#FC4E07","#00AFBB"), lwd=2) 
  #调整对角线 
  abline(0,1,lty=3,lwd=1,col="grey") 

说明:

  • plot里面加上第二条线只需要add=T;

  • 图例legend(site,legend =,col=,lwd=):四个参数:位置,内容,颜色,宽度

    • 位置:可以用坐标c(,),可以选择 “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”;

    • legend = “expression”;

  • errbar绘制垂直误差条,详见??errbar

  • par()绘图参数:mar()图形边界,默认为mar(5,4,4,2)

参考文献:

代表性参考文献1:http://jco.ascopubs.org/content/26/8/1364.long

代表性参考文献2:http://jco.ascopubs.org/content/31/9/1188.long