Nomogram 的绘制及模型评价

日期：2021-09-23

任务：

nomogram的绘制
C-index的计算和calibrate的绘制
ROC的绘制

DCA的绘制

设置工作路径

rm(list = ls()) 
gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 416142 22.3     869494 46.5   638965 34.2
## Vcells 759728  5.8    8388608 64.0  1632729 12.5

setwd("D:/R/R-4.0.5/bin/project_writing/singleGene/Nomogram")

1 nomogram图

nomogram的绘制步骤：

读取rms程序包及辅助程序包
数据准备
拟合模型
绘图

第一步加载R包

suppressMessages(library(Hmisc))
suppressMessages(library(grid))
suppressMessages(library(lattice))
suppressMessages(library(Formula))
suppressMessages(library(ggplot2))
suppressMessages(library(rms))
suppressMessages(library(survival))
suppressMessages(library(tidyverse))

第二步读取数据

说明：

Nomogram数据格式要求：与 cox 一样可以是 factor 、numeric
- factor 可以是分类变量、有序等级变量（有序等级变量数量不能小于3），但是必须是数值型因子
- numeric 可以是连续/离散型数值型变量
数据内容要求：先构思一个草图，想想自己要什么

## 读取数据集 
load("D:/R/R-4.0.5/bin/project_writing/singleGene/phenotype_gene_HNSC_TCGA.Rdata") 
str(pdata)

## 'data.frame':    501 obs. of  28 variables:
##  $ sampleID                       : chr  "TCGA-4P-AA8J-01A" "TCGA-BA-4074-01A" "TCGA-BA-4075-01A" "TCGA-BA-4076-01A" ...
##  $ AGE                            : Factor w/ 2 levels "<=65",">65": 2 2 1 1 1 2 2 1 1 1 ...
##  $ age                            : num  66 69 49 39 45 83 72 56 51 54 ...
##  $ gender                         : Factor w/ 2 levels "M","F": 1 1 1 1 2 1 1 1 1 1 ...
##  $ race                           : Factor w/ 4 levels "american indian",..: 3 4 3 4 4 4 4 4 4 3 ...
##  $ site                           : Factor w/ 3 levels "tongue","Oropharynx",..: 1 1 1 2 1 2 3 3 2 2 ...
##  $ Stage                          : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 3 4 ...
##  $ AJCC_T                         : Factor w/ 4 levels "1","2","3","4": 4 3 4 3 4 2 4 4 2 2 ...
##  $ AJCC_N                         : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 4 3 1 1 2 3 ...
##  $ AJCC_M                         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Grade                          : Factor w/ 4 levels "1","2","3","4": 2 3 2 2 2 2 1 2 2 2 ...
##  $ Alcohol                        : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ Alcohol_amount                 : num  NA NA 5 NA 0 NA 0 0 0 0 ...
##  $ Smoke                          : Factor w/ 3 levels "Lifelong Non-smoker",..: NA 2 2 2 3 3 3 1 1 3 ...
##  $ HPV                            : Factor w/ 2 levels "Negative","Positive": NA NA NA NA NA NA NA NA NA NA ...
##  $ radiation_therapy              : Factor w/ 2 levels "NO","YES": 1 2 1 2 1 2 2 2 NA 1 ...
##  $ therapy_outcome                : Factor w/ 5 levels "Complete Remission/Response",..: 1 1 5 1 4 NA 1 1 NA 1 ...
##  $ lymphovascular_invasion_present: Factor w/ 2 levels "NO","YES": 2 NA NA NA NA 1 1 1 NA 1 ...
##  $ OS                             : int  0 1 1 1 1 1 0 0 1 0 ...
##  $ OS.time                        : int  102 462 283 415 1134 276 722 1288 1762 520 ...
##  $ DSS                            : int  0 1 1 1 1 1 0 0 1 0 ...
##  $ DSS.time                       : int  102 462 283 415 1134 276 722 1288 1762 520 ...
##  $ DFI                            : int  NA NA NA NA NA NA NA NA 1 NA ...
##  $ DFI.time                       : int  NA NA NA NA NA NA NA NA 1522 NA ...
##  $ PFI                            : int  0 1 1 1 1 1 1 0 1 1 ...
##  $ PFI.time                       : int  102 396 236 286 1134 276 517 1288 1522 248 ...
##  $ expr                           : num  1 3.17 3.17 1 2 ...
##  $ group                          : Factor w/ 2 levels "high","low": 2 1 1 2 2 1 1 2 2 2 ...

colnames(pdata)

##  [1] "sampleID"                        "AGE"                            
##  [3] "age"                             "gender"                         
##  [5] "race"                            "site"                           
##  [7] "Stage"                           "AJCC_T"                         
##  [9] "AJCC_N"                          "AJCC_M"                         
## [11] "Grade"                           "Alcohol"                        
## [13] "Alcohol_amount"                  "Smoke"                          
## [15] "HPV"                             "radiation_therapy"              
## [17] "therapy_outcome"                 "lymphovascular_invasion_present"
## [19] "OS"                              "OS.time"                        
## [21] "DSS"                             "DSS.time"                       
## [23] "DFI"                             "DFI.time"                       
## [25] "PFI"                             "PFI.time"                       
## [27] "expr"                            "group"

pdNMG <- pdata %>% select(1,2,4,9,10,19,20,28) 
label(pdNMG$group) <- "HOXC5" 
head(pdNMG) ## 显示数据集的前6行结果`

##                          sampleID  AGE gender AJCC_N AJCC_M OS OS.time group
## TCGA-4P-AA8J-01A TCGA-4P-AA8J-01A  >65      M      2      0  0     102   low
## TCGA-BA-4074-01A TCGA-BA-4074-01A  >65      M      2      0  1     462  high
## TCGA-BA-4075-01A TCGA-BA-4075-01A <=65      M      1      0  1     283  high
## TCGA-BA-4076-01A TCGA-BA-4076-01A <=65      M      2      0  1     415   low
## TCGA-BA-4077-01B TCGA-BA-4077-01B <=65      F      3      0  1    1134   low
## TCGA-BA-4078-01A TCGA-BA-4078-01A  >65      M      2      0  1     276  high

第三步打包数据

按照nomogram要求“打包”数据，绘制nomogram的关键步骤,??datadist查看详细说明

dd  <- datadist(pdNMG)
options(datadist="dd")

第四步构建模型及绘图

建模

可选择的模型有 lrm ， cph ， psm 。
- lrm logistic regression model：因变量为二分类变量时使用；所以 lrm 公式左边是OS，生存事件结局；
- cph Cox proportional hazards model：生存分析最常用；
- psm Parametric Survival Model: 生存分析升级版，基于加速时间失效模型（AFT），有文献报告认为它更精确。
建模代码：输入公式 因变量~自变量 ，数据 data= 即可

nomogram函数

用法: nomogram(fit, fun, funlabel)

输入：

回归模型
fun = 输入一个函数:

logistic是 fun= function(x)1/(1+exp(-x))
中位生存时间是固有函数 med(lp=x)
psm的离线表是固有函数surv(time,x) ，输入对应的时间即可。
多个则用list()

funlabel= 输入名称

输出：

结果是对象，用 plot(nom,xfrac = ) 画出来，xfrac=num是名称与图之间的距离

1. 基于logistic回归的列线表

## 构建logisitc回归模型
fl <- lrm(OS~ AGE + gender + group, data = pdNMG) 

## 绘制logisitc回归的风险预测值的nomogram图
noml <- nomogram(fl, fun= function(x)1/(1+exp(-x)), # or fun=plogis
                lp=F, funlabel="Risk")
plot(noml)

2.基于参数生存分析的列线表

## 提取生存数据
NMGsurv <- Surv(pdNMG$OS.time,pdNMG$OS)#即【时间+事件】
head(NMGsurv)

## [1]  102+  462   283   415  1134   276

## 构建模型
NMGvar <- colnames(pdNMG)[c(1:4,8)]
NMGvar <- paste(NMGvar,collapse = "+")
NMGvar

## [1] "sampleID+AGE+gender+AJCC_N+group"

fp <- psm(NMGsurv ~ AGE+gender+AJCC_N+AJCC_M+ group, data =  pdNMG, dist='lognormal') 

## 绘制中位生存时间的Nomogram图
med <- Quantile(fp) # 计算中位生存时间
surv <- Survival(fp) # 构建生存概率函数
nomM <- nomogram(fp, fun=function(x) med(lp=x),
                funlabel="Median Survival Time")
plot(nomM)

## 绘制Nomogram图
## 注意数据的time是以”天“为单位
nomP <- nomogram(fp, fun=list(function(x) surv(1095, x),
                             function(x) surv(1825, x)),
                funlabel=c("3-year Survival Probability",
                           "5-year Survival Probability"))
plot(nomP, xfrac=.6)#标签和图的间距

说明：

psm() 是参数生存模型，基于AFT（加速失效时间模型）
surv()注意时间要与生存时间的单位一致

评价Nomogram的预测效果

第一步计算c-index

rcorrcens(NMGsurv ~ predict(fp), data = pdNMG)

## 
## Somers' Rank Correlation for Censored Data    Response variable:NMGsurv
## 
##                C  Dxy aDxy    SD    Z P   n
## predict(fp) 0.61 0.22 0.22 0.045 4.88 0 467

结果如下:

c-index=0.609
C-index大于0.9为高可信，0.7-0.9为中可信，0.5-0.7为低可信

第二步绘制校正曲线

参数说明：

1、绘制校正曲线前需要在模型函数中添加参数x=T, y=T，在nomogram输出的对象里面增加一些信息，x=T 增加设计矩阵；y=T 增加残差

2、u需要与之前模型中定义好的time.inc一致，即365或730；

3、m要根据样本量来确定，由于标准曲线一般将所有样本分为3组（在图中显示3个点），而m代表每组的样本量数，因此m*3应该等于或近似等于样本量；

4、B代表最大再抽样的样本量,常规为1000，如果电脑运算比较慢，可以500，750等

建模

## 重新调整模型函数f1，即添加x=T, y=T
f2 <- psm(NMGsurv ~ AGE+gender+AJCC_N+AJCC_M+group, 
          data = pdNMG, 
          x=T, y=T, 
          dist='lognormal') 
## 构建校正曲线 
cal1 <- calibrate(f2, cmethod='KM', method="boot", u=1095, m=155, B=1000)

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

cal2 <- calibrate(f2, cmethod='KM', method="boot", u=1825, m=155, B=1000)

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

## Warning in groupkm(psurv, y, u = u, cuts = orig.cuts): one interval had < 2
## observations

绘图

??rms::calibrate查看详细参数说明

par(mar=c(7,5,1,1),cex = 0.75) 
  plot(cal1,lwd=2,lty=1, 
       errbar.col="#FC4E07", #线上面的竖线
       xlim=c(0.25,0.70),ylim=c(0.2,0.80), 
       xlab="Nomogram-Predicted Probability of 3,5 Year OS", 
       ylab="Actual 3,5 Year OS (proportion)", 
       col="#FC4E07") 
  plot(cal2, add=T, conf.int=T, 
       subtitles = F, cex.subtitles=0.8, 
       lwd=2, lty=1, errbar.col="#00AFBB", col="#00AFBB") 
  #加上图例 
  legend("bottomright", legend=c("3 years", "5 years"), 
         col=c("#FC4E07","#00AFBB"), lwd=2) 
  #调整对角线 
  abline(0,1,lty=3,lwd=1,col="grey")

说明：

plot里面加上第二条线只需要add=T;
图例legend(site,legend =,col=,lwd=):四个参数：位置，内容，颜色，宽度
- 位置：可以用坐标c(,),可以选择 “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”；
- legend = “expression”;
errbar绘制垂直误差条，详见??errbar
par()绘图参数:mar()图形边界，默认为mar(5,4,4,2)

参考文献：

代表性参考文献1：http://jco.ascopubs.org/content/26/8/1364.long

代表性参考文献2：http://jco.ascopubs.org/content/31/9/1188.long