骨髓瘤数据分析

许锡云

2021/09/23

一、引言

1.1 研究背景

作为一种恶性浆细胞病，多发性骨髓瘤发生于骨髓的多灶性浆细胞恶性肿瘤，在中国是血液系统中排名靠前的常见恶性肿瘤，主要的发病群体为老年，到目前为止仍然无法治愈。对于此类疾病，在现有传统诊疗方法中容易出现漏诊、误诊，因此，临床研究迫切探索灵敏度高，特异性好的多发性骨髓瘤预测方法。同时，随着数据信息化的不断发展，人工智能和机器学习进展迅速，并广泛应用于各领域，医学领域也包含其中。面对大量的医疗数据信息，借助机器学习、人工智能相关分析手段成为具有临床指导意义的资源。本报告就利用机器学习的相关知识对多发性骨髓瘤的预测进行探索研究，致力于寻找建立一个准确率高，可靠性强的预测模型，以辅助医学诊断。）

1.2 研究意义

多发性骨髓瘤(Multiple myeloma,MM)是一种主要发生于中老年人群的恶性浆细胞肿瘤,在欧美国家血液肿瘤中的发病率仅次于恶性淋巴瘤.随着我国进入老龄化社会,近年来MM发病率已呈明显上升趋势.自体造血干细胞移植(ASCT)的应用以及新的治疗药物的出现,使得MM的治疗模式几经演进,患者疗效不断得到提高.MM异质性强,患者容易出现复耐药,治疗选择受制于老年人的身体状况,如何为患者选择合适的治疗方案仍然是一个难题.因此,无论是在基础研究还是在临床治疗上,MM都已经成为血液肿瘤研究的热点。

1.3 研究动机

由于目前对多发性骨髓瘤的临床诊断还停留在医学经验判断和生理检测上，对有效提前预测多发性骨髓瘤的方法探究也成为了一个热门课题。立足于这种现状，本报告通过结合基本的机器学习模型，探究一个预测该类疾病的预测模型，便于临床研究对该疾病的提前预测。

1.4 研究目标与设计

通过对病例数据分析，结合多个机器学习模型，建立和优化选择出一个良好的预测模型，以辅助该疾病的医学诊断。

二、数据分析

2.1 材料和方法

选取在浙江省人民医院治疗的多发性骨髓瘤患者179例（观察组），纳入标准：①初治患者；②诊断标准符合中国多发性骨髓诊治指南(2017年修订)[6]；③得到患者家属及患者许可。排除标准：①已接受免疫抑制、输血等治疗；②合并有其他疾病，如糖尿病、恶性肿瘤、心血管系统疾病、急慢性感染等。同时建立对照组，选取健康体检者352例，对照组与观察组的一般资料比较差异无统计学意义(P>0.05),见表1。

组别	例数	男/女	年龄
观察组	179	107/72	65.66±11.55
对照组	352	203/149	64.79±12.69
Χ2/t		0.217	0.769
p值		0.642	0.442

2.2 检测指标

根据临床诊断经验，统计相关检测指标包括：白蛋白、球蛋白、白蛋白与球蛋白比值、血清钙离子、血肌酐、乳酸脱氢酶、血红蛋白、血小板计数、血小板分布宽度等。

2.2 机器学习方法

为建立一个良好的多发性骨髓瘤预测模型，基于资料数据，利用python实现决策树（Tree）、随机森林(RFC)、逻辑回归(LR)、支持向量机(SVM)、k近邻(KNN)和XGBoost等多种机器学习方法。并进一步比较模型间的准确率、精确率、召回率、特异性、F1分数与AUC值等评价指标，选择最优预测模型

2.3 模型建立于比较

2.3.1数据读入

import pandas as pd
import numpy as np
data=pd.read_excel('C:/Users/xuxiyun/Desktop/zherenyi/省人医数据3-骨髓瘤/gusuiliuhailuo1.xlsx')
introduction=data.iloc[0,:]
content_data=data.iloc[1:,:]

#for i in data.columns:
    #print(type(data[i].values[0]))##查看各列数据类型
    
content_data['group']=[str(i) for i in content_data['group'].values]

## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

content_data['sex']=[str(i) for i in content_data['sex'].values]
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(content_data.iloc[:,2:-3],
                                                    content_data.iloc[:,0], 
                                                    test_size = 0.3, 
                                                    random_state = 42)
                                            
Xtest

##     age sex      alb  albrglo      glo       ca     crea      ldh
## 518  44   1  46.4006  1.67393  28.1485  2.37956     82.9      200
## 7    47   0     37.8     0.79       48     2.11     67.8      176
## 323  50   1     44.5     1.72     25.8  2.33029  82.8661  159.891
## 417  79   0     37.1     1.04     35.6     2.37     98.9      173
## 498  55   0     43.6     1.71     25.5     2.26     62.9      120
## ..   ..  ..      ...      ...      ...      ...      ...      ...
## 442  56   1  46.6933  1.66586  28.3706     2.46     69.6  165.588
## 58   58   0     43.3     1.59       26     2.06       43      271
## 487  47   0       42     1.94     21.6     2.23       66      153
## 25   67   0     35.2     1.28     27.6     2.15     65.7      331
## 18   46   1     39.3      0.8     26.2     3.06    98.89      165
## 
## [160 rows x 8 columns]

2.3.2 模型建立

决策树(Decision Tree)是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干，故称决策树。在机器学习中，决策树是一个预测模型，他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度，使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念。

决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。

分类树(决策树)是一种十分常用的分类方法。它是一种监督学习，所谓监督学习就是给定一堆样本，每个样本都有一组属性和一个类别，这些类别是事先确定的，那么通过学习得到一个分类器，这个分类器能够对新出现的对象给出正确的分类。这样的机器学习就被称之为监督学习。

# 决策树
from sklearn import tree # 分类
clf = tree.DecisionTreeClassifier(criterion="gini",random_state=42) 
clf.fit(Xtrain,ytrain)
# test_data为预测数据,test_label为验证集的标签

## DecisionTreeClassifier(random_state=42)

predictions=clf.predict(Xtest)
from sklearn.metrics import confusion_matrix
cf_matrix=confusion_matrix(ytest, predictions)
cf_matrix

## array([[95,  6],
##        [15, 44]], dtype=int64)

accuracy=(cf_matrix[0,0]+cf_matrix[1,1])/np.sum(cf_matrix)#准确率
accuracy

## 0.86875

P1 = cf_matrix[0,0]  / (cf_matrix[0,0]+cf_matrix[0,1])#精确率
R1 = cf_matrix[0,0] / (cf_matrix[0,0]+cf_matrix[1,0])#召回率/灵敏度
S1= cf_matrix[0,1] / (cf_matrix[0,1]+cf_matrix[1,1])#特异性
ER1=(cf_matrix[0,1]+cf_matrix[1,0])/np.sum(cf_matrix)#容错率
F11= 2*P1*R1/(P1+R1)#F1分数
print(P1,R1,S1,ER1,F11)

## 0.9405940594059405 0.8636363636363636 0.12 0.13125 0.9004739336492891

随机森林是一种集成学习组合分类算法，属于bagging算法。集成学习的核心思想：将若干个分类器组合起来，得到一个分类性能显著优越的分类器。随机森林强调两个方面：随机+森林。随机是指抽样方法随机，森林是指由很多决策树构成。
随机森林的优点

a.可以处理高纬度数据，并且数据可以不做特征选择，数据集可以不做规范化处理。

b.行、列随机抽样的方法引入，降低了过拟合以及增强了抗噪声能力。

c.训练速度比较快，可以做并行化计算。

d.擅长处理特征缺失数据和不平衡数据。

e.在创建随机森林的时候，对泛化误差使用无偏差模型，泛化能力强。
随机森林的缺点

a.可能出现相似的决策树和分类能力不强的树。

b.对小数据集或者维度低的数据集，不能产生很好的分类。

c.模型可解释性不好，要是树深度很大的话。

d.解决回归问题没有分类问题好，它不能给出一个连续输出，同时也不能预测超过数据集范围的预测。

#随机森林
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier(oob_score=True, random_state=173)
rfc.fit(Xtrain,ytrain)
# test_data为预测数据,test_label为验证集的标签

## RandomForestClassifier(oob_score=True, random_state=173)

predictions1=rfc.predict(Xtest)
# 混淆矩阵
cf_matrix=confusion_matrix(ytest, predictions1)
cf_matrix

## array([[99,  2],
##        [11, 48]], dtype=int64)

accuracy=(cf_matrix[0,0]+cf_matrix[1,1])/np.sum(cf_matrix)
accuracy

## 0.91875

P2 = cf_matrix[0,0]  / (cf_matrix[0,0]+cf_matrix[0,1])#精确率
R2= cf_matrix[0,0] / (cf_matrix[0,0]+cf_matrix[1,0])#召回率/灵敏度
S2= cf_matrix[0,1] / (cf_matrix[0,1]+cf_matrix[1,1])#特异性
ER2=(cf_matrix[0,1]+cf_matrix[1,0])/np.sum(cf_matrix)#容错率
F12= 2*P2*R2/(P2+R2)#F1分数
print(P2,R2,S2,ER2,F12)

## 0.9801980198019802 0.9 0.04 0.08125 0.9383886255924171

逻辑回归用于分类问题。在分类问题中，我们尝试预测目前观测目标属于哪一类，它会产生一个离散的二元结果y∈{0,1}。而线性回归模型产生的预测值为是实数值，于是我们引入一个新的模型，使输出变量z的值到始终在0和1之间。

# 逻辑回归
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression() 
lr.fit(Xtrain,ytrain)
# test_data为预测数据,test_label为验证集的标签

## LogisticRegression()
## 
## C:\Users\xuxiyun\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
## STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
## 
## Increase the number of iterations (max_iter) or scale the data as shown in:
##     https://scikit-learn.org/stable/modules/preprocessing.html
## Please also refer to the documentation for alternative solver options:
##     https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
##   n_iter_i = _check_optimize_result(

predictions2=lr.predict(Xtest)
from sklearn.metrics import confusion_matrix
cf_matrix=confusion_matrix(ytest, predictions2)
cf_matrix

## array([[99,  2],
##        [19, 40]], dtype=int64)

KNN（k-NearestNeighbor ）算法是数据挖掘算法中最简单的算法之一，K代表最邻近的邻居，一个样本的列表可以根据最邻近的K个邻居的分类决定。KNN的核心思想就是一个样本在特征向量空间中k个最相邻的样本中大多数属于哪个类别，那我们就说这个样本属于哪个类别。

#knn
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
knn.fit(Xtrain,ytrain)

## KNeighborsClassifier()

predictions3=knn.predict(Xtest)
from sklearn.metrics import confusion_matrix
cf_matrix=confusion_matrix(ytest,  predictions3)
cf_matrix

## array([[99,  2],
##        [24, 35]], dtype=int64)

accuracy=(cf_matrix[0,0]+cf_matrix[1,1])/np.sum(cf_matrix)
accuracy

## 0.8375

P5= cf_matrix[0,0]  / (cf_matrix[0,0]+cf_matrix[0,1])#精确率
R5= cf_matrix[0,0] / (cf_matrix[0,0]+cf_matrix[1,0])#召回率/灵敏度
S5= cf_matrix[0,1] / (cf_matrix[0,1]+cf_matrix[1,1])#特异性
ER5=(cf_matrix[0,1]+cf_matrix[1,0])/np.sum(cf_matrix)#容错率
F15= 2*P5*R5/(P5+R5)#F1分数
print(P5,R5,S5,ER5,F15)

## 0.9801980198019802 0.8048780487804879 0.05405405405405406 0.1625 0.8839285714285714

支持向量机(Support Vector Machine，SVM)是CorinnaCortes和Vapnik等于1995年首先提出的，它在解决小样本、非线性及高维模式识别中表现出许多特有的优势，并能够推广应用到函数拟合等其他机器学习问题中。

在机器学习中，支持向量机(SVM，还支持矢量网络)是与相关的学习算法有关的监督学习模型，可以分析数据，识别模式，用于分类和回归分析

# svm
##只适用于i数值型数据，所以把最后年龄列删掉
from sklearn.linear_model import SGDClassifier 
svm = SGDClassifier() 
svm.fit(Xtrain.iloc[:,2:],ytrain)
# test_data为预测数据,test_label为验证集的标签

## SGDClassifier()

predictions4=svm.predict(Xtest.iloc[:,2:])
cf_matrix=confusion_matrix(ytest, predictions4)
print(cf_matrix)

## [[101   0]
##  [ 38  21]]

accuracy=(cf_matrix[0,0]+cf_matrix[1,1])/np.sum(cf_matrix)
print(accuracy)

## 0.7625

P6= cf_matrix[0,0]  / (cf_matrix[0,0]+cf_matrix[0,1])#精确率
R6= cf_matrix[0,0] / (cf_matrix[0,0]+cf_matrix[1,0])#召回率/灵敏度
S6= cf_matrix[0,1] / (cf_matrix[0,1]+cf_matrix[1,1])#特异性
ER6=(cf_matrix[0,1]+cf_matrix[1,0])/np.sum(cf_matrix)#容错率
F16= 2*P6*R6/(P6+R6)#F1分数
print(P6,R6,S6,ER6,F16)

## 1.0 0.7266187050359713 0.0 0.2375 0.8416666666666668

# XGBoost
import xgboost as xgb
xgbc = xgb.XGBClassifier()

#将提示的包含错误数据类型这一列进行转换

Xtrain['sex']=[int(i) for i in Xtrain['sex'].values]
Xtest['sex']=[int(i) for i in Xtest['sex'].values]
ytrain=[int(i) for i in ytrain]
ytest=[int(i) for i in ytest]
xgbc.fit(Xtrain.values,ytrain)
#for i in Xtrain.columns:
    #print(type(Xtrain[i].values[0]))##查看各列数据类型

#for i in Xtest.columns:
    #print(type(Xtest[i].values[0]))##查看各列数据类型

## [20:12:17] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
## XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
##               colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
##               importance_type='gain', interaction_constraints='',
##               learning_rate=0.300000012, max_delta_step=0, max_depth=6,
##               min_child_weight=1, missing=nan, monotone_constraints='()',
##               n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
##               reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
##               tree_method='exact', validate_parameters=1, verbosity=None)
## 
## C:\Users\xuxiyun\AppData\Local\Programs\Python\Python38\lib\site-packages\xgboost\sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
##   warnings.warn(label_encoder_deprecation_msg, UserWarning)

accuracy=(cf_matrix[0,0]+cf_matrix[1,1])/np.sum(cf_matrix)
print(accuracy)

## 0.7625

P = cf_matrix[0,0]  / (cf_matrix[0,0]+cf_matrix[0,1])#精确率
R = cf_matrix[0,0] / (cf_matrix[0,0]+cf_matrix[1,0])#召回率/灵敏度
S = cf_matrix[0,1] / (cf_matrix[0,1]+cf_matrix[1,1])#特异性
ER=(cf_matrix[0,1]+cf_matrix[1,0])/np.sum(cf_matrix)#容错率
F1 = 2*P*R/(P+R)#F1分数
print(P,R,S,ER,F1)

## 1.0 0.7266187050359713 0.0 0.2375 0.8416666666666668

accuracy=(cf_matrix[0,0]+cf_matrix[1,1])/np.sum(cf_matrix)
accuracy

## 0.7625

P4= cf_matrix[0,0]  / (cf_matrix[0,0]+cf_matrix[0,1])#精确率
R4= cf_matrix[0,0] / (cf_matrix[0,0]+cf_matrix[1,0])#召回率/灵敏度
S4= cf_matrix[0,1] / (cf_matrix[0,1]+cf_matrix[1,1])#特异性
ER4=(cf_matrix[0,1]+cf_matrix[1,0])/np.sum(cf_matrix)#容错率
F14= 2*P4*R4/(P4+R4)#F1分数
print(P4,R4,S4,ER4,F14)

## 1.0 0.7266187050359713 0.0 0.2375 0.8416666666666668

from xgboost import plot_importance
from matplotlib import pyplot
# plot feature importance
plot_importance(xgbc,importance_type='gain')

## <AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

pyplot.show()

rfc.feature_importances_

## array([0.04879548, 0.00470237, 0.31229986, 0.19670194, 0.17539739,
##        0.09177802, 0.10283131, 0.06749362])

data.iloc[1:,:]

##     group     No age sex      alb  ...     crea     ldh   hb  plt      pdw
## 1       1    2.0  55   1     44.6  ...     76.3     180  118  150      9.5
## 2       1    5.0  72   0     36.8  ...    123.2     168   93   76     21.3
## 3       1    6.0  73   1     40.3  ...     71.8     139  121   95       75
## 4       1    8.0  71   1       32  ...       68     133   91  109     13.8
## 5       1    9.0  82   1     35.9  ...     75.4     221  109  310     11.3
## ..    ...    ...  ..  ..      ...  ...      ...     ...  ...  ...      ...
## 527     0  348.0  74   0       44  ...     71.2     162  145  210     43.8
## 528     0  349.0  74   0       40  ...     75.4     207  120  208       16
## 529     0  350.0  85   0     41.7  ...    117.1     247  102   43  44.3107
## 530     0  351.0  55   0     44.3  ...     66.3     185  143  175       15
## 531     0  352.0  59   0  43.5343  ...  79.1363  171.97  145  242     12.1
## 
## [531 rows x 13 columns]

from scipy.stats import ttest_ind,norm,f
import numpy as np
def ftest(s1,s2):
    '''F检验样本总体方差是否相等'''
    print("Null Hypothesis:var(s1)=var(s2)，α=0.05")
    F = np.var(s1)/np.var(s2)
    v1 = len(s1) - 1
    v2 = len(s2) - 1
    p_val = 1 - 2*abs(0.5-f.cdf(F,v1,v2))
    print(p_val)
    if p_val < 0.05:
        print("Reject the Null Hypothesis.")
        equal_var=False
    else:
        print("Accept the Null Hypothesis.")
        equal_var=True
    return equal_var

def ttest_ind_fun(s1,s2):
    '''t检验独立样本所代表的两个总体均值是否存在差异'''
    equal_var = ftest(s1,s2)
    print("Null Hypothesis:mean(s1)=mean(s2)，α=0.05")
    ttest,pval = ttest_ind(s1,s2,equal_var=equal_var)
    if pval < 0.05:
        print("Reject the Null Hypothesis.")
    else:
        print("Accept the Null Hypothesis.")
    return pval

np.random.seed(42)
s1 = data.iloc[1:180,2]
s2 = data.iloc[180:532,2]
s3 = data.iloc[1:180,4]
s4 = data.iloc[180:532,4]

ttest_ind_fun(s1,s2)
# ttest_ind_fun(s3,s4)

## Null Hypothesis:var(s1)=var(s2)，α=0.05
## 0.14995983040518335
## Accept the Null Hypothesis.
## Null Hypothesis:mean(s1)=mean(s2)，α=0.05
## Accept the Null Hypothesis.
## 0.4423248614137002

ttest_ind_fun(s3,s4)

## Null Hypothesis:var(s1)=var(s2)，α=0.05
## 2.220446049250313e-16
## Reject the Null Hypothesis.
## Null Hypothesis:mean(s1)=mean(s2)，α=0.05
## Reject the Null Hypothesis.
## 4.202701232334866e-44

2.4 结论

2.4.1观察组和对照组的指标对比分析

观察组血肌酐、球蛋白、乳酸脱氢酶明显高于对照组，其中血肌酐的含量差距最大；此外白蛋白、血红蛋白、血小板计数以及血小板分布宽度低于观察组；而白蛋白和球蛋白比值、血清钙离子无明显差异。

组别	观察组	对照组
白蛋白	33.92±6.83	43.45±3.43
白蛋白/球蛋白	1.03±0.59	1.52±0.24
球蛋白	44.26±22.95	29.08±3.49
血清钙离子	2.30±0.34	2.32±0.10
血肌酐	170.85±238.51	82.33±14.84
乳酸脱氢酶	192.77±73.60	192.77±73.60
血红蛋白	94.57±25.90	141.49±17.09
血小板计数	168.74±82.53	196.24±57.54
血小板分布密度	15.26±11.17	25.05±18.18

结合上述数据和指标对比分析结论，利用机器学习知识将数据以7：3划分为训练集和测试集，构建多个多发性骨髓瘤的预测模型，并得到模型评价指标如下表所示，根据模型中指标结果不难发现相对其他模型，随机森林的准确率、精准率、召回率、F1得分以及AUC值都是最优；且特异性值最低，表明模型的灵敏性高。综合5项模型评价指标，最终选择随机森林作为预测多发性骨髓瘤的最优模型。