Survival Model
UAS
| Kontak | : \(\downarrow\) |
| yosia.yosia@student.matanauniversity.ac.id | |
| yyosia | |
| RPubs | https://rpubs.com/yosia/ |
Contoh Kasus Penerapan Survival Model Menggunakan R
Introduction
Penyakit kardiovaskular adalah penyebab kematian nomor 1 di seluruh dunia, yang merenggut sekitar 17,9 juta jiwa setiap tahunnya, yang merupakan 31% dari seluruh kematian di seluruh dunia. Untungnya, sebagian besar penyakit kardiovaskular dapat dicegah dengan mengatasi faktor risiko perilaku dengan strategi di seluruh populasi. Proyek ini bertujuan untuk melakukan analisis data eksplorasi, memanfaatkan berbagai model pembelajaran mesin untuk mendeteksi fitur yang paling penting untuk memprediksi kejadian gagal jantung dan menerapkan model Cox, Analisis Survival, dan Hazard Ratio untuk memvalidasi hasilnya.
Summary:
Exploratory Data Analysis
Machine Learning Model (GBM, RF, Linear Regression) to select 3 most important features
Cox Model + Proportional Hazard Assumption
Survival Analysis
Key Concept:
GBM model
Random Forest
ROC
General Linear Model
Cox Proportional Model
Hazard Ratio
Kaplan Meier Plot
Survival Analysis
Data Summary
Pertama-tama, dataset ini tidak memiliki nilai yang hilang, jadi kita tidak perlu melakukan imputasi. Untuk membuat nilai lab lebih mudah diakses oleh dokter klinis, kami akan mengkategorikannya berdasarkan praktik standar.
Variable Categorization
Ejection Fraction: Normal 41% - 75%, Tidak Normal <41% atau >75% Fraksi ejeksi yang normal adalah sekitar 50% hingga 75%, menurut American Heart Association. Fraksi ejeksi yang tidak normal dapat berkisar antara 41% hingga 50%.
Serum Creatinine: Untuk pria dewasa, 0,74 hingga 1,35 mg/dL (65,4 hingga 119,3 mikromol/L) Untuk wanita dewasa, 0,59 hingga 1,04 mg/dL (52,2 hingga 91,9 mikromol/L)
Platelets: Jumlah trombosit yang normal berkisar antara 150.000 hingga 450.000 trombosit per mikroliter darah.
Serum Sodium: Kadar natrium darah normal adalah antara 135 dan 145 miliekivalen per liter (mEq/L)
Creatinine Phosphkinase:10 to 120 micrograms per liter (mcg/L)
Event: Pasien yang meninggal dunia. Censor: PPasien yang hidup.
Age Group: Age <65 years old, Age >=65 years old
# Assign ID
heart$id <- seq.int(nrow(heart))
# Assign Character value to Numeric variables
heart$sexc <-ifelse(heart$sex==1, "Male", "Female")
heart$smoke <-ifelse(heart$smoking==1, "Yes", "No")
heart$hbp <- ifelse(heart$high_blood_pressure==1, "Yes","No")
heart$dia <-ifelse(heart$diabetes==1, "Yes", "No")
heart$anaemiac <- ifelse(heart$anaemia==1 ,"Yes", "No")
# Platelets : Hopkins Medicine
heart$platc <- ifelse(heart$platelets>150000 & heart$platelets <450000, "Platelets Normal", "Platelets Abnormal")
heart$plat <- ifelse(heart$platelets>150000 & heart$platelets <450000, 0,1)
# Serum Sodium: Mayo Clinic
heart$sodiumc <- ifelse(heart$serum_sodium >135 & heart$serum_sodium<145, "Serum Sodium Normal", "Serum Sodium Abnormal")
heart$sodiumn <- ifelse(heart$serum_sodium >135 & heart$serum_sodium<145, 0, 1)
#Creatine Phosphkinase : Mountsinai
heart$cpk <- ifelse(heart$creatinine_phosphokinase >10 & heart$creatinine_phosphokinase<120, "CPK Normal", "CPK Abnormal")
heart$cpkn <- ifelse(heart$creatinine_phosphokinase >10 & heart$creatinine_phosphokinase<120, 0, 1)
#ejection_fraction: Mayo
heart$efraction <-ifelse(heart$ejection_fraction<=75 & heart$ejection_fraction>=41, "Ejection Normal", "Ejection Abnormal")
heart$efractionn <-ifelse(heart$ejection_fraction<=75 & heart$ejection_fraction>=41, 0, 1)
#serum_creatinine :mayo
heart$screat<- ifelse((heart$serum_creatinine<1.35 & heart$serum_creatinine>0.74 & heart$sex==1 ) | (heart$serum_creatinine<1.04 & heart$serum_creatinine>0.59 & heart$sex==0) , "Creatinine Normal", "Creatinine Abnormal" )
heart$screatn<- ifelse((heart$serum_creatinine<1.35 & heart$serum_creatinine>0.74 & heart$sex==1 ) | (heart$serum_creatinine<1.04 & heart$serum_creatinine>0.59 & heart$sex==0) , 0, 1 )
#age group: Pharma convention
heart$agegp <- ifelse( heart$age<65, "Age <65", "Age >=65")
heart$agegpn <- ifelse( heart$age<65, 0, 1)
#event vs censor
heart$cnsr <- ifelse(heart$DEATH_EVENT==0, "Censor", "Event")Original Data table
h1<- subset(heart, select=c(age,anaemia,creatinine_phosphokinase, serum_creatinine,diabetes, ejection_fraction ,high_blood_pressure, platelets , serum_sodium, sex, smoking, DEATH_EVENT))
head(h1, 5)%>% DT::datatable()h1c<- subset(heart, select=c(agegp,anaemiac,cpk, screat, dia, efraction ,hbp, platc, sodiumc, sexc, smoke, DEATH_EVENT, time))Modified Categorical Data table
head(h1c, 5)%>% DT::datatable()#Modified Categorical variable selection
m1<- subset(heart, select=c(agegpn,anaemia,cpkn, screatn, diabetes, efractionn ,high_blood_pressure, plat, sodiumn, sex, smoking, DEATH_EVENT))Training + Testing Data
set.seed=8
train.test.split<-sample(2, nrow(h1), replace=TRUE, prob=c(0.8,0.2))
train=h1[train.test.split==1,]
test=h1[train.test.split==2,]
set.seed=18
train.test.split1<-sample(2, nrow(m1), replace=TRUE, prob=c(0.7,0.3))
train1=m1[train.test.split==1,]
test1=m1[train.test.split==2,]
#head(train, 5)%>% DT::datatable()
#head(test, 5)%>% DT::datatable()Exploratory Data Analysis
Binary Variable Distribution
#1. age group
p1<-ggplot(heart, aes(x=agegp))+geom_bar(fill="lightblue")+ labs(x="Age Group")+ theme_minimal(base_size=10)
#2. Sex
p2<-ggplot(heart, aes(x=sexc))+geom_bar(fill="indianred3")+ labs(x="Sex")+ theme_minimal(base_size=10)
#3. Smoking
p3<-ggplot(heart, aes(x=smoke))+geom_bar(fill="seagreen2")+ labs(x="Smoking")+ theme_minimal(base_size=10)
#4. Diabetes
p4<-ggplot(heart, aes(x=dia))+geom_bar(fill="orange2")+
labs(x="Diabetes Status")+ theme_minimal(base_size=10)
#5. cpk
p5<-ggplot(heart, aes(x=cpk))+geom_bar(fill="lightblue")+
labs(x="Creatinine Phosphokinase")+ theme_minimal(base_size=10)
#6. Platelets
p6<-ggplot(heart, aes(x=platc))+geom_bar(fill="indianred2")+
labs(x="Platelets")+ theme_minimal(base_size=10)
#7. serum sodium
p7<-ggplot(heart, aes(x=sodiumc))+geom_bar(fill="seagreen2")+
labs(x="Serum Sodium") + theme_minimal(base_size=10)
#8. Serum creatinine
p8<-ggplot(heart, aes(x=screat))+geom_bar(fill="orange2")+
labs(x="Serum Creatinine") + theme_minimal(base_size=10)
#9. anaemia
p9<-ggplot(heart, aes(x=anaemiac, fill=DEATH_EVENT))+geom_bar(fill="lightblue")+ labs(x="Anaemia")+ theme_minimal(base_size=10)
#10. ejection_fraction
p10<-ggplot(heart, aes(x=efraction))+geom_bar(fill="indianred2")+
labs(x="Ejection Fraction")+ theme_minimal(base_size=10)
#11. High blood pressure
p11<-ggplot(heart, aes(x=hbp))+geom_bar(fill="seagreen2")+
labs(x="High Blood Pressure Status")+ theme_minimal(base_size=10)
#12. Event
p12<-ggplot(heart, aes(x=cnsr))+geom_bar(fill="orangered3")+ labs(x="Event Status")+ theme_minimal(base_size=10)Demographic and Baseline Characters Distribution
(p1+p2+p3 +p4)+
plot_annotation(title="Demographic and Histology Distribution")Lab Test Result Distribution
(p5+p6+p7+p8) + plot_annotation(title="Lab Test Distribution")Disease history Distribution
(p9+p10+p11+p12) + plot_annotation(title="Disease History Distribution")Continuous Variables Disbribution
Age
#1. Age
c1<- ggplot(heart, aes(x=age))+ geom_histogram(binwidth=5, colour="white", fill="darkseagreen2", alpha=0.8)+
geom_density(eval(bquote(aes(y=..count..*5))),colour="darkgreen", fill="darkgreen", alpha=0.3)+ scale_x_continuous(breaks=seq(40,100,10))+geom_vline(xintercept = 65, linetype="dashed")+ annotate("text", x=50, y=45, label="Age <65", size=2.5, color="dark green") + annotate("text", x=80, y=45, label="Age >= 65", size=2.5, color="dark red") +labs(title="Age Distribution") + theme_minimal(base_size = 8)
c1## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
CPK
#2. cpk
c2<- ggplot(heart, aes(x=creatinine_phosphokinase))+ geom_histogram(binwidth=100, colour="white", fill="mediumpurple2", alpha=0.8)+
geom_density(eval(bquote(aes(y=..count..*150))),colour="mediumorchid1", fill="mediumorchid1", alpha=0.3)+ scale_x_continuous(breaks=seq(0,10000,1000))+geom_vline(xintercept = 120, linetype="dashed")+ annotate("text", x=0, y=100, label="CPK Normal", size=2.5, color="dark green") + annotate("text", x=1000, y=80, label="CPK Abnormal", size=2.5, color="dark red")+labs(title="Creatinine Phosphokinase Distribution") + theme_minimal(base_size = 8)
c2Ejection Fraction
c3<- ggplot(heart, aes(x=ejection_fraction))+ geom_histogram(binwidth=5, colour="white", fill="lightpink1", alpha=0.8)+
geom_density(eval(bquote(aes(y=..count..*5))),colour="mistyrose2", fill="mistyrose2", alpha=0.3)+ scale_x_continuous(breaks=seq(0,80,10))+geom_vline(xintercept = 40, linetype="dashed")+geom_vline(xintercept = 75, linetype="dashed")+ annotate("text", x=20, y=30, label="Abnormal", size=2.5, color="dark red") + annotate("text", x=50, y=30, label="Normal", color="dark green")+ annotate("text", x=80, y=30, label="Abnormal", size=2.5, color="dark red")+labs(title="Ejection Fraction Distribution") + theme_minimal(base_size = 8)
c3Platelets Count
c4<- ggplot(heart, aes(x=platelets))+ geom_histogram(binwidth=20000, colour="white", fill="lightskyblue2", alpha=0.8)+
geom_density(eval(bquote(aes(y=..count..*25000))),colour="lightsteelblue", fill="lightsteelblue", alpha=0.3)+
geom_vline(xintercept = 150000, linetype="dashed")+geom_vline(xintercept = 450000, linetype="dashed")+ annotate("text", x=100000, y=30, label="Abnormal", size=2.5, color="dark red") + annotate("text", x=300000, y=30, label="Normal", color="dark green")+ annotate("text", x=500000, y=30, label="Abnormal", size=2.5, color="dark red")+labs(title="Platelets Count") + theme_minimal(base_size = 8)
c4Serum Sodium
c5<- ggplot(heart, aes(x=serum_sodium))+ geom_histogram(binwidth=1, colour="white", fill="lightsalmon", alpha=0.8)+
geom_density(eval(bquote(aes(y=..count..))),colour="lightcoral", fill="lightcoral", alpha=0.3)+
geom_vline(xintercept = 135, linetype="dashed")+geom_vline(xintercept = 145, linetype="dashed")+ annotate("text", x=130, y=20, label="Abnormal", size=2.5, color="dark red") + annotate("text", x=142, y=20, label="Normal", color="dark green")+ annotate("text", x=148, y=20, label="Abnormal", size=2.5, color="dark red")+labs(title="Serum Sodium") + theme_minimal(base_size = 8)
c5Serum Creatinine
c6<- ggplot(heart, aes(x=serum_creatinine))+ geom_histogram(binwidth=0.2, colour="white", fill="lightgoldenrod", alpha=0.8)+
geom_density(eval(bquote(aes(y=..count..*0.2))),colour="moccasin", fill="moccasin", alpha=0.3)+
geom_vline(xintercept = 0.74, linetype="dashed")+geom_vline(xintercept = 1.35, linetype="dashed")+ annotate("text", x=0.05, y=20, label="Abnormal", size=2.5, color="dark red") + annotate("text", x=1, y=20, label="Normal", color="dark green")+ annotate("text", x=2.5, y=20, label="Abnormal", size=2.5, color="dark red")+labs(title="Serum Creatinine") + theme_minimal(base_size = 8)
c6Death Event Count with Survival time
Bubble Chart
Seperti yang bisa kita lihat, bentuk segitiga adalah Kematian, lingkaran mewakili subjek yang disensor, ukuran bentuk mewakili jumlah pasien yang hidup/mati pada hari yang sama. Jelas pasien yang disensor (lingkaran) hidup lebih lama.
d1 <- group_by(heart,time,DEATH_EVENT)
d2<- summarise(d1,count=n())## `summarise()` has grouped output by 'time'. You can override using the
## `.groups` argument.
d22 <- arrange(d2, desc(time))
ggplot(d22, aes(x=reorder( time, count), y=time))+ geom_point(aes(size=count, colour=factor(count), shape=factor(DEATH_EVENT)), alpha=1/2)+
theme_ipsum() +
theme(axis.text.x = element_text(angle = 90, hjust = 1) , legend.position="none") +coord_flip() +ylab("Survival days") +xlab("Survival counts")+ ggtitle("Patient survival time with counts")Lollipop Chart for Survival Status with Censor
Kami menggunakan garis biru untuk menunjukkan pasien yang disensor (hidup), dan garis oranye untuk menunjukkan pasien yang mengalami kejadian (mati). Sangat jelas bahwa pasien yang disensor hidup lebih lama secara umum.
heart$idc <- paste("id",as.factor(heart$id))
lol1_100<-ggplot(heart[0:100,], aes(x=idc, y=time)) +
geom_segment( aes(x=idc, xend=idc, y=0, yend=time), color=ifelse(heart[0:100,]$DEATH_EVENT==1, "orange", "skyblue"))+
geom_point( color=ifelse(heart[0:100,]$DEATH_EVENT==1, "red", "darkgreen"), size=0.1, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+ ylab("Survival days") +xlab("Patient ID 1-100") + ggtitle("Patient 1-100 Survival Status with Censor")
lol101<-ggplot(heart[101:200,], aes(x=idc, y=time)) +
geom_segment( aes(x=idc, xend=idc, y=0, yend=time), color=ifelse(heart[0:100,]$DEATH_EVENT==1, "orange", "skyblue"))+
geom_point( color=ifelse(heart[0:100,]$DEATH_EVENT==1, "red", "darkgreen"), size=0.1, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+ ylab("Survival days") +xlab("Patient ID 101-200") + ggtitle("Patient 101-200 Survival Status with Censor")
lol201<-ggplot(heart[201:299,], aes(x=idc, y=time)) +
geom_segment( aes(x=idc, xend=idc, y=0, yend=time), color=ifelse(heart[201:299,]$DEATH_EVENT==1, "orange", "skyblue"))+
geom_point( color=ifelse(heart[201:299,]$DEATH_EVENT==1, "red", "darkgreen"), size=0.1, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+ ylab("Survival days") +xlab("Patient ID 201-299") + ggtitle("Patient 201-299 Survival Status with Censor")
lol1_100lol101lol201Correlations
Correlation Matrix
Dari matriks korelasi, kita dapat melihat Kejadian Kematian berkorelasi tinggi dengan kreatinin serum, usia, natrium serum, fraksi ejeksi.
r=cor(h1)
corrplot(r, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 90)Heatmap
Saya juga akan menunjukkan heatmap sebagai bukti tambahan.
coul <- colorRampPalette(brewer.pal(8, "PiYG"))(25)
heatmap(r, scale="column", col = coul)draw_confusion_matrix <- function(cm) {
total <- sum(cm$table)
res <- as.numeric(cm$table)
# Generate color gradients. Palettes come from RColorBrewer.
greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
getColor <- function (greenOrRed = "green", amount = 0) {
if (amount == 0)
return("#FFFFFF")
palette <- greenPalette
if (greenOrRed == "red")
palette <- redPalette
colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
}
# set the basic layout
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
classes = colnames(cm$table)
rect(150, 430, 240, 370, col=getColor("green", res[1]))
text(195, 435, classes[1], cex=1.2)
rect(250, 430, 340, 370, col=getColor("red", res[3]))
text(295, 435, classes[2], cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col=getColor("red", res[2]))
rect(250, 305, 340, 365, col=getColor("green", res[4]))
text(140, 400, classes[1], cex=1.2, srt=90)
text(140, 335, classes[2], cex=1.2, srt=90)
# add in the cm results
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}Machine Learning Model for Variable Selections
Kami menerapkan 3 metode pembelajaran mesin (GLM, GBM, Random Forest) untuk memprediksi faktor yang paling penting untuk Kejadian_Kematian Gagal Jantung sebagai variabel dependen dalam struktur data asli serta kerangka data yang dikategorikan. Secara keseluruhan, kami menemukan serum_creatinine, usia, fraksi ejeksi, dan natrium serum adalah empat fitur yang paling penting, yang konsisten dengan matriks korelasi.
catatan: Kecualikan variabel “waktu” (paradoks)
GBM
Gradient Boosting adalah teknik pembelajaran mesin yang digunakan dalam tugas regresi dan klasifikasi. Teknik ini terkenal karena dapat mengubah pembelajar yang lemah menjadi pembelajar yang kuat. Dimulai dengan memberikan bobot yang sama pada setiap pohon. kemudian memperbaiki prediksi pohon pertama. Kami mengulangi proses Tree1 +Tree2 ini hingga jumlah tertentu.
GBM with Original Data
gbm.m<- gbm(train$DEATH_EVENT ~. , data=train, distribution = "bernoulli",
cv.folds=10, shrinkage=0.01, n.minobsinnode = 10, n.trees=1000)
#gbm.m
gbm.imp=summary(gbm.m)gbm.impgmb.t =predict(object=gbm.m, newdata=test, n.trees=1000, type="response")
presult<- as.factor(ifelse(gmb.t>0.5,1,0))
test$DEATH_EVENT1<-as.factor(test$DEATH_EVENT)
g<-confusionMatrix(presult,test$DEATH_EVENT1)
draw_confusion_matrix(g)GBM with Categorized Data
gbm.m1<- gbm(train1$DEATH_EVENT ~. , data=train1, distribution = "bernoulli",
cv.folds=10, shrinkage=0.01, n.minobsinnode = 10, n.trees=1000)
#gbm.m1
gbm.imp1=summary(gbm.m1)gbm.imp1gmb.t1 =predict(object=gbm.m1, newdata=test1, n.trees=1000, type="response")
presult<- as.factor(ifelse(gmb.t1>0.5,1,0))
test1$DEATH_EVENT1<-as.factor(test1$DEATH_EVENT)
g1<-confusionMatrix(presult,test1$DEATH_EVENT1)
draw_confusion_matrix(g1)Random Forest
Random Forest with Original Data
rforest<- randomForest(factor(DEATH_EVENT) ~. , data=train, ntree=500, importance=TRUE)
#summary(rforest)
imp<-varImp(rforest)
varImpPlot(rforest)rpredict<- predict(rforest, test, type="class")
cm2<-confusionMatrix(rpredict, test$DEATH_EVENT1)
draw_confusion_matrix(cm2)Random Forest with Categorized Data
rforest1<- randomForest(factor(DEATH_EVENT) ~. , data=train1, ntree=500, importance=TRUE)
#summary(rforest1)
imp1<-varImp(rforest1)
varImpPlot(rforest1)rpredict1<- predict(rforest1, test1, type="class")
cm21<-confusionMatrix(rpredict1, test1$DEATH_EVENT1)
draw_confusion_matrix(cm21)General Linear Model
General Linear Model with Original Data
lm1 <- glm(DEATH_EVENT ~., data=train, family=binomial(link="logit"))
#summary(lm1)
limp<-varImp(lm1)
backward<-step(lm1,direction="backward", trace=0)
#vi(backward)
p2<- vip(backward,num_features = length(coef(backward)),
geom="point", horizontal = TRUE, mapping = aes_string(color="Sign"))## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation ideoms with `aes()`
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2glm.t =predict(object=lm1, newdata=test, type="response")
presult<- as.factor(ifelse(glm.t>0.5,1,0))
test$DEATH_EVENT1<-as.factor(test$DEATH_EVENT)
cm3<- confusionMatrix(presult,test$DEATH_EVENT1)
draw_confusion_matrix(cm3)General Linear Model with Classication
lm11 <- glm(DEATH_EVENT ~., data=train1, family=binomial(link="logit"))
#summary(lm1)
limp1<-varImp(lm11)
backward<-step(lm11,direction="backward", trace=0)
#vi(backward)
p21<- vip(backward,num_features = length(coef(backward)),
geom="point", horizontal = TRUE, mapping = aes_string(color="Sign"))
p21glm.t1 =predict(object=lm11, newdata=test1, type="response")
presult1<- as.factor(ifelse(glm.t1>0.5,1,0))
test1$DEATH_EVENT1<-as.factor(test1$DEATH_EVENT)
cm31<- confusionMatrix(presult1,test1$DEATH_EVENT1)
draw_confusion_matrix(cm31)Survival Analysis
Survival Analysis Basics
Pertama, kita akan menguji asumsi bahaya proporsional, dan menilai fitur-fitur secara individual untuk melihat kecocokannya dengan kurva survival dan Rasio Bahaya yang dihormati. Kemudian kita akan menggabungkan tiga fitur yang paling penting untuk membuat Kelompok Risiko Rendah dan Risiko Tinggi.
Definiton and Concept
Analisis Kelangsungan Hidup menyelidiki waktu terhadap hasil kejadian yang melibatkan penyensoran adalah pendekatan statistik yang paling umum dalam literatur medis. Penting untuk mengetahui waktu kejadian untuk setiap pasien. Dalam kasus kami, waktu terjadinya gagal jantung.
- Fungsi Kelangsungan Hidup: Probabilitas bertahan hidup lebih lama dari titik waktu tertentu
\[ S(t) = P(T > t) \]
- Fungsi Hazard: Risiko mengalami kejadian pada interval berikutnya dengan syarat bertahan hingga awal interval
\[ h(t)=\lim_{\Delta t \to 0 } \frac{1}{\Delta t} P(T <= t+ \Delta t|T>t)\]
\[ H(t) = -log(S(t)) \] Censor Kita menggunakan aturan penyensoran yang tepat, peristiwa tersebut terjadi setelah tanggal tertentu
Cox Model adalah model yang terkenal untuk mengeksplorasi hubungan antara kelangsungan hidup pasien dan beberapa variabel penjelas. Model ini memperkirakan bahaya (risiko) dari kejadian yang menarik bagi individu, dengan mempertimbangkan variabel prognostik mereka.
\[h_i(t)=h_0(t) e^{\beta_1*x1+ \beta_2*x2} \]
Survival function and its Relations
If we know one of the function S(t), f(t), \(\lambda (t)\) , \(\Lambda(t)\), we can compute the rest three by FTC.
Survivor function S(t) \[S(t)=Pr(T \geq t)= \int_t^\infty f(u)du\]
The density function f(t) \[ f(t)=\lim_{\Delta t \to 0 } \frac{1}{\Delta t} Pr(t \leq T \leq t+ \Delta t)\] By FTC, fundamental calculus part2, we have \(F(t)=Pr(T<t)\), which leads to \(S(t)=1-F(t)\)
The cumulative hazard function \(\Lambda(t)\) or H(t) , with the help of FTC, we have \(S(t)=e^{-\Lambda(t)}=e^{-H(t)}\) \[\Lambda(t) = H(t)= \int_0^t \lambda(u)du = h(u)du\]
The hazard function \(\lambda(t)\) or h(t) \[\lambda(t) = h(t) =\frac{f(t)}{S(t)}= -\frac{d}{dt}[logS(t)]\]
Fungsi survivor S(t) menurun seiring waktu, sementara H(t) meningkat seiring waktu. (probabilitas bertahan hidup semakin kecil sepanjang waktu, sementara risiko kegagalan meningkat seiring waktu)
Assessing Proportional Hazard
Kita akan menggunakan fungsi Cox.zph untuk menguji asumsi Proportional Hazard, jika asumsi SDM dilanggar, kita akan menggunakan analisis survival rata-rata terbatas untuk penyelidikan lebih lanjut. Dari gambar-gambar tersebut, kita dapat melihat bahwa data kita memenuhi kondisi Proportional.
Nilai p-value yang signifikan mengindikasikan asumsi proporsional hazard dilanggar
Plot residual Schoefeld (garis kemiringan nol)
mv_fit <- coxph(Surv(time,DEATH_EVENT) ~ efraction+ agegp + screat +sodiumc, data=heart)
ccox<- cox.zph(mv_fit)
print(ccox)## chisq df p
## efraction 0.00357 1 0.952
## agegp 2.58958 1 0.108
## screat 3.63617 1 0.057
## sodiumc 1.53360 1 0.216
## GLOBAL 7.66582 4 0.105
options(repr.plot.width=10, repr.plot.height=40)
ggcoxzph(ccox)Hazard Ratio
Rasio Bahaya menunjukkan perbedaan kelangsungan hidup yang kuat antara kelompok.
Hazard Ratio : Perbandingan antara probabilitas kejadian pada kelompok pengobatan, dibandingkan dengan probabilitas kejadian pada kelompok kontrol. Rasio ini digunakan untuk melihat apakah pasien yang menerima pengobatan berkembang lebih cepat (lebih lambat) dibandingkan dengan pasien yang tidak menerima pengobatan.
\[ \lambda (t_iX)= \lambda_0(t)exp(\beta X) \]
\[ log\ of \ Hazard = log(a) +b1x1...bkxk\]
HR=1: Tidak ada perbedaan antara kelompok perlakuan dan kelompok kontrol
HR<1: Probabilitas kejadian pada kelompok perlakuan lebih kecil dari kelompok kontrol
HR>1: Probabilitas kejadian pada kelompok perlakuan lebih besar dari kelompok kontrol
Hazard Ratio for Important Variable
Kita dapat melihat Ejection fraction, serum creatinine, sodium creatinine kelompok normal memiliki probabilitas yang lebih rendah untuk mengalami kejadian tersebut, sedangkan kelompok usia yang lebih tua memiliki probabilitas yang lebih tinggi untuk mengalami kejadian gagal jantung.
ggforest(mv_fit)Hazard Ratio for other variables
Grafik yang disajikan Tekanan darah tinggi, penyakit anemia, jumlah trombosit dan CPK memiliki tingkat yang cukup signifikan dalam rasio bahaya.
Sedangkan yang mengejutkan saya, Merokok tidak membuat perbedaan.
mo <- coxph(Surv(time,DEATH_EVENT) ~sexc +dia +hbp +smoke +anaemiac+ platc+cpk, data=heart)
ggforest(mo)KM Curve
Kurva Kaplan Meier adalah sebuah estimator yang digunakan untuk mengestimasi fungsi survival non parametrik (Log rank test). Kurva Kaplan Meier adalah representasi visual dari fungsi ini yang menunjukkan probabilitas suatu peristiwa pada interval waktu tertentu. Tidak memerlukan asumsi mengenai distribusi data yang mendasarinya.
Ejection Fraction, Serum Creatinine , Age Group , Serum Sodium
Plot KM menunjukkan kurva kelangsungan hidup yang dipisahkan untuk kelompok yang dibandingkan dengan memperhatikan fitur-fiturnya: Ejection Fraction, Serum Creatinine, Age Group, and Serum Sodium
fit_ef<-survfit(Surv(time,DEATH_EVENT)~heart$efraction, data=heart)
fit_sc<-survfit(Surv(time,DEATH_EVENT)~screat, data=heart)
fit_age<-survfit(Surv(time,DEATH_EVENT)~agegp, data=heart)
fit_sd<-survfit(Surv(time,DEATH_EVENT)~sodiumc, data=heart)
splots<- list()
splots[[1]]<-ggsurvplot(fit_ef,data=heart ,xlab="Days", ggtheme=theme_minimal())
splots[[2]]<-ggsurvplot(fit_sc,data=heart, xlab="Days", ggtheme=theme_minimal())
splots[[3]]<-ggsurvplot(fit_age,data=heart,xlab="Days", ggtheme=theme_minimal())
splots[[4]]<-ggsurvplot(fit_sd,data=heart,xlab="Days", ggtheme=theme_minimal())
arrange_ggsurvplots(splots, print=TRUE, ncol=2, nrow=2)Final Survival Analysis Model
Tentukan Kelompok Risiko:
Risiko Rendah: kreatinin serum normal + usia <65 + fraksi ejeksi normal
Risiko Tinggi: kombinasi istirahat
heart$riskgp <- ifelse(heart$agegp=="Age <65" & heart$efraction=="Ejection Normal" & heart$sodiumc=="Serum Sodium Normal", "Risk Low", "Risk High")
fit<-survfit(Surv(time,DEATH_EVENT)~riskgp, data=heart)
km<-ggsurvplot(fit,data=heart, risk.table=TRUE, legend="none", break.time.by=30, size=0.1,tables.height=0.3, xlab="Days")
km#log rank p value for groups
survdiff(Surv(time, DEATH_EVENT) ~ riskgp, data=heart)## Call:
## survdiff(formula = Surv(time, DEATH_EVENT) ~ riskgp, data = heart)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## riskgp=Risk High 272 94 86.27 0.693 6.87
## riskgp=Risk Low 27 2 9.73 6.145 6.87
##
## Chisq= 6.9 on 1 degrees of freedom, p= 0.009
Cox Proportion Model and Hazard Ratio
Terakhir, kita dapat melihat p value dari model Cox konsisten dengan hasil tersebut.
coxph(Surv(time,DEATH_EVENT) ~ riskgp, data=heart) %>%
gtsummary::tbl_regression(exp=TRUE)| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| riskgp | |||
| Risk High | — | — | |
| Risk Low | 0.19 | 0.05, 0.76 | 0.019 |
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
Summary
Dari penerapan berbagai model Machine Learning, kami mencapai hasil yang sama, bahwa Serum Sodium, Serum Creatinine, Ejection Fraction, and Age Group adalah fitur yang paling penting dalam Penyakit Jantung. Analisis kelangsungan hidup dan kurva Kaplan Meier memvalidasi temuan kami. Oleh karena itu, kami dapat memberi tahu masyarakat jika pasien memiliki masalah jantung kronis, mereka dapat secara berkala memantau Kreatinin Serum, Natrium Serum, dan Fraksi Ejeksi mereka setelah berada di atas 65.
References :
[3] https://www.mayoclinic.org/tests-procedures/ekg/expert-answers/ejection-fraction/faq-20058286
[4] https://www.mayoclinic.org/tests-procedures/creatinine-test/about/pac-20384646
[6] https://www.mayoclinic.org/diseases-conditions/hyponatremia/symptoms-causes/syc-20373711
[7] https://www.mountsinai.org/health-library/tests/creatine-phosphokinase-test
[8] https://www.kaggle.com/snowpea8/heart-failure-eda-prediction-with-r-91-5-acc
[9] https://pubmed.ncbi.nlm.nih.gov/33919237/
[10] ScD. Paul Catalano Lecture notes
[11] 榛子会发光 lollipop plot inspiration