1. Pendahuluan

Penyakit jantung merupakan salah satu penyebab utama kematian di dunia. Risiko penyakit jantung dipengaruhi oleh berbagai faktor seperti usia, tekanan darah, kadar kolesterol, serta kondisi klinis individu. Oleh karena itu, analisis statistik diperlukan untuk mengidentifikasi faktor risiko utama sebagai dasar pencegahan dan deteksi dini.

Penelitian ini bertujuan untuk menganalisis faktor risiko penyakit jantung menggunakan metode regresi logistik.

  1. Deskripsi Data

Dataset yang digunakan berasal dari Kaggle yaitu Heart Disease UCI Dataset. Dataset ini berisi indikator kesehatan pasien.

library(readxl)
data <- read_excel("Heart Disease UCI.xlsx")
summary(data)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :2.000   Median :130.0  
##  Mean   :54.54   Mean   :0.6768   Mean   :2.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.0  
##  Median :243.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :247.4   Mean   :0.1448   Mean   :0.9966   Mean   :149.6  
##  3rd Qu.:276.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang          oldpeak              slope              ca        
##  Min.   :0.0000   Length:297         Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Mode  :character   Median :1.0000   Median :0.0000  
##  Mean   :0.3266                      Mean   :0.6027   Mean   :0.6768  
##  3rd Qu.:1.0000                      3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000                      Max.   :2.0000   Max.   :3.0000  
##       thal         condition     
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000  
##  Mean   :0.835   Mean   :0.4613  
##  3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :2.000   Max.   :1.0000

Statistik Deskriptif

mean_age <- round(mean(data$age, na.rm=TRUE),2)
mean_bp <- round(mean(data$trestbps, na.rm=TRUE),2)
mean_chol <- round(mean(data$chol, na.rm=TRUE),2)

mean_age
## [1] 54.54
mean_bp
## [1] 131.69
mean_chol
## [1] 247.35
  1. Pemodelan Regresi Logistik
model <- glm(condition ~ age + sex + cp + trestbps + chol + thalach + oldpeak,
             data = data,
             family = binomial)
summary(model)
## 
## Call:
## glm(formula = condition ~ age + sex + cp + trestbps + chol + 
##     thalach + oldpeak, family = binomial, data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.244e+00  2.533e+00  -2.465 0.013711 *  
## age          2.599e-02  2.409e-02   1.079 0.280777    
## sex          2.480e+00  4.786e-01   5.181 2.21e-07 ***
## cp           1.128e+00  2.149e-01   5.251 1.52e-07 ***
## trestbps     3.069e-02  1.135e-02   2.705 0.006840 ** 
## chol         7.886e-03  3.694e-03   2.135 0.032750 *  
## thalach     -3.896e-02  1.062e-02  -3.668 0.000245 ***
## oldpeak0.1  -7.706e-01  1.028e+00  -0.750 0.453408    
## oldpeak0.2  -4.302e-01  9.004e-01  -0.478 0.632780    
## oldpeak0.3   9.884e-01  1.398e+00   0.707 0.479508    
## oldpeak0.4  -2.284e+00  1.199e+00  -1.904 0.056866 .  
## oldpeak0.5  -1.544e+00  1.203e+00  -1.283 0.199530    
## oldpeak0.6  -2.418e-01  7.848e-01  -0.308 0.758015    
## oldpeak0.7  -1.213e+01  6.523e+03  -0.002 0.998516    
## oldpeak0.8   5.330e-01  7.912e-01   0.674 0.500574    
## oldpeak0.9   1.802e+00  3.401e+00   0.530 0.596104    
## oldpeak1     8.647e-01  8.729e-01   0.991 0.321849    
## oldpeak1.1  -1.680e+01  3.760e+03  -0.004 0.996434    
## oldpeak1.2   3.746e-01  6.921e-01   0.541 0.588332    
## oldpeak1.3  -1.556e+01  6.523e+03  -0.002 0.998097    
## oldpeak1.4   1.822e+00  9.080e-01   2.007 0.044740 *  
## oldpeak1.5  -3.091e+00  1.713e+00  -1.804 0.071205 .  
## oldpeak1.6  -7.068e-01  8.468e-01  -0.835 0.403879    
## oldpeak1.8   1.658e+00  1.127e+00   1.472 0.141049    
## oldpeak1.9   4.864e-01  1.232e+00   0.395 0.692994    
## oldpeak2     9.620e-01  9.883e-01   0.973 0.330353    
## oldpeak2.1   1.608e+01  6.523e+03   0.002 0.998033    
## oldpeak2.2   1.698e+01  2.992e+03   0.006 0.995470    
## oldpeak2.3  -2.006e+01  3.612e+03  -0.006 0.995568    
## oldpeak2.4  -2.934e-01  1.675e+00  -0.175 0.860979    
## oldpeak2.5   1.897e+01  4.222e+03   0.004 0.996415    
## oldpeak2.6   2.576e+00  1.380e+00   1.867 0.061971 .  
## oldpeak2.8   1.756e+01  2.480e+03   0.007 0.994350    
## oldpeak2.9   1.559e+01  6.523e+03   0.002 0.998093    
## oldpeak3     1.088e+00  1.221e+00   0.891 0.373027    
## oldpeak3.1   1.771e+01  6.523e+03   0.003 0.997834    
## oldpeak3.2   1.850e+01  3.924e+03   0.005 0.996238    
## oldpeak3.4   1.623e+01  3.639e+03   0.004 0.996442    
## oldpeak3.5  -1.670e+01  6.523e+03  -0.003 0.997958    
## oldpeak3.6   1.836e+01  2.947e+03   0.006 0.995030    
## oldpeak3.8   2.293e+01  6.523e+03   0.004 0.997195    
## oldpeak4     1.738e+01  3.683e+03   0.005 0.996235    
## oldpeak4.2  -1.276e+00  1.956e+00  -0.652 0.514161    
## oldpeak4.4   1.689e+01  6.523e+03   0.003 0.997934    
## oldpeak5.6   1.583e+01  6.523e+03   0.002 0.998063    
## oldpeak6.2   1.926e+01  6.523e+03   0.003 0.997644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 211.28  on 251  degrees of freedom
## AIC: 303.28
## 
## Number of Fisher Scoring iterations: 17

Koefisien positif menunjukkan peningkatan risiko, sedangkan negatif menunjukkan penurunan risiko.

  1. Persamaan Model
coef_model <- round(coef(model),4)
coef_model
## (Intercept)         age         sex          cp    trestbps        chol 
##     -6.2444      0.0260      2.4797      1.1281      0.0307      0.0079 
##     thalach  oldpeak0.1  oldpeak0.2  oldpeak0.3  oldpeak0.4  oldpeak0.5 
##     -0.0390     -0.7706     -0.4302      0.9884     -2.2836     -1.5436 
##  oldpeak0.6  oldpeak0.7  oldpeak0.8  oldpeak0.9    oldpeak1  oldpeak1.1 
##     -0.2418    -12.1315      0.5330      1.8024      0.8647    -16.8041 
##  oldpeak1.2  oldpeak1.3  oldpeak1.4  oldpeak1.5  oldpeak1.6  oldpeak1.8 
##      0.3746    -15.5603      1.8225     -3.0908     -0.7068      1.6584 
##  oldpeak1.9    oldpeak2  oldpeak2.1  oldpeak2.2  oldpeak2.3  oldpeak2.4 
##      0.4864      0.9620     16.0791     16.9850    -20.0645     -0.2934 
##  oldpeak2.5  oldpeak2.6  oldpeak2.8  oldpeak2.9    oldpeak3  oldpeak3.1 
##     18.9677      2.5763     17.5631     15.5863      1.0879     17.7095 
##  oldpeak3.2  oldpeak3.4  oldpeak3.5  oldpeak3.6  oldpeak3.8    oldpeak4 
##     18.5047     16.2303    -16.6958     18.3584     22.9285     17.3791 
##  oldpeak4.2  oldpeak4.4  oldpeak5.6  oldpeak6.2 
##     -1.2759     16.8863     15.8331     19.2595
  1. Odds Ratio
odds <- exp(coef(model))
odds
##  (Intercept)          age          sex           cp     trestbps         chol 
## 1.941285e-03 1.026329e+00 1.193759e+01 3.089742e+00 1.031169e+00 1.007917e+00 
##      thalach   oldpeak0.1   oldpeak0.2   oldpeak0.3   oldpeak0.4   oldpeak0.5 
## 9.617920e-01 4.627414e-01 6.503639e-01 2.686968e+00 1.019134e-01 2.136091e-01 
##   oldpeak0.6   oldpeak0.7   oldpeak0.8   oldpeak0.9     oldpeak1   oldpeak1.1 
## 7.852295e-01 5.387179e-06 1.703989e+00 6.063953e+00 2.374397e+00 5.035836e-08 
##   oldpeak1.2   oldpeak1.3   oldpeak1.4   oldpeak1.5   oldpeak1.6   oldpeak1.8 
## 1.454417e+00 1.746904e-07 6.187071e+00 4.546372e-02 4.932005e-01 5.250644e+00 
##   oldpeak1.9     oldpeak2   oldpeak2.1   oldpeak2.2   oldpeak2.3   oldpeak2.4 
## 1.626444e+00 2.616879e+00 9.617663e+06 2.379437e+07 1.932454e-09 7.457257e-01 
##   oldpeak2.5   oldpeak2.6   oldpeak2.8   oldpeak2.9     oldpeak3   oldpeak3.1 
## 1.728012e+08 1.314810e+01 4.241947e+07 5.875643e+06 2.967992e+00 4.910872e+07 
##   oldpeak3.2   oldpeak3.4   oldpeak3.5   oldpeak3.6   oldpeak3.8     oldpeak4 
## 1.087651e+08 1.118730e+07 5.611943e-08 9.395865e+07 9.072578e+09 3.528833e+07 
##   oldpeak4.2   oldpeak4.4   oldpeak5.6   oldpeak6.2 
## 2.791813e-01 2.155992e+07 7.519825e+06 2.313579e+08
  1. Evaluasi Model 6.1 Confusion Matrix
prob <- predict(model, type="response")

pred <- ifelse(prob > 0.5,1,0)

cm <- table(
  factor(pred, levels=c(0,1)),
  factor(data$condition, levels=c(0,1))
)

cm
##    
##       0   1
##   0 143  24
##   1  17 113

6.2 Akurasi

accuracy <- sum(diag(cm)) / sum(cm)
accuracy
## [1] 0.8619529

6.3 Sensitivity dan Specificity

sensitivity <- cm[2,2] / (cm[2,2] + cm[1,2])
specificity <- cm[1,1] / (cm[1,1] + cm[2,1])

sensitivity
## [1] 0.8248175
specificity
## [1] 0.89375

Sensitivity menunjukkan kemampuan model mendeteksi pasien sakit, sedangkan specificity menunjukkan kemampuan mendeteksi pasien sehat.

  1. Goodness of Fit (Pseudo R²)
logLik_model <- logLik(model)

logLik_null <- logLik(
  glm(condition ~ 1, data=data, family=binomial)
)

pseudo_r2 <- 1 - (logLik_model / logLik_null)

pseudo_r2
## 'log Lik.' 0.4846183 (df=46)

Nilai pseudo R² menunjukkan kemampuan model menjelaskan variabilitas data.

  1. Multikolinearitas
numeric_vars <- c("age","trestbps","chol","thalach","oldpeak")

data[numeric_vars] <- lapply(data[numeric_vars], as.numeric)

cor_matrix <- cor(data[numeric_vars], use="complete.obs")

cor_matrix
##                 age    trestbps          chol       thalach     oldpeak
## age       1.0000000  0.29047626  2.026435e-01 -3.945629e-01  0.19712262
## trestbps  0.2904763  1.00000000  1.315357e-01 -4.910766e-02  0.19124314
## chol      0.2026435  0.13153571  1.000000e+00 -7.456799e-05  0.03859579
## thalach  -0.3945629 -0.04910766 -7.456799e-05  1.000000e+00 -0.34763997
## oldpeak   0.1971226  0.19124314  3.859579e-02 -3.476400e-01  1.00000000

Korelasi tinggi antar variabel dapat menyebabkan multikolinearitas.

  1. ROC Curve Manual
tpr <- c()
fpr <- c()

thresholds <- seq(0,1,by=0.05)

for(t in thresholds){

  pred_t <- ifelse(prob > t,1,0)

  cm_t <- table(
    factor(pred_t,levels=c(0,1)),
    factor(data$condition,levels=c(0,1))
  )

  tpr <- c(tpr, cm_t[2,2]/(cm_t[2,2]+cm_t[1,2]))
  fpr <- c(fpr, cm_t[2,1]/(cm_t[2,1]+cm_t[1,1]))
}

plot(fpr,tpr,
     type="l",
     lwd=3,
     col="blue",
     xlab="False Positive Rate",
     ylab="True Positive Rate",
     main="ROC Curve")

abline(0,1,col="red",lty=2)

ROC curve menunjukkan kemampuan model membedakan pasien sakit dan sehat.

  1. Visualisasi Hubungan Variabel
ggplot(data,aes(age,trestbps,color=factor(condition)))+
  geom_point(size=3)+
  labs(
    title="Hubungan Usia dan Tekanan Darah",
    x="Usia",
    y="Tekanan Darah",
    color="Condition"
  )+
  theme_minimal()

  1. Interpretasi

Hasil menunjukkan bahwa usia, tekanan darah, dan kolesterol berpengaruh terhadap risiko penyakit jantung. Individu dengan usia lebih tinggi dan tekanan darah tinggi memiliki peluang lebih besar mengalami penyakit jantung. Faktor klinis lainnya juga berperan dalam meningkatkan risiko.

  1. Kesimpulan

Regresi logistik efektif dalam mengidentifikasi faktor risiko penyakit jantung. Model ini dapat digunakan sebagai dasar deteksi dini dan pengambilan keputusan kesehatan.