Heart Attack Data Analysis

1 Data pre-processing

df <- read.csv("../data/Heart Attack Dataset/Medicaldataset.csv")
glimpse(df)

Rows: 1,319
Columns: 9
$ Age                      <int> 64, 21, 55, 64, 55, 58, 32, 63, 44, 67, 44, 6…
$ Gender                   <int> 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, …
$ Heart.rate               <int> 66, 94, 64, 70, 64, 61, 40, 60, 60, 61, 60, 6…
$ Systolic.blood.pressure  <int> 160, 98, 160, 120, 112, 112, 179, 214, 154, 1…
$ Diastolic.blood.pressure <int> 83, 46, 77, 55, 65, 58, 68, 82, 81, 95, 90, 8…
$ Blood.sugar              <dbl> 160, 296, 270, 270, 300, 87, 102, 87, 135, 10…
$ CK.MB                    <dbl> 1.800, 6.750, 1.990, 13.870, 1.080, 1.830, 0.…
$ Troponin                 <dbl> 0.012, 1.060, 0.003, 0.122, 0.003, 0.004, 0.0…
$ Result                   <chr> "negative", "positive", "negative", "positive…

The above data is from (“Heart Attack Dataset” n.d.).

The model is going to a simple linear model at first. The we’ll explore other methods.

df |> 
  group_by(Gender, Result) |> 
  count()

# A tibble: 4 × 3
# Groups:   Gender, Result [4]
  Gender Result       n
   <int> <chr>    <int>
1      0 negative   202
2      0 positive   247
3      1 negative   307
4      1 positive   563

df$Result <- ifelse(df$Result == "positive", 1, 0)

head(df)

  Age Gender Heart.rate Systolic.blood.pressure Diastolic.blood.pressure
1  64      1         66                     160                       83
2  21      1         94                      98                       46
3  55      1         64                     160                       77
4  64      1         70                     120                       55
5  55      1         64                     112                       65
6  58      0         61                     112                       58
  Blood.sugar CK.MB Troponin Result
1         160  1.80    0.012      0
2         296  6.75    1.060      1
3         270  1.99    0.003      0
4         270 13.87    0.122      1
5         300  1.08    0.003      0
6          87  1.83    0.004      0

2 Visualization

df |> 
  filter(Result == 1) |> 
  ggplot(aes(
    Age
  )) +
  geom_histogram() +
  labs(title = "Number of positive cases") +
  scale_color_colorblind()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df |> 
  filter(Result == 0) |> 
  ggplot(aes(
    Age
  )) +
  geom_histogram() +
  labs(title = "Number of negative cases") +
  scale_color_colorblind()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The following three plots have positive correlations,

df |> 
  ggplot(aes(x = Age, y = Result)) +
  geom_point() +
  geom_smooth() +
  theme_minimal() +
  scale_color_colorblind()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

df |> 
  ggplot(aes(x = CK.MB, y = Result)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal() +
  scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

df |> 
  ggplot(aes(x = Troponin, y = Result)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal() +
  scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

After looking at the following graphs, it is likely that Age and CK.MB aren’t correlated. So, they have little influence on each other.

df |> 
  ggplot(aes(x = Age, y = CK.MB)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal() +
  scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

After looking at the following graphs, it is likely that Troponin and CK.MB aren’t correlated. So, they have little influence on each other.

df |> 
  ggplot(aes(x = Troponin, y = CK.MB)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal() +
  scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

df |> 
  ggplot(aes(
    x = Systolic.blood.pressure,
    y = Diastolic.blood.pressure,
    color = factor(Result)
  )) + 
  geom_point() +
  geom_smooth()+
  theme_minimal() +
  scale_color_colorblind()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Doesn’t really tell much about anything. Correlation between the fields,

cor(df)

                                  Age       Gender   Heart.rate
Age                       1.000000000 -0.092873556 -0.023439847
Gender                   -0.092873556  1.000000000 -0.026780580
Heart.rate               -0.023439847 -0.026780580  1.000000000
Systolic.blood.pressure   0.017440525  0.011065277  0.010882130
Diastolic.blood.pressure  0.002614212 -0.009369901  0.108353464
Blood.sugar              -0.004192985  0.006667896 -0.019583957
CK.MB                     0.018418737  0.017526624 -0.013001040
Troponin                  0.088800234  0.065793391  0.011179854
Result                    0.238096935  0.094432315  0.006920486
                         Systolic.blood.pressure Diastolic.blood.pressure
Age                                   0.01744053              0.002614212
Gender                                0.01106528             -0.009369901
Heart.rate                            0.01088213              0.108353464
Systolic.blood.pressure               1.00000000              0.586166317
Diastolic.blood.pressure              0.58616632              1.000000000
Blood.sugar                           0.02080740             -0.025614103
CK.MB                                -0.01639595             -0.023403458
Troponin                              0.04372877              0.043360050
Result                               -0.02082502             -0.009659034
                          Blood.sugar       CK.MB    Troponin       Result
Age                      -0.004192985  0.01841874  0.08880023  0.238096935
Gender                    0.006667896  0.01752662  0.06579339  0.094432315
Heart.rate               -0.019583957 -0.01300104  0.01117985  0.006920486
Systolic.blood.pressure   0.020807397 -0.01639595  0.04372877 -0.020825017
Diastolic.blood.pressure -0.025614103 -0.02340346  0.04336005 -0.009659034
Blood.sugar               1.000000000  0.04575658  0.02106887 -0.033059441
CK.MB                     0.045756581  1.00000000 -0.01600835  0.217719724
Troponin                  0.021068866 -0.01600835  1.00000000  0.229376305
Result                   -0.033059441  0.21771972  0.22937631  1.000000000

3 Generalized Linear Model

Fitting a Generalized Linear Model,

glm.fits.first <- glm(
  Result ~ Age + Gender + Heart.rate + Systolic.blood.pressure + Diastolic.blood.pressure + Blood.sugar + CK.MB + Troponin,
  data = df, family = binomial
)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The warning means that there exists a combination of predictors that perfectly predicts the outcomes (i.e., y = 1 cases have certain predictor values and all y = 0 have different values). The logistic function becomes very steep in such situations, driving predicted probabilities toward exact 0 or 1. This can cause numerical issues.

summary(glm.fits.first)


Call:
glm(formula = Result ~ Age + Gender + Heart.rate + Systolic.blood.pressure + 
    Diastolic.blood.pressure + Blood.sugar + CK.MB + Troponin, 
    family = binomial, data = df)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -4.2678852  0.6139782  -6.951 3.62e-12 ***
Age                       0.0507890  0.0059633   8.517  < 2e-16 ***
Gender                    0.4322776  0.1558576   2.774  0.00554 ** 
Heart.rate                0.0005790  0.0015397   0.376  0.70688    
Systolic.blood.pressure  -0.0036050  0.0035317  -1.021  0.30738    
Diastolic.blood.pressure  0.0033561  0.0064365   0.521  0.60208    
Blood.sugar              -0.0009792  0.0009623  -1.018  0.30889    
CK.MB                     0.3583029  0.0361816   9.903  < 2e-16 ***
Troponin                  5.4970535  0.7435150   7.393 1.43e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1759.2  on 1318  degrees of freedom
Residual deviance: 1074.2  on 1310  degrees of freedom
AIC: 1092.2

Number of Fisher Scoring iterations: 10

So, if the data is linearly modeled, Heart.rate, Systolic.blood.pressure, Diastolic.blood.pressure, and Blood.sugar have statistically insignificant p-value. Training a new model after removing the insignificant predictors,

glm.fits.second <- glm(
  Result ~ Age + Gender + CK.MB + Troponin,
  data = df, family = binomial
)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glm.fits.second)


Call:
glm(formula = Result ~ Age + Gender + CK.MB + Troponin, family = binomial, 
    data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.57256    0.41294 -11.073  < 2e-16 ***
Age          0.05067    0.00596   8.501  < 2e-16 ***
Gender       0.42298    0.15518   2.726  0.00642 ** 
CK.MB        0.35909    0.03615   9.932  < 2e-16 ***
Troponin     5.51797    0.74691   7.388 1.49e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1759.2  on 1318  degrees of freedom
Residual deviance: 1075.9  on 1314  degrees of freedom
AIC: 1085.9

Number of Fisher Scoring iterations: 10

So, fitting a Generalized Linear Model with predictors Age, Gender, CK.MB, and Troponin is good enough.

4 References

“Heart Attack Dataset.” n.d. Accessed June 30, 2025.https://www.kaggle.com/datasets/fatemehmohammadinia/heart-attack-dataset-tarik-a-rashid .