1 Case Study

The diabetes data set in this case study contains 1025 observations on 13 variables. The data set is available on keggle data repository and is loaded in via my github.

1.1 Data and Variable Descriptions

There are 13 variables in the data set.

  1. age: Age (years)

  2. sex: (Male=1, Female=0)

  3. chest pain type : 4 Values increasing in pain

  4. resting blood pressure: Diastolic blood pressure (mm Hg)

  5. serum cholestoral: cholestoral in mg/dl

  6. fasting blood sugar: > 120 mg/d

  7. resting electrocardiographic results: (values 0,1,2)

  8. maximum heart rate achieved: bpm

  9. exercise induced angina:

  10. oldpeak: ST depression induced by exercise relative to rest

  11. the slope of the peak exercise ST segment

  12. number of major vessels (0-3) colored by flourosopy

  13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

## Warning in data(heart): data set 'heart' not found

1.2 Research Question

The objective of this case study is to identify the risk factors for heart disease.

1.3 Exploratory Analysis

We first make the following pairwise scatter plots to inspect the potential issues with predictor variables.

From the correlation matrix plot, we can see several patterns in the predictor variables.

  • All predictor variables are unimodal. But oldpeak, and age are significantly skewed. We next take a close look at the frequency distribution of these two variables.

Based on the above histogram, we discretize oldpeak and age in the following.

  • A moderate correlation is observed in several pairs of variables: age v.s. resting blood pressure, serum cholestoral v.s. maximum heart rate achieved, and oldpeak v.s. thal. We will not drop any of these variables for the moment but will perform an automatic variable selection process to remove potential redundant variables since a few of them will be forced to be included in the final model.

2 Building the Multiple Logistic Regression Model

Based on the above exploratory analysis, we first build the full model and the smallest model.

Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 17.7770105 3261.3196841 0.0054509 0.9956509
grp.age31-40 -15.6846515 3261.3193646 -0.0048093 0.9961628
grp.age41-50 -15.5361909 3261.3193401 -0.0047638 0.9961991
grp.age50 + -15.8401537 3261.3193423 -0.0048570 0.9961247
sex -2.3840503 0.3056434 -7.8001046 0.0000000
cp 1.1173690 0.1272111 8.7835834 0.0000000
trestbps -0.0269028 0.0068124 -3.9490955 0.0000784
chol -0.0082743 0.0023862 -3.4675065 0.0005253
fbs 0.1254052 0.3243435 0.3866432 0.6990204
restecg 0.6425300 0.2241524 2.8664869 0.0041506
thalach 0.0347898 0.0068422 5.0845515 0.0000004
exang -0.5676446 0.2733258 -2.0768056 0.0378195
grp.oldpeak0.1 0.9637366 0.6044114 1.5945042 0.1108231
grp.oldpeak0.2 1.1557692 0.5777735 2.0003845 0.0454588
grp.oldpeak0.3 -0.8487734 0.8218813 -1.0327201 0.3017349
grp.oldpeak0.4 3.1895362 0.7333222 4.3494338 0.0000136
grp.oldpeak0.5 2.3154285 0.7719284 2.9995380 0.0027039
grp.oldpeak0.6 0.2611654 0.4488352 0.5818737 0.5606517
grp.oldpeak0.7 12.7327811 3765.8472120 0.0033811 0.9973023
grp.oldpeak0.8 -1.0184317 0.4589374 -2.2191080 0.0264794
grp.oldpeak0.9 -1.2864320 3.9075861 -0.3292140 0.7419939
grp.oldpeak1.1 14.8044829 2569.5446520 0.0057615 0.9954030
grp.oldpeak1.2 0.1380403 0.4519899 0.3054058 0.7600571
grp.oldpeak1.3 17.2627370 3765.8471721 0.0045840 0.9963425
grp.oldpeak1.4 -1.0977389 0.5368421 -2.0448078 0.0408738
grp.oldpeak1.5 3.7891830 1.8564152 2.0411290 0.0412380
grp.oldpeak1.6 0.5578858 0.6056644 0.9211138 0.3569910
grp.oldpeak1.8 -0.6965062 0.6285086 -1.1081890 0.2677802
grp.oldpeak1.9 0.4049483 0.9869833 0.4102889 0.6815940
grp.oldpeak2-4 -0.1679775 0.5402558 -0.3109222 0.7558598
grp.oldpeak2.1 -15.5141954 3765.8471932 -0.0041197 0.9967130
grp.oldpeak2.2 -18.1136096 1285.3096000 -0.0140928 0.9887559
grp.oldpeak2.3 18.6216203 2325.7766320 0.0080066 0.9936117
grp.oldpeak2.4 1.4263679 1.8105423 0.7878125 0.4308064
grp.oldpeak2.5 -17.5483804 2164.6318233 -0.0081069 0.9935317
grp.oldpeak2.6 -1.6640836 0.8918688 -1.8658391 0.0620639
grp.oldpeak2.8 -16.4148858 1191.5722538 -0.0137758 0.9890088
grp.oldpeak2.9 -15.2154798 3765.8471911 -0.0040404 0.9967762
grp.oldpeak3.1 -14.6785989 3261.3193528 -0.0045008 0.9964089
grp.oldpeak3.2 -17.5856667 1813.5221133 -0.0096970 0.9922631
grp.oldpeak3.4 -15.7824630 1786.9002021 -0.0088323 0.9929529
grp.oldpeak3.5 17.0036604 3765.8472098 0.0045152 0.9963974
grp.oldpeak3.6 -17.7101823 1398.3105849 -0.0126654 0.9898947
grp.oldpeak3.8 -21.0372538 3261.3193706 -0.0064505 0.9948533
grp.oldpeak4.2 4.2016022 1.4470081 2.9036480 0.0036884
grp.oldpeak4.4 -15.4256414 3261.3194036 -0.0047299 0.9962261
grp.oldpeak4+ -14.2877180 1759.4724112 -0.0081205 0.9935209
grp.oldpeak5.6 -13.5491346 3261.3193553 -0.0041545 0.9966852
grp.oldpeak6.2 -14.7723736 3765.8472186 -0.0039227 0.9968701
slope 1.0882222 0.2275418 4.7825152 0.0000017
ca -0.7230864 0.1137678 -6.3558086 0.0000000
thal -0.9814599 0.1960230 -5.0068602 0.0000006
Summary of inferential statistics of the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1559428 1.0453082 1.105839 0.2687961
sex -1.8030934 0.2236844 -8.060882 0.0000000
cp 0.8811601 0.0896764 9.825997 0.0000000
trestbps -0.0242039 0.0049669 -4.873073 0.0000011
chol -0.0063886 0.0018220 -3.506455 0.0004541
restecg 0.3676481 0.1668241 2.203807 0.0275379
thalach 0.0428564 0.0048002 8.928120 0.0000000
ca -0.7223292 0.0921831 -7.835812 0.0000000
thal -0.8744961 0.1432756 -6.103596 0.0000000
Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.2255795 1.4166591 1.5710057 0.1161813
sex -2.3746600 0.3062891 -7.7530008 0.0000000
cp 1.1203404 0.1263510 8.8668875 0.0000000
trestbps -0.0288793 0.0065522 -4.4075841 0.0000105
chol -0.0086249 0.0023438 -3.6798513 0.0002334
restecg 0.6209798 0.2222343 2.7942574 0.0052019
thalach 0.0366217 0.0066801 5.4822257 0.0000000
ca -0.7331907 0.1099731 -6.6670036 0.0000000
thal -0.9960030 0.1937944 -5.1394821 0.0000003
grp.oldpeak0.1 0.9283190 0.5932033 1.5649256 0.1176004
grp.oldpeak0.2 1.1546258 0.5753361 2.0068717 0.0447633
grp.oldpeak0.3 -0.9137710 0.8117580 -1.1256692 0.2603056
grp.oldpeak0.4 3.0629522 0.7242206 4.2293082 0.0000234
grp.oldpeak0.5 2.2636371 0.7722752 2.9311275 0.0033773
grp.oldpeak0.6 0.2318603 0.4460112 0.5198532 0.6031659
grp.oldpeak0.7 12.6799213 3765.8471824 0.0033671 0.9973135
grp.oldpeak0.8 -0.9894935 0.4588671 -2.1563837 0.0310537
grp.oldpeak0.9 -1.2924146 4.0952858 -0.3155859 0.7523168
grp.oldpeak1.1 14.8226640 2596.8394240 0.0057080 0.9954457
grp.oldpeak1.2 0.1055085 0.4495358 0.2347053 0.8144374
grp.oldpeak1.3 17.1384262 3765.8471616 0.0045510 0.9963688
grp.oldpeak1.4 -1.1318034 0.5349980 -2.1155281 0.0343850
grp.oldpeak1.5 3.7925097 1.8875253 2.0092497 0.0445107
grp.oldpeak1.6 0.5147364 0.6113042 0.8420299 0.3997712
grp.oldpeak1.8 -0.8162161 0.6219081 -1.3124385 0.1893722
grp.oldpeak1.9 0.4333954 0.9890490 0.4381940 0.6612456
grp.oldpeak2-4 -0.1273885 0.5433961 -0.2344303 0.8146509
grp.oldpeak2.1 -15.5864170 3765.8471915 -0.0041389 0.9966977
grp.oldpeak2.2 -18.1285787 1286.0697801 -0.0140961 0.9887533
grp.oldpeak2.3 18.5559033 2323.2672835 0.0079870 0.9936274
grp.oldpeak2.4 1.3531487 1.8145593 0.7457175 0.4558381
grp.oldpeak2.5 -17.6847700 2104.3132865 -0.0084041 0.9932946
grp.oldpeak2.6 -1.7598806 0.8805238 -1.9986748 0.0456436
grp.oldpeak2.8 -16.3520444 1204.6498420 -0.0135741 0.9891698
grp.oldpeak2.9 -15.1706838 3765.8471890 -0.0040285 0.9967857
grp.oldpeak3.1 -14.7143378 3261.3193441 -0.0045118 0.9964001
grp.oldpeak3.2 -17.7192432 1808.0318388 -0.0098003 0.9921806
grp.oldpeak3.4 -15.6961237 1799.2248578 -0.0087238 0.9930395
grp.oldpeak3.5 16.9077788 3765.8471805 0.0044898 0.9964177
grp.oldpeak3.6 -17.5870658 1424.6953631 -0.0123444 0.9901508
grp.oldpeak3.8 -21.0706515 3261.3193419 -0.0064608 0.9948451
grp.oldpeak4.2 4.1719677 1.4677825 2.8423610 0.0044781
grp.oldpeak4.4 -15.5554409 3261.3193976 -0.0047697 0.9961944
grp.oldpeak4+ -14.2994496 1744.9482911 -0.0081948 0.9934616
grp.oldpeak5.6 -13.6032141 3261.3193538 -0.0041711 0.9966720
grp.oldpeak6.2 -14.8790386 3765.8472159 -0.0039510 0.9968475
slope 1.0494403 0.2207043 4.7549601 0.0000020
exang -0.5970425 0.2700241 -2.2110710 0.0270309
Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 594.4229 1420.24 698.4229
reduced.model 818.7202 1420.24 836.7202
final.model 596.0881 1420.24 692.0881

When observing the AIC of the three models its clear that the final model is the best model because it has a lower AIC which means its the best fit model

3 Final Model

In the exploratory analysis, we observed a few of the variables with linear correlations. From the final model we dropped age, fasting blood sugar (fbs), and slope. This left us with the best possible version of our model which we call the final model

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 2.2255795 1.4166591 1.5710057 0.1161813 9.258847e+00
sex -2.3746600 0.3062891 -7.7530008 0.0000000 9.304610e-02
cp 1.1203404 0.1263510 8.8668875 0.0000000 3.065898e+00
trestbps -0.0288793 0.0065522 -4.4075841 0.0000105 9.715338e-01
chol -0.0086249 0.0023438 -3.6798513 0.0002334 9.914121e-01
restecg 0.6209798 0.2222343 2.7942574 0.0052019 1.860750e+00
thalach 0.0366217 0.0066801 5.4822257 0.0000000 1.037300e+00
ca -0.7331907 0.1099731 -6.6670036 0.0000000 4.803738e-01
thal -0.9960030 0.1937944 -5.1394821 0.0000003 3.693528e-01
grp.oldpeak0.1 0.9283190 0.5932033 1.5649256 0.1176004 2.530252e+00
grp.oldpeak0.2 1.1546258 0.5753361 2.0068717 0.0447633 3.172836e+00
grp.oldpeak0.3 -0.9137710 0.8117580 -1.1256692 0.2603056 4.010092e-01
grp.oldpeak0.4 3.0629522 0.7242206 4.2293082 0.0000234 2.139061e+01
grp.oldpeak0.5 2.2636371 0.7722752 2.9311275 0.0033773 9.618007e+00
grp.oldpeak0.6 0.2318603 0.4460112 0.5198532 0.6031659 1.260944e+00
grp.oldpeak0.7 12.6799213 3765.8471824 0.0033671 0.9973135 3.212328e+05
grp.oldpeak0.8 -0.9894935 0.4588671 -2.1563837 0.0310537 3.717649e-01
grp.oldpeak0.9 -1.2924146 4.0952858 -0.3155859 0.7523168 2.746069e-01
grp.oldpeak1.1 14.8226640 2596.8394240 0.0057080 0.9954457 2.737796e+06
grp.oldpeak1.2 0.1055085 0.4495358 0.2347053 0.8144374 1.111276e+00
grp.oldpeak1.3 17.1384262 3765.8471616 0.0045510 0.9963688 2.774112e+07
grp.oldpeak1.4 -1.1318034 0.5349980 -2.1155281 0.0343850 3.224512e-01
grp.oldpeak1.5 3.7925097 1.8875253 2.0092497 0.0445107 4.436761e+01
grp.oldpeak1.6 0.5147364 0.6113042 0.8420299 0.3997712 1.673197e+00
grp.oldpeak1.8 -0.8162161 0.6219081 -1.3124385 0.1893722 4.421014e-01
grp.oldpeak1.9 0.4333954 0.9890490 0.4381940 0.6612456 1.542486e+00
grp.oldpeak2-4 -0.1273885 0.5433961 -0.2344303 0.8146509 8.803916e-01
grp.oldpeak2.1 -15.5864170 3765.8471915 -0.0041389 0.9966977 2.000000e-07
grp.oldpeak2.2 -18.1285787 1286.0697801 -0.0140961 0.9887533 0.000000e+00
grp.oldpeak2.3 18.5559033 2323.2672835 0.0079870 0.9936274 1.144792e+08
grp.oldpeak2.4 1.3531487 1.8145593 0.7457175 0.4558381 3.869591e+00
grp.oldpeak2.5 -17.6847700 2104.3132865 -0.0084041 0.9932946 0.000000e+00
grp.oldpeak2.6 -1.7598806 0.8805238 -1.9986748 0.0456436 1.720654e-01
grp.oldpeak2.8 -16.3520444 1204.6498420 -0.0135741 0.9891698 1.000000e-07
grp.oldpeak2.9 -15.1706838 3765.8471890 -0.0040285 0.9967857 3.000000e-07
grp.oldpeak3.1 -14.7143378 3261.3193441 -0.0045118 0.9964001 4.000000e-07
grp.oldpeak3.2 -17.7192432 1808.0318388 -0.0098003 0.9921806 0.000000e+00
grp.oldpeak3.4 -15.6961237 1799.2248578 -0.0087238 0.9930395 2.000000e-07
grp.oldpeak3.5 16.9077788 3765.8471805 0.0044898 0.9964177 2.202698e+07
grp.oldpeak3.6 -17.5870658 1424.6953631 -0.0123444 0.9901508 0.000000e+00
grp.oldpeak3.8 -21.0706515 3261.3193419 -0.0064608 0.9948451 0.000000e+00
grp.oldpeak4.2 4.1719677 1.4677825 2.8423610 0.0044781 6.484292e+01
grp.oldpeak4.4 -15.5554409 3261.3193976 -0.0047697 0.9961944 2.000000e-07
grp.oldpeak4+ -14.2994496 1744.9482911 -0.0081948 0.9934616 6.000000e-07
grp.oldpeak5.6 -13.6032141 3261.3193538 -0.0041711 0.9966720 1.200000e-06
grp.oldpeak6.2 -14.8790386 3765.8472159 -0.0039510 0.9968475 3.000000e-07
slope 1.0494403 0.2207043 4.7549601 0.0000020 2.856052e+00
exang -0.5970425 0.2700241 -2.2110710 0.0270309 5.504372e-01

The interpretation of the odds ratios is similar to the case of simple logistic regression. The group-oldpeak variable grp.oldpeak has four categories. The baseline category is 0-1. We can see from the above table inferential table that the odds of getting heart disease doesnt follow a pattern as you get a higher oldpeak for dpression related to excerise and rest. For example, the odds ratio associated with the old peak group 2-4 is 8.8039 meaning that, given the same level of everything else the risk of heart disease in the oldpeak group of 2-4 is almost 9 times of that in the baseline group aged 0-1. This however changes when you get to the group 4+ its only 6.000 so it decreased but if you observe the numbers around these values the numbers are a little all over the place.

4 Summary and Conclusion

The case study focused on the association analysis between a set of potential risk factors for heart disease. The initial data set has 15 numerical and categorical variables.

After exploratory analysis, we decide to re-group two sparse discrete variables oldpeak and age, and then define dummy variables for the associated variables. These new group variables were used in the model search process.

Since sex, cp, tresrpbs,chol,restingecg thalach,and thal are considered to be major contributors to the development of heart disease, we include three risk factors in the final model regardless of the statistical significance.

After automatic variable selection, we obtain the final model with 7 factors, sex, cp, oldpeak with 4 dummy variables tresrpbs,chol,restingecg thalach,and thal