The diabetes data set in this case study contains 1025 observations on 13 variables. The data set is available on keggle data repository and is loaded in via my github.
There are 13 variables in the data set.
age: Age (years)
sex: (Male=1, Female=0)
chest pain type : 4 Values increasing in pain
resting blood pressure: Diastolic blood pressure (mm Hg)
serum cholestoral: cholestoral in mg/dl
fasting blood sugar: > 120 mg/d
resting electrocardiographic results: (values 0,1,2)
maximum heart rate achieved: bpm
exercise induced angina:
oldpeak: ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
## Warning in data(heart): data set 'heart' not found
The objective of this case study is to identify the risk factors for heart disease.
We first make the following pairwise scatter plots to inspect the potential issues with predictor variables.
From the correlation matrix plot, we can see several patterns in the predictor variables.
Based on the above histogram, we discretize oldpeak and age in the following.
Based on the above exploratory analysis, we first build the full model and the smallest model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 17.7770105 | 3261.3196841 | 0.0054509 | 0.9956509 |
| grp.age31-40 | -15.6846515 | 3261.3193646 | -0.0048093 | 0.9961628 |
| grp.age41-50 | -15.5361909 | 3261.3193401 | -0.0047638 | 0.9961991 |
| grp.age50 + | -15.8401537 | 3261.3193423 | -0.0048570 | 0.9961247 |
| sex | -2.3840503 | 0.3056434 | -7.8001046 | 0.0000000 |
| cp | 1.1173690 | 0.1272111 | 8.7835834 | 0.0000000 |
| trestbps | -0.0269028 | 0.0068124 | -3.9490955 | 0.0000784 |
| chol | -0.0082743 | 0.0023862 | -3.4675065 | 0.0005253 |
| fbs | 0.1254052 | 0.3243435 | 0.3866432 | 0.6990204 |
| restecg | 0.6425300 | 0.2241524 | 2.8664869 | 0.0041506 |
| thalach | 0.0347898 | 0.0068422 | 5.0845515 | 0.0000004 |
| exang | -0.5676446 | 0.2733258 | -2.0768056 | 0.0378195 |
| grp.oldpeak0.1 | 0.9637366 | 0.6044114 | 1.5945042 | 0.1108231 |
| grp.oldpeak0.2 | 1.1557692 | 0.5777735 | 2.0003845 | 0.0454588 |
| grp.oldpeak0.3 | -0.8487734 | 0.8218813 | -1.0327201 | 0.3017349 |
| grp.oldpeak0.4 | 3.1895362 | 0.7333222 | 4.3494338 | 0.0000136 |
| grp.oldpeak0.5 | 2.3154285 | 0.7719284 | 2.9995380 | 0.0027039 |
| grp.oldpeak0.6 | 0.2611654 | 0.4488352 | 0.5818737 | 0.5606517 |
| grp.oldpeak0.7 | 12.7327811 | 3765.8472120 | 0.0033811 | 0.9973023 |
| grp.oldpeak0.8 | -1.0184317 | 0.4589374 | -2.2191080 | 0.0264794 |
| grp.oldpeak0.9 | -1.2864320 | 3.9075861 | -0.3292140 | 0.7419939 |
| grp.oldpeak1.1 | 14.8044829 | 2569.5446520 | 0.0057615 | 0.9954030 |
| grp.oldpeak1.2 | 0.1380403 | 0.4519899 | 0.3054058 | 0.7600571 |
| grp.oldpeak1.3 | 17.2627370 | 3765.8471721 | 0.0045840 | 0.9963425 |
| grp.oldpeak1.4 | -1.0977389 | 0.5368421 | -2.0448078 | 0.0408738 |
| grp.oldpeak1.5 | 3.7891830 | 1.8564152 | 2.0411290 | 0.0412380 |
| grp.oldpeak1.6 | 0.5578858 | 0.6056644 | 0.9211138 | 0.3569910 |
| grp.oldpeak1.8 | -0.6965062 | 0.6285086 | -1.1081890 | 0.2677802 |
| grp.oldpeak1.9 | 0.4049483 | 0.9869833 | 0.4102889 | 0.6815940 |
| grp.oldpeak2-4 | -0.1679775 | 0.5402558 | -0.3109222 | 0.7558598 |
| grp.oldpeak2.1 | -15.5141954 | 3765.8471932 | -0.0041197 | 0.9967130 |
| grp.oldpeak2.2 | -18.1136096 | 1285.3096000 | -0.0140928 | 0.9887559 |
| grp.oldpeak2.3 | 18.6216203 | 2325.7766320 | 0.0080066 | 0.9936117 |
| grp.oldpeak2.4 | 1.4263679 | 1.8105423 | 0.7878125 | 0.4308064 |
| grp.oldpeak2.5 | -17.5483804 | 2164.6318233 | -0.0081069 | 0.9935317 |
| grp.oldpeak2.6 | -1.6640836 | 0.8918688 | -1.8658391 | 0.0620639 |
| grp.oldpeak2.8 | -16.4148858 | 1191.5722538 | -0.0137758 | 0.9890088 |
| grp.oldpeak2.9 | -15.2154798 | 3765.8471911 | -0.0040404 | 0.9967762 |
| grp.oldpeak3.1 | -14.6785989 | 3261.3193528 | -0.0045008 | 0.9964089 |
| grp.oldpeak3.2 | -17.5856667 | 1813.5221133 | -0.0096970 | 0.9922631 |
| grp.oldpeak3.4 | -15.7824630 | 1786.9002021 | -0.0088323 | 0.9929529 |
| grp.oldpeak3.5 | 17.0036604 | 3765.8472098 | 0.0045152 | 0.9963974 |
| grp.oldpeak3.6 | -17.7101823 | 1398.3105849 | -0.0126654 | 0.9898947 |
| grp.oldpeak3.8 | -21.0372538 | 3261.3193706 | -0.0064505 | 0.9948533 |
| grp.oldpeak4.2 | 4.2016022 | 1.4470081 | 2.9036480 | 0.0036884 |
| grp.oldpeak4.4 | -15.4256414 | 3261.3194036 | -0.0047299 | 0.9962261 |
| grp.oldpeak4+ | -14.2877180 | 1759.4724112 | -0.0081205 | 0.9935209 |
| grp.oldpeak5.6 | -13.5491346 | 3261.3193553 | -0.0041545 | 0.9966852 |
| grp.oldpeak6.2 | -14.7723736 | 3765.8472186 | -0.0039227 | 0.9968701 |
| slope | 1.0882222 | 0.2275418 | 4.7825152 | 0.0000017 |
| ca | -0.7230864 | 0.1137678 | -6.3558086 | 0.0000000 |
| thal | -0.9814599 | 0.1960230 | -5.0068602 | 0.0000006 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.1559428 | 1.0453082 | 1.105839 | 0.2687961 |
| sex | -1.8030934 | 0.2236844 | -8.060882 | 0.0000000 |
| cp | 0.8811601 | 0.0896764 | 9.825997 | 0.0000000 |
| trestbps | -0.0242039 | 0.0049669 | -4.873073 | 0.0000011 |
| chol | -0.0063886 | 0.0018220 | -3.506455 | 0.0004541 |
| restecg | 0.3676481 | 0.1668241 | 2.203807 | 0.0275379 |
| thalach | 0.0428564 | 0.0048002 | 8.928120 | 0.0000000 |
| ca | -0.7223292 | 0.0921831 | -7.835812 | 0.0000000 |
| thal | -0.8744961 | 0.1432756 | -6.103596 | 0.0000000 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 2.2255795 | 1.4166591 | 1.5710057 | 0.1161813 |
| sex | -2.3746600 | 0.3062891 | -7.7530008 | 0.0000000 |
| cp | 1.1203404 | 0.1263510 | 8.8668875 | 0.0000000 |
| trestbps | -0.0288793 | 0.0065522 | -4.4075841 | 0.0000105 |
| chol | -0.0086249 | 0.0023438 | -3.6798513 | 0.0002334 |
| restecg | 0.6209798 | 0.2222343 | 2.7942574 | 0.0052019 |
| thalach | 0.0366217 | 0.0066801 | 5.4822257 | 0.0000000 |
| ca | -0.7331907 | 0.1099731 | -6.6670036 | 0.0000000 |
| thal | -0.9960030 | 0.1937944 | -5.1394821 | 0.0000003 |
| grp.oldpeak0.1 | 0.9283190 | 0.5932033 | 1.5649256 | 0.1176004 |
| grp.oldpeak0.2 | 1.1546258 | 0.5753361 | 2.0068717 | 0.0447633 |
| grp.oldpeak0.3 | -0.9137710 | 0.8117580 | -1.1256692 | 0.2603056 |
| grp.oldpeak0.4 | 3.0629522 | 0.7242206 | 4.2293082 | 0.0000234 |
| grp.oldpeak0.5 | 2.2636371 | 0.7722752 | 2.9311275 | 0.0033773 |
| grp.oldpeak0.6 | 0.2318603 | 0.4460112 | 0.5198532 | 0.6031659 |
| grp.oldpeak0.7 | 12.6799213 | 3765.8471824 | 0.0033671 | 0.9973135 |
| grp.oldpeak0.8 | -0.9894935 | 0.4588671 | -2.1563837 | 0.0310537 |
| grp.oldpeak0.9 | -1.2924146 | 4.0952858 | -0.3155859 | 0.7523168 |
| grp.oldpeak1.1 | 14.8226640 | 2596.8394240 | 0.0057080 | 0.9954457 |
| grp.oldpeak1.2 | 0.1055085 | 0.4495358 | 0.2347053 | 0.8144374 |
| grp.oldpeak1.3 | 17.1384262 | 3765.8471616 | 0.0045510 | 0.9963688 |
| grp.oldpeak1.4 | -1.1318034 | 0.5349980 | -2.1155281 | 0.0343850 |
| grp.oldpeak1.5 | 3.7925097 | 1.8875253 | 2.0092497 | 0.0445107 |
| grp.oldpeak1.6 | 0.5147364 | 0.6113042 | 0.8420299 | 0.3997712 |
| grp.oldpeak1.8 | -0.8162161 | 0.6219081 | -1.3124385 | 0.1893722 |
| grp.oldpeak1.9 | 0.4333954 | 0.9890490 | 0.4381940 | 0.6612456 |
| grp.oldpeak2-4 | -0.1273885 | 0.5433961 | -0.2344303 | 0.8146509 |
| grp.oldpeak2.1 | -15.5864170 | 3765.8471915 | -0.0041389 | 0.9966977 |
| grp.oldpeak2.2 | -18.1285787 | 1286.0697801 | -0.0140961 | 0.9887533 |
| grp.oldpeak2.3 | 18.5559033 | 2323.2672835 | 0.0079870 | 0.9936274 |
| grp.oldpeak2.4 | 1.3531487 | 1.8145593 | 0.7457175 | 0.4558381 |
| grp.oldpeak2.5 | -17.6847700 | 2104.3132865 | -0.0084041 | 0.9932946 |
| grp.oldpeak2.6 | -1.7598806 | 0.8805238 | -1.9986748 | 0.0456436 |
| grp.oldpeak2.8 | -16.3520444 | 1204.6498420 | -0.0135741 | 0.9891698 |
| grp.oldpeak2.9 | -15.1706838 | 3765.8471890 | -0.0040285 | 0.9967857 |
| grp.oldpeak3.1 | -14.7143378 | 3261.3193441 | -0.0045118 | 0.9964001 |
| grp.oldpeak3.2 | -17.7192432 | 1808.0318388 | -0.0098003 | 0.9921806 |
| grp.oldpeak3.4 | -15.6961237 | 1799.2248578 | -0.0087238 | 0.9930395 |
| grp.oldpeak3.5 | 16.9077788 | 3765.8471805 | 0.0044898 | 0.9964177 |
| grp.oldpeak3.6 | -17.5870658 | 1424.6953631 | -0.0123444 | 0.9901508 |
| grp.oldpeak3.8 | -21.0706515 | 3261.3193419 | -0.0064608 | 0.9948451 |
| grp.oldpeak4.2 | 4.1719677 | 1.4677825 | 2.8423610 | 0.0044781 |
| grp.oldpeak4.4 | -15.5554409 | 3261.3193976 | -0.0047697 | 0.9961944 |
| grp.oldpeak4+ | -14.2994496 | 1744.9482911 | -0.0081948 | 0.9934616 |
| grp.oldpeak5.6 | -13.6032141 | 3261.3193538 | -0.0041711 | 0.9966720 |
| grp.oldpeak6.2 | -14.8790386 | 3765.8472159 | -0.0039510 | 0.9968475 |
| slope | 1.0494403 | 0.2207043 | 4.7549601 | 0.0000020 |
| exang | -0.5970425 | 0.2700241 | -2.2110710 | 0.0270309 |
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 594.4229 | 1420.24 | 698.4229 |
| reduced.model | 818.7202 | 1420.24 | 836.7202 |
| final.model | 596.0881 | 1420.24 | 692.0881 |
When observing the AIC of the three models its clear that the final model is the best model because it has a lower AIC which means its the best fit model
In the exploratory analysis, we observed a few of the variables with linear correlations. From the final model we dropped age, fasting blood sugar (fbs), and slope. This left us with the best possible version of our model which we call the final model
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | 2.2255795 | 1.4166591 | 1.5710057 | 0.1161813 | 9.258847e+00 |
| sex | -2.3746600 | 0.3062891 | -7.7530008 | 0.0000000 | 9.304610e-02 |
| cp | 1.1203404 | 0.1263510 | 8.8668875 | 0.0000000 | 3.065898e+00 |
| trestbps | -0.0288793 | 0.0065522 | -4.4075841 | 0.0000105 | 9.715338e-01 |
| chol | -0.0086249 | 0.0023438 | -3.6798513 | 0.0002334 | 9.914121e-01 |
| restecg | 0.6209798 | 0.2222343 | 2.7942574 | 0.0052019 | 1.860750e+00 |
| thalach | 0.0366217 | 0.0066801 | 5.4822257 | 0.0000000 | 1.037300e+00 |
| ca | -0.7331907 | 0.1099731 | -6.6670036 | 0.0000000 | 4.803738e-01 |
| thal | -0.9960030 | 0.1937944 | -5.1394821 | 0.0000003 | 3.693528e-01 |
| grp.oldpeak0.1 | 0.9283190 | 0.5932033 | 1.5649256 | 0.1176004 | 2.530252e+00 |
| grp.oldpeak0.2 | 1.1546258 | 0.5753361 | 2.0068717 | 0.0447633 | 3.172836e+00 |
| grp.oldpeak0.3 | -0.9137710 | 0.8117580 | -1.1256692 | 0.2603056 | 4.010092e-01 |
| grp.oldpeak0.4 | 3.0629522 | 0.7242206 | 4.2293082 | 0.0000234 | 2.139061e+01 |
| grp.oldpeak0.5 | 2.2636371 | 0.7722752 | 2.9311275 | 0.0033773 | 9.618007e+00 |
| grp.oldpeak0.6 | 0.2318603 | 0.4460112 | 0.5198532 | 0.6031659 | 1.260944e+00 |
| grp.oldpeak0.7 | 12.6799213 | 3765.8471824 | 0.0033671 | 0.9973135 | 3.212328e+05 |
| grp.oldpeak0.8 | -0.9894935 | 0.4588671 | -2.1563837 | 0.0310537 | 3.717649e-01 |
| grp.oldpeak0.9 | -1.2924146 | 4.0952858 | -0.3155859 | 0.7523168 | 2.746069e-01 |
| grp.oldpeak1.1 | 14.8226640 | 2596.8394240 | 0.0057080 | 0.9954457 | 2.737796e+06 |
| grp.oldpeak1.2 | 0.1055085 | 0.4495358 | 0.2347053 | 0.8144374 | 1.111276e+00 |
| grp.oldpeak1.3 | 17.1384262 | 3765.8471616 | 0.0045510 | 0.9963688 | 2.774112e+07 |
| grp.oldpeak1.4 | -1.1318034 | 0.5349980 | -2.1155281 | 0.0343850 | 3.224512e-01 |
| grp.oldpeak1.5 | 3.7925097 | 1.8875253 | 2.0092497 | 0.0445107 | 4.436761e+01 |
| grp.oldpeak1.6 | 0.5147364 | 0.6113042 | 0.8420299 | 0.3997712 | 1.673197e+00 |
| grp.oldpeak1.8 | -0.8162161 | 0.6219081 | -1.3124385 | 0.1893722 | 4.421014e-01 |
| grp.oldpeak1.9 | 0.4333954 | 0.9890490 | 0.4381940 | 0.6612456 | 1.542486e+00 |
| grp.oldpeak2-4 | -0.1273885 | 0.5433961 | -0.2344303 | 0.8146509 | 8.803916e-01 |
| grp.oldpeak2.1 | -15.5864170 | 3765.8471915 | -0.0041389 | 0.9966977 | 2.000000e-07 |
| grp.oldpeak2.2 | -18.1285787 | 1286.0697801 | -0.0140961 | 0.9887533 | 0.000000e+00 |
| grp.oldpeak2.3 | 18.5559033 | 2323.2672835 | 0.0079870 | 0.9936274 | 1.144792e+08 |
| grp.oldpeak2.4 | 1.3531487 | 1.8145593 | 0.7457175 | 0.4558381 | 3.869591e+00 |
| grp.oldpeak2.5 | -17.6847700 | 2104.3132865 | -0.0084041 | 0.9932946 | 0.000000e+00 |
| grp.oldpeak2.6 | -1.7598806 | 0.8805238 | -1.9986748 | 0.0456436 | 1.720654e-01 |
| grp.oldpeak2.8 | -16.3520444 | 1204.6498420 | -0.0135741 | 0.9891698 | 1.000000e-07 |
| grp.oldpeak2.9 | -15.1706838 | 3765.8471890 | -0.0040285 | 0.9967857 | 3.000000e-07 |
| grp.oldpeak3.1 | -14.7143378 | 3261.3193441 | -0.0045118 | 0.9964001 | 4.000000e-07 |
| grp.oldpeak3.2 | -17.7192432 | 1808.0318388 | -0.0098003 | 0.9921806 | 0.000000e+00 |
| grp.oldpeak3.4 | -15.6961237 | 1799.2248578 | -0.0087238 | 0.9930395 | 2.000000e-07 |
| grp.oldpeak3.5 | 16.9077788 | 3765.8471805 | 0.0044898 | 0.9964177 | 2.202698e+07 |
| grp.oldpeak3.6 | -17.5870658 | 1424.6953631 | -0.0123444 | 0.9901508 | 0.000000e+00 |
| grp.oldpeak3.8 | -21.0706515 | 3261.3193419 | -0.0064608 | 0.9948451 | 0.000000e+00 |
| grp.oldpeak4.2 | 4.1719677 | 1.4677825 | 2.8423610 | 0.0044781 | 6.484292e+01 |
| grp.oldpeak4.4 | -15.5554409 | 3261.3193976 | -0.0047697 | 0.9961944 | 2.000000e-07 |
| grp.oldpeak4+ | -14.2994496 | 1744.9482911 | -0.0081948 | 0.9934616 | 6.000000e-07 |
| grp.oldpeak5.6 | -13.6032141 | 3261.3193538 | -0.0041711 | 0.9966720 | 1.200000e-06 |
| grp.oldpeak6.2 | -14.8790386 | 3765.8472159 | -0.0039510 | 0.9968475 | 3.000000e-07 |
| slope | 1.0494403 | 0.2207043 | 4.7549601 | 0.0000020 | 2.856052e+00 |
| exang | -0.5970425 | 0.2700241 | -2.2110710 | 0.0270309 | 5.504372e-01 |
The interpretation of the odds ratios is similar to the case of simple logistic regression. The group-oldpeak variable grp.oldpeak has four categories. The baseline category is 0-1. We can see from the above table inferential table that the odds of getting heart disease doesnt follow a pattern as you get a higher oldpeak for dpression related to excerise and rest. For example, the odds ratio associated with the old peak group 2-4 is 8.8039 meaning that, given the same level of everything else the risk of heart disease in the oldpeak group of 2-4 is almost 9 times of that in the baseline group aged 0-1. This however changes when you get to the group 4+ its only 6.000 so it decreased but if you observe the numbers around these values the numbers are a little all over the place.
The case study focused on the association analysis between a set of potential risk factors for heart disease. The initial data set has 15 numerical and categorical variables.
After exploratory analysis, we decide to re-group two sparse discrete variables oldpeak and age, and then define dummy variables for the associated variables. These new group variables were used in the model search process.
Since sex, cp, tresrpbs,chol,restingecg thalach,and thal are considered to be major contributors to the development of heart disease, we include three risk factors in the final model regardless of the statistical significance.
After automatic variable selection, we obtain the final model with 7 factors, sex, cp, oldpeak with 4 dummy variables tresrpbs,chol,restingecg thalach,and thal