The diabetes data set in this case study contains 1025 observations on 13 variables. The data set is available on keggle data repository and is loaded in via my github.

1 Data and Variable Descriptions

There are 13 variables in the data set.

  1. age: Age (years)

  2. sex: (Male=1, Female=0)

  3. chest pain type : 4 Values increasing in pain

  4. resting blood pressure: Diastolic blood pressure (mm Hg)

  5. serum cholestoral: cholestoral in mg/dl

  6. fasting blood sugar: > 120 mg/d

  7. resting electrocardiographic results: (values 0,1,2)

  8. maximum heart rate achieved: bpm

  9. exercise induced angina:

  10. oldpeak: ST depression induced by exercise relative to rest

  11. the slope of the peak exercise ST segment

  12. number of major vessels (0-3) colored by flourosopy

  13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2  53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3  70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4  61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5  62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6  58   0  0      100  248   0       0     122     0     1.0     1  0    2
##   target heart.0.age heart.0.sex heart.0.cp heart.0.trestbps heart.0.chol
## 1      0          52           1          0              125          212
## 2      0          53           1          0              140          203
## 3      0          70           1          0              145          174
## 4      0          61           1          0              148          203
## 5      0          62           0          0              138          294
## 6      1          58           0          0              100          248
##   heart.0.fbs heart.0.restecg heart.0.thalach heart.0.exang heart.0.oldpeak
## 1           0               1             168             0             1.0
## 2           1               0             155             1             3.1
## 3           0               1             125             1             2.6
## 4           0               1             161             0             0.0
## 5           1               1             106             0             1.9
## 6           0               0             122             0             1.0
##   heart.0.slope heart.0.ca heart.0.thal heart.0.target
## 1             2          2            3              0
## 2             0          0            3              0
## 3             0          0            3              0
## 4             2          1            3              0
## 5             1          3            2              0
## 6             1          0            2              1

2 Clinical Question

Many studies indicated that heart rate is a more powerful indecator for heart disease than high blood pressure. The objective of this case study is to explore the association between heart rate and heart disease.

The general interpretation of blood pressure for adults is given below:

  • very high: >170
  • target: [100, 170]
  • moderate: [60, 100]
  • poor very low: <60

2.1 Building the Simple Logistic Regression

Since we only study the simple logistic regression model, only one predictor variable is included in the model. We first perform exploratory data analysis on the predictor variable to make sure the variable is not extremely skewed.

He we are checking to see if the data is normally distributed or of it needs to be bootstrapped. The graph shows that its normally distributed so we dont need to bootstrap.

## 
## Call:
## glm(formula = target ~ thalach, family = binomial(link = "logit"), 
##     data = heart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0918  -1.0420   0.5412   0.9505   2.1987  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.583063   0.541701  -12.15   <2e-16 ***
## thalach      0.044369   0.003573   12.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1420.2  on 1024  degrees of freedom
## Residual deviance: 1219.6  on 1023  degrees of freedom
## AIC: 1223.6
## 
## Number of Fisher Scoring iterations: 3

Note that the response variable is a binary factor variable, R uses alphabetical order to define the level of the factor variable. In our case, “neg” = 0 and “pos” = 1. The “success” probability is defined to be P(heart disease = “pos”). The simple logistic regression is fitted in the following.

The summary of major statistics is given below.

## Waiting for profiling to be done...
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -6.5830635 0.5417014 -12.15257 0 -7.6706347 -5.5453817
thalach 0.0443686 0.0035726 12.41901 0 0.0375196 0.0515359

The p value is extremly close to zero so its highly signficant. The 95% confidnece interval is between [0.0375, 0.0515] which doesnt include 0 so its signfiicant

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -6.5830635 0.5417014 -12.15257 0 0.0013836
thalach 0.0443686 0.0035726 12.41901 0 1.0453676

The odds ratio associated with heart rate is 1.045 meaning that as the heart rate increases by one unit, the odds of being tested positive for heart disease increase by about \(4\%\). This is a practically significant risk factor for heart disease.

Some global goodness-of-fit measures are summarized in the following table.

Deviance.residual Null.Deviance.Residual AIC
1220 1420 1224

This table is shown to be not importnat because we have nothing to comapre the goodness of fit mesurement to.

On the left side you see an S curve that shows if the heart rate increases so does the probablity of a positive test for heart disease. In the rate of change graph you can see that the probablity test of a positive test increases on the left until 155 bpm and then decreses when it passes 155. Which means that 155 is the turning point.