Logistic regression

NOTE: This example is adapted from StatQuest, who adapted it from a demo that comes from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/index.php

This is a real dataset, the Heart Disease dataset: http://archive.ics.uci.edu/ml/datasets/Heart+Disease

Check out and Clean the dataset

library(ggplot2)
library(cowplot)
library(tidyverse)

url <- "https://raw.githubusercontent.com/StatQuest/logistic_regression_demo/master/processed.cleveland.data"
 
data <- read.csv(url, header=FALSE)

Let’s familiarize ourselves with the dataset

head(data)

##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
## 3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0

Cleaning up the dataset:

Notice how the columns’ names are not labeled clearly. So, the first step is to get those names from the dataset

colnames(data) <- c(
  "age",
  "sex",
  "cp", 
  "trestbps",
  "chol", 
  "fbs",  
  "restecg", 
  "thalach", 
  "exang",   
  "oldpeak", 
  "slope", 
  "ca",
  "thal",
  "hd")
 
head(data) #to check the columns'names

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal hd
## 1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0  0
## 2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0  2
## 3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0  1
## 4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0  0
## 5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0  0
## 6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0  0

Check the structure of your data:

str(data)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ thal    : chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

What do these columns mean?

Feature	Description
`age`	Age of the patient
`sex`	Sex (0 = female, 1 = male)
`cp`	Chest pain (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain 4 = asymptomatic)
`trestbps`	Resting blood pressure (in mm Hg)
`chol`	Serum cholesterol (in mg/dl)
`fbs`	Fasting blood sugar if less than 120 mg/dl (1 = TRUE, 0 = FALSE)
`restecg`	Resting electrocardiographic results (1 = normal, 2 = ST-T wave abnormality, 3 = probable/definite left ventricular hypertrophy)
`thalach`	Maximum heart rate achieved
`exang`	Exercise-induced angina (1 = yes, 0 = no)
`oldpeak`	ST depression induced by exercise relative to rest
`slope`	Slope of the peak exercise ST segment(1 = upsloping, 2 = flat, 3 = downsloping)
`ca`	Number of major vessels (0–3) colored by fluoroscopy
`thal`	Thallium heart scan results(3 = normal, 6 = fixed defect, 7 = reversible defect)
`hd`	Heart disease diagnosis (predicted attribute) (0 = less than or equal 50% diameter narrowing, 1 = greater than 50% diameter narrowing)

Based on the description, we have some cleaning to do:

data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"

head(data)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal hd
## 1  63   M  1      145  233   1       2     150     0     2.3     3 0.0  6.0  0
## 2  67   M  4      160  286   0       2     108     1     1.5     2 3.0  3.0  2
## 3  67   M  4      120  229   0       2     129     1     2.6     2 2.0  7.0  1
## 4  37   M  3      130  250   0       0     187     0     3.5     3 0.0  3.0  0
## 5  41   F  2      130  204   0       2     172     0     1.4     1 0.0  3.0  0
## 6  56   M  2      120  236   0       0     178     0     0.8     1 0.0  3.0  0

We need to convert “sex”,“cp”,“fbs”,“restecg”,“exang”,“slope”, “ca”, “thal” to factors

data <- data |>
  mutate(across(c(sex, cp, fbs, restecg, exang, slope), as.factor))


str(data)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ thal    : chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

unique(data$ca)

## [1] "0.0" "3.0" "2.0" "1.0" "?"

unique(data$thal)

## [1] "6.0" "3.0" "7.0" "?"

Note “ca” and “thal” columns had “?”s in them, so R thought that the levels for the factor are character, but we know they are integers, so first convert them to integers, then convert the integers to factor levels

data$ca <- as.integer(data$ca) 
data$ca <- as.factor(data$ca)  
 
data$thal <- as.integer(data$thal) # "thal" also had "?"s in it.
data$thal <- as.factor(data$thal)

str(data)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
##  $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

unique(data$ca)

## [1] 0    3    2    1    <NA>
## Levels: 0 1 2 3

The last step is cleaning “hd”, replace 0 and 1 with “healthy” and “Unhealthy”

data$hd <- ifelse(data$hd == 0, "healthy", "unhealthy")

data$hd <- as.factor(data$hd) # Now convert to a factor

Finally,

str(data)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
##  $ hd      : Factor w/ 2 levels "healthy","unhealthy": 1 2 2 1 1 1 2 1 2 2 ...

Deal with the NA(s)

colSums(is.na(data))    # NA counts per column

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal       hd 
##        0        0        0        4        2        0

We can see in which rows these Na(s) are:

data[is.na(data$ca) | is.na(data$thal),]

##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope   ca thal
## 88   53   F  3      128  216   0       2     115     0     0.0     1    0 <NA>
## 167  52   M  3      138  223   0       0     169     0     0.0     1 <NA>    3
## 193  43   M  4      132  247   1       2     143     1     0.1     2 <NA>    7
## 267  52   M  4      128  204   1       0     156     1     1.0     2    0 <NA>
## 288  58   M  2      125  220   0       0     144     0     0.4     2 <NA>    7
## 303  38   M  3      138  175   0       0     173     0     0.0     1 <NA>    3
##            hd
## 88    healthy
## 167   healthy
## 193 unhealthy
## 267 unhealthy
## 288   healthy
## 303   healthy

Only 6 out of the 303 rows in the dataset contain missing values. This accounts for just about 2% of the data, so we can safely remove these rows without significantly affecting the analysis.

In machine learning, it’s common to impute missing values instead of deleting data. Techniques like using the mean, median, or more advanced models (e.g.regression) are often used to fill in the gaps.

However, for this example, we’ll simply remove the rows with missing values (NAs) to keep things simple.

nrow(data) #number of rows

## [1] 303

data <- data[!(is.na(data$ca) | is.na(data$thal)),] # deleting the NAs

nrow(data) #number of rows now

## [1] 297

Quality Control:

Before building a model, we want to check that every category (like types of chest pain or heart scan results) appears in both people with and without heart disease. This helps make sure the model can learn fair comparisons.

If a category only has 1 or 2 people, it’s not reliable — a small change could dramatically affect the results. So we’ll remove those rare categories to keep the analysis stable.

To do so, we will create several tables to compare our data:

xtabs(~ hd + sex, data=data)

##            sex
## hd            F   M
##   healthy    71  89
##   unhealthy  25 112

Healthy and unhealthy patients are both represented by a lot of female and male samples! Note, most females are healthy and most males are unhealthy. Sex could be a predictor.

Now let’s check if all levels of chest pain are reported by a bunch of patients

xtabs(~ hd + cp, data=data)

##            cp
## hd            1   2   3   4
##   healthy    16  40  65  39
##   unhealthy   7   9  18 103

Each category has more more than 5 in both groups, so this variable is safe to use in modeling — no need to drop or merge any levels.

Continue…

xtabs(~ hd + fbs, data=data)

##            fbs
## hd            0   1
##   healthy   137  23
##   unhealthy 117  20

xtabs(~ hd + restecg, data=data)

##            restecg
## hd           0  1  2
##   healthy   92  1 67
##   unhealthy 55  3 79

xtabs(~ hd + exang, data=data)

##            exang
## hd            0   1
##   healthy   137  23
##   unhealthy  63  74

xtabs(~ hd + slope, data=data)

##            slope
## hd            1   2   3
##   healthy   103  48   9
##   unhealthy  36  89  12

xtabs(~ hd + ca, data=data)

##            ca
## hd            0   1   2   3
##   healthy   129  21   7   3
##   unhealthy  45  44  31  17

xtabs(~ hd + thal, data=data)

##            thal
## hd            3   6   7
##   healthy   127   6  27
##   unhealthy  37  12  88

We notice that for resting electrocardiographic results (restecg), level 1 is reported by only 4 people. This small sample size could cause issues later in the analysis if used. However, we will not need it here.

Logistic Regression Models:

Simple model: let’s start super simple and see if sex (female/male) is a good predictor. Remember the first table and the odds of being female and healthy.

For modeling, we use the glm() function (Generalized Linear Model). Setting family = binomial tells glm() to perform logistic regression, which is appropriate for binary outcomes like heart disease (yes/no).

logistic <- glm(hd ~ sex, data=data, family="binomial")

summary(logistic)

## 
## Call:
## glm(formula = hd ~ sex, family = "binomial", data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0438     0.2326  -4.488 7.18e-06 ***
## sexM          1.2737     0.2725   4.674 2.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 386.12  on 295  degrees of freedom
## AIC: 390.12
## 
## Number of Fisher Scoring iterations: 4

We can write the model equation as

Log odds of heart disease = -1.0438 + 1.2737(patient’s sex) [ female = 0, male = 1]

Log odds of heart disease for women = -1.0438 Log odds of heart disease for men = -1.0438 +1.2737 = 0.2299

Intercept (−1.04): It is the log(odds) a female will be unhealthy. This is because female is the first factor in “sex” (the factors are ordered, alphabetically by default,“female”, “male”)

On the probability scale, that equals

p= 1 / (1 + e^1.04) = 0.26 Which means about 26% chance of heart disease for females.

sexM (1.2737): This is the change in log odds when going from Female → Male

on the probability scale, this equals

p = 1/(1+e^-0.2299) = 0.557

The baseline risk for females is ~26%, while for males it jumps to ~56%.”

We can also calculate the odds ratio = e^1.27 = 3.57.

So, men have about 3.6 times higher odds of heart disease than women.

This effect is statistically significant (p < 0.001).

In plain words: men in this dataset are much more likely than women to have heart disease, and the model fit is significant.

Deviance Deviance in logistic regression is a measure of model fit (like “error” for linear regression). The smaller, the better!

Null deviance (409.95): The deviance of a model with no predictors.

Residual deviance (386.12): Deviance of your model (with sex as a predictor).

The decrease (409.95 → 386.12) shows your model explains some of the variation in heart disease outcome.

These deviance are used to compute r-squared and overall p-value. They are useful to compare models. Logistic regression doesn’t have a “true” R² like linear regression, you can calculate pseudo R² values. One common version is McFadden’s R², based on model deviance.

R² = 1 - (Deviance/ Null Deviance)

r_square

r_square <- 1 - (logistic$deviance/logistic$null.deviance)

r_square

## [1] 0.05812569

This means your model explains about 5.8% of the variation in heart disease outcomes, based on sex alone.

Overall p-value

## p-value = 1 - pchisq(chi-square value, df = 1) (only sex is used for prediction)

1 - pchisq((logistic$null.deviance - logistic$deviance), df=1)

## [1] 1.053157e-06

So since the r square is low but the model is statistically significant, it means that being male does increase the odds of having hear disease, but the predictive power of this model is weak. We’ll want to include other variables (like age, chest pain, cholesterol, etc.) to improve the model’s performance.

Lastly, let’s see what this logistic regression predicts, given that a patient is either female or male (and no other data about them).

predicted.data <- data.frame(
  probability.of.hd=logistic$fitted.values,
  sex=data$sex)

predicted.data

##     probability.of.hd sex
## 1           0.5572139   M
## 2           0.5572139   M
## 3           0.5572139   M
## 4           0.5572139   M
## 5           0.2604167   F
## 6           0.5572139   M
## 7           0.2604167   F
## 8           0.2604167   F
## 9           0.5572139   M
## 10          0.5572139   M
## 11          0.5572139   M
## 12          0.2604167   F
## 13          0.5572139   M
## 14          0.5572139   M
## 15          0.5572139   M
## 16          0.5572139   M
## 17          0.5572139   M
## 18          0.5572139   M
## 19          0.2604167   F
## 20          0.5572139   M
## 21          0.5572139   M
## 22          0.2604167   F
## 23          0.5572139   M
## 24          0.5572139   M
## 25          0.5572139   M
## 26          0.2604167   F
## 27          0.2604167   F
## 28          0.2604167   F
## 29          0.5572139   M
## 30          0.5572139   M
## 31          0.2604167   F
## 32          0.5572139   M
## 33          0.5572139   M
## 34          0.5572139   M
## 35          0.5572139   M
## 36          0.5572139   M
## 37          0.5572139   M
## 38          0.5572139   M
## 39          0.5572139   M
## 40          0.5572139   M
## 41          0.2604167   F
## 42          0.5572139   M
## 43          0.2604167   F
## 44          0.5572139   M
## 45          0.2604167   F
## 46          0.5572139   M
## 47          0.5572139   M
## 48          0.5572139   M
## 49          0.2604167   F
## 50          0.5572139   M
## 51          0.2604167   F
## 52          0.5572139   M
## 53          0.5572139   M
## 54          0.5572139   M
## 55          0.5572139   M
## 56          0.5572139   M
## 57          0.5572139   M
## 58          0.5572139   M
## 59          0.5572139   M
## 60          0.5572139   M
## 61          0.2604167   F
## 62          0.2604167   F
## 63          0.5572139   M
## 64          0.2604167   F
## 65          0.5572139   M
## 66          0.5572139   M
## 67          0.5572139   M
## 68          0.5572139   M
## 69          0.5572139   M
## 70          0.5572139   M
## 71          0.2604167   F
## 72          0.5572139   M
## 73          0.5572139   M
## 74          0.5572139   M
## 75          0.5572139   M
## 76          0.2604167   F
## 77          0.5572139   M
## 78          0.2604167   F
## 79          0.5572139   M
## 80          0.5572139   M
## 81          0.5572139   M
## 82          0.2604167   F
## 83          0.5572139   M
## 84          0.5572139   M
## 85          0.5572139   M
## 86          0.5572139   M
## 87          0.5572139   M
## 89          0.2604167   F
## 90          0.2604167   F
## 91          0.5572139   M
## 92          0.2604167   F
## 93          0.5572139   M
## 94          0.2604167   F
## 95          0.2604167   F
## 96          0.5572139   M
## 97          0.5572139   M
## 98          0.2604167   F
## 99          0.5572139   M
## 100         0.5572139   M
## 101         0.5572139   M
## 102         0.5572139   M
## 103         0.2604167   F
## 104         0.2604167   F
## 105         0.5572139   M
## 106         0.5572139   M
## 107         0.5572139   M
## 108         0.5572139   M
## 109         0.5572139   M
## 110         0.5572139   M
## 111         0.2604167   F
## 112         0.5572139   M
## 113         0.5572139   M
## 114         0.2604167   F
## 115         0.2604167   F
## 116         0.5572139   M
## 117         0.5572139   M
## 118         0.2604167   F
## 119         0.5572139   M
## 120         0.5572139   M
## 121         0.5572139   M
## 122         0.2604167   F
## 123         0.5572139   M
## 124         0.5572139   M
## 125         0.5572139   M
## 126         0.2604167   F
## 127         0.2604167   F
## 128         0.5572139   M
## 129         0.5572139   M
## 130         0.2604167   F
## 131         0.5572139   M
## 132         0.5572139   M
## 133         0.5572139   M
## 134         0.5572139   M
## 135         0.2604167   F
## 136         0.2604167   F
## 137         0.5572139   M
## 138         0.5572139   M
## 139         0.5572139   M
## 140         0.5572139   M
## 141         0.5572139   M
## 142         0.5572139   M
## 143         0.5572139   M
## 144         0.5572139   M
## 145         0.5572139   M
## 146         0.5572139   M
## 147         0.5572139   M
## 148         0.5572139   M
## 149         0.5572139   M
## 150         0.2604167   F
## 151         0.5572139   M
## 152         0.2604167   F
## 153         0.2604167   F
## 154         0.5572139   M
## 155         0.5572139   M
## 156         0.5572139   M
## 157         0.5572139   M
## 158         0.5572139   M
## 159         0.5572139   M
## 160         0.5572139   M
## 161         0.5572139   M
## 162         0.5572139   M
## 163         0.2604167   F
## 164         0.2604167   F
## 165         0.5572139   M
## 166         0.5572139   M
## 168         0.2604167   F
## 169         0.5572139   M
## 170         0.2604167   F
## 171         0.5572139   M
## 172         0.5572139   M
## 173         0.2604167   F
## 174         0.2604167   F
## 175         0.5572139   M
## 176         0.5572139   M
## 177         0.5572139   M
## 178         0.5572139   M
## 179         0.5572139   M
## 180         0.5572139   M
## 181         0.5572139   M
## 182         0.2604167   F
## 183         0.5572139   M
## 184         0.5572139   M
## 185         0.2604167   F
## 186         0.2604167   F
## 187         0.5572139   M
## 188         0.5572139   M
## 189         0.5572139   M
## 190         0.5572139   M
## 191         0.5572139   M
## 192         0.5572139   M
## 194         0.2604167   F
## 195         0.2604167   F
## 196         0.5572139   M
## 197         0.5572139   M
## 198         0.2604167   F
## 199         0.2604167   F
## 200         0.5572139   M
## 201         0.2604167   F
## 202         0.2604167   F
## 203         0.5572139   M
## 204         0.2604167   F
## 205         0.5572139   M
## 206         0.5572139   M
## 207         0.5572139   M
## 208         0.5572139   M
## 209         0.5572139   M
## 210         0.2604167   F
## 211         0.2604167   F
## 212         0.5572139   M
## 213         0.5572139   M
## 214         0.2604167   F
## 215         0.5572139   M
## 216         0.5572139   M
## 217         0.2604167   F
## 218         0.2604167   F
## 219         0.2604167   F
## 220         0.5572139   M
## 221         0.2604167   F
## 222         0.2604167   F
## 223         0.2604167   F
## 224         0.5572139   M
## 225         0.2604167   F
## 226         0.2604167   F
## 227         0.5572139   M
## 228         0.2604167   F
## 229         0.5572139   M
## 230         0.5572139   M
## 231         0.2604167   F
## 232         0.2604167   F
## 233         0.5572139   M
## 234         0.2604167   F
## 235         0.2604167   F
## 236         0.5572139   M
## 237         0.5572139   M
## 238         0.5572139   M
## 239         0.2604167   F
## 240         0.5572139   M
## 241         0.5572139   M
## 242         0.2604167   F
## 243         0.2604167   F
## 244         0.5572139   M
## 245         0.2604167   F
## 246         0.5572139   M
## 247         0.5572139   M
## 248         0.5572139   M
## 249         0.5572139   M
## 250         0.5572139   M
## 251         0.5572139   M
## 252         0.5572139   M
## 253         0.5572139   M
## 254         0.2604167   F
## 255         0.5572139   M
## 256         0.2604167   F
## 257         0.2604167   F
## 258         0.2604167   F
## 259         0.5572139   M
## 260         0.5572139   M
## 261         0.2604167   F
## 262         0.2604167   F
## 263         0.2604167   F
## 264         0.5572139   M
## 265         0.5572139   M
## 266         0.5572139   M
## 268         0.5572139   M
## 269         0.5572139   M
## 270         0.5572139   M
## 271         0.5572139   M
## 272         0.5572139   M
## 273         0.5572139   M
## 274         0.2604167   F
## 275         0.5572139   M
## 276         0.5572139   M
## 277         0.2604167   F
## 278         0.2604167   F
## 279         0.5572139   M
## 280         0.2604167   F
## 281         0.5572139   M
## 282         0.5572139   M
## 283         0.2604167   F
## 284         0.5572139   M
## 285         0.5572139   M
## 286         0.5572139   M
## 287         0.2604167   F
## 289         0.5572139   M
## 290         0.5572139   M
## 291         0.5572139   M
## 292         0.2604167   F
## 293         0.5572139   M
## 294         0.5572139   M
## 295         0.2604167   F
## 296         0.5572139   M
## 297         0.5572139   M
## 298         0.2604167   F
## 299         0.5572139   M
## 300         0.5572139   M
## 301         0.5572139   M
## 302         0.2604167   F

Since there are only two probabilities (one for females and one for males), we can use a table to summarize the predicted probabilities.

xtabs(~ probability.of.hd + sex, data=predicted.data)

##                    sex
## probability.of.hd     F   M
##   0.260416666667239  96   0
##   0.55721393034826    0 201

Now we will use all of the data available to predict heart disease

logistic2 <- glm(hd ~ ., data=data, family="binomial")
summary(logistic2)

## 
## Call:
## glm(formula = hd ~ ., family = "binomial", data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.253978   2.960399  -2.113 0.034640 *  
## age         -0.023508   0.025122  -0.936 0.349402    
## sexM         1.670152   0.552486   3.023 0.002503 ** 
## cp2          1.448396   0.809136   1.790 0.073446 .  
## cp3          0.393353   0.700338   0.562 0.574347    
## cp4          2.373287   0.709094   3.347 0.000817 ***
## trestbps     0.027720   0.011748   2.359 0.018300 *  
## chol         0.004445   0.004091   1.087 0.277253    
## fbs1        -0.574079   0.592539  -0.969 0.332622    
## restecg1     1.000887   2.638393   0.379 0.704424    
## restecg2     0.486408   0.396327   1.227 0.219713    
## thalach     -0.019695   0.011717  -1.681 0.092781 .  
## exang1       0.653306   0.447445   1.460 0.144267    
## oldpeak      0.390679   0.239173   1.633 0.102373    
## slope2       1.302289   0.486197   2.679 0.007395 ** 
## slope3       0.606760   0.939324   0.646 0.518309    
## ca1          2.237444   0.514770   4.346 1.38e-05 ***
## ca2          3.271852   0.785123   4.167 3.08e-05 ***
## ca3          2.188715   0.928644   2.357 0.018428 *  
## thal6       -0.168439   0.810310  -0.208 0.835331    
## thal7        1.433319   0.440567   3.253 0.001141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 183.10  on 276  degrees of freedom
## AIC: 225.1
## 
## Number of Fisher Scoring iterations: 6

Let’s look at high log odd ratios that are significant

sexM = 1.67 (p = 0.0025) Being male increases the log-odds of heart disease. Statistically significant, so sex is an important predictor.

cp4 = 2.37 (p = 0.0008) Having asymptomatic chest pain (cp4) increases the log-odds of heart disease compared to the reference chest pain (cp1 = typical angina). Strong effect, very significant.

ca2 = 3.27 (p ≈ 3e-05) Having 2 major vessels colored by fluoroscopy dramatically increases odds of heart disease.This is one of the strongest predictors.

We see other variables like age:

age = -0.023 (p = 0.35) Slight negative effect, but not significant — age alone, adjusted for other variables, doesn’t add much here.

The overall model fit is much improved compared to the null model, as shown by the big drop in deviance and low AIC.

We can calculate r square:

r_square_all <- 1 - (logistic2$deviance/logistic2$null.deviance)

r_square_all

## [1] 0.5533531

The model explains 55.3% of the variation in the outcome (heart disease diagnosis)

p-value:

1 - pchisq((logistic2$null.deviance - logistic2$deviance), df=(length(logistic2$coefficients)-1))

## [1] 0

Very significant!

Let’s see what this logistic regression predicts, given all variables for each patient:

predicted.data2 <- data.frame(
  probability.of.hd=logistic2$fitted.values,
  hd=data$hd)
 

#order and rank the cases to graph easily

predicted.data2 <- predicted.data2[
  order(predicted.data2$probability.of.hd, decreasing=FALSE),]
predicted.data2$rank <- 1:nrow(predicted.data2)

predicted.data2

##     probability.of.hd        hd rank
## 223       0.001098497   healthy    1
## 64        0.002575954   healthy    2
## 211       0.003028610   healthy    3
## 27        0.003097422   healthy    4
## 222       0.003165527   healthy    5
## 245       0.003617366   healthy    6
## 217       0.004247289   healthy    7
## 263       0.004252341   healthy    8
## 95        0.004584178   healthy    9
## 94        0.005571569   healthy   10
## 22        0.006124983   healthy   11
## 226       0.007329205   healthy   12
## 19        0.007981298   healthy   13
## 254       0.008217007   healthy   14
## 221       0.008354930   healthy   15
## 256       0.009034574   healthy   16
## 90        0.010242121   healthy   17
## 71        0.011000034   healthy   18
## 148       0.011385704   healthy   19
## 135       0.011982965   healthy   20
## 199       0.012983984   healthy   21
## 163       0.013201037   healthy   22
## 102       0.014150104   healthy   23
## 264       0.014451038   healthy   24
## 146       0.014554209 unhealthy   25
## 130       0.015033948   healthy   26
## 191       0.015841235   healthy   27
## 242       0.015875792   healthy   28
## 143       0.016358136   healthy   29
## 282       0.016419912   healthy   30
## 117       0.016746018   healthy   31
## 150       0.019001592   healthy   32
## 231       0.019026606   healthy   33
## 26        0.019441324   healthy   34
## 113       0.021239775   healthy   35
## 47        0.022966668   healthy   36
## 270       0.023465719   healthy   37
## 292       0.023660411   healthy   38
## 104       0.023816792   healthy   39
## 278       0.024948978   healthy   40
## 296       0.025097112   healthy   41
## 5         0.026253999   healthy   42
## 76        0.028514591   healthy   43
## 201       0.028662579   healthy   44
## 44        0.028795023   healthy   45
## 170       0.030374207   healthy   46
## 243       0.030957274   healthy   47
## 33        0.031156620 unhealthy   48
## 16        0.031443239   healthy   49
## 86        0.033625168   healthy   50
## 35        0.035519859   healthy   51
## 6         0.036623801   healthy   52
## 28        0.036764717   healthy   53
## 129       0.038651472   healthy   54
## 204       0.039121409   healthy   55
## 284       0.041092208   healthy   56
## 228       0.041439655   healthy   57
## 118       0.043010961   healthy   58
## 45        0.043670536 unhealthy   59
## 195       0.043722537   healthy   60
## 62        0.044071213   healthy   61
## 51        0.045212124   healthy   62
## 241       0.046577593   healthy   63
## 250       0.049019064   healthy   64
## 50        0.049126644   healthy   65
## 239       0.050842601   healthy   66
## 89        0.050881022   healthy   67
## 87        0.051476886   healthy   68
## 83        0.052258069   healthy   69
## 85        0.052332974   healthy   70
## 187       0.055881052   healthy   71
## 54        0.056834144   healthy   72
## 133       0.057383996   healthy   73
## 290       0.058387396   healthy   74
## 235       0.058724520   healthy   75
## 161       0.059822030   healthy   76
## 1         0.062099416   healthy   77
## 209       0.062208482   healthy   78
## 240       0.064414970   healthy   79
## 8         0.066394093   healthy   80
## 126       0.066399313   healthy   81
## 20        0.066924129   healthy   82
## 258       0.079983914   healthy   83
## 274       0.089692902   healthy   84
## 200       0.091311778 unhealthy   85
## 49        0.092434805   healthy   86
## 185       0.096186963 unhealthy   87
## 141       0.096559427   healthy   88
## 100       0.104585966   healthy   89
## 149       0.105461313   healthy   90
## 101       0.110839080   healthy   91
## 227       0.110915898   healthy   92
## 136       0.111011538   healthy   93
## 4         0.113609757   healthy   94
## 180       0.114492942   healthy   95
## 123       0.116842845   healthy   96
## 140       0.124269406   healthy   97
## 31        0.126544058   healthy   98
## 15        0.130489782   healthy   99
## 261       0.138707036   healthy  100
## 36        0.139632266   healthy  101
## 280       0.140838233   healthy  102
## 186       0.142744928   healthy  103
## 21        0.142803306   healthy  104
## 295       0.145321482 unhealthy  105
## 259       0.154759366   healthy  106
## 106       0.157514867   healthy  107
## 220       0.158988022   healthy  108
## 12        0.160933163   healthy  109
## 14        0.161325646   healthy  110
## 42        0.163686615   healthy  111
## 164       0.168917435   healthy  112
## 165       0.182860432   healthy  113
## 168       0.187798107   healthy  114
## 213       0.207789639   healthy  115
## 202       0.209015413   healthy  116
## 40        0.212072901   healthy  117
## 78        0.214730318   healthy  118
## 79        0.222771395   healthy  119
## 152       0.224006064   healthy  120
## 82        0.227745780   healthy  121
## 18        0.228054268   healthy  122
## 289       0.239284008   healthy  123
## 151       0.241638846   healthy  124
## 260       0.248886621 unhealthy  125
## 277       0.250309804   healthy  126
## 17        0.254486059 unhealthy  127
## 153       0.260147993   healthy  128
## 203       0.260211314   healthy  129
## 67        0.270926284 unhealthy  130
## 68        0.273754102   healthy  131
## 205       0.274486678   healthy  132
## 275       0.277710692 unhealthy  133
## 234       0.278686452   healthy  134
## 216       0.280420160   healthy  135
## 257       0.282132523   healthy  136
## 116       0.284808945   healthy  137
## 233       0.288415392 unhealthy  138
## 302       0.290319803 unhealthy  139
## 134       0.290656356   healthy  140
## 299       0.306943078 unhealthy  141
## 52        0.311482759   healthy  142
## 262       0.312001020 unhealthy  143
## 29        0.321379988   healthy  144
## 103       0.324438353   healthy  145
## 210       0.348367592 unhealthy  146
## 218       0.352439407   healthy  147
## 23        0.361345492 unhealthy  148
## 58        0.365212639 unhealthy  149
## 174       0.367992856   healthy  150
## 198       0.368698797   healthy  151
## 131       0.376671866   healthy  152
## 145       0.377552339   healthy  153
## 132       0.381366128   healthy  154
## 255       0.382113409   healthy  155
## 125       0.386736001 unhealthy  156
## 43        0.387062547   healthy  157
## 11        0.391477771   healthy  158
## 179       0.399258886   healthy  159
## 99        0.422661163   healthy  160
## 70        0.432020605 unhealthy  161
## 59        0.438010086   healthy  162
## 215       0.445261105 unhealthy  163
## 160       0.450309482   healthy  164
## 173       0.450534712 unhealthy  165
## 166       0.451753243   healthy  166
## 91        0.469440883   healthy  167
## 268       0.475457614 unhealthy  168
## 75        0.479436972 unhealthy  169
## 269       0.480560387 unhealthy  170
## 291       0.492615946 unhealthy  171
## 276       0.509159664   healthy  172
## 60        0.522777452   healthy  173
## 212       0.541141848 unhealthy  174
## 197       0.544980768   healthy  175
## 183       0.545109373   healthy  176
## 293       0.545390099 unhealthy  177
## 272       0.546140318   healthy  178
## 142       0.547420366 unhealthy  179
## 251       0.570442123   healthy  180
## 279       0.578952477 unhealthy  181
## 238       0.631198562 unhealthy  182
## 13        0.647739413 unhealthy  183
## 144       0.658242622 unhealthy  184
## 298       0.684782514 unhealthy  185
## 177       0.687452168   healthy  186
## 53        0.702488808 unhealthy  187
## 247       0.705874159 unhealthy  188
## 246       0.708644163 unhealthy  189
## 162       0.717750550 unhealthy  190
## 61        0.719224535 unhealthy  191
## 34        0.720057696   healthy  192
## 115       0.728266034 unhealthy  193
## 230       0.750503476 unhealthy  194
## 157       0.751145818 unhealthy  195
## 74        0.754971451 unhealthy  196
## 169       0.770892651 unhealthy  197
## 81        0.775235460   healthy  198
## 194       0.796637680 unhealthy  199
## 84        0.800662839 unhealthy  200
## 225       0.806879666 unhealthy  201
## 111       0.811489543 unhealthy  202
## 181       0.812164223 unhealthy  203
## 244       0.813179938 unhealthy  204
## 184       0.813804721   healthy  205
## 57        0.824574972 unhealthy  206
## 110       0.826936305 unhealthy  207
## 108       0.845142336 unhealthy  208
## 93        0.849972280   healthy  209
## 46        0.853137162 unhealthy  210
## 105       0.856640939 unhealthy  211
## 285       0.865310779 unhealthy  212
## 172       0.866587227   healthy  213
## 266       0.880856760 unhealthy  214
## 114       0.888731337 unhealthy  215
## 10        0.889990950 unhealthy  216
## 24        0.894275653 unhealthy  217
## 137       0.898795128 unhealthy  218
## 107       0.900556820 unhealthy  219
## 7         0.913251803 unhealthy  220
## 96        0.916958660 unhealthy  221
## 188       0.920857417 unhealthy  222
## 80        0.922972893 unhealthy  223
## 189       0.923564290 unhealthy  224
## 190       0.925103473 unhealthy  225
## 219       0.925380488   healthy  226
## 237       0.926670250 unhealthy  227
## 139       0.932562076 unhealthy  228
## 249       0.940551673 unhealthy  229
## 112       0.941739080 unhealthy  230
## 32        0.943991082 unhealthy  231
## 229       0.954226067 unhealthy  232
## 158       0.954699941 unhealthy  233
## 30        0.954869283 unhealthy  234
## 72        0.957285535 unhealthy  235
## 55        0.958591574 unhealthy  236
## 208       0.962311675 unhealthy  237
## 232       0.962665558 unhealthy  238
## 41        0.964616179 unhealthy  239
## 37        0.967083158 unhealthy  240
## 48        0.972585444 unhealthy  241
## 286       0.973118237 unhealthy  242
## 155       0.974849359 unhealthy  243
## 287       0.974864179 unhealthy  244
## 119       0.975154297 unhealthy  245
## 248       0.975923512 unhealthy  246
## 121       0.976316797 unhealthy  247
## 214       0.976863953 unhealthy  248
## 196       0.980524845 unhealthy  249
## 156       0.981052202 unhealthy  250
## 138       0.981297525 unhealthy  251
## 65        0.981593207 unhealthy  252
## 271       0.981856494 unhealthy  253
## 124       0.982652585 unhealthy  254
## 92        0.983817695 unhealthy  255
## 283       0.984232291 unhealthy  256
## 9         0.984513939 unhealthy  257
## 178       0.985418187 unhealthy  258
## 97        0.986597517 unhealthy  259
## 69        0.986804486 unhealthy  260
## 175       0.986878593 unhealthy  261
## 171       0.987235288 unhealthy  262
## 38        0.987553136 unhealthy  263
## 98        0.988237660 unhealthy  264
## 301       0.988632504 unhealthy  265
## 265       0.989466671 unhealthy  266
## 122       0.989498213 unhealthy  267
## 253       0.990420853   healthy  268
## 297       0.990686567 unhealthy  269
## 128       0.992386274 unhealthy  270
## 281       0.993102290 unhealthy  271
## 2         0.993376136 unhealthy  272
## 300       0.993471763 unhealthy  273
## 147       0.993897346 unhealthy  274
## 109       0.994017715 unhealthy  275
## 252       0.994109138 unhealthy  276
## 159       0.994384275 unhealthy  277
## 120       0.994429272 unhealthy  278
## 206       0.994568332 unhealthy  279
## 39        0.994621317 unhealthy  280
## 182       0.994884640 unhealthy  281
## 63        0.994921518 unhealthy  282
## 77        0.995593851 unhealthy  283
## 294       0.996285860 unhealthy  284
## 154       0.996938269 unhealthy  285
## 236       0.997135275 unhealthy  286
## 56        0.997439676 unhealthy  287
## 3         0.997935717 unhealthy  288
## 176       0.998059276 unhealthy  289
## 73        0.998090959 unhealthy  290
## 25        0.998313153 unhealthy  291
## 127       0.998418347 unhealthy  292
## 192       0.998657003 unhealthy  293
## 224       0.998862241 unhealthy  294
## 207       0.998977158 unhealthy  295
## 273       0.999062844 unhealthy  296
## 66        0.999172369 unhealthy  297

Lastly, we can plot the predicted probabilities for each sample having heart disease and color by whether or not they actually had heart disease

ggplot(data=predicted.data2, aes(x=rank, y=probability.of.hd)) +
  geom_point(aes(color=hd), alpha=1, shape=4, stroke=2) +
  xlab("Index") +
  ylab("Predicted probability of getting heart disease")

The graph is a probability plot showing the predicted probability of heart disease for each individual, sorted from lowest to highest. It’s a great way to visually assess how well our logistic regression model is separating the two classes: Healthy (0) and Unhealthy (1).

Most red (Healthy) cases are on the left (low predicted probabilities). Most blue (Unhealthy) cases are on the right (high predicted probabilities).

This means the model is predicting probabilities well — high for those with disease, low for those without. Some overlap is normal in real-world data, especially with clinical conditions that are influenced by many subtle factors.

Confusion Matrix

A confusion matrix shows how well your logistic regression model is classifying patients as healthy (0) or unhealthy (1), based on a threshold (usually 0.5).

# Since the structure of hd is a factor with labels like "Healthy"/"Unhealthy", convert it to numeric 0/1 
data$hd_num <- ifelse(data$hd == "unhealthy", 1, 0)

# Predicted probabilities
predicted.probs <- logistic2$fitted.values

# Predicted classes: 1 if prob > 0.5, else 0
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)

# Confusion matrix
confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(data$hd_num, levels = c(0, 1))
)

confusion

##          Actual
## Predicted   0   1
##         0 146  25
##         1  14 112

146 people were actually healthy, and the model correctly predicted them as healthy. This is a True Negative.

25 people were actually unhealthy, but the model mistakenly predicted them as healthy. This is a False Negative.

14 people were actually healthy, but the model mistakenly predicted them as unhealthy. This is a False Positive.

112 people were actually unhealthy, and the model correctly predicted them as unhealthy. This is a True Positive.

Now let’s calculate key performance metrics:

# Extract values
TN <- 146
FP <- 14
FN <- 25
TP <- 112

# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)   # also called recall or true positive rate
specificity <- TN / (TN + FP)   # true negative rate
precision <- TP / (TP + FP)     # positive predictive value
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)

# Print results
cat("Accuracy:    ", round(accuracy, 4), "\n")

## Accuracy:     0.8687

cat("Sensitivity: ", round(sensitivity, 4), "\n")

## Sensitivity:  0.8175

cat("Specificity: ", round(specificity, 4), "\n")

## Specificity:  0.9125

cat("Precision:   ", round(precision, 4), "\n")

## Precision:    0.8889

cat("F1 Score:    ", round(f1_score, 4), "\n")

## F1 Score:     0.8517

The model has high accuracy and strong balance between detecting positives and avoiding false alarms.

It’s especially good at identifying healthy individuals (high specificity), and still quite good at catching disease (81.7% sensitivity).

Overall, these are excellent results for a binary classification model on medical data.

Finally: ROC and AUC

# install.packages("pROC") # if needed
library(pROC)

# ROC curve & AUC on full data
roc_obj <- roc(response = data$hd,
               predictor = logistic2$fitted.values,
               levels = c("healthy", "unhealthy"),
               direction = "<")  # smaller prob = Healthy

# Print AUC value
auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.9395

# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

The AUC = 0.94 means your model is very good at distinguishing between healthy and unhealthy patients.

On the plot, your curve is far above the diagonal “random guess” line, which shows the model is much better than chance.

In plain words: if you randomly pick one unhealthy and one healthy patient, the model has about a 94% chance of ranking the unhealthy one higher (more likely to have disease).