NOTE: This example is adapted from StatQuest, who adapted it from a demo that comes from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/index.php
This is a real dataset, the Heart Disease dataset: http://archive.ics.uci.edu/ml/datasets/Heart+Disease
Check out and Clean the dataset
library(ggplot2)
library(cowplot)
library(tidyverse)
url <- "https://raw.githubusercontent.com/StatQuest/logistic_regression_demo/master/processed.cleveland.data"
data <- read.csv(url, header=FALSE)
Let’s familiarize ourselves with the dataset
head(data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
Notice how the columns’ names are not labeled clearly. So, the first step is to get those names from the dataset
colnames(data) <- c(
"age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal",
"hd")
head(data) #to check the columns'names
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal hd
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
Check the structure of your data:
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : num 1 1 1 1 0 1 0 0 1 1 ...
## $ cp : num 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
## $ restecg : num 2 2 2 0 2 0 2 0 2 2 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : num 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : num 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : chr "0.0" "3.0" "2.0" "0.0" ...
## $ thal : chr "6.0" "3.0" "7.0" "3.0" ...
## $ hd : int 0 2 1 0 0 0 3 0 2 1 ...
What do these columns mean?
| Feature | Description |
|---|---|
age |
Age of the patient |
sex |
Sex (0 = female, 1 = male) |
cp |
Chest pain (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain 4 = asymptomatic) |
trestbps |
Resting blood pressure (in mm Hg) |
chol |
Serum cholesterol (in mg/dl) |
fbs |
Fasting blood sugar if less than 120 mg/dl (1 = TRUE, 0 = FALSE) |
restecg |
Resting electrocardiographic results (1 = normal, 2 = ST-T wave abnormality, 3 = probable/definite left ventricular hypertrophy) |
thalach |
Maximum heart rate achieved |
exang |
Exercise-induced angina (1 = yes, 0 = no) |
oldpeak |
ST depression induced by exercise relative to rest |
slope |
Slope of the peak exercise ST segment(1 = upsloping, 2 = flat, 3 = downsloping) |
ca |
Number of major vessels (0–3) colored by fluoroscopy |
thal |
Thallium heart scan results(3 = normal, 6 = fixed defect, 7 = reversible defect) |
hd |
Heart disease diagnosis (predicted attribute) (0 = less than or equal 50% diameter narrowing, 1 = greater than 50% diameter narrowing) |
Based on the description, we have some cleaning to do:
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
head(data)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal hd
## 1 63 M 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
## 2 67 M 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
## 3 67 M 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
## 4 37 M 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
## 5 41 F 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
## 6 56 M 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
We need to convert “sex”,“cp”,“fbs”,“restecg”,“exang”,“slope”, “ca”, “thal” to factors
data <- data |>
mutate(across(c(sex, cp, fbs, restecg, exang, slope), as.factor))
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : chr "0.0" "3.0" "2.0" "0.0" ...
## $ thal : chr "6.0" "3.0" "7.0" "3.0" ...
## $ hd : int 0 2 1 0 0 0 3 0 2 1 ...
unique(data$ca)
## [1] "0.0" "3.0" "2.0" "1.0" "?"
unique(data$thal)
## [1] "6.0" "3.0" "7.0" "?"
Note “ca” and “thal” columns had “?”s in them, so R thought that the levels for the factor are character, but we know they are integers, so first convert them to integers, then convert the integers to factor levels
data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)
data$thal <- as.integer(data$thal) # "thal" also had "?"s in it.
data$thal <- as.factor(data$thal)
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
## $ thal : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
## $ hd : int 0 2 1 0 0 0 3 0 2 1 ...
unique(data$ca)
## [1] 0 3 2 1 <NA>
## Levels: 0 1 2 3
The last step is cleaning “hd”, replace 0 and 1 with “healthy” and “Unhealthy”
data$hd <- ifelse(data$hd == 0, "healthy", "unhealthy")
data$hd <- as.factor(data$hd) # Now convert to a factor
Finally,
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
## $ thal : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
## $ hd : Factor w/ 2 levels "healthy","unhealthy": 1 2 2 1 1 1 2 1 2 2 ...
Deal with the NA(s)
colSums(is.na(data)) # NA counts per column
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal hd
## 0 0 0 4 2 0
We can see in which rows these Na(s) are:
data[is.na(data$ca) | is.na(data$thal),]
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 88 53 F 3 128 216 0 2 115 0 0.0 1 0 <NA>
## 167 52 M 3 138 223 0 0 169 0 0.0 1 <NA> 3
## 193 43 M 4 132 247 1 2 143 1 0.1 2 <NA> 7
## 267 52 M 4 128 204 1 0 156 1 1.0 2 0 <NA>
## 288 58 M 2 125 220 0 0 144 0 0.4 2 <NA> 7
## 303 38 M 3 138 175 0 0 173 0 0.0 1 <NA> 3
## hd
## 88 healthy
## 167 healthy
## 193 unhealthy
## 267 unhealthy
## 288 healthy
## 303 healthy
Only 6 out of the 303 rows in the dataset contain missing values. This accounts for just about 2% of the data, so we can safely remove these rows without significantly affecting the analysis.
In machine learning, it’s common to impute missing values instead of deleting data. Techniques like using the mean, median, or more advanced models (e.g.regression) are often used to fill in the gaps.
However, for this example, we’ll simply remove the rows with missing values (NAs) to keep things simple.
nrow(data) #number of rows
## [1] 303
data <- data[!(is.na(data$ca) | is.na(data$thal)),] # deleting the NAs
nrow(data) #number of rows now
## [1] 297
Before building a model, we want to check that every category (like types of chest pain or heart scan results) appears in both people with and without heart disease. This helps make sure the model can learn fair comparisons.
If a category only has 1 or 2 people, it’s not reliable — a small change could dramatically affect the results. So we’ll remove those rare categories to keep the analysis stable.
To do so, we will create several tables to compare our data:
xtabs(~ hd + sex, data=data)
## sex
## hd F M
## healthy 71 89
## unhealthy 25 112
Healthy and unhealthy patients are both represented by a lot of female and male samples! Note, most females are healthy and most males are unhealthy. Sex could be a predictor.
Now let’s check if all levels of chest pain are reported by a bunch of patients
xtabs(~ hd + cp, data=data)
## cp
## hd 1 2 3 4
## healthy 16 40 65 39
## unhealthy 7 9 18 103
Each category has more more than 5 in both groups, so this variable is safe to use in modeling — no need to drop or merge any levels.
Continue…
xtabs(~ hd + fbs, data=data)
## fbs
## hd 0 1
## healthy 137 23
## unhealthy 117 20
xtabs(~ hd + restecg, data=data)
## restecg
## hd 0 1 2
## healthy 92 1 67
## unhealthy 55 3 79
xtabs(~ hd + exang, data=data)
## exang
## hd 0 1
## healthy 137 23
## unhealthy 63 74
xtabs(~ hd + slope, data=data)
## slope
## hd 1 2 3
## healthy 103 48 9
## unhealthy 36 89 12
xtabs(~ hd + ca, data=data)
## ca
## hd 0 1 2 3
## healthy 129 21 7 3
## unhealthy 45 44 31 17
xtabs(~ hd + thal, data=data)
## thal
## hd 3 6 7
## healthy 127 6 27
## unhealthy 37 12 88
We notice that for resting electrocardiographic results (restecg), level 1 is reported by only 4 people. This small sample size could cause issues later in the analysis if used. However, we will not need it here.
For modeling, we use the glm() function (Generalized Linear Model). Setting family = binomial tells glm() to perform logistic regression, which is appropriate for binary outcomes like heart disease (yes/no).
logistic <- glm(hd ~ sex, data=data, family="binomial")
summary(logistic)
##
## Call:
## glm(formula = hd ~ sex, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0438 0.2326 -4.488 7.18e-06 ***
## sexM 1.2737 0.2725 4.674 2.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 386.12 on 295 degrees of freedom
## AIC: 390.12
##
## Number of Fisher Scoring iterations: 4
We can write the model equation as
Log odds of heart disease = -1.0438 + 1.2737(patient’s sex) [ female = 0, male = 1]
Log odds of heart disease for women = -1.0438 Log odds of heart disease for men = -1.0438 +1.2737 = 0.2299
On the probability scale, that equals
p= 1 / (1 + e^1.04) = 0.26 Which means about 26% chance of heart disease for females.
on the probability scale, this equals
p = 1/(1+e^-0.2299) = 0.557
The baseline risk for females is ~26%, while for males it jumps to ~56%.”
We can also calculate the odds ratio = e^1.27 = 3.57.
So, men have about 3.6 times higher odds of heart disease than women.
This effect is statistically significant (p < 0.001).
In plain words: men in this dataset are much more likely than women to have heart disease, and the model fit is significant.
Deviance Deviance in logistic regression is a measure of model fit (like “error” for linear regression). The smaller, the better!
Null deviance (409.95): The deviance of a model with no predictors.
Residual deviance (386.12): Deviance of your model (with sex as a predictor).
The decrease (409.95 → 386.12) shows your model explains some of the variation in heart disease outcome.
These deviance are used to compute r-squared and overall p-value. They are useful to compare models. Logistic regression doesn’t have a “true” R² like linear regression, you can calculate pseudo R² values. One common version is McFadden’s R², based on model deviance.
R² = 1 - (Deviance/ Null Deviance)
r_square <- 1 - (logistic$deviance/logistic$null.deviance)
r_square
## [1] 0.05812569
This means your model explains about 5.8% of the variation in heart disease outcomes, based on sex alone.
## p-value = 1 - pchisq(chi-square value, df = 1) (only sex is used for prediction)
1 - pchisq((logistic$null.deviance - logistic$deviance), df=1)
## [1] 1.053157e-06
So since the r square is low but the model is statistically significant, it means that being male does increase the odds of having hear disease, but the predictive power of this model is weak. We’ll want to include other variables (like age, chest pain, cholesterol, etc.) to improve the model’s performance.
predicted.data <- data.frame(
probability.of.hd=logistic$fitted.values,
sex=data$sex)
predicted.data
## probability.of.hd sex
## 1 0.5572139 M
## 2 0.5572139 M
## 3 0.5572139 M
## 4 0.5572139 M
## 5 0.2604167 F
## 6 0.5572139 M
## 7 0.2604167 F
## 8 0.2604167 F
## 9 0.5572139 M
## 10 0.5572139 M
## 11 0.5572139 M
## 12 0.2604167 F
## 13 0.5572139 M
## 14 0.5572139 M
## 15 0.5572139 M
## 16 0.5572139 M
## 17 0.5572139 M
## 18 0.5572139 M
## 19 0.2604167 F
## 20 0.5572139 M
## 21 0.5572139 M
## 22 0.2604167 F
## 23 0.5572139 M
## 24 0.5572139 M
## 25 0.5572139 M
## 26 0.2604167 F
## 27 0.2604167 F
## 28 0.2604167 F
## 29 0.5572139 M
## 30 0.5572139 M
## 31 0.2604167 F
## 32 0.5572139 M
## 33 0.5572139 M
## 34 0.5572139 M
## 35 0.5572139 M
## 36 0.5572139 M
## 37 0.5572139 M
## 38 0.5572139 M
## 39 0.5572139 M
## 40 0.5572139 M
## 41 0.2604167 F
## 42 0.5572139 M
## 43 0.2604167 F
## 44 0.5572139 M
## 45 0.2604167 F
## 46 0.5572139 M
## 47 0.5572139 M
## 48 0.5572139 M
## 49 0.2604167 F
## 50 0.5572139 M
## 51 0.2604167 F
## 52 0.5572139 M
## 53 0.5572139 M
## 54 0.5572139 M
## 55 0.5572139 M
## 56 0.5572139 M
## 57 0.5572139 M
## 58 0.5572139 M
## 59 0.5572139 M
## 60 0.5572139 M
## 61 0.2604167 F
## 62 0.2604167 F
## 63 0.5572139 M
## 64 0.2604167 F
## 65 0.5572139 M
## 66 0.5572139 M
## 67 0.5572139 M
## 68 0.5572139 M
## 69 0.5572139 M
## 70 0.5572139 M
## 71 0.2604167 F
## 72 0.5572139 M
## 73 0.5572139 M
## 74 0.5572139 M
## 75 0.5572139 M
## 76 0.2604167 F
## 77 0.5572139 M
## 78 0.2604167 F
## 79 0.5572139 M
## 80 0.5572139 M
## 81 0.5572139 M
## 82 0.2604167 F
## 83 0.5572139 M
## 84 0.5572139 M
## 85 0.5572139 M
## 86 0.5572139 M
## 87 0.5572139 M
## 89 0.2604167 F
## 90 0.2604167 F
## 91 0.5572139 M
## 92 0.2604167 F
## 93 0.5572139 M
## 94 0.2604167 F
## 95 0.2604167 F
## 96 0.5572139 M
## 97 0.5572139 M
## 98 0.2604167 F
## 99 0.5572139 M
## 100 0.5572139 M
## 101 0.5572139 M
## 102 0.5572139 M
## 103 0.2604167 F
## 104 0.2604167 F
## 105 0.5572139 M
## 106 0.5572139 M
## 107 0.5572139 M
## 108 0.5572139 M
## 109 0.5572139 M
## 110 0.5572139 M
## 111 0.2604167 F
## 112 0.5572139 M
## 113 0.5572139 M
## 114 0.2604167 F
## 115 0.2604167 F
## 116 0.5572139 M
## 117 0.5572139 M
## 118 0.2604167 F
## 119 0.5572139 M
## 120 0.5572139 M
## 121 0.5572139 M
## 122 0.2604167 F
## 123 0.5572139 M
## 124 0.5572139 M
## 125 0.5572139 M
## 126 0.2604167 F
## 127 0.2604167 F
## 128 0.5572139 M
## 129 0.5572139 M
## 130 0.2604167 F
## 131 0.5572139 M
## 132 0.5572139 M
## 133 0.5572139 M
## 134 0.5572139 M
## 135 0.2604167 F
## 136 0.2604167 F
## 137 0.5572139 M
## 138 0.5572139 M
## 139 0.5572139 M
## 140 0.5572139 M
## 141 0.5572139 M
## 142 0.5572139 M
## 143 0.5572139 M
## 144 0.5572139 M
## 145 0.5572139 M
## 146 0.5572139 M
## 147 0.5572139 M
## 148 0.5572139 M
## 149 0.5572139 M
## 150 0.2604167 F
## 151 0.5572139 M
## 152 0.2604167 F
## 153 0.2604167 F
## 154 0.5572139 M
## 155 0.5572139 M
## 156 0.5572139 M
## 157 0.5572139 M
## 158 0.5572139 M
## 159 0.5572139 M
## 160 0.5572139 M
## 161 0.5572139 M
## 162 0.5572139 M
## 163 0.2604167 F
## 164 0.2604167 F
## 165 0.5572139 M
## 166 0.5572139 M
## 168 0.2604167 F
## 169 0.5572139 M
## 170 0.2604167 F
## 171 0.5572139 M
## 172 0.5572139 M
## 173 0.2604167 F
## 174 0.2604167 F
## 175 0.5572139 M
## 176 0.5572139 M
## 177 0.5572139 M
## 178 0.5572139 M
## 179 0.5572139 M
## 180 0.5572139 M
## 181 0.5572139 M
## 182 0.2604167 F
## 183 0.5572139 M
## 184 0.5572139 M
## 185 0.2604167 F
## 186 0.2604167 F
## 187 0.5572139 M
## 188 0.5572139 M
## 189 0.5572139 M
## 190 0.5572139 M
## 191 0.5572139 M
## 192 0.5572139 M
## 194 0.2604167 F
## 195 0.2604167 F
## 196 0.5572139 M
## 197 0.5572139 M
## 198 0.2604167 F
## 199 0.2604167 F
## 200 0.5572139 M
## 201 0.2604167 F
## 202 0.2604167 F
## 203 0.5572139 M
## 204 0.2604167 F
## 205 0.5572139 M
## 206 0.5572139 M
## 207 0.5572139 M
## 208 0.5572139 M
## 209 0.5572139 M
## 210 0.2604167 F
## 211 0.2604167 F
## 212 0.5572139 M
## 213 0.5572139 M
## 214 0.2604167 F
## 215 0.5572139 M
## 216 0.5572139 M
## 217 0.2604167 F
## 218 0.2604167 F
## 219 0.2604167 F
## 220 0.5572139 M
## 221 0.2604167 F
## 222 0.2604167 F
## 223 0.2604167 F
## 224 0.5572139 M
## 225 0.2604167 F
## 226 0.2604167 F
## 227 0.5572139 M
## 228 0.2604167 F
## 229 0.5572139 M
## 230 0.5572139 M
## 231 0.2604167 F
## 232 0.2604167 F
## 233 0.5572139 M
## 234 0.2604167 F
## 235 0.2604167 F
## 236 0.5572139 M
## 237 0.5572139 M
## 238 0.5572139 M
## 239 0.2604167 F
## 240 0.5572139 M
## 241 0.5572139 M
## 242 0.2604167 F
## 243 0.2604167 F
## 244 0.5572139 M
## 245 0.2604167 F
## 246 0.5572139 M
## 247 0.5572139 M
## 248 0.5572139 M
## 249 0.5572139 M
## 250 0.5572139 M
## 251 0.5572139 M
## 252 0.5572139 M
## 253 0.5572139 M
## 254 0.2604167 F
## 255 0.5572139 M
## 256 0.2604167 F
## 257 0.2604167 F
## 258 0.2604167 F
## 259 0.5572139 M
## 260 0.5572139 M
## 261 0.2604167 F
## 262 0.2604167 F
## 263 0.2604167 F
## 264 0.5572139 M
## 265 0.5572139 M
## 266 0.5572139 M
## 268 0.5572139 M
## 269 0.5572139 M
## 270 0.5572139 M
## 271 0.5572139 M
## 272 0.5572139 M
## 273 0.5572139 M
## 274 0.2604167 F
## 275 0.5572139 M
## 276 0.5572139 M
## 277 0.2604167 F
## 278 0.2604167 F
## 279 0.5572139 M
## 280 0.2604167 F
## 281 0.5572139 M
## 282 0.5572139 M
## 283 0.2604167 F
## 284 0.5572139 M
## 285 0.5572139 M
## 286 0.5572139 M
## 287 0.2604167 F
## 289 0.5572139 M
## 290 0.5572139 M
## 291 0.5572139 M
## 292 0.2604167 F
## 293 0.5572139 M
## 294 0.5572139 M
## 295 0.2604167 F
## 296 0.5572139 M
## 297 0.5572139 M
## 298 0.2604167 F
## 299 0.5572139 M
## 300 0.5572139 M
## 301 0.5572139 M
## 302 0.2604167 F
Since there are only two probabilities (one for females and one for males), we can use a table to summarize the predicted probabilities.
xtabs(~ probability.of.hd + sex, data=predicted.data)
## sex
## probability.of.hd F M
## 0.260416666667239 96 0
## 0.55721393034826 0 201
logistic2 <- glm(hd ~ ., data=data, family="binomial")
summary(logistic2)
##
## Call:
## glm(formula = hd ~ ., family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.253978 2.960399 -2.113 0.034640 *
## age -0.023508 0.025122 -0.936 0.349402
## sexM 1.670152 0.552486 3.023 0.002503 **
## cp2 1.448396 0.809136 1.790 0.073446 .
## cp3 0.393353 0.700338 0.562 0.574347
## cp4 2.373287 0.709094 3.347 0.000817 ***
## trestbps 0.027720 0.011748 2.359 0.018300 *
## chol 0.004445 0.004091 1.087 0.277253
## fbs1 -0.574079 0.592539 -0.969 0.332622
## restecg1 1.000887 2.638393 0.379 0.704424
## restecg2 0.486408 0.396327 1.227 0.219713
## thalach -0.019695 0.011717 -1.681 0.092781 .
## exang1 0.653306 0.447445 1.460 0.144267
## oldpeak 0.390679 0.239173 1.633 0.102373
## slope2 1.302289 0.486197 2.679 0.007395 **
## slope3 0.606760 0.939324 0.646 0.518309
## ca1 2.237444 0.514770 4.346 1.38e-05 ***
## ca2 3.271852 0.785123 4.167 3.08e-05 ***
## ca3 2.188715 0.928644 2.357 0.018428 *
## thal6 -0.168439 0.810310 -0.208 0.835331
## thal7 1.433319 0.440567 3.253 0.001141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 183.10 on 276 degrees of freedom
## AIC: 225.1
##
## Number of Fisher Scoring iterations: 6
Let’s look at high log odd ratios that are significant
sexM = 1.67 (p = 0.0025) Being male increases the log-odds of heart disease. Statistically significant, so sex is an important predictor.
cp4 = 2.37 (p = 0.0008) Having asymptomatic chest pain (cp4) increases the log-odds of heart disease compared to the reference chest pain (cp1 = typical angina). Strong effect, very significant.
ca2 = 3.27 (p ≈ 3e-05) Having 2 major vessels colored by fluoroscopy dramatically increases odds of heart disease.This is one of the strongest predictors.
We see other variables like age:
age = -0.023 (p = 0.35) Slight negative effect, but not significant — age alone, adjusted for other variables, doesn’t add much here.
The overall model fit is much improved compared to the null model, as shown by the big drop in deviance and low AIC.
We can calculate r square:
r_square_all <- 1 - (logistic2$deviance/logistic2$null.deviance)
r_square_all
## [1] 0.5533531
The model explains 55.3% of the variation in the outcome (heart disease diagnosis)
p-value:
1 - pchisq((logistic2$null.deviance - logistic2$deviance), df=(length(logistic2$coefficients)-1))
## [1] 0
Very significant!
predicted.data2 <- data.frame(
probability.of.hd=logistic2$fitted.values,
hd=data$hd)
#order and rank the cases to graph easily
predicted.data2 <- predicted.data2[
order(predicted.data2$probability.of.hd, decreasing=FALSE),]
predicted.data2$rank <- 1:nrow(predicted.data2)
predicted.data2
## probability.of.hd hd rank
## 223 0.001098497 healthy 1
## 64 0.002575954 healthy 2
## 211 0.003028610 healthy 3
## 27 0.003097422 healthy 4
## 222 0.003165527 healthy 5
## 245 0.003617366 healthy 6
## 217 0.004247289 healthy 7
## 263 0.004252341 healthy 8
## 95 0.004584178 healthy 9
## 94 0.005571569 healthy 10
## 22 0.006124983 healthy 11
## 226 0.007329205 healthy 12
## 19 0.007981298 healthy 13
## 254 0.008217007 healthy 14
## 221 0.008354930 healthy 15
## 256 0.009034574 healthy 16
## 90 0.010242121 healthy 17
## 71 0.011000034 healthy 18
## 148 0.011385704 healthy 19
## 135 0.011982965 healthy 20
## 199 0.012983984 healthy 21
## 163 0.013201037 healthy 22
## 102 0.014150104 healthy 23
## 264 0.014451038 healthy 24
## 146 0.014554209 unhealthy 25
## 130 0.015033948 healthy 26
## 191 0.015841235 healthy 27
## 242 0.015875792 healthy 28
## 143 0.016358136 healthy 29
## 282 0.016419912 healthy 30
## 117 0.016746018 healthy 31
## 150 0.019001592 healthy 32
## 231 0.019026606 healthy 33
## 26 0.019441324 healthy 34
## 113 0.021239775 healthy 35
## 47 0.022966668 healthy 36
## 270 0.023465719 healthy 37
## 292 0.023660411 healthy 38
## 104 0.023816792 healthy 39
## 278 0.024948978 healthy 40
## 296 0.025097112 healthy 41
## 5 0.026253999 healthy 42
## 76 0.028514591 healthy 43
## 201 0.028662579 healthy 44
## 44 0.028795023 healthy 45
## 170 0.030374207 healthy 46
## 243 0.030957274 healthy 47
## 33 0.031156620 unhealthy 48
## 16 0.031443239 healthy 49
## 86 0.033625168 healthy 50
## 35 0.035519859 healthy 51
## 6 0.036623801 healthy 52
## 28 0.036764717 healthy 53
## 129 0.038651472 healthy 54
## 204 0.039121409 healthy 55
## 284 0.041092208 healthy 56
## 228 0.041439655 healthy 57
## 118 0.043010961 healthy 58
## 45 0.043670536 unhealthy 59
## 195 0.043722537 healthy 60
## 62 0.044071213 healthy 61
## 51 0.045212124 healthy 62
## 241 0.046577593 healthy 63
## 250 0.049019064 healthy 64
## 50 0.049126644 healthy 65
## 239 0.050842601 healthy 66
## 89 0.050881022 healthy 67
## 87 0.051476886 healthy 68
## 83 0.052258069 healthy 69
## 85 0.052332974 healthy 70
## 187 0.055881052 healthy 71
## 54 0.056834144 healthy 72
## 133 0.057383996 healthy 73
## 290 0.058387396 healthy 74
## 235 0.058724520 healthy 75
## 161 0.059822030 healthy 76
## 1 0.062099416 healthy 77
## 209 0.062208482 healthy 78
## 240 0.064414970 healthy 79
## 8 0.066394093 healthy 80
## 126 0.066399313 healthy 81
## 20 0.066924129 healthy 82
## 258 0.079983914 healthy 83
## 274 0.089692902 healthy 84
## 200 0.091311778 unhealthy 85
## 49 0.092434805 healthy 86
## 185 0.096186963 unhealthy 87
## 141 0.096559427 healthy 88
## 100 0.104585966 healthy 89
## 149 0.105461313 healthy 90
## 101 0.110839080 healthy 91
## 227 0.110915898 healthy 92
## 136 0.111011538 healthy 93
## 4 0.113609757 healthy 94
## 180 0.114492942 healthy 95
## 123 0.116842845 healthy 96
## 140 0.124269406 healthy 97
## 31 0.126544058 healthy 98
## 15 0.130489782 healthy 99
## 261 0.138707036 healthy 100
## 36 0.139632266 healthy 101
## 280 0.140838233 healthy 102
## 186 0.142744928 healthy 103
## 21 0.142803306 healthy 104
## 295 0.145321482 unhealthy 105
## 259 0.154759366 healthy 106
## 106 0.157514867 healthy 107
## 220 0.158988022 healthy 108
## 12 0.160933163 healthy 109
## 14 0.161325646 healthy 110
## 42 0.163686615 healthy 111
## 164 0.168917435 healthy 112
## 165 0.182860432 healthy 113
## 168 0.187798107 healthy 114
## 213 0.207789639 healthy 115
## 202 0.209015413 healthy 116
## 40 0.212072901 healthy 117
## 78 0.214730318 healthy 118
## 79 0.222771395 healthy 119
## 152 0.224006064 healthy 120
## 82 0.227745780 healthy 121
## 18 0.228054268 healthy 122
## 289 0.239284008 healthy 123
## 151 0.241638846 healthy 124
## 260 0.248886621 unhealthy 125
## 277 0.250309804 healthy 126
## 17 0.254486059 unhealthy 127
## 153 0.260147993 healthy 128
## 203 0.260211314 healthy 129
## 67 0.270926284 unhealthy 130
## 68 0.273754102 healthy 131
## 205 0.274486678 healthy 132
## 275 0.277710692 unhealthy 133
## 234 0.278686452 healthy 134
## 216 0.280420160 healthy 135
## 257 0.282132523 healthy 136
## 116 0.284808945 healthy 137
## 233 0.288415392 unhealthy 138
## 302 0.290319803 unhealthy 139
## 134 0.290656356 healthy 140
## 299 0.306943078 unhealthy 141
## 52 0.311482759 healthy 142
## 262 0.312001020 unhealthy 143
## 29 0.321379988 healthy 144
## 103 0.324438353 healthy 145
## 210 0.348367592 unhealthy 146
## 218 0.352439407 healthy 147
## 23 0.361345492 unhealthy 148
## 58 0.365212639 unhealthy 149
## 174 0.367992856 healthy 150
## 198 0.368698797 healthy 151
## 131 0.376671866 healthy 152
## 145 0.377552339 healthy 153
## 132 0.381366128 healthy 154
## 255 0.382113409 healthy 155
## 125 0.386736001 unhealthy 156
## 43 0.387062547 healthy 157
## 11 0.391477771 healthy 158
## 179 0.399258886 healthy 159
## 99 0.422661163 healthy 160
## 70 0.432020605 unhealthy 161
## 59 0.438010086 healthy 162
## 215 0.445261105 unhealthy 163
## 160 0.450309482 healthy 164
## 173 0.450534712 unhealthy 165
## 166 0.451753243 healthy 166
## 91 0.469440883 healthy 167
## 268 0.475457614 unhealthy 168
## 75 0.479436972 unhealthy 169
## 269 0.480560387 unhealthy 170
## 291 0.492615946 unhealthy 171
## 276 0.509159664 healthy 172
## 60 0.522777452 healthy 173
## 212 0.541141848 unhealthy 174
## 197 0.544980768 healthy 175
## 183 0.545109373 healthy 176
## 293 0.545390099 unhealthy 177
## 272 0.546140318 healthy 178
## 142 0.547420366 unhealthy 179
## 251 0.570442123 healthy 180
## 279 0.578952477 unhealthy 181
## 238 0.631198562 unhealthy 182
## 13 0.647739413 unhealthy 183
## 144 0.658242622 unhealthy 184
## 298 0.684782514 unhealthy 185
## 177 0.687452168 healthy 186
## 53 0.702488808 unhealthy 187
## 247 0.705874159 unhealthy 188
## 246 0.708644163 unhealthy 189
## 162 0.717750550 unhealthy 190
## 61 0.719224535 unhealthy 191
## 34 0.720057696 healthy 192
## 115 0.728266034 unhealthy 193
## 230 0.750503476 unhealthy 194
## 157 0.751145818 unhealthy 195
## 74 0.754971451 unhealthy 196
## 169 0.770892651 unhealthy 197
## 81 0.775235460 healthy 198
## 194 0.796637680 unhealthy 199
## 84 0.800662839 unhealthy 200
## 225 0.806879666 unhealthy 201
## 111 0.811489543 unhealthy 202
## 181 0.812164223 unhealthy 203
## 244 0.813179938 unhealthy 204
## 184 0.813804721 healthy 205
## 57 0.824574972 unhealthy 206
## 110 0.826936305 unhealthy 207
## 108 0.845142336 unhealthy 208
## 93 0.849972280 healthy 209
## 46 0.853137162 unhealthy 210
## 105 0.856640939 unhealthy 211
## 285 0.865310779 unhealthy 212
## 172 0.866587227 healthy 213
## 266 0.880856760 unhealthy 214
## 114 0.888731337 unhealthy 215
## 10 0.889990950 unhealthy 216
## 24 0.894275653 unhealthy 217
## 137 0.898795128 unhealthy 218
## 107 0.900556820 unhealthy 219
## 7 0.913251803 unhealthy 220
## 96 0.916958660 unhealthy 221
## 188 0.920857417 unhealthy 222
## 80 0.922972893 unhealthy 223
## 189 0.923564290 unhealthy 224
## 190 0.925103473 unhealthy 225
## 219 0.925380488 healthy 226
## 237 0.926670250 unhealthy 227
## 139 0.932562076 unhealthy 228
## 249 0.940551673 unhealthy 229
## 112 0.941739080 unhealthy 230
## 32 0.943991082 unhealthy 231
## 229 0.954226067 unhealthy 232
## 158 0.954699941 unhealthy 233
## 30 0.954869283 unhealthy 234
## 72 0.957285535 unhealthy 235
## 55 0.958591574 unhealthy 236
## 208 0.962311675 unhealthy 237
## 232 0.962665558 unhealthy 238
## 41 0.964616179 unhealthy 239
## 37 0.967083158 unhealthy 240
## 48 0.972585444 unhealthy 241
## 286 0.973118237 unhealthy 242
## 155 0.974849359 unhealthy 243
## 287 0.974864179 unhealthy 244
## 119 0.975154297 unhealthy 245
## 248 0.975923512 unhealthy 246
## 121 0.976316797 unhealthy 247
## 214 0.976863953 unhealthy 248
## 196 0.980524845 unhealthy 249
## 156 0.981052202 unhealthy 250
## 138 0.981297525 unhealthy 251
## 65 0.981593207 unhealthy 252
## 271 0.981856494 unhealthy 253
## 124 0.982652585 unhealthy 254
## 92 0.983817695 unhealthy 255
## 283 0.984232291 unhealthy 256
## 9 0.984513939 unhealthy 257
## 178 0.985418187 unhealthy 258
## 97 0.986597517 unhealthy 259
## 69 0.986804486 unhealthy 260
## 175 0.986878593 unhealthy 261
## 171 0.987235288 unhealthy 262
## 38 0.987553136 unhealthy 263
## 98 0.988237660 unhealthy 264
## 301 0.988632504 unhealthy 265
## 265 0.989466671 unhealthy 266
## 122 0.989498213 unhealthy 267
## 253 0.990420853 healthy 268
## 297 0.990686567 unhealthy 269
## 128 0.992386274 unhealthy 270
## 281 0.993102290 unhealthy 271
## 2 0.993376136 unhealthy 272
## 300 0.993471763 unhealthy 273
## 147 0.993897346 unhealthy 274
## 109 0.994017715 unhealthy 275
## 252 0.994109138 unhealthy 276
## 159 0.994384275 unhealthy 277
## 120 0.994429272 unhealthy 278
## 206 0.994568332 unhealthy 279
## 39 0.994621317 unhealthy 280
## 182 0.994884640 unhealthy 281
## 63 0.994921518 unhealthy 282
## 77 0.995593851 unhealthy 283
## 294 0.996285860 unhealthy 284
## 154 0.996938269 unhealthy 285
## 236 0.997135275 unhealthy 286
## 56 0.997439676 unhealthy 287
## 3 0.997935717 unhealthy 288
## 176 0.998059276 unhealthy 289
## 73 0.998090959 unhealthy 290
## 25 0.998313153 unhealthy 291
## 127 0.998418347 unhealthy 292
## 192 0.998657003 unhealthy 293
## 224 0.998862241 unhealthy 294
## 207 0.998977158 unhealthy 295
## 273 0.999062844 unhealthy 296
## 66 0.999172369 unhealthy 297
ggplot(data=predicted.data2, aes(x=rank, y=probability.of.hd)) +
geom_point(aes(color=hd), alpha=1, shape=4, stroke=2) +
xlab("Index") +
ylab("Predicted probability of getting heart disease")
The graph is a probability plot showing the predicted probability of heart disease for each individual, sorted from lowest to highest. It’s a great way to visually assess how well our logistic regression model is separating the two classes: Healthy (0) and Unhealthy (1).
Most red (Healthy) cases are on the left (low predicted probabilities). Most blue (Unhealthy) cases are on the right (high predicted probabilities).
This means the model is predicting probabilities well — high for those with disease, low for those without. Some overlap is normal in real-world data, especially with clinical conditions that are influenced by many subtle factors.
A confusion matrix shows how well your logistic regression model is classifying patients as healthy (0) or unhealthy (1), based on a threshold (usually 0.5).
# Since the structure of hd is a factor with labels like "Healthy"/"Unhealthy", convert it to numeric 0/1
data$hd_num <- ifelse(data$hd == "unhealthy", 1, 0)
# Predicted probabilities
predicted.probs <- logistic2$fitted.values
# Predicted classes: 1 if prob > 0.5, else 0
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
# Confusion matrix
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(data$hd_num, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 146 25
## 1 14 112
146 people were actually healthy, and the model correctly predicted them as healthy. This is a True Negative.
25 people were actually unhealthy, but the model mistakenly predicted them as healthy. This is a False Negative.
14 people were actually healthy, but the model mistakenly predicted them as unhealthy. This is a False Positive.
112 people were actually unhealthy, and the model correctly predicted them as unhealthy. This is a True Positive.
# Extract values
TN <- 146
FP <- 14
FN <- 25
TP <- 112
# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN) # also called recall or true positive rate
specificity <- TN / (TN + FP) # true negative rate
precision <- TP / (TP + FP) # positive predictive value
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Print results
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.8687
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0.8175
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 0.9125
cat("Precision: ", round(precision, 4), "\n")
## Precision: 0.8889
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: 0.8517
The model has high accuracy and strong balance between detecting positives and avoiding false alarms.
It’s especially good at identifying healthy individuals (high specificity), and still quite good at catching disease (81.7% sensitivity).
Overall, these are excellent results for a binary classification model on medical data.
Finally: ROC and AUC
# install.packages("pROC") # if needed
library(pROC)
# ROC curve & AUC on full data
roc_obj <- roc(response = data$hd,
predictor = logistic2$fitted.values,
levels = c("healthy", "unhealthy"),
direction = "<") # smaller prob = Healthy
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.9395
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
The AUC = 0.94 means your model is very good at distinguishing between healthy and unhealthy patients.
On the plot, your curve is far above the diagonal “random guess” line, which shows the model is much better than chance.
In plain words: if you randomly pick one unhealthy and one healthy patient, the model has about a 94% chance of ranking the unhealthy one higher (more likely to have disease).