Farhad Abasahl
UMD: 113959479
Abstract
This analysis investigates factors predicting heart disease using a logistic regression model. We explore the relationship between age, cholesterol levels, resting blood pressure, maximum heart rate, and heart disease presence. Visualizations include predicted probability trends by age and heart rate, cholesterol distribution, and sex differences in heart disease prevalence. The analysis reveals key insights into how various health metrics are associated with heart disease risk, and discusses uncertainties in the predictive model.We aim to predict the presence of heart disease based on several risk factors, including age, cholesterol levels, resting blood pressure, and maximum heart rate. A logistic regression model will be used for prediction, and the relationships between these predictors and heart disease will be explored through a series of multivariate visualizations.
First we need to read our data frame with csv format and assign data frame to the variable “heart_data”.
setwd("/Users/farhadabasahl/Documents/R/heart+disease")
heart_data <- read.csv("processed.cleveland.data", header = FALSE)
head(heart_data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
Preparing our data and making sure the data are cleaned, transformed and labeled is essential for our task. After assigning meaningful names to the columns of the current data frame, we identified the cells containing the symbol “?” as placeholders for missing data. We then counted the total number of occurrences of “?” in the dataset and replaced all instances with NA to handle the missing values. Following this, we removed any rows containing NA values to ensure a clean dataset. Finally, we checked the cleaned data for integrity and saved it to a new CSV file for further analysis:
colnames(heart_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs",
"restecg", "thalach", "exang", "oldpeak", "slope",
"ca", "thal", "num")
which(heart_data == "?", arr.ind = TRUE)
## row col
## [1,] 167 12
## [2,] 193 12
## [3,] 288 12
## [4,] 303 12
## [5,] 88 13
## [6,] 267 13
sum(heart_data == "?")
## [1] 6
heart_data[heart_data == "?"] <- NA
heart_data <- as.data.frame(lapply(heart_data,
function(x) as.numeric(as.character(x))))
heart_data_clean <- na.omit(heart_data)
summary(heart_data_clean)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.54 Mean :0.6768 Mean :3.158 Mean :131.7
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.0
## Median :243.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :247.4 Mean :0.1448 Mean :0.9966 Mean :149.6
## 3rd Qu.:276.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :2.000 Median :0.0000
## Mean :0.3266 Mean :1.056 Mean :1.603 Mean :0.6768
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :3.000 Max. :3.0000
## thal num
## Min. :3.000 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:0.0000
## Median :3.000 Median :0.0000
## Mean :4.731 Mean :0.9461
## 3rd Qu.:7.000 3rd Qu.:2.0000
## Max. :7.000 Max. :4.0000
write.csv(heart_data_clean, "heart_data_clean.csv", row.names = FALSE)
Now let’s predict the presence of heart disease (hdisease) using several features. We’ll use a logistic regression model, where the target valuable (num) represets the presence of heart disease.
To assess the presence of heart disease, we created a binary variable indicating heart disease status (1 for presence and 0 for absence). A linear model was then fitted to predict heart disease based on various features. We provided a summary of the model’s results, which allowed us to understand the significance of each predictor. Predicted probabilities of heart disease were subsequently calculated, and the results were visualized through a plot to illustrate the relationship between the predictors and the likelihood of heart disease.
heart_data$hdisease <- ifelse(heart_data$num > 0, 1, 0)
model <- lm(hdisease ~ age + sex + trestbps + chol + thalach, data = heart_data)
summary(model)
##
## Call:
## lm(formula = hdisease ~ age + sex + trestbps + chol + thalach,
## data = heart_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95685 -0.35390 -0.06046 0.36846 0.96938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6136691 0.3284495 1.868 0.0627 .
## age 0.0023043 0.0031674 0.727 0.4675
## sex 0.3139838 0.0540150 5.813 1.58e-08 ***
## trestbps 0.0035544 0.0014665 2.424 0.0160 *
## chol 0.0011336 0.0004967 2.282 0.0232 *
## thalach -0.0082988 0.0011813 -7.025 1.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4276 on 297 degrees of freedom
## Multiple R-squared: 0.2783, Adjusted R-squared: 0.2661
## F-statistic: 22.9 on 5 and 297 DF, p-value: < 2.2e-16
heart_data$predicted_prob <- predict(model, type = "response")
heart_data$hdisease <- as.factor(heart_data$hdisease)
ggplot(heart_data, aes(x = age, y = predicted_prob)) +
geom_point(aes(color = hdisease), alpha = 0.9) +
geom_smooth(method = "lm", color = "firebrick4", se = FALSE) +
labs(title = "Predicted Probability of Heart Disease by Age",
x = "Age",
y = "Predicted Probability") +
scale_color_manual(values = c("black", "grey70"), labels = c(
"No Heart Disease", "Heart Disease")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The plot shows the predicted probability of heart disease as a function of age, with points representing individuals, color-coded by heart disease status (black for no heart disease, grey for heart disease), and a logistic regression line in firebrick4 highlighting the overall trend of increasing heart disease probability with age.
ggplot(heart_data, aes(x = factor(hdisease), y = chol , fill = factor(hdisease))) +
geom_boxplot() +
labs(title = "Cholesterol Levels by Heart Disease Status",
x = "Heart Disease (0 = No, 1 = Yes)",
y = "Cholesterol Level") +
scale_fill_manual(values = c("grey30", "grey75"), labels = c(
"No Heart Disease", "Heart Disease")) +
theme_minimal()
The boxplot shows the distribution of cholesterol levels for individuals with and without heart disease, highlighting differences in cholesterol levels between the two groups, with a darker fill representing those without heart disease and a lighter fill for those with heart disease.
ggplot(heart_data, aes(x = chol, fill = factor(hdisease))) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
facet_wrap(~sex) +
labs(title = "Cholesterol Levels Distribution by Heart Disease Status and Sex",
x = "Cholesterol Level",
y = "Count",
fill = "Heart Disease Status") +
theme_minimal() +
scale_fill_manual(values = c("skyblue", "tomato3"), labels = c("No Heart Disease", "Heart Disease"))
The faceted histogram illustrates the distribution of cholesterol levels for individuals with and without heart disease, separated by sex. This visualization allows us to examine how cholesterol levels vary not only by heart disease status but also by sex. The plot suggests that both males and females with higher cholesterol levels tend to have a higher prevalence of heart disease, though the distributions differ slightly between genders.
ggplot(heart_data, aes(x = thalach, y = predicted_prob)) +
geom_point(aes(color = hdisease), alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Predicted Probability of Heart Disease by Maximum Heart Rate",
x = "Maximum Heart Rate",
y = "Predicted Probability") +
scale_color_manual(values = c("olivedrab2", "black"), labels = c("No Heart Disease",
"Heart Disease")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The plot illustrates the predicted probability of heart disease based on maximum heart rate, with points color-coded by heart disease status (grey for no heart disease, darker grey for heart disease), and a smooth trend line showing how heart disease probability changes as heart rate increases.
ggplot(heart_data, aes(x = factor(sex), fill = factor(hdisease))) +
geom_bar(position = "dodge") +
labs(title = "Heart Disease by Gender",
x = "Sex (0 = Female, 1 = Male)",
y = "Count",
fill = "Heart Disease") +
scale_fill_manual(values = c("bisque1", "lightgoldenrod4"), labels = c(
"No Heart Disease - bisque1", "Heart Disease - lightgoldenrod4")) +
theme_minimal()
The bar chart compares the count of heart disease cases between males and females. It reveals that males tend to have a higher prevalence of heart disease than females, highlighting potential sex-based differences in heart disease risk.
When predicting heart disease using a logistic regression model, uncertainty arises from several sources. One key source of uncertainty is the variability in the data. For example, while age is a strong predictor of heart disease, individual differences and other unmeasured factors can lead to variability in the predicted probabilities. The logistic model assumes a linear relationship between predictors (like age, cholesterol, and heart rate) and the outcome, but real-life data may not always follow this pattern perfectly.
Additionally, there is uncertainty in the estimated coefficients of the model. For each predictor, the model calculates confidence intervals, which indicate the range within which the true effect is likely to fall. Wide confidence intervals, particularly for variables like resting blood pressure, suggest that the model is less certain about the exact effect of that predictor on heart disease risk.
Lastly, the presence of missing data also contributes to uncertainty. Although we handled missing data by removing rows with incomplete values, this could lead to bias if the missing data was not random.
Overall, while the model provides useful predictions, these sources of uncertainty should be considered when interpreting the results.
The analysis of heart disease risk using logistic regression highlights several important factors. Visualizations demonstrate that age, cholesterol levels, and maximum heart rate are significant predictors of heart disease risk. Specifically, as age increases, the probability of heart disease rises. Cholesterol levels also show a notable difference between individuals with and without heart disease, especially when segmented by sex. These insights suggest that both demographic (e.g., sex) and physiological (e.g., cholesterol, heart rate) factors are key to understanding heart disease risk.