Week 12 - Discussion: Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
DATA
The dataset is the Cleveland Heart Disease dataset taken from the UCI repository. The dataset consists of 303 individuals’ data. There are 14 columns in the dataset(which have been extracted from a larger set of 75). No missing values. The classification task is to predict whether an individual is suffering from heart disease or not. (0: absence, 1: presence)
This database contains 13 attributes and a target variable. It has 8 nominal values and 5 numeric values. The detailed description of all these features are as follows:
Age: Patients Age in years (Numeric) Sex: Gender (Male : 1; Female : 0) (Nominal) cp: Type of chest pain experienced by patient. This term categorized into 4 category. 0 typical angina, 1 atypical angina, 2 non- anginal pain, 3 asymptomatic (Nominal) trestbps: patient’s level of blood pressure at resting mode in mm/HG (Numerical) chol: Serum cholesterol in mg/dl (Numeric) fbs: Blood sugar levels on fasting > 120 mg/dl represents as 1 in case of true and 0 as false (Nominal) restecg: Result of electrocardiogram while at rest are represented in 3 distinct values 0 : Normal 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 2: showing probable or definite left ventricular hypertrophyby Estes’ criteria (Nominal) thalach: Maximum heart rate achieved (Numeric) exang: Angina induced by exercise 0 depicting NO 1 depicting Yes (Nominal) oldpeak: Exercise induced ST-depression in relative with the state of rest (Numeric) slope: ST segment measured in terms of slope during peak exercise 0: up sloping; 1: flat; 2: down sloping(Nominal) ca: The number of major vessels (0–3)(nominal) thal: A blood disorder called thalassemia 0: NULL 1: normal blood flow 2: fixed defect (no blood flow in some part of the heart) 3: reversible defect (a blood flow is observed but it is not normal(nominal) target: It is the target variable which we have to predict 1 means patient is suffering from heart disease and 0 means patient is normal.
https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland
Original data: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
data <- read.csv("~/Downloads/Heart_disease_cleveland_new.csv")
summary(data)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :2.000 Median :130.0
## Mean :54.44 Mean :0.6799 Mean :2.158 Mean :131.7
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :241.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.7 Mean :0.1485 Mean :0.9901 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.0000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :0.6007 Mean :0.6634
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.0000 Max. :3.0000
## thal target
## Min. :1.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000
## Median :1.000 Median :0.0000
## Mean :1.832 Mean :0.4587
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
head(data)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 0 145 233 1 2 150 0 2.3 2 0 2
## 2 67 1 3 160 286 0 2 108 1 1.5 1 3 1
## 3 67 1 3 120 229 0 2 129 1 2.6 1 2 3
## 4 37 1 2 130 250 0 0 187 0 3.5 2 0 1
## 5 41 0 1 130 204 0 2 172 0 1.4 0 0 1
## 6 56 1 1 120 236 0 0 178 0 0.8 0 0 1
## target
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
hist(data$age)
Linear Model
#quadratic term
quadratic <- data$age^2
#dichotomous vs. quantitative interaction term
age_chol <- data$age * data$chol
#multiple regression model
data_lm <- lm(age ~ chol + sex + cp + trestbps, data = data)
summary(data_lm)
##
## Call:
## lm(formula = age ~ chol + sex + cp + trestbps, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6069 -6.4945 0.4826 6.1839 21.3962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.324160 4.485359 6.315 9.81e-10 ***
## chol 0.027410 0.009762 2.808 0.00532 **
## sex -0.969238 1.071493 -0.905 0.36643
## cp 0.968230 0.512457 1.889 0.05981 .
## trestbps 0.136093 0.028135 4.837 2.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.517 on 298 degrees of freedom
## Multiple R-squared: 0.1239, Adjusted R-squared: 0.1122
## F-statistic: 10.54 on 4 and 298 DF, p-value: 5.352e-08
To improve our model we will add our attributes.
#added fbs
data_lm <- lm(age ~ chol + sex + cp + trestbps + fbs, data = data)
summary(data_lm)
##
## Call:
## lm(formula = age ~ chol + sex + cp + trestbps + fbs, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2595 -6.4010 0.2956 5.9521 21.6572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.994889 4.500961 6.442 4.75e-10 ***
## chol 0.027394 0.009744 2.811 0.00526 **
## sex -1.061763 1.071425 -0.991 0.32250
## cp 0.993977 0.511822 1.942 0.05308 .
## trestbps 0.128801 0.028531 4.514 9.16e-06 ***
## fbs 2.026465 1.398329 1.449 0.14834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.501 on 297 degrees of freedom
## Multiple R-squared: 0.1301, Adjusted R-squared: 0.1154
## F-statistic: 8.881 on 5 and 297 DF, p-value: 7.13e-08
#added restecg
data_lm <- lm(age ~ chol + sex + cp + trestbps + fbs + restecg, data = data)
summary(data_lm)
##
## Call:
## lm(formula = age ~ chol + sex + cp + trestbps + fbs + restecg,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.115 -6.008 -0.044 6.173 21.432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.660322 4.518537 6.564 2.34e-10 ***
## chol 0.025188 0.009854 2.556 0.0111 *
## sex -1.151032 1.071564 -1.074 0.2836
## cp 0.948812 0.511999 1.853 0.0649 .
## trestbps 0.123853 0.028701 4.315 2.18e-05 ***
## fbs 1.935493 1.397550 1.385 0.1671
## restecg 0.708776 0.504776 1.404 0.1613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.487 on 296 degrees of freedom
## Multiple R-squared: 0.1358, Adjusted R-squared: 0.1183
## F-statistic: 7.754 on 6 and 296 DF, p-value: 9.298e-08
The model explains 13.58% of the variability in the data based on the R2 value.