Week 12 - Discussion: Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

DATA

The dataset is the Cleveland Heart Disease dataset taken from the UCI repository. The dataset consists of 303 individuals’ data. There are 14 columns in the dataset(which have been extracted from a larger set of 75). No missing values. The classification task is to predict whether an individual is suffering from heart disease or not. (0: absence, 1: presence)

This database contains 13 attributes and a target variable. It has 8 nominal values and 5 numeric values. The detailed description of all these features are as follows:

Age: Patients Age in years (Numeric) Sex: Gender (Male : 1; Female : 0) (Nominal) cp: Type of chest pain experienced by patient. This term categorized into 4 category. 0 typical angina, 1 atypical angina, 2 non- anginal pain, 3 asymptomatic (Nominal) trestbps: patient’s level of blood pressure at resting mode in mm/HG (Numerical) chol: Serum cholesterol in mg/dl (Numeric) fbs: Blood sugar levels on fasting > 120 mg/dl represents as 1 in case of true and 0 as false (Nominal) restecg: Result of electrocardiogram while at rest are represented in 3 distinct values 0 : Normal 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 2: showing probable or definite left ventricular hypertrophyby Estes’ criteria (Nominal) thalach: Maximum heart rate achieved (Numeric) exang: Angina induced by exercise 0 depicting NO 1 depicting Yes (Nominal) oldpeak: Exercise induced ST-depression in relative with the state of rest (Numeric) slope: ST segment measured in terms of slope during peak exercise 0: up sloping; 1: flat; 2: down sloping(Nominal) ca: The number of major vessels (0–3)(nominal) thal: A blood disorder called thalassemia 0: NULL 1: normal blood flow 2: fixed defect (no blood flow in some part of the heart) 3: reversible defect (a blood flow is observed but it is not normal(nominal) target: It is the target variable which we have to predict 1 means patient is suffering from heart disease and 0 means patient is normal.

https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland

Original data: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
data <- read.csv("~/Downloads/Heart_disease_cleveland_new.csv")
summary(data)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :2.000   Median :130.0  
##  Mean   :54.44   Mean   :0.6799   Mean   :2.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :241.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.7   Mean   :0.1485   Mean   :0.9901   Mean   :149.6  
##  3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope              ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.0000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :0.6007   Mean   :0.6634  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.0000   Max.   :3.0000  
##       thal           target      
##  Min.   :1.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.0000  
##  Median :1.000   Median :0.0000  
##  Mean   :1.832   Mean   :0.4587  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000
head(data)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  0      145  233   1       2     150     0     2.3     2  0    2
## 2  67   1  3      160  286   0       2     108     1     1.5     1  3    1
## 3  67   1  3      120  229   0       2     129     1     2.6     1  2    3
## 4  37   1  2      130  250   0       0     187     0     3.5     2  0    1
## 5  41   0  1      130  204   0       2     172     0     1.4     0  0    1
## 6  56   1  1      120  236   0       0     178     0     0.8     0  0    1
##   target
## 1      0
## 2      1
## 3      1
## 4      0
## 5      0
## 6      0
hist(data$age)

Linear Model

#quadratic term
quadratic <- data$age^2

#dichotomous vs. quantitative interaction term
age_chol <- data$age * data$chol

#multiple regression model
data_lm <- lm(age ~ chol + sex + cp + trestbps, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = age ~ chol + sex + cp + trestbps, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6069  -6.4945   0.4826   6.1839  21.3962 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.324160   4.485359   6.315 9.81e-10 ***
## chol         0.027410   0.009762   2.808  0.00532 ** 
## sex         -0.969238   1.071493  -0.905  0.36643    
## cp           0.968230   0.512457   1.889  0.05981 .  
## trestbps     0.136093   0.028135   4.837 2.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.517 on 298 degrees of freedom
## Multiple R-squared:  0.1239, Adjusted R-squared:  0.1122 
## F-statistic: 10.54 on 4 and 298 DF,  p-value: 5.352e-08

To improve our model we will add our attributes.

#added fbs
data_lm <- lm(age ~ chol + sex + cp + trestbps + fbs, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = age ~ chol + sex + cp + trestbps + fbs, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2595  -6.4010   0.2956   5.9521  21.6572 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.994889   4.500961   6.442 4.75e-10 ***
## chol         0.027394   0.009744   2.811  0.00526 ** 
## sex         -1.061763   1.071425  -0.991  0.32250    
## cp           0.993977   0.511822   1.942  0.05308 .  
## trestbps     0.128801   0.028531   4.514 9.16e-06 ***
## fbs          2.026465   1.398329   1.449  0.14834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.501 on 297 degrees of freedom
## Multiple R-squared:  0.1301, Adjusted R-squared:  0.1154 
## F-statistic: 8.881 on 5 and 297 DF,  p-value: 7.13e-08
#added restecg
data_lm <- lm(age ~ chol + sex + cp + trestbps + fbs + restecg, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = age ~ chol + sex + cp + trestbps + fbs + restecg, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.115  -6.008  -0.044   6.173  21.432 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.660322   4.518537   6.564 2.34e-10 ***
## chol         0.025188   0.009854   2.556   0.0111 *  
## sex         -1.151032   1.071564  -1.074   0.2836    
## cp           0.948812   0.511999   1.853   0.0649 .  
## trestbps     0.123853   0.028701   4.315 2.18e-05 ***
## fbs          1.935493   1.397550   1.385   0.1671    
## restecg      0.708776   0.504776   1.404   0.1613    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.487 on 296 degrees of freedom
## Multiple R-squared:  0.1358, Adjusted R-squared:  0.1183 
## F-statistic: 7.754 on 6 and 296 DF,  p-value: 9.298e-08

The model explains 13.58% of the variability in the data based on the R2 value.