Heart Disease Prediction Using Machine Learning and Data Analytic

In this project I will employ both Support Vector Machines and Logistic Regression to study and attempt to predict if, based on a number of risk factors,a given patient will have heart disease.The data set contains data on 303 patients consisting of 13 risk factors and a zero-one valued variable which indicates if the patient has heart disease.The data set can be downloaded

setwd("~/myRprojects")
#install.packages(c("tidyr","dplyr","ggplot2","latticeExtra","psych","funModeling", "GGally",
             #"corrplot","class","gmodels","magrittr","leaps","car",
             #"ROCR","e1071","caret","MASS"))
package <- c("tidyr","dplyr","ggplot2","latticeExtra","psych","funModeling"              , "GGally","corrplot","class","gmodels","magrittr","leaps",
             "car","ROCR","e1071","caret","MASS","stargazer")
lapply(package,require,character.only=TRUE)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE
## 
## [[11]]
## [1] TRUE
## 
## [[12]]
## [1] TRUE
## 
## [[13]]
## [1] TRUE
## 
## [[14]]
## [1] TRUE
## 
## [[15]]
## [1] TRUE
## 
## [[16]]
## [1] TRUE
## 
## [[17]]
## [1] TRUE
## 
## [[18]]
## [1] TRUE

The variables in the dataset are : **Chol: serum cholestoral in mg/dl

**Chest Pain: chest pain type - Value 1: typical angina - Value 2: atypical angina - Value 3: non-anginal pain - Value 4: asymptomatic

**Age: age in years

**Sex: sex (1 = male; 0 = female)

**fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

**Restecg: resting electrocardiographic results - Value 0: normal - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) - Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

**RestBP: resting blood pressure (in mm Hg on admission to the hospital)

**MaxHR: maximum heart rate achieved

**Exang: exercise induced angina (1 = yes; 0 = no)

**oldpeak: ST depression induced by exercise relative to rest

**Slope: the slope of the peak exercise ST segment - Value 1: upsloping - Value 2: flat - Value 3: downsloping

**ca: number of major vessels (0-3) colored by flourosopy

**thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

**AHD:diagnosis of heart disease

heart <- read.csv("Heart.csv")
heart <- heart[,-1]
sum(is.na(heart))

## [1] 6

heart <- na.omit(heart)
str(heart)

## 'data.frame':    297 obs. of  14 variables:
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ AHD      : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int  88 167 193 267 288 303
##   ..- attr(*, "names")= chr  "88" "167" "193" "267" ...

varqual <- heart[,c("Sex","ChestPain","Fbs","RestECG","ExAng","Slope","Thal","AHD")]
varquant <- heart[,c("Age","RestBP","Chol","MaxHR","Oldpeak","Ca")]
varqual <- lapply(varqual, factor)
varqual <- as.data.frame(varqual)
sapply(varqual, class)

##       Sex ChestPain       Fbs   RestECG     ExAng     Slope      Thal 
##  "factor"  "factor"  "factor"  "factor"  "factor"  "factor"  "factor" 
##       AHD 
##  "factor"

heart <- cbind(varqual,varquant)

Graphical Analysis for the categorical variables

Refering to the variable information,We recode the categorical variable in order to have a better understanding of the graph

levels(varqual$Fbs)[levels(varqual$Fbs)=="1"] <- "BlSugar"
levels(varqual$Fbs)[levels(varqual$Fbs)=="0"] <- "NoBlSugar"
levels(varqual$Sex)[levels(varqual$Sex)=="1"] <- "Male"
levels(varqual$Sex)[levels(varqual$Sex)=="0"] <- "Female"
levels(varqual$RestECG )[levels(varqual$RestECG)=="0"] <- "Normality"
levels(varqual$RestECG )[levels(varqual$RestECG)=="1"] <- "Abnormality"
levels(varqual$RestECG )[levels(varqual$RestECG)=="2"] <- "Hypertrophy"
levels(varqual$ExAng)[levels(varqual$ExAng)=="1"] <- "exang"
levels(varqual$ExAng)[levels(varqual$ExAng)=="0"] <- "NO.exang"
levels(varqual$Slope)[levels(varqual$Slope)=="1"] <- "upsloping"
levels(varqual$Slope)[levels(varqual$Slope)=="2"] <- "Flat"
levels(varqual$Slope)[levels(varqual$Slope)=="3"] <- "downsloping"

par(mfrow=c(2,2))
freq(data=varqual, input = names(varqual))

##      Sex frequency percentage cumulative_perc
## 1   Male       201      67.68           67.68
## 2 Female        96      32.32          100.00

##      ChestPain frequency percentage cumulative_perc
## 1 asymptomatic       142      47.81           47.81
## 2   nonanginal        83      27.95           75.76
## 3   nontypical        49      16.50           92.26
## 4      typical        23       7.74          100.00

##         Fbs frequency percentage cumulative_perc
## 1 NoBlSugar       254      85.52           85.52
## 2   BlSugar        43      14.48          100.00

##       RestECG frequency percentage cumulative_perc
## 1   Normality       147      49.49           49.49
## 2 Hypertrophy       146      49.16           98.65
## 3 Abnormality         4       1.35          100.00

##      ExAng frequency percentage cumulative_perc
## 1 NO.exang       200      67.34           67.34
## 2    exang        97      32.66          100.00

##         Slope frequency percentage cumulative_perc
## 1   upsloping       139      46.80           46.80
## 2        Flat       137      46.13           92.93
## 3 downsloping        21       7.07          100.00

##         Thal frequency percentage cumulative_perc
## 1     normal       164      55.22           55.22
## 2 reversable       115      38.72           93.94
## 3      fixed        18       6.06          100.00

##   AHD frequency percentage cumulative_perc
## 1  No       160      53.87           53.87
## 2 Yes       137      46.13          100.00

## [1] "Variables processed: Sex, ChestPain, Fbs, RestECG, ExAng, Slope, Thal, AHD"

Qualitative variables describing the ECG

restECG

cross_gender=cross_plot(varqual, input="RestECG", target="AHD")

RestECG is one of the three metrics of ECG present in our data set .The subsequent -RestECG- graph indicate that people with normal resting electrographic result and the one that present a definite left ventricular hypertrophy by the Estes’criteria have the highest rate of patient with AHD. From the plot on the left side(%): This stuck graph show that the chance of getting heart desease if the patient is diagnosed with an Normal resting ECG is higher than the chance of getting heart desease if the patient is diagnosed with a left ventricular hypertrophy (62.6 vs 45.9%). Patients wit abnormality resting ECG are relatively less likely to be diagnosed with heart desease. From the plot onthe right side(count): 147 patient are diagnosed with a normal resting ECG .55 of them have heart desease . 4 patients are diagnosed with an abnormal resting ECG ,3 of them have heart desease .This indicate that an eye need to be kept on the group of patient with an abnormal resting electrographic. 146 patients are diagnosed wit a definite left ventricular hypertrophy, 79nof them have heart desease and the remaining do not have it.

Slope

cross_gender=cross_plot(varqual, input="Slope" , target="AHD")

Slope:The visualization of the slope versus the target shows that patient with flat slope are the most exposed to a positive diagnostic. From the graph on the left(%): Here the likelihood of having heart desease is 65% if the patient has flat slop. the patients with flat slop recorded the highest rate of heart desease followed by the one that are diagnosed with a down slop (57.1%). From the graph on the right side(count): 139 patient are presented with an up slop but 36 of them have heart desease 137 patient are presented with a flat slop but 89 of them have heart desease 21 patient are presented with down slop but 12 of them have heart desease.

Variable qualitaive other than ECG metrics

Sex

cross_gender=cross_plot(varqual, input="Sex", target="AHD")

**Sex: We analize the relationship between the categorical features and the target feature.We found out that the likelihood of having heart disease is different given the female/male groups. A positive diagnosis occure more in the male group than female group. From the plot on the left side (%): The likelihood of having heart disease for males is 55.7%, while for females is: 26%. The heart disease rate for males doubles the rate for females (55.7 vs. 26, respectively). From the plot on the right side (count): There is a total of 96 females: 25 of them have heart disease (25/96=26%, which is the ratio of the left side plot). the remaining 71 do not have heart disease (74%) There is a total of 201 males: 112 of them have heart disease (55%) the remaining 89 do not have heart disease (44.3%)

ChestPain

cross_gender=cross_plot(varqual, input="ChestPain", target="AHD")

**Chestpain: The relation between the typical chestpain and target attribute is shown in these graphs. From the plot on the left side (%): The likelihood of having heart desease is higher in the group of patient that present an asymptomatic(72.5%) chest pain and lower in the groupe of patient with non typical chest pain (18.4%).The likelihood of having heart desease given the patient has asymptomatic chest pain is more than 2 times more likely if the patient has typical chest pain(72.5 vs30.4%), more than 3 time more if the patient has non anginal chest pain(72.5 vs 21.7%) and,more than 4 times more likely if the patients has non typical chest pain(72.5 vs 18.4). From the plot on the right side (count): There is a total of: -142 patient with asymptomatic chest pain,103 of them have heart desease -83 patient with non anginal chest pain,65 of them do not have heart desease -49 patients with non typical chest pain, 40 of them do not have heart desease -23 patients with pytical chest apin, 16 of them do not have heart desease People admited at the hospital with a non typical chest pain are less likely to be diagnosed with an heart deseas

Fbs

cross_gender=cross_plot(varqual, input="Fbs", target="AHD")

**Fbs:Patients with a fasting blood sugar greater than 120 mg/dl are not that much exposed to AHD compare to the ones with a fasting blood sugar less than 120 mg/dl From the plot on the left side(%): The likelihood of having heart desease is 46.1% if the patient has no fasting blood sugar.We noticed that the chance of having heart desease given the patient has blood sugar or non blood sugar while fasting is quite equal(46.5 vs 46.1%) From the plot on the right side (count): The total number of patient with no fasting blood sugar is 254 which is 85.52% of the total patient tested with a fasting blood sugar.117 of 254 of the patients tested with no blood sugar while fasting have heart desease. the remaining 137 do not have heart deseae.

ExAng

cross_gender=cross_plot(varqual, input="ExAng", target="AHD")

**Exang exercise induced angina From the graph on the left side(%): the graph shows that people with exang are presented with 76.3% of chance of getting heart desease where as patients with noexang are less exposed with 31.5%. From the plot on the right side(count): 63 of the 200 patients with no exercice induced angina are diagnosed with heart desease. In term of absolut value there is more patient in group exang that have heart desease than the patients in the group no exang(74 vs 63)

Thal

cross_gender=cross_plot(varqual, input="Thal" , target="AHD")

**Thal The patient with a reversable Thal have the most case of AHD. From grapph of %: 76.5% of Patients in the reversable group are positively diagnosed with heart desease.When a patient has fix Thal she/he is 66.7% likely to get heart desease.The likelihood of getting heart desease given the patient is in the group normal is 22.6%. From the graph on the right side(count): 88 of 115 patients with reversable Thal have heart desease. 164 patients have normal Thal . We noticed that 37 of them are positively diagnosed with hearth desease and the remaining are not. There is not a lot of people diagnosed with fixed Thal in our dataset, nevertheless 12 of the 18 patients that fixed Thal have heart desease.

Description

describe(varquant)

## varquant 
## 
##  6  Variables      297  Observations
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      297        0       41    0.999    54.54    10.31       40       42 
##      .25      .50      .75      .90      .95 
##       48       56       61       66       68 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## ---------------------------------------------------------------------------
## RestBP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      297        0       50    0.994    131.7    19.62    108.0    110.0 
##      .25      .50      .75      .90      .95 
##    120.0    130.0    140.0    152.8    160.8 
## 
## lowest :  94 100 101 102 104, highest: 174 178 180 192 200
## ---------------------------------------------------------------------------
## Chol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      297        0      152        1    247.4    56.18    175.8    190.4 
##      .25      .50      .75      .90      .95 
##    211.0    243.0    276.0    309.0    327.6 
## 
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## ---------------------------------------------------------------------------
## MaxHR 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      297        0       91        1    149.6    25.81    108.0    116.0 
##      .25      .50      .75      .90      .95 
##    133.0    153.0    166.0    177.4    182.0 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## ---------------------------------------------------------------------------
## Oldpeak 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      297        0       40    0.965    1.056    1.235      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.8      1.6      2.8      3.4 
## 
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## ---------------------------------------------------------------------------
## Ca 
##        n  missing distinct     Info     Mean      Gmd 
##      297        0        4    0.786   0.6768   0.9283 
##                                   
## Value          0     1     2     3
## Frequency    174    65    38    20
## Proportion 0.586 0.219 0.128 0.067
## ---------------------------------------------------------------------------

Quantitative vs AHD

varquant1 <- cbind(varquant,AHD=varqual$AHD)
cross_plot(varquant1, input=names(varquant1), target="AHD")

## Plotting transformed variable 'Age' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'

## Plotting transformed variable 'RestBP' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'

## Plotting transformed variable 'Chol' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'

## Plotting transformed variable 'MaxHR' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'

## Plotting transformed variable 'Oldpeak' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'

**Oldpeak graph shows clearly how: the likelihood of having heart disease increases as oldpeak increases as well.This gives an order to the data.It look like it is following a patter . A close look the left side graph shows that oldpeak [1.3,1.6) was supposed to be higher than [0.9,1.3).This could denote some noise in our data. The likehood of having heart desease if the patient oldpeak is in the range [2.9,6.2] is the highest one with 88.5 %

**Age : The age’s histogram shows that patient in the age range of[57-64] are most likely the one that are diagnosed with AHD.The chance of having heart desease if the patien belong to the age’s range [57-64] is followed: Age in [57-59) is 62.9% [59-61) is 69.2% [61-64) is 71.4% this latter is the highest rate of getting heart desease in the age group

**RestBP: 37.8% of the patients with a ideal blood pressure around 120 have positive diagnosis of heart desease.The likelihood of having a heart desease given the patient has resting blood pressure in the range [142,146) is the highest one(70%).Patient with resting blood sugar that highly exposed are the following: 1- patients with [122-128) 2- patients with [132-135) 3- patients with [142-146) 4- patients with [155-200)

**MaxHR:Patient with maximal achieved heart rate between 140-150 are exposed to a positive diagnosis of desease. At first glance, max heart rate shows a negative and linear relationship. However, there are some buckets which add noise to the relationship. For example, the bucket (141, 147] has a higher heart disease rate than the previous bucket, and it was expected to have a lower. This could be noise in data

**Chol :Patient with chol in [126-192) , [270-289) and [311-564) are 53.3%, 65.5% and 53,6% respectively exposed to a positive diagnosis of heart desease.

**Ca: These plots show that when the number of major vessels colored by fluoroscopy increase the likelihood of having heart desease increase too.the patients with 3 major vessels colored by fluoroscopy have the highest rate of having heart desease(85%).A close look at the graph on the right side denote that the group of patient that do not have major vessels colored by fluoroscopy has more patient with heart desease(45 cases vs 17)

Graphical Analysis for the quantitative variables

Correlation between the continuous variables

ggpairs(varquant1, aes(color=AHD, alpha=0.75), lower=list(continuous="smooth"))+ theme_bw()+
labs(title="quantitative variable")+
theme(plot.title=element_text(face='bold',color='black',hjust=0.5,size=12))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The output indicate that there is no major correlation between the continous variables.The distribution of the RestBP ,Chol ,Age ,looks normal,MaxHR. No relationship is noted between the variables

Prepare Training And Test Data

To simulate a train and test set we are going to split randomly this data set into 80% train and 20% test.We select certain index number and then use the selected index numbers to divide the dataset into training and testing dataset.

set.seed(100) # set seed to replicate results
trainingIndex <- sample(1:nrow(heart), 0.8*nrow(heart)) # indices for 80% training data
trainingData <- heart[trainingIndex, ] # training data
testData <- heart[-trainingIndex, ] # test data

LOGISTIC REGRESSION

specifying and fitting the logistic regresion

Given what we’ve learned about our heart train dataset, we begin our exploration by creating a logistic regression model using all predictor variables

#specify and fit the logistic regresion
heart_model <- {AHD ~ .}
heart_fit <- glm(heart_model,family = binomial,data=trainingData)
#print(summary(heart_fit))
stargazer(heart_fit,type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 AHD            
## -----------------------------------------------
## Sex1                         1.669***          
##                               (0.601)          
##                                                
## ChestPainnonanginal          -1.962***         
##                               (0.563)          
##                                                
## ChestPainnontypical           -0.738           
##                               (0.606)          
##                                                
## ChestPaintypical             -2.573***         
##                               (0.818)          
##                                                
## Fbs1                          -0.345           
##                               (0.648)          
##                                                
## RestECG1                       0.658           
##                               (2.997)          
##                                                
## RestECG2                       0.454           
##                               (0.420)          
##                                                
## ExAng1                         0.690           
##                               (0.483)          
##                                                
## Slope2                        1.335**          
##                               (0.527)          
##                                                
## Slope3                         0.802           
##                               (1.182)          
##                                                
## Thalnormal                     0.031           
##                               (0.917)          
##                                                
## Thalreversable                 1.421           
##                               (0.915)          
##                                                
## Age                           -0.018           
##                               (0.028)          
##                                                
## RestBP                        0.028**          
##                               (0.013)          
##                                                
## Chol                           0.005           
##                               (0.004)          
##                                                
## MaxHR                         -0.017           
##                               (0.012)          
##                                                
## Oldpeak                        0.218           
##                               (0.252)          
##                                                
## Ca                           1.452***          
##                               (0.320)          
##                                                
## Constant                      -4.712           
##                               (3.144)          
##                                                
## -----------------------------------------------
## Observations                    237            
## Log Likelihood                -78.851          
## Akaike Inf. Crit.             195.701          
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Based on the above summary, there were multiple independent variables that appeared to be significant in building our logistic regression model. To fine tune this model we use the step function to find the best combination of independent variables..

Using the stepwise procedure to find the best regression

#Using the stepwise procedure to find the best regression
null <- glm(AHD~1,family = binomial,data= trainingData)
full <- glm(AHD~.,family = binomial,data = trainingData)
best_fit <- step(null,scope = list(upper=full),data=trainingData,direction = "both")

## Start:  AIC=329.03
## AHD ~ 1
## 
##             Df Deviance    AIC
## + Thal       2   258.66 264.66
## + ChestPain  3   266.32 274.32
## + Ca         1   274.86 278.86
## + MaxHR      1   284.90 288.90
## + Oldpeak    1   285.05 289.05
## + ExAng      1   288.04 292.04
## + Slope      2   295.64 301.64
## + Sex        1   310.95 314.95
## + Age        1   313.12 317.12
## + RestBP     1   320.57 324.57
## + RestECG    2   322.03 328.03
## + Chol       1   324.90 328.90
## <none>           327.03 329.03
## + Fbs        1   327.03 331.03
## 
## Step:  AIC=264.66
## AHD ~ Thal
## 
##             Df Deviance    AIC
## + Ca         1   219.38 227.38
## + ChestPain  3   223.87 235.87
## + MaxHR      1   235.12 243.12
## + ExAng      1   241.23 249.23
## + Oldpeak    1   241.53 249.53
## + Slope      2   245.59 255.59
## + Age        1   249.54 257.54
## + RestECG    2   252.99 262.99
## + RestBP     1   256.55 264.55
## <none>           258.66 264.66
## + Chol       1   256.74 264.74
## + Sex        1   257.61 265.61
## + Fbs        1   258.59 266.59
## - Thal       2   327.03 329.03
## 
## Step:  AIC=227.38
## AHD ~ Thal + Ca
## 
##             Df Deviance    AIC
## + ChestPain  3   191.37 205.37
## + ExAng      1   202.13 212.13
## + MaxHR      1   204.03 214.03
## + Slope      2   207.05 219.05
## + Oldpeak    1   209.91 219.91
## + RestBP     1   217.30 227.30
## + Sex        1   217.36 227.36
## <none>           219.38 227.38
## + RestECG    2   215.42 227.42
## + Fbs        1   217.78 227.78
## + Chol       1   218.76 228.76
## + Age        1   218.78 228.78
## - Ca         1   258.66 264.66
## - Thal       2   274.86 278.86
## 
## Step:  AIC=205.37
## AHD ~ Thal + Ca + ChestPain
## 
##             Df Deviance    AIC
## + Slope      2   179.23 197.23
## + Oldpeak    1   183.75 199.75
## + ExAng      1   185.26 201.26
## + MaxHR      1   185.29 201.29
## + RestBP     1   188.11 204.11
## + Sex        1   188.50 204.50
## <none>           191.37 205.37
## + RestECG    2   188.11 206.11
## + Chol       1   190.47 206.47
## + Age        1   190.80 206.80
## + Fbs        1   191.37 207.37
## - ChestPain  3   219.38 227.38
## - Ca         1   223.87 235.87
## - Thal       2   228.41 238.41
## 
## Step:  AIC=197.23
## AHD ~ Thal + Ca + ChestPain + Slope
## 
##             Df Deviance    AIC
## + Sex        1   172.28 192.28
## + RestBP     1   175.50 195.50
## + ExAng      1   175.59 195.59
## + Oldpeak    1   176.88 196.88
## <none>           179.23 197.23
## + MaxHR      1   177.37 197.37
## + RestECG    2   176.74 198.74
## + Chol       1   178.78 198.78
## + Age        1   179.21 199.21
## + Fbs        1   179.23 199.23
## - Slope      2   191.37 205.37
## - ChestPain  3   207.05 219.05
## - Thal       2   208.24 222.24
## - Ca         1   212.75 228.75
## 
## Step:  AIC=192.27
## AHD ~ Thal + Ca + ChestPain + Slope + Sex
## 
##             Df Deviance    AIC
## + RestBP     1   166.78 188.78
## + ExAng      1   169.06 191.06
## + MaxHR      1   170.17 192.17
## <none>           172.28 192.28
## + Chol       1   170.31 192.31
## + Oldpeak    1   170.44 192.44
## + Age        1   171.82 193.82
## + RestECG    2   170.16 194.16
## + Fbs        1   172.26 194.26
## - Sex        1   179.23 197.23
## - Thal       2   187.61 203.61
## - Slope      2   188.50 204.50
## - ChestPain  3   202.73 216.73
## - Ca         1   206.03 224.03
## 
## Step:  AIC=188.78
## AHD ~ Thal + Ca + ChestPain + Slope + Sex + RestBP
## 
##             Df Deviance    AIC
## + ExAng      1   164.04 188.04
## + MaxHR      1   164.64 188.64
## <none>           166.78 188.78
## + Oldpeak    1   165.15 189.15
## + Chol       1   165.22 189.22
## + Fbs        1   166.51 190.51
## + Age        1   166.78 190.78
## + RestECG    2   165.26 191.26
## - RestBP     1   172.28 192.28
## - Sex        1   175.50 195.50
## - Thal       2   180.10 198.10
## - Slope      2   184.30 202.30
## - ChestPain  3   200.53 216.53
## - Ca         1   201.87 221.87
## 
## Step:  AIC=188.04
## AHD ~ Thal + Ca + ChestPain + Slope + Sex + RestBP + ExAng
## 
##             Df Deviance    AIC
## <none>           164.04 188.04
## + Chol       1   162.50 188.50
## + MaxHR      1   162.53 188.53
## + Oldpeak    1   162.70 188.70
## - ExAng      1   166.78 188.78
## + Fbs        1   163.59 189.59
## + Age        1   164.04 190.04
## + RestECG    2   162.47 190.47
## - RestBP     1   169.06 191.06
## - Sex        1   172.28 194.28
## - Thal       2   175.05 195.05
## - Slope      2   178.10 198.10
## - ChestPain  3   187.43 205.43
## - Ca         1   199.66 221.66

Using the step function we’ve narrowed our predictor variables that optimized our logistic regression model and provided the smallest AIC:185.98. The resulting predictors included 8 variables: ChestPain,Thal,Ca,Oldpeak,RestECG,Sex,MaxHR, and RestBP.Note that the overall . The difference between the null deviance and the residual is 181.88 which indicate us that any addition of predictors in our model will decrease the deviance by 181 points on 11 degree of freedom. We can say that the residual was reduced with a loss of 11 degree of freedom.We also can say that our model is not over dispersed based on the ratio residual deviance over residual degree freedom which is 0.66 (less than 1.5)

logit2prob <- function(logit){
  odds <- exp(logit)
  prob <- odds / (1 + odds)
  return(prob)
}
logit2prob(coef(best_fit))

##         (Intercept)          Thalnormal      Thalreversable 
##         0.002135718         0.469009143         0.790738341 
##                  Ca ChestPainnonanginal ChestPainnontypical 
##         0.809455099         0.106838753         0.272868731 
##    ChestPaintypical              Slope2              Slope3 
##         0.069041917         0.845554963         0.787819517 
##                Sex1              RestBP              ExAng1 
##         0.819246635         0.506464913         0.685597205

prob <- data.frame(Odds = coef(best_fit)[-1],
                 Probability = logit2prob(coef(best_fit))[-1])
ggplot(prob) +
  aes(x = Odds, y = Probability) +
  geom_point(size = 2, alpha = .3) +
  labs(x = "logit", y = "probability of success")

We noticed from this output that, whenever the logit is negative, the associated probability is below 50%.Hence the probability of having AHD given the patient has heart desease is less tha 50%.More precisely in the group of patient that have chest pain, having a chance of getting heart desease is ~10% for the no anginal type,~27% for the no typical pain and ~6.9% for the one feeling typical pain. Patient with a flat slop are the most expose with 84.55% chance of having heart desease.Patients with a down slop are also exposed(78.78%).Males also happen to be at risk with a probability of 81.92%.

Diagnostic Plot

Standardized residual

#Plot the standardized residual and regression dianostic
residualPlots(best_fit)

##           Test stat Pr(>|Test stat|)  
## Thal                                  
## Ca           4.3896          0.03616 *
## ChestPain                             
## Slope                                 
## Sex                                   
## RestBP       0.0886          0.76590  
## ExAng                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The graphs on the right that represent the Pearson residual vs fitted is telling us that all our variables are linear , therefore we don’t need any transformation or corrections.

Marginal plot

##Marginal plot
marginalModelPlots(best_fit)

## Warning in mmps(...): Interactions and/or factors skipped

This is maginalplot which drawn the response variable against each predictor and linear predictors.In these graphs fitted values are compared to the observed data. We can say that our best model reproduce the data in all case.

Outliers

##outlier
outlierTest(best_fit)

## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 93 -2.967363          0.0030037      0.71187

The output show that there are no significant outliers as noticed by the Bonferonni P-value. . Looking at the Bonferonni we see that all its P-values are close to 1 . Observations 17,184 and 200 have the largest value of cook distance which indicates that they are possible influential observations. The plot studentized vs hat values confirm that observation 146 has the largest studentized residual but is not that influential. Observation 17 and 184 confirm their influence in the data set. We will try to remove this influential data point in our dataset to find out about their real influence.

levrage

#levrage
influenceIndexPlot(best_fit,id.n=4)

## Warning in plot.window(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in box(...): "id.n" is not a graphical parameter

## Warning in title(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.window(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in box(...): "id.n" is not a graphical parameter

## Warning in title(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.window(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in box(...): "id.n" is not a graphical parameter

## Warning in title(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.window(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in box(...): "id.n" is not a graphical parameter

## Warning in title(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

The diagnostic plot shows us the observations that are farthest from the average.Observations 93 and 146 are most likely to be outliers. Looking at the Bonferonni we see that all its P-values are close to 1 except the 146th observation.

Influence point

#Influence point
influencePlot(best_fit,col="blue",id.n=4)

## Warning in plot.window(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter

## Warning in box(...): "id.n" is not a graphical parameter

## Warning in title(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

##       StudRes         Hat      CookD
## 93  -2.967363 0.015287688 0.06974475
## 17   1.386117 0.260503740 0.04519689
## 200  2.669264 0.023336067 0.05119951
## 184 -1.336433 0.280273872 0.04542461
## 146  2.944418 0.007576484 0.03830143

This output shows that observation 184 has the largest values in cook distance ,observation 146 has the largest value in studentized residuals and observation 200 has the largest hat value.We will try to remove these data point in our dataset to find out about their real influence.

now let examine the change when we remove the influence point 17 and 184

#now let examine the change when we remove the influence point 17 and 184
best_fit.update <- update(best_fit,subset=c(-146,-184,-200))
#compareCoefs(best_fit,best_fit.update)
stargazer(best_fit,best_fit.update,type="text")

## 
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                                 AHD             
##                          (1)            (2)     
## ------------------------------------------------
## Thalnormal              -0.124        -0.138    
##                        (0.867)        (0.866)   
##                                                 
## Thalreversable          1.329          1.321    
##                        (0.860)        (0.859)   
##                                                 
## Ca                     1.446***      1.435***   
##                        (0.289)        (0.291)   
##                                                 
## ChestPainnonanginal   -2.123***      -2.114***  
##                        (0.531)        (0.531)   
##                                                 
## ChestPainnontypical    -0.980*        -0.977*   
##                        (0.587)        (0.587)   
##                                                 
## ChestPaintypical      -2.602***      -2.590***  
##                        (0.791)        (0.791)   
##                                                 
## Slope2                 1.700***      1.692***   
##                        (0.480)        (0.480)   
##                                                 
## Slope3                  1.312          1.305    
##                        (0.926)        (0.924)   
##                                                 
## Sex1                   1.511***      1.498***   
##                        (0.549)        (0.550)   
##                                                 
## RestBP                 0.026**        0.026**   
##                        (0.012)        (0.012)   
##                                                 
## ExAng1                  0.780*        0.770*    
##                        (0.466)        (0.467)   
##                                                 
## Constant              -6.147***      -6.126***  
##                        (2.034)        (2.033)   
##                                                 
## ------------------------------------------------
## Observations             237            234     
## Log Likelihood         -82.019        -81.949   
## Akaike Inf. Crit.      188.037        187.898   
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

We can see from this output that the coefficients have not changed that much. we can also conclude that observations are not influential.

Logistic model prediction and performaance

Predicting the probability of having a heart desease

#Predicting the probability of having a heart desease
Prob_AHD_train <- predict.glm(best_fit,newdata= trainingData,type = "response")
Prob_AHD <- predict.glm(best_fit,newdata= testData,type = "response")
print(anova(best_fit,test = "Chisq"))

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: AHD
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        236     327.03              
## Thal       2   68.371       234     258.66 1.424e-15 ***
## Ca         1   39.276       233     219.38 3.679e-10 ***
## ChestPain  3   28.009       230     191.37 3.616e-06 ***
## Slope      2   12.136       228     179.23  0.002316 ** 
## Sex        1    6.960       227     172.27  0.008337 ** 
## RestBP     1    5.493       226     166.78  0.019093 *  
## ExAng      1    2.745       225     164.04  0.097584 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

densityplot(~Prob_AHD|AHD,data = testData,layout=c(1,2),aspect = 1,col="blue",plot.point="rug")

#install.packages("rpivotTable") 
#library(rpivotTable)
#Predicting having a heart disease or not using 0.5 cut off
Predict_disease <- ifelse((Prob_AHD>0.5),1,0)
Predict_disease <-factor(Predict_disease,levels=c(0,1),labels=c("No Disease","Disease"))
#rpivotTable(data = testData, rows = "Predict_disease", cols="AHD", aggregatorName = "Count", rendererName = #"Table", width="100%", height="400px")

Confusion matrix and accuracy

#Checking over all prediction
confusion_matrix <- table(testData$AHD,Predict_disease)
print(confusion_matrix)

##      Predict_disease
##       No Disease Disease
##   No          29       3
##   Yes          4      24

#Checking over all accuracy of the prediction
predictive_accuracy <- (confusion_matrix[1,1]+confusion_matrix[2,2])/sum(confusion_matrix)
print(predictive_accuracy)

## [1] 0.8833333

Based on the Confusion matrix, our overall accuracy of our logistic regression model is 88.7%, states that 88% of the time our prediction of AHD is accurate; which is very good.

ROC for logistic

prediction_test <- prediction(Prob_AHD,testData$AHD)
prediction_train <- prediction(Prob_AHD_train,trainingData$AHD)
test_AUC <- as.numeric(performance(prediction_test,"auc")@y.values)
train_AUC <- as.numeric(performance(prediction_train,"auc")@y.values)
#plotting the ROC
testh_roc <- performance(prediction_test,"tpr","fpr")
trainh_roc <- performance(prediction_train,"tpr","fpr")

Plotting the ROC curve

plot_roc <- function(train_roc,train_auc,test_roc,test_auc){
  plot(train_roc,col="blue",lty="solid",main="",lwd=2,xlab="False Positive Rate",
       ylab="True Positive Rate")
  plot(test_roc,col="red",lty="dashed",lwd=2,add=TRUE)
  abline(c(0,1))  
  #legend
  train.legend <- paste("Training AUC = ", round(train_auc,digits = 3))
  test.legend <- paste("Test AUC = ", round(test_auc,digits = 3))
  legend("bottomright",legend = c(train.legend,test.legend),
         lty=c("solid","dashed"),lwd=2,col=c("blue","red"))
}
plot_roc(trainh_roc,train_AUC,testh_roc,test_AUC)

The ROC plot indicates that we are correctly labeling ore than 75% of AHD with a threshold of 0.5 with a very low false positive rate which is desireable. An AUC value indicates that our model can differentiate between patients with AHD and those without it.We notice that AUC is closer to 1 than 0.5 which is proof that our model has good predictive ability.

SUPPORT VECTOR MACHINE MODEL

Determine the tuning parametre prior to fitting model to train set fit the svm to the train set

#fit the svm to the train set
require("kernlab")

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:psych':
## 
##     alpha

## The following object is masked from 'package:ggplot2':
## 
##     alpha

train_svm_fit  <- ksvm( AHD ~ ., data=trainingData, type='C-svc', kernel='vanilladot', C=100, scale=c() )

##  Setting default kernel parameters

svm_predict_train <- predict(train_svm_fit,trainingData)
svm_train_prediction <- prediction(as.numeric(svm_predict_train), trainingData$AHD)
svm_train_AUC <- as.numeric(performance(svm_train_prediction,"auc")@y.values)

Evaluation on the test data

#we evaluation on the test data
svm_predict_test <- predict(train_svm_fit,testData)
svm_test_prediction <- prediction(as.numeric(svm_predict_test),testData$AHD)
svm_test_AUC <- as.numeric(performance(svm_test_prediction ,"auc")@y.values)
svm_test_AUC

## [1] 0.8973214

Confusion matrix and accuracy

#Checking over all prediction
confusion_matrix <- table(svm_predict_test,testData$AHD)
print(confusion_matrix)

##                 
## svm_predict_test No Yes
##              No  30   4
##              Yes  2  24

#Checking over all accuracy of the prediction
predictive_accuracy <- (confusion_matrix[1,1]+confusion_matrix[2,2])/sum(confusion_matrix)
print(predictive_accuracy)

## [1] 0.9

Based on the Confusion matrix, our overall accuracy of our SVM model is 90%, this states that 90% of the time our prediction of AHD is accurate; which is really good, even better than our logistic regression model in part 1 . Let’s look at our ROC curve

#ROC for svm
trains_vm_roc <- performance(svm_train_prediction,"tpr","fpr")  
test_svm_roc <- performance(svm_test_prediction,"tpr","fpr")
plot_roc(trains_vm_roc,svm_train_AUC,test_svm_roc,svm_test_AUC)

The ROC curve provides more evidences of a svm model being able to perform well. This curve indicates that we are labeling 90% of our data correctly with a less than 10% false positive rate. An AUC value indicates that our model can differentiate between patients with AHD and those without it.We notice that AUC is closer to 1 than 0.5 which is proof that our model has good predictive ability just as in part 1 for the logistic regression model.

Heart Disease Prediction Using Machine Learning and Data Analytic

Mamadou Michel Diakhate

Assigned Nov. 30, 2016

Graphical Analysis for the categorical variables

Qualitative variables describing the ECG

restECG

Slope

Variable qualitaive other than ECG metrics

Sex

ChestPain

Fbs

ExAng

Thal

Description

Quantitative vs AHD

Graphical Analysis for the quantitative variables

Correlation between the continuous variables

Prepare Training And Test Data

LOGISTIC REGRESSION

specifying and fitting the logistic regresion

Using the stepwise procedure to find the best regression

Diagnostic Plot

Standardized residual

Marginal plot

Outliers

levrage

Influence point

now let examine the change when we remove the influence point 17 and 184

Logistic model prediction and performaance

Confusion matrix and accuracy

ROC for logistic

Plotting the ROC curve

SUPPORT VECTOR MACHINE MODEL

Evaluation on the test data