Scope

Purpose of this summary is to provide insight into key factors contributing to Heart Disease. Approach used to identify the most significant model for prediction relies on the following assumptions:

    - Heart Disease data set will be used for the analysis (:http://archive.ics.uci.edu/ml/datasets/heart+disease)
    - Records with missing data will be excluded.
    - Outliers for each factor will be evaluated and excluded only if values are due to data entry error.
    - Given criticality of this excercise, outliers will not be excluded without a reaonable cause.
    - Both GLM & SVM are used to evaluate models. Variables are not limited to quantitative, qualitative values are included as well- e.g. ChestPain & Thal  are excluded as they are categorical

Process | Logistic Regression Model

Both Logistic Regression & SVM Models will be evaluated to identify the optimal option

    - Logictic Regression Model (lgm)
      *  Different models are evaluated with variation of variables
      *  Interaction effect is incorporated
      *  Only "Binomial" family is evaluated. After identifying the best fit model based on key statistics and confusion matrix, the link function "probit" & "cloglog" are compared to defauly "logit" in case they offer more significance.
      *  Define cut off value to assign "0/1" for resulting prbabilities based on  model prediction range - (Max-Min)/2
      *  Construct confusion matrix using cutt off value. Use result to further assess models for key factors:
         > Accuarcy
         > Sensitivity
         > Significance
      *  Use ROC curve to compare side by side the most significant models.

LGM Models | Preliminary - logit Function Only

Model 4 based on summary below is selected as it is the most significant based on p value.

glm.fits_final = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = logit), data = train_data)

LGM Models | Confusion Matrix & ROC Curve Comparison

By comparing final model using different link functions, “cloglog” the most significant model.

Conclusion: By comparing summary results, lgm Model with link “cloglog” is more significant an is the selected as best fit

Process | Support Vector Machine (SVM)

Both Logistic Regression & SVM Models will be evaluated to identify the optimal option

    - Support Vector Machine (SVM)
      *  Prelimany mode included key variables based on GLM model results: 
         > Sex
         > ChestPain
         > Ca
         > RestBP
      *  Confusion matrix evaluated ane results weren't promissing therefore, blunt force approach is used by including all variables in SVM model
      *  More robust process to plit data is used by applying data partitioning function. 
      *  Confusin matrix is evalauted as well as different cost levels.
      *  The following methods are compared for the model:
         > svmLinear
         > pls
         > rda
      *  The more significant models are compared against each other. Best fit model is selected based on resampling results.
      

Support Vector Machine (SVM) | Method - “svmLinear”

After training the model, confusion matrix is evaluated suggesting satisfactory results compared to lgm model. The only caution is that all variables are included which might cause overffiting and render unstable model.

Different cost levels are evaluated as well to ensure optimal value is incorporated in the model. Final value is set @ .1

Support Vector Machine (SVM) | Comparing different Methods

Both “pls” & “rda” methods are evaluated as well and compared against “svmLinear” in case of additional valuable insight before making final selection.

Conclusion : - Model set with svmLinear Method & cost = .01 is the one selected. - No added value when applying “pls” or “rda”"

Confusion Matrix : pls vs. rda

Method : pls

Method : rda

ROC : pls vs. rda

Resampling Comparison : pls vs. rda

Process | Data review & initial splitting into training & test set

setwd("~/Introduction to Data Science/R Projects/data")
heart_data <- read.table(file = "heart.txt", sep = ",", header = T)
View(heart_data)
dim(heart_data)
## [1] 303  15
str(heart_data)
## 'data.frame':    303 obs. of  15 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ AHD      : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
#Run summary to better understand the variables and identify NAs#
summary(heart_data)
##        X              Age             Sex                ChestPain  
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   asymptomatic:144  
##  1st Qu.: 76.5   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 86  
##  Median :152.0   Median :56.00   Median :1.0000   nontypical  : 50  
##  Mean   :152.0   Mean   :54.44   Mean   :0.6799   typical     : 23  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :303.0   Max.   :77.00   Max.   :1.0000                     
##                                                                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :241.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :246.7   Mean   :0.1485   Mean   :0.9901  
##  3rd Qu.:140.0   3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##                                                                   
##      MaxHR           ExAng           Oldpeak         Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.80   Median :2.000  
##  Mean   :149.6   Mean   :0.3267   Mean   :1.04   Mean   :1.601  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.20   Max.   :3.000  
##                                                                 
##        Ca                 Thal      AHD     
##  Min.   :0.0000   fixed     : 18   No :164  
##  1st Qu.:0.0000   normal    :166   Yes:139  
##  Median :0.0000   reversable:117            
##  Mean   :0.6722   NA's      :  2            
##  3rd Qu.:1.0000                             
##  Max.   :3.0000                             
##  NA's   :4
#exclude NAs from "Thal" & "Ca"
heart_data_final <- na.omit(heart_data)
summary(heart_data_final)
##        X              Age             Sex                ChestPain  
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   asymptomatic:142  
##  1st Qu.: 75.0   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 83  
##  Median :150.0   Median :56.00   Median :1.0000   nontypical  : 49  
##  Mean   :150.7   Mean   :54.54   Mean   :0.6768   typical     : 23  
##  3rd Qu.:226.0   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :302.0   Max.   :77.00   Max.   :1.0000                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :243.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :247.4   Mean   :0.1448   Mean   :0.9966  
##  3rd Qu.:140.0   3rd Qu.:276.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##      MaxHR           ExAng           Oldpeak          Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:133.0   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.800   Median :2.000  
##  Mean   :149.6   Mean   :0.3266   Mean   :1.056   Mean   :1.603  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.200   Max.   :3.000  
##        Ca                 Thal      AHD     
##  Min.   :0.0000   fixed     : 18   No :160  
##  1st Qu.:0.0000   normal    :164   Yes:137  
##  Median :0.0000   reversable:115            
##  Mean   :0.6768                             
##  3rd Qu.:1.0000                             
##  Max.   :3.0000
library(ggplot2)
par(mfrow=c(3,3))
boxplot_Age <- boxplot(heart_data$Age, xlab = "Age", outcol="red")
boxplot_RestBP <- boxplot(heart_data$RestBP, xlab = "RestBP", outcol="red")
boxplot_Chol <- boxplot(heart_data$Chol, xlab = "Chol", outcol="red")
boxplot_Fbs <- boxplot(heart_data$Fbs, xlab = "Fbs", outcol="red")
boxplot_RestECG <- boxplot(heart_data$RestECG, xlab = "RestECG", outcol="red")
boxplot_ExAng<- boxplot(heart_data$ExAng, xlab = "ExAng", outcol="red")
boxplot_MaxHR <- boxplot(heart_data$MaxHR, xlab = "MaxHR", outcol="red")
boxplot_Ca <- boxplot(heart_data$Ca, xlab = "Ca", outcol="red")
boxplot_Oldpeak <- boxplot(heart_data$Oldpeak, xlab="Oldpeak", outcol="red")

#boxplot_Thal <- boxplot(heart_data$Thal, xlab= "Thal")#excluded as it is a categorical variable
boxplot_Age$out
## numeric(0)
boxplot_RestBP$out
## [1] 172 180 200 174 178 192 180 178 180
boxplot_Chol$out
## [1] 417 407 564 394 409
boxplot_Fbs$out
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1
boxplot_ExAng$out
## numeric(0)
boxplot_Ca$out
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
boxplot_Oldpeak$out
## [1] 6.2 5.6 4.2 4.2 4.4
boxplot_MaxHR$out
## [1] 71
library(ISLR)
str(heart_data_final)
## 'data.frame':    297 obs. of  15 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ AHD      : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:6] 88 167 193 267 288 303
##   .. ..- attr(*, "names")= chr [1:6] "88" "167" "193" "267" ...
dim(heart_data_final)
## [1] 297  15
pairs(heart_data_final)

#Code Updated to exclude Categorical Cariables-cor(heart_data_final)#
cor(heart_data_final [,c(-4,-14,-15)])
##                    X          Age         Sex      RestBP          Chol
## X        1.000000000  0.009262273 -0.08814079 -0.02225682 -8.396768e-02
## Age      0.009262273  1.000000000 -0.09239948  0.29047626  2.026435e-01
## Sex     -0.088140794 -0.092399479  1.00000000 -0.06634020 -1.980891e-01
## RestBP  -0.022256823  0.290476262 -0.06634020  1.00000000  1.315357e-01
## Chol    -0.083967682  0.202643546 -0.19808906  0.13153571  1.000000e+00
## Fbs     -0.051693004  0.132061989  0.03885030  0.18085954  1.270828e-02
## RestECG -0.136735719  0.149916512  0.03389683  0.14924228  1.650460e-01
## MaxHR   -0.117354904 -0.394562881 -0.06049601 -0.04910766 -7.456799e-05
## ExAng   -0.002661773  0.096488805  0.14358125  0.06669107  5.933893e-02
## Oldpeak -0.114655929  0.197122616  0.10656724  0.19124314  3.859579e-02
## Slope   -0.032451869  0.159404737  0.03334496  0.12117205 -9.215240e-03
## Ca       0.048687403  0.362210343  0.09192480  0.09795376  1.159446e-01
##                   Fbs     RestECG         MaxHR         ExAng      Oldpeak
## X       -0.0516930041 -0.13673572 -1.173549e-01 -0.0026617728 -0.114655929
## Age      0.1320619890  0.14991651 -3.945629e-01  0.0964888046  0.197122616
## Sex      0.0388502996  0.03389683 -6.049601e-02  0.1435812504  0.106567243
## RestBP   0.1808595428  0.14924228 -4.910766e-02  0.0666910687  0.191243136
## Chol     0.0127082808  0.16504603 -7.456799e-05  0.0593389323  0.038595794
## Fbs      1.0000000000  0.06883111 -7.842359e-03 -0.0008930821  0.008310667
## RestECG  0.0688311070  1.00000000 -7.228965e-02  0.0818739197  0.113726420
## MaxHR   -0.0078423590 -0.07228965  1.000000e+00 -0.3843675321 -0.347639972
## ExAng   -0.0008930821  0.08187392 -3.843675e-01  1.0000000000  0.289309666
## Oldpeak  0.0083106671  0.11372642 -3.476400e-01  0.2893096659  1.000000000
## Slope    0.0478190123  0.13514058 -3.893067e-01  0.2505715154  0.579037353
## Ca       0.1520858900  0.12902063 -2.687270e-01  0.1482322256  0.294452277
##               Slope          Ca
## X       -0.03245187  0.04868740
## Age      0.15940474  0.36221034
## Sex      0.03334496  0.09192480
## RestBP   0.12117205  0.09795376
## Chol    -0.00921524  0.11594459
## Fbs      0.04781901  0.15208589
## RestECG  0.13514058  0.12902063
## MaxHR   -0.38930674 -0.26872698
## ExAng    0.25057152  0.14823223
## Oldpeak  0.57903735  0.29445228
## Slope    1.00000000  0.10976112
## Ca       0.10976112  1.00000000
attach(heart_data_final)
## The following objects are masked from heart_data_final (pos = 3):
## 
##     Age, AHD, Ca, ChestPain, Chol, ExAng, Fbs, MaxHR, Oldpeak,
##     RestBP, RestECG, Sex, Slope, Thal, X
#confirming no missing value in final data set#
anyNA(heart_data_final)
## [1] FALSE
train_data <-  heart_data_final[1:250,]
test_data <- heart_data_final[251:297,]
head(train_data)
##   X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1  63   1      typical    145  233   1       2   150     0     2.3     3
## 2 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2
## 3 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2
## 4 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3
## 5 5  41   0   nontypical    130  204   0       2   172     0     1.4     1
## 6 6  56   1   nontypical    120  236   0       0   178     0     0.8     1
##   Ca       Thal AHD
## 1  0      fixed  No
## 2  3     normal Yes
## 3  2 reversable Yes
## 4  0     normal  No
## 5  0     normal  No
## 6  0     normal  No
head(test_data)
##       X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak
## 254 254  51   0   nonanginal    120  295   0       2   157     0     0.6
## 255 255  43   1 asymptomatic    115  303   0       0   181     0     1.2
## 256 256  42   0   nonanginal    120  209   0       0   173     0     0.0
## 257 257  67   0 asymptomatic    106  223   0       0   142     0     0.3
## 258 258  76   0   nonanginal    140  197   0       1   116     0     1.1
## 259 259  70   1   nontypical    156  245   0       2   143     0     0.0
##     Slope Ca   Thal AHD
## 254     1  0 normal  No
## 255     2  0 normal  No
## 256     2  0 normal  No
## 257     1  2 normal  No
## 258     2  0 normal  No
## 259     1  0 normal  No

Process | Data Summary - Entire Data Set

## 'data.frame':    303 obs. of  15 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
##  $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ Thal     : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ AHD      : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
##        X              Age             Sex                ChestPain  
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   asymptomatic:144  
##  1st Qu.: 76.5   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 86  
##  Median :152.0   Median :56.00   Median :1.0000   nontypical  : 50  
##  Mean   :152.0   Mean   :54.44   Mean   :0.6799   typical     : 23  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :303.0   Max.   :77.00   Max.   :1.0000                     
##                                                                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :241.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :246.7   Mean   :0.1485   Mean   :0.9901  
##  3rd Qu.:140.0   3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##                                                                   
##      MaxHR           ExAng           Oldpeak         Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.80   Median :2.000  
##  Mean   :149.6   Mean   :0.3267   Mean   :1.04   Mean   :1.601  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.20   Max.   :3.000  
##                                                                 
##        Ca                 Thal      AHD     
##  Min.   :0.0000   fixed     : 18   No :164  
##  1st Qu.:0.0000   normal    :166   Yes:139  
##  Median :0.0000   reversable:117            
##  Mean   :0.6722   NA's      :  2            
##  3rd Qu.:1.0000                             
##  Max.   :3.0000                             
##  NA's   :4

Process | Data Summary - Final Data Set after excluding NAs

##        X              Age             Sex                ChestPain  
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   asymptomatic:142  
##  1st Qu.: 75.0   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 83  
##  Median :150.0   Median :56.00   Median :1.0000   nontypical  : 49  
##  Mean   :150.7   Mean   :54.54   Mean   :0.6768   typical     : 23  
##  3rd Qu.:226.0   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :302.0   Max.   :77.00   Max.   :1.0000                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :243.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :247.4   Mean   :0.1448   Mean   :0.9966  
##  3rd Qu.:140.0   3rd Qu.:276.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##      MaxHR           ExAng           Oldpeak          Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:133.0   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.800   Median :2.000  
##  Mean   :149.6   Mean   :0.3266   Mean   :1.056   Mean   :1.603  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.200   Max.   :3.000  
##        Ca                 Thal      AHD     
##  Min.   :0.0000   fixed     : 18   No :160  
##  1st Qu.:0.0000   normal    :164   Yes:137  
##  Median :0.0000   reversable:115            
##  Mean   :0.6768                             
##  3rd Qu.:1.0000                             
##  Max.   :3.0000

Boxplots - Qualitative Factors

Relationship between Variables

Relationship between Variables - Correlation

By evaluating the correlation between variable, we can better identify key factors that should be included in the Model to avoid overfiiting or potential co-linearity

#Code Updated to exclude Categorical Cariables-cor(heart_data_final)#
cor(heart_data_final [,c(-4,-14,-15)])
##                    X          Age         Sex      RestBP          Chol
## X        1.000000000  0.009262273 -0.08814079 -0.02225682 -8.396768e-02
## Age      0.009262273  1.000000000 -0.09239948  0.29047626  2.026435e-01
## Sex     -0.088140794 -0.092399479  1.00000000 -0.06634020 -1.980891e-01
## RestBP  -0.022256823  0.290476262 -0.06634020  1.00000000  1.315357e-01
## Chol    -0.083967682  0.202643546 -0.19808906  0.13153571  1.000000e+00
## Fbs     -0.051693004  0.132061989  0.03885030  0.18085954  1.270828e-02
## RestECG -0.136735719  0.149916512  0.03389683  0.14924228  1.650460e-01
## MaxHR   -0.117354904 -0.394562881 -0.06049601 -0.04910766 -7.456799e-05
## ExAng   -0.002661773  0.096488805  0.14358125  0.06669107  5.933893e-02
## Oldpeak -0.114655929  0.197122616  0.10656724  0.19124314  3.859579e-02
## Slope   -0.032451869  0.159404737  0.03334496  0.12117205 -9.215240e-03
## Ca       0.048687403  0.362210343  0.09192480  0.09795376  1.159446e-01
##                   Fbs     RestECG         MaxHR         ExAng      Oldpeak
## X       -0.0516930041 -0.13673572 -1.173549e-01 -0.0026617728 -0.114655929
## Age      0.1320619890  0.14991651 -3.945629e-01  0.0964888046  0.197122616
## Sex      0.0388502996  0.03389683 -6.049601e-02  0.1435812504  0.106567243
## RestBP   0.1808595428  0.14924228 -4.910766e-02  0.0666910687  0.191243136
## Chol     0.0127082808  0.16504603 -7.456799e-05  0.0593389323  0.038595794
## Fbs      1.0000000000  0.06883111 -7.842359e-03 -0.0008930821  0.008310667
## RestECG  0.0688311070  1.00000000 -7.228965e-02  0.0818739197  0.113726420
## MaxHR   -0.0078423590 -0.07228965  1.000000e+00 -0.3843675321 -0.347639972
## ExAng   -0.0008930821  0.08187392 -3.843675e-01  1.0000000000  0.289309666
## Oldpeak  0.0083106671  0.11372642 -3.476400e-01  0.2893096659  1.000000000
## Slope    0.0478190123  0.13514058 -3.893067e-01  0.2505715154  0.579037353
## Ca       0.1520858900  0.12902063 -2.687270e-01  0.1482322256  0.294452277
##               Slope          Ca
## X       -0.03245187  0.04868740
## Age      0.15940474  0.36221034
## Sex      0.03334496  0.09192480
## RestBP   0.12117205  0.09795376
## Chol    -0.00921524  0.11594459
## Fbs      0.04781901  0.15208589
## RestECG  0.13514058  0.12902063
## MaxHR   -0.38930674 -0.26872698
## ExAng    0.25057152  0.14823223
## Oldpeak  0.57903735  0.29445228
## Slope    1.00000000  0.10976112
## Ca       0.10976112  1.00000000

Splitting Data to Train & Test the Model - Preliminary

Data set is split based on requirement to train and test the Models. - As analysis progress and get the understand the data better, preliminary split is subject to change. - Records with incomplete data are excluded.

attach(heart_data_final)
## The following objects are masked from heart_data_final (pos = 3):
## 
##     Age, AHD, Ca, ChestPain, Chol, ExAng, Fbs, MaxHR, Oldpeak,
##     RestBP, RestECG, Sex, Slope, Thal, X
## The following objects are masked from heart_data_final (pos = 4):
## 
##     Age, AHD, Ca, ChestPain, Chol, ExAng, Fbs, MaxHR, Oldpeak,
##     RestBP, RestECG, Sex, Slope, Thal, X
#confirming no missing value in final data set#
anyNA(heart_data_final)
## [1] FALSE
train_data <-  heart_data_final[1:250,]
test_data <- heart_data_final[251:297,]
head(train_data)
##   X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1  63   1      typical    145  233   1       2   150     0     2.3     3
## 2 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2
## 3 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2
## 4 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3
## 5 5  41   0   nontypical    130  204   0       2   172     0     1.4     1
## 6 6  56   1   nontypical    120  236   0       0   178     0     0.8     1
##   Ca       Thal AHD
## 1  0      fixed  No
## 2  3     normal Yes
## 3  2 reversable Yes
## 4  0     normal  No
## 5  0     normal  No
## 6  0     normal  No
head(test_data)
##       X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak
## 254 254  51   0   nonanginal    120  295   0       2   157     0     0.6
## 255 255  43   1 asymptomatic    115  303   0       0   181     0     1.2
## 256 256  42   0   nonanginal    120  209   0       0   173     0     0.0
## 257 257  67   0 asymptomatic    106  223   0       0   142     0     0.3
## 258 258  76   0   nonanginal    140  197   0       1   116     0     1.1
## 259 259  70   1   nontypical    156  245   0       2   143     0     0.0
##     Slope Ca   Thal AHD
## 254     1  0 normal  No
## 255     2  0 normal  No
## 256     2  0 normal  No
## 257     1  2 normal  No
## 258     2  0 normal  No
## 259     1  0 normal  No

Logistic Regression Model - Evaluation

In this step, we use Logistic Regression to evaluate different models to ensure right mix of factors is incorporated. Summary below: - Family is set to Binomial - 5 Models are created and evaluated based on Coef. P values,..etc * Model 1 - All Factors included * Model 2- Sex, ChestPain, & both MaxHR & Slope are tested for interaction effect. * Model 3 - Sex, ChestPain, Ca, and both RestBP & Oldpeak are tested for interaction effect. * Model 4 - Sex , ChestPain, Ca & RestBP * Model 5 - Sex, ChestPain, Ca, and Rboth estBP & MaxHR tested for interaction effect.

Conclusion: By comparing summary results Model 4 is considered best fit and will be used to train the data and testing. p value is significantly low for all parameters.

## 
## Call:
## glm(formula = AHD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs + 
##     RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca, family = binomial, 
##     data = heart_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5914  -0.5280  -0.1506   0.4217   2.4822  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.405544   2.645651  -1.665 0.095872 .  
## Age                 -0.009658   0.023677  -0.408 0.683333    
## Sex                  1.951369   0.456954   4.270 1.95e-05 ***
## ChestPainnonanginal -1.815575   0.461783  -3.932 8.44e-05 ***
## ChestPainnontypical -1.175942   0.534164  -2.201 0.027703 *  
## ChestPaintypical    -2.124587   0.641333  -3.313 0.000924 ***
## RestBP               0.025533   0.010828   2.358 0.018368 *  
## Chol                 0.006473   0.003982   1.625 0.104058    
## Fbs                 -0.779666   0.564135  -1.382 0.166955    
## RestECG              0.178182   0.180902   0.985 0.324641    
## MaxHR               -0.022278   0.010503  -2.121 0.033919 *  
## ExAng                0.926336   0.417580   2.218 0.026531 *  
## Oldpeak              0.351297   0.216763   1.621 0.105092    
## Slope                0.745698   0.361482   2.063 0.039123 *  
## Ca                   1.274989   0.260660   4.891 1.00e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 208.35  on 282  degrees of freedom
## AIC: 238.35
## 
## Number of Fisher Scoring iterations: 6
##         (Intercept)                 Age                 Sex 
##        -4.405544055        -0.009658484         1.951369054 
## ChestPainnonanginal ChestPainnontypical    ChestPaintypical 
##        -1.815575013        -1.175942437        -2.124587104 
##              RestBP                Chol                 Fbs 
##         0.025532913         0.006473361        -0.779665936 
##             RestECG               MaxHR               ExAng 
##         0.178182093        -0.022277789         0.926335958 
##             Oldpeak               Slope                  Ca 
##         0.351297420         0.745697774         1.274989328
##                         Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)         -4.405544055 2.645651397 -1.6652020 9.587246e-02
## Age                 -0.009658484 0.023677444 -0.4079192 6.833330e-01
## Sex                  1.951369054 0.456954316  4.2703811 1.951393e-05
## ChestPainnonanginal -1.815575013 0.461782788 -3.9316645 8.435973e-05
## ChestPainnontypical -1.175942437 0.534163541 -2.2014652 2.770311e-02
## ChestPaintypical    -2.124587104 0.641332971 -3.3127676 9.237770e-04
## RestBP               0.025532913 0.010827710  2.3581083 1.836833e-02
## Chol                 0.006473361 0.003982402  1.6254917 1.040578e-01
## Fbs                 -0.779665936 0.564134970 -1.3820557 1.669546e-01
## RestECG              0.178182093 0.180901845  0.9849656 3.246410e-01
## MaxHR               -0.022277789 0.010503257 -2.1210362 3.391875e-02
## ExAng                0.926335958 0.417579569  2.2183460 2.653125e-02
## Oldpeak              0.351297420 0.216762954  1.6206525 1.050922e-01
## Slope                0.745697774 0.361481704  2.0628922 3.912287e-02
## Ca                   1.274989328 0.260660366  4.8913816 1.001306e-06
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + MaxHR * Age, family = binomial, 
##     data = heart_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8151  -0.6278  -0.2051   0.5041   2.5984  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         20.7052828  8.3573782   2.477 0.013231 *  
## Sex                  1.6453767  0.3769823   4.365 1.27e-05 ***
## ChestPainnonanginal -1.9499460  0.4017587  -4.854 1.21e-06 ***
## ChestPainnontypical -1.7484936  0.4875348  -3.586 0.000335 ***
## ChestPaintypical    -1.7442403  0.5891636  -2.961 0.003071 ** 
## Ca                   1.0339271  0.2135209   4.842 1.28e-06 ***
## MaxHR               -0.1480578  0.0550228  -2.691 0.007127 ** 
## Age                 -0.3059567  0.1455806  -2.102 0.035586 *  
## MaxHR:Age            0.0021147  0.0009661   2.189 0.028608 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 239.33  on 288  degrees of freedom
## AIC: 257.33
## 
## Number of Fisher Scoring iterations: 5
##         (Intercept)                 Sex ChestPainnonanginal 
##         20.70528282          1.64537673         -1.94994597 
## ChestPainnontypical    ChestPaintypical                  Ca 
##         -1.74849365         -1.74424035          1.03392712 
##               MaxHR                 Age           MaxHR:Age 
##         -0.14805782         -0.30595668          0.00211472
##                        Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept)         20.70528282 8.3573781576  2.477485 1.323118e-02
## Sex                  1.64537673 0.3769823133  4.364599 1.273560e-05
## ChestPainnonanginal -1.94994597 0.4017586975 -4.853525 1.212859e-06
## ChestPainnontypical -1.74849365 0.4875347898 -3.586398 3.352775e-04
## ChestPaintypical    -1.74424035 0.5891635523 -2.960537 3.071035e-03
## Ca                   1.03392712 0.2135208807  4.842276 1.283600e-06
## MaxHR               -0.14805782 0.0550228048 -2.690845 7.127137e-03
## Age                 -0.30595668 0.1455805926 -2.101631 3.558561e-02
## MaxHR:Age            0.00211472 0.0009661371  2.188841 2.860840e-02
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + MaxHR * Slope, family = binomial, 
##     data = heart_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6837  -0.5644  -0.2157   0.4773   2.1706  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.111992   3.922104   0.029  0.97722    
## Sex                  1.643526   0.385589   4.262 2.02e-05 ***
## ChestPainnonanginal -2.057107   0.418202  -4.919 8.70e-07 ***
## ChestPainnontypical -1.603459   0.495087  -3.239  0.00120 ** 
## ChestPaintypical    -1.991727   0.579625  -3.436  0.00059 ***
## Ca                   1.178514   0.222414   5.299 1.17e-07 ***
## MaxHR               -0.018707   0.024556  -0.762  0.44618    
## Slope                1.448169   2.142814   0.676  0.49915    
## MaxHR:Slope         -0.002829   0.013708  -0.206  0.83649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 231.83  on 288  degrees of freedom
## AIC: 249.83
## 
## Number of Fisher Scoring iterations: 5
##         (Intercept)                 Sex ChestPainnonanginal 
##         0.111992162         1.643525558        -2.057107132 
## ChestPainnontypical    ChestPaintypical                  Ca 
##        -1.603459408        -1.991726731         1.178513591 
##               MaxHR               Slope         MaxHR:Slope 
##        -0.018706827         1.448168629        -0.002829125
##                         Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)          0.111992162 3.92210393  0.0285541 9.772202e-01
## Sex                  1.643525558 0.38558858  4.2623813 2.022599e-05
## ChestPainnonanginal -2.057107132 0.41820240 -4.9189271 8.701986e-07
## ChestPainnontypical -1.603459408 0.49508739 -3.2387402 1.200589e-03
## ChestPaintypical    -1.991726731 0.57962523 -3.4362320 5.898657e-04
## Ca                   1.178513591 0.22241380  5.2987431 1.166026e-07
## MaxHR               -0.018706827 0.02455605 -0.7618011 4.461787e-01
## Slope                1.448168629 2.14281410  0.6758256 4.991514e-01
## MaxHR:Slope         -0.002829125 0.01370840 -0.2063790 8.364949e-01
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP * Oldpeak, 
##     family = binomial, data = heart_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2771  -0.5393  -0.2028   0.4964   2.5592  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -6.144905   1.736289  -3.539 0.000401 ***
## Sex                  1.511065   0.391798   3.857 0.000115 ***
## ChestPainnonanginal -2.362805   0.426465  -5.540 3.02e-08 ***
## ChestPainnontypical -1.765082   0.492021  -3.587 0.000334 ***
## ChestPaintypical    -2.599348   0.602760  -4.312 1.61e-05 ***
## Ca                   1.082099   0.215806   5.014 5.33e-07 ***
## RestBP               0.034818   0.012476   2.791 0.005259 ** 
## Oldpeak              2.486976   1.079403   2.304 0.021221 *  
## RestBP:Oldpeak      -0.012607   0.007615  -1.656 0.097815 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 230.67  on 288  degrees of freedom
## AIC: 248.67
## 
## Number of Fisher Scoring iterations: 5
##         (Intercept)                 Sex ChestPainnonanginal 
##         -6.14490542          1.51106518         -2.36280496 
## ChestPainnontypical    ChestPaintypical                  Ca 
##         -1.76508159         -2.59934764          1.08209890 
##              RestBP             Oldpeak      RestBP:Oldpeak 
##          0.03481803          2.48697576         -0.01260698
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial, 
##     data = heart_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6880  -0.6163  -0.2476   0.6090   2.1251  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.265659   1.288263  -3.311 0.000929 ***
## Sex                  1.629896   0.370191   4.403 1.07e-05 ***
## ChestPainnonanginal -2.279957   0.392021  -5.816 6.03e-09 ***
## ChestPainnontypical -2.260915   0.472161  -4.788 1.68e-06 ***
## ChestPaintypical    -2.275540   0.581232  -3.915 9.04e-05 ***
## Ca                   1.150925   0.204851   5.618 1.93e-08 ***
## RestBP               0.025631   0.009126   2.809 0.004975 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 253.72  on 290  degrees of freedom
## AIC: 267.72
## 
## Number of Fisher Scoring iterations: 5
##         (Intercept)                 Sex ChestPainnonanginal 
##         -4.26565914          1.62989631         -2.27995678 
## ChestPainnontypical    ChestPaintypical                  Ca 
##         -2.26091509         -2.27554019          1.15092493 
##              RestBP 
##          0.02563142
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP * MaxHR, family = binomial, 
##     data = heart_data_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6281  -0.5899  -0.2022   0.5153   2.4074  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.9326987  9.1271635  -0.212 0.832300    
## Sex                  1.7357897  0.3818974   4.545 5.49e-06 ***
## ChestPainnonanginal -2.0480544  0.4041083  -5.068 4.02e-07 ***
## ChestPainnontypical -1.7942456  0.4977098  -3.605 0.000312 ***
## ChestPaintypical    -2.1223125  0.6068965  -3.497 0.000471 ***
## Ca                   1.1056820  0.2127394   5.197 2.02e-07 ***
## RestBP               0.0463943  0.0688582   0.674 0.500460    
## MaxHR               -0.0192594  0.0595729  -0.323 0.746475    
## RestBP:MaxHR        -0.0001183  0.0004479  -0.264 0.791673    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 234.86  on 288  degrees of freedom
## AIC: 252.86
## 
## Number of Fisher Scoring iterations: 5
##         (Intercept)                 Sex ChestPainnonanginal 
##       -1.9326987063        1.7357897057       -2.0480544338 
## ChestPainnontypical    ChestPaintypical                  Ca 
##       -1.7942455816       -2.1223125277        1.1056820370 
##              RestBP               MaxHR        RestBP:MaxHR 
##        0.0463943109       -0.0192593912       -0.0001183069
##                          Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept)         -1.9326987063 9.1271635228 -0.2117524 8.323002e-01
## Sex                  1.7357897057 0.3818973829  4.5451731 5.489013e-06
## ChestPainnonanginal -2.0480544338 0.4041083476 -5.0680825 4.018433e-07
## ChestPainnontypical -1.7942455816 0.4977097740 -3.6050037 3.121485e-04
## ChestPaintypical    -2.1223125277 0.6068964802 -3.4969926 4.705348e-04
## Ca                   1.1056820370 0.2127393892  5.1973546 2.021446e-07
## RestBP               0.0463943109 0.0688581655  0.6737663 5.004599e-01
## MaxHR               -0.0192593912 0.0595728608 -0.3232914 7.464746e-01
## RestBP:MaxHR        -0.0001183069 0.0004478974 -0.2641383 7.916733e-01

glm Models | Preliminary - logit Function Only

Logistic Regression Model - Training The Model

#By comparing Coefficient, Model #4 is considered best fit#
#Use "train_data" to train Model 4#
glm.fits_final = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = logit), data = train_data)
glm.probs=predict(glm.fits_final, train_data, type = "response")
#Assessind diff link Fn on model significance#
#link = probit
glm.fits_final_p = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = probit), data = train_data)
glm.probs=predict(glm.fits_final_p, train_data, type = "response")
#link = cloglog
glm.fits_final_c = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = cloglog), data = train_data)
glm.probs=predict(glm.fits_final_c, train_data, type = "response")
#Summary for same model--> diff link fn#
summary(glm.fits_final)
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = logit), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6017  -0.6120  -0.2161   0.6290   2.3013  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.733568   1.437308  -3.293 0.000990 ***
## Sex                  1.795004   0.430600   4.169 3.06e-05 ***
## ChestPainnonanginal -2.297364   0.425138  -5.404 6.52e-08 ***
## ChestPainnontypical -2.729353   0.609057  -4.481 7.42e-06 ***
## ChestPaintypical    -2.518962   0.653739  -3.853 0.000117 ***
## Ca                   1.083789   0.215022   5.040 4.65e-07 ***
## RestBP               0.028122   0.009987   2.816 0.004865 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 343.86  on 249  degrees of freedom
## Residual deviance: 207.14  on 243  degrees of freedom
## AIC: 221.14
## 
## Number of Fisher Scoring iterations: 5
summary(glm.fits_final_p)
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = probit), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6976  -0.6269  -0.1715   0.6396   2.3059  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.747753   0.816383  -3.366 0.000763 ***
## Sex                  1.064683   0.241008   4.418 9.98e-06 ***
## ChestPainnonanginal -1.362536   0.238931  -5.703 1.18e-08 ***
## ChestPainnontypical -1.589711   0.326448  -4.870 1.12e-06 ***
## ChestPaintypical    -1.490236   0.373971  -3.985 6.75e-05 ***
## Ca                   0.618915   0.117659   5.260 1.44e-07 ***
## RestBP               0.016339   0.005707   2.863 0.004196 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 343.86  on 249  degrees of freedom
## Residual deviance: 206.54  on 243  degrees of freedom
## AIC: 220.54
## 
## Number of Fisher Scoring iterations: 6
summary(glm.fits_final_c)
## 
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = cloglog), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9477  -0.6323  -0.2991   0.6560   2.2091  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.105311   0.974081  -4.215 2.50e-05 ***
## Sex                  1.212279   0.296741   4.085 4.40e-05 ***
## ChestPainnonanginal -1.627868   0.295904  -5.501 3.77e-08 ***
## ChestPainnontypical -1.986974   0.470081  -4.227 2.37e-05 ***
## ChestPaintypical    -1.788290   0.492948  -3.628 0.000286 ***
## Ca                   0.640640   0.119459   5.363 8.19e-08 ***
## RestBP               0.022593   0.006551   3.449 0.000563 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 343.86  on 249  degrees of freedom
## Residual deviance: 209.04  on 243  degrees of freedom
## AIC: 223.04
## 
## Number of Fisher Scoring iterations: 7
#based on summary---> Model with link "cloglog" is more significant#
#For validation purposes, both models - glm.fits_final & glm.fits_final_c will be evaluated#
predit_glm_final <- predict(glm.fits_final, newdata = test_data, type = "response")
predit_glm_final_c <- predict(glm.fits_final_c, newdata = test_data, type = "response")
#Prediction for p added for ROCgraph purposes only#
predit_glm_final_p <- predict(glm.fits_final_p, newdata = test_data, type = "response")
#assigning cut off value to assign "0/1" for resulting prbabilities#
range_glm_final <- range(predit_glm_final)
range_glm_final_c <- range(predit_glm_final_c)
range_glm_final
## [1] 0.02296012 0.97898095
range_glm_final_c
## [1] 0.04361867 0.99970032
#using the range results, the initial cutoff values will be "0.48"#
(0.97898095 - 0.02296012)/2
## [1] 0.4780104
(0.99970032-0.04361867)/2
## [1] 0.4780408

Logistic Regression Model- Confusion Matrix

cutoff_glm_final <- ifelse (predit_glm_final > .48, 1, 0)
tbl_glm_final<- table(test_data$AHD,cutoff_glm_final)
cutoff_glm_final_c <- ifelse (predit_glm_final_c > .48, 1, 0)
tbl_glm_final_c <- table(test_data$AHD,cutoff_glm_final_c)
#Classification accuracy=(TP+TN)/(TP+FP+TN+FN)#
#glm_final_accuracy <- (16+19)/(19+3+9+16)#
glm_final_accuracy <- sum(diag(tbl_glm_final)) / nrow(test_data)
#glm_final_c_accuracy <- (15+20)/(20+2+10+15)#b
glm_final_c_accuracy <- sum(diag(tbl_glm_final_c)) / nrow(test_data)
#Sensitivity=TP/(TP+FN)#
glm_final_sensitivity <-16/(16+3)
glm_final_c_sensitivity <- 15/(15+2)
#Specificity=TN/(TN+FP)#
glm_final_specificity<- 19/(19+9)
glm_final_c_specificity<-20/(20+10)
#glm Models Comparison#
glm_models_summary <- matrix(c(glm_final_accuracy,glm_final_sensitivity,glm_final_specificity,glm_final_c_accuracy,glm_final_c_sensitivity,glm_final_c_specificity),ncol=3,byrow=TRUE)
colnames(glm_models_summary) <- c("Accuacy","Sensitivity", "Specificty")
rownames(glm_models_summary) <- c("Final glm Model - logit Link","Final glm Model - cloglog Link")
glm_models_summary<- as.table(glm_models_summary)
glm_models_summary
##                                  Accuacy Sensitivity Specificty
## Final glm Model - logit Link   0.7446809   0.8421053  0.6785714
## Final glm Model - cloglog Link 0.7446809   0.8823529  0.6666667

Logistic Regression Model | Visualization - AUC based Pruning

To be updated..

Due to high number of variables, didn’t have the chance to update timely.

Support vector machine(SVM) Model| Evaluation - Initial Scenario

For the purpose of this exercise, we use all explanatory variables and defined the Kernal as Linear. * Different cost levels have been compared to identify the optimal value. * Confusion Matrix was used to evaluate level of accuracy & other key data points. Results weren’t encouraging in first attempt therefore, I changed approach and used the flowing list of libraries for scenario II

    - caret
    - mlbench
    - knitr
    - lattice
    - ROCR
    - mgmum.r (It took one day to install :-)

Scenario 1 Code included as FYI only,

library(e1071)
plot(heart_data_final$Sex, heart_data_final$ChestPain, col=heart_data_final$AHD)

plot(heart_data_final$Sex, col=heart_data_final$AHD)

plot(heart_data_final$ChestPain, col=heart_data_final$AHD)

plot(heart_data_final$Ca, col=heart_data_final$AHD)

svmmodel <- svm(AHD ~. , data = train_data, kernel = "linear", cost = .1, scale = FALSE)
print(svmmodel)
## 
## Call:
## svm(formula = AHD ~ ., data = train_data, kernel = "linear", 
##     cost = 0.1, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  108
svmmodel1 <- svm(AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, type='C-classification', kernel = "linear", cost = .1, scale = FALSE)
print(svmmodel1)
## 
## Call:
## svm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, 
##     type = "C-classification", kernel = "linear", cost = 0.1, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.1428571 
## 
## Number of Support Vectors:  148
svmtuned_1 <- tune(svm, AHD ~ Sex + ChestPain + Ca + RestBP , data = train_data, kernel = "linear", ranges = list(cost = c(.001, .005,.01, .05, .1, .5, .075, .025) , scale = FALSE))
summary(svmtuned_1)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost scale
##   0.5 FALSE
## 
## - best performance: 0.208 
## 
## - Detailed performance results:
##    cost scale error dispersion
## 1 0.001 FALSE 0.448 0.09577752
## 2 0.005 FALSE 0.312 0.11895844
## 3 0.010 FALSE 0.260 0.12961481
## 4 0.050 FALSE 0.220 0.09092121
## 5 0.100 FALSE 0.212 0.08854377
## 6 0.500 FALSE 0.208 0.07955431
## 7 0.075 FALSE 0.212 0.08854377
## 8 0.025 FALSE 0.244 0.09512565
#For svmmodel1, the optimal cost value is .5#
library(tourr)
## Warning: package 'tourr' was built under R version 3.4.2
## 
## Attaching package: 'tourr'
## The following object is masked from 'package:e1071':
## 
##     interpolate
svmmodel1 <- svm(AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, type='C-classification', kernel = "linear", cost = .1, scale = FALSE)
print(svmmodel1)
## 
## Call:
## svm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, 
##     type = "C-classification", kernel = "linear", cost = 0.1, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.1428571 
## 
## Number of Support Vectors:  148
svmtuned_1 <- tune(svm, AHD ~ Sex + ChestPain + Ca + RestBP , data = train_data, kernel = "linear", ranges = list(cost = c(.001, .005,.01, .05, .1, .5, .075, .025) , scale = FALSE))
summary(svmtuned_1)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost scale
##   0.1 FALSE
## 
## - best performance: 0.216 
## 
## - Detailed performance results:
##    cost scale error dispersion
## 1 0.001 FALSE 0.452 0.07315129
## 2 0.005 FALSE 0.316 0.08099383
## 3 0.010 FALSE 0.268 0.07067924
## 4 0.050 FALSE 0.224 0.03373096
## 5 0.100 FALSE 0.216 0.03864367
## 6 0.500 FALSE 0.216 0.05719363
## 7 0.075 FALSE 0.220 0.03399346
## 8 0.025 FALSE 0.232 0.07004760
#For svmmodel1, the optimal cost value is .5#
svmmodel1 <- svm(AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, type='C-classification', kernel = "linear", cost = .5, scale = FALSE)
print(svmmodel1)
## 
## Call:
## svm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, 
##     type = "C-classification", kernel = "linear", cost = 0.5, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.5 
##       gamma:  0.1428571 
## 
## Number of Support Vectors:  121
svm_predict_1 <- predict(svmmodel1, newdata = test_data, type = "class")
svm_predict_1
## 254 255 256 257 258 259 260 261 262 263 264 265 266 268 269 270 271 272 
##  No Yes  No Yes  No  No  No  No  No  No  No Yes Yes  No Yes  No Yes Yes 
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 289 290 291 
## Yes  No Yes  No  No  No  No  No Yes  No  No  No Yes Yes Yes  No  No  No 
## 292 293 294 295 296 297 298 299 300 301 302 
##  No Yes Yes  No  No Yes  No  No Yes Yes  No 
## Levels: No Yes
print(svm_predict_1)
## 254 255 256 257 258 259 260 261 262 263 264 265 266 268 269 270 271 272 
##  No Yes  No Yes  No  No  No  No  No  No  No Yes Yes  No Yes  No Yes Yes 
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 289 290 291 
## Yes  No Yes  No  No  No  No  No Yes  No  No  No Yes Yes Yes  No  No  No 
## 292 293 294 295 296 297 298 299 300 301 302 
##  No Yes Yes  No  No Yes  No  No Yes Yes  No 
## Levels: No Yes
head(svm_predict_1)
## 254 255 256 257 258 259 
##  No Yes  No Yes  No  No 
## Levels: No Yes
str(svm_predict_1)
##  Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 1 1 ...
##  - attr(*, "names")= chr [1:47] "254" "255" "256" "257" ...
summary(svm_predict_1)
##  No Yes 
##  29  18
#confusion matrix#
svmmodel1_eval <- table (svm_predict_1, test_data$AHD)
svmmodel1_eval
##              
## svm_predict_1 No Yes
##           No  19  10
##           Yes  3  15
#Definitley bad model- svmmodel1 excluded#
#Visualization-Evaluating the equation of boundary plane#
w <- t(svmmodel1$coefs) %*% svmmodel1$SV
w
##            Sex ChestPainasymptomatic ChestPainnonanginal
## [1,] -1.287547             -1.204816           0.4015195
##      ChestPainnontypical ChestPaintypical         Ca       RestBP
## [1,]           0.3560102        0.4472863 -0.6666633 -0.009130642
#negative intercept#
svmmodel1$rho
## [1] -3.163924

Support vector machine(SVM) Model | Evaluation - Scenario II (svmLinear)

In an attempt to use more robust capabilities, I used different approach to split data and resample.

## Warning: package 'caret' was built under R version 3.4.2
## Loading required package: lattice
## Warning: package 'mlbench' was built under R version 3.4.2
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
##        X              Age             Sex                ChestPain  
##  Min.   :  3.0   Min.   :29.00   Min.   :0.0000   asymptomatic:118  
##  1st Qu.: 79.5   1st Qu.:48.00   1st Qu.:0.0000   nonanginal  : 65  
##  Median :151.0   Median :56.00   Median :1.0000   nontypical  : 44  
##  Mean   :152.3   Mean   :54.58   Mean   :0.6478   typical     : 20  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000                     
##  Max.   :302.0   Max.   :77.00   Max.   :1.0000                     
##      RestBP           Chol            Fbs            RestECG      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:210.5   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :241.0   Median :0.0000   Median :0.0000  
##  Mean   :131.6   Mean   :246.5   Mean   :0.1336   Mean   :0.9838  
##  3rd Qu.:140.0   3rd Qu.:275.5   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##      MaxHR           ExAng           Oldpeak          Slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:135.0   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000  
##  Median :154.0   Median :0.0000   Median :0.600   Median :2.000  
##  Mean   :149.7   Mean   :0.3198   Mean   :1.001   Mean   :1.571  
##  3rd Qu.:164.5   3rd Qu.:1.0000   3rd Qu.:1.550   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.200   Max.   :3.000  
##        Ca                 Thal      AHD     
##  Min.   :0.0000   fixed     : 10   No :133  
##  1st Qu.:0.0000   normal    :142   Yes:114  
##  Median :0.0000   reversable: 95            
##  Mean   :0.6275                             
##  3rd Qu.:1.0000                             
##  Max.   :3.0000
##        X              Age             Sex              ChestPain 
##  Min.   :  1.0   Min.   :35.00   Min.   :0.00   asymptomatic:24  
##  1st Qu.: 66.5   1st Qu.:48.00   1st Qu.:1.00   nonanginal  :18  
##  Median :143.0   Median :54.50   Median :1.00   nontypical  : 5  
##  Mean   :142.5   Mean   :54.34   Mean   :0.82   typical     : 3  
##  3rd Qu.:210.2   3rd Qu.:62.00   3rd Qu.:1.00                    
##  Max.   :300.0   Max.   :70.00   Max.   :1.00                    
##      RestBP           Chol            Fbs         RestECG    
##  Min.   : 94.0   Min.   :169.0   Min.   :0.0   Min.   :0.00  
##  1st Qu.:120.0   1st Qu.:225.2   1st Qu.:0.0   1st Qu.:0.00  
##  Median :130.0   Median :245.5   Median :0.0   Median :2.00  
##  Mean   :132.2   Mean   :251.4   Mean   :0.2   Mean   :1.06  
##  3rd Qu.:143.5   3rd Qu.:280.2   3rd Qu.:0.0   3rd Qu.:2.00  
##  Max.   :192.0   Max.   :409.0   Max.   :1.0   Max.   :2.00  
##      MaxHR           ExAng         Oldpeak          Slope     
##  Min.   : 99.0   Min.   :0.00   Min.   :0.000   Min.   :1.00  
##  1st Qu.:132.0   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:1.00  
##  Median :150.0   Median :0.00   Median :1.500   Median :2.00  
##  Mean   :149.3   Mean   :0.36   Mean   :1.324   Mean   :1.76  
##  3rd Qu.:170.5   3rd Qu.:1.00   3rd Qu.:2.200   3rd Qu.:2.00  
##  Max.   :195.0   Max.   :1.00   Max.   :3.500   Max.   :3.00  
##        Ca               Thal     AHD    
##  Min.   :0.00   fixed     : 8   No :27  
##  1st Qu.:0.00   normal    :22   Yes:23  
##  Median :0.00   reversable:20           
##  Mean   :0.92                           
##  3rd Qu.:2.00                           
##  Max.   :3.00
## [1] 247  15
## [1] 50 15
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## Support Vector Machines with Linear Kernel 
## 
## 247 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 222, 222, 223, 222, 222, 223, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.815094  0.6262081
## 
## Tuning parameter 'C' was held constant at a value of 1
##  [1] No  Yes No  No  Yes No  No  No  Yes No  Yes Yes Yes No  Yes Yes No 
## [18] No  Yes No  Yes No  No  Yes No  No  Yes Yes No  Yes Yes Yes Yes No 
## [35] Yes Yes No  No  No  Yes Yes Yes No  No  Yes No  No  Yes Yes Yes
## Levels: No Yes
## 
## Attaching package: 'gmum.r'
## The following object is masked from 'package:kernlab':
## 
##     centers
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  23   1
##        Yes  4  22
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7819, 0.9667)
##     No Information Rate : 0.54            
##     P-Value [Acc > NIR] : 4.519e-08       
##                                           
##                   Kappa : 0.8006          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.9565          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.5400          
##          Detection Rate : 0.4600          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.9042          
##                                           
##        'Positive' Class : No              
## 
##  [1] No  Yes No  No  Yes No  No  No  Yes No  Yes Yes Yes No  Yes Yes No 
## [18] No  No  No  Yes No  No  Yes No  Yes Yes Yes No  No  Yes No  Yes No 
## [35] Yes Yes No  No  No  Yes Yes No  No  No  Yes No  No  Yes Yes Yes
## Levels: No Yes
## Support Vector Machines with Linear Kernel 
## 
## 247 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 222, 222, 223, 222, 222, 223, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##   0.001  0.7297735  0.4319236
##   0.005  0.8341581  0.6651568
##   0.010  0.8342607  0.6649947
##   0.025  0.8326496  0.6612919
##   0.050  0.8247564  0.6450954
##   0.075  0.8233162  0.6419452
##   0.100  0.8166966  0.6286843
##   0.175  0.8096496  0.6149714
##   0.250  0.8083120  0.6125596
##   0.750  0.8110385  0.6179633
##   1.000  0.8150940  0.6262081
##   1.500  0.8124786  0.6209638
##   1.750  0.8178162  0.6314467
##   2.000  0.8178675  0.6316306
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was C = 0.01.

## null device 
##           1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  20   1
##        Yes  7  22
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7089, 0.9283)
##     No Information Rate : 0.54            
##     P-Value [Acc > NIR] : 7.854e-06       
##                                           
##                   Kappa : 0.684           
##  Mcnemar's Test P-Value : 0.0771          
##                                           
##             Sensitivity : 0.7407          
##             Specificity : 0.9565          
##          Pos Pred Value : 0.9524          
##          Neg Pred Value : 0.7586          
##              Prevalence : 0.5400          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4200          
##       Balanced Accuracy : 0.8486          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  23   1
##        Yes  4  22
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7819, 0.9667)
##     No Information Rate : 0.54            
##     P-Value [Acc > NIR] : 4.519e-08       
##                                           
##                   Kappa : 0.8006          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.9565          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.5400          
##          Detection Rate : 0.4600          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.9042          
##                                           
##        'Positive' Class : No              
## 

Support vector machine(SVM) Model | Evaluation - Scenario II (pls)

In an attempt to use more robust capabilities, I used different approach to split data and resample.

## Warning: package 'pls' was built under R version 3.4.2
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
##           No       Yes
## 1  0.3969739 0.6030261
## 2  0.2903513 0.7096487
## 4  0.5902840 0.4097160
## 6  0.7178610 0.2821390
## 9  0.3479927 0.6520073
## 19 0.7204744 0.2795256
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  19   2
##        Yes  8  21
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6628, 0.8997)
##     No Information Rate : 0.54            
##     P-Value [Acc > NIR] : 0.0001186       
##                                           
##                   Kappa : 0.6051          
##  Mcnemar's Test P-Value : 0.1138463       
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.9130          
##          Pos Pred Value : 0.9048          
##          Neg Pred Value : 0.7241          
##              Prevalence : 0.5400          
##          Detection Rate : 0.3800          
##    Detection Prevalence : 0.4200          
##       Balanced Accuracy : 0.8084          
##                                           
##        'Positive' Class : No              
## 
##  [1] Yes Yes No  No  Yes No  Yes No  Yes No  Yes Yes Yes Yes Yes Yes No 
## [18] No  Yes No  Yes Yes Yes Yes No  No  Yes Yes No  Yes Yes No  Yes No 
## [35] Yes No  No  No  No  Yes Yes Yes No  No  Yes No  No  Yes Yes Yes
## Levels: No Yes
##  [1] No  Yes No  No  Yes No  No  No  Yes No  Yes Yes Yes No  Yes Yes No 
## [18] No  No  No  Yes No  No  Yes No  Yes Yes Yes No  No  Yes No  Yes No 
## [35] Yes Yes No  No  No  Yes Yes No  No  No  Yes No  No  Yes Yes Yes
## Levels: No Yes

Support vector machine(SVM) Model | Evaluation - Scenario II (rda) ============================================================================== — output: incremental: true —

In an attempt to use more robust capabilities, I used different approach to split data and resample.

## Warning: package 'klaR' was built under R version 3.4.2
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.2
## Regularized Discriminant Analysis 
## 
## 247 samples
##  14 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 222, 222, 223, 222, 222, 223, ... 
## Resampling results across tuning parameters:
## 
##   gamma  ROC        Sens       Spec     
##   0.00   0.8943279  0.8721612  0.7666667
##   0.25   0.7363109  0.7919414  0.5782828
##   0.50   0.7243673  0.7846154  0.5601010
##   0.75   0.7015290  0.7641026  0.5313131
##   1.00   0.6259962  0.6560440  0.5171717
## 
## Tuning parameter 'lambda' was held constant at a value of 0.75
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were gamma = 0 and lambda = 0.75.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  20   1
##        Yes  7  22
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7089, 0.9283)
##     No Information Rate : 0.54            
##     P-Value [Acc > NIR] : 7.854e-06       
##                                           
##                   Kappa : 0.684           
##  Mcnemar's Test P-Value : 0.0771          
##                                           
##             Sensitivity : 0.7407          
##             Specificity : 0.9565          
##          Pos Pred Value : 0.9524          
##          Neg Pred Value : 0.7586          
##              Prevalence : 0.5400          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4200          
##       Balanced Accuracy : 0.8486          
##                                           
##        'Positive' Class : No              
## 
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: pls, rda 
## Number of resamples: 30 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## pls 0.7412587 0.8618881 0.9228480 0.9069070 0.9423077    1    0
## rda 0.7552448 0.8496503 0.9128788 0.8943279 0.9464286    1    0
## 
## Sens 
##          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## pls 0.5384615 0.7857143 0.8571429 0.8521978 0.9230769    1    0
## rda 0.6923077 0.8461538 0.8571429 0.8721612 0.9230769    1    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## pls 0.6363636 0.7272727 0.8181818 0.8015152 0.8333333    1    0
## rda 0.5000000 0.6439394 0.8181818 0.7666667 0.8901515    1    0
## 
## Call:
## summary.diff.resamples(object = resampling_eval)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## ROC 
##     pls     rda    
## pls         0.01258
## rda 0.07428        
## 
## Sens 
##     pls    rda     
## pls        -0.01996
## rda 0.1817         
## 
## Spec 
##     pls     rda    
## pls         0.03485
## rda 0.03153

Support vector machine(SVM) Model | Visualization

Need Update

## Warning: package 'factoextra' was built under R version 3.4.2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
## Warning: package 'FactoMineR' was built under R version 3.4.2
## Warning: package 'rgl' was built under R version 3.4.2

## 
## Call:
## PCA(X = trainData[, -1], quali.sup = c(3, 13, 14)) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               2.602   1.503   1.142   1.037   0.929   0.824
## % of var.             23.653  13.665  10.378   9.430   8.448   7.495
## Cumulative % of var.  23.653  37.318  47.696  57.126  65.573  73.068
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.790   0.755   0.620   0.433   0.365
## % of var.              7.179   6.866   5.638   3.934   3.315
## Cumulative % of var.  80.248  87.114  92.752  96.685 100.000
## 
## Individuals (the 10 first)
##                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 3            |  3.457 |  2.783  1.205  0.648 | -0.917  0.226  0.070 |
## 5            |  2.987 | -1.687  0.443  0.319 |  0.070  0.001  0.001 |
## 7            |  4.227 |  2.619  1.067  0.384 |  0.923  0.229  0.048 |
## 8            |  3.409 | -0.795  0.098  0.054 |  0.854  0.196  0.063 |
## 10           |  4.544 |  2.429  0.918  0.286 | -0.732  0.144  0.026 |
## 11           |  2.192 | -0.385  0.023  0.031 | -0.740  0.148  0.114 |
## 12           |  2.383 |  0.193  0.006  0.007 |  1.238  0.413  0.270 |
## 13           |  3.345 |  1.419  0.313  0.180 |  0.340  0.031  0.010 |
## 14           |  2.699 | -2.376  0.879  0.775 | -0.552  0.082  0.042 |
## 15           |  4.090 | -0.819  0.104  0.040 |  1.197  0.386  0.086 |
##               Dim.3    ctr   cos2  
## 3            -0.001  0.000  0.000 |
## 5            -0.811  0.233  0.074 |
## 7            -1.626  0.937  0.148 |
## 8            -1.561  0.864  0.210 |
## 10            1.149  0.468  0.064 |
## 11            0.539  0.103  0.061 |
## 12           -1.696  1.020  0.507 |
## 13            1.622  0.933  0.235 |
## 14            0.143  0.007  0.003 |
## 15            3.094  3.395  0.572 |
## 
## Variables (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Age          |  0.558 11.956  0.311 |  0.458 13.952  0.210 |  0.075  0.488
## Sex          |  0.113  0.495  0.013 | -0.513 17.530  0.263 |  0.517 23.418
## RestBP       |  0.356  4.882  0.127 |  0.468 14.553  0.219 |  0.227  4.501
## Chol         |  0.131  0.662  0.017 |  0.552 20.307  0.305 | -0.433 16.441
## Fbs          |  0.156  0.935  0.024 |  0.409 11.120  0.167 |  0.662 38.418
## RestECG      |  0.281  3.028  0.079 |  0.228  3.463  0.052 | -0.177  2.740
## MaxHR        | -0.682 17.896  0.466 |  0.216  3.107  0.047 |  0.008  0.006
## ExAng        |  0.505  9.817  0.255 | -0.349  8.096  0.122 | -0.011  0.010
## Oldpeak      |  0.731 20.566  0.535 | -0.220  3.233  0.049 | -0.141  1.735
## Slope        |  0.698 18.745  0.488 | -0.222  3.293  0.050 | -0.281  6.906
##                cos2  
## Age           0.006 |
## Sex           0.267 |
## RestBP        0.051 |
## Chol          0.188 |
## Fbs           0.439 |
## RestECG       0.031 |
## MaxHR         0.000 |
## ExAng         0.000 |
## Oldpeak       0.020 |
## Slope         0.079 |
## 
## Supplementary categories
##                  Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
## asymptomatic |  0.744 |  0.644  0.750  5.993 | -0.260  0.122 -3.175 |
## nonanginal   |  0.611 | -0.403  0.435 -2.343 |  0.328  0.288  2.508 |
## nontypical   |  1.222 | -1.198  0.961 -5.422 |  0.134  0.012  0.795 |
## typical      |  0.931 |  0.143  0.024  0.414 |  0.172  0.034  0.651 |
## fixed        |  1.705 |  1.325  0.604  2.646 | -0.437  0.066 -1.150 |
## normal       |  0.707 | -0.608  0.740 -6.874 |  0.231  0.107  3.443 |
## reversable   |  0.903 |  0.769  0.725  5.913 | -0.300  0.110 -3.032 |
## No           |  0.950 | -0.894  0.886 -9.392 |  0.214  0.051  2.959 |
## Yes          |  1.108 |  1.043  0.886  9.392 | -0.250  0.051 -2.959 |
##               Dim.3   cos2 v.test  
## asymptomatic -0.068  0.008 -0.961 |
## nonanginal    0.014  0.001  0.121 |
## nontypical   -0.015  0.000 -0.106 |
## typical       0.393  0.178  1.712 |
## fixed         0.620  0.132  1.870 |
## normal       -0.179  0.064 -3.049 |
## reversable    0.202  0.050  2.340 |
## No           -0.098  0.011 -1.552 |
## Yes           0.114  0.011  1.552 |