Purpose of this summary is to provide insight into key factors contributing to Heart Disease. Approach used to identify the most significant model for prediction relies on the following assumptions:
- Heart Disease data set will be used for the analysis (:http://archive.ics.uci.edu/ml/datasets/heart+disease)
- Records with missing data will be excluded.
- Outliers for each factor will be evaluated and excluded only if values are due to data entry error.
- Given criticality of this excercise, outliers will not be excluded without a reaonable cause.
- Both GLM & SVM are used to evaluate models. Variables are not limited to quantitative, qualitative values are included as well- e.g. ChestPain & Thal are excluded as they are categorical
Both Logistic Regression & SVM Models will be evaluated to identify the optimal option
- Logictic Regression Model (lgm)
* Different models are evaluated with variation of variables
* Interaction effect is incorporated
* Only "Binomial" family is evaluated. After identifying the best fit model based on key statistics and confusion matrix, the link function "probit" & "cloglog" are compared to defauly "logit" in case they offer more significance.
* Define cut off value to assign "0/1" for resulting prbabilities based on model prediction range - (Max-Min)/2
* Construct confusion matrix using cutt off value. Use result to further assess models for key factors:
> Accuarcy
> Sensitivity
> Significance
* Use ROC curve to compare side by side the most significant models.
Model 4 based on summary below is selected as it is the most significant based on p value.
glm.fits_final = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = logit), data = train_data)
By comparing final model using different link functions, “cloglog” the most significant model.
## Warning: package 'ISLR' was built under R version 3.4.2
## [1] FALSE
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = logit),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6017 -0.6120 -0.2161 0.6290 2.3013
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.733568 1.437308 -3.293 0.000990 ***
## Sex 1.795004 0.430600 4.169 3.06e-05 ***
## ChestPainnonanginal -2.297364 0.425138 -5.404 6.52e-08 ***
## ChestPainnontypical -2.729353 0.609057 -4.481 7.42e-06 ***
## ChestPaintypical -2.518962 0.653739 -3.853 0.000117 ***
## Ca 1.083789 0.215022 5.040 4.65e-07 ***
## RestBP 0.028122 0.009987 2.816 0.004865 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 343.86 on 249 degrees of freedom
## Residual deviance: 207.14 on 243 degrees of freedom
## AIC: 221.14
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = probit),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6976 -0.6269 -0.1715 0.6396 2.3059
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.747753 0.816383 -3.366 0.000763 ***
## Sex 1.064683 0.241008 4.418 9.98e-06 ***
## ChestPainnonanginal -1.362536 0.238931 -5.703 1.18e-08 ***
## ChestPainnontypical -1.589711 0.326448 -4.870 1.12e-06 ***
## ChestPaintypical -1.490236 0.373971 -3.985 6.75e-05 ***
## Ca 0.618915 0.117659 5.260 1.44e-07 ***
## RestBP 0.016339 0.005707 2.863 0.004196 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 343.86 on 249 degrees of freedom
## Residual deviance: 206.54 on 243 degrees of freedom
## AIC: 220.54
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = cloglog),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9477 -0.6323 -0.2991 0.6560 2.2091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.105311 0.974081 -4.215 2.50e-05 ***
## Sex 1.212279 0.296741 4.085 4.40e-05 ***
## ChestPainnonanginal -1.627868 0.295904 -5.501 3.77e-08 ***
## ChestPainnontypical -1.986974 0.470081 -4.227 2.37e-05 ***
## ChestPaintypical -1.788290 0.492948 -3.628 0.000286 ***
## Ca 0.640640 0.119459 5.363 8.19e-08 ***
## RestBP 0.022593 0.006551 3.449 0.000563 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 343.86 on 249 degrees of freedom
## Residual deviance: 209.04 on 243 degrees of freedom
## AIC: 223.04
##
## Number of Fisher Scoring iterations: 7
By comparing final model using different link functions, “cloglog” the most significant model.
Conclusion: By comparing summary results, lgm Model with link “cloglog” is more significant an is the selected as best fit
Both Logistic Regression & SVM Models will be evaluated to identify the optimal option
- Support Vector Machine (SVM)
* Prelimany mode included key variables based on GLM model results:
> Sex
> ChestPain
> Ca
> RestBP
* Confusion matrix evaluated ane results weren't promissing therefore, blunt force approach is used by including all variables in SVM model
* More robust process to plit data is used by applying data partitioning function.
* Confusin matrix is evalauted as well as different cost levels.
* The following methods are compared for the model:
> svmLinear
> pls
> rda
* The more significant models are compared against each other. Best fit model is selected based on resampling results.
After training the model, confusion matrix is evaluated suggesting satisfactory results compared to lgm model. The only caution is that all variables are included which might cause overffiting and render unstable model.
Different cost levels are evaluated as well to ensure optimal value is incorporated in the model. Final value is set @ .1
Both “pls” & “rda” methods are evaluated as well and compared against “svmLinear” in case of additional valuable insight before making final selection.
Conclusion : - Model set with svmLinear Method & cost = .01 is the one selected. - No added value when applying “pls” or “rda”"
Confusion Matrix : pls vs. rda
Method : pls
Method : rda
ROC : pls vs. rda
Resampling Comparison : pls vs. rda
setwd("~/Introduction to Data Science/R Projects/data")
heart_data <- read.table(file = "heart.txt", sep = ",", header = T)
View(heart_data)
dim(heart_data)
## [1] 303 15
str(heart_data)
## 'data.frame': 303 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thal : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
## $ AHD : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
#Run summary to better understand the variables and identify NAs#
summary(heart_data)
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:144
## 1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 86
## Median :152.0 Median :56.00 Median :1.0000 nontypical : 50
## Mean :152.0 Mean :54.44 Mean :0.6799 typical : 23
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :303.0 Max. :77.00 Max. :1.0000
##
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
## 3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
##
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.80 Median :2.000
## Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
##
## Ca Thal AHD
## Min. :0.0000 fixed : 18 No :164
## 1st Qu.:0.0000 normal :166 Yes:139
## Median :0.0000 reversable:117
## Mean :0.6722 NA's : 2
## 3rd Qu.:1.0000
## Max. :3.0000
## NA's :4
#exclude NAs from "Thal" & "Ca"
heart_data_final <- na.omit(heart_data)
summary(heart_data_final)
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:142
## 1st Qu.: 75.0 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 83
## Median :150.0 Median :56.00 Median :1.0000 nontypical : 49
## Mean :150.7 Mean :54.54 Mean :0.6768 typical : 23
## 3rd Qu.:226.0 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :302.0 Max. :77.00 Max. :1.0000
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :243.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :247.4 Mean :0.1448 Mean :0.9966
## 3rd Qu.:140.0 3rd Qu.:276.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:133.0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.800 Median :2.000
## Mean :149.6 Mean :0.3266 Mean :1.056 Mean :1.603
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.200 Max. :3.000
## Ca Thal AHD
## Min. :0.0000 fixed : 18 No :160
## 1st Qu.:0.0000 normal :164 Yes:137
## Median :0.0000 reversable:115
## Mean :0.6768
## 3rd Qu.:1.0000
## Max. :3.0000
library(ggplot2)
par(mfrow=c(3,3))
boxplot_Age <- boxplot(heart_data$Age, xlab = "Age", outcol="red")
boxplot_RestBP <- boxplot(heart_data$RestBP, xlab = "RestBP", outcol="red")
boxplot_Chol <- boxplot(heart_data$Chol, xlab = "Chol", outcol="red")
boxplot_Fbs <- boxplot(heart_data$Fbs, xlab = "Fbs", outcol="red")
boxplot_RestECG <- boxplot(heart_data$RestECG, xlab = "RestECG", outcol="red")
boxplot_ExAng<- boxplot(heart_data$ExAng, xlab = "ExAng", outcol="red")
boxplot_MaxHR <- boxplot(heart_data$MaxHR, xlab = "MaxHR", outcol="red")
boxplot_Ca <- boxplot(heart_data$Ca, xlab = "Ca", outcol="red")
boxplot_Oldpeak <- boxplot(heart_data$Oldpeak, xlab="Oldpeak", outcol="red")
#boxplot_Thal <- boxplot(heart_data$Thal, xlab= "Thal")#excluded as it is a categorical variable
boxplot_Age$out
## numeric(0)
boxplot_RestBP$out
## [1] 172 180 200 174 178 192 180 178 180
boxplot_Chol$out
## [1] 417 407 564 394 409
boxplot_Fbs$out
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1
boxplot_ExAng$out
## numeric(0)
boxplot_Ca$out
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
boxplot_Oldpeak$out
## [1] 6.2 5.6 4.2 4.2 4.4
boxplot_MaxHR$out
## [1] 71
library(ISLR)
str(heart_data_final)
## 'data.frame': 297 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thal : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
## $ AHD : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:6] 88 167 193 267 288 303
## .. ..- attr(*, "names")= chr [1:6] "88" "167" "193" "267" ...
dim(heart_data_final)
## [1] 297 15
pairs(heart_data_final)
#Code Updated to exclude Categorical Cariables-cor(heart_data_final)#
cor(heart_data_final [,c(-4,-14,-15)])
## X Age Sex RestBP Chol
## X 1.000000000 0.009262273 -0.08814079 -0.02225682 -8.396768e-02
## Age 0.009262273 1.000000000 -0.09239948 0.29047626 2.026435e-01
## Sex -0.088140794 -0.092399479 1.00000000 -0.06634020 -1.980891e-01
## RestBP -0.022256823 0.290476262 -0.06634020 1.00000000 1.315357e-01
## Chol -0.083967682 0.202643546 -0.19808906 0.13153571 1.000000e+00
## Fbs -0.051693004 0.132061989 0.03885030 0.18085954 1.270828e-02
## RestECG -0.136735719 0.149916512 0.03389683 0.14924228 1.650460e-01
## MaxHR -0.117354904 -0.394562881 -0.06049601 -0.04910766 -7.456799e-05
## ExAng -0.002661773 0.096488805 0.14358125 0.06669107 5.933893e-02
## Oldpeak -0.114655929 0.197122616 0.10656724 0.19124314 3.859579e-02
## Slope -0.032451869 0.159404737 0.03334496 0.12117205 -9.215240e-03
## Ca 0.048687403 0.362210343 0.09192480 0.09795376 1.159446e-01
## Fbs RestECG MaxHR ExAng Oldpeak
## X -0.0516930041 -0.13673572 -1.173549e-01 -0.0026617728 -0.114655929
## Age 0.1320619890 0.14991651 -3.945629e-01 0.0964888046 0.197122616
## Sex 0.0388502996 0.03389683 -6.049601e-02 0.1435812504 0.106567243
## RestBP 0.1808595428 0.14924228 -4.910766e-02 0.0666910687 0.191243136
## Chol 0.0127082808 0.16504603 -7.456799e-05 0.0593389323 0.038595794
## Fbs 1.0000000000 0.06883111 -7.842359e-03 -0.0008930821 0.008310667
## RestECG 0.0688311070 1.00000000 -7.228965e-02 0.0818739197 0.113726420
## MaxHR -0.0078423590 -0.07228965 1.000000e+00 -0.3843675321 -0.347639972
## ExAng -0.0008930821 0.08187392 -3.843675e-01 1.0000000000 0.289309666
## Oldpeak 0.0083106671 0.11372642 -3.476400e-01 0.2893096659 1.000000000
## Slope 0.0478190123 0.13514058 -3.893067e-01 0.2505715154 0.579037353
## Ca 0.1520858900 0.12902063 -2.687270e-01 0.1482322256 0.294452277
## Slope Ca
## X -0.03245187 0.04868740
## Age 0.15940474 0.36221034
## Sex 0.03334496 0.09192480
## RestBP 0.12117205 0.09795376
## Chol -0.00921524 0.11594459
## Fbs 0.04781901 0.15208589
## RestECG 0.13514058 0.12902063
## MaxHR -0.38930674 -0.26872698
## ExAng 0.25057152 0.14823223
## Oldpeak 0.57903735 0.29445228
## Slope 1.00000000 0.10976112
## Ca 0.10976112 1.00000000
attach(heart_data_final)
## The following objects are masked from heart_data_final (pos = 3):
##
## Age, AHD, Ca, ChestPain, Chol, ExAng, Fbs, MaxHR, Oldpeak,
## RestBP, RestECG, Sex, Slope, Thal, X
#confirming no missing value in final data set#
anyNA(heart_data_final)
## [1] FALSE
train_data <- heart_data_final[1:250,]
test_data <- heart_data_final[251:297,]
head(train_data)
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1 63 1 typical 145 233 1 2 150 0 2.3 3
## 2 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2
## 3 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2
## 4 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3
## 5 5 41 0 nontypical 130 204 0 2 172 0 1.4 1
## 6 6 56 1 nontypical 120 236 0 0 178 0 0.8 1
## Ca Thal AHD
## 1 0 fixed No
## 2 3 normal Yes
## 3 2 reversable Yes
## 4 0 normal No
## 5 0 normal No
## 6 0 normal No
head(test_data)
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak
## 254 254 51 0 nonanginal 120 295 0 2 157 0 0.6
## 255 255 43 1 asymptomatic 115 303 0 0 181 0 1.2
## 256 256 42 0 nonanginal 120 209 0 0 173 0 0.0
## 257 257 67 0 asymptomatic 106 223 0 0 142 0 0.3
## 258 258 76 0 nonanginal 140 197 0 1 116 0 1.1
## 259 259 70 1 nontypical 156 245 0 2 143 0 0.0
## Slope Ca Thal AHD
## 254 1 0 normal No
## 255 2 0 normal No
## 256 2 0 normal No
## 257 1 2 normal No
## 258 2 0 normal No
## 259 1 0 normal No
## 'data.frame': 303 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ ChestPain: Factor w/ 4 levels "asymptomatic",..: 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : int 1 0 0 0 0 0 0 0 0 1 ...
## $ RestECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : int 0 1 1 0 0 0 0 1 0 1 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : int 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ Thal : Factor w/ 3 levels "fixed","normal",..: 1 2 3 2 2 2 2 2 3 3 ...
## $ AHD : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:144
## 1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 86
## Median :152.0 Median :56.00 Median :1.0000 nontypical : 50
## Mean :152.0 Mean :54.44 Mean :0.6799 typical : 23
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :303.0 Max. :77.00 Max. :1.0000
##
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
## 3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
##
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.80 Median :2.000
## Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
##
## Ca Thal AHD
## Min. :0.0000 fixed : 18 No :164
## 1st Qu.:0.0000 normal :166 Yes:139
## Median :0.0000 reversable:117
## Mean :0.6722 NA's : 2
## 3rd Qu.:1.0000
## Max. :3.0000
## NA's :4
## X Age Sex ChestPain
## Min. : 1.0 Min. :29.00 Min. :0.0000 asymptomatic:142
## 1st Qu.: 75.0 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 83
## Median :150.0 Median :56.00 Median :1.0000 nontypical : 49
## Mean :150.7 Mean :54.54 Mean :0.6768 typical : 23
## 3rd Qu.:226.0 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :302.0 Max. :77.00 Max. :1.0000
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :243.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :247.4 Mean :0.1448 Mean :0.9966
## 3rd Qu.:140.0 3rd Qu.:276.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:133.0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.800 Median :2.000
## Mean :149.6 Mean :0.3266 Mean :1.056 Mean :1.603
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.200 Max. :3.000
## Ca Thal AHD
## Min. :0.0000 fixed : 18 No :160
## 1st Qu.:0.0000 normal :164 Yes:137
## Median :0.0000 reversable:115
## Mean :0.6768
## 3rd Qu.:1.0000
## Max. :3.0000
By evaluating the correlation between variable, we can better identify key factors that should be included in the Model to avoid overfiiting or potential co-linearity
#Code Updated to exclude Categorical Cariables-cor(heart_data_final)#
cor(heart_data_final [,c(-4,-14,-15)])
## X Age Sex RestBP Chol
## X 1.000000000 0.009262273 -0.08814079 -0.02225682 -8.396768e-02
## Age 0.009262273 1.000000000 -0.09239948 0.29047626 2.026435e-01
## Sex -0.088140794 -0.092399479 1.00000000 -0.06634020 -1.980891e-01
## RestBP -0.022256823 0.290476262 -0.06634020 1.00000000 1.315357e-01
## Chol -0.083967682 0.202643546 -0.19808906 0.13153571 1.000000e+00
## Fbs -0.051693004 0.132061989 0.03885030 0.18085954 1.270828e-02
## RestECG -0.136735719 0.149916512 0.03389683 0.14924228 1.650460e-01
## MaxHR -0.117354904 -0.394562881 -0.06049601 -0.04910766 -7.456799e-05
## ExAng -0.002661773 0.096488805 0.14358125 0.06669107 5.933893e-02
## Oldpeak -0.114655929 0.197122616 0.10656724 0.19124314 3.859579e-02
## Slope -0.032451869 0.159404737 0.03334496 0.12117205 -9.215240e-03
## Ca 0.048687403 0.362210343 0.09192480 0.09795376 1.159446e-01
## Fbs RestECG MaxHR ExAng Oldpeak
## X -0.0516930041 -0.13673572 -1.173549e-01 -0.0026617728 -0.114655929
## Age 0.1320619890 0.14991651 -3.945629e-01 0.0964888046 0.197122616
## Sex 0.0388502996 0.03389683 -6.049601e-02 0.1435812504 0.106567243
## RestBP 0.1808595428 0.14924228 -4.910766e-02 0.0666910687 0.191243136
## Chol 0.0127082808 0.16504603 -7.456799e-05 0.0593389323 0.038595794
## Fbs 1.0000000000 0.06883111 -7.842359e-03 -0.0008930821 0.008310667
## RestECG 0.0688311070 1.00000000 -7.228965e-02 0.0818739197 0.113726420
## MaxHR -0.0078423590 -0.07228965 1.000000e+00 -0.3843675321 -0.347639972
## ExAng -0.0008930821 0.08187392 -3.843675e-01 1.0000000000 0.289309666
## Oldpeak 0.0083106671 0.11372642 -3.476400e-01 0.2893096659 1.000000000
## Slope 0.0478190123 0.13514058 -3.893067e-01 0.2505715154 0.579037353
## Ca 0.1520858900 0.12902063 -2.687270e-01 0.1482322256 0.294452277
## Slope Ca
## X -0.03245187 0.04868740
## Age 0.15940474 0.36221034
## Sex 0.03334496 0.09192480
## RestBP 0.12117205 0.09795376
## Chol -0.00921524 0.11594459
## Fbs 0.04781901 0.15208589
## RestECG 0.13514058 0.12902063
## MaxHR -0.38930674 -0.26872698
## ExAng 0.25057152 0.14823223
## Oldpeak 0.57903735 0.29445228
## Slope 1.00000000 0.10976112
## Ca 0.10976112 1.00000000
Data set is split based on requirement to train and test the Models. - As analysis progress and get the understand the data better, preliminary split is subject to change. - Records with incomplete data are excluded.
attach(heart_data_final)
## The following objects are masked from heart_data_final (pos = 3):
##
## Age, AHD, Ca, ChestPain, Chol, ExAng, Fbs, MaxHR, Oldpeak,
## RestBP, RestECG, Sex, Slope, Thal, X
## The following objects are masked from heart_data_final (pos = 4):
##
## Age, AHD, Ca, ChestPain, Chol, ExAng, Fbs, MaxHR, Oldpeak,
## RestBP, RestECG, Sex, Slope, Thal, X
#confirming no missing value in final data set#
anyNA(heart_data_final)
## [1] FALSE
train_data <- heart_data_final[1:250,]
test_data <- heart_data_final[251:297,]
head(train_data)
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
## 1 1 63 1 typical 145 233 1 2 150 0 2.3 3
## 2 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2
## 3 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2
## 4 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3
## 5 5 41 0 nontypical 130 204 0 2 172 0 1.4 1
## 6 6 56 1 nontypical 120 236 0 0 178 0 0.8 1
## Ca Thal AHD
## 1 0 fixed No
## 2 3 normal Yes
## 3 2 reversable Yes
## 4 0 normal No
## 5 0 normal No
## 6 0 normal No
head(test_data)
## X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak
## 254 254 51 0 nonanginal 120 295 0 2 157 0 0.6
## 255 255 43 1 asymptomatic 115 303 0 0 181 0 1.2
## 256 256 42 0 nonanginal 120 209 0 0 173 0 0.0
## 257 257 67 0 asymptomatic 106 223 0 0 142 0 0.3
## 258 258 76 0 nonanginal 140 197 0 1 116 0 1.1
## 259 259 70 1 nontypical 156 245 0 2 143 0 0.0
## Slope Ca Thal AHD
## 254 1 0 normal No
## 255 2 0 normal No
## 256 2 0 normal No
## 257 1 2 normal No
## 258 2 0 normal No
## 259 1 0 normal No
In this step, we use Logistic Regression to evaluate different models to ensure right mix of factors is incorporated. Summary below: - Family is set to Binomial - 5 Models are created and evaluated based on Coef. P values,..etc * Model 1 - All Factors included * Model 2- Sex, ChestPain, & both MaxHR & Slope are tested for interaction effect. * Model 3 - Sex, ChestPain, Ca, and both RestBP & Oldpeak are tested for interaction effect. * Model 4 - Sex , ChestPain, Ca & RestBP * Model 5 - Sex, ChestPain, Ca, and Rboth estBP & MaxHR tested for interaction effect.
Conclusion: By comparing summary results Model 4 is considered best fit and will be used to train the data and testing. p value is significantly low for all parameters.
##
## Call:
## glm(formula = AHD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs +
## RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca, family = binomial,
## data = heart_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5914 -0.5280 -0.1506 0.4217 2.4822
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.405544 2.645651 -1.665 0.095872 .
## Age -0.009658 0.023677 -0.408 0.683333
## Sex 1.951369 0.456954 4.270 1.95e-05 ***
## ChestPainnonanginal -1.815575 0.461783 -3.932 8.44e-05 ***
## ChestPainnontypical -1.175942 0.534164 -2.201 0.027703 *
## ChestPaintypical -2.124587 0.641333 -3.313 0.000924 ***
## RestBP 0.025533 0.010828 2.358 0.018368 *
## Chol 0.006473 0.003982 1.625 0.104058
## Fbs -0.779666 0.564135 -1.382 0.166955
## RestECG 0.178182 0.180902 0.985 0.324641
## MaxHR -0.022278 0.010503 -2.121 0.033919 *
## ExAng 0.926336 0.417580 2.218 0.026531 *
## Oldpeak 0.351297 0.216763 1.621 0.105092
## Slope 0.745698 0.361482 2.063 0.039123 *
## Ca 1.274989 0.260660 4.891 1.00e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 208.35 on 282 degrees of freedom
## AIC: 238.35
##
## Number of Fisher Scoring iterations: 6
## (Intercept) Age Sex
## -4.405544055 -0.009658484 1.951369054
## ChestPainnonanginal ChestPainnontypical ChestPaintypical
## -1.815575013 -1.175942437 -2.124587104
## RestBP Chol Fbs
## 0.025532913 0.006473361 -0.779665936
## RestECG MaxHR ExAng
## 0.178182093 -0.022277789 0.926335958
## Oldpeak Slope Ca
## 0.351297420 0.745697774 1.274989328
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.405544055 2.645651397 -1.6652020 9.587246e-02
## Age -0.009658484 0.023677444 -0.4079192 6.833330e-01
## Sex 1.951369054 0.456954316 4.2703811 1.951393e-05
## ChestPainnonanginal -1.815575013 0.461782788 -3.9316645 8.435973e-05
## ChestPainnontypical -1.175942437 0.534163541 -2.2014652 2.770311e-02
## ChestPaintypical -2.124587104 0.641332971 -3.3127676 9.237770e-04
## RestBP 0.025532913 0.010827710 2.3581083 1.836833e-02
## Chol 0.006473361 0.003982402 1.6254917 1.040578e-01
## Fbs -0.779665936 0.564134970 -1.3820557 1.669546e-01
## RestECG 0.178182093 0.180901845 0.9849656 3.246410e-01
## MaxHR -0.022277789 0.010503257 -2.1210362 3.391875e-02
## ExAng 0.926335958 0.417579569 2.2183460 2.653125e-02
## Oldpeak 0.351297420 0.216762954 1.6206525 1.050922e-01
## Slope 0.745697774 0.361481704 2.0628922 3.912287e-02
## Ca 1.274989328 0.260660366 4.8913816 1.001306e-06
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + MaxHR * Age, family = binomial,
## data = heart_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8151 -0.6278 -0.2051 0.5041 2.5984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 20.7052828 8.3573782 2.477 0.013231 *
## Sex 1.6453767 0.3769823 4.365 1.27e-05 ***
## ChestPainnonanginal -1.9499460 0.4017587 -4.854 1.21e-06 ***
## ChestPainnontypical -1.7484936 0.4875348 -3.586 0.000335 ***
## ChestPaintypical -1.7442403 0.5891636 -2.961 0.003071 **
## Ca 1.0339271 0.2135209 4.842 1.28e-06 ***
## MaxHR -0.1480578 0.0550228 -2.691 0.007127 **
## Age -0.3059567 0.1455806 -2.102 0.035586 *
## MaxHR:Age 0.0021147 0.0009661 2.189 0.028608 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 239.33 on 288 degrees of freedom
## AIC: 257.33
##
## Number of Fisher Scoring iterations: 5
## (Intercept) Sex ChestPainnonanginal
## 20.70528282 1.64537673 -1.94994597
## ChestPainnontypical ChestPaintypical Ca
## -1.74849365 -1.74424035 1.03392712
## MaxHR Age MaxHR:Age
## -0.14805782 -0.30595668 0.00211472
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 20.70528282 8.3573781576 2.477485 1.323118e-02
## Sex 1.64537673 0.3769823133 4.364599 1.273560e-05
## ChestPainnonanginal -1.94994597 0.4017586975 -4.853525 1.212859e-06
## ChestPainnontypical -1.74849365 0.4875347898 -3.586398 3.352775e-04
## ChestPaintypical -1.74424035 0.5891635523 -2.960537 3.071035e-03
## Ca 1.03392712 0.2135208807 4.842276 1.283600e-06
## MaxHR -0.14805782 0.0550228048 -2.690845 7.127137e-03
## Age -0.30595668 0.1455805926 -2.101631 3.558561e-02
## MaxHR:Age 0.00211472 0.0009661371 2.188841 2.860840e-02
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + MaxHR * Slope, family = binomial,
## data = heart_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6837 -0.5644 -0.2157 0.4773 2.1706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.111992 3.922104 0.029 0.97722
## Sex 1.643526 0.385589 4.262 2.02e-05 ***
## ChestPainnonanginal -2.057107 0.418202 -4.919 8.70e-07 ***
## ChestPainnontypical -1.603459 0.495087 -3.239 0.00120 **
## ChestPaintypical -1.991727 0.579625 -3.436 0.00059 ***
## Ca 1.178514 0.222414 5.299 1.17e-07 ***
## MaxHR -0.018707 0.024556 -0.762 0.44618
## Slope 1.448169 2.142814 0.676 0.49915
## MaxHR:Slope -0.002829 0.013708 -0.206 0.83649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 231.83 on 288 degrees of freedom
## AIC: 249.83
##
## Number of Fisher Scoring iterations: 5
## (Intercept) Sex ChestPainnonanginal
## 0.111992162 1.643525558 -2.057107132
## ChestPainnontypical ChestPaintypical Ca
## -1.603459408 -1.991726731 1.178513591
## MaxHR Slope MaxHR:Slope
## -0.018706827 1.448168629 -0.002829125
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.111992162 3.92210393 0.0285541 9.772202e-01
## Sex 1.643525558 0.38558858 4.2623813 2.022599e-05
## ChestPainnonanginal -2.057107132 0.41820240 -4.9189271 8.701986e-07
## ChestPainnontypical -1.603459408 0.49508739 -3.2387402 1.200589e-03
## ChestPaintypical -1.991726731 0.57962523 -3.4362320 5.898657e-04
## Ca 1.178513591 0.22241380 5.2987431 1.166026e-07
## MaxHR -0.018706827 0.02455605 -0.7618011 4.461787e-01
## Slope 1.448168629 2.14281410 0.6758256 4.991514e-01
## MaxHR:Slope -0.002829125 0.01370840 -0.2063790 8.364949e-01
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP * Oldpeak,
## family = binomial, data = heart_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2771 -0.5393 -0.2028 0.4964 2.5592
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.144905 1.736289 -3.539 0.000401 ***
## Sex 1.511065 0.391798 3.857 0.000115 ***
## ChestPainnonanginal -2.362805 0.426465 -5.540 3.02e-08 ***
## ChestPainnontypical -1.765082 0.492021 -3.587 0.000334 ***
## ChestPaintypical -2.599348 0.602760 -4.312 1.61e-05 ***
## Ca 1.082099 0.215806 5.014 5.33e-07 ***
## RestBP 0.034818 0.012476 2.791 0.005259 **
## Oldpeak 2.486976 1.079403 2.304 0.021221 *
## RestBP:Oldpeak -0.012607 0.007615 -1.656 0.097815 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 230.67 on 288 degrees of freedom
## AIC: 248.67
##
## Number of Fisher Scoring iterations: 5
## (Intercept) Sex ChestPainnonanginal
## -6.14490542 1.51106518 -2.36280496
## ChestPainnontypical ChestPaintypical Ca
## -1.76508159 -2.59934764 1.08209890
## RestBP Oldpeak RestBP:Oldpeak
## 0.03481803 2.48697576 -0.01260698
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial,
## data = heart_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6880 -0.6163 -0.2476 0.6090 2.1251
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.265659 1.288263 -3.311 0.000929 ***
## Sex 1.629896 0.370191 4.403 1.07e-05 ***
## ChestPainnonanginal -2.279957 0.392021 -5.816 6.03e-09 ***
## ChestPainnontypical -2.260915 0.472161 -4.788 1.68e-06 ***
## ChestPaintypical -2.275540 0.581232 -3.915 9.04e-05 ***
## Ca 1.150925 0.204851 5.618 1.93e-08 ***
## RestBP 0.025631 0.009126 2.809 0.004975 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 253.72 on 290 degrees of freedom
## AIC: 267.72
##
## Number of Fisher Scoring iterations: 5
## (Intercept) Sex ChestPainnonanginal
## -4.26565914 1.62989631 -2.27995678
## ChestPainnontypical ChestPaintypical Ca
## -2.26091509 -2.27554019 1.15092493
## RestBP
## 0.02563142
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP * MaxHR, family = binomial,
## data = heart_data_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6281 -0.5899 -0.2022 0.5153 2.4074
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9326987 9.1271635 -0.212 0.832300
## Sex 1.7357897 0.3818974 4.545 5.49e-06 ***
## ChestPainnonanginal -2.0480544 0.4041083 -5.068 4.02e-07 ***
## ChestPainnontypical -1.7942456 0.4977098 -3.605 0.000312 ***
## ChestPaintypical -2.1223125 0.6068965 -3.497 0.000471 ***
## Ca 1.1056820 0.2127394 5.197 2.02e-07 ***
## RestBP 0.0463943 0.0688582 0.674 0.500460
## MaxHR -0.0192594 0.0595729 -0.323 0.746475
## RestBP:MaxHR -0.0001183 0.0004479 -0.264 0.791673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 234.86 on 288 degrees of freedom
## AIC: 252.86
##
## Number of Fisher Scoring iterations: 5
## (Intercept) Sex ChestPainnonanginal
## -1.9326987063 1.7357897057 -2.0480544338
## ChestPainnontypical ChestPaintypical Ca
## -1.7942455816 -2.1223125277 1.1056820370
## RestBP MaxHR RestBP:MaxHR
## 0.0463943109 -0.0192593912 -0.0001183069
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9326987063 9.1271635228 -0.2117524 8.323002e-01
## Sex 1.7357897057 0.3818973829 4.5451731 5.489013e-06
## ChestPainnonanginal -2.0480544338 0.4041083476 -5.0680825 4.018433e-07
## ChestPainnontypical -1.7942455816 0.4977097740 -3.6050037 3.121485e-04
## ChestPaintypical -2.1223125277 0.6068964802 -3.4969926 4.705348e-04
## Ca 1.1056820370 0.2127393892 5.1973546 2.021446e-07
## RestBP 0.0463943109 0.0688581655 0.6737663 5.004599e-01
## MaxHR -0.0192593912 0.0595728608 -0.3232914 7.464746e-01
## RestBP:MaxHR -0.0001183069 0.0004478974 -0.2641383 7.916733e-01
Model is trained using logit link as well as probit & cloglog.
Conclusion: By comparing summary results Model with link “cloglog” is more significant. To ensure no gaps in conclusion, both logit & cloglog will further evaluated
#By comparing Coefficient, Model #4 is considered best fit#
#Use "train_data" to train Model 4#
glm.fits_final = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = logit), data = train_data)
glm.probs=predict(glm.fits_final, train_data, type = "response")
#Assessind diff link Fn on model significance#
#link = probit
glm.fits_final_p = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = probit), data = train_data)
glm.probs=predict(glm.fits_final_p, train_data, type = "response")
#link = cloglog
glm.fits_final_c = glm(AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial (link = cloglog), data = train_data)
glm.probs=predict(glm.fits_final_c, train_data, type = "response")
#Summary for same model--> diff link fn#
summary(glm.fits_final)
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = logit),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6017 -0.6120 -0.2161 0.6290 2.3013
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.733568 1.437308 -3.293 0.000990 ***
## Sex 1.795004 0.430600 4.169 3.06e-05 ***
## ChestPainnonanginal -2.297364 0.425138 -5.404 6.52e-08 ***
## ChestPainnontypical -2.729353 0.609057 -4.481 7.42e-06 ***
## ChestPaintypical -2.518962 0.653739 -3.853 0.000117 ***
## Ca 1.083789 0.215022 5.040 4.65e-07 ***
## RestBP 0.028122 0.009987 2.816 0.004865 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 343.86 on 249 degrees of freedom
## Residual deviance: 207.14 on 243 degrees of freedom
## AIC: 221.14
##
## Number of Fisher Scoring iterations: 5
summary(glm.fits_final_p)
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = probit),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6976 -0.6269 -0.1715 0.6396 2.3059
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.747753 0.816383 -3.366 0.000763 ***
## Sex 1.064683 0.241008 4.418 9.98e-06 ***
## ChestPainnonanginal -1.362536 0.238931 -5.703 1.18e-08 ***
## ChestPainnontypical -1.589711 0.326448 -4.870 1.12e-06 ***
## ChestPaintypical -1.490236 0.373971 -3.985 6.75e-05 ***
## Ca 0.618915 0.117659 5.260 1.44e-07 ***
## RestBP 0.016339 0.005707 2.863 0.004196 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 343.86 on 249 degrees of freedom
## Residual deviance: 206.54 on 243 degrees of freedom
## AIC: 220.54
##
## Number of Fisher Scoring iterations: 6
summary(glm.fits_final_c)
##
## Call:
## glm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, family = binomial(link = cloglog),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9477 -0.6323 -0.2991 0.6560 2.2091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.105311 0.974081 -4.215 2.50e-05 ***
## Sex 1.212279 0.296741 4.085 4.40e-05 ***
## ChestPainnonanginal -1.627868 0.295904 -5.501 3.77e-08 ***
## ChestPainnontypical -1.986974 0.470081 -4.227 2.37e-05 ***
## ChestPaintypical -1.788290 0.492948 -3.628 0.000286 ***
## Ca 0.640640 0.119459 5.363 8.19e-08 ***
## RestBP 0.022593 0.006551 3.449 0.000563 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 343.86 on 249 degrees of freedom
## Residual deviance: 209.04 on 243 degrees of freedom
## AIC: 223.04
##
## Number of Fisher Scoring iterations: 7
#based on summary---> Model with link "cloglog" is more significant#
#For validation purposes, both models - glm.fits_final & glm.fits_final_c will be evaluated#
predit_glm_final <- predict(glm.fits_final, newdata = test_data, type = "response")
predit_glm_final_c <- predict(glm.fits_final_c, newdata = test_data, type = "response")
#Prediction for p added for ROCgraph purposes only#
predit_glm_final_p <- predict(glm.fits_final_p, newdata = test_data, type = "response")
#assigning cut off value to assign "0/1" for resulting prbabilities#
range_glm_final <- range(predit_glm_final)
range_glm_final_c <- range(predit_glm_final_c)
range_glm_final
## [1] 0.02296012 0.97898095
range_glm_final_c
## [1] 0.04361867 0.99970032
#using the range results, the initial cutoff values will be "0.48"#
(0.97898095 - 0.02296012)/2
## [1] 0.4780104
(0.99970032-0.04361867)/2
## [1] 0.4780408
cutoff_glm_final <- ifelse (predit_glm_final > .48, 1, 0)
tbl_glm_final<- table(test_data$AHD,cutoff_glm_final)
cutoff_glm_final_c <- ifelse (predit_glm_final_c > .48, 1, 0)
tbl_glm_final_c <- table(test_data$AHD,cutoff_glm_final_c)
#Classification accuracy=(TP+TN)/(TP+FP+TN+FN)#
#glm_final_accuracy <- (16+19)/(19+3+9+16)#
glm_final_accuracy <- sum(diag(tbl_glm_final)) / nrow(test_data)
#glm_final_c_accuracy <- (15+20)/(20+2+10+15)#b
glm_final_c_accuracy <- sum(diag(tbl_glm_final_c)) / nrow(test_data)
#Sensitivity=TP/(TP+FN)#
glm_final_sensitivity <-16/(16+3)
glm_final_c_sensitivity <- 15/(15+2)
#Specificity=TN/(TN+FP)#
glm_final_specificity<- 19/(19+9)
glm_final_c_specificity<-20/(20+10)
#glm Models Comparison#
glm_models_summary <- matrix(c(glm_final_accuracy,glm_final_sensitivity,glm_final_specificity,glm_final_c_accuracy,glm_final_c_sensitivity,glm_final_c_specificity),ncol=3,byrow=TRUE)
colnames(glm_models_summary) <- c("Accuacy","Sensitivity", "Specificty")
rownames(glm_models_summary) <- c("Final glm Model - logit Link","Final glm Model - cloglog Link")
glm_models_summary<- as.table(glm_models_summary)
glm_models_summary
## Accuacy Sensitivity Specificty
## Final glm Model - logit Link 0.7446809 0.8421053 0.6785714
## Final glm Model - cloglog Link 0.7446809 0.8823529 0.6666667
ROC graph is used hand in hand with AUC to compare & evaluate models. In general: - The close the ROC Curve to upper left corner the better the model is. Basically this shows that sensitivity goes up with specificity goes up. - A curve that is 45 degrees indication decisions are randomly Made and not following a model - In case two curves are close in shape & direction we use AUC curve to determine which more fit. - The higher the AUC the better
Conclusion :
Chart below is a comparison of Logistic regression models using different link function. - Basically probit is same as logit. - In our example, there in no difference between diff link function when same number of variables is used.
#Visualization#
library(pROC)
## Warning: package 'pROC' was built under R version 3.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
ROC_logit <- roc(test_data$AHD, predit_glm_final)
ROC_probit <- roc(test_data$AHD, predit_glm_final_p)
ROC_cloglog <- roc(test_data$AHD, predit_glm_final_c)
plot(ROC_logit, col = 'blue')
plot(ROC_probit, add=TRUE, col='red')
plot(ROC_cloglog, add=TRUE, col='black')
To be updated..
Due to high number of variables, didn’t have the chance to update timely.
For the purpose of this exercise, we use all explanatory variables and defined the Kernal as Linear. * Different cost levels have been compared to identify the optimal value. * Confusion Matrix was used to evaluate level of accuracy & other key data points. Results weren’t encouraging in first attempt therefore, I changed approach and used the flowing list of libraries for scenario II
- caret
- mlbench
- knitr
- lattice
- ROCR
- mgmum.r (It took one day to install :-)
Scenario 1 Code included as FYI only,
library(e1071)
plot(heart_data_final$Sex, heart_data_final$ChestPain, col=heart_data_final$AHD)
plot(heart_data_final$Sex, col=heart_data_final$AHD)
plot(heart_data_final$ChestPain, col=heart_data_final$AHD)
plot(heart_data_final$Ca, col=heart_data_final$AHD)
svmmodel <- svm(AHD ~. , data = train_data, kernel = "linear", cost = .1, scale = FALSE)
print(svmmodel)
##
## Call:
## svm(formula = AHD ~ ., data = train_data, kernel = "linear",
## cost = 0.1, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
## gamma: 0.05555556
##
## Number of Support Vectors: 108
svmmodel1 <- svm(AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, type='C-classification', kernel = "linear", cost = .1, scale = FALSE)
print(svmmodel1)
##
## Call:
## svm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data,
## type = "C-classification", kernel = "linear", cost = 0.1,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
## gamma: 0.1428571
##
## Number of Support Vectors: 148
svmtuned_1 <- tune(svm, AHD ~ Sex + ChestPain + Ca + RestBP , data = train_data, kernel = "linear", ranges = list(cost = c(.001, .005,.01, .05, .1, .5, .075, .025) , scale = FALSE))
summary(svmtuned_1)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost scale
## 0.5 FALSE
##
## - best performance: 0.208
##
## - Detailed performance results:
## cost scale error dispersion
## 1 0.001 FALSE 0.448 0.09577752
## 2 0.005 FALSE 0.312 0.11895844
## 3 0.010 FALSE 0.260 0.12961481
## 4 0.050 FALSE 0.220 0.09092121
## 5 0.100 FALSE 0.212 0.08854377
## 6 0.500 FALSE 0.208 0.07955431
## 7 0.075 FALSE 0.212 0.08854377
## 8 0.025 FALSE 0.244 0.09512565
#For svmmodel1, the optimal cost value is .5#
library(tourr)
## Warning: package 'tourr' was built under R version 3.4.2
##
## Attaching package: 'tourr'
## The following object is masked from 'package:e1071':
##
## interpolate
svmmodel1 <- svm(AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, type='C-classification', kernel = "linear", cost = .1, scale = FALSE)
print(svmmodel1)
##
## Call:
## svm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data,
## type = "C-classification", kernel = "linear", cost = 0.1,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
## gamma: 0.1428571
##
## Number of Support Vectors: 148
svmtuned_1 <- tune(svm, AHD ~ Sex + ChestPain + Ca + RestBP , data = train_data, kernel = "linear", ranges = list(cost = c(.001, .005,.01, .05, .1, .5, .075, .025) , scale = FALSE))
summary(svmtuned_1)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost scale
## 0.1 FALSE
##
## - best performance: 0.216
##
## - Detailed performance results:
## cost scale error dispersion
## 1 0.001 FALSE 0.452 0.07315129
## 2 0.005 FALSE 0.316 0.08099383
## 3 0.010 FALSE 0.268 0.07067924
## 4 0.050 FALSE 0.224 0.03373096
## 5 0.100 FALSE 0.216 0.03864367
## 6 0.500 FALSE 0.216 0.05719363
## 7 0.075 FALSE 0.220 0.03399346
## 8 0.025 FALSE 0.232 0.07004760
#For svmmodel1, the optimal cost value is .5#
svmmodel1 <- svm(AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data, type='C-classification', kernel = "linear", cost = .5, scale = FALSE)
print(svmmodel1)
##
## Call:
## svm(formula = AHD ~ Sex + ChestPain + Ca + RestBP, data = train_data,
## type = "C-classification", kernel = "linear", cost = 0.5,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.5
## gamma: 0.1428571
##
## Number of Support Vectors: 121
svm_predict_1 <- predict(svmmodel1, newdata = test_data, type = "class")
svm_predict_1
## 254 255 256 257 258 259 260 261 262 263 264 265 266 268 269 270 271 272
## No Yes No Yes No No No No No No No Yes Yes No Yes No Yes Yes
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 289 290 291
## Yes No Yes No No No No No Yes No No No Yes Yes Yes No No No
## 292 293 294 295 296 297 298 299 300 301 302
## No Yes Yes No No Yes No No Yes Yes No
## Levels: No Yes
print(svm_predict_1)
## 254 255 256 257 258 259 260 261 262 263 264 265 266 268 269 270 271 272
## No Yes No Yes No No No No No No No Yes Yes No Yes No Yes Yes
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 289 290 291
## Yes No Yes No No No No No Yes No No No Yes Yes Yes No No No
## 292 293 294 295 296 297 298 299 300 301 302
## No Yes Yes No No Yes No No Yes Yes No
## Levels: No Yes
head(svm_predict_1)
## 254 255 256 257 258 259
## No Yes No Yes No No
## Levels: No Yes
str(svm_predict_1)
## Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 1 1 ...
## - attr(*, "names")= chr [1:47] "254" "255" "256" "257" ...
summary(svm_predict_1)
## No Yes
## 29 18
#confusion matrix#
svmmodel1_eval <- table (svm_predict_1, test_data$AHD)
svmmodel1_eval
##
## svm_predict_1 No Yes
## No 19 10
## Yes 3 15
#Definitley bad model- svmmodel1 excluded#
#Visualization-Evaluating the equation of boundary plane#
w <- t(svmmodel1$coefs) %*% svmmodel1$SV
w
## Sex ChestPainasymptomatic ChestPainnonanginal
## [1,] -1.287547 -1.204816 0.4015195
## ChestPainnontypical ChestPaintypical Ca RestBP
## [1,] 0.3560102 0.4472863 -0.6666633 -0.009130642
#negative intercept#
svmmodel1$rho
## [1] -3.163924
In an attempt to use more robust capabilities, I used different approach to split data and resample.
## Warning: package 'caret' was built under R version 3.4.2
## Loading required package: lattice
## Warning: package 'mlbench' was built under R version 3.4.2
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## X Age Sex ChestPain
## Min. : 3.0 Min. :29.00 Min. :0.0000 asymptomatic:118
## 1st Qu.: 79.5 1st Qu.:48.00 1st Qu.:0.0000 nonanginal : 65
## Median :151.0 Median :56.00 Median :1.0000 nontypical : 44
## Mean :152.3 Mean :54.58 Mean :0.6478 typical : 20
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000
## Max. :302.0 Max. :77.00 Max. :1.0000
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:210.5 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :0.0000
## Mean :131.6 Mean :246.5 Mean :0.1336 Mean :0.9838
## 3rd Qu.:140.0 3rd Qu.:275.5 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
## MaxHR ExAng Oldpeak Slope
## Min. : 71.0 Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:135.0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
## Median :154.0 Median :0.0000 Median :0.600 Median :2.000
## Mean :149.7 Mean :0.3198 Mean :1.001 Mean :1.571
## 3rd Qu.:164.5 3rd Qu.:1.0000 3rd Qu.:1.550 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.200 Max. :3.000
## Ca Thal AHD
## Min. :0.0000 fixed : 10 No :133
## 1st Qu.:0.0000 normal :142 Yes:114
## Median :0.0000 reversable: 95
## Mean :0.6275
## 3rd Qu.:1.0000
## Max. :3.0000
## X Age Sex ChestPain
## Min. : 1.0 Min. :35.00 Min. :0.00 asymptomatic:24
## 1st Qu.: 66.5 1st Qu.:48.00 1st Qu.:1.00 nonanginal :18
## Median :143.0 Median :54.50 Median :1.00 nontypical : 5
## Mean :142.5 Mean :54.34 Mean :0.82 typical : 3
## 3rd Qu.:210.2 3rd Qu.:62.00 3rd Qu.:1.00
## Max. :300.0 Max. :70.00 Max. :1.00
## RestBP Chol Fbs RestECG
## Min. : 94.0 Min. :169.0 Min. :0.0 Min. :0.00
## 1st Qu.:120.0 1st Qu.:225.2 1st Qu.:0.0 1st Qu.:0.00
## Median :130.0 Median :245.5 Median :0.0 Median :2.00
## Mean :132.2 Mean :251.4 Mean :0.2 Mean :1.06
## 3rd Qu.:143.5 3rd Qu.:280.2 3rd Qu.:0.0 3rd Qu.:2.00
## Max. :192.0 Max. :409.0 Max. :1.0 Max. :2.00
## MaxHR ExAng Oldpeak Slope
## Min. : 99.0 Min. :0.00 Min. :0.000 Min. :1.00
## 1st Qu.:132.0 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:1.00
## Median :150.0 Median :0.00 Median :1.500 Median :2.00
## Mean :149.3 Mean :0.36 Mean :1.324 Mean :1.76
## 3rd Qu.:170.5 3rd Qu.:1.00 3rd Qu.:2.200 3rd Qu.:2.00
## Max. :195.0 Max. :1.00 Max. :3.500 Max. :3.00
## Ca Thal AHD
## Min. :0.00 fixed : 8 No :27
## 1st Qu.:0.00 normal :22 Yes:23
## Median :0.00 reversable:20
## Mean :0.92
## 3rd Qu.:2.00
## Max. :3.00
## [1] 247 15
## [1] 50 15
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## Support Vector Machines with Linear Kernel
##
## 247 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 222, 222, 223, 222, 222, 223, ...
## Resampling results:
##
## Accuracy Kappa
## 0.815094 0.6262081
##
## Tuning parameter 'C' was held constant at a value of 1
## [1] No Yes No No Yes No No No Yes No Yes Yes Yes No Yes Yes No
## [18] No Yes No Yes No No Yes No No Yes Yes No Yes Yes Yes Yes No
## [35] Yes Yes No No No Yes Yes Yes No No Yes No No Yes Yes Yes
## Levels: No Yes
##
## Attaching package: 'gmum.r'
## The following object is masked from 'package:kernlab':
##
## centers
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 23 1
## Yes 4 22
##
## Accuracy : 0.9
## 95% CI : (0.7819, 0.9667)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : 4.519e-08
##
## Kappa : 0.8006
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.8519
## Specificity : 0.9565
## Pos Pred Value : 0.9583
## Neg Pred Value : 0.8462
## Prevalence : 0.5400
## Detection Rate : 0.4600
## Detection Prevalence : 0.4800
## Balanced Accuracy : 0.9042
##
## 'Positive' Class : No
##
## [1] No Yes No No Yes No No No Yes No Yes Yes Yes No Yes Yes No
## [18] No No No Yes No No Yes No Yes Yes Yes No No Yes No Yes No
## [35] Yes Yes No No No Yes Yes No No No Yes No No Yes Yes Yes
## Levels: No Yes
## Support Vector Machines with Linear Kernel
##
## 247 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 222, 222, 223, 222, 222, 223, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.001 0.7297735 0.4319236
## 0.005 0.8341581 0.6651568
## 0.010 0.8342607 0.6649947
## 0.025 0.8326496 0.6612919
## 0.050 0.8247564 0.6450954
## 0.075 0.8233162 0.6419452
## 0.100 0.8166966 0.6286843
## 0.175 0.8096496 0.6149714
## 0.250 0.8083120 0.6125596
## 0.750 0.8110385 0.6179633
## 1.000 0.8150940 0.6262081
## 1.500 0.8124786 0.6209638
## 1.750 0.8178162 0.6314467
## 2.000 0.8178675 0.6316306
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
## null device
## 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 20 1
## Yes 7 22
##
## Accuracy : 0.84
## 95% CI : (0.7089, 0.9283)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : 7.854e-06
##
## Kappa : 0.684
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.7407
## Specificity : 0.9565
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.7586
## Prevalence : 0.5400
## Detection Rate : 0.4000
## Detection Prevalence : 0.4200
## Balanced Accuracy : 0.8486
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 23 1
## Yes 4 22
##
## Accuracy : 0.9
## 95% CI : (0.7819, 0.9667)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : 4.519e-08
##
## Kappa : 0.8006
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.8519
## Specificity : 0.9565
## Pos Pred Value : 0.9583
## Neg Pred Value : 0.8462
## Prevalence : 0.5400
## Detection Rate : 0.4600
## Detection Prevalence : 0.4800
## Balanced Accuracy : 0.9042
##
## 'Positive' Class : No
##
In an attempt to use more robust capabilities, I used different approach to split data and resample.
## Warning: package 'pls' was built under R version 3.4.2
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## No Yes
## 1 0.3969739 0.6030261
## 2 0.2903513 0.7096487
## 4 0.5902840 0.4097160
## 6 0.7178610 0.2821390
## 9 0.3479927 0.6520073
## 19 0.7204744 0.2795256
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 19 2
## Yes 8 21
##
## Accuracy : 0.8
## 95% CI : (0.6628, 0.8997)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : 0.0001186
##
## Kappa : 0.6051
## Mcnemar's Test P-Value : 0.1138463
##
## Sensitivity : 0.7037
## Specificity : 0.9130
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.7241
## Prevalence : 0.5400
## Detection Rate : 0.3800
## Detection Prevalence : 0.4200
## Balanced Accuracy : 0.8084
##
## 'Positive' Class : No
##
## [1] Yes Yes No No Yes No Yes No Yes No Yes Yes Yes Yes Yes Yes No
## [18] No Yes No Yes Yes Yes Yes No No Yes Yes No Yes Yes No Yes No
## [35] Yes No No No No Yes Yes Yes No No Yes No No Yes Yes Yes
## Levels: No Yes
## [1] No Yes No No Yes No No No Yes No Yes Yes Yes No Yes Yes No
## [18] No No No Yes No No Yes No Yes Yes Yes No No Yes No Yes No
## [35] Yes Yes No No No Yes Yes No No No Yes No No Yes Yes Yes
## Levels: No Yes
Support vector machine(SVM) Model | Evaluation - Scenario II (rda) ============================================================================== — output: incremental: true —
In an attempt to use more robust capabilities, I used different approach to split data and resample.
## Warning: package 'klaR' was built under R version 3.4.2
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.2
## Regularized Discriminant Analysis
##
## 247 samples
## 14 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 222, 222, 223, 222, 222, 223, ...
## Resampling results across tuning parameters:
##
## gamma ROC Sens Spec
## 0.00 0.8943279 0.8721612 0.7666667
## 0.25 0.7363109 0.7919414 0.5782828
## 0.50 0.7243673 0.7846154 0.5601010
## 0.75 0.7015290 0.7641026 0.5313131
## 1.00 0.6259962 0.6560440 0.5171717
##
## Tuning parameter 'lambda' was held constant at a value of 0.75
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0 and lambda = 0.75.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 20 1
## Yes 7 22
##
## Accuracy : 0.84
## 95% CI : (0.7089, 0.9283)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : 7.854e-06
##
## Kappa : 0.684
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.7407
## Specificity : 0.9565
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.7586
## Prevalence : 0.5400
## Detection Rate : 0.4000
## Detection Prevalence : 0.4200
## Balanced Accuracy : 0.8486
##
## 'Positive' Class : No
##
##
## Call:
## summary.resamples(object = resamps)
##
## Models: pls, rda
## Number of resamples: 30
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.7412587 0.8618881 0.9228480 0.9069070 0.9423077 1 0
## rda 0.7552448 0.8496503 0.9128788 0.8943279 0.9464286 1 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.5384615 0.7857143 0.8571429 0.8521978 0.9230769 1 0
## rda 0.6923077 0.8461538 0.8571429 0.8721612 0.9230769 1 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.6363636 0.7272727 0.8181818 0.8015152 0.8333333 1 0
## rda 0.5000000 0.6439394 0.8181818 0.7666667 0.8901515 1 0
##
## Call:
## summary.diff.resamples(object = resampling_eval)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## ROC
## pls rda
## pls 0.01258
## rda 0.07428
##
## Sens
## pls rda
## pls -0.01996
## rda 0.1817
##
## Spec
## pls rda
## pls 0.03485
## rda 0.03153
Need Update
## Warning: package 'factoextra' was built under R version 3.4.2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
## Warning: package 'FactoMineR' was built under R version 3.4.2
## Warning: package 'rgl' was built under R version 3.4.2
##
## Call:
## PCA(X = trainData[, -1], quali.sup = c(3, 13, 14))
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 2.602 1.503 1.142 1.037 0.929 0.824
## % of var. 23.653 13.665 10.378 9.430 8.448 7.495
## Cumulative % of var. 23.653 37.318 47.696 57.126 65.573 73.068
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.790 0.755 0.620 0.433 0.365
## % of var. 7.179 6.866 5.638 3.934 3.315
## Cumulative % of var. 80.248 87.114 92.752 96.685 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 3 | 3.457 | 2.783 1.205 0.648 | -0.917 0.226 0.070 |
## 5 | 2.987 | -1.687 0.443 0.319 | 0.070 0.001 0.001 |
## 7 | 4.227 | 2.619 1.067 0.384 | 0.923 0.229 0.048 |
## 8 | 3.409 | -0.795 0.098 0.054 | 0.854 0.196 0.063 |
## 10 | 4.544 | 2.429 0.918 0.286 | -0.732 0.144 0.026 |
## 11 | 2.192 | -0.385 0.023 0.031 | -0.740 0.148 0.114 |
## 12 | 2.383 | 0.193 0.006 0.007 | 1.238 0.413 0.270 |
## 13 | 3.345 | 1.419 0.313 0.180 | 0.340 0.031 0.010 |
## 14 | 2.699 | -2.376 0.879 0.775 | -0.552 0.082 0.042 |
## 15 | 4.090 | -0.819 0.104 0.040 | 1.197 0.386 0.086 |
## Dim.3 ctr cos2
## 3 -0.001 0.000 0.000 |
## 5 -0.811 0.233 0.074 |
## 7 -1.626 0.937 0.148 |
## 8 -1.561 0.864 0.210 |
## 10 1.149 0.468 0.064 |
## 11 0.539 0.103 0.061 |
## 12 -1.696 1.020 0.507 |
## 13 1.622 0.933 0.235 |
## 14 0.143 0.007 0.003 |
## 15 3.094 3.395 0.572 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Age | 0.558 11.956 0.311 | 0.458 13.952 0.210 | 0.075 0.488
## Sex | 0.113 0.495 0.013 | -0.513 17.530 0.263 | 0.517 23.418
## RestBP | 0.356 4.882 0.127 | 0.468 14.553 0.219 | 0.227 4.501
## Chol | 0.131 0.662 0.017 | 0.552 20.307 0.305 | -0.433 16.441
## Fbs | 0.156 0.935 0.024 | 0.409 11.120 0.167 | 0.662 38.418
## RestECG | 0.281 3.028 0.079 | 0.228 3.463 0.052 | -0.177 2.740
## MaxHR | -0.682 17.896 0.466 | 0.216 3.107 0.047 | 0.008 0.006
## ExAng | 0.505 9.817 0.255 | -0.349 8.096 0.122 | -0.011 0.010
## Oldpeak | 0.731 20.566 0.535 | -0.220 3.233 0.049 | -0.141 1.735
## Slope | 0.698 18.745 0.488 | -0.222 3.293 0.050 | -0.281 6.906
## cos2
## Age 0.006 |
## Sex 0.267 |
## RestBP 0.051 |
## Chol 0.188 |
## Fbs 0.439 |
## RestECG 0.031 |
## MaxHR 0.000 |
## ExAng 0.000 |
## Oldpeak 0.020 |
## Slope 0.079 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
## asymptomatic | 0.744 | 0.644 0.750 5.993 | -0.260 0.122 -3.175 |
## nonanginal | 0.611 | -0.403 0.435 -2.343 | 0.328 0.288 2.508 |
## nontypical | 1.222 | -1.198 0.961 -5.422 | 0.134 0.012 0.795 |
## typical | 0.931 | 0.143 0.024 0.414 | 0.172 0.034 0.651 |
## fixed | 1.705 | 1.325 0.604 2.646 | -0.437 0.066 -1.150 |
## normal | 0.707 | -0.608 0.740 -6.874 | 0.231 0.107 3.443 |
## reversable | 0.903 | 0.769 0.725 5.913 | -0.300 0.110 -3.032 |
## No | 0.950 | -0.894 0.886 -9.392 | 0.214 0.051 2.959 |
## Yes | 1.108 | 1.043 0.886 9.392 | -0.250 0.051 -2.959 |
## Dim.3 cos2 v.test
## asymptomatic -0.068 0.008 -0.961 |
## nonanginal 0.014 0.001 0.121 |
## nontypical -0.015 0.000 -0.106 |
## typical 0.393 0.178 1.712 |
## fixed 0.620 0.132 1.870 |
## normal -0.179 0.064 -3.049 |
## reversable 0.202 0.050 2.340 |
## No -0.098 0.011 -1.552 |
## Yes 0.114 0.011 1.552 |