This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

Attribute Information:

age _ numeric
sex - category (1=male, 0=female)
cp chest pain type category (1= typical angina,2= atypical angina,3= non-anginal pain,4= asymptomatic)
resting blood pressure. Numeric (in mm Hg on admission to the hospital)
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values: 0= normal, 1= having ST-T wave abnormality, 2= probable or definite left ventricular hypertrophy by Estes)
thalach: maximum heart rate achieved. Numeric
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment. (1= upsloping, 2= flat,3= downsloping) number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
target: diagnosis of heart disease (angiographic disease status) Value 0= < 50% diameter narrowing, 1= > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been “processed”, that one containing the Cleveland database. All four unprocessed files also exist in this directory.

Current Predictive Rates

Support Vector Machine Classierimplementation in R with caret

Importing packages

suppressMessages(library(dplyr))
suppressMessages(library(readr))
suppressMessages(library(corrplot))
suppressMessages(library(qgraph))
suppressMessages(library(ggplot2))
suppressMessages(library(PerformanceAnalytics))
suppressMessages(library(GGally))
suppressMessages(library(caret))
suppressMessages(library(rpart.plot))
suppressMessages(library(randomForest))
suppressMessages(library(corrr))
suppressMessages(library(psych))
suppressMessages(library(ggpubr))
suppressMessages(library(DataExplorer))

Loading dataset

Data wrangling

Changing gender from numeric to category Checking missing values Changing target from numeric to category

heart <- heart %>% mutate(sex=ifelse(sex==1,"male","female"))
heart$sex <- factor(heart$sex)
heart$cp <- factor(heart$cp)
heart$restecg <- factor(heart$restecg)
heart$exang <- factor(heart$exang)
heart$ca <- factor(heart$ca)
heart$slope <- factor(heart$slope)
heart$thal <- factor(heart$thal)
heart$target <- factor(heart$target)

describe(heart)
##          vars   n   mean    sd median trimmed   mad min   max range  skew
## age         1 303  54.37  9.08   55.0   54.54 10.38  29  77.0  48.0 -0.20
## sex*        2 303   1.68  0.47    2.0    1.73  0.00   1   2.0   1.0 -0.78
## cp*         3 303   1.97  1.03    2.0    1.86  1.48   1   4.0   3.0  0.48
## trestbps    4 303 131.62 17.54  130.0  130.44 14.83  94 200.0 106.0  0.71
## chol        5 303 246.26 51.83  240.0  243.49 47.44 126 564.0 438.0  1.13
## fbs         6 303   0.15  0.36    0.0    0.06  0.00   0   1.0   1.0  1.97
## restecg*    7 303   1.53  0.53    2.0    1.52  0.00   1   3.0   2.0  0.16
## thalach     8 303 149.65 22.91  153.0  150.98 22.24  71 202.0 131.0 -0.53
## exang*      9 303   1.33  0.47    1.0    1.28  0.00   1   2.0   1.0  0.74
## oldpeak    10 303   1.04  1.16    0.8    0.86  1.19   0   6.2   6.2  1.26
## slope*     11 303   2.40  0.62    2.0    2.46  1.48   1   3.0   2.0 -0.50
## ca*        12 303   1.73  1.02    1.0    1.54  0.00   1   5.0   4.0  1.30
## thal*      13 303   3.31  0.61    3.0    3.36  0.00   1   4.0   3.0 -0.47
## target*    14 303   1.54  0.50    2.0    1.56  0.00   1   2.0   1.0 -0.18
##          kurtosis   se
## age         -0.57 0.52
## sex*        -1.39 0.03
## cp*         -1.21 0.06
## trestbps     0.87 1.01
## chol         4.36 2.98
## fbs          1.88 0.02
## restecg*    -1.37 0.03
## thalach     -0.10 1.32
## exang*      -1.46 0.03
## oldpeak      1.50 0.07
## slope*      -0.65 0.04
## ca*          0.78 0.06
## thal*        0.25 0.04
## target*     -1.97 0.03
# A summary of the dataset
summary(heart)
##       age            sex      cp         trestbps          chol      
##  Min.   :29.00   female: 96   0:143   Min.   : 94.0   Min.   :126.0  
##  1st Qu.:47.50   male  :207   1: 50   1st Qu.:120.0   1st Qu.:211.0  
##  Median :55.00                2: 87   Median :130.0   Median :240.0  
##  Mean   :54.37                3: 23   Mean   :131.6   Mean   :246.3  
##  3rd Qu.:61.00                        3rd Qu.:140.0   3rd Qu.:274.5  
##  Max.   :77.00                        Max.   :200.0   Max.   :564.0  
##       fbs         restecg    thalach      exang      oldpeak     slope  
##  Min.   :0.0000   0:147   Min.   : 71.0   0:204   Min.   :0.00   0: 21  
##  1st Qu.:0.0000   1:152   1st Qu.:133.5   1: 99   1st Qu.:0.00   1:140  
##  Median :0.0000   2:  4   Median :153.0           Median :0.80   2:142  
##  Mean   :0.1485           Mean   :149.6           Mean   :1.04          
##  3rd Qu.:0.0000           3rd Qu.:166.0           3rd Qu.:1.60          
##  Max.   :1.0000           Max.   :202.0           Max.   :6.20          
##  ca      thal    target 
##  0:175   0:  2   0:138  
##  1: 65   1: 18   1:165  
##  2: 38   2:166          
##  3: 20   3:117          
##  4:  5                  
## 
# checking for missing value
plot_missing(heart)

# Data exploratory
prop.table(table(heart$target))
## 
##         0         1 
## 0.4554455 0.5445545
ggplot(heart, aes(x = sex, fill=as.factor(sex))) + geom_bar() + ggtitle("Male _ Female ratio")       

plot_histogram(heart)

ggplot(data=heart, aes(x=age, fill=as.factor(target)))+
  geom_dotplot(alpha=.5,stackgroups = TRUE, binwidth = 1, binpositions = "all")+
  ggtitle("Age Distribution _ Status")

Correlation, Variance and Covariance (Matrices) Description var, cov and cor compute the variance of x and the covariance or correlation of x and y if these are vectors. If x and y are matrices then the covariances (or correlations) between the columns of x and the columns of y are computed. The chart show correllations between age, trestbps, chol, fbs, and thalach VARIABLES

CLASSIFICATION BY DECISION TRESS Decision trees are popular and simple statistical tools for classification and prediction. A decision tree is a flow chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution. The top most node in a tree is the root rode.

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
corrplot(cor(heart[,c(1,4,5,8)]), type="lower",method="number") 

pairs.panels(heart[,c(1,4,5,8)], method="pearson",
             hist.col = "#1fbbfa", density=TRUE, ellipses=TRUE, show.points = TRUE,
             pch=1, lm=TRUE, cex.cor=.7, smoother=F, stars = T, main="Heart Disease SE _ pairs.panels")

Network frame _ Correlations

qgraph(cor(heart[,c(1,4,5,8)]))

set.seed(3033)
index <- createDataPartition(y = heart$target, p= 0.8, list = FALSE)
training <- heart[index,]
testing <- heart[-index,]
#Check statistical distributions of the two splits with the original dataset
 
colnames(training)
##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"
str(training)
## 'data.frame':    243 obs. of  14 variables:
##  $ age     : int  63 37 56 57 57 56 44 57 54 48 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 2 2 2 1 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 1 1 2 2 3 1 3 ...
##  $ trestbps: int  145 130 120 120 140 140 120 150 140 130 ...
##  $ chol    : int  233 250 236 354 192 294 263 168 239 275 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 2 2 2 1 2 2 2 2 ...
##  $ thalach : int  150 187 178 163 148 153 173 174 160 139 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 0.8 0.6 0.4 1.3 0 1.6 1.2 0.2 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 2 2 3 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 2 3 4 3 3 3 ...
##  $ target  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
glm_fit <-glm(target ~ ., data=training, family="binomial")
glm_pred <- predict(glm_fit,newdata=testing[,-14],type="response")
glm_pred <- as.data.frame(glm_pred)
glm_pred <- glm_pred %>% mutate(glm_model=ifelse(glm_pred>0.5,1,0))
actual <- as.factor(testing$target)
confusionMatrix(as.factor(glm_pred$glm_model),actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  4
##          1  3 29
##                                           
##                Accuracy : 0.8833          
##                  95% CI : (0.7743, 0.9518)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 2.959e-08       
##                                           
##                   Kappa : 0.7651          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.9062          
##              Prevalence : 0.4500          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.8838          
##                                           
##        'Positive' Class : 0               
## 

Using Caret Package for regression analysis

library(caret)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
set.seed(3233)
training$target = factor(training$target)
svm_fit <- train(target ~., data = training, method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecg2
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
svm_fit
## Support Vector Machines with Linear Kernel 
## 
## 243 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 219, 219, 218, 219, 218, 219, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8289333  0.6522771
## 
## Tuning parameter 'C' was held constant at a value of 1
svm_pred <- predict(svm_fit, newdata = testing)
confusionMatrix(svm_pred,actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  3
##          1  3 30
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7949, 0.9624)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 4.558e-09       
##                                           
##                   Kappa : 0.798           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8889          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.4500          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8990          
##                                           
##        'Positive' Class : 0               
## 

svm_Radial

set.seed(3233)
svm_Radial <- train(target ~age+sex+cp+chol+fbs+thalach+exang+oldpeak+slope+ca+thal, 
                    data = training, method = "svmRadial",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10)
                    

svm_Radial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 243 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (19), scaled (19) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 219, 219, 218, 219, 218, 219, ... 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.8057333  0.6082035
##     0.50  0.8114000  0.6172288
##     1.00  0.8155667  0.6244030
##     2.00  0.8099000  0.6139260
##     4.00  0.8080667  0.6108269
##     8.00  0.7990000  0.5931980
##    16.00  0.8170000  0.6314212
##    32.00  0.8071667  0.6115396
##    64.00  0.8014000  0.6006593
##   128.00  0.7990000  0.5959580
## 
## Tuning parameter 'sigma' was held constant at a value of 0.03632465
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.03632465 and C = 16.
Radial_pred <- predict(svm_Radial, newdata = testing)
confusionMatrix(Radial_pred,actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  6
##          1  6 27
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6767, 0.8922)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 4.67e-05        
##                                           
##                   Kappa : 0.596           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.7980          
##                                           
##        'Positive' Class : 0               
## 

Random Forest

# fit a random forest model (using ranger)
rf_grid <- expand.grid(mtry = c(2, 3, 4, 5),
                      splitrule = c("gini", "extratrees"),
                      min.node.size = c(1, 3, 5))


ranger_fit <- train(target ~., data = training, method = "ranger",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    verbose=FALSE,
                    tuneLength = 10,
                    tuneGrid = rf_grid)
                    
ranger_fit
## Random Forest 
## 
## 243 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 219, 218, 219, 218, 219, 219, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   min.node.size  Accuracy   Kappa    
##   2     gini        1              0.7981462  0.5900654
##   2     gini        3              0.7981462  0.5904543
##   2     gini        5              0.7964128  0.5866494
##   2     extratrees  1              0.7972487  0.5896329
##   2     extratrees  3              0.7914154  0.5770773
##   2     extratrees  5              0.7872513  0.5687665
##   3     gini        1              0.7865769  0.5664680
##   3     gini        3              0.7949103  0.5836004
##   3     gini        5              0.7915128  0.5771035
##   3     extratrees  1              0.8022487  0.6008994
##   3     extratrees  3              0.7971487  0.5903926
##   3     extratrees  5              0.7947821  0.5856478
##   4     gini        1              0.7856487  0.5656830
##   4     gini        3              0.7906154  0.5752536
##   4     gini        5              0.7890462  0.5715309
##   4     extratrees  1              0.7971487  0.5910520
##   4     extratrees  3              0.7971487  0.5911288
##   4     extratrees  5              0.7971821  0.5907189
##   5     gini        1              0.7915154  0.5781145
##   5     gini        3              0.7815821  0.5565398
##   5     gini        5              0.7890154  0.5720712
##   5     extratrees  1              0.7971154  0.5914102
##   5     extratrees  3              0.7996154  0.5962054
##   5     extratrees  5              0.8013487  0.5991850
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = extratrees
##  and min.node.size = 1.
ranger_pred <- predict(ranger_fit, newdata=testing)
# compare predicted outcome and true outcome
confusionMatrix(ranger_pred, actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  4
##          1  4 29
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.7541, 0.9406)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 1.653e-07       
##                                           
##                   Kappa : 0.7306          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.8788          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3833          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8653          
##                                           
##        'Positive' Class : 0               
## 

Linear Regression

Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known. lm() is a linear model function, such like linear regression analysis.

lm(formula, data, subset, weights, …) formula: model description, such as x ~ y data: optional, variables in the model subset: optional, a subset vector of observations to be used in the fitting process weights: optional, a vector of weights to be used in the fitting process

Generalized Linear Model

The Generalized Linear Model (GLZ) is a generalization of the general linear model (see, e.g., the General Linear Models, Multiple Regression, and ANOVA/MANOVA topics). In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variable Y, and a set of predictor variables, the X’s, so that

Y = b0 + b1X1 + b2X2 + … + bkXk

In this equation b0 is the regression coefficient for the intercept and the bi values are the regression coefficients (for variables 1 through k) computed from the data.

Generalized Linear Modeling

# Training The Model
glm_model <- glm(target ~., data = training, family = binomial)
summary(glm_model)
## 
## Call:
## glm(formula = target ~ ., family = binomial, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8277  -0.3153   0.1110   0.4859   3.1290  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.486e+00  4.213e+00   0.353 0.724289    
## age          2.060e-02  2.872e-02   0.717 0.473281    
## sexmale     -1.885e+00  6.291e-01  -2.997 0.002725 ** 
## cp1          7.706e-01  6.107e-01   1.262 0.207007    
## cp2          1.972e+00  6.108e-01   3.229 0.001241 ** 
## cp3          2.487e+00  7.897e-01   3.149 0.001638 ** 
## trestbps    -2.303e-02  1.231e-02  -1.871 0.061383 .  
## chol        -5.901e-03  4.456e-03  -1.324 0.185386    
## fbs          3.212e-01  6.460e-01   0.497 0.619076    
## restecg1     1.940e-01  4.353e-01   0.446 0.655809    
## restecg2    -1.362e+01  1.347e+03  -0.010 0.991933    
## thalach      1.778e-02  1.295e-02   1.374 0.169572    
## exang1      -1.030e+00  4.916e-01  -2.096 0.036088 *  
## oldpeak     -4.694e-01  2.520e-01  -1.863 0.062498 .  
## slope1      -1.049e+00  9.156e-01  -1.145 0.252107    
## slope2       3.676e-01  9.936e-01   0.370 0.711434    
## ca1         -2.202e+00  5.710e-01  -3.856 0.000115 ***
## ca2         -3.279e+00  8.683e-01  -3.776 0.000159 ***
## ca3         -1.218e+00  9.938e-01  -1.225 0.220483    
## ca4          1.209e+00  1.731e+00   0.699 0.484864    
## thal1        2.006e+00  3.134e+00   0.640 0.522083    
## thal2        2.096e+00  3.023e+00   0.693 0.488139    
## thal3        1.007e+00  3.027e+00   0.333 0.739305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 153.00  on 220  degrees of freedom
## AIC: 199
## 
## Number of Fisher Scoring iterations: 15
par(mfrow=c(2,2))
plot(glm_model,sub.caption="Generalized Linear Model")

# Testing the Model
glm_probs <- predict(glm_model, newdata = testing, type = "response")
glm_fit <- ifelse(glm_probs > 0.5, 1, 0)
glm_fit <- as.data.frame(glm_fit)
glm_fit <- as.factor(glm_fit$glm_fit)

actual <- as.factor(testing$target)
confusionMatrix(glm_fit,actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  4
##          1  3 29
##                                           
##                Accuracy : 0.8833          
##                  95% CI : (0.7743, 0.9518)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 2.959e-08       
##                                           
##                   Kappa : 0.7651          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.9062          
##              Prevalence : 0.4500          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.8838          
##                                           
##        'Positive' Class : 0               
## 

anova is analysis of overall mean distance between variables Variability ratio

Random Forest

library(randomForest)
set.seed(123)
training$target <- as.factor(training$target)
fit.rf <- randomForest(target ~.,
                       data = training,
                       ntree = 500, # number of trees to create
                       importance = TRUE)
pred.rf <- predict(fit.rf, testing )
pred.rf <- as.data.frame(pred.rf)
reference <- as.factor(testing$target)
confusionMatrix(pred.rf$pred.rf, reference)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  4
##          1  3 29
##                                           
##                Accuracy : 0.8833          
##                  95% CI : (0.7743, 0.9518)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 2.959e-08       
##                                           
##                   Kappa : 0.7651          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.9062          
##              Prevalence : 0.4500          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.8838          
##                                           
##        'Positive' Class : 0               
## 

Naive Bayes

library(naivebayes)
## naivebayes 0.9.6 loaded
nb <- naive_bayes(training[,-14],as.factor(training$target))
## Warning: naive_bayes(): Feature restecg - zero probabilities are present.
## Consider Laplace smoothing.
nb_pred <-predict(nb,newdata=testing)
## Warning: predict.naive_bayes(): More features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
nb_pred<- as.data.frame(nb_pred)
nb_pred$nb_pred <- as.factor(nb_pred$nb_pred)
reference <- as.factor(testing$target)
confusionMatrix(nb_pred$nb_pred, reference)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  4
##          1  4 29
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.7541, 0.9406)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 1.653e-07       
##                                           
##                   Kappa : 0.7306          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.8788          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3833          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8653          
##                                           
##        'Positive' Class : 0               
## 

k-nearest neighbors

set.seed(10)
objControl <- trainControl(method='repeatedcv', number=10,  repeats = 10)
kknn_model <- train(target ~., data = training, method = "kknn",
                    trControl=objControl, verbose=FALSE,
                    tuneLength = 10)
                    

kknn_Prediction <- predict(kknn_model, testing)
confusionMatrix(kknn_Prediction, actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  6
##          1  3 27
##                                          
##                Accuracy : 0.85           
##                  95% CI : (0.7343, 0.929)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 8.068e-07      
##                                          
##                   Kappa : 0.7            
##                                          
##  Mcnemar's Test P-Value : 0.505          
##                                          
##             Sensitivity : 0.8889         
##             Specificity : 0.8182         
##          Pos Pred Value : 0.8000         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.4500         
##          Detection Rate : 0.4000         
##    Detection Prevalence : 0.5000         
##       Balanced Accuracy : 0.8535         
##                                          
##        'Positive' Class : 0              
## 

Generalized Additive Model using Splines

set.seed(10)
objControl <- trainControl(method='repeatedcv', number=10,  repeats = 10)
GAM_model <- train(target ~., data = training, method = "gam",
                    trControl=objControl, verbose=FALSE,
                    tuneLength = 10)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Fitting terminated with step failure - check results carefully
## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Iteration limit reached without full convergence - check carefully

## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Iteration limit reached without full convergence - check carefully
GAM_Prediction <- predict(GAM_model, testing)
confusionMatrix(GAM_Prediction, actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  6
##          1  4 27
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.6656          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7931          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3833          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.8350          
##                                           
##        'Positive' Class : 0               
## 
Model Accuracy Rate
Support Vector Machines with Linear Kernel 90%
Random Forest 86%
Generalized Linear Model 88%
Random Forest within Caret Package 88%

Joe Long, data analyst Cabrillo Research