This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
Attribute Information:
age _ numeric
sex - category (1=male, 0=female)
cp chest pain type category (1= typical angina,2= atypical angina,3= non-anginal pain,4= asymptomatic)
resting blood pressure. Numeric (in mm Hg on admission to the hospital)
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values: 0= normal, 1= having ST-T wave abnormality, 2= probable or definite left ventricular hypertrophy by Estes)
thalach: maximum heart rate achieved. Numeric
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment. (1= upsloping, 2= flat,3= downsloping) number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
target: diagnosis of heart disease (angiographic disease status) Value 0= < 50% diameter narrowing, 1= > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been “processed”, that one containing the Cleveland database. All four unprocessed files also exist in this directory.
Support Vector Machine Classierimplementation in R with caret
suppressMessages(library(dplyr))
suppressMessages(library(readr))
suppressMessages(library(corrplot))
suppressMessages(library(qgraph))
suppressMessages(library(ggplot2))
suppressMessages(library(PerformanceAnalytics))
suppressMessages(library(GGally))
suppressMessages(library(caret))
suppressMessages(library(rpart.plot))
suppressMessages(library(randomForest))
suppressMessages(library(corrr))
suppressMessages(library(psych))
suppressMessages(library(ggpubr))
suppressMessages(library(DataExplorer))
Data wrangling
Changing gender from numeric to category Checking missing values Changing target from numeric to category
heart <- heart %>% mutate(sex=ifelse(sex==1,"male","female"))
heart$sex <- factor(heart$sex)
heart$cp <- factor(heart$cp)
heart$restecg <- factor(heart$restecg)
heart$exang <- factor(heart$exang)
heart$ca <- factor(heart$ca)
heart$slope <- factor(heart$slope)
heart$thal <- factor(heart$thal)
heart$target <- factor(heart$target)
describe(heart)
## vars n mean sd median trimmed mad min max range skew
## age 1 303 54.37 9.08 55.0 54.54 10.38 29 77.0 48.0 -0.20
## sex* 2 303 1.68 0.47 2.0 1.73 0.00 1 2.0 1.0 -0.78
## cp* 3 303 1.97 1.03 2.0 1.86 1.48 1 4.0 3.0 0.48
## trestbps 4 303 131.62 17.54 130.0 130.44 14.83 94 200.0 106.0 0.71
## chol 5 303 246.26 51.83 240.0 243.49 47.44 126 564.0 438.0 1.13
## fbs 6 303 0.15 0.36 0.0 0.06 0.00 0 1.0 1.0 1.97
## restecg* 7 303 1.53 0.53 2.0 1.52 0.00 1 3.0 2.0 0.16
## thalach 8 303 149.65 22.91 153.0 150.98 22.24 71 202.0 131.0 -0.53
## exang* 9 303 1.33 0.47 1.0 1.28 0.00 1 2.0 1.0 0.74
## oldpeak 10 303 1.04 1.16 0.8 0.86 1.19 0 6.2 6.2 1.26
## slope* 11 303 2.40 0.62 2.0 2.46 1.48 1 3.0 2.0 -0.50
## ca* 12 303 1.73 1.02 1.0 1.54 0.00 1 5.0 4.0 1.30
## thal* 13 303 3.31 0.61 3.0 3.36 0.00 1 4.0 3.0 -0.47
## target* 14 303 1.54 0.50 2.0 1.56 0.00 1 2.0 1.0 -0.18
## kurtosis se
## age -0.57 0.52
## sex* -1.39 0.03
## cp* -1.21 0.06
## trestbps 0.87 1.01
## chol 4.36 2.98
## fbs 1.88 0.02
## restecg* -1.37 0.03
## thalach -0.10 1.32
## exang* -1.46 0.03
## oldpeak 1.50 0.07
## slope* -0.65 0.04
## ca* 0.78 0.06
## thal* 0.25 0.04
## target* -1.97 0.03
# A summary of the dataset
summary(heart)
## age sex cp trestbps chol
## Min. :29.00 female: 96 0:143 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 male :207 1: 50 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## fbs restecg thalach exang oldpeak slope
## Min. :0.0000 0:147 Min. : 71.0 0:204 Min. :0.00 0: 21
## 1st Qu.:0.0000 1:152 1st Qu.:133.5 1: 99 1st Qu.:0.00 1:140
## Median :0.0000 2: 4 Median :153.0 Median :0.80 2:142
## Mean :0.1485 Mean :149.6 Mean :1.04
## 3rd Qu.:0.0000 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :1.0000 Max. :202.0 Max. :6.20
## ca thal target
## 0:175 0: 2 0:138
## 1: 65 1: 18 1:165
## 2: 38 2:166
## 3: 20 3:117
## 4: 5
##
# checking for missing value
plot_missing(heart)
# Data exploratory
prop.table(table(heart$target))
##
## 0 1
## 0.4554455 0.5445545
ggplot(heart, aes(x = sex, fill=as.factor(sex))) + geom_bar() + ggtitle("Male _ Female ratio")
plot_histogram(heart)
ggplot(data=heart, aes(x=age, fill=as.factor(target)))+
geom_dotplot(alpha=.5,stackgroups = TRUE, binwidth = 1, binpositions = "all")+
ggtitle("Age Distribution _ Status")
Correlation, Variance and Covariance (Matrices) Description var, cov and cor compute the variance of x and the covariance or correlation of x and y if these are vectors. If x and y are matrices then the covariances (or correlations) between the columns of x and the columns of y are computed. The chart show correllations between age, trestbps, chol, fbs, and thalach VARIABLES
CLASSIFICATION BY DECISION TRESS Decision trees are popular and simple statistical tools for classification and prediction. A decision tree is a flow chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution. The top most node in a tree is the root rode.
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
corrplot(cor(heart[,c(1,4,5,8)]), type="lower",method="number")
pairs.panels(heart[,c(1,4,5,8)], method="pearson",
hist.col = "#1fbbfa", density=TRUE, ellipses=TRUE, show.points = TRUE,
pch=1, lm=TRUE, cex.cor=.7, smoother=F, stars = T, main="Heart Disease SE _ pairs.panels")
Network frame _ Correlations
qgraph(cor(heart[,c(1,4,5,8)]))
set.seed(3033)
index <- createDataPartition(y = heart$target, p= 0.8, list = FALSE)
training <- heart[index,]
testing <- heart[-index,]
#Check statistical distributions of the two splits with the original dataset
colnames(training)
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
str(training)
## 'data.frame': 243 obs. of 14 variables:
## $ age : int 63 37 56 57 57 56 44 57 54 48 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 2 2 2 1 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 1 1 2 2 3 1 3 ...
## $ trestbps: int 145 130 120 120 140 140 120 150 140 130 ...
## $ chol : int 233 250 236 354 192 294 263 168 239 275 ...
## $ fbs : int 1 0 0 0 0 0 0 0 0 0 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 2 2 2 1 2 2 2 2 ...
## $ thalach : int 150 187 178 163 148 153 173 174 160 139 ...
## $ exang : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 0.8 0.6 0.4 1.3 0 1.6 1.2 0.2 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 2 2 3 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 2 3 4 3 3 3 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
glm_fit <-glm(target ~ ., data=training, family="binomial")
glm_pred <- predict(glm_fit,newdata=testing[,-14],type="response")
glm_pred <- as.data.frame(glm_pred)
glm_pred <- glm_pred %>% mutate(glm_model=ifelse(glm_pred>0.5,1,0))
actual <- as.factor(testing$target)
confusionMatrix(as.factor(glm_pred$glm_model),actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 4
## 1 3 29
##
## Accuracy : 0.8833
## 95% CI : (0.7743, 0.9518)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 2.959e-08
##
## Kappa : 0.7651
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8889
## Specificity : 0.8788
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9062
## Prevalence : 0.4500
## Detection Rate : 0.4000
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.8838
##
## 'Positive' Class : 0
##
library(caret)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
set.seed(3233)
training$target = factor(training$target)
svm_fit <- train(target ~., data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecg2
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
svm_fit
## Support Vector Machines with Linear Kernel
##
## 243 samples
## 13 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 219, 219, 218, 219, 218, 219, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8289333 0.6522771
##
## Tuning parameter 'C' was held constant at a value of 1
svm_pred <- predict(svm_fit, newdata = testing)
confusionMatrix(svm_pred,actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 3
## 1 3 30
##
## Accuracy : 0.9
## 95% CI : (0.7949, 0.9624)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 4.558e-09
##
## Kappa : 0.798
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8889
## Specificity : 0.9091
## Pos Pred Value : 0.8889
## Neg Pred Value : 0.9091
## Prevalence : 0.4500
## Detection Rate : 0.4000
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8990
##
## 'Positive' Class : 0
##
set.seed(3233)
svm_Radial <- train(target ~age+sex+cp+chol+fbs+thalach+exang+oldpeak+slope+ca+thal,
data = training, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
svm_Radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 243 samples
## 11 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (19), scaled (19)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 219, 219, 218, 219, 218, 219, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8057333 0.6082035
## 0.50 0.8114000 0.6172288
## 1.00 0.8155667 0.6244030
## 2.00 0.8099000 0.6139260
## 4.00 0.8080667 0.6108269
## 8.00 0.7990000 0.5931980
## 16.00 0.8170000 0.6314212
## 32.00 0.8071667 0.6115396
## 64.00 0.8014000 0.6006593
## 128.00 0.7990000 0.5959580
##
## Tuning parameter 'sigma' was held constant at a value of 0.03632465
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.03632465 and C = 16.
Radial_pred <- predict(svm_Radial, newdata = testing)
confusionMatrix(Radial_pred,actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 6
## 1 6 27
##
## Accuracy : 0.8
## 95% CI : (0.6767, 0.8922)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 4.67e-05
##
## Kappa : 0.596
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7778
## Specificity : 0.8182
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.8182
## Prevalence : 0.4500
## Detection Rate : 0.3500
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.7980
##
## 'Positive' Class : 0
##
# fit a random forest model (using ranger)
rf_grid <- expand.grid(mtry = c(2, 3, 4, 5),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 3, 5))
ranger_fit <- train(target ~., data = training, method = "ranger",
trControl=trctrl,
preProcess = c("center", "scale"),
verbose=FALSE,
tuneLength = 10,
tuneGrid = rf_grid)
ranger_fit
## Random Forest
##
## 243 samples
## 13 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 219, 218, 219, 218, 219, 219, ...
## Resampling results across tuning parameters:
##
## mtry splitrule min.node.size Accuracy Kappa
## 2 gini 1 0.7981462 0.5900654
## 2 gini 3 0.7981462 0.5904543
## 2 gini 5 0.7964128 0.5866494
## 2 extratrees 1 0.7972487 0.5896329
## 2 extratrees 3 0.7914154 0.5770773
## 2 extratrees 5 0.7872513 0.5687665
## 3 gini 1 0.7865769 0.5664680
## 3 gini 3 0.7949103 0.5836004
## 3 gini 5 0.7915128 0.5771035
## 3 extratrees 1 0.8022487 0.6008994
## 3 extratrees 3 0.7971487 0.5903926
## 3 extratrees 5 0.7947821 0.5856478
## 4 gini 1 0.7856487 0.5656830
## 4 gini 3 0.7906154 0.5752536
## 4 gini 5 0.7890462 0.5715309
## 4 extratrees 1 0.7971487 0.5910520
## 4 extratrees 3 0.7971487 0.5911288
## 4 extratrees 5 0.7971821 0.5907189
## 5 gini 1 0.7915154 0.5781145
## 5 gini 3 0.7815821 0.5565398
## 5 gini 5 0.7890154 0.5720712
## 5 extratrees 1 0.7971154 0.5914102
## 5 extratrees 3 0.7996154 0.5962054
## 5 extratrees 5 0.8013487 0.5991850
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = extratrees
## and min.node.size = 1.
ranger_pred <- predict(ranger_fit, newdata=testing)
# compare predicted outcome and true outcome
confusionMatrix(ranger_pred, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 4
## 1 4 29
##
## Accuracy : 0.8667
## 95% CI : (0.7541, 0.9406)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 1.653e-07
##
## Kappa : 0.7306
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8519
## Specificity : 0.8788
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.8788
## Prevalence : 0.4500
## Detection Rate : 0.3833
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8653
##
## 'Positive' Class : 0
##
Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known. lm() is a linear model function, such like linear regression analysis.
lm(formula, data, subset, weights, …) formula: model description, such as x ~ y data: optional, variables in the model subset: optional, a subset vector of observations to be used in the fitting process weights: optional, a vector of weights to be used in the fitting process
The Generalized Linear Model (GLZ) is a generalization of the general linear model (see, e.g., the General Linear Models, Multiple Regression, and ANOVA/MANOVA topics). In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variable Y, and a set of predictor variables, the X’s, so that
Y = b0 + b1X1 + b2X2 + … + bkXk
In this equation b0 is the regression coefficient for the intercept and the bi values are the regression coefficients (for variables 1 through k) computed from the data.
# Training The Model
glm_model <- glm(target ~., data = training, family = binomial)
summary(glm_model)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8277 -0.3153 0.1110 0.4859 3.1290
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.486e+00 4.213e+00 0.353 0.724289
## age 2.060e-02 2.872e-02 0.717 0.473281
## sexmale -1.885e+00 6.291e-01 -2.997 0.002725 **
## cp1 7.706e-01 6.107e-01 1.262 0.207007
## cp2 1.972e+00 6.108e-01 3.229 0.001241 **
## cp3 2.487e+00 7.897e-01 3.149 0.001638 **
## trestbps -2.303e-02 1.231e-02 -1.871 0.061383 .
## chol -5.901e-03 4.456e-03 -1.324 0.185386
## fbs 3.212e-01 6.460e-01 0.497 0.619076
## restecg1 1.940e-01 4.353e-01 0.446 0.655809
## restecg2 -1.362e+01 1.347e+03 -0.010 0.991933
## thalach 1.778e-02 1.295e-02 1.374 0.169572
## exang1 -1.030e+00 4.916e-01 -2.096 0.036088 *
## oldpeak -4.694e-01 2.520e-01 -1.863 0.062498 .
## slope1 -1.049e+00 9.156e-01 -1.145 0.252107
## slope2 3.676e-01 9.936e-01 0.370 0.711434
## ca1 -2.202e+00 5.710e-01 -3.856 0.000115 ***
## ca2 -3.279e+00 8.683e-01 -3.776 0.000159 ***
## ca3 -1.218e+00 9.938e-01 -1.225 0.220483
## ca4 1.209e+00 1.731e+00 0.699 0.484864
## thal1 2.006e+00 3.134e+00 0.640 0.522083
## thal2 2.096e+00 3.023e+00 0.693 0.488139
## thal3 1.007e+00 3.027e+00 0.333 0.739305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 335.05 on 242 degrees of freedom
## Residual deviance: 153.00 on 220 degrees of freedom
## AIC: 199
##
## Number of Fisher Scoring iterations: 15
par(mfrow=c(2,2))
plot(glm_model,sub.caption="Generalized Linear Model")
# Testing the Model
glm_probs <- predict(glm_model, newdata = testing, type = "response")
glm_fit <- ifelse(glm_probs > 0.5, 1, 0)
glm_fit <- as.data.frame(glm_fit)
glm_fit <- as.factor(glm_fit$glm_fit)
actual <- as.factor(testing$target)
confusionMatrix(glm_fit,actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 4
## 1 3 29
##
## Accuracy : 0.8833
## 95% CI : (0.7743, 0.9518)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 2.959e-08
##
## Kappa : 0.7651
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8889
## Specificity : 0.8788
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9062
## Prevalence : 0.4500
## Detection Rate : 0.4000
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.8838
##
## 'Positive' Class : 0
##
anova is analysis of overall mean distance between variables Variability ratio
library(randomForest)
set.seed(123)
training$target <- as.factor(training$target)
fit.rf <- randomForest(target ~.,
data = training,
ntree = 500, # number of trees to create
importance = TRUE)
pred.rf <- predict(fit.rf, testing )
pred.rf <- as.data.frame(pred.rf)
reference <- as.factor(testing$target)
confusionMatrix(pred.rf$pred.rf, reference)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 4
## 1 3 29
##
## Accuracy : 0.8833
## 95% CI : (0.7743, 0.9518)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 2.959e-08
##
## Kappa : 0.7651
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8889
## Specificity : 0.8788
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9062
## Prevalence : 0.4500
## Detection Rate : 0.4000
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.8838
##
## 'Positive' Class : 0
##
library(naivebayes)
## naivebayes 0.9.6 loaded
nb <- naive_bayes(training[,-14],as.factor(training$target))
## Warning: naive_bayes(): Feature restecg - zero probabilities are present.
## Consider Laplace smoothing.
nb_pred <-predict(nb,newdata=testing)
## Warning: predict.naive_bayes(): More features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
nb_pred<- as.data.frame(nb_pred)
nb_pred$nb_pred <- as.factor(nb_pred$nb_pred)
reference <- as.factor(testing$target)
confusionMatrix(nb_pred$nb_pred, reference)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 4
## 1 4 29
##
## Accuracy : 0.8667
## 95% CI : (0.7541, 0.9406)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 1.653e-07
##
## Kappa : 0.7306
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8519
## Specificity : 0.8788
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.8788
## Prevalence : 0.4500
## Detection Rate : 0.3833
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8653
##
## 'Positive' Class : 0
##
set.seed(10)
objControl <- trainControl(method='repeatedcv', number=10, repeats = 10)
kknn_model <- train(target ~., data = training, method = "kknn",
trControl=objControl, verbose=FALSE,
tuneLength = 10)
kknn_Prediction <- predict(kknn_model, testing)
confusionMatrix(kknn_Prediction, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 6
## 1 3 27
##
## Accuracy : 0.85
## 95% CI : (0.7343, 0.929)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 8.068e-07
##
## Kappa : 0.7
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.8889
## Specificity : 0.8182
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.9000
## Prevalence : 0.4500
## Detection Rate : 0.4000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8535
##
## 'Positive' Class : 0
##
set.seed(10)
objControl <- trainControl(method='repeatedcv', number=10, repeats = 10)
GAM_model <- train(target ~., data = training, method = "gam",
trControl=objControl, verbose=FALSE,
tuneLength = 10)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Fitting terminated with step failure - check results carefully
## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Iteration limit reached without full convergence - check carefully
## Warning in newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L =
## G$L, : Iteration limit reached without full convergence - check carefully
GAM_Prediction <- predict(GAM_model, testing)
confusionMatrix(GAM_Prediction, actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 6
## 1 4 27
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.6656
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8519
## Specificity : 0.8182
## Pos Pred Value : 0.7931
## Neg Pred Value : 0.8710
## Prevalence : 0.4500
## Detection Rate : 0.3833
## Detection Prevalence : 0.4833
## Balanced Accuracy : 0.8350
##
## 'Positive' Class : 0
##
Model | Accuracy Rate |
---|---|
Support Vector Machines with Linear Kernel | 90% |
Random Forest | 86% |
Generalized Linear Model | 88% |
Random Forest within Caret Package | 88% |
Joe Long, data analyst Cabrillo Research