Execute by Neha Raut
To study heart disease dataset and to model the classifier for predecting whether patient is suffering from heart disease or not
loading library dplyr
#caret package provides a method createDataPartition() which is basically for partitioning our data into train and test set.
#e1071 use for SVM Classification
library(caret)
library(e1071)
heart_df <- read.csv("heart_tidy.csv", sep=',', header = FALSE)
str(heart_df)
## 'data.frame': 300 obs. of 14 variables:
## $ V1 : int 63 67 67 37 41 56 62 57 63 53 ...
## $ V2 : int 1 1 1 1 0 1 0 0 1 1 ...
## $ V3 : int 1 4 4 3 2 2 4 4 4 4 ...
## $ V4 : int 145 160 120 130 130 120 140 120 130 140 ...
## $ V5 : int 233 286 229 250 204 236 268 354 254 203 ...
## $ V6 : int 1 0 0 0 0 0 0 0 0 1 ...
## $ V7 : int 2 2 2 0 2 0 2 0 2 2 ...
## $ V8 : int 150 108 129 187 172 178 160 163 147 155 ...
## $ V9 : int 0 1 1 0 0 0 0 1 0 1 ...
## $ V10: num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ V11: int 3 2 2 3 1 1 3 1 2 3 ...
## $ V12: int 0 3 2 0 0 0 2 0 1 0 ...
## $ V13: int 6 3 7 3 3 3 3 3 7 7 ...
## $ V14: int 0 1 1 0 0 0 1 0 1 1 ...
head(heart_df)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
dim(heart_df)
## [1] 300 14
Devide data in 70(for training)-30(testing) formate
The “y” parameter takes the value of variable according to which data needs to be partitioned. In our case, target variable is at V14, so we are passing heart$V14
The “p” parameter holds a decimal value in the range of 0-1. It’s to show the percentage of the split. We are using p=0.7. It means that data split should be done in 70:30 ratio. So, 70% of the data is used for training and the remaining 30% is for testing the model.
The “list” parameter is for whether to return a list or matrix. We are passing FALSE for not returning a list
#v14 is target variable
#data slicing
set.seed(2)
intrain <- createDataPartition(y=heart_df$V14, p=0.7, list=F)
training <- heart_df[intrain,]
testing <- heart_df[-intrain,]
dim(training)
## [1] 210 14
dim(testing)
## [1] 90 14
Checks for any null values
anyNA(heart_df)
## [1] FALSE
summary(heart_df)
## V1 V2 V3 V4
## Min. :29.00 Min. :0.00 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.00 1st Qu.:3.000 1st Qu.:120.0
## Median :56.00 Median :1.00 Median :3.000 Median :130.0
## Mean :54.48 Mean :0.68 Mean :3.153 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.00 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.00 Max. :4.000 Max. :200.0
## V5 V6 V7 V8
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.8
## Median :241.5 Median :0.0000 Median :0.5000 Median :153.0
## Mean :246.9 Mean :0.1467 Mean :0.9867 Mean :149.7
## 3rd Qu.:275.2 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## V9 V10 V11 V12
## Min. :0.0000 Min. :0.00 Min. :1.000 Min. :0.00
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.00
## Median :0.0000 Median :0.80 Median :2.000 Median :0.00
## Mean :0.3267 Mean :1.05 Mean :1.603 Mean :0.67
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.00
## Max. :1.0000 Max. :6.20 Max. :3.000 Max. :3.00
## V13 V14
## Min. :3.000 Min. :0.00
## 1st Qu.:3.000 1st Qu.:0.00
## Median :3.000 Median :0.00
## Mean :4.727 Mean :0.46
## 3rd Qu.:7.000 3rd Qu.:1.00
## Max. :7.000 Max. :1.00
Our target variable should be categorical, so factorize them
training["V14"] = factor(training[["V14"]])
#before train your model implement trainControl method
trctrl <- trainControl(method = "repeatedcv", number=10, repeats = 3) #number is number of iteration, repeat the cross validation
#train model
svm_linear <- train(V14~. , data=training, method = "svmLinear",
trControl = trctrl,
preProcess = c("center", "scale"),
tuneLength=10)
## Loading required package: kernlab
## Warning: package 'kernlab' was built under R version 3.4.4
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
#preprocessing parameter help in scaling and centering data
svm_linear
## Support Vector Machines with Linear Kernel
##
## 210 samples
## 13 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 189, 189, 189, 189, 188, 189, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8637085 0.7246394
##
## Tuning parameter 'C' was held constant at a value of 1
It’s a linear model therefore, it just tested at value “C” =1.Now, our model is trained with C value as 1. We are ready to predict classes for our test set.
test_pred <- predict(svm_linear, newdata= testing) # first paramter is our train model
test_pred
## [1] 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0
## [36] 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1
## [71] 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 1 0 0
## Levels: 0 1
from above, you will get predicted value, 0 means does not suffer from heart disease and 1 means suffer from heart disease
summary(test_pred)
## 0 1
## 49 41
Check the accuracy of our Model
confusionMatrix(test_pred, testing$V14)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 40 9
## 1 10 31
##
## Accuracy : 0.7889
## 95% CI : (0.6901, 0.8679)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 3.228e-06
##
## Kappa : 0.5736
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8000
## Specificity : 0.7750
## Pos Pred Value : 0.8163
## Neg Pred Value : 0.7561
## Prevalence : 0.5556
## Detection Rate : 0.4444
## Detection Prevalence : 0.5444
## Balanced Accuracy : 0.7875
##
## 'Positive' Class : 0
##
The output shows that our model accuracy for test set is 78.89%
We can improve the performance, we can customize value in linear classification This can be done by inputting values in grid search.
The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.
grid <- expand.grid(C = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))
svmgrid <- train(V14 ~., data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid,
tuneLength = 10)
svmgrid
## Support Vector Machines with Linear Kernel
##
## 210 samples
## 13 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 189, 189, 190, 189, 189, 189, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.01 0.8681674 0.7329887
## 0.05 0.8634055 0.7234586
## 0.10 0.8634055 0.7236340
## 0.25 0.8602309 0.7172408
## 0.50 0.8569769 0.7108924
## 0.75 0.8569769 0.7110924
## 1.00 0.8569769 0.7110649
## 1.25 0.8585642 0.7142323
## 1.50 0.8585642 0.7143156
## 1.75 0.8569769 0.7112607
## 2.00 0.8569769 0.7112607
## 5.00 0.8586436 0.7145267
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
plot(svmgrid)
The above plot is showing that our classifier is giving best accuracy on C = 0.05.
test model for this c value
test_pred <- predict(svmgrid, newdata= testing)
test_pred
## [1] 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0
## [36] 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1
## [71] 0 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0
## Levels: 0 1
summary(test_pred)
## 0 1
## 54 36
confusionMatrix(test_pred, testing$V14)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 11
## 1 7 29
##
## Accuracy : 0.8
## 95% CI : (0.7025, 0.8769)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 1.034e-06
##
## Kappa : 0.5909
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.8600
## Specificity : 0.7250
## Pos Pred Value : 0.7963
## Neg Pred Value : 0.8056
## Prevalence : 0.5556
## Detection Rate : 0.4778
## Detection Prevalence : 0.6000
## Balanced Accuracy : 0.7925
##
## 'Positive' Class : 0
##