Execute by Neha Raut

Problem Statement:

To study heart disease dataset and to model the classifier for predecting whether patient is suffering from heart disease or not

Step 1: Load Data

loading library dplyr

#caret package provides a method createDataPartition() which is basically for partitioning our data into train and test set.
#e1071 use for SVM Classification

library(caret)
library(e1071)
heart_df  <- read.csv("heart_tidy.csv", sep=',', header = FALSE)
str(heart_df)
## 'data.frame':    300 obs. of  14 variables:
##  $ V1 : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ V2 : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ V3 : int  1 4 4 3 2 2 4 4 4 4 ...
##  $ V4 : int  145 160 120 130 130 120 140 120 130 140 ...
##  $ V5 : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ V6 : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ V7 : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ V8 : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ V9 : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ V10: num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ V11: int  3 2 2 3 1 1 3 1 2 3 ...
##  $ V12: int  0 3 2 0 0 0 2 0 1 0 ...
##  $ V13: int  6 3 7 3 3 3 3 3 7 7 ...
##  $ V14: int  0 1 1 0 0 0 1 0 1 1 ...
head(heart_df)
##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3   0   6   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2   3   3   1
## 3 67  1  4 120 229  0  2 129  1 2.6   2   2   7   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3   0   3   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1   0   3   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1   0   3   0
dim(heart_df)
## [1] 300  14

Split the data into training set and testing set

Devide data in 70(for training)-30(testing) formate

The “y” parameter takes the value of variable according to which data needs to be partitioned. In our case, target variable is at V14, so we are passing heart$V14

The “p” parameter holds a decimal value in the range of 0-1. It’s to show the percentage of the split. We are using p=0.7. It means that data split should be done in 70:30 ratio. So, 70% of the data is used for training and the remaining 30% is for testing the model.

The “list” parameter is for whether to return a list or matrix. We are passing FALSE for not returning a list

#v14 is target variable
#data slicing
set.seed(2)
intrain <- createDataPartition(y=heart_df$V14, p=0.7, list=F)

training <- heart_df[intrain,]
testing <- heart_df[-intrain,]

dim(training)
## [1] 210  14
dim(testing)
## [1] 90 14

Checks for any null values

anyNA(heart_df)
## [1] FALSE
summary(heart_df)
##        V1              V2             V3              V4       
##  Min.   :29.00   Min.   :0.00   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.00   1st Qu.:3.000   1st Qu.:120.0  
##  Median :56.00   Median :1.00   Median :3.000   Median :130.0  
##  Mean   :54.48   Mean   :0.68   Mean   :3.153   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.00   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.00   Max.   :4.000   Max.   :200.0  
##        V5              V6               V7               V8       
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.8  
##  Median :241.5   Median :0.0000   Median :0.5000   Median :153.0  
##  Mean   :246.9   Mean   :0.1467   Mean   :0.9867   Mean   :149.7  
##  3rd Qu.:275.2   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##        V9              V10            V11             V12      
##  Min.   :0.0000   Min.   :0.00   Min.   :1.000   Min.   :0.00  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.00  
##  Median :0.0000   Median :0.80   Median :2.000   Median :0.00  
##  Mean   :0.3267   Mean   :1.05   Mean   :1.603   Mean   :0.67  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.00  
##  Max.   :1.0000   Max.   :6.20   Max.   :3.000   Max.   :3.00  
##       V13             V14      
##  Min.   :3.000   Min.   :0.00  
##  1st Qu.:3.000   1st Qu.:0.00  
##  Median :3.000   Median :0.00  
##  Mean   :4.727   Mean   :0.46  
##  3rd Qu.:7.000   3rd Qu.:1.00  
##  Max.   :7.000   Max.   :1.00

Our target variable should be categorical, so factorize them

training["V14"] = factor(training[["V14"]])

Train our model

#before train your model implement  trainControl method
trctrl <- trainControl(method = "repeatedcv", number=10, repeats = 3) #number is number of iteration, repeat the cross validation
#train model
svm_linear <- train(V14~. , data=training, method = "svmLinear",
                    trControl = trctrl,
                    preProcess = c("center", "scale"),  
                    tuneLength=10)
## Loading required package: kernlab
## Warning: package 'kernlab' was built under R version 3.4.4
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
#preprocessing parameter help in scaling and centering data
svm_linear
## Support Vector Machines with Linear Kernel 
## 
## 210 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 189, 189, 189, 189, 188, 189, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8637085  0.7246394
## 
## Tuning parameter 'C' was held constant at a value of 1

It’s a linear model therefore, it just tested at value “C” =1.Now, our model is trained with C value as 1. We are ready to predict classes for our test set.

test_pred <- predict(svm_linear, newdata= testing)   # first paramter is our train model
test_pred  
##  [1] 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0
## [36] 0 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1
## [71] 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 1 0 0
## Levels: 0 1

from above, you will get predicted value, 0 means does not suffer from heart disease and 1 means suffer from heart disease

summary(test_pred)
##  0  1 
## 49 41

Check the accuracy of our Model

confusionMatrix(test_pred, testing$V14) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40  9
##          1 10 31
##                                           
##                Accuracy : 0.7889          
##                  95% CI : (0.6901, 0.8679)
##     No Information Rate : 0.5556          
##     P-Value [Acc > NIR] : 3.228e-06       
##                                           
##                   Kappa : 0.5736          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.7750          
##          Pos Pred Value : 0.8163          
##          Neg Pred Value : 0.7561          
##              Prevalence : 0.5556          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.5444          
##       Balanced Accuracy : 0.7875          
##                                           
##        'Positive' Class : 0               
## 

The output shows that our model accuracy for test set is 78.89%

Building & tuning of an SVM classifier with different values of C

We can improve the performance, we can customize value in linear classification This can be done by inputting values in grid search.

The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example.

For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.

grid <- expand.grid(C = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))
svmgrid <- train(V14 ~., data = training, method = "svmLinear",
                         trControl=trctrl,
                         preProcess = c("center", "scale"),
                         tuneGrid = grid,
                         tuneLength = 10)
svmgrid
## Support Vector Machines with Linear Kernel 
## 
## 210 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 189, 189, 190, 189, 189, 189, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.01  0.8681674  0.7329887
##   0.05  0.8634055  0.7234586
##   0.10  0.8634055  0.7236340
##   0.25  0.8602309  0.7172408
##   0.50  0.8569769  0.7108924
##   0.75  0.8569769  0.7110924
##   1.00  0.8569769  0.7110649
##   1.25  0.8585642  0.7142323
##   1.50  0.8585642  0.7143156
##   1.75  0.8569769  0.7112607
##   2.00  0.8569769  0.7112607
##   5.00  0.8586436  0.7145267
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was C = 0.01.
plot(svmgrid) 

The above plot is showing that our classifier is giving best accuracy on C = 0.05.

Let’s try to make predictions using this model for our test set

test model for this c value

test_pred <- predict(svmgrid, newdata= testing)
test_pred  
##  [1] 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0
## [36] 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1
## [71] 0 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0
## Levels: 0 1
summary(test_pred)  
##  0  1 
## 54 36
confusionMatrix(test_pred, testing$V14) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43 11
##          1  7 29
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.7025, 0.8769)
##     No Information Rate : 0.5556          
##     P-Value [Acc > NIR] : 1.034e-06       
##                                           
##                   Kappa : 0.5909          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.8600          
##             Specificity : 0.7250          
##          Pos Pred Value : 0.7963          
##          Neg Pred Value : 0.8056          
##              Prevalence : 0.5556          
##          Detection Rate : 0.4778          
##    Detection Prevalence : 0.6000          
##       Balanced Accuracy : 0.7925          
##                                           
##        'Positive' Class : 0               
## 

The results of the confusion matrix show that this time the accuracy on the test set is 80%, which is more accurate than our previous result