Dataset Information:-

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

Objective:-

The main objective is to predict whether the given person is having heart disease or not, with the help of several factor which are causing eg- age,cholesterol level,type of chest pain etc.

The algorithm which we are using in this problem are:-

  • Binary Logistic Regression.

  • Naive Bayes algorithm

  • Decision Tree

  • Random Forest

Description of the Dataset:-

The data has 303 observations and 14 variables. Each observation contains following infomation about an individual.

  • age:- Age of individual in years

  • sex:- Gender Of individual(1 = male; 0 = female)

  • cp - Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)

  • trestbps - Resting blood pressure (in mm Hg on admission to the hospital)

  • chol - Serum cholesterol in mg/dl

  • fbs - Fasting blood sugar level > 120 mg/dl (1 = true; 0 = false)

  • restecg - Resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)

  • thalach - Maximum heart rate achieved

  • exang - Exercise induced angina (1 = yes; 0 = no)

  • oldpeak - ST depression induced by exercise relative to rest

  • slope - The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)

  • ca - Number of major vessels (0-4) colored by flourosopy

  • thal -Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2 = fixed defect; 3 = reversable defect

  • target - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)

Loading the data in Rstudio

heart<-read.csv("C:/Users/sragh/Desktop/heart.csv",header = T)

header = T means that the given data has its own heading or otherwords the first observation is also considered for prediction.

head(heart)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1
# we use head function when we want to see and check the first six observation of our data.
tail(heart)
##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 298  59   1  0      164  176   1       0      90     0     1.0     1  2
## 299  57   0  0      140  241   0       1     123     1     0.2     1  0
## 300  45   1  3      110  264   0       1     132     0     1.2     1  0
## 301  68   1  0      144  193   1       1     141     0     3.4     1  2
## 302  57   1  0      130  131   0       1     115     1     1.2     1  1
## 303  57   0  1      130  236   0       0     174     0     0.0     1  1
##     thal target
## 298    1      0
## 299    3      0
## 300    3      0
## 301    3      0
## 302    3      0
## 303    2      0
# same as head but it show show bottom six obervation of our data.
colSums(is.na(heart))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0
# This function is used to check whether our data contains any NA valus or not.
# As there are no NA found we can move forward or else we have to remove NA before moving forward. 

To check the structure of our data

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

To see the summary of our data

summary(heart)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal          target      
##  Min.   :1.00   Min.   :0.0000  
##  1st Qu.:2.00   1st Qu.:0.0000  
##  Median :2.00   Median :1.0000  
##  Mean   :2.32   Mean   :0.5446  
##  3rd Qu.:3.00   3rd Qu.:1.0000  
##  Max.   :3.00   Max.   :1.0000

As observing the above summary and str we can say following points:-

  • sex cannot be continuous variable as it can be either Male or Female as per our description. Hence we have to convert the variable name sex from integer to factor. And also labelling it to avoid any further confusion.

  • cp cannot be continuous variable as it is type of chest pain. As it is type of chest pain, we have to convert variable cp to factor and labelling it to our convenience.

  • fbs cannot be continuous variable or integer as it shows blood sugar level below 120mg/dl or not.Therefore, we convert it to factor and labelling it to our convenience.

  • restecg should be factor as it is type of ECG results.Hence, it can’t be integer.So, we are converting it to factor and labelling.

  • exang should be factor as per the description of the dataset. Angina can happen or not i.e. it can be either yes or no. Therefore, converting the variable to factor and labelling it.

  • slope cannot be integer as it is type of slope which is observed in ECG.Therefore we are converting the variable to factor and labelling it.

  • ca as per the description of our dataset. It can’t be integer. Therefore, we are converting the variable to factor.

  • thal cannot be integer as it is type of thalassemia which cannot be numeric or integer.Therefore, we are converting the variable to factor and labelling it.

  • target is the predicated variable and tells us whether the individual has heart disease or not. Therefore, we are converting the variable to factor and labelling it for your convenience.

According to above observation we implementing the changes

heart$sex<-as.factor(heart$sex)
levels(heart$sex)<-c("Female","Male")
heart$cp<-as.factor(heart$cp)
levels(heart$cp)<-c("typical","atypical","non-anginal","asymptomatic")
heart$fbs<-as.factor(heart$fbs)
levels(heart$fbs)<-c("False","True")
heart$restecg<-as.factor(heart$restecg)
levels(heart$restecg)<-c("normal","stt","hypertrophy")
heart$exang<-as.factor(heart$exang)
levels(heart$exang)<-c("No","Yes")
heart$slope<-as.factor(heart$slope)
levels(heart$slope)<-c("upsloping","flat","downsloping")
heart$ca<-as.factor(heart$ca)
heart$thal<-as.factor(heart$thal)
levels(heart$thal)<-c("normal","fixed","reversable")
heart$target<-as.factor(heart$target)
levels(heart$target)<-c("No", "Yes")

Checking whether the above changes are implemented or not

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "typical","atypical",..: 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "normal","stt",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "upsloping","flat",..: 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 3 levels "normal","fixed",..: 1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
summary(heart)
##       age            sex                 cp         trestbps    
##  Min.   :29.00   Female: 96   typical     :143   Min.   : 94.0  
##  1st Qu.:47.50   Male  :207   atypical    : 50   1st Qu.:120.0  
##  Median :55.00                non-anginal : 87   Median :130.0  
##  Mean   :54.37                asymptomatic: 23   Mean   :131.6  
##  3rd Qu.:61.00                                   3rd Qu.:140.0  
##  Max.   :77.00                                   Max.   :200.0  
##       chol          fbs             restecg       thalach      exang    
##  Min.   :126.0   False:258   normal     :147   Min.   : 71.0   No :204  
##  1st Qu.:211.0   True : 45   stt        :152   1st Qu.:133.5   Yes: 99  
##  Median :240.0               hypertrophy:  4   Median :153.0            
##  Mean   :246.3                                 Mean   :149.6            
##  3rd Qu.:274.5                                 3rd Qu.:166.0            
##  Max.   :564.0                                 Max.   :202.0            
##     oldpeak             slope     ca              thal     target   
##  Min.   :0.00   upsloping  : 21   0:175   normal    : 20   No :138  
##  1st Qu.:0.00   flat       :140   1: 65   fixed     :166   Yes:165  
##  Median :0.80   downsloping:142   2: 38   reversable:117            
##  Mean   :1.04                     3: 20                             
##  3rd Qu.:1.60                     4:  5                             
##  Max.   :6.20

EDA

EDA stands for Exploratory Data Analysis which is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a dataset.

For Graphical representation we require library “ggplot2”

library(ggplot2)
ggplot(heart,aes(x=age,fill=target,color=target)) + geom_histogram(binwidth = 1,color="black") + labs(x = "Age",y = "Frequency", title = "Heart Disease w.r.t. Age")

We can conclude that the age group of 40 to 60 has the highest probability of getting heart diseases compared to age above 60.

mytable <- table(heart$cp)
pct<-round(mytable/sum(mytable)*100)
lbls1<-paste(names(mytable),pct)
lbls<-paste(lbls1, "%", sep="")
pie(mytable, labels = lbls,col = rainbow(length(lbls)),main="Pie Chart of Chest Pain",radius = 0.9)

we can conclude that out of all types of chest pain, most observed in the individual are typical type of chest pain, then comes the non-anginal.

Performing Machine learning algorithm

Logistic Regression:-

Firstly, we are dividing our dataset into training(75%) and testing data(25%).

set.seed(100) 
#100 is used to control the sampling permutation to 100. 
index<-sample(nrow(heart),0.75*nrow(heart))
train<-heart[index,]
test<-heart[-index,]

Model generation on training data and then validating the model with testing data.

modelblr<-glm(target~.,data = train,family = "binomial")
# family = " binomial" means it contains only two outcomes.

To check how well our model is generated,we need to calculate predicted score and built confusion matrix to know the accuracy of the model.

train$pred<-fitted(modelblr)
# fitted can be used only to get predicted score of the data on which model has been generated.
head(train)
##     age    sex          cp trestbps chol   fbs restecg thalach exang
## 202  60   Male     typical      125  258 False  normal     141   Yes
## 112  57   Male non-anginal      150  126  True     stt     173    No
## 206  52   Male     typical      128  255 False     stt     161   Yes
## 4    56   Male    atypical      120  236 False     stt     178    No
## 98   52   Male     typical      108  233  True     stt     147    No
## 7    56 Female    atypical      140  294 False  normal     153    No
##     oldpeak       slope ca       thal target        pred
## 202     2.8        flat  1 reversable     No 0.003267884
## 112     0.2 downsloping  1 reversable    Yes 0.663651641
## 206     0.0 downsloping  1 reversable     No 0.071402635
## 4       0.8 downsloping  0      fixed    Yes 0.983649596
## 98      0.1 downsloping  3 reversable    Yes 0.357415872
## 7       1.3        flat  0      fixed    Yes 0.913661147

As we can see that the predicted score are in probability of having heart diseases.But we have to find a proper cutoff points from which it is easy to distinguish between having heart diseases and not having it.

For that we require ROC curve(receiver operating characteristic curve) which is a graph showing the performance of a classification model at all classification thresholds. It will allow us to take proper cutoff.

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred<-prediction(train$pred,train$target)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize = T,print.cutoffs.at = seq(0.1,by = 0.1))

With the use of ROC curve we can observe that 0.6 is having better sensitivity and specificity.There we select 0.6 as our cutoff to distinguish.

train$pred1<-ifelse(train$pred<0.6,"No","Yes")
library(caret)
## Loading required package: lattice
confusionMatrix(factor(train$pred1),train$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   92  14
##        Yes  12 109
##                                           
##                Accuracy : 0.8855          
##                  95% CI : (0.8367, 0.9238)
##     No Information Rate : 0.5419          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7697          
##                                           
##  Mcnemar's Test P-Value : 0.8445          
##                                           
##             Sensitivity : 0.8846          
##             Specificity : 0.8862          
##          Pos Pred Value : 0.8679          
##          Neg Pred Value : 0.9008          
##              Prevalence : 0.4581          
##          Detection Rate : 0.4053          
##    Detection Prevalence : 0.4670          
##       Balanced Accuracy : 0.8854          
##                                           
##        'Positive' Class : No              
## 
# Accuracy of training data
acc_tr<-(109+92)/(227);acc_tr 
## [1] 0.8854626

From confusion matrix of training data, we come to know that our model is 88.55% accurate.

Now validating the model on testing data

test$pred<-predict(modelblr,test,type = "response")
# type = "response" is used to get the outcome in the form of probability of having heart diseases.
head(test)
##    age    sex           cp trestbps chol   fbs restecg thalach exang
## 6   57   Male      typical      140  192 False     stt     148    No
## 10  57   Male  non-anginal      150  168 False     stt     174    No
## 17  58 Female  non-anginal      120  340 False     stt     172    No
## 18  66 Female asymptomatic      150  226 False     stt     114    No
## 21  59   Male      typical      135  234 False     stt     161    No
## 22  44   Male  non-anginal      130  233 False     stt     179   Yes
##    oldpeak       slope ca       thal target      pred
## 6      0.4        flat  0     normal    Yes 0.5914325
## 10     1.6 downsloping  0      fixed    Yes 0.9810904
## 17     0.0 downsloping  0      fixed    Yes 0.9982063
## 18     2.6   upsloping  0      fixed    Yes 0.9948418
## 21     0.5        flat  0 reversable    Yes 0.2880055
## 22     0.4 downsloping  0      fixed    Yes 0.9647337

As we know that, for training data the cutoff has been 0.6.Similarily the testing data will also have the same thersold or cutoff.

test$pred1<-ifelse(test$pred<0.6,"No","Yes")
confusionMatrix(factor(test$pred1),test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  25   5
##        Yes  9  37
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7103, 0.8955)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 1.294e-06       
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.7353          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.8043          
##              Prevalence : 0.4474          
##          Detection Rate : 0.3289          
##    Detection Prevalence : 0.3947          
##       Balanced Accuracy : 0.8081          
##                                           
##        'Positive' Class : No              
## 
# Accuracy of Testing data.
acc_tt<-(25+37)/(76);acc_tt
## [1] 0.8157895

To check how much of our predicted values lie inside the curve

auc<-performance(pred,"auc")
auc@y.values
## [[1]]
## [1] 0.9517667

We can conclude that we are getting an accuracy of 81.58% with 90.26% of our predicted values lying under the curve. Also our misclassifcation rate is 18.42%

Naive Bayes algorithm

We need to remove the extra coloumns we added while performing BLR before implementing Naive Bayes algorithm.

train$pred<-NULL
train$pred1<-NULL
test$pred<-NULL
test$pred1<-NULL
# library(e1071) contains the naivebayes model, for that reason we have to first call this library.
library(e1071)
model_nb<-naiveBayes(target~.,data = train)

Checking the model with training data and creationg its confusion matrix to know how accurate the model is.

train$pred<-predict(model_nb,train)
confusionMatrix(train$pred,train$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   85  14
##        Yes  19 109
##                                           
##                Accuracy : 0.8546          
##                  95% CI : (0.8019, 0.8978)
##     No Information Rate : 0.5419          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7061          
##                                           
##  Mcnemar's Test P-Value : 0.4862          
##                                           
##             Sensitivity : 0.8173          
##             Specificity : 0.8862          
##          Pos Pred Value : 0.8586          
##          Neg Pred Value : 0.8516          
##              Prevalence : 0.4581          
##          Detection Rate : 0.3744          
##    Detection Prevalence : 0.4361          
##       Balanced Accuracy : 0.8517          
##                                           
##        'Positive' Class : No              
## 
acc_tr_nb<-(85+109)/227;acc_tr_nb
## [1] 0.8546256

We can say that naive bayes algorithm is 85.46% accurate with the training data.

Now, Validating the model with the testing data by predicting and creating confusion matrix.

test$pred<-predict(model_nb,test)
confusionMatrix(test$pred,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  23   5
##        Yes 11  37
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.6808, 0.8746)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 1.428e-05       
##                                           
##                   Kappa : 0.567           
##                                           
##  Mcnemar's Test P-Value : 0.2113          
##                                           
##             Sensitivity : 0.6765          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8214          
##          Neg Pred Value : 0.7708          
##              Prevalence : 0.4474          
##          Detection Rate : 0.3026          
##    Detection Prevalence : 0.3684          
##       Balanced Accuracy : 0.7787          
##                                           
##        'Positive' Class : No              
## 
acc_tt_nb<-(23+37)/76;acc_tt_nb
## [1] 0.7894737

we can conclude that the model generated with help of Naive Bayes algorithm is 78.95% accurate or we can also say that the misclassification rate for Naive Bayes algorithm is 21.05%.

Decision Tree

We need to remove the extra coloumns we added while performing Naive Bayes algorithm before implementing Decision Tree.

train$pred<-NULL
test$pred<-NULL

We need the following libraries to perform Decision tree

  • library(rpart)
  • library(rpart.plot)

rpart stands for Recursive partitioning and regression trees.

rpart is used when both independent and dependent variables are continuous or categorical.

rpart automatically detects whether to perform regression or classification based on dependent variable. There is no need to specify.

Implementing Decision tree

library(rpart)
tree<-rpart(target~.,method = "class",data = train)
library(rpart.plot)
rpart.plot(tree)

With the help of decision tree we can say that most significant variable out of all are cp,ca,thal,oldpeak.

Let’s validate the model with testing data and find out the accuracy of model.

test$pred<-predict(tree,test,type = "class")
confusionMatrix(test$pred,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  21   5
##        Yes 13  37
##                                           
##                Accuracy : 0.7632          
##                  95% CI : (0.6518, 0.8532)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.0001166       
##                                           
##                   Kappa : 0.51            
##                                           
##  Mcnemar's Test P-Value : 0.0989602       
##                                           
##             Sensitivity : 0.6176          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8077          
##          Neg Pred Value : 0.7400          
##              Prevalence : 0.4474          
##          Detection Rate : 0.2763          
##    Detection Prevalence : 0.3421          
##       Balanced Accuracy : 0.7493          
##                                           
##        'Positive' Class : No              
## 
acc_tr_tree<-(21+37)/76;acc_tr_tree
## [1] 0.7631579

we can say that decision tree is 76.32% accurate or it’s misclassification rate is 23.68%.

Random Forest

We need to remove the extra coloumns we added while performing Decision Tree before implementing Random Forest.

test$pred<-NULL

In random forest, we don’t require to split the data into training and testing data.We direct generate the model on the whole data. To generate we require the library random forest

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Set.seed controls the randomness by limitimg the permutation.
set.seed(100)
model_rf<-randomForest(target~.,data = heart)
model_rf
## 
## Call:
##  randomForest(formula = target ~ ., data = heart) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.5%
## Confusion matrix:
##      No Yes class.error
## No  110  28   0.2028986
## Yes  22 143   0.1333333

To plot the random forest on graph with respect to class error.

plot(model_rf)

Red line represents MCR of class not having heart diseases, green line represents MCR of class having heart diseases and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.

Conclusion

After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 76% to 84%. Out of which Random forest gave a slightly better accuracy of 83.5%