This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.
The main objective is to predict whether the given person is having heart disease or not, with the help of several factor which are causing eg- age,cholesterol level,type of chest pain etc.
The algorithm which we are using in this problem are:-
Binary Logistic Regression.
Naive Bayes algorithm
Decision Tree
Random Forest
The data has 303 observations and 14 variables. Each observation contains following infomation about an individual.
age:- Age of individual in years
sex:- Gender Of individual(1 = male; 0 = female)
cp - Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
trestbps - Resting blood pressure (in mm Hg on admission to the hospital)
chol - Serum cholesterol in mg/dl
fbs - Fasting blood sugar level > 120 mg/dl (1 = true; 0 = false)
restecg - Resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
thalach - Maximum heart rate achieved
exang - Exercise induced angina (1 = yes; 0 = no)
oldpeak - ST depression induced by exercise relative to rest
slope - The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
ca - Number of major vessels (0-4) colored by flourosopy
thal -Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2 = fixed defect; 3 = reversable defect
target - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)
Loading the data in Rstudio
heart<-read.csv("C:/Users/sragh/Desktop/heart.csv",header = T)
header = T means that the given data has its own heading or otherwords the first observation is also considered for prediction.
head(heart)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
# we use head function when we want to see and check the first six observation of our data.
tail(heart)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 298 59 1 0 164 176 1 0 90 0 1.0 1 2
## 299 57 0 0 140 241 0 1 123 1 0.2 1 0
## 300 45 1 3 110 264 0 1 132 0 1.2 1 0
## 301 68 1 0 144 193 1 1 141 0 3.4 1 2
## 302 57 1 0 130 131 0 1 115 1 1.2 1 1
## 303 57 0 1 130 236 0 0 174 0 0.0 1 1
## thal target
## 298 1 0
## 299 3 0
## 300 3 0
## 301 3 0
## 302 3 0
## 303 2 0
# same as head but it show show bottom six obervation of our data.
colSums(is.na(heart))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
# This function is used to check whether our data contains any NA valus or not.
# As there are no NA found we can move forward or else we have to remove NA before moving forward.
To check the structure of our data
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
To see the summary of our data
summary(heart)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal target
## Min. :1.00 Min. :0.0000
## 1st Qu.:2.00 1st Qu.:0.0000
## Median :2.00 Median :1.0000
## Mean :2.32 Mean :0.5446
## 3rd Qu.:3.00 3rd Qu.:1.0000
## Max. :3.00 Max. :1.0000
As observing the above summary and str we can say following points:-
sex cannot be continuous variable as it can be either Male or Female as per our description. Hence we have to convert the variable name sex from integer to factor. And also labelling it to avoid any further confusion.
cp cannot be continuous variable as it is type of chest pain. As it is type of chest pain, we have to convert variable cp to factor and labelling it to our convenience.
fbs cannot be continuous variable or integer as it shows blood sugar level below 120mg/dl or not.Therefore, we convert it to factor and labelling it to our convenience.
restecg should be factor as it is type of ECG results.Hence, it can’t be integer.So, we are converting it to factor and labelling.
exang should be factor as per the description of the dataset. Angina can happen or not i.e. it can be either yes or no. Therefore, converting the variable to factor and labelling it.
slope cannot be integer as it is type of slope which is observed in ECG.Therefore we are converting the variable to factor and labelling it.
ca as per the description of our dataset. It can’t be integer. Therefore, we are converting the variable to factor.
thal cannot be integer as it is type of thalassemia which cannot be numeric or integer.Therefore, we are converting the variable to factor and labelling it.
target is the predicated variable and tells us whether the individual has heart disease or not. Therefore, we are converting the variable to factor and labelling it for your convenience.
According to above observation we implementing the changes
heart$sex<-as.factor(heart$sex)
levels(heart$sex)<-c("Female","Male")
heart$cp<-as.factor(heart$cp)
levels(heart$cp)<-c("typical","atypical","non-anginal","asymptomatic")
heart$fbs<-as.factor(heart$fbs)
levels(heart$fbs)<-c("False","True")
heart$restecg<-as.factor(heart$restecg)
levels(heart$restecg)<-c("normal","stt","hypertrophy")
heart$exang<-as.factor(heart$exang)
levels(heart$exang)<-c("No","Yes")
heart$slope<-as.factor(heart$slope)
levels(heart$slope)<-c("upsloping","flat","downsloping")
heart$ca<-as.factor(heart$ca)
heart$thal<-as.factor(heart$thal)
levels(heart$thal)<-c("normal","fixed","reversable")
heart$target<-as.factor(heart$target)
levels(heart$target)<-c("No", "Yes")
Checking whether the above changes are implemented or not
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "typical","atypical",..: 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : Factor w/ 3 levels "normal","stt",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "upsloping","flat",..: 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 3 levels "normal","fixed",..: 1 2 2 2 2 1 2 3 3 2 ...
## $ target : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
summary(heart)
## age sex cp trestbps
## Min. :29.00 Female: 96 typical :143 Min. : 94.0
## 1st Qu.:47.50 Male :207 atypical : 50 1st Qu.:120.0
## Median :55.00 non-anginal : 87 Median :130.0
## Mean :54.37 asymptomatic: 23 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg thalach exang
## Min. :126.0 False:258 normal :147 Min. : 71.0 No :204
## 1st Qu.:211.0 True : 45 stt :152 1st Qu.:133.5 Yes: 99
## Median :240.0 hypertrophy: 4 Median :153.0
## Mean :246.3 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
## oldpeak slope ca thal target
## Min. :0.00 upsloping : 21 0:175 normal : 20 No :138
## 1st Qu.:0.00 flat :140 1: 65 fixed :166 Yes:165
## Median :0.80 downsloping:142 2: 38 reversable:117
## Mean :1.04 3: 20
## 3rd Qu.:1.60 4: 5
## Max. :6.20
EDA stands for Exploratory Data Analysis which is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a dataset.
For Graphical representation we require library “ggplot2”
library(ggplot2)
ggplot(heart,aes(x=age,fill=target,color=target)) + geom_histogram(binwidth = 1,color="black") + labs(x = "Age",y = "Frequency", title = "Heart Disease w.r.t. Age")
We can conclude that the age group of 40 to 60 has the highest probability of getting heart diseases compared to age above 60.
mytable <- table(heart$cp)
pct<-round(mytable/sum(mytable)*100)
lbls1<-paste(names(mytable),pct)
lbls<-paste(lbls1, "%", sep="")
pie(mytable, labels = lbls,col = rainbow(length(lbls)),main="Pie Chart of Chest Pain",radius = 0.9)
we can conclude that out of all types of chest pain, most observed in the individual are typical type of chest pain, then comes the non-anginal.
Firstly, we are dividing our dataset into training(75%) and testing data(25%).
set.seed(100)
#100 is used to control the sampling permutation to 100.
index<-sample(nrow(heart),0.75*nrow(heart))
train<-heart[index,]
test<-heart[-index,]
Model generation on training data and then validating the model with testing data.
modelblr<-glm(target~.,data = train,family = "binomial")
# family = " binomial" means it contains only two outcomes.
To check how well our model is generated,we need to calculate predicted score and built confusion matrix to know the accuracy of the model.
train$pred<-fitted(modelblr)
# fitted can be used only to get predicted score of the data on which model has been generated.
head(train)
## age sex cp trestbps chol fbs restecg thalach exang
## 202 60 Male typical 125 258 False normal 141 Yes
## 112 57 Male non-anginal 150 126 True stt 173 No
## 206 52 Male typical 128 255 False stt 161 Yes
## 4 56 Male atypical 120 236 False stt 178 No
## 98 52 Male typical 108 233 True stt 147 No
## 7 56 Female atypical 140 294 False normal 153 No
## oldpeak slope ca thal target pred
## 202 2.8 flat 1 reversable No 0.003267884
## 112 0.2 downsloping 1 reversable Yes 0.663651641
## 206 0.0 downsloping 1 reversable No 0.071402635
## 4 0.8 downsloping 0 fixed Yes 0.983649596
## 98 0.1 downsloping 3 reversable Yes 0.357415872
## 7 1.3 flat 0 fixed Yes 0.913661147
As we can see that the predicted score are in probability of having heart diseases.But we have to find a proper cutoff points from which it is easy to distinguish between having heart diseases and not having it.
For that we require ROC curve(receiver operating characteristic curve) which is a graph showing the performance of a classification model at all classification thresholds. It will allow us to take proper cutoff.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred<-prediction(train$pred,train$target)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize = T,print.cutoffs.at = seq(0.1,by = 0.1))
With the use of ROC curve we can observe that 0.6 is having better sensitivity and specificity.There we select 0.6 as our cutoff to distinguish.
train$pred1<-ifelse(train$pred<0.6,"No","Yes")
library(caret)
## Loading required package: lattice
confusionMatrix(factor(train$pred1),train$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 92 14
## Yes 12 109
##
## Accuracy : 0.8855
## 95% CI : (0.8367, 0.9238)
## No Information Rate : 0.5419
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7697
##
## Mcnemar's Test P-Value : 0.8445
##
## Sensitivity : 0.8846
## Specificity : 0.8862
## Pos Pred Value : 0.8679
## Neg Pred Value : 0.9008
## Prevalence : 0.4581
## Detection Rate : 0.4053
## Detection Prevalence : 0.4670
## Balanced Accuracy : 0.8854
##
## 'Positive' Class : No
##
# Accuracy of training data
acc_tr<-(109+92)/(227);acc_tr
## [1] 0.8854626
From confusion matrix of training data, we come to know that our model is 88.55% accurate.
Now validating the model on testing data
test$pred<-predict(modelblr,test,type = "response")
# type = "response" is used to get the outcome in the form of probability of having heart diseases.
head(test)
## age sex cp trestbps chol fbs restecg thalach exang
## 6 57 Male typical 140 192 False stt 148 No
## 10 57 Male non-anginal 150 168 False stt 174 No
## 17 58 Female non-anginal 120 340 False stt 172 No
## 18 66 Female asymptomatic 150 226 False stt 114 No
## 21 59 Male typical 135 234 False stt 161 No
## 22 44 Male non-anginal 130 233 False stt 179 Yes
## oldpeak slope ca thal target pred
## 6 0.4 flat 0 normal Yes 0.5914325
## 10 1.6 downsloping 0 fixed Yes 0.9810904
## 17 0.0 downsloping 0 fixed Yes 0.9982063
## 18 2.6 upsloping 0 fixed Yes 0.9948418
## 21 0.5 flat 0 reversable Yes 0.2880055
## 22 0.4 downsloping 0 fixed Yes 0.9647337
As we know that, for training data the cutoff has been 0.6.Similarily the testing data will also have the same thersold or cutoff.
test$pred1<-ifelse(test$pred<0.6,"No","Yes")
confusionMatrix(factor(test$pred1),test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 25 5
## Yes 9 37
##
## Accuracy : 0.8158
## 95% CI : (0.7103, 0.8955)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 1.294e-06
##
## Kappa : 0.6232
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.7353
## Specificity : 0.8810
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.8043
## Prevalence : 0.4474
## Detection Rate : 0.3289
## Detection Prevalence : 0.3947
## Balanced Accuracy : 0.8081
##
## 'Positive' Class : No
##
# Accuracy of Testing data.
acc_tt<-(25+37)/(76);acc_tt
## [1] 0.8157895
To check how much of our predicted values lie inside the curve
auc<-performance(pred,"auc")
auc@y.values
## [[1]]
## [1] 0.9517667
We can conclude that we are getting an accuracy of 81.58% with 90.26% of our predicted values lying under the curve. Also our misclassifcation rate is 18.42%
We need to remove the extra coloumns we added while performing BLR before implementing Naive Bayes algorithm.
train$pred<-NULL
train$pred1<-NULL
test$pred<-NULL
test$pred1<-NULL
# library(e1071) contains the naivebayes model, for that reason we have to first call this library.
library(e1071)
model_nb<-naiveBayes(target~.,data = train)
Checking the model with training data and creationg its confusion matrix to know how accurate the model is.
train$pred<-predict(model_nb,train)
confusionMatrix(train$pred,train$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 85 14
## Yes 19 109
##
## Accuracy : 0.8546
## 95% CI : (0.8019, 0.8978)
## No Information Rate : 0.5419
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7061
##
## Mcnemar's Test P-Value : 0.4862
##
## Sensitivity : 0.8173
## Specificity : 0.8862
## Pos Pred Value : 0.8586
## Neg Pred Value : 0.8516
## Prevalence : 0.4581
## Detection Rate : 0.3744
## Detection Prevalence : 0.4361
## Balanced Accuracy : 0.8517
##
## 'Positive' Class : No
##
acc_tr_nb<-(85+109)/227;acc_tr_nb
## [1] 0.8546256
We can say that naive bayes algorithm is 85.46% accurate with the training data.
Now, Validating the model with the testing data by predicting and creating confusion matrix.
test$pred<-predict(model_nb,test)
confusionMatrix(test$pred,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 23 5
## Yes 11 37
##
## Accuracy : 0.7895
## 95% CI : (0.6808, 0.8746)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 1.428e-05
##
## Kappa : 0.567
##
## Mcnemar's Test P-Value : 0.2113
##
## Sensitivity : 0.6765
## Specificity : 0.8810
## Pos Pred Value : 0.8214
## Neg Pred Value : 0.7708
## Prevalence : 0.4474
## Detection Rate : 0.3026
## Detection Prevalence : 0.3684
## Balanced Accuracy : 0.7787
##
## 'Positive' Class : No
##
acc_tt_nb<-(23+37)/76;acc_tt_nb
## [1] 0.7894737
we can conclude that the model generated with help of Naive Bayes algorithm is 78.95% accurate or we can also say that the misclassification rate for Naive Bayes algorithm is 21.05%.
We need to remove the extra coloumns we added while performing Naive Bayes algorithm before implementing Decision Tree.
train$pred<-NULL
test$pred<-NULL
We need the following libraries to perform Decision tree
rpart stands for Recursive partitioning and regression trees.
rpart is used when both independent and dependent variables are continuous or categorical.
rpart automatically detects whether to perform regression or classification based on dependent variable. There is no need to specify.
Implementing Decision tree
library(rpart)
tree<-rpart(target~.,method = "class",data = train)
library(rpart.plot)
rpart.plot(tree)
With the help of decision tree we can say that most significant variable out of all are cp,ca,thal,oldpeak.
Let’s validate the model with testing data and find out the accuracy of model.
test$pred<-predict(tree,test,type = "class")
confusionMatrix(test$pred,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 21 5
## Yes 13 37
##
## Accuracy : 0.7632
## 95% CI : (0.6518, 0.8532)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 0.0001166
##
## Kappa : 0.51
##
## Mcnemar's Test P-Value : 0.0989602
##
## Sensitivity : 0.6176
## Specificity : 0.8810
## Pos Pred Value : 0.8077
## Neg Pred Value : 0.7400
## Prevalence : 0.4474
## Detection Rate : 0.2763
## Detection Prevalence : 0.3421
## Balanced Accuracy : 0.7493
##
## 'Positive' Class : No
##
acc_tr_tree<-(21+37)/76;acc_tr_tree
## [1] 0.7631579
we can say that decision tree is 76.32% accurate or it’s misclassification rate is 23.68%.
We need to remove the extra coloumns we added while performing Decision Tree before implementing Random Forest.
test$pred<-NULL
In random forest, we don’t require to split the data into training and testing data.We direct generate the model on the whole data. To generate we require the library random forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Set.seed controls the randomness by limitimg the permutation.
set.seed(100)
model_rf<-randomForest(target~.,data = heart)
model_rf
##
## Call:
## randomForest(formula = target ~ ., data = heart)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.5%
## Confusion matrix:
## No Yes class.error
## No 110 28 0.2028986
## Yes 22 143 0.1333333
To plot the random forest on graph with respect to class error.
plot(model_rf)
Red line represents MCR of class not having heart diseases, green line represents MCR of class having heart diseases and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.
After performing various classification techniques and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 76% to 84%. Out of which Random forest gave a slightly better accuracy of 83.5%