As part of the analysis , I will be using Kaggle Heart Disease Dataset
The dataset has 1025 records and 14 features.
Part 1:
Classification Problem - Supervised learning
We will be using a classification algorithms (Decision tree and Random Forest ) to predict the target variable (integer valued 0 = no disease and 1 = disease). Also I will be measuring the performance of this classification algorithm.
Part 2
Clustering -Unsupervised learning
We will be using a unsupervised machine learning algorithm (K-Mean ) to identify the clusters exist in the data.
load libraries
library(ggcorrplot)
library(tidyverse)
library(caTools)
library(dplyr)
library(psych)
library(ggplot2)
library(caret)
library(stats)
library(DataExplorer)
library(boot)
library(randomForest)
library(rpart)
library(rpart.plot)
library(parameters)
library(nnet)
library(factoextra)
library(skimr)
library(ggpubr)
library(see)
Let’s analyze the distribution of various feature and it’s values.
heartDf <- read.csv(file = 'heart.csv')
summary(heartDf)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.43 Mean :0.6956 Mean :0.9424 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## chol fbs restecg thalach
## Min. :126 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0
## Median :240 Median :0.0000 Median :1.0000 Median :152.0
## Mean :246 Mean :0.1493 Mean :0.5298 Mean :149.1
## 3rd Qu.:275 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3366 Mean :1.072 Mean :1.385 Mean :0.7541
## 3rd Qu.:1.0000 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.324 Mean :0.5132
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
glimpse(heartDf)
## Rows: 1,025
## Columns: 14
## $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3~
## $ sex <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1~
## $ cp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2~
## $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1~
## $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2~
## $ fbs <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0~
## $ restecg <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0~
## $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1~
## $ exang <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0~
## $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0~
## $ slope <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1~
## $ ca <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0~
## $ thal <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2~
## $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0~
head(heartDf)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## target
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
Check for missing data
plot_missing(heartDf)
We could see that there are no missing data in the data sett
Check the structure of all variables:
str(heartDf)
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : int 1 1 1 1 0 0 1 1 1 1 ...
## $ cp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : int 0 1 0 0 1 0 0 0 0 0 ...
## $ restecg : int 1 0 1 1 1 0 2 0 0 0 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : int 2 0 0 2 1 1 0 1 2 1 ...
## $ ca : int 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : int 3 3 3 3 2 2 1 3 3 2 ...
## $ target : int 0 0 0 0 0 1 0 0 0 0 ...
Eventhough all the feature are numerical , we could see that some of the features are having only categorical values .Hence we need to factor those variables.
Correlation
corr = cor(heartDf)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of the data")
Convert the categorical variables into factors
heartDf[,"sex"] = as.factor(heartDf[,"sex"])
heartDf[,"cp"] = as.factor(heartDf[,"cp"])
heartDf[,"fbs"] = as.factor(heartDf[,"fbs"])
heartDf[,"restecg"] = as.factor(heartDf[,"restecg"])
heartDf[,"exang"] = as.factor(heartDf[,"exang"])
heartDf[,"slope"] = as.factor(heartDf[,"slope"])
heartDf[,"ca"] = as.factor(heartDf[,"ca"])
heartDf[,"thal"] = as.factor(heartDf[,"thal"])
heartDf[,"target"] = as.factor(heartDf[,"target"])
str(heartDf)
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 2 1 2 2 2 1 3 1 1 1 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : Factor w/ 3 levels "0","1","2": 3 1 1 3 2 2 1 2 3 2 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 3 1 1 2 4 1 4 2 1 3 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 3 3 2 4 4 3 ...
## $ target : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:
skimr::skim(heartDf)
| Name | heartDf |
| Number of rows | 1025 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | 1: 713, 0: 312 |
| cp | 0 | 1 | FALSE | 4 | 0: 497, 2: 284, 1: 167, 3: 77 |
| fbs | 0 | 1 | FALSE | 2 | 0: 872, 1: 153 |
| restecg | 0 | 1 | FALSE | 3 | 1: 513, 0: 497, 2: 15 |
| exang | 0 | 1 | FALSE | 2 | 0: 680, 1: 345 |
| slope | 0 | 1 | FALSE | 3 | 1: 482, 2: 469, 0: 74 |
| ca | 0 | 1 | FALSE | 5 | 0: 578, 1: 226, 2: 134, 3: 69 |
| thal | 0 | 1 | FALSE | 4 | 2: 544, 3: 410, 1: 64, 0: 7 |
| target | 0 | 1 | FALSE | 2 | 1: 526, 0: 499 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 54.43 | 9.07 | 29 | 48 | 56.0 | 61.0 | 77.0 | ▁▅▇▇▁ |
| trestbps | 0 | 1 | 131.61 | 17.52 | 94 | 120 | 130.0 | 140.0 | 200.0 | ▃▇▅▁▁ |
| chol | 0 | 1 | 246.00 | 51.59 | 126 | 211 | 240.0 | 275.0 | 564.0 | ▃▇▂▁▁ |
| thalach | 0 | 1 | 149.11 | 23.01 | 71 | 132 | 152.0 | 166.0 | 202.0 | ▁▂▅▇▂ |
| oldpeak | 0 | 1 | 1.07 | 1.18 | 0 | 0 | 0.8 | 1.8 | 6.2 | ▇▂▁▁▁ |
Distribution of the data across categorical variable
long_df1 <- heartDf %>% select(c('sex','cp','fbs','target')) %>% pivot_longer(cols=-c(target), names_to='kpi')
ggplot(long_df1, aes(x=value, fill=target)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) +
ggtitle('Comparing Categorical Features and Target')
long_df2 <- heartDf %>% select(c('restecg','exang','slope','target')) %>% pivot_longer(cols=-c(target), names_to='kpi')
ggplot(long_df2, aes(x=value, fill=target)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) +
ggtitle('Comparing Categorical Features and Target')
long_df3 <- heartDf %>% select(c('ca','thal','target')) %>% pivot_longer(cols=-c(target), names_to='kpi')
ggplot(long_df3, aes(x=value, fill=target)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) +
ggtitle('Comparing Categorical Features and Target')
#Create a subset with numerical data
Numerical <- heartDf %>%
select(age, thalach, chol, oldpeak, trestbps, target) %>%
gather(key = "key", value = "value", -target)
Numerical %>%
ggplot(aes(y = value)) +
geom_boxplot(aes(fill = target),
alpha = .6,
fatten = .7) +
labs(x = "",
y = "",
title = "Distribution of numerical variables") +
scale_fill_manual(
values = c("#fde725ff", "#20a486ff"),
name = "Heart\nDisease",
labels = c("No diagnosis", " Diagnosed")) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
facet_wrap(~ key,
scales = "free",
ncol = 2)
set.seed(1234)
training.samples <- heartDf$target %>%
createDataPartition(p = 0.8, list=FALSE)
train <- heartDf [training.samples,]
test <- heartDf[-training.samples,]
round(prop.table(table(select(train, 'target'))),2)
##
## 0 1
## 0.49 0.51
round(prop.table(table(select(test, 'target'))),2)
##
## 0 1
## 0.49 0.51
X_test <- subset (test, select = -c(target))
Y_test <- subset (test, select = c(target))
We could see that the target variable is uniformly distributed both in test and train data in accordance with original data. Also there is no class imbalance.
Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled data sets to train algorithms that to classify data or predict outcomes accurately.
From the Explanatory Data Analysis (EDA), we could see that dataset has both numerical and categorical features. But the categorical values are bounded by certain range.By looking at the nature of the dataset and it’s characteristics the best algorithm we could try here are Decision tree and Random Forest.
Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
DT_model <- rpart(target~., data = train, method = 'class')
rpart.plot(DT_model, extra = 106)
Confusion Matrix
DT_model_predictions = predict(DT_model, X_test, type="class")
dtSub1CM <- confusionMatrix(data = DT_model_predictions, reference = Y_test$target, positive = "1")
dtSub1CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 81 11
## 1 18 94
##
## Accuracy : 0.8578
## 95% CI : (0.8023, 0.9027)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7149
##
## Mcnemar's Test P-Value : 0.2652
##
## Sensitivity : 0.8952
## Specificity : 0.8182
## Pos Pred Value : 0.8393
## Neg Pred Value : 0.8804
## Prevalence : 0.5147
## Detection Rate : 0.4608
## Detection Prevalence : 0.5490
## Balanced Accuracy : 0.8567
##
## 'Positive' Class : 1
##
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.
set.seed(123)
fit.forest <- randomForest(target~., data = train, importance=TRUE, ntree=200)
# display model details
fit.forest
##
## Call:
## randomForest(formula = target ~ ., data = train, importance = TRUE, ntree = 200)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.73%
## Confusion matrix:
## 0 1 class.error
## 0 399 1 0.00250000
## 1 5 416 0.01187648
Confusion Matrix
rf.pred <- predict(fit.forest, newdata=X_test, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, Y_test$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 99 0
## 1 0 105
##
## Accuracy : 1
## 95% CI : (0.9821, 1)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4853
## Detection Rate : 0.4853
## Detection Prevalence : 0.4853
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
Unsupervised learning refers to the use of artificial intelligence (AI) algorithms to identify patterns in data sets containing data points that are neither classified nor labeled. As the name suggests, this type of machine learning is unsupervised and requires little human supervision and prep work. Because unsupervised learning does not rely on labels to identify patterns, the insights tend to be less biased than other forms of AI.
Unsupervised learning models are used in the following ways:
Clustering: This is the process of finding similarities among unlabeled data and grouping them together.
Association: This unsupervised learning method finds relationships between the data in a given dataset.
Dimensionality Reduction: This machine learning technique is used when the number of features in a dataset is too high. This technique reduces the number of inputs into a more manageable size all while preserving the data integrity.
Cluster analysis is an unsupervised learning.Basically a cluster is a group of data that share similar features. We can say, clustering analysis is more about discovery than a prediction. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
We need to remove the output(target) and other categorical variables from the data before we start.
##select only the numerical feature and scale the values
heartDf2 <- heartDf %>% select (age,trestbps,chol,thalach,oldpeak) %>% scale()
heartDf2 <- heartDf2 %>% scale();
k-means clustering is a partitional, exclusive, and complete clustering approach. This means that the cluster boundaries are independent of each other; each item can belong to only one cluster, and every item is assigned to a cluster. In k-means clustering, a user decides how many clusters (k) a given dataset should be partitioned into. The algorithm then attempts to assign every item within the dataset to one (and only one) of k non-overlapping clusters based on similarity.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
One draw back of the K-Means algorith is that user has to supply the value of K. However, there are several statistical methods that provide us with “some guidance” as to how many clusters are reasonable when segmenting items within a dataset
The Elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
The degree to which items within a cluster are similar (or dissimilar) can be quantified using a measure called the within-cluster sum of squares (WCSS). The WCSS of a cluster is the sum of the distances between the items in the cluster and the cluster centroid.
fviz_nbclust(heartDf2, kmeans, method = "wss") + ggtitle("Elbow Method")
The silhouette of an item is a measure of how closely the item is matched with other items within the same cluster and how loosely it is with items in neighboring clusters.
A silhouette value close to 1 implies that an item is the right cluster, while a silhouette value close to –1 implies that it is in the wrong cluster. The average silhouette method computes the average silhouette of all items in the dataset based on different values for k. If most items have a high value, then the average will be high, and the clustering configuration is considered appropriate. However, if many points have a low silhouette value, then the average will also be low, and the clustering configuration is not optimal. Similar to the elbow method, to use the average silhouette method, we plot the average silhouette against different values of k. The k value corresponding to the highest average silhouette represents the optimal number of clusters.
fviz_nbclust(heartDf2, kmeans, method = "silhouette") + ggtitle("Silhouette Method")
The third statistical approach we consider compares the difference between clusters created from the observed data and clusters created from a randomly generated dataset, known as the reference dataset. For a given k, the gap statistic is the difference in the total WCSS for the observed data and that of the reference dataset. The optimal number of clusters is denoted by the k value that yields the largest gap statistic.
fviz_nbclust(heartDf2, kmeans, method = "gap_stat") + ggtitle("Gap Statistic")
Both Silhouette Method and Gap Statistic gave the best for k (number of cluster) for the given data set is 2, so I proceed with building the model. The graphic below nicely shows the 2 clusters and their boundaries, and we can see a tangible difference.
set.seed(345)
k_clust <- kmeans(heartDf2, centers = 2, nstart = 25)
fviz_cluster(
k_clust,
data = heartDf2,
main = "Heart disease cluster",
repel = TRUE
) + theme(text = element_text(size = 14))
k_clust$size
## [1] 430 595
k_clust$centers
## age trestbps chol thalach oldpeak
## 1 0.7165487 0.4880342 0.3754376 -0.6496950 0.6284718
## 2 -0.5178419 -0.3526970 -0.2713246 0.4695275 -0.4541897
k-means clustering approach has certain pros and cons associated with it. Understanding the strengths and weaknesses of the approach is useful in deciding when it is or is not a good fit for the problem at hand.
Pros :
One of the reasons why the k-means clustering approach is so commonly used in segmenting data into subgroups is because it has a wide set of real-world applications
The approach is also flexible and malleable in that all one needs to vary is the value of k in order to change the number of subgroups that items are grouped into
The underlying mathematical principles behind k-means clustering (such as Euclidean distance) are not difficult to understand.
Cons:
k-means clustering requires that the value for k be set by the user. Sometimes choosing the right number of clusters requires additional knowledge about the problem domain
Because distance can be calculated only between numeric values, k- means clustering works only with numeric data.
The algorithm is sensitive to outliers
The k-means algorithm is not good at modeling clusters that have a complex geometric shape
The simplicity of k-means clustering makes it less than ideal for modeling complex relationships between items beyond the use of a distance measure.
The use of random or pseudorandom initial centroids means that the approach, to some extent, relies on chance
We tried solving the classification problem for heart-disease using Decision Tree and Random Forest algorithms.
We could see that Random Forest algorithm yielded 100% accuracy for our test data while Decision Tree algorithm gave 85.78% accuracy, as we already seen in our earlier home works that Random Forest algorithms are always superior to Decision Tree.
Also it is already proven that decision tree based algorithm performs really well especially to predict various ailments with high accuracy due to the nature of data and it’s characteristics.
At the same time , Unsupervised algorithm (K-Means clustering) also able to identify 2 clusters within the dataset. So it underscores our earlier finding that ,there are some clear differentiating factors within the dataset which separates our data into 2 different clusters.