Data 622 Homework 4

Objective

As part of the analysis , I will be using Kaggle Heart Disease Dataset

The dataset has 1025 records and 14 features.

Part 1:

Classification Problem - Supervised learning

We will be using a classification algorithms (Decision tree and Random Forest ) to predict the target variable (integer valued 0 = no disease and 1 = disease). Also I will be measuring the performance of this classification algorithm.

Part 2

Clustering -Unsupervised learning

We will be using a unsupervised machine learning algorithm (K-Mean ) to identify the clusters exist in the data.

load libraries

library(ggcorrplot)
library(tidyverse)
library(caTools)
library(dplyr)
library(psych)
library(ggplot2)
library(caret)
library(stats)
library(DataExplorer)
library(boot)
library(randomForest)
library(rpart)
library(rpart.plot)
library(parameters)
library(nnet)
library(factoextra)
library(skimr)
library(ggpubr)

library(see)

Explanatory Data Analysis

Analyse the data

Let’s analyze the distribution of various feature and it’s values.

heartDf <- read.csv(file = 'heart.csv')
summary(heartDf)

##       age             sex               cp            trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol          fbs            restecg          thalach     
##  Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  
##  Median :240   Median :0.0000   Median :1.0000   Median :152.0  
##  Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  
##  3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope             ca        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  
##  3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.324   Mean   :0.5132  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

glimpse(heartDf)

## Rows: 1,025
## Columns: 14
## $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3~
## $ sex      <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1~
## $ cp       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2~
## $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1~
## $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2~
## $ fbs      <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0~
## $ restecg  <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0~
## $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1~
## $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0~
## $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0~
## $ slope    <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1~
## $ ca       <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0~
## $ thal     <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2~
## $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0~

head(heartDf)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2  53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3  70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4  61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5  62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6  58   0  0      100  248   0       0     122     0     1.0     1  0    2
##   target
## 1      0
## 2      0
## 3      0
## 4      0
## 5      0
## 6      1

Check for missing data

plot_missing(heartDf)

We could see that there are no missing data in the data sett

Check the structure of all variables:

str(heartDf)

## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ cp      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ restecg : int  1 0 1 1 1 0 2 0 0 0 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : int  2 0 0 2 1 1 0 1 2 1 ...
##  $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : int  3 3 3 3 2 2 1 3 3 2 ...
##  $ target  : int  0 0 0 0 0 1 0 0 0 0 ...

Eventhough all the feature are numerical , we could see that some of the features are having only categorical values .Hence we need to factor those variables.

Correlation

corr = cor(heartDf)

ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of the data")

Data preperation

Convert the categorical variables into factors

heartDf[,"sex"] = as.factor(heartDf[,"sex"])
heartDf[,"cp"] = as.factor(heartDf[,"cp"])
heartDf[,"fbs"] = as.factor(heartDf[,"fbs"])
heartDf[,"restecg"] = as.factor(heartDf[,"restecg"])
heartDf[,"exang"] = as.factor(heartDf[,"exang"])
heartDf[,"slope"] = as.factor(heartDf[,"slope"])
heartDf[,"ca"] = as.factor(heartDf[,"ca"])
heartDf[,"thal"] = as.factor(heartDf[,"thal"])
heartDf[,"target"] = as.factor(heartDf[,"target"])

str(heartDf)

## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 2 1 2 2 2 1 3 1 1 1 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 3 1 1 3 2 2 1 2 3 2 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 3 1 1 2 4 1 4 2 1 3 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 3 3 2 4 4 3 ...
##  $ target  : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...

Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:

skimr::skim(heartDf)

Data summary
Name	heartDf
Number of rows	1025
Number of columns	14
_______________________
Column type frequency:
factor	9
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
sex	1	FALSE	2	1: 713, 0: 312
cp	1	FALSE	4	0: 497, 2: 284, 1: 167, 3: 77
fbs	1	FALSE	2	0: 872, 1: 153
restecg	1	FALSE	3	1: 513, 0: 497, 2: 15
exang	1	FALSE	2	0: 680, 1: 345
slope	1	FALSE	3	1: 482, 2: 469, 0: 74
ca	1	FALSE	5	0: 578, 1: 226, 2: 134, 3: 69
thal	1	FALSE	4	2: 544, 3: 410, 1: 64, 0: 7
target	1	FALSE	2	1: 526, 0: 499

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.43	9.07	29	48	56.0	61.0	77.0	▁▅▇▇▁
trestbps	1	131.61	17.52	94	120	130.0	140.0	200.0	▃▇▅▁▁
chol	1	246.00	51.59	126	211	240.0	275.0	564.0	▃▇▂▁▁
thalach	1	149.11	23.01	71	132	152.0	166.0	202.0	▁▂▅▇▂
oldpeak	1	1.07	1.18	0	0	0.8	1.8	6.2	▇▂▁▁▁

Data visualization

Distribution of the data across categorical variable

long_df1 <- heartDf %>%  select(c('sex','cp','fbs','target')) %>%   pivot_longer(cols=-c(target), names_to='kpi')


ggplot(long_df1, aes(x=value, fill=target)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64")) + 
  ggtitle('Comparing Categorical Features and Target')

long_df2 <- heartDf %>%  select(c('restecg','exang','slope','target')) %>%   pivot_longer(cols=-c(target), names_to='kpi')


ggplot(long_df2, aes(x=value, fill=target)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64")) + 
  ggtitle('Comparing Categorical Features and Target')

long_df3 <- heartDf %>%  select(c('ca','thal','target')) %>%   pivot_longer(cols=-c(target), names_to='kpi')


ggplot(long_df3, aes(x=value, fill=target)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64")) + 
  ggtitle('Comparing Categorical Features and Target')

Numeric Variable Distribution

#Create a subset with numerical data
Numerical <- heartDf  %>%
  select(age, thalach, chol, oldpeak, trestbps, target) %>% 
  gather(key = "key", value = "value", -target)

 
Numerical %>% 
  ggplot(aes(y = value)) +
       geom_boxplot(aes(fill = target),
                      alpha  = .6,
                      fatten = .7) +
        labs(x = "",
             y = "",
             title = "Distribution of numerical variables") +
      scale_fill_manual(
            values = c("#fde725ff", "#20a486ff"),
            name   = "Heart\nDisease",
            labels = c("No diagnosis", " Diagnosed")) +
      theme(
         axis.text.x  = element_blank(),
         axis.ticks.x = element_blank()) +
      facet_wrap(~ key, 
                 scales = "free", 
                 ncol   = 2)

Splitting the Dataset into Training and Testing

set.seed(1234)

training.samples <- heartDf$target %>% 
  createDataPartition(p = 0.8, list=FALSE)

train <- heartDf  [training.samples,]
test <- heartDf[-training.samples,]

round(prop.table(table(select(train, 'target'))),2)

## 
##    0    1 
## 0.49 0.51

round(prop.table(table(select(test, 'target'))),2)

## 
##    0    1 
## 0.49 0.51

X_test <- subset (test, select = -c(target))
Y_test <- subset (test, select = c(target))

We could see that the target variable is uniformly distributed both in test and train data in accordance with original data. Also there is no class imbalance.

Supervised Machine learning

Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled data sets to train algorithms that to classify data or predict outcomes accurately.

From the Explanatory Data Analysis (EDA), we could see that dataset has both numerical and categorical features. But the categorical values are bounded by certain range.By looking at the nature of the dataset and it’s characteristics the best algorithm we could try here are Decision tree and Random Forest.

Decision tree Model

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

DT_model <- rpart(target~., data = train, method = 'class')
rpart.plot(DT_model, extra = 106)

Confusion Matrix

DT_model_predictions = predict(DT_model, X_test, type="class")
dtSub1CM <- confusionMatrix(data = DT_model_predictions, reference = Y_test$target, positive = "1")
dtSub1CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 11
##          1 18 94
##                                           
##                Accuracy : 0.8578          
##                  95% CI : (0.8023, 0.9027)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7149          
##                                           
##  Mcnemar's Test P-Value : 0.2652          
##                                           
##             Sensitivity : 0.8952          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.8393          
##          Neg Pred Value : 0.8804          
##              Prevalence : 0.5147          
##          Detection Rate : 0.4608          
##    Detection Prevalence : 0.5490          
##       Balanced Accuracy : 0.8567          
##                                           
##        'Positive' Class : 1               
##

Random Forest

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

set.seed(123)
fit.forest <- randomForest(target~., data = train, importance=TRUE, ntree=200)

# display model details
fit.forest

## 
## Call:
##  randomForest(formula = target ~ ., data = train, importance = TRUE,      ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.73%
## Confusion matrix:
##     0   1 class.error
## 0 399   1  0.00250000
## 1   5 416  0.01187648

Confusion Matrix

rf.pred <- predict(fit.forest, newdata=X_test, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, Y_test$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  99   0
##          1   0 105
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9821, 1)
##     No Information Rate : 0.5147     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4853     
##          Detection Rate : 0.4853     
##    Detection Prevalence : 0.4853     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

Unsupervised Learning-

Unsupervised learning refers to the use of artificial intelligence (AI) algorithms to identify patterns in data sets containing data points that are neither classified nor labeled. As the name suggests, this type of machine learning is unsupervised and requires little human supervision and prep work. Because unsupervised learning does not rely on labels to identify patterns, the insights tend to be less biased than other forms of AI.

Unsupervised learning models are used in the following ways:

Clustering: This is the process of finding similarities among unlabeled data and grouping them together.

Association: This unsupervised learning method finds relationships between the data in a given dataset.

Dimensionality Reduction: This machine learning technique is used when the number of features in a dataset is too high. This technique reduces the number of inputs into a more manageable size all while preserving the data integrity.

Clustering

Cluster analysis is an unsupervised learning.Basically a cluster is a group of data that share similar features. We can say, clustering analysis is more about discovery than a prediction. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Data Preperation

We need to remove the output(target) and other categorical variables from the data before we start.

##select only the numerical feature and scale the values
heartDf2 <- heartDf %>% select (age,trestbps,chol,thalach,oldpeak) %>% scale()

heartDf2 <- heartDf2 %>% scale();

K-means clustering

k-means clustering is a partitional, exclusive, and complete clustering approach. This means that the cluster boundaries are independent of each other; each item can belong to only one cluster, and every item is assigned to a cluster. In k-means clustering, a user decides how many clusters (k) a given dataset should be partitioned into. The algorithm then attempts to assign every item within the dataset to one (and only one) of k non-overlapping clusters based on similarity.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

Choosing the Right Number of Clusters

One draw back of the K-Means algorith is that user has to supply the value of K. However, there are several statistical methods that provide us with “some guidance” as to how many clusters are reasonable when segmenting items within a dataset

The Elbow Method

The Elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

The degree to which items within a cluster are similar (or dissimilar) can be quantified using a measure called the within-cluster sum of squares (WCSS). The WCSS of a cluster is the sum of the distances between the items in the cluster and the cluster centroid.

 fviz_nbclust(heartDf2, kmeans, method = "wss")  + ggtitle("Elbow Method")

The Average Silhouette Method

The silhouette of an item is a measure of how closely the item is matched with other items within the same cluster and how loosely it is with items in neighboring clusters.

A silhouette value close to 1 implies that an item is the right cluster, while a silhouette value close to –1 implies that it is in the wrong cluster. The average silhouette method computes the average silhouette of all items in the dataset based on different values for k. If most items have a high value, then the average will be high, and the clustering configuration is considered appropriate. However, if many points have a low silhouette value, then the average will also be low, and the clustering configuration is not optimal. Similar to the elbow method, to use the average silhouette method, we plot the average silhouette against different values of k. The k value corresponding to the highest average silhouette represents the optimal number of clusters.

 fviz_nbclust(heartDf2, kmeans, method = "silhouette")  + ggtitle("Silhouette Method")

The Gap Statistics

The third statistical approach we consider compares the difference between clusters created from the observed data and clusters created from a randomly generated dataset, known as the reference dataset. For a given k, the gap statistic is the difference in the total WCSS for the observed data and that of the reference dataset. The optimal number of clusters is denoted by the k value that yields the largest gap statistic.

fviz_nbclust(heartDf2, kmeans, method = "gap_stat") + ggtitle("Gap Statistic")

Model creation

Both Silhouette Method and Gap Statistic gave the best for k (number of cluster) for the given data set is 2, so I proceed with building the model. The graphic below nicely shows the 2 clusters and their boundaries, and we can see a tangible difference.

set.seed(345)
k_clust <- kmeans(heartDf2, centers = 2, nstart = 25)

fviz_cluster(
  k_clust,
  data = heartDf2,
  main = "Heart disease cluster",
  repel = TRUE
) + theme(text = element_text(size = 14))

k_clust$size

## [1] 430 595

k_clust$centers

##          age   trestbps       chol    thalach    oldpeak
## 1  0.7165487  0.4880342  0.3754376 -0.6496950  0.6284718
## 2 -0.5178419 -0.3526970 -0.2713246  0.4695275 -0.4541897

Strengths and Weaknesses of k-Means Clustering

k-means clustering approach has certain pros and cons associated with it. Understanding the strengths and weaknesses of the approach is useful in deciding when it is or is not a good fit for the problem at hand.

Pros :

One of the reasons why the k-means clustering approach is so commonly used in segmenting data into subgroups is because it has a wide set of real-world applications
The approach is also flexible and malleable in that all one needs to vary is the value of k in order to change the number of subgroups that items are grouped into
The underlying mathematical principles behind k-means clustering (such as Euclidean distance) are not difficult to understand.

Cons:

k-means clustering requires that the value for k be set by the user. Sometimes choosing the right number of clusters requires additional knowledge about the problem domain
Because distance can be calculated only between numeric values, k- means clustering works only with numeric data.
The algorithm is sensitive to outliers
The k-means algorithm is not good at modeling clusters that have a complex geometric shape
The simplicity of k-means clustering makes it less than ideal for modeling complex relationships between items beyond the use of a distance measure.
The use of random or pseudorandom initial centroids means that the approach, to some extent, relies on chance

Summary

We tried solving the classification problem for heart-disease using Decision Tree and Random Forest algorithms.

We could see that Random Forest algorithm yielded 100% accuracy for our test data while Decision Tree algorithm gave 85.78% accuracy, as we already seen in our earlier home works that Random Forest algorithms are always superior to Decision Tree.

Also it is already proven that decision tree based algorithm performs really well especially to predict various ailments with high accuracy due to the nature of data and it’s characteristics.

At the same time , Unsupervised algorithm (K-Means clustering) also able to identify 2 clusters within the dataset. So it underscores our earlier finding that ,there are some clear differentiating factors within the dataset which separates our data into 2 different clusters.