Objective

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - describe the problem you are trying to solve. - describe your datases and what you did to prepare the data for analysis. - methodologies you used for analyzing the data - what’s the purpose of the analysis performed - make your conclusions from your analysis.

Please be sure to address the business impact (it could be of any domain) of your solution.

Data

I will use heart disease dataset to do logistic regression model and PCA to predict if a patient will have a heart disease or no.

Data Source

https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset?resource=download

#Required libraries
library(dplyr)
library(psych)
library(ggplot2)
library(ggcorrplot)
library(tidyverse)
library(caTools)
library(caret)
library(boot)
library(rpart)
library(rpart.plot)
library(parameters)
library(nnet)
library(factoextra)

Data Exploration

#Read the data
Data_v1 <- read.csv(file = 'heart.csv')
#Read the first few rows of the data
head(Data_v1)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2  53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3  70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4  61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5  62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6  58   0  0      100  248   0       0     122     0     1.0     1  0    2
##   target
## 1      0
## 2      0
## 3      0
## 4      0
## 5      0
## 6      1

Summary Statistic

#Explore the data via summary.
summary(Data_v1)
##       age             sex               cp            trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol          fbs            restecg          thalach     
##  Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  
##  Median :240   Median :0.0000   Median :1.0000   Median :152.0  
##  Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  
##  3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope             ca        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  
##  3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.324   Mean   :0.5132  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000
describe(Data_v1)
##          vars    n   mean    sd median trimmed   mad min   max range  skew
## age         1 1025  54.43  9.07   56.0   54.66  8.90  29  77.0  48.0 -0.25
## sex         2 1025   0.70  0.46    1.0    0.74  0.00   0   1.0   1.0 -0.85
## cp          3 1025   0.94  1.03    1.0    0.83  1.48   0   3.0   3.0  0.53
## trestbps    4 1025 131.61 17.52  130.0  130.39 14.83  94 200.0 106.0  0.74
## chol        5 1025 246.00 51.59  240.0  243.26 48.93 126 564.0 438.0  1.07
## fbs         6 1025   0.15  0.36    0.0    0.06  0.00   0   1.0   1.0  1.97
## restecg     7 1025   0.53  0.53    1.0    0.52  0.00   0   2.0   2.0  0.18
## thalach     8 1025 149.11 23.01  152.0  150.40 23.72  71 202.0 131.0 -0.51
## exang       9 1025   0.34  0.47    0.0    0.30  0.00   0   1.0   1.0  0.69
## oldpeak    10 1025   1.07  1.18    0.8    0.89  1.19   0   6.2   6.2  1.21
## slope      11 1025   1.39  0.62    1.0    1.45  1.48   0   2.0   2.0 -0.48
## ca         12 1025   0.75  1.03    0.0    0.57  0.00   0   4.0   4.0  1.26
## thal       13 1025   2.32  0.62    2.0    2.38  0.00   0   3.0   3.0 -0.52
## target     14 1025   0.51  0.50    1.0    0.52  0.00   0   1.0   1.0 -0.05
##          kurtosis   se
## age         -0.53 0.28
## sex         -1.28 0.01
## cp          -1.15 0.03
## trestbps     0.97 0.55
## chol         3.96 1.61
## fbs          1.87 0.01
## restecg     -1.31 0.02
## thalach     -0.10 0.72
## exang       -1.52 0.01
## oldpeak      1.29 0.04
## slope       -0.65 0.02
## ca           0.68 0.03
## thal         0.24 0.02
## target      -2.00 0.02

We can see few variables with high variance and skewed. Also, there are no missing data in this dataset.

Check the structure of all variables:

str(Data_v1)
## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ cp      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ restecg : int  1 0 1 1 1 0 2 0 0 0 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : int  2 0 0 2 1 1 0 1 2 1 ...
##  $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : int  3 3 3 3 2 2 1 3 3 2 ...
##  $ target  : int  0 0 0 0 0 1 0 0 0 0 ...

We need to change data type for some variables to factors.

Correlation

corr = cor(Data_v1)

ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of the data")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Data Preparation

To understand our variables in a better way, I created the following table to explain the description, data type, and the value of each feature.

Feature Definition Type Value
age Patient’s age in years Numerical 29 - 77
sex Gender Nominal (0)female, (1)male
cp Type of chest-pain Nominal (0)typical angina, (1)atypical angina, (2)non-angina pain, (3)asymptomatic
trestbps Resting blood pressure in mmHg Numerical 94 - 200
chol Serum cholestoral in mg/dl Numerical 126 – 564
fbs Fasting blood sugar higher than 120 mg/dl Nominal (0)False (1)True
restecg Resting electrocardiographic results Nominal (0)normal, (1)having ST-T wave abnormality, (2)showing probable left ventricular hypertrophy
thalach Maximum heart rate achieved Numerical 71 –202
exang Exercise induced angina Nominal (0)No(1)Yes
oldpeak ST depression induced by exercise relative to rest Numerical -2.6 - 6.2
slope The slope of the peak exercise ST segment Nominal (0)upsloping, (1)flat, (2)downsloping
ca Number of major vessels colored by flourosopy Nominal 0, 1, 2, 3
thal Thalassemia Nominal (0)normal,(1)fixed defect, (2)reversible defect
target Diagnosis of heart disease Nominal (0)heart disease not present, (1)heart disease present
#Get the columns names
cols = colnames(Data_v1)
cols
##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

Check the number of unique values in each variable

for (i in cols){
  uniq = n_distinct(Data_v1[i])
  print(paste0(i, " : ", uniq))
  
}
## [1] "age : 41"
## [1] "sex : 2"
## [1] "cp : 4"
## [1] "trestbps : 49"
## [1] "chol : 152"
## [1] "fbs : 2"
## [1] "restecg : 3"
## [1] "thalach : 91"
## [1] "exang : 2"
## [1] "oldpeak : 40"
## [1] "slope : 3"
## [1] "ca : 5"
## [1] "thal : 4"
## [1] "target : 2"

Create an empty list to separate Categorical and Numerical variables

Categorical_att = list()
Numerical_attr = list()

for (i in cols){
  uniq = n_distinct(Data_v1[i])
  if (uniq < 5 ){
    Categorical_att= append(Categorical_att, i)
  }
  else{
   Numerical_attr=  append(Numerical_attr, i)
  }
}
#Make a copy of the data
Data_v2 = Data_v1

Convert Categorical data to factor

for (i in Categorical_att){
Data_v2[,i] = as.factor(Data_v2[,i])
}

Confirm the changes in the data type

str(Data_v2)
## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 2 1 2 2 2 1 3 1 1 1 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 3 1 1 3 2 2 1 2 3 2 ...
##  $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 3 3 2 4 4 3 ...
##  $ target  : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...

Categorical Variables

1. Gender

Distribution of Heart Disease among males and females

Sex_condition <- table(Data_v2$target, Data_v2$sex)
Sex_condition
##    
##       0   1
##   0  86 413
##   1 226 300

There are 226 females out of 312 who have diagnosed with heart disease and 300 males out of 713 were diagnosed with heart disease.

2. cp - chest-pain

cp_condition <- table(Data_v2$target, Data_v2$cp)
cp_condition
##    
##       0   1   2   3
##   0 375  33  65  26
##   1 122 134 219  51

3. restecg - Resting electrocardiographic results

restecg_condition <- table(Data_v2$target, Data_v2$restecg)
restecg_condition
##    
##       0   1   2
##   0 283 204  12
##   1 214 309   3

4. cp - Exercise induced angina

exang_condition <- table(Data_v2$target, Data_v2$exang)
exang_condition
##    
##       0   1
##   0 225 274
##   1 455  71

5. ca - Number of major vessels colored by flourosopy

ca_condition <- table(Data_v2$target, Data_v2$ca)
ca_condition
##    
##       0   1   2   3   4
##   0 163 160 113  60   3
##   1 415  66  21   9  15

6. thal - Thalassemia

thal_condition <- table(Data_v2$target, Data_v2$thal)
thal_condition
##    
##       0   1   2   3
##   0   4  43 132 320
##   1   3  21 412  90

Numeric variables

age_condition <- table(Data_v2$target, Data_v2$age)
age_condition
##    
##     29 34 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
##   0  0  0  7  0  4  4  8  3  4 11 10  6 11  8 11  8  9 10 15  8 21 20 22 36 44
##   1  4  6  8  6  8 10  3 29 22 15 26 19 12 10 12  9 12 29 28 18 32 10 17 21 24
##    
##     59 60 61 62 63 64 65 66 67 68 69 70 71 74 76 77
##   0 31 27 28 24 23 15 15 11 22  6  3 11  0  0  0  3
##   1 15 10  3 13  9 19 12 14  9  6  6  3 11  3  3  0
ggplot(Data_v2) +
  geom_point(aes(x = age, y= thalach, color = target)) 

barplot(age_condition, legend.text = TRUE)

#Visualize numeric variables using boxplots

#Create a subset with numerical data
Numerical <- Data_v2  %>%
  select(age, thalach, chol, oldpeak, trestbps, target) %>% 
  gather(key = "key", value = "value", -target)

 
Numerical %>% 
  ggplot(aes(y = value)) +
       geom_boxplot(aes(fill = target),
                      alpha  = .6,
                      fatten = .7) +
        labs(x = "",
             y = "",
             title = "Boxplots for Numeric Variables in relation with heart disease") +
      scale_fill_manual(
            values = c("#fde725ff", "#20a486ff"),
            name   = "Heart\nDisease",
            labels = c("No diagnosis", " Diagnosed")) +
      theme(
         axis.text.x  = element_blank(),
         axis.ticks.x = element_blank()) +
      facet_wrap(~ key, 
                 scales = "free", 
                 ncol   = 2) 

Data Splitting

set.seed(333)
sample <- sample.split(Data_v2$target, SplitRatio = 0.8)
train  <- subset(Data_v2, sample == TRUE)
test   <- subset(Data_v2, sample == FALSE)
X_test <- subset (test, select = -c(target))
Y_test <- subset (test, select = c(target))

Logistic regression model

I will start with a logistic regression model since the response variable is binary. Will start with all features included in our base model.

logistic_regression_model <- glm( target ~., data = train, family=binomial)
summary(logistic_regression_model)
## 
## Call:
## glm(formula = target ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6854  -0.4031   0.1216   0.4932   2.6934  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.260825   2.037376   0.619 0.536017    
## age         -0.013268   0.014365  -0.924 0.355658    
## sex1        -1.711465   0.319547  -5.356 8.51e-08 ***
## cp1          1.100787   0.340210   3.236 0.001214 ** 
## cp2          2.020901   0.288865   6.996 2.63e-12 ***
## cp3          2.295920   0.409963   5.600 2.14e-08 ***
## trestbps    -0.014422   0.006474  -2.228 0.025901 *  
## chol        -0.006030   0.002469  -2.442 0.014597 *  
## fbs1         0.331894   0.345824   0.960 0.337196    
## restecg1     0.529405   0.226993   2.332 0.019687 *  
## restecg2    -0.158371   1.268599  -0.125 0.900651    
## thalach      0.016917   0.006646   2.545 0.010918 *  
## exang1      -0.645600   0.264083  -2.445 0.014498 *  
## oldpeak     -0.527356   0.140291  -3.759 0.000171 ***
## slope1      -0.574139   0.507157  -1.132 0.257604    
## slope2       0.305994   0.554031   0.552 0.580740    
## ca          -0.818473   0.123692  -6.617 3.67e-11 ***
## thal1        2.211163   1.425322   1.551 0.120820    
## thal2        2.247562   1.376054   1.633 0.102398    
## thal3        0.969427   1.382058   0.701 0.483030    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1136.17  on 819  degrees of freedom
## Residual deviance:  550.93  on 800  degrees of freedom
## AIC: 590.93
## 
## Number of Fisher Scoring iterations: 6

We notice that cholesterol, age, and blood pressure do not have any significance in the base model.

Prediction

logistic_regression_predictions = predict(logistic_regression_model, X_test, type="response")

Confusion Matrix

A confusion matrix is a table that will categorize the predictions against the actual values. It includes two dimensions, among them one will indicate the predicted values and another one will represent the actual values.

Logistic_regression_CM = table(Y_test$target, logistic_regression_predictions >= 0.5)
Logistic_regression_CM
##    
##     FALSE TRUE
##   0    80   20
##   1    14   91

Accuracy

logistic_regression_accuracy <- sum(diag(Logistic_regression_CM)) / sum(Logistic_regression_CM)
print(paste('logistic_regression Accuracy ', logistic_regression_accuracy))
## [1] "logistic_regression Accuracy  0.834146341463415"

Decision tree model

DT_model <- rpart(target~., data = train, method = 'class')
rpart.plot(DT_model, extra = 106)

DT_model_predictions = predict(DT_model, X_test, type="class")
DT_CM = table(Y_test$target, DT_model_predictions)
DT_CM
##    DT_model_predictions
##       0   1
##   0  80  20
##   1   5 100
DT_accuracy <- sum(diag(DT_CM)) / sum(DT_CM)
print(paste('DT Accuracy ', DT_accuracy))
## [1] "DT Accuracy  0.878048780487805"

DT Model tuning

DT_control <- rpart.control(minsplit = 9,
    minbucket = round(4 / 3),
    maxdepth = 30,
    cp = 0)
tune_DT <- rpart(target~., data = train, method = 'class', control = DT_control)

Prediction after model tuning

tuned_DT_model_predictions = predict(tune_DT, X_test, type="class")
tuned_DT_CM = table(Y_test$target, tuned_DT_model_predictions)
tuned_DT_accuracy <- sum(diag(tuned_DT_CM)) / sum(tuned_DT_CM)
print(paste('tuned_DT Accuracy ', tuned_DT_accuracy))
## [1] "tuned_DT Accuracy  0.970731707317073"

Principal component analysis (PCA)

I will use PCA to scale variables.

PCA_Model <- prcomp(Data_v1, scale = TRUE)
attributes(PCA_Model)
## $names
## [1] "sdev"     "rotation" "center"   "scale"    "x"       
## 
## $class
## [1] "prcomp"
#PCA_Model[["x"]]
#print(PCA_Model)

Check High variance

eig.val <- get_eigenvalue(PCA_Model)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   3.3137491        23.669637                    23.66964
## Dim.2   1.5882035        11.344310                    35.01395
## Dim.3   1.2304486         8.788919                    43.80287
## Dim.4   1.1793388         8.423849                    52.22671
## Dim.5   0.9992057         7.137184                    59.36390
## Dim.6   0.9727274         6.948053                    66.31195
## Dim.7   0.8764322         6.260230                    72.57218
## Dim.8   0.7680307         5.485933                    78.05811
## Dim.9   0.7338385         5.241704                    83.29982
## Dim.10  0.6334123         4.524374                    87.82419
## Dim.11  0.5281293         3.772352                    91.59654
## Dim.12  0.4348326         3.105947                    94.70249
## Dim.13  0.3726179         2.661557                    97.36405
## Dim.14  0.3690333         2.635952                   100.00000
fviz_eig(PCA_Model)

summary(PCA_Model)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.8204 1.2602 1.10926 1.08597 0.99960 0.98627 0.9362
## Proportion of Variance 0.2367 0.1134 0.08789 0.08424 0.07137 0.06948 0.0626
## Cumulative Proportion  0.2367 0.3501 0.43803 0.52227 0.59364 0.66312 0.7257
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.87637 0.85664 0.79587 0.72673 0.65942 0.61042 0.60748
## Proportion of Variance 0.05486 0.05242 0.04524 0.03772 0.03106 0.02662 0.02636
## Cumulative Proportion  0.78058 0.83300 0.87824 0.91597 0.94702 0.97364 1.00000

Data splitting

sample2 <- sample.split(Data_v1$target, SplitRatio = 0.8)
train2  <- subset(Data_v1, sample == TRUE)
test2   <- subset(Data_v1, sample == FALSE)

test_pca <- predict(PCA_Model, test2)
test_pca = data.frame(test_pca,test2["target"])
test_pca <- test_pca[ , ! names(test_pca) %in% c("PC14")]
X_test_pca <- subset(test_pca, select = -c(target))
Y_test_pca <- subset(test_pca, select = c(target))
train_pca <- predict(PCA_Model, train2)
train_pca = data.frame(train_pca,train2["target"])
train_pca <- train_pca[ , ! names(train_pca) %in% c("PC14")]

Logistic regression model after PCA is done.

PCA_logistic_regression_model <- glm(target~., data = train_pca)

Prediction

PCA_logistic_regression_predictions = predict(PCA_logistic_regression_model, X_test_pca, type="response")

Confusion Matrix

PCA_Logistic_regression_CM = table(Y_test_pca$target, PCA_logistic_regression_predictions >= 0.5)
PCA_Logistic_regression_CM
##    
##     FALSE TRUE
##   0   100    0
##   1     0  105
PCA_logistic_regression_accuracy <- sum(diag(PCA_Logistic_regression_CM)) / sum(PCA_Logistic_regression_CM)
print(paste('pca_logistic_regression Accuracy ', PCA_logistic_regression_accuracy))
## [1] "pca_logistic_regression Accuracy  1"

Unsupervised- Clustering

Clustering is unsupervised learning problem. It deals with finding a structure in a collection of unlabeled data. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Note: We need to remove the output(target) from the data before we start unsupervised technique.

clust_train <- train2[ , ! names(train2) %in% c("target")]
clust_test <- test2[ , ! names(test2) %in% c("target")]

K-means Clustering

kmeans_model <- cluster_analysis(clust_train,
                               n = 2,
                               method = "kmeans")
plot(kmeans_model)

As we can see, there are two groups (clusters) are formed.

km_model_predictions = predict(kmeans_model, clust_test)
#km_model_predictions

Conclusion

The accuracy of the decision tree model is higher than the logistic regression model. After tuning the decision tree model, the accuracy increased to over 90% Using PCA to scale the scale the data without using dimensionality reduction has also increased the accuracy to 100%. By removing the “target”, I did unsupervised learning to discover hidden patterns in data