You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - describe the problem you are trying to solve. - describe your datases and what you did to prepare the data for analysis. - methodologies you used for analyzing the data - what’s the purpose of the analysis performed - make your conclusions from your analysis.
Please be sure to address the business impact (it could be of any domain) of your solution.
I will use heart disease dataset to do logistic regression model and PCA to predict if a patient will have a heart disease or no.
https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset?resource=download
#Required libraries
library(dplyr)
library(psych)
library(ggplot2)
library(ggcorrplot)
library(tidyverse)
library(caTools)
library(caret)
library(boot)
library(rpart)
library(rpart.plot)
library(parameters)
library(nnet)
library(factoextra)
#Read the data
Data_v1 <- read.csv(file = 'heart.csv')
#Read the first few rows of the data
head(Data_v1)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## target
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
#Explore the data via summary.
summary(Data_v1)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.43 Mean :0.6956 Mean :0.9424 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## chol fbs restecg thalach
## Min. :126 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0
## Median :240 Median :0.0000 Median :1.0000 Median :152.0
## Mean :246 Mean :0.1493 Mean :0.5298 Mean :149.1
## 3rd Qu.:275 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3366 Mean :1.072 Mean :1.385 Mean :0.7541
## 3rd Qu.:1.0000 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.324 Mean :0.5132
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
describe(Data_v1)
## vars n mean sd median trimmed mad min max range skew
## age 1 1025 54.43 9.07 56.0 54.66 8.90 29 77.0 48.0 -0.25
## sex 2 1025 0.70 0.46 1.0 0.74 0.00 0 1.0 1.0 -0.85
## cp 3 1025 0.94 1.03 1.0 0.83 1.48 0 3.0 3.0 0.53
## trestbps 4 1025 131.61 17.52 130.0 130.39 14.83 94 200.0 106.0 0.74
## chol 5 1025 246.00 51.59 240.0 243.26 48.93 126 564.0 438.0 1.07
## fbs 6 1025 0.15 0.36 0.0 0.06 0.00 0 1.0 1.0 1.97
## restecg 7 1025 0.53 0.53 1.0 0.52 0.00 0 2.0 2.0 0.18
## thalach 8 1025 149.11 23.01 152.0 150.40 23.72 71 202.0 131.0 -0.51
## exang 9 1025 0.34 0.47 0.0 0.30 0.00 0 1.0 1.0 0.69
## oldpeak 10 1025 1.07 1.18 0.8 0.89 1.19 0 6.2 6.2 1.21
## slope 11 1025 1.39 0.62 1.0 1.45 1.48 0 2.0 2.0 -0.48
## ca 12 1025 0.75 1.03 0.0 0.57 0.00 0 4.0 4.0 1.26
## thal 13 1025 2.32 0.62 2.0 2.38 0.00 0 3.0 3.0 -0.52
## target 14 1025 0.51 0.50 1.0 0.52 0.00 0 1.0 1.0 -0.05
## kurtosis se
## age -0.53 0.28
## sex -1.28 0.01
## cp -1.15 0.03
## trestbps 0.97 0.55
## chol 3.96 1.61
## fbs 1.87 0.01
## restecg -1.31 0.02
## thalach -0.10 0.72
## exang -1.52 0.01
## oldpeak 1.29 0.04
## slope -0.65 0.02
## ca 0.68 0.03
## thal 0.24 0.02
## target -2.00 0.02
We can see few variables with high variance and skewed. Also, there are no missing data in this dataset.
Check the structure of all variables:
str(Data_v1)
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : int 1 1 1 1 0 0 1 1 1 1 ...
## $ cp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : int 0 1 0 0 1 0 0 0 0 0 ...
## $ restecg : int 1 0 1 1 1 0 2 0 0 0 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : int 2 0 0 2 1 1 0 1 2 1 ...
## $ ca : int 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : int 3 3 3 3 2 2 1 3 3 2 ...
## $ target : int 0 0 0 0 0 1 0 0 0 0 ...
We need to change data type for some variables to factors.
corr = cor(Data_v1)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of the data")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
To understand our variables in a better way, I created the following table to explain the description, data type, and the value of each feature.
| Feature | Definition | Type | Value |
|---|---|---|---|
| age | Patient’s age in years | Numerical | 29 - 77 |
| sex | Gender | Nominal | (0)female, (1)male |
| cp | Type of chest-pain | Nominal | (0)typical angina, (1)atypical angina, (2)non-angina pain, (3)asymptomatic |
| trestbps | Resting blood pressure in mmHg | Numerical | 94 - 200 |
| chol | Serum cholestoral in mg/dl | Numerical | 126 – 564 |
| fbs | Fasting blood sugar higher than 120 mg/dl | Nominal | (0)False (1)True |
| restecg | Resting electrocardiographic results | Nominal | (0)normal, (1)having ST-T wave abnormality, (2)showing probable left ventricular hypertrophy |
| thalach | Maximum heart rate achieved | Numerical | 71 –202 |
| exang | Exercise induced angina | Nominal | (0)No(1)Yes |
| oldpeak | ST depression induced by exercise relative to rest | Numerical | -2.6 - 6.2 |
| slope | The slope of the peak exercise ST segment | Nominal | (0)upsloping, (1)flat, (2)downsloping |
| ca | Number of major vessels colored by flourosopy | Nominal | 0, 1, 2, 3 |
| thal | Thalassemia | Nominal | (0)normal,(1)fixed defect, (2)reversible defect |
| target | Diagnosis of heart disease | Nominal | (0)heart disease not present, (1)heart disease present |
#Get the columns names
cols = colnames(Data_v1)
cols
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
Check the number of unique values in each variable
for (i in cols){
uniq = n_distinct(Data_v1[i])
print(paste0(i, " : ", uniq))
}
## [1] "age : 41"
## [1] "sex : 2"
## [1] "cp : 4"
## [1] "trestbps : 49"
## [1] "chol : 152"
## [1] "fbs : 2"
## [1] "restecg : 3"
## [1] "thalach : 91"
## [1] "exang : 2"
## [1] "oldpeak : 40"
## [1] "slope : 3"
## [1] "ca : 5"
## [1] "thal : 4"
## [1] "target : 2"
Create an empty list to separate Categorical and Numerical variables
Categorical_att = list()
Numerical_attr = list()
for (i in cols){
uniq = n_distinct(Data_v1[i])
if (uniq < 5 ){
Categorical_att= append(Categorical_att, i)
}
else{
Numerical_attr= append(Numerical_attr, i)
}
}
#Make a copy of the data
Data_v2 = Data_v1
Convert Categorical data to factor
for (i in Categorical_att){
Data_v2[,i] = as.factor(Data_v2[,i])
}
Confirm the changes in the data type
str(Data_v2)
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 2 1 2 2 2 1 3 1 1 1 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : Factor w/ 3 levels "0","1","2": 3 1 1 3 2 2 1 2 3 2 ...
## $ ca : int 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 3 3 2 4 4 3 ...
## $ target : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
1. Gender
Distribution of Heart Disease among males and females
Sex_condition <- table(Data_v2$target, Data_v2$sex)
Sex_condition
##
## 0 1
## 0 86 413
## 1 226 300
There are 226 females out of 312 who have diagnosed with heart disease and 300 males out of 713 were diagnosed with heart disease.
2. cp - chest-pain
cp_condition <- table(Data_v2$target, Data_v2$cp)
cp_condition
##
## 0 1 2 3
## 0 375 33 65 26
## 1 122 134 219 51
3. restecg - Resting electrocardiographic results
restecg_condition <- table(Data_v2$target, Data_v2$restecg)
restecg_condition
##
## 0 1 2
## 0 283 204 12
## 1 214 309 3
4. cp - Exercise induced angina
exang_condition <- table(Data_v2$target, Data_v2$exang)
exang_condition
##
## 0 1
## 0 225 274
## 1 455 71
5. ca - Number of major vessels colored by flourosopy
ca_condition <- table(Data_v2$target, Data_v2$ca)
ca_condition
##
## 0 1 2 3 4
## 0 163 160 113 60 3
## 1 415 66 21 9 15
6. thal - Thalassemia
thal_condition <- table(Data_v2$target, Data_v2$thal)
thal_condition
##
## 0 1 2 3
## 0 4 43 132 320
## 1 3 21 412 90
age_condition <- table(Data_v2$target, Data_v2$age)
age_condition
##
## 29 34 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
## 0 0 0 7 0 4 4 8 3 4 11 10 6 11 8 11 8 9 10 15 8 21 20 22 36 44
## 1 4 6 8 6 8 10 3 29 22 15 26 19 12 10 12 9 12 29 28 18 32 10 17 21 24
##
## 59 60 61 62 63 64 65 66 67 68 69 70 71 74 76 77
## 0 31 27 28 24 23 15 15 11 22 6 3 11 0 0 0 3
## 1 15 10 3 13 9 19 12 14 9 6 6 3 11 3 3 0
ggplot(Data_v2) +
geom_point(aes(x = age, y= thalach, color = target))
barplot(age_condition, legend.text = TRUE)
#Visualize numeric variables using boxplots
#Create a subset with numerical data
Numerical <- Data_v2 %>%
select(age, thalach, chol, oldpeak, trestbps, target) %>%
gather(key = "key", value = "value", -target)
Numerical %>%
ggplot(aes(y = value)) +
geom_boxplot(aes(fill = target),
alpha = .6,
fatten = .7) +
labs(x = "",
y = "",
title = "Boxplots for Numeric Variables in relation with heart disease") +
scale_fill_manual(
values = c("#fde725ff", "#20a486ff"),
name = "Heart\nDisease",
labels = c("No diagnosis", " Diagnosed")) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
facet_wrap(~ key,
scales = "free",
ncol = 2)
set.seed(333)
sample <- sample.split(Data_v2$target, SplitRatio = 0.8)
train <- subset(Data_v2, sample == TRUE)
test <- subset(Data_v2, sample == FALSE)
X_test <- subset (test, select = -c(target))
Y_test <- subset (test, select = c(target))
I will start with a logistic regression model since the response variable is binary. Will start with all features included in our base model.
logistic_regression_model <- glm( target ~., data = train, family=binomial)
summary(logistic_regression_model)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6854 -0.4031 0.1216 0.4932 2.6934
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.260825 2.037376 0.619 0.536017
## age -0.013268 0.014365 -0.924 0.355658
## sex1 -1.711465 0.319547 -5.356 8.51e-08 ***
## cp1 1.100787 0.340210 3.236 0.001214 **
## cp2 2.020901 0.288865 6.996 2.63e-12 ***
## cp3 2.295920 0.409963 5.600 2.14e-08 ***
## trestbps -0.014422 0.006474 -2.228 0.025901 *
## chol -0.006030 0.002469 -2.442 0.014597 *
## fbs1 0.331894 0.345824 0.960 0.337196
## restecg1 0.529405 0.226993 2.332 0.019687 *
## restecg2 -0.158371 1.268599 -0.125 0.900651
## thalach 0.016917 0.006646 2.545 0.010918 *
## exang1 -0.645600 0.264083 -2.445 0.014498 *
## oldpeak -0.527356 0.140291 -3.759 0.000171 ***
## slope1 -0.574139 0.507157 -1.132 0.257604
## slope2 0.305994 0.554031 0.552 0.580740
## ca -0.818473 0.123692 -6.617 3.67e-11 ***
## thal1 2.211163 1.425322 1.551 0.120820
## thal2 2.247562 1.376054 1.633 0.102398
## thal3 0.969427 1.382058 0.701 0.483030
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.17 on 819 degrees of freedom
## Residual deviance: 550.93 on 800 degrees of freedom
## AIC: 590.93
##
## Number of Fisher Scoring iterations: 6
We notice that cholesterol, age, and blood pressure do not have any significance in the base model.
logistic_regression_predictions = predict(logistic_regression_model, X_test, type="response")
A confusion matrix is a table that will categorize the predictions against the actual values. It includes two dimensions, among them one will indicate the predicted values and another one will represent the actual values.
Logistic_regression_CM = table(Y_test$target, logistic_regression_predictions >= 0.5)
Logistic_regression_CM
##
## FALSE TRUE
## 0 80 20
## 1 14 91
logistic_regression_accuracy <- sum(diag(Logistic_regression_CM)) / sum(Logistic_regression_CM)
print(paste('logistic_regression Accuracy ', logistic_regression_accuracy))
## [1] "logistic_regression Accuracy 0.834146341463415"
DT_model <- rpart(target~., data = train, method = 'class')
rpart.plot(DT_model, extra = 106)
DT_model_predictions = predict(DT_model, X_test, type="class")
DT_CM = table(Y_test$target, DT_model_predictions)
DT_CM
## DT_model_predictions
## 0 1
## 0 80 20
## 1 5 100
DT_accuracy <- sum(diag(DT_CM)) / sum(DT_CM)
print(paste('DT Accuracy ', DT_accuracy))
## [1] "DT Accuracy 0.878048780487805"
DT_control <- rpart.control(minsplit = 9,
minbucket = round(4 / 3),
maxdepth = 30,
cp = 0)
tune_DT <- rpart(target~., data = train, method = 'class', control = DT_control)
tuned_DT_model_predictions = predict(tune_DT, X_test, type="class")
tuned_DT_CM = table(Y_test$target, tuned_DT_model_predictions)
tuned_DT_accuracy <- sum(diag(tuned_DT_CM)) / sum(tuned_DT_CM)
print(paste('tuned_DT Accuracy ', tuned_DT_accuracy))
## [1] "tuned_DT Accuracy 0.970731707317073"
I will use PCA to scale variables.
PCA_Model <- prcomp(Data_v1, scale = TRUE)
attributes(PCA_Model)
## $names
## [1] "sdev" "rotation" "center" "scale" "x"
##
## $class
## [1] "prcomp"
#PCA_Model[["x"]]
#print(PCA_Model)
eig.val <- get_eigenvalue(PCA_Model)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.3137491 23.669637 23.66964
## Dim.2 1.5882035 11.344310 35.01395
## Dim.3 1.2304486 8.788919 43.80287
## Dim.4 1.1793388 8.423849 52.22671
## Dim.5 0.9992057 7.137184 59.36390
## Dim.6 0.9727274 6.948053 66.31195
## Dim.7 0.8764322 6.260230 72.57218
## Dim.8 0.7680307 5.485933 78.05811
## Dim.9 0.7338385 5.241704 83.29982
## Dim.10 0.6334123 4.524374 87.82419
## Dim.11 0.5281293 3.772352 91.59654
## Dim.12 0.4348326 3.105947 94.70249
## Dim.13 0.3726179 2.661557 97.36405
## Dim.14 0.3690333 2.635952 100.00000
fviz_eig(PCA_Model)
summary(PCA_Model)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8204 1.2602 1.10926 1.08597 0.99960 0.98627 0.9362
## Proportion of Variance 0.2367 0.1134 0.08789 0.08424 0.07137 0.06948 0.0626
## Cumulative Proportion 0.2367 0.3501 0.43803 0.52227 0.59364 0.66312 0.7257
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.87637 0.85664 0.79587 0.72673 0.65942 0.61042 0.60748
## Proportion of Variance 0.05486 0.05242 0.04524 0.03772 0.03106 0.02662 0.02636
## Cumulative Proportion 0.78058 0.83300 0.87824 0.91597 0.94702 0.97364 1.00000
sample2 <- sample.split(Data_v1$target, SplitRatio = 0.8)
train2 <- subset(Data_v1, sample == TRUE)
test2 <- subset(Data_v1, sample == FALSE)
test_pca <- predict(PCA_Model, test2)
test_pca = data.frame(test_pca,test2["target"])
test_pca <- test_pca[ , ! names(test_pca) %in% c("PC14")]
X_test_pca <- subset(test_pca, select = -c(target))
Y_test_pca <- subset(test_pca, select = c(target))
train_pca <- predict(PCA_Model, train2)
train_pca = data.frame(train_pca,train2["target"])
train_pca <- train_pca[ , ! names(train_pca) %in% c("PC14")]
PCA_logistic_regression_model <- glm(target~., data = train_pca)
PCA_logistic_regression_predictions = predict(PCA_logistic_regression_model, X_test_pca, type="response")
PCA_Logistic_regression_CM = table(Y_test_pca$target, PCA_logistic_regression_predictions >= 0.5)
PCA_Logistic_regression_CM
##
## FALSE TRUE
## 0 100 0
## 1 0 105
PCA_logistic_regression_accuracy <- sum(diag(PCA_Logistic_regression_CM)) / sum(PCA_Logistic_regression_CM)
print(paste('pca_logistic_regression Accuracy ', PCA_logistic_regression_accuracy))
## [1] "pca_logistic_regression Accuracy 1"
Clustering is unsupervised learning problem. It deals with finding a structure in a collection of unlabeled data. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Note: We need to remove the output(target) from the data before we start unsupervised technique.
clust_train <- train2[ , ! names(train2) %in% c("target")]
clust_test <- test2[ , ! names(test2) %in% c("target")]
kmeans_model <- cluster_analysis(clust_train,
n = 2,
method = "kmeans")
plot(kmeans_model)
As we can see, there are two groups (clusters) are formed.
km_model_predictions = predict(kmeans_model, clust_test)
#km_model_predictions
The accuracy of the decision tree model is higher than the logistic regression model. After tuning the decision tree model, the accuracy increased to over 90% Using PCA to scale the scale the data without using dimensionality reduction has also increased the accuracy to 100%. By removing the “target”, I did unsupervised learning to discover hidden patterns in data