The data table is the result from an cardiovascular (CV) study in Cleveland Clinic between May 1981 and September 1984. No patients had a history of CV disease. After providing the historical information, all patients performed a number of clinical tests. A part of features from these results was collected in the studying data table.
The dataset was loaded and assigned to a variable call data
The data have 303 rows or patients, and 15 fields: X, Age, Sex, ChestPain, RestBP, Chol, Fbs, RestECG, MaxHR, ExAng, Oldpeak, Slope, Ca, Thal, AHD. Because the first column is X - the ID of patients, it’s not necessary for this report, so I’m going to remove it.
# Remove the first column
data_col <- colnames(data)
data_col <- data_col[data_col != "X"]
data <- data[data_col]
# Process the categorical variables
data$Sex <- factor(data$Sex) # In `sex` variable, 0 : 'female' and 1 : 'male'
data$ChestPain <- factor(data$ChestPain,
levels = c('asymptomatic','nonanginal','nontypical','typical'),
labels = c(4,3,2,1))
data$Fbs <- factor(data$Fbs)
data$RestECG <- factor(data$RestECG)
data$ExAng <- factor(data$ExAng)
data$Slope <- factor(data$Slope)
data$Ca <- factor(data$Ca)
data$Thal <- factor(data$Thal, levels = c('normal','fixed','reversable'),
labels = c(1,2,3))
data$AHD <- factor(data$AHD, levels = c('Yes','No'), labels = c(1,0))
str(data)## 'data.frame': 303 obs. of 14 variables:
## $ Age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ Sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
## $ ChestPain: Factor w/ 4 levels "4","3","2","1": 4 1 1 2 3 3 1 1 1 1 ...
## $ RestBP : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ RestECG : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ MaxHR : int 150 108 129 187 172 178 160 163 147 155 ...
## $ ExAng : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ Oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ Ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
## $ Thal : Factor w/ 3 levels "1","2","3": 2 1 3 1 1 1 1 1 3 3 ...
## $ AHD : Factor w/ 2 levels "1","0": 2 1 1 2 2 2 1 2 1 1 ...
factor_variables <- c("Sex", "Fbs", "RestECG", "ExAng", "Slope", "Ca", "Thal")
numberic_variables <- c("Age", "RestBP", "Chol", "MaxHR", "Oldpeak")The basic statistics of data was below :
## Age Sex ChestPain RestBP Chol Fbs
## Min. :29.00 0: 97 4:144 Min. : 94.0 Min. :126.0 0:258
## 1st Qu.:48.00 1:206 3: 86 1st Qu.:120.0 1st Qu.:211.0 1: 45
## Median :56.00 2: 50 Median :130.0 Median :241.0
## Mean :54.44 1: 23 Mean :131.7 Mean :246.7
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275.0
## Max. :77.00 Max. :200.0 Max. :564.0
## RestECG MaxHR ExAng Oldpeak Slope Ca Thal
## 0:151 Min. : 71.0 0:204 Min. :0.00 1:142 0 :176 1 :166
## 1: 4 1st Qu.:133.5 1: 99 1st Qu.:0.00 2:140 1 : 65 2 : 18
## 2:148 Median :153.0 Median :0.80 3: 21 2 : 38 3 :117
## Mean :149.6 Mean :1.04 3 : 20 NA's: 2
## 3rd Qu.:166.0 3rd Qu.:1.60 NA's: 4
## Max. :202.0 Max. :6.20
## AHD
## 1:139
## 0:164
##
##
##
##
Because the data contain missing values and I would be removed before building model
After removing the NA, the table consisted of 172 rows of patients.
Because the AHD is a categorical variable with two values : Yes & No, so I’m using boxplot to show the diffirence of AHD with continuous variables.
p1 <- ggplot(data = data, aes(x=AHD, y=Age, color=AHD)) + geom_boxplot() + theme(legend.position = "none")
p2 <- ggplot(data = data, aes(x=AHD, y=RestBP, color=AHD)) + geom_boxplot() + theme(legend.position = "none")
p3 <- ggplot(data = data, aes(x=AHD, y=Chol, color=AHD)) + geom_boxplot() + theme(legend.position = "none")
p4 <- ggplot(data = data, aes(x=AHD, y=MaxHR, color=AHD)) + geom_boxplot() + theme(legend.position = "none")
p5 <- ggplot(data = data, aes(x=AHD, y=Oldpeak, color=AHD)) + geom_boxplot() + theme(legend.position = "none")
grid.arrange(p1, p2, p3, p4, p5)Base on the above charts, the boxplot show the difference in distribution of AHD by Age, MaxHR and Oldpeak. These hypothesis would be tested by t-test.
##
## Welch Two Sample t-test
##
## data: Age by AHD
## t = 4.0636, df = 294.66, p-value = 6.204e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.122234 6.108514
## sample estimates:
## mean in group 1 mean in group 0
## 56.75912 52.64375
##
## Welch Two Sample t-test
##
## data: MaxHR by AHD
## t = -7.9286, df = 266.44, p-value = 6.108e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.30715 -14.63637
## sample estimates:
## mean in group 1 mean in group 0
## 139.1095 158.5813
##
## Welch Two Sample t-test
##
## data: Oldpeak by AHD
## t = 7.7558, df = 216, p-value = 3.429e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.738632 1.241970
## sample estimates:
## mean in group 1 mean in group 0
## 1.589051 0.598750
The t.test show there was a significant difference in mean of Age, MaxHR, Oldpeak between group of patiend who had AHD or no AHD.
So let’s begin to build a neural network model. In this report, I would be built a model neural network of logistic regression with AHD variable in order to predict patients with heart disease or not? (corresponding to two values yes or no).
First, I used sample.split() function to separate data into two parts : training data and testing data.
# Create y_train in training data
y_train <- subset(data, split == TRUE)
y_train <- y_train[14]
y_train <- as.matrix(y_train)
y_train <- as.vector(y_train)
y_train <- as.numeric(as.character(y_train))
# Create y_test in testing data
y_test <- subset(data, split == FALSE)
y_test <- y_test[14]
y_test <- as.matrix(y_test)
y_test <- as.vector(y_test)
y_test <- as.numeric(as.character(y_test))In this above lines code, I created y_train and y_test variables. Both were converted to numeric data type in order to suitable with require of model. It contains the actual result of AHD variable. y_train accounts for 80% of AHD and y_test takes up the rest of the AHD variable. Then, I created x_train and x_test corresponding y_train and y_test but in this x_train and x_test, both contain remaining 13 columns except AHD column. There are the necessary variables to predict the AHD variable.
# Create x_train
x_train <- subset(data, split == TRUE)
x_train <- x_train[1:13]
x_train$Age <- as.numeric(x_train$Age)
x_train$Sex <- as.numeric(as.character(x_train$Sex))
x_train$ChestPain <- as.numeric(as.character(x_train$ChestPain))
x_train$RestBP <- as.numeric(x_train$RestBP)
x_train$Chol <- as.numeric(x_train$Chol)
x_train$Fbs <- as.numeric(as.character(x_train$Fbs))
x_train$RestECG <- as.numeric(as.character(x_train$RestECG))
x_train$MaxHR <- as.numeric(x_train$MaxHR)
x_train$ExAng <- as.numeric(as.character(x_train$ExAng))
x_train$Oldpeak <- as.numeric(x_train$Oldpeak)
x_train$Slope <- as.numeric(as.character(x_train$Slope))
x_train$Ca <- as.numeric(as.character(x_train$Ca))
x_train$Thal<- as.numeric(as.character(x_train$Thal))
x_train <- as.matrix(x_train)
x_train <- scale(x_train)
# Create x_test
x_test <- subset(data, split == FALSE)
x_test <- x_test[1:13]
x_test$Age <- as.numeric(x_test$Age)
x_test$Sex <- as.numeric(as.character(x_test$Sex))
x_test$ChestPain <- as.numeric(as.character(x_test$ChestPain))
x_test$RestBP <- as.numeric(x_test$RestBP)
x_test$Chol <- as.numeric(x_test$Chol)
x_test$Fbs <- as.numeric(as.character(x_test$Fbs))
x_test$RestECG <- as.numeric(as.character(x_test$RestECG))
x_test$MaxHR <- as.numeric(x_test$MaxHR)
x_test$ExAng <- as.numeric(as.character(x_test$ExAng))
x_test$Oldpeak <- as.numeric(x_test$Oldpeak)
x_test$Slope <- as.numeric(as.character(x_test$Slope))
x_test$Ca <- as.numeric(as.character(x_test$Ca))
x_test$Thal <- as.numeric(as.character(x_test$Thal))
x_test <- as.matrix(x_test)
col_means_train <- attr(x_train, "scaled:center")
col_stddevs_train <- attr(x_train, "scaled:scale")
x_test <- scale(x_test, center = col_means_train, scale = col_stddevs_train)Then I have x_train and x_test, there was scaled by scale() function. The x_test was scaled with mean and standard deviation of x_train. The purpose of this is to fit the model and not be overfiting.
After pre-processed data and created Traing data and Testing data, I started to build model and training model.
model <- keras_model_sequential()
model %>%
layer_dense(units = 16, activation = 'relu',input_shape = dim(x_train)[2]) %>%
layer_dense(units = 1, activation = 'sigmoid')
summary(model)## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense (Dense) (None, 16) 224
## ________________________________________________________________________________
## dense_1 (Dense) (None, 1) 17
## ================================================================================
## Total params: 241
## Trainable params: 241
## Non-trainable params: 0
## ________________________________________________________________________________
model %>% compile(
optimizer = optimizer_adam(),
loss = 'binary_crossentropy',
metrics = c('accuracy')
)In this above lines code, I built a model neural network. This model has two layers : Input and Output and used activation functions were relu and sigmoid. In compile model, the loss function is binary_crossentropy method, the optimization method is adam and used accurancy to evalute the accurancy model.
Let’s training model with fit() and the number of epoch are 200.
After training, I evaluated the loss and accurancy in testing data. The result are loss : 0.3077 and accurancy : 0.8305. Was this result OK ?
The output of the model is stored in the p1 variable
## [,1]
## [1,] 0.99867970
## [2,] 0.09159675
## [3,] 0.01610032
## [4,] 0.99753481
## [5,] 0.95574594
## [6,] 0.98702645
## [7,] 0.99302530
## [8,] 0.12536159
## [9,] 0.26144147
## [10,] 0.98831415
## [11,] 0.73331028
## [12,] 0.42922673
## [13,] 0.12150046
## [14,] 0.98800761
## [15,] 0.04571483
## [16,] 0.99111485
## [17,] 0.10394582
## [18,] 0.28652352
## [19,] 0.20704731
## [20,] 0.95554471
Finally, I used confusion matrix method to check the result again :
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 30 3
## 1 2 24
##
## Accuracy : 0.9153
## 95% CI : (0.8132, 0.9719)
## No Information Rate : 0.5424
## P-Value [Acc > NIR] : 5.04e-10
##
## Kappa : 0.8288
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9375
## Specificity : 0.8889
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.9231
## Prevalence : 0.5424
## Detection Rate : 0.5085
## Detection Prevalence : 0.5593
## Balanced Accuracy : 0.9132
##
## 'Positive' Class : 0
##
In this report, I had built a neural network model with logistic regression. By using the adam method to optimize the loss function and using two activation functions : sigmoid and relu to build the model, the result is quite good with loss : 0.3077 and accurancy : 0.8305. However, this model was not the most optimal model but it was very useful to diagnose what a patient has heart disease or not - a good assistant for a doctor.