Introduction

In the second group project, I used the “neuralnet” package implementation of feedforward neural networks to determine whether or not a person had a stroke given various categorical and numerical health information such as BMI, blood glucose levels, and the presence or lack of heart disease. For a sample of my best work, I will then explore the “neuralnet” feedforward neural network implementation to predict whether or not a person has heart disease—as indicated by more than 50% narrowing of the coronary arteries—using a simple neural network with one hidden layer. I will also perform a very basic hyperparameter search to optimize the number of nodes in the hidden layer in a small range.

Dataset

The dataset used will be the Kaggle Cleveland Heart Disease UCI Dataset, similar to the one used by the heart disease subgroup in the second group project. The dataset consists of 303 rows with 13 attributes of health data and a final target column indicating whether or not the person was diagnosed with heart disease.

Data Viewing and Wrangling

The data was first opened and viewed.

data <- read_csv("heart.csv")

head(data[c(1:9)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
age sex cp trestbps chol fbs restecg thalach exang
63 1 3 145 233 1 0 150 0
37 1 2 130 250 0 1 187 0
41 0 1 130 204 0 0 172 0
56 1 1 120 236 0 1 178 0
57 0 0 120 354 0 1 163 1
57 1 0 140 192 0 1 148 0
head(data[c(10:14)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
oldpeak slope ca thal target
2.3 0 0 1 1
3.5 0 0 2 1
1.4 2 0 2 1
0.8 2 0 2 1
0.6 2 0 2 1
0.4 1 0 1 1

We see that these columns correspond to age, sex, class of chest pain, resting blood pressure, serum cholesterol, resting electrocardiographic results, maximum heart rate achieved from tests, presence of exercise-induced chest pain, whether or not ST depression is indicated by the electrocardiogram following exercise, the integer slope of the ST segment of the electrocardiogram at peak exercise, and the number of major vessels which were marked by fluoroscopy.

We can then view verify the size of the dataset and determine the ratio of positive to negative cases.

nrow(data)
## [1] 303
num_cases = table(data$target)
num_cases
## 
##   0   1 
## 138 165
num_cases[2]/num_cases[1]
##        1 
## 1.195652

We see that there are indeed 303 rows in the dataset, and, fortunately, the ratio of positive to negative cases is 1.2 and therefore sufficiently close to 1, and so large biases towards either the negative or positive class should not be learned by the model.

Because, according to the dataset source, no NA values exist in this processed portion of the data, we do not need to attempt to remove or otherwise correct for rows with NA values. Instead, we can now one-hot encode the categorical columns of chest pain type and resting electrocardiogram results and view the dataset again before we split it into training, validation, and test sets.

data$cp <- factor(data$cp, levels = c(0, 1, 2, 3))
data$restecg <- factor(data$restecg, levels = c(0, 1, 2))
dummies <- dummyVars("~ .", data=data)
encoded_data <- data.frame(predict(dummies, newdata=data))

head(encoded_data[c(1:10)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
age sex cp.0 cp.1 cp.2 cp.3 trestbps chol fbs restecg.0
63 1 0 0 0 1 145 233 1 1
37 1 0 0 1 0 130 250 0 0
41 0 0 1 0 0 130 204 0 1
56 1 0 1 0 0 120 236 0 0
57 0 1 0 0 0 120 354 0 0
57 1 1 0 0 0 140 192 0 0
head(encoded_data[c(11:19)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
restecg.1 restecg.2 thalach exang oldpeak slope ca thal target
0 0 150 0 2.3 0 0 1 1
1 0 187 0 3.5 0 0 2 1
0 0 172 0 1.4 2 0 2 1
1 0 178 0 0.8 2 0 2 1
1 0 163 1 0.6 2 0 2 1
1 0 148 0 0.4 1 0 1 1

Next we will split the dataset randomly between a training set, consisting of 60% of the original dataset, a validation set for hyperparameter tuning, consisting of 20% of the original, and a test set for objective assessment of accuracy with the remaining 20%. We can achieve this by first splitting the dataset 60%-40%, and then we can randomly split the 40% subset 50%-50% so that we have three sets with sizes 60%-20%-20% of the original.

set.seed(101)
encoded_n <- nrow(encoded_data)
valTestIndex <- sample(1:encoded_n, size = round(0.4*encoded_n), replace = FALSE)
valIndex <- sample(1:round(0.4*encoded_n), size = round(0.5*round(0.4*encoded_n)), replace = FALSE)
train <- encoded_data[-valTestIndex ,]
valTest <- encoded_data[valTestIndex ,]
val <- valTest[valIndex ,]
test <- valTest[-valIndex ,]

Next we must scale the data and center it around the mean to allow the neural network to converge more quickly by standardizing the scales of each feature. In order to prevent data leakage, we create a scaling object and fit it to the training data before we apply the transformation to all three sets. The certain pre-processing methods used will subtract the mean from each feature and divide by the standard deviation. This method sets the mean of each column to 0 and reduces the range of the values while not being especially affected by outliers. We apply this only to the features which are composed of ratio data, not those with ordinal or categorical data.

preprocessor <- preProcess(train[, c(1, 7, 8, 13, 15)], method=c("center", "scale"))
normalized_train <- predict(preprocessor, train[, c(1, 7, 8, 13, 15)])
normalized_test <- predict(preprocessor, test[, c(1, 7, 8, 13, 15)])
normalized_val <- predict(preprocessor, val[, c(1, 7, 8, 13, 15)])
train[, c(1, 7, 8, 13, 15)] <- normalized_train
test[, c(1, 7, 8, 13, 15)] <- normalized_test
val[, c(1, 7, 8, 13, 15)] <- normalized_val

Now we can split test and validation sets into inputs vectors and labels.

test_X <- select(test, -target)
test_y <- select(test, target)
val_X <- select(val, -target)
val_y <- select(val, target)

Final Model Training

We will then retrain the model on the training and validation data combined for a model with 15 nodes in the hidden layer.

remove(model)
set.seed(101)
model <- neuralnet(target~., rbind(train, val), hidden = 15, linear.output = FALSE, err.fct = "ce")

We can now plot the model to see the structure of the neural network.

plot(model, rep = "best")

We see that the model has 19 nodes in the input layer, each connected by specific weights to every one of the 15 nodes in the hidden layer with an added bias, and the outputs of those 15 nodes feeding into the final output layer with a single node for classification.

Final Model Evaluation

Finally, we can evaluate the trained model on the test set.

predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
results <- data.frame(predictions = as.numeric(unlist(predictions[, 1])), true = as.numeric(unlist(test_y)))
performance <- evaluate(results, target_col = "true", prediction_cols = "predictions", type = "binomial")

We can now view a set of performance statistics for the model’s predictions on the test set.

performance[1:8] %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
Balanced Accuracy Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC
0.7272222 0.7213115 0.7462687 0.6944444 0.76 0.8064516 0.6333333 0.7272222

We see that the model, when trained on 60% of the original dataset with the optimized number of nodes in the hidden layer, is about 72% accurate on the test dataset with an AUC_ROC score of 0.73. A balanced accuracy score close of approximately 73%, which is close to the standard accuracy, indicates consistent accuracy on both the positive and negative class. Similar sensitivity and specificity scores also indicate consistent performance across both classes. We can also view a confusion matrix to specifically assess the distribution of predictions in relation to their true labels.

plot_confusion_matrix(performance$'Confusion Matrix'[[1]])

We confirm that there is consistent accuracy in the predictions for negative and positive classes, yet with a slight bias towards the positive class likely due to the slightly larger number of positive examples in the dataset.

Conclusion

In this sample of my best work, I demonstrated several useful skills I learned in the Introduction to Data Science course including loading and preprocessing data, analyzing patterns in data, training and tuning the hyperparameters of a machine learning model, displaying tables and graphs, and evaluating the performance of a machine learning model on a test set. Specifically, I used the Kaggle Heart Disease UCI Dataset to build and optimize a neural network to learn to classify a person as having or not having heart disease based on various health data. In doing so, I found a value of 15 nodes in the hidden layer to be optimal for the neural network’s performance and, following retraining, found the accuracy of the network to be ~72% with an AUC_ROC score of ~0.73 on the held-out test set. Thank you for the great semester Dr. Lanning and I look forward to learning more in my Data Analytics concentration!