In the second group project, I used the “neuralnet” package implementation of feedforward neural networks to determine whether or not a person had a stroke given various categorical and numerical health information such as BMI, blood glucose levels, and the presence or lack of heart disease. For a sample of my best work, I will then explore the “neuralnet” feedforward neural network implementation to predict whether or not a person has heart disease—as indicated by more than 50% narrowing of the coronary arteries—using a simple neural network with one hidden layer. I will also perform a very basic hyperparameter search to optimize the number of nodes in the hidden layer in a small range.
The dataset used will be the Kaggle Cleveland Heart Disease UCI Dataset, similar to the one used by the heart disease subgroup in the second group project. The dataset consists of 303 rows with 13 attributes of health data and a final target column indicating whether or not the person was diagnosed with heart disease.
The data was first opened and viewed.
data <- read_csv("heart.csv")
head(data[c(1:9)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang |
|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 |
| 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 |
| 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 |
| 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 |
| 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 |
| 57 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 |
head(data[c(10:14)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| oldpeak | slope | ca | thal | target |
|---|---|---|---|---|
| 2.3 | 0 | 0 | 1 | 1 |
| 3.5 | 0 | 0 | 2 | 1 |
| 1.4 | 2 | 0 | 2 | 1 |
| 0.8 | 2 | 0 | 2 | 1 |
| 0.6 | 2 | 0 | 2 | 1 |
| 0.4 | 1 | 0 | 1 | 1 |
We see that these columns correspond to age, sex, class of chest pain, resting blood pressure, serum cholesterol, resting electrocardiographic results, maximum heart rate achieved from tests, presence of exercise-induced chest pain, whether or not ST depression is indicated by the electrocardiogram following exercise, the integer slope of the ST segment of the electrocardiogram at peak exercise, and the number of major vessels which were marked by fluoroscopy.
We can then view verify the size of the dataset and determine the ratio of positive to negative cases.
nrow(data)
## [1] 303
num_cases = table(data$target)
num_cases
##
## 0 1
## 138 165
num_cases[2]/num_cases[1]
## 1
## 1.195652
We see that there are indeed 303 rows in the dataset, and, fortunately, the ratio of positive to negative cases is 1.2 and therefore sufficiently close to 1, and so large biases towards either the negative or positive class should not be learned by the model.
Because, according to the dataset source, no NA values exist in this processed portion of the data, we do not need to attempt to remove or otherwise correct for rows with NA values. Instead, we can now one-hot encode the categorical columns of chest pain type and resting electrocardiogram results and view the dataset again before we split it into training, validation, and test sets.
data$cp <- factor(data$cp, levels = c(0, 1, 2, 3))
data$restecg <- factor(data$restecg, levels = c(0, 1, 2))
dummies <- dummyVars("~ .", data=data)
encoded_data <- data.frame(predict(dummies, newdata=data))
head(encoded_data[c(1:10)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| age | sex | cp.0 | cp.1 | cp.2 | cp.3 | trestbps | chol | fbs | restecg.0 |
|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 0 | 0 | 0 | 1 | 145 | 233 | 1 | 1 |
| 37 | 1 | 0 | 0 | 1 | 0 | 130 | 250 | 0 | 0 |
| 41 | 0 | 0 | 1 | 0 | 0 | 130 | 204 | 0 | 1 |
| 56 | 1 | 0 | 1 | 0 | 0 | 120 | 236 | 0 | 0 |
| 57 | 0 | 1 | 0 | 0 | 0 | 120 | 354 | 0 | 0 |
| 57 | 1 | 1 | 0 | 0 | 0 | 140 | 192 | 0 | 0 |
head(encoded_data[c(11:19)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| restecg.1 | restecg.2 | thalach | exang | oldpeak | slope | ca | thal | target |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 0 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 1 | 0 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 1 | 0 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
| 1 | 0 | 148 | 0 | 0.4 | 1 | 0 | 1 | 1 |
Next we will split the dataset randomly between a training set, consisting of 60% of the original dataset, a validation set for hyperparameter tuning, consisting of 20% of the original, and a test set for objective assessment of accuracy with the remaining 20%. We can achieve this by first splitting the dataset 60%-40%, and then we can randomly split the 40% subset 50%-50% so that we have three sets with sizes 60%-20%-20% of the original.
set.seed(101)
encoded_n <- nrow(encoded_data)
valTestIndex <- sample(1:encoded_n, size = round(0.4*encoded_n), replace = FALSE)
valIndex <- sample(1:round(0.4*encoded_n), size = round(0.5*round(0.4*encoded_n)), replace = FALSE)
train <- encoded_data[-valTestIndex ,]
valTest <- encoded_data[valTestIndex ,]
val <- valTest[valIndex ,]
test <- valTest[-valIndex ,]
Next we must scale the data and center it around the mean to allow the neural network to converge more quickly by standardizing the scales of each feature. In order to prevent data leakage, we create a scaling object and fit it to the training data before we apply the transformation to all three sets. The certain pre-processing methods used will subtract the mean from each feature and divide by the standard deviation. This method sets the mean of each column to 0 and reduces the range of the values while not being especially affected by outliers. We apply this only to the features which are composed of ratio data, not those with ordinal or categorical data.
preprocessor <- preProcess(train[, c(1, 7, 8, 13, 15)], method=c("center", "scale"))
normalized_train <- predict(preprocessor, train[, c(1, 7, 8, 13, 15)])
normalized_test <- predict(preprocessor, test[, c(1, 7, 8, 13, 15)])
normalized_val <- predict(preprocessor, val[, c(1, 7, 8, 13, 15)])
train[, c(1, 7, 8, 13, 15)] <- normalized_train
test[, c(1, 7, 8, 13, 15)] <- normalized_test
val[, c(1, 7, 8, 13, 15)] <- normalized_val
Now we can split test and validation sets into inputs vectors and labels.
test_X <- select(test, -target)
test_y <- select(test, target)
val_X <- select(val, -target)
val_y <- select(val, target)
We can now build the neural network and run a hyperparameter search. The network will have 19 nodes in the input layer, one for each feature, and we will optimize the number of nodes for the network between 10 and 20 inclusive with a hyperparameter search. A wider range of values should be searched for a proper hyperparameter search, but the slow speed of R compared to other languages limits this possibility. We will build ten different models, train them on the training set, and compare their performance on the validation set. The hyperparameters which result in the greatest performance will then be taken and used to train the final model on the combination of training and validation data, and this model will be evaluated on the test data. The metric used in the hyperparameter search will be the area under the receiver operating characteristic curve (AUC_ROC).
The general neural network structure will have the sigmoidal activation function after each non-input layer, which compresses the output of each node between 0 and 1 and allows the network to model nonlinear relationships. In a feedforward neural network, each node in one layer connects to every node in the next layer, multiplying each input by a learned weight and adding a learned bias. These learned values allow the model to extract complex relationships to produce accurate classifications for varying combinations and ranges of input data. The final output layer for this binary classification problem consists of a single node, with a positive classification being made for outputs greater than 0.5, or 50%, and a negative classification being made for outputs less than or equal to 0.5.
set.seed(101)
results <- c()
for (num_nodes in 10:20) {
model <- neuralnet(target~., train, hidden = num_nodes, linear.output = FALSE, err.fct = "ce")
predictions <- neuralnet::compute(model, val_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
results <- c(results, auc_roc(as.numeric(unlist(predictions[, 1])), as.numeric(unlist(val_y))))
}
results <- data.frame(cbind(num_nodes = 10:20, AUC_ROC = results))
results$num_nodes <- as.integer(results$num_nodes)
We can now view the results of the hyperparameter search in table and graph form.
results %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| num_nodes | AUC_ROC |
|---|---|
| 10 | 0.6606335 |
| 11 | 0.6990950 |
| 12 | 0.6798643 |
| 13 | 0.7138009 |
| 14 | 0.6561086 |
| 15 | 0.7714932 |
| 16 | 0.7239819 |
| 17 | 0.7624434 |
| 18 | 0.6855204 |
| 19 | 0.7669683 |
| 20 | 0.6945701 |
ggplot(results) + geom_point(mapping = aes(x = num_nodes, y = AUC_ROC)) + geom_smooth(mapping = aes(x = num_nodes, y = AUC_ROC), se=F)
We see a clear peak at a value of 15 nodes, with a drop in AUC_ROC scores for models trained with more or less nodes.
We will then retrain the model on the training and validation data combined for a model with 15 nodes in the hidden layer.
remove(model)
set.seed(101)
model <- neuralnet(target~., rbind(train, val), hidden = 15, linear.output = FALSE, err.fct = "ce")
We can now plot the model to see the structure of the neural network.
plot(model, rep = "best")
We see that the model has 19 nodes in the input layer, each connected by specific weights to every one of the 15 nodes in the hidden layer with an added bias, and the outputs of those 15 nodes feeding into the final output layer with a single node for classification.
Finally, we can evaluate the trained model on the test set.
predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
results <- data.frame(predictions = as.numeric(unlist(predictions[, 1])), true = as.numeric(unlist(test_y)))
performance <- evaluate(results, target_col = "true", prediction_cols = "predictions", type = "binomial")
We can now view a set of performance statistics for the model’s predictions on the test set.
performance[1:8] %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC |
|---|---|---|---|---|---|---|---|
| 0.7272222 | 0.7213115 | 0.7462687 | 0.6944444 | 0.76 | 0.8064516 | 0.6333333 | 0.7272222 |
We see that the model, when trained on 60% of the original dataset with the optimized number of nodes in the hidden layer, is about 72% accurate on the test dataset with an AUC_ROC score of 0.73. A balanced accuracy score close of approximately 73%, which is close to the standard accuracy, indicates consistent accuracy on both the positive and negative class. Similar sensitivity and specificity scores also indicate consistent performance across both classes. We can also view a confusion matrix to specifically assess the distribution of predictions in relation to their true labels.
plot_confusion_matrix(performance$'Confusion Matrix'[[1]])
We confirm that there is consistent accuracy in the predictions for negative and positive classes, yet with a slight bias towards the positive class likely due to the slightly larger number of positive examples in the dataset.
In this sample of my best work, I demonstrated several useful skills I learned in the Introduction to Data Science course including loading and preprocessing data, analyzing patterns in data, training and tuning the hyperparameters of a machine learning model, displaying tables and graphs, and evaluating the performance of a machine learning model on a test set. Specifically, I used the Kaggle Heart Disease UCI Dataset to build and optimize a neural network to learn to classify a person as having or not having heart disease based on various health data. In doing so, I found a value of 15 nodes in the hidden layer to be optimal for the neural network’s performance and, following retraining, found the accuracy of the network to be ~72% with an AUC_ROC score of ~0.73 on the held-out test set. Thank you for the great semester Dr. Lanning and I look forward to learning more in my Data Analytics concentration!