Introduction

In the second group project, I used the “neuralnet” package implementation of feedforward neural networks to determine whether or not a person had a stroke given various categorical and numerical health information such as BMI, blood glucose levels, and the presence or lack of heart disease. For a sample of my best work, I will then explore the “neuralnet” feedforward neural network implementation to predict whether or not a person has heart disease—as indicated by more than 50% narrowing of the coronary arteries—using a simple neural network with one hidden layer. I will also perform a very basic hyperparameter search to optimize the number of nodes in the hidden layer in a small range.

Dataset

The dataset used will be the Kaggle Cleveland Heart Disease UCI Dataset, similar to the one used by the heart disease subgroup in the second group project. The dataset consists of 303 rows with 13 attributes of health data and a final target column indicating whether or not the person was diagnosed with heart disease.

Data Viewing and Wrangling

The data was first opened and viewed.

data <- read_csv("heart.csv")

head(data[c(1:9)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang
63	1	3	145	233	1	0	150	0
37	1	2	130	250	0	1	187	0
41	0	1	130	204	0	0	172	0
56	1	1	120	236	0	1	178	0
57	0	0	120	354	0	1	163	1
57	1	0	140	192	0	1	148	0

head(data[c(10:14)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))

oldpeak	slope	thal	target
2.3	0	1	1
3.5	0	2	1
1.4	2	2	1
0.8	2	2	1
0.6	2	2	1
0.4	1	1	1

We see that these columns correspond to age, sex, class of chest pain, resting blood pressure, serum cholesterol, resting electrocardiographic results, maximum heart rate achieved from tests, presence of exercise-induced chest pain, whether or not ST depression is indicated by the electrocardiogram following exercise, the integer slope of the ST segment of the electrocardiogram at peak exercise, and the number of major vessels which were marked by fluoroscopy.

We can then view verify the size of the dataset and determine the ratio of positive to negative cases.

nrow(data)

## [1] 303

num_cases = table(data$target)
num_cases

## 
##   0   1 
## 138 165

num_cases[2]/num_cases[1]

##        1 
## 1.195652

We see that there are indeed 303 rows in the dataset, and, fortunately, the ratio of positive to negative cases is 1.2 and therefore sufficiently close to 1, and so large biases towards either the negative or positive class should not be learned by the model.

Because, according to the dataset source, no NA values exist in this processed portion of the data, we do not need to attempt to remove or otherwise correct for rows with NA values. Instead, we can now one-hot encode the categorical columns of chest pain type and resting electrocardiogram results and view the dataset again before we split it into training, validation, and test sets.

data$cp <- factor(data$cp, levels = c(0, 1, 2, 3))
data$restecg <- factor(data$restecg, levels = c(0, 1, 2))
dummies <- dummyVars("~ .", data=data)
encoded_data <- data.frame(predict(dummies, newdata=data))

head(encoded_data[c(1:10)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))

age	sex	cp.0	cp.1	cp.2	cp.3	trestbps	chol	fbs	restecg.0
63	1	0	0	0	1	145	233	1	1
37	1	0	0	1	0	130	250	0	0
41	0	0	1	0	0	130	204	0	1
56	1	0	1	0	0	120	236	0	0
57	0	1	0	0	0	120	354	0	0
57	1	1	0	0	0	140	192	0	0

head(encoded_data[c(11:19)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))

restecg.1	thalach	exang	oldpeak	slope	thal	target
0	150	0	2.3	0	1	1
1	187	0	3.5	0	2	1
0	172	0	1.4	2	2	1
1	178	0	0.8	2	2	1
1	163	1	0.6	2	2	1
1	148	0	0.4	1	1	1

Next we will split the dataset randomly between a training set, consisting of 60% of the original dataset, a validation set for hyperparameter tuning, consisting of 20% of the original, and a test set for objective assessment of accuracy with the remaining 20%. We can achieve this by first splitting the dataset 60%-40%, and then we can randomly split the 40% subset 50%-50% so that we have three sets with sizes 60%-20%-20% of the original.

set.seed(101)
encoded_n <- nrow(encoded_data)
valTestIndex <- sample(1:encoded_n, size = round(0.4*encoded_n), replace = FALSE)
valIndex <- sample(1:round(0.4*encoded_n), size = round(0.5*round(0.4*encoded_n)), replace = FALSE)
train <- encoded_data[-valTestIndex ,]
valTest <- encoded_data[valTestIndex ,]
val <- valTest[valIndex ,]
test <- valTest[-valIndex ,]

Next we must scale the data and center it around the mean to allow the neural network to converge more quickly by standardizing the scales of each feature. In order to prevent data leakage, we create a scaling object and fit it to the training data before we apply the transformation to all three sets. The certain pre-processing methods used will subtract the mean from each feature and divide by the standard deviation. This method sets the mean of each column to 0 and reduces the range of the values while not being especially affected by outliers. We apply this only to the features which are composed of ratio data, not those with ordinal or categorical data.

preprocessor <- preProcess(train[, c(1, 7, 8, 13, 15)], method=c("center", "scale"))
normalized_train <- predict(preprocessor, train[, c(1, 7, 8, 13, 15)])
normalized_test <- predict(preprocessor, test[, c(1, 7, 8, 13, 15)])
normalized_val <- predict(preprocessor, val[, c(1, 7, 8, 13, 15)])
train[, c(1, 7, 8, 13, 15)] <- normalized_train
test[, c(1, 7, 8, 13, 15)] <- normalized_test
val[, c(1, 7, 8, 13, 15)] <- normalized_val

Now we can split test and validation sets into inputs vectors and labels.

test_X <- select(test, -target)
test_y <- select(test, target)
val_X <- select(val, -target)
val_y <- select(val, target)

Neural Network Building and Hyperparameter Search

We can now build the neural network and run a hyperparameter search. The network will have 19 nodes in the input layer, one for each feature, and we will optimize the number of nodes for the network between 10 and 20 inclusive with a hyperparameter search. A wider range of values should be searched for a proper hyperparameter search, but the slow speed of R compared to other languages limits this possibility. We will build ten different models, train them on the training set, and compare their performance on the validation set. The hyperparameters which result in the greatest performance will then be taken and used to train the final model on the combination of training and validation data, and this model will be evaluated on the test data. The metric used in the hyperparameter search will be the area under the receiver operating characteristic curve (AUC_ROC).

The general neural network structure will have the sigmoidal activation function after each non-input layer, which compresses the output of each node between 0 and 1 and allows the network to model nonlinear relationships. In a feedforward neural network, each node in one layer connects to every node in the next layer, multiplying each input by a learned weight and adding a learned bias. These learned values allow the model to extract complex relationships to produce accurate classifications for varying combinations and ranges of input data. The final output layer for this binary classification problem consists of a single node, with a positive classification being made for outputs greater than 0.5, or 50%, and a negative classification being made for outputs less than or equal to 0.5.

set.seed(101)
results <- c()
for (num_nodes in 10:20) {
  model <- neuralnet(target~., train, hidden = num_nodes, linear.output = FALSE, err.fct = "ce")
  predictions <- neuralnet::compute(model, val_X)
  predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
  results <- c(results, auc_roc(as.numeric(unlist(predictions[, 1])), as.numeric(unlist(val_y))))
}

results <- data.frame(cbind(num_nodes = 10:20, AUC_ROC = results))
results$num_nodes <- as.integer(results$num_nodes)

We can now view the results of the hyperparameter search in table and graph form.

results %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))

num_nodes	AUC_ROC
10	0.6606335
11	0.6990950
12	0.6798643
13	0.7138009
14	0.6561086
15	0.7714932
16	0.7239819
17	0.7624434
18	0.6855204
19	0.7669683
20	0.6945701

ggplot(results) + geom_point(mapping = aes(x = num_nodes, y = AUC_ROC)) + geom_smooth(mapping = aes(x = num_nodes, y = AUC_ROC), se=F)

We see a clear peak at a value of 15 nodes, with a drop in AUC_ROC scores for models trained with more or less nodes.

Final Model Training

We will then retrain the model on the training and validation data combined for a model with 15 nodes in the hidden layer.

remove(model)
set.seed(101)
model <- neuralnet(target~., rbind(train, val), hidden = 15, linear.output = FALSE, err.fct = "ce")

We can now plot the model to see the structure of the neural network.

plot(model, rep = "best")

We see that the model has 19 nodes in the input layer, each connected by specific weights to every one of the 15 nodes in the hidden layer with an added bias, and the outputs of those 15 nodes feeding into the final output layer with a single node for classification.

Final Model Evaluation

Finally, we can evaluate the trained model on the test set.

predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
results <- data.frame(predictions = as.numeric(unlist(predictions[, 1])), true = as.numeric(unlist(test_y)))
performance <- evaluate(results, target_col = "true", prediction_cols = "predictions", type = "binomial")

We can now view a set of performance statistics for the model’s predictions on the test set.

performance[1:8] %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))

Balanced Accuracy	Accuracy	F1	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	AUC
0.7272222	0.7213115	0.7462687	0.6944444	0.76	0.8064516	0.6333333	0.7272222

We see that the model, when trained on 60% of the original dataset with the optimized number of nodes in the hidden layer, is about 72% accurate on the test dataset with an AUC_ROC score of 0.73. A balanced accuracy score close of approximately 73%, which is close to the standard accuracy, indicates consistent accuracy on both the positive and negative class. Similar sensitivity and specificity scores also indicate consistent performance across both classes. We can also view a confusion matrix to specifically assess the distribution of predictions in relation to their true labels.

plot_confusion_matrix(performance$'Confusion Matrix'[[1]])

We confirm that there is consistent accuracy in the predictions for negative and positive classes, yet with a slight bias towards the positive class likely due to the slightly larger number of positive examples in the dataset.

Conclusion

In this sample of my best work, I demonstrated several useful skills I learned in the Introduction to Data Science course including loading and preprocessing data, analyzing patterns in data, training and tuning the hyperparameters of a machine learning model, displaying tables and graphs, and evaluating the performance of a machine learning model on a test set. Specifically, I used the Kaggle Heart Disease UCI Dataset to build and optimize a neural network to learn to classify a person as having or not having heart disease based on various health data. In doing so, I found a value of 15 nodes in the hidden layer to be optimal for the neural network’s performance and, following retraining, found the accuracy of the network to be ~72% with an AUC_ROC score of ~0.73 on the held-out test set. Thank you for the great semester Dr. Lanning and I look forward to learning more in my Data Analytics concentration!

Best Work

Aden Eagle

4/27/2021