Introduction

The purpose of this analysis is to demonstrate how deep learning may lead to over fitting and a reduction in the accuracy of a neural networks. The data set was created and made open by Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian in the General Surgery and Computer Science Departments as the University of Wisconsin (Special thank you to them). The original use of this data was for the following paper:
Nuclear feature extraction for breast tumor diagnosis.
IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
At the time the estimated accuracy was 97.5% using repeated 10-fold cross validations. Since then, many researches have used this data to make a variety of different models. A full list can be found here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Methodology

Three different neural networks are created and tested using 10-fold cross validation. The first has one hidden layer, the second has two, and the third has three. All hidden layers have 30 neurons (equal to number of input nodes). All 30 original variables are included as inputs; the output node is binary, indicating 1 for malignant and 0 for benign. The accuracy of each fold is stored and averaged to find the overall accuracy for the model. The 569 samples are randomly shuffled to ensure randomly sampling for the cross validation.

Analysis

Load Data

This data set is originally from the University of Wisconsin CS department at ftp ftp.cs.wisc.edu > cd math-prog/cpo-dataset/machine-learn/WDBC/. But I found it in the University of California Urvine Machine Learning Depository at the link below. The first column is a unique Id, the second column is a binary variable ‘M’ for malignant and ‘B’ for benign. Then there are 30 independent variables which are all different measurements of cell nucleus size and shape. Variable descriptions can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names.

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
wdbc <- read.csv(url, header = FALSE)

# Dependent variable Binary 0 = Benign, 1 = Malignant
colnames(wdbc)[2] <- "Malignant.Benign"

Clean and prep data

# Dependent variable Binary 0 = Benign, 1 = Malignant
colnames(wdbc)[2] <- "Malignant.Benign"
wdbc$Malignant.Benign <- as.numeric(wdbc$Malignant.Benign == 'M')

# Randomly Shuffle Data
wdbc <- wdbc[sample(nrow(wdbc)),]

# Create k equally sized folds
k = 10
folds <- cut(seq(nrow(wdbc)), breaks = k, labels = FALSE)
accuracy1 <- rep(NA, k) # Stores accuracy of each fold for 1 hidden layer(s)
accuracy2 <- rep(NA, k) # Stores accuracy of each fold for 2 hidden layer(s)
accuracy3 <- rep(NA, k) # Stores accuracy of each fold for 3 hidden layer(s)

# Create formula input for neuralnet()
n <- paste(names(wdbc[3:32]), collapse = ' + ')
f <- as.formula(c("Malignant.Benign ~ " , n))

Create and test all models

library(neuralnet)
# Loop through, Create, Test k folds
for (i in seq(k))
{
  # Split train and test data
  test_indexes <- which(folds == i, arr.ind = TRUE)
  test_data <- wdbc[test_indexes,-c(1,2)]
  train_data <- wdbc[-test_indexes,]
  
  # Correct output for test data
  actual <- wdbc[test_indexes,2]
  
  # Create Models with Train Data
  nn1 <- neuralnet(f, train_data, hidden = c(30), linear.output = FALSE, threshold = 0.000001)
  nn2 <- neuralnet(f, train_data, hidden = c(30,30), linear.output = FALSE, threshold = 0.000001)
  nn3 <- neuralnet(f, train_data, hidden = c(30,30,30), linear.output = FALSE, threshold = 0.000001)

  # Run test data through neural networks
  results1 <- compute(nn1,test_data)
  results2 <- compute(nn2,test_data)  
  results3 <- compute(nn3,test_data)  
  
  # Get estimates from the test results
  estimate1 <- round(results1$net.result)
  estimate2 <- round(results2$net.result)
  estimate3 <- round(results3$net.result) 
  
  # Calculate accuracies from estimates
  accuracy1[i] <- mean(estimate1 == actual)  
  accuracy2[i] <- mean(estimate2 == actual)
  accuracy3[i] <- mean(estimate3 == actual)
}

Conclusion

I cannot see the exact numbers since I am writing this document in R markdown before the code is ran. But we should see a decrease in accuracy from model 1 to model 2, and from model 2 to model 3. There is some randomness in how the neural network algorithm initializes parameters. But this shows the decline in accuracy as the number of hidden layers in the neural network is increased. Presumably, this is due to over fitting. Deep learning is no doubt much more effective for image recognition and natural language processing. But for less complex pattern recognition tasks, sometimes a simpler model will produce better results than the advanced method.

cat('The overall accuracy for model one is: ',mean(accuracy1), '\n',
    'The overall accuracy for model one is: ',mean(accuracy2), '\n',
    'The overall accuracy for model one is: ',mean(accuracy3), '\n')

## The overall accuracy for model one is:  0.9508458647 
##  The overall accuracy for model one is:  0.9296679198 
##  The overall accuracy for model one is:  0.9157581454

All models are wrong, but some are useful. - George E.P. Box

Wisconsin Breast Cancer Diagnosis Deep Learning

Kyle Stahl

July 15, 2017