1) Small Writeup on the Dataset

The Haberman’s Survival Dataset contains data from a study conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital. The study focused on the survival of patients who had undergone surgery for breast cancer. The dataset includes 306 instances with 3 features and 1 binary outcome variable indicating whether the patient survived 5 years or more (1) or died within 5 years (2) after the surgery.

2) Exploratory Data Analysis (EDA)

Load necessary libraries

options(warn = -1)
# Load required libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(neuralnet)
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Load the dataset
haberman <- read.csv("C:/Users/raksh/Downloads/haberman.csv", header = FALSE, 
                     col.names = c("Age", "Op_year", "Axil_nodes", "Surv_status"))

# Convert the relevant columns to numeric
haberman <- haberman %>%
  mutate(Age = as.numeric(Age),
         Op_year = as.numeric(Op_year),
         Axil_nodes = as.numeric(Axil_nodes),
         Surv_status = as.factor(Surv_status))

# Check for NA values
sum(is.na(haberman))  # There are NA values introduced by coercion, likely from the Surv_status column
## [1] 3
# Remove rows with NA values
haberman <- haberman[complete.cases(haberman), ]

# Summary statistics after removing NA values
summary(haberman)
##       Age           Op_year        Axil_nodes     Surv_status 
##  Min.   :30.00   Min.   :58.00   Min.   : 0.000   1     :225  
##  1st Qu.:44.00   1st Qu.:60.00   1st Qu.: 0.000   2     : 80  
##  Median :52.00   Median :63.00   Median : 1.000   status:  0  
##  Mean   :52.36   Mean   :62.87   Mean   : 4.033               
##  3rd Qu.:60.00   3rd Qu.:66.00   3rd Qu.: 4.000               
##  Max.   :78.00   Max.   :69.00   Max.   :52.000
# Distribution of Age (continuous)
ggplot(haberman, aes(x = Age)) + 
  geom_histogram(binwidth = 5, fill = "blue", color = "white") + 
  labs(title = "Age Distribution", x = "Age", y = "Frequency")

# Distribution of Axillary Nodes (continuous)
ggplot(haberman, aes(x = Axil_nodes)) + 
  geom_histogram(binwidth = 5, fill = "red", color = "white") + 
  labs(title = "Axillary Nodes Distribution", x = "Number of Nodes", y = "Frequency")

# Distribution of Operation Year (discrete, so we use geom_bar)
ggplot(haberman, aes(x = as.factor(Op_year))) + 
  geom_bar(fill = "green", color = "white") + 
  labs(title = "Operation Year Distribution", x = "Year of Operation", y = "Frequency")

# Survival status distribution (discrete, use geom_bar)
ggplot(haberman, aes(x = Surv_status)) + 
  geom_bar(fill = "purple", color = "white") + 
  labs(title = "Survival Status Distribution", x = "Survival Status", y = "Frequency")

# Pairwise scatter plots
pairs(haberman[,1:3], col = ifelse(haberman$Surv_status == 1, "blue", "red"), 
      main = "Pairwise Scatter Plots", pch = 19)

3) Preprocessing

  1. Variable Selection The dataset contains three features:

Age: Continuous variable. Year of operation: Continuous variable. Number of positive axillary nodes: Continuous variable. The target variable is:

Survival status (Surv_status): Binary variable (1 = survived 5 years or more, 2 = died within 5 years). b) Data Encoding No encoding is needed since the target variable is already binary.

  1. Normalization Normalize the features to a 0-1 range for better performance of the neural network.
# Normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Normalize the numeric columns
haberman_norm <- as.data.frame(lapply(haberman[,1:3], normalize))
haberman_norm$Surv_status <- as.numeric(haberman$Surv_status) - 1  # Convert factor to numeric (0 and 1)

# Check for NA values after normalization
sum(is.na(haberman_norm))
## [1] 0
# Handle NA values by removing rows with NA
haberman_norm <- haberman_norm[complete.cases(haberman_norm), ]

4) Model Selection

  1. Architecture We will create a simple feedforward neural network with one hidden layer.

  2. Loss Function Use binary cross-entropy as the loss function since this is a binary classification problem.

  3. Hyperparameters Number of hidden neurons: Experiment with 3-5 neurons. Learning rate: 0.01 Epochs: 100 Activation function: Sigmoid for hidden layer, Softmax for output.

# Load the neural network library
library(neuralnet)

# Split the data
set.seed(123)
index <- sample(1:nrow(haberman_norm), round(0.80 * nrow(haberman_norm)))
trainset <- haberman_norm[index,]
testset <- haberman_norm[-index,]

# Train the neural network
nn <- neuralnet(Surv_status ~ Age + Op_year + Axil_nodes, 
                data = trainset, hidden = 3, 
                linear.output = FALSE, 
                act.fct = "tanh", 
                err.fct = "ce", 
                likelihood = TRUE)

# Save the neural network plot as PDF
plot(nn)

# Include the PDF image
knitr::include_graphics("neuralnet_plot.png")

5) Performance Evaluation

# Predict on test data
predicted <- compute(nn, testset[,1:3])$net.result
predicted <- ifelse(predicted > 0.5, 1, 0)

# Convert predicted to factor for comparison with Surv_status
predicted <- factor(predicted, levels = c(0, 1))

# Confusion matrix
confusion_matrix <- table(predicted, testset$Surv_status)
print(confusion_matrix)
##          
## predicted  0  1
##         0 43 18
##         1  0  0
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.7"
# Calculate MSE
mse <- mean((as.numeric(predicted) - 1 - testset$Surv_status)^2)
print(paste("MSE:", round(mse, 2)))
## [1] "MSE: 0.3"

6) Conclusion

The neural network model built to predict the survival status of patients using the Haberman’s Survival Dataset achieved an accuracy of 70% and a mean squared error (MSE) of 0.30 on the test set.

Accuracy: The accuracy of 70% indicates that the model correctly predicted the survival status for 70% of the test instances. This level of accuracy suggests that the model has some predictive power but may still have room for improvement, particularly in terms of distinguishing between the two classes.

Mean Squared Error (MSE): The MSE of 0.30 shows that the model’s predictions deviate from the actual survival status labels by an average squared difference of 0.30. While this is a relatively low error, it suggests that the model does not perfectly capture the relationship between the input features (Age, Operation Year, and Axillary Nodes) and the survival outcome.