The Haberman dataset is a collection of data from the survival of
patients who had undergone surgery for breast cancer. The dataset
consists of 306 instances, each representing a patient. There are four
attributes: - age: Age of the patient at the time of
operation. - year: Year of the operation (year - 1900). -
nodes: Number of positive axillary nodes detected. -
survival: Survival status (1 = survived 5 years or more, 2
= died within 5 years).
The goal is to predict the survival status of patients based on the other three attributes.
Exploratory Data Analysis (EDA) involves a preliminary examination of
the dataset to understand its structure and characteristics. In this
analysis: - We checked the structure of the data using functions like
str() and summary(). - We observed that the
survival attribute is categorical and represents the target
variable for classification. - Basic summary statistics were reviewed to
understand the range and distribution of the numeric attributes
(age, year, nodes).
Preprocessing steps were taken to prepare the data for the neural
network model: - Normalization: The continuous
variables (age, year, nodes) were
normalized to bring all values into the range [0, 1]. This helps in
accelerating the training process and improves the model’s performance.
- Factor Conversion: The target variable
survival was converted to a factor to facilitate
classification.
A neural network model was selected for this classification task. The
architecture of the model was defined as follows: - Input
Layer: 3 neurons corresponding to the 3 input features
(age, year, nodes). -
Hidden Layers: The model includes two hidden layers: -
The first hidden layer has 5 neurons. - The second hidden layer has 3
neurons. - Output Layer: 1 neuron with a sigmoid
activation function to predict the binary outcome (survival status).
The model used binary cross-entropy as the loss function because it is suitable for binary classification tasks.
logistic), and the output
layer uses the sigmoid function.The model’s performance was evaluated using accuracy as the metric: -
Accuracy: The model achieved an accuracy of
50% on the test dataset. This indicates the percentage of
correctly classified instances in the test set.
library(neuralnet)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
haberman <- read.csv(url, header = FALSE, col.names = c("age", "year", "nodes", "survival"))
head(haberman)
## age year nodes survival
## 1 30 64 1 1
## 2 30 62 3 1
## 3 30 65 0 1
## 4 31 59 2 1
## 5 31 65 4 1
## 6 33 58 10 1
haberman$survival <- as.factor(haberman$survival)
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
haberman_norm <- as.data.frame(lapply(haberman[, 1:3], normalize))
haberman_norm$survival <- haberman$survival
set.seed(123)
train_indices <- sample(1:nrow(haberman_norm), 0.7 * nrow(haberman_norm))
train_data <- haberman_norm[train_indices, ]
test_data <- haberman_norm[-train_indices, ]
nn_formula <- survival ~ age + year + nodes
nn_model <- neuralnet(
formula = nn_formula,
data = train_data,
hidden = c(3),
linear.output = FALSE,
stepmax = 1e6
)
predict_neuralnet <- function(model, newdata) {
output <- compute(model, newdata)$net.result
predicted <- ifelse(output > 0.5, "2", "1")
return(factor(predicted, levels = levels(haberman$survival)))
}
predicted_classes <- predict_neuralnet(nn_model, test_data[, c("age", "year", "nodes")])
accuracy <- mean(predicted_classes == test_data$survival)
cat("Accuracy:", round(accuracy * 100, 2), "%
")
## Accuracy: 50 %
plot(nn_model)