Haberman’s Survival Data:
The Haberman dataset contains information about patients who had undergone breast cancer surgery. It includes the following columns:
Age: The age of the patient at the time of operation (numeric). Year: The year of the operation (numeric). Nodes: The number of axillary nodes detected (numeric). Survival_Status: The survival status of the patient. This is a categorical variable where: 1 indicates that the patient survived 5 years or longer. 0 indicates that the patient did not survive 5 years.
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:neuralnet':
##
## compute
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(neuralnet)
# Load and preprocess the data
haberman <- read.csv("C:/Users/Muraa/Desktop/haberman.csv", header = FALSE)
colnames(haberman) <- c("Age", "Year", "Nodes", "Survival_Status")
haberman <- haberman %>% mutate(Age = as.numeric(Age), Op_year = as.numeric(Op_year), Axil_nodes = as.numeric(Axil_nodes), Surv_status = as.factor(Surv_status))
sum(is.na(haberman)) # There are NA values introduced by coercion, likely from the Surv_status column
haberman <- haberman[complete.cases(haberman), ]
summary(haberman)
ggplot(haberman, aes(x = Age)) + geom_histogram(binwidth = 5, fill = “blue”, color = “white”) + labs(title = “Age Distribution”, x = “Age”, y = “Frequency”)
ggplot(haberman, aes(x = Axil_nodes)) + geom_histogram(binwidth = 5, fill = “red”, color = “white”) + labs(title = “Axillary Nodes Distribution”, x = “Number of Nodes”, y = “Frequency”)
ggplot(haberman, aes(x = as.factor(Op_year))) + geom_bar(fill = “green”, color = “white”) + labs(title = “Operation Year Distribution”, x = “Year of Operation”, y = “Frequency”)
ggplot(haberman, aes(x = Surv_status)) + geom_bar(fill = “purple”, color = “white”) + labs(title = “Survival Status Distribution”, x = “Survival Status”, y = “Frequency”)
pairs(haberman[,1:3], col = ifelse(haberman$Surv_status == 1, “blue”, “red”), main = “Pairwise Scatter Plots”, pch = 19)
## 3) Preprocessing
``` r
# Normalization function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# Normalize the numeric columns
haberman_norm <- as.data.frame(lapply(haberman[,1:3], normalize))
haberman_norm$Survival_Status <- as.numeric(haberman$Survival_Status) - 1 # Convert factor to numeric (0 and 1)
# Check for NA values after normalization
sum(is.na(haberman_norm))
## [1] 0
# Handle NA values by removing rows with NA
haberman_norm <- haberman_norm[complete.cases(haberman_norm), ]
# Load the neural network library
library(neuralnet)
# Split the data
set.seed(123)
index <- sample(1:nrow(haberman_norm), round(0.80 * nrow(haberman_norm)))
trainset <- haberman_norm[index,]
testset <- haberman_norm[-index,]
# Train the neural network
nn <- neuralnet(Survival_Status ~ Age + Year + Nodes,
data = trainset, hidden = 4,
linear.output = FALSE,
act.fct = "tanh",
err.fct = "ce",
likelihood = TRUE)
## Warning in log(x): NaNs produced
## Warning: 'err.fct' does not fit 'data' or 'act.fct'
# Save the neural network plot as PDF
plot(nn)
# Include the Png image
knitr::include_graphics("neuralnet.png")
# Predict on test data using neuralnet::compute
predicted <- neuralnet::compute(nn, testset[,1:3])$net.result
# Convert predictions to binary outcomes (0 or 1)
predicted <- ifelse(predicted > 0.5, 1, 0)
# Convert predicted to factor for comparison with Survival_Status
predicted <- factor(predicted, levels = c(0, 1))
# Confusion matrix
confusion_matrix <- table(predicted, testset$Survival_Status)
print(confusion_matrix)
##
## predicted 0 1
## 0 43 18
## 1 0 0
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.7"
# Calculate MSE
mse <- mean((as.numeric(predicted) - as.numeric(testset$Survival_Status))^2)
print(paste("MSE:", round(mse, 2)))
## [1] "MSE: 0.7"
In this analysis, we applied a neural network model to the Haberman dataset to predict the survival status of patients based on their age, year of operation, and number of axillary nodes detected.
Performance Metrics:
Accuracy: The model achieved an accuracy of 70%. This indicates that the model correctly classified 70% of the cases in the test set. Mean Squared Error (MSE): The MSE of the model was 0.7, which reflects the average squared difference between the predicted and actual values. A lower MSE would indicate better model performance, but in this case, the MSE is relatively high.