1) Small Writeup on the Dataset

Haberman’s Survival Data:

The Haberman dataset contains information about patients who had undergone breast cancer surgery. It includes the following columns:

Age: The age of the patient at the time of operation (numeric). Year: The year of the operation (numeric). Nodes: The number of axillary nodes detected (numeric). Survival_Status: The survival status of the patient. This is a categorical variable where: 1 indicates that the patient survived 5 years or longer. 0 indicates that the patient did not survive 5 years.

library(neuralnet)

## Warning: package 'neuralnet' was built under R version 4.3.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:neuralnet':
## 
##     compute

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(neuralnet)

# Load and preprocess the data
haberman <- read.csv("C:/Users/Muraa/Desktop/haberman.csv", header = FALSE)
colnames(haberman) <- c("Age", "Year", "Nodes", "Survival_Status")

Convert the relevant columns to numeric

haberman <- haberman %>% mutate(Age = as.numeric(Age), Op_year = as.numeric(Op_year), Axil_nodes = as.numeric(Axil_nodes), Surv_status = as.factor(Surv_status))

Check for NA values

sum(is.na(haberman)) # There are NA values introduced by coercion, likely from the Surv_status column

Remove rows with NA values

haberman <- haberman[complete.cases(haberman), ]

Summary statistics after removing NA values

summary(haberman)

Distribution of Age (continuous)

ggplot(haberman, aes(x = Age)) + geom_histogram(binwidth = 5, fill = “blue”, color = “white”) + labs(title = “Age Distribution”, x = “Age”, y = “Frequency”)

Distribution of Axillary Nodes (continuous)

ggplot(haberman, aes(x = Axil_nodes)) + geom_histogram(binwidth = 5, fill = “red”, color = “white”) + labs(title = “Axillary Nodes Distribution”, x = “Number of Nodes”, y = “Frequency”)

Distribution of Operation Year (discrete, so we use geom_bar)

ggplot(haberman, aes(x = as.factor(Op_year))) + geom_bar(fill = “green”, color = “white”) + labs(title = “Operation Year Distribution”, x = “Year of Operation”, y = “Frequency”)

Survival status distribution (discrete, use geom_bar)

ggplot(haberman, aes(x = Surv_status)) + geom_bar(fill = “purple”, color = “white”) + labs(title = “Survival Status Distribution”, x = “Survival Status”, y = “Frequency”)

Pairwise scatter plots

pairs(haberman[,1:3], col = ifelse(haberman$Surv_status == 1, “blue”, “red”), main = “Pairwise Scatter Plots”, pch = 19)


## 3) Preprocessing


``` r
# Normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Normalize the numeric columns
haberman_norm <- as.data.frame(lapply(haberman[,1:3], normalize))
haberman_norm$Survival_Status <- as.numeric(haberman$Survival_Status) - 1  # Convert factor to numeric (0 and 1)

# Check for NA values after normalization
sum(is.na(haberman_norm))

## [1] 0

# Handle NA values by removing rows with NA
haberman_norm <- haberman_norm[complete.cases(haberman_norm), ]

4) Model Selection

# Load the neural network library
library(neuralnet)

# Split the data
set.seed(123)
index <- sample(1:nrow(haberman_norm), round(0.80 * nrow(haberman_norm)))
trainset <- haberman_norm[index,]
testset <- haberman_norm[-index,]

# Train the neural network
nn <- neuralnet(Survival_Status ~ Age + Year + Nodes, 
                data = trainset, hidden = 4, 
                linear.output = FALSE, 
                act.fct = "tanh", 
                err.fct = "ce", 
                likelihood = TRUE)

## Warning in log(x): NaNs produced

## Warning: 'err.fct' does not fit 'data' or 'act.fct'

# Save the neural network plot as PDF
plot(nn)

# Include the Png image
knitr::include_graphics("neuralnet.png")

5) Performance Evaluation

# Predict on test data using neuralnet::compute
predicted <- neuralnet::compute(nn, testset[,1:3])$net.result

# Convert predictions to binary outcomes (0 or 1)
predicted <- ifelse(predicted > 0.5, 1, 0)

# Convert predicted to factor for comparison with Survival_Status
predicted <- factor(predicted, levels = c(0, 1))

# Confusion matrix
confusion_matrix <- table(predicted, testset$Survival_Status)
print(confusion_matrix)

##          
## predicted  0  1
##         0 43 18
##         1  0  0

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))

## [1] "Accuracy: 0.7"

# Calculate MSE
mse <- mean((as.numeric(predicted) - as.numeric(testset$Survival_Status))^2)
print(paste("MSE:", round(mse, 2)))

## [1] "MSE: 0.7"

6) Conclusion

In this analysis, we applied a neural network model to the Haberman dataset to predict the survival status of patients based on their age, year of operation, and number of axillary nodes detected.

Performance Metrics:

Accuracy: The model achieved an accuracy of 70%. This indicates that the model correctly classified 70% of the cases in the test set. Mean Squared Error (MSE): The MSE of the model was 0.7, which reflects the average squared difference between the predicted and actual values. A lower MSE would indicate better model performance, but in this case, the MSE is relatively high.

NN MODEL

MURA AKHIL

2024-09-03