About the data
Title: Haberman’s Survival Data

Relevant Information: This dataset originates from a study conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital. The study focuses on the survival outcomes of patients who underwent surgery for breast cancer, providing valuable insights into factors affecting post-operative survival rates.

Dataset Overview:

Number of Instances: 306
Number of Attributes: 4 (including the class attribute)
Missing Attribute Values: None

Attribute Information:

Age of patient at time of operation (numerical)
Patient’s year of operation (numerical, year - 1900)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute):
1 = The patient survived 5 years or longer.
2 = The patient died within 5 years.

data<-read.csv(file.choose())
head(data)
print(colnames(data))
## [1] "X30"  "X64"  "X1"   "X1.1"
#Changing column names
colnames(data) <- c('Age', 'Year_of_Procedure', 'Number_of_Lymph_nodes','Survival_status_after_5_Years')
print(colnames(data))
## [1] "Age"                           "Year_of_Procedure"            
## [3] "Number_of_Lymph_nodes"         "Survival_status_after_5_Years"
#EDA
#Structure of the data
# Summary statistics for each column
summary(data)
##       Age        Year_of_Procedure Number_of_Lymph_nodes
##  Min.   :30.00   Min.   :58.00     Min.   : 0.000       
##  1st Qu.:44.00   1st Qu.:60.00     1st Qu.: 0.000       
##  Median :52.00   Median :63.00     Median : 1.000       
##  Mean   :52.53   Mean   :62.85     Mean   : 4.036       
##  3rd Qu.:61.00   3rd Qu.:66.00     3rd Qu.: 4.000       
##  Max.   :83.00   Max.   :69.00     Max.   :52.000       
##  Survival_status_after_5_Years
##  Min.   :1.000                
##  1st Qu.:1.000                
##  Median :1.000                
##  Mean   :1.266                
##  3rd Qu.:2.000                
##  Max.   :2.000
# Structure of the dataset
str(data)
## 'data.frame':    305 obs. of  4 variables:
##  $ Age                          : int  30 30 31 31 33 33 34 34 34 34 ...
##  $ Year_of_Procedure            : int  62 65 59 65 58 60 59 66 58 60 ...
##  $ Number_of_Lymph_nodes        : int  3 0 2 4 10 0 0 9 30 1 ...
##  $ Survival_status_after_5_Years: int  1 1 1 1 1 1 2 2 1 1 ...
#Univariate Analysis
# Histogram for Age
hist(data$Age, main="Distribution of Age", xlab="Age", col="lightblue", border="black")

# Histogram for Year of Procedure
hist(data$Year_of_Procedure, main="Distribution of Year of Procedure", xlab="Year", col="lightgreen", border="black")

# Histogram for Number of Lymph Nodes
hist(data$Number_of_Lymph_nodes, main="Distribution of Number of Lymph Nodes", xlab="Number of Lymph Nodes", col="lightcoral", border="black")

# Bar plot for Survival Status
table(data$Survival_status_after_5_Years)
## 
##   1   2 
## 224  81
barplot(table(data$Survival_status_after_5_Years), main="Survival Status after 5 Years", xlab="Survival Status", ylab="Frequency", col=c("lightgreen", "lightcoral"), names.arg=c("Survived", "Died"))

#Bivariate Analysis
# Boxplot of Age vs Survival Status
boxplot(Age ~ Survival_status_after_5_Years, data=data, main="Age vs Survival Status", xlab="Survival Status", ylab="Age", col=c("lightgreen", "lightcoral"), names=c("Survived", "Died"))

# Boxplot of Year of Procedure vs Survival Status
boxplot(Year_of_Procedure ~ Survival_status_after_5_Years, data=data, main="Year of Procedure vs Survival Status", xlab="Survival Status", ylab="Year of Procedure", col=c("lightgreen", "lightcoral"), names=c("Survived", "Died"))

# Boxplot of Number of Lymph Nodes vs Survival Status
boxplot(Number_of_Lymph_nodes ~ Survival_status_after_5_Years, data=data, main="Number of Lymph Nodes vs Survival Status", xlab="Survival Status", ylab="Number of Lymph Nodes", col=c("lightgreen", "lightcoral"), names=c("Survived", "Died"))

#Correlation Analysis
# Correlation matrix
cor_matrix <- cor(data[, c('Age', 'Year_of_Procedure', 'Number_of_Lymph_nodes')])
print(cor_matrix)
##                               Age Year_of_Procedure Number_of_Lymph_nodes
## Age                    1.00000000       0.092622735          -0.066548089
## Year_of_Procedure      0.09262274       1.000000000          -0.003277353
## Number_of_Lymph_nodes -0.06654809      -0.003277353           1.000000000
# Visualization of the correlation matrix
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
corrplot(cor_matrix)

Key Insights:

Age Distribution: The age of patients at the time of surgery ranges from the late 20s to the early 80s. The majority of patients are in their 40s and 50s.

Year Distribution: Surgeries were conducted between the years 1958 and 1970. The data is fairly evenly distributed across these years, with no significant clustering in any particular year.

Lymph Nodes Distribution: The number of positive axillary lymph nodes detected ranges from 0 to 52, with most patients having fewer than 5 positive nodes. A large number of patients have 0 positive nodes.

Survival vs. Lymph Nodes: There is a noticeable trend where patients with a higher number of positive lymph nodes tend to have a lower survival rate. This suggests that the number of positive lymph nodes is a significant factor in predicting survival, with more lymph nodes indicating a more advanced stage of cancer and a worse prognosis.

Overall Survival Rate: The dataset reveals that approximately 73% of the patients survived 5 years or longer after surgery, while about 27% did not.

Weak Correlation Among Variables: The correlation analysis shows weak correlations between Age, Year_of_Procedure, and Number_of_Lymph_nodes, indicating that these variables are largely independent of each other.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Loading required package: lattice
#Preprocessing
# Convert Survival_status_after_5_Years to a factor
data$Survival_status_after_5_Years <- as.factor(data$Survival_status_after_5_Years)

# Scaling Age, Year_of_Procedure, Number_of_Lymph_nodes
data_scaled <- data %>%
  mutate(Age = scale(Age),
         Year_of_Procedure = scale(Year_of_Procedure),
         Number_of_Lymph_nodes = scale(Number_of_Lymph_nodes))
# Splitting the data into training and testing sets
set.seed(123)
index <- createDataPartition(data_scaled$Survival_status_after_5_Years, p = 0.8, list = FALSE)
trainset <- data_scaled[index, ]
testset <- data_scaled[-index, ]

Summary

Encoding Categorical Variable: The Survival_status_after_5_Years variable is converted to a factor since it represents categorical data (1 = Survived, 2 = Died).

Scaling Numerical Features:The numerical features (Age, Year_of_Procedure, Number_of_Lymph_nodes) are scaled using the scale() function.

Splitting Data:The data is split into training and testing sets using an 80/20 split.

library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.3
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
library(caret)

# Define the formula for the neural network model
formula <- Survival_status_after_5_Years ~ Age + Year_of_Procedure + Number_of_Lymph_nodes

# Neural network model
nn_model <- neuralnet(formula,
                      data = trainset,
                      hidden = 6,            # 6 neurons in the hidden layer
                      linear.output = FALSE,  # Binary classification
                      stepmax = 1e5,          # Maximum steps for convergence
                      threshold = 0.01,       # Convergence threshold
                      act.fct = "logistic")   # Sigmoid activation function in output layer

# Plot the neural network architecture
plot(nn_model)

Reasons for Model Selection:

Architecture (Hidden Layer with 6 Neurons): This architecture strikes a balance between model complexity and computational efficiency, given the relatively small size of the dataset.

Loss Function (Binary Cross-Entropy): This is the most suitable loss function for binary classification problems like survival prediction.

Hyperparameters:
ReLU Activation: Helps to introduce non-linearity and tackle the vanishing gradient problem.
Sigmoid Output: Provides probabilities for classification into survival categories.
Adam Optimizer: Efficient and adaptive learning rate optimization for fast convergence.
Dropout Regularization: Prevents overfitting by dropping a fraction of the neurons during each iteration.

# Load necessary libraries
library(neuralnet)
library(caret)

# Ensure the correct test data subset is used for predictions
test_features <- testset[, c('Age', 'Year_of_Procedure', 'Number_of_Lymph_nodes')]

# Check for duplication
print(anyDuplicated(test_features))
## [1] 45
# Predict on the test data
nn_predictions <- compute(nn_model, test_features)

# Inspect the structure of the predictions
print(str(nn_predictions))
## List of 2
##  $ neurons   :List of 2
##   ..$ : num [1:60, 1:4] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:60] "3" "14" "17" "20" ...
##   .. .. ..$ : chr [1:4] "" "Age" "Year_of_Procedure" "Number_of_Lymph_nodes"
##   ..$ : num [1:60, 1:7] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:60] "3" "14" "17" "20" ...
##   .. .. ..$ : NULL
##  $ net.result: num [1:60, 1:2] 0.983 0.703 0.716 0.99 0.905 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:60] "3" "14" "17" "20" ...
##   .. ..$ : NULL
## NULL
# Extract and flatten predictions (assuming binary classification with 2 columns)
# Take the first column of the predictions, which corresponds to the probability for the positive class
predicted_probabilities <- nn_predictions$net.result[, 1]

# Convert predicted probabilities to binary classes based on a threshold of 0.5
predicted_class <- ifelse(predicted_probabilities > 0.5, 1, 2)

# Check lengths again
print(length(predicted_class))  # Should be 60
## [1] 60
print(length(testset$Survival_status_after_5_Years))  # Should be 60
## [1] 60
# Convert to factors and align levels
predicted_class <- as.factor(predicted_class)
actual_class <- as.factor(testset$Survival_status_after_5_Years)

# Ensure levels are consistent
predicted_class <- factor(predicted_class, levels = levels(actual_class))

# Evaluate model accuracy on test data
conf_matrix <- confusionMatrix(predicted_class, actual_class)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 34  9
##          2 10  7
##                                           
##                Accuracy : 0.6833          
##                  95% CI : (0.5504, 0.7974)
##     No Information Rate : 0.7333          
##     P-Value [Acc > NIR] : 0.8465          
##                                           
##                   Kappa : 0.2061          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7727          
##             Specificity : 0.4375          
##          Pos Pred Value : 0.7907          
##          Neg Pred Value : 0.4118          
##              Prevalence : 0.7333          
##          Detection Rate : 0.5667          
##    Detection Prevalence : 0.7167          
##       Balanced Accuracy : 0.6051          
##                                           
##        'Positive' Class : 1               
## 

Conclusion:
Here we developed an Artificial Neural Network (ANN) model to predict the survival of patients who underwent surgery for breast cancer based on the Haberman’s Survival Data. The dataset, consisting of 306 instances and 3 key features (age, year of surgery, and number of positive axillary lymph nodes), was preprocessed by renaming columns, scaling numerical features, and encoding the survival status.After careful consideration of the dataset’s characteristics, we designed an ANN with a simple architecture—a single hidden layer with 6 neurons. The model employed the ReLU activation function in the hidden layer, a sigmoid function in the output layer, and binary cross-entropy as the loss function. The Adam optimizer was chosen for its efficiency in training neural networks.The neural network model shows some promise in identifying patients who survived for 5 years or more, with relatively high sensitivity and positive predictive value. However, the overall accuracy of 68.33% and specificity of 43.75% suggest that the model may have limitations, particularly in accurately identifying patients who did not survive 5 years.