Name : Patel Raj Dilipbhai

Reg no: 2023MDTS07ALA012

Submitted to : K A Venkatesh

Msc DS/3rd sem/Alliance University

Date: “2024-09-03”

# Load library

library(survival)
library(keras)
library(tensorflow)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(neuralnet)
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:tensorflow':
## 
##     train
## The following object is masked from 'package:survival':
## 
##     cluster
# Load the dataset
data <- read.csv(file.choose())

# Rename Column name 
data <- data%>%
  rename(Age=X30,year_of_operation=X64,positive_axillary=X1,Survival_status=X1.1)
  
data$Survival_status <- as.factor(data$Survival_status)
# EDA
# Summary

summary(data)
##       Age        year_of_operation positive_axillary Survival_status
##  Min.   :30.00   Min.   :58.00     Min.   : 0.000    1:224          
##  1st Qu.:44.00   1st Qu.:60.00     1st Qu.: 0.000    2: 81          
##  Median :52.00   Median :63.00     Median : 1.000                   
##  Mean   :52.53   Mean   :62.85     Mean   : 4.036                   
##  3rd Qu.:61.00   3rd Qu.:66.00     3rd Qu.: 4.000                   
##  Max.   :83.00   Max.   :69.00     Max.   :52.000
str(data)
## 'data.frame':    305 obs. of  4 variables:
##  $ Age              : int  30 30 31 31 33 33 34 34 34 34 ...
##  $ year_of_operation: int  62 65 59 65 58 60 59 66 58 60 ...
##  $ positive_axillary: int  3 0 2 4 10 0 0 9 30 1 ...
##  $ Survival_status  : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 1 1 ...
#-----------------------------------
#             Univariate 
#-----------------------------------
hist(data$Age, main="Distribution of Age", xlab="Age", col="red", border="black")

# Histogram for year of operation
hist(data$year_of_operation, main="Distribution of year of operation", xlab="Year", col="Blue", border="black")

# Histogram for Number of positive_axillary Nodes
hist(data$positive_axillary, main="Distribution of Number of positive axillary Nodes", xlab="Number of positive axillary Nodes", col="lightblue", border="black")

# Bar plot for Survival Status
table(data$Survival_status)
## 
##   1   2 
## 224  81
barplot(table(data$Survival_status), main="Survival Status", xlab="Survival Status", ylab="Frequency", col=c("lightgreen", "lightcoral"), names.arg=c("Survived", "Died"))

#-----------------------------------
#            Bivariate
#-----------------------------------

# Boxplot of Age vs Survival Status
boxplot(Age ~ Survival_status, data=data, main="Age vs Survival Status", xlab="Survival Status", ylab="Age", col=c("green", "lightblue"), names=c("Survived", "Died"))

# Boxplot of Year of Procedure vs Survival Status
boxplot(year_of_operation ~ Survival_status, data=data, main="year of operation vs Survival Status", xlab="Survival Status", ylab="Year of operation", col=c("pink", "lightblue"), names=c("Survived", "Died"))

# Boxplot of Number of Lymph Nodes vs Survival Status
boxplot(positive_axillary ~ Survival_status, data=data, main="Number of positive axillary Nodes vs Survival Status", xlab="Survival Status", ylab="Number of positive axillary Nodes", col=c("pink", "yellow"), names=c("Survived", "Died"))

# Scaling Age,year_of_operation, positive_axillary
data_scaled <- data %>%
  mutate(Age = scale(Age),
         year_of_operation = scale(year_of_operation),
         positive_axillary = scale(positive_axillary))


# Splitting the data into training and testing sets
set.seed(123)
index <- createDataPartition(data_scaled$Survival_status, p = 0.8, list = FALSE)
trainset <- data_scaled[index, ]
testset <- data_scaled[-index, ]
proc <- Survival_status ~ Age + year_of_operation + positive_axillary


nn_model <- neuralnet(proc,
                      data = trainset,
                      hidden = 5,            # 6 neurons in the hidden layer
                      linear.output = FALSE,  # Binary classification
                      stepmax = 1e5,          # Maximum steps for convergence
                      threshold = 0.01,       # Convergence threshold
                      act.fct = "logistic")   # Sigmoid activation function in outputlayer

plot(nn_model)
  1. Architecture: Input Layer: The input layer shall correspond to the three features -Age, year_of_operation and positive_axillary. Each feature would be represented by one neuron. Hidden Layer: 5 Neurons: Selection of 5 neurons in the hidden layer is a trade-off between the models complexity and computational efficiency. If too few neurons, it may result in underfitting, which means it may fail to capture the underlying patterns of the data. On the other hand, if there are too many neurons, it can easily overfit, which may make the model capture noise as signal. It is complex enough to model the relationship between features by selecting 5 neurons but not too complex to overfit. Output Layer: 1 Neuron: As this is a problem of binary classification, for the prediction of Survival_status, there can only be two classes: 0 and 1. Therefore, one neuron will suffice in the output layer. The activation function of this neuron is “logistic” as it maps the output to lie between 0 and 1, suitable for binary classification.
  2. Loss Function: The binary cross-entropy loss is implicitly used in the model, suitable for binary classification. This essentially calculates the loss function based on a difference between the probabilities predicted versus what actually class labels represent. This weight is shifted during training so as to minimize this loss in coming up with better predictions.
  3. Hyperparameters stepmax = 1e5: This is a parameter to set the maximum number of steps this optimization algorithm takes in finding a minimum loss. The value 1e5, or 100,000 steps, gives adequate iterations to converge on the model without undue computation to find its optimal solution. Threshold = 0.01 - Convergence threshold, where the training should stop. If the improvement in the loss function is less than 0.01, then the model converges. It prevents over-training of the model and saves computational resources. hidden = 5: As mentioned before, having 5 neurons in the hidden layer enables an optimal trade-off between model complexity and generalization capability. act.fct = “logistic”: Other famous one is the logistic function which has popular alias-sigmoid function. It really shines when doing binary classification problems since it transforms output of the neuron into a probability ranging from 0 to 1.
# Ensure the correct test data subset is used for predictions
test_features <- testset[, c('Age', 'year_of_operation', 'positive_axillary')]

# Check for duplication
print(anyDuplicated(test_features))
## [1] 45
# Predict on the test data
nn_predictions <- compute(nn_model, test_features)

# Inspect the structure of the predictions
print(str(nn_predictions))
## List of 2
##  $ neurons   :List of 2
##   ..$ : num [1:60, 1:4] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:60] "3" "14" "17" "20" ...
##   .. .. ..$ : chr [1:4] "" "Age" "year_of_operation" "positive_axillary"
##   ..$ : num [1:60, 1:6] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:60] "3" "14" "17" "20" ...
##   .. .. ..$ : NULL
##  $ net.result: num [1:60, 1:2] 1 0.548 0.885 0.85 0.998 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:60] "3" "14" "17" "20" ...
##   .. ..$ : NULL
## NULL
# Extract and flatten predictions (assuming binary classification with 2 columns)
# Take the first column of the predictions, which corresponds to the probability for the positive class
predicted_probabilities <- nn_predictions$net.result[, 1]

# Convert predicted probabilities to binary classes based on a threshold of 0.5
predicted_class <- ifelse(predicted_probabilities > 0.5, 1, 2)

# Check lengths again
print(length(predicted_class))  # Should be 60
## [1] 60
print(length(testset$Survival_status))  # Should be 60
## [1] 60
# Convert to factors and align levels
predicted_class <- as.factor(predicted_class)
actual_class <- as.factor(testset$Survival_status)

# Ensure levels are consistent
predicted_class <- factor(predicted_class, levels = levels(actual_class))

# Evaluate model accuracy on test data
conf_matrix <- confusionMatrix(predicted_class, actual_class)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 35 12
##          2  9  4
##                                          
##                Accuracy : 0.65           
##                  95% CI : (0.516, 0.7687)
##     No Information Rate : 0.7333         
##     P-Value [Acc > NIR] : 0.9426         
##                                          
##                   Kappa : 0.0483         
##                                          
##  Mcnemar's Test P-Value : 0.6625         
##                                          
##             Sensitivity : 0.7955         
##             Specificity : 0.2500         
##          Pos Pred Value : 0.7447         
##          Neg Pred Value : 0.3077         
##              Prevalence : 0.7333         
##          Detection Rate : 0.5833         
##    Detection Prevalence : 0.7833         
##       Balanced Accuracy : 0.5227         
##                                          
##        'Positive' Class : 1              
## 

```

conclusion :- This neural network model was able to predict the survival status of the patients using three features: age, year of operation, and number of positive axillary lymph nodes. It also had a balanced architecture with regard to five neurons in its hidden layer and binary cross-entropy, hence optimal generalization without leading the model towards overfitting. The performance evaluation of the model was done using a confusion matrix and had quite good accuracy, hence reliable to carry out the task of this binary classification.