data<-read.csv(file.choose())
head(data)
print(colnames(data))
## [1] "X30" "X64" "X1" "X1.1"
colnames(data) <- c('Age', 'Year_of_Procedure', 'Number_of_Lymph_nodes','Survival_status_after_5_Years')
print(colnames(data))
## [1] "Age" "Year_of_Procedure"
## [3] "Number_of_Lymph_nodes" "Survival_status_after_5_Years"
summary(data)
## Age Year_of_Procedure Number_of_Lymph_nodes
## Min. :30.00 Min. :58.00 Min. : 0.000
## 1st Qu.:44.00 1st Qu.:60.00 1st Qu.: 0.000
## Median :52.00 Median :63.00 Median : 1.000
## Mean :52.53 Mean :62.85 Mean : 4.036
## 3rd Qu.:61.00 3rd Qu.:66.00 3rd Qu.: 4.000
## Max. :83.00 Max. :69.00 Max. :52.000
## Survival_status_after_5_Years
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.266
## 3rd Qu.:2.000
## Max. :2.000
str(data)
## 'data.frame': 305 obs. of 4 variables:
## $ Age : int 30 30 31 31 33 33 34 34 34 34 ...
## $ Year_of_Procedure : int 62 65 59 65 58 60 59 66 58 60 ...
## $ Number_of_Lymph_nodes : int 3 0 2 4 10 0 0 9 30 1 ...
## $ Survival_status_after_5_Years: int 1 1 1 1 1 1 2 2 1 1 ...
# Histogram for Age
hist(data$Age, main="Distribution of Age", xlab="Age", col="yellow", border="black")
# Histogram for Year of Procedure
hist(data$Year_of_Procedure, main="Distribution of Year of Procedure", xlab="Year", col="lightblue", border="black")
# Histogram for Number of Lymph Nodes
hist(data$Number_of_Lymph_nodes, main="Distribution of Number of Lymph Nodes", xlab="Number of Lymph Nodes", col="green", border="black")
# Bar plot for Survival Status
table(data$Survival_status_after_5_Years)
##
## 1 2
## 224 81
barplot(table(data$Survival_status_after_5_Years), main="Survival Status after 5 Years", xlab="Survival Status", ylab="Frequency", col=c("lightgreen", "yellow"), names.arg=c("Survived", "Died"))
#Bivariate Analysis
# Boxplot of Age vs Survival Status
boxplot(Age ~ Survival_status_after_5_Years, data=data, main="Age vs Survival Status", xlab="Survival Status", ylab="Age", col=c("lightgreen", "yellow"), names=c("Survived", "Died"))
# Boxplot of Year of Procedure vs Survival Status
boxplot(Year_of_Procedure ~ Survival_status_after_5_Years, data=data, main="Year of Procedure vs Survival Status", xlab="Survival Status", ylab="Year of Procedure", col=c("lightgreen", "yellow"), names=c("Survived", "Died"))
# Boxplot of Number of Lymph Nodes vs Survival Status
boxplot(Number_of_Lymph_nodes ~ Survival_status_after_5_Years, data=data, main="Number of Lymph Nodes vs Survival Status", xlab="Survival Status", ylab="Number of Lymph Nodes", col=c("lightgreen", "yellow"), names=c("Survived", "Died"))
#Correlation Analysis
# Correlation matrix
cor_matrix <- cor(data[, c('Age', 'Year_of_Procedure', 'Number_of_Lymph_nodes')])
print(cor_matrix)
## Age Year_of_Procedure Number_of_Lymph_nodes
## Age 1.00000000 0.092622735 -0.066548089
## Year_of_Procedure 0.09262274 1.000000000 -0.003277353
## Number_of_Lymph_nodes -0.06654809 -0.003277353 1.000000000
# Visualization of the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix)
This script is a comprehensive approach to exploring and understanding the dataset. The univariate analysis focuses on understanding the distribution of individual variables, while the bivariate analysis investigates how different variables relate to each other, particularly in the context of survival after 5 years. Finally, the correlation analysis helps identify any potential linear relationships between numerical variables in the dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Loading required package: lattice
# Convert Survival_status_after_5_Years to a factor
data$Survival_status_after_5_Years <- as.factor(data$Survival_status_after_5_Years)
# Scaling Age, Year_of_Procedure, Number_of_Lymph_nodes
data_scaled <- data %>%
mutate(Age = scale(Age),
Year_of_Procedure = scale(Year_of_Procedure),
Number_of_Lymph_nodes = scale(Number_of_Lymph_nodes))
# Splitting the data into training and testing sets
set.seed(123)
index <- createDataPartition(data_scaled$Survival_status_after_5_Years, p = 0.8, list = FALSE)
trainset <- data_scaled[index, ]
testset <- data_scaled[-index, ]
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.3
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(caret)
# Formula for the neural network model
formula <- Survival_status_after_5_Years ~ Age + Year_of_Procedure + Number_of_Lymph_nodes
# Neural network model
nn_model <- neuralnet(formula,
data = trainset,
hidden = 6, # 6 neurons in the hidden layer
linear.output = FALSE, # Binary classification
stepmax = 1e5, # Maximum steps for convergence
threshold = 0.01, # Convergence threshold
act.fct = "logistic") # Sigmoid activation function in output layer
# Neural network architecture
plot(nn_model)
Architecture (6 Neurons in Hidden Layer): The choice of 6 neurons in the hidden layer is a balanced decision to allow the network to capture complex patterns in the data while avoiding overfitting. A single hidden layer is often sufficient for simpler problems, and 6 neurons are a modest number that prevents the model from being too complex.
Activation Function (Sigmoid): The sigmoid activation function is ideal for binary classification because it outputs a probability value between 0 and 1, which aligns with the two classes (Survived, Not Survived).
Loss Function (Implicit Cross-Entropy): Cross-entropy loss is standard for binary classification tasks, as it measures the difference between the predicted probability distribution and the true distribution, optimizing the model for accurate classification.
Hyperparameters (Stepmax and Threshold): stepmax = 1e5 ensures the model has sufficient iterations to converge, while threshold = 0.01 ensures that training stops once the gradient is small enough, indicating the model has found a minimum of the loss function.
This configuration is chosen to create a neural network that can effectively learn from the data without overfitting or underfitting, providing a good balance between model complexity and training efficiency.
library(neuralnet)
library(caret)
# check for the correct test data subset is used for predictions
test_features <- testset[, c('Age', 'Year_of_Procedure', 'Number_of_Lymph_nodes')]
# Check for duplication
print(anyDuplicated(test_features))
## [1] 45
# Predict on the test data
nn_predictions <- compute(nn_model, test_features)
# Inspect the structure of the predictions
print(str(nn_predictions))
## List of 2
## $ neurons :List of 2
## ..$ : num [1:60, 1:4] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:60] "3" "14" "17" "20" ...
## .. .. ..$ : chr [1:4] "" "Age" "Year_of_Procedure" "Number_of_Lymph_nodes"
## ..$ : num [1:60, 1:7] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:60] "3" "14" "17" "20" ...
## .. .. ..$ : NULL
## $ net.result: num [1:60, 1:2] 0.983 0.703 0.716 0.99 0.905 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:60] "3" "14" "17" "20" ...
## .. ..$ : NULL
## NULL
# Extract and flatten predictions
# Take the first column of the predictions, which corresponds to the probability for the positive class
predicted_probabilities <- nn_predictions$net.result[, 1]
# Convert predicted probabilities to binary classes based on a threshold of 0.5
predicted_class <- ifelse(predicted_probabilities > 0.5, 1, 2)
# Check lengths again
print(length(predicted_class)) # limit 60
## [1] 60
print(length(testset$Survival_status_after_5_Years)) # limit 60
## [1] 60
# Convert to factors and align levels
predicted_class <- as.factor(predicted_class)
actual_class <- as.factor(testset$Survival_status_after_5_Years)
# Ensure levels are consistent
predicted_class <- factor(predicted_class, levels = levels(actual_class))
# Evaluate model accuracy on test data
conf_matrix <- confusionMatrix(predicted_class, actual_class)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 34 9
## 2 10 7
##
## Accuracy : 0.6833
## 95% CI : (0.5504, 0.7974)
## No Information Rate : 0.7333
## P-Value [Acc > NIR] : 0.8465
##
## Kappa : 0.2061
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7727
## Specificity : 0.4375
## Pos Pred Value : 0.7907
## Neg Pred Value : 0.4118
## Prevalence : 0.7333
## Detection Rate : 0.5667
## Detection Prevalence : 0.7167
## Balanced Accuracy : 0.6051
##
## 'Positive' Class : 1
##
The performance of a neural network model trained to predict the survival status after 5 years. After ensuring that the test data is correctly prepared, the model predicts probabilities for the survival status. These probabilities are then converted into binary classes using a threshold of 0.5. The code compares these predictions against the actual survival statuses in the test set using a confusion matrix. This matrix helps to assess the model’s accuracy and other performance metrics, providing insights into how well the model classifies survival outcomes.
Conclusion: The provided R code outlines the process of building, training, and evaluating a neural network model for predicting 5-year survival status based on patient characteristics such as age, year of procedure, and the number of lymph nodes. The model uses a single hidden layer with 6 neurons and applies a sigmoid activation function for binary classification. After training, predictions are made on a test set, with probabilities converted to binary classes based on a 0.5 threshold. The model’s performance is then evaluated using a confusion matrix, which assesses the accuracy of predictions compared to actual survival outcomes. This approach effectively demonstrates how to preprocess data, train a neural network, and interpret its performance in a real-world classification task.