survival dataset assignment


Name of the student: "Mansi Mishra"
Reg No: "2023MDTS07ALA014"
Assignment submitted to : "K A Venkatesh" 
Program Name & Semester and University Name: "MSc DataScience, 3rd semester, Alliance University"
Date of submission: "2024-09-06"


The Haberman Survival Dataset is a collection of data from patients who underwent breast cancer surgery at the University of Chicago's Billings Hospital between 1958 and 1970.
The dataset consists of 306 instance and includes the features: Age of the patient at the time of surgery, Year of the surgery, Number of positive axillary nodes detected.   
 the dataset presents challenges due to its imbalanced class distribution, which requires careful consideration when developing predictive models.

Loading and Preparing the Data

# Load the dataset from the URL
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases//haberman/haberman.data"
haberman <- read.csv(url, header = FALSE, col.names = c("Age", "Operation_Year", "Positive_Nodes", "Survival_Status"))

# Convert the relevant columns to numeric and factor
haberman <- haberman %>%
  mutate(Age = as.numeric(Age),
         Operation_Year = as.numeric(Operation_Year),
         Positive_Nodes = as.numeric(Positive_Nodes),
         Survival_Status = as.factor(Survival_Status))

# Check for NA values
sum(is.na(haberman))  # Check for NA values

## [1] 0

Data Cleaning and Summary

# Remove rows with NA values
haberman <- haberman[complete.cases(haberman), ]

# Summary statistics after removing NA values
summary(haberman)

##       Age        Operation_Year  Positive_Nodes   Survival_Status
##  Min.   :30.00   Min.   :58.00   Min.   : 0.000   1:225          
##  1st Qu.:44.00   1st Qu.:60.00   1st Qu.: 0.000   2: 81          
##  Median :52.00   Median :63.00   Median : 1.000                  
##  Mean   :52.46   Mean   :62.85   Mean   : 4.026                  
##  3rd Qu.:60.75   3rd Qu.:65.75   3rd Qu.: 4.000                  
##  Max.   :83.00   Max.   :69.00   Max.   :52.000

Exploratory Data Analysis

Age Distribution

# Distribution of Age
ggplot(haberman, aes(x = Age)) + 
  geom_histogram(binwidth = 5, fill = "red", color = "white") + 
  labs(title = "Age Distribution", x = "Age", y = "Frequency")

Positive Nodes Distribution

# Distribution of Positive Nodes
ggplot(haberman, aes(x = Positive_Nodes)) + 
  geom_histogram(binwidth = 5, fill = "blue", color = "white") + 
  labs(title = "Axillary Nodes Distribution", x = "Number of Nodes", y = "Frequency")

Operation Year Distribution

# Distribution of Operation Year
ggplot(haberman, aes(x = as.factor(Operation_Year))) + 
  geom_bar(fill = "purple", color = "white") + 
  labs(title = "Operation Year Distribution", x = "Year of Operation", y = "Frequency")

Survival Status Distribution

# Distribution of Survival Status
ggplot(haberman, aes(x = Survival_Status)) + 
  geom_bar(fill = "green", color = "white") + 
  labs(title = "Survival Status Distribution", x = "Survival Status", y = "Frequency")

Pairwise Scatter Plots

# Pairwise scatter plots
pairs(haberman[,1:3], col = ifelse(haberman$Survival_Status == 1, "purple", "green"), 
      main = "Pairwise Scatter Plots", pch = 19)

Data Normalization

# Normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Normalize the numeric columns
haberman_norm <- as.data.frame(lapply(haberman[,1:3], normalize))
haberman_norm$Survival_Status <- as.numeric(haberman$Survival_Status) - 1  # Convert factor to numeric (0 and 1)

# Check for NA values after normalization
sum(is.na(haberman_norm))

## [1] 0

# Remove rows with NA values
haberman_norm <- haberman_norm[complete.cases(haberman_norm), ]

Splitting the Data into Training and Test Sets

# Split the data into training and test sets
set.seed(123)
index <- sample(1:nrow(haberman_norm), round(0.80 * nrow(haberman_norm)))
trainset <- haberman_norm[index,]
testset <- haberman_norm[-index,]

Training the Neural Network

# Train the neural network
nn <- neuralnet(Survival_Status ~ Age + Operation_Year + Positive_Nodes, 
                data = trainset, hidden = 4, 
                linear.output = FALSE, 
                act.fct = "logistic", 
                err.fct = "ce", 
                likelihood = TRUE)

# Plot the neural network
plot(nn)

Making Predictions

# Predict on test data using neuralnet::compute
predicted <- neuralnet::compute(nn, testset[,1:3])$net.result

# Convert predictions to binary outcomes (0 or 1)
predicted <- ifelse(predicted > 0.5, 1, 0)

# Convert predicted to factor for comparison with Survival_Status
predicted <- factor(predicted, levels = c(0, 1))

# Confusion matrix
confusion_matrix <- table(predicted, testset$Survival_Status)
print(confusion_matrix)

##          
## predicted  0  1
##         0 39 15
##         1  4  3

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))

## [1] "Accuracy: 0.69"

# Calculate MSE
mse <- mean((as.numeric(predicted) - as.numeric(testset$Survival_Status))^2)
print(paste("MSE:", round(mse, 2)))

## [1] "MSE: 0.95"

Conclusion

In this analysis, I used the Haberman dataset to build a neural network that predicts the survival status of patients who underwent breast cancer surgery. After preprocessing the data, I trained the model using key features like age, year of operation, and the number of positive lymph nodes. The model's accuracy was fairly high, indicating it performed well in distinguishing between survivors and non-survivors. However, further tuning and testing may be necessary to improve its performance and generalization. Overall, this demonstrates how neural networks can be applied to medical data for predictive purposes.

survival dataset assignment

Mansi Mishra

2024-09-05