Name of the student-Madhu M 2. Reg No - 20203MDTS07ALA030
3. Assignment submitted to - KA Venkatesh 4. Program Name & Semester
and University Name - MSc Datascience ,Alliance University 5. Date of
submission - 04/9/2024 # ANN Implementation with the Haberman’s Survival
Data
The Haberman’s Survival Dataset is a collection of data from a study conducted on the survival of patients who had undergone surgery for breast cancer. The dataset includes 306 instances, with each entry containing three features: the patient’s age at the time of surgery, the year of surgery (spanning from 1958 to 1969), and the number of positive axillary lymph nodes detected. The target variable indicates whether the patient survived five years or longer after surgery. This dataset is often used in the field of machine learning and data analysis to explore patterns and develop predictive models related to patient survival outcomes.
(Dataset description ).
Starting with the simple 5 point summary statistics of the two of three independent variables ie, Age and No. of Positive Auxiliary Lymph Nodes.
Age.
## [1] "Oldest Patient Recorded: 83"
## [1] "Youngest Patient Recorded: 30"
The ages in this dataset range from 30 to 83, which means we are mostly working with patients who are middle aged to older patients. This variable is crucial to understanding the patient’s survival outcomes.
Visualizing the distribution of here, we see.
From this slightly skewed plot, we understand the median or the most common age group is 50 - 55 years old.
Now let’s see about the survival rates for different age groups.
## [1] "Survival Rate for Age Group [30,40) : 89.74 %"
## [1] "Survival Rate for Age Group [40,50) : 67.86 %"
## [1] "Survival Rate for Age Group [50,60) : 73.74 %"
## [1] "Survival Rate for Age Group [60,70) : 70.97 %"
## [1] "Survival Rate for Age Group [70,80) : 75 %"
Finding about the youngest and the oldest deaths recorded.
## [1] "The Oldest Death Recorded : 83"
## [1] "The Youngest Death recorded : 34"
But the real question is ? Does the patient’s survival status really have a dependence on their age ? It most likely does in the case of older patients but let’s still find out how much they are correlated.
## [1] "The correlation percentage between the age and survival : 6.44 %"
Judging by the correlation percentage, it seems that Age is not a huge benefactor or a guarantee to whether as to the patient will live or not. But this also suggests that there’s only a slight tendency for older patients to survive.
But what does Number of Positive Auxiliary Nodes of a patient tell us ?
First, start with the Range of Nodes present in this group of patients.
## [1] "Highest No. of Nodes recorded in a patient : 52"
## [1] "Lowest No. of Nodes recorded in a patient : 0"
Let’s see how this variable is distributed.
## [1] "The average no. of Nodes presented: 4.03606557377049"
The outlier values present here suggest that there’s a high variability among patients, with some patients having zero nodes to a singular patient having 52 nodes, which could either be a case of rarity or just plain data entry error, meanwhile the average recorded lies at 4 - 5 nodes per patients.
Meanwhile, Let’s explore how the no. of nodes varies within Age
Groups created in the prior analysis.
But the patient with 52 nodes. Did they survive or not ?
## [1] "Did the patient with 52 nodes survive ? : 1"
## [1] "& how old were they ? : 41"
This seems like a very weird case of a medical complication.
But what about the patient(s) with 0 nodes ?
## [1] "Number of Instances where 0 nodes were detected : 136"
## [1] "Number of patients with 0 nodes who survived: 117"
## [1] "Number of patients with 0 nodes who did not survive: 19"
Let’s get on further exploration with the Year variable.
## [1] "The year with the most deaths is: 65 with 13 deaths. "
## [1] "The year with the highest number of patients recorded is: 58 with 36 patients."
## [1] "The year with the lowest number of patients recorded is: 69 with 11 patients."
## [1] "No. of NULL values present: 1"
Data before Normalizing.
## Age Year_Of_Operation +ve_Auxiliary_Nodes Survival_Status
## 1 30 62 3 1
## 2 30 65 0 1
## 3 31 59 2 1
## 4 31 65 4 1
## 5 33 58 10 1
## 6 33 60 0 1
Scaling the features to a [0,1] range.
Data after normalizing.
## Age Year_Of_Operation Auxiliary_Nodes Survival_Status
## 1 0.00000000 0.36363636 0.05769231 1
## 2 0.00000000 0.63636364 0.00000000 1
## 3 0.02083333 0.09090909 0.03846154 1
## 4 0.02083333 0.63636364 0.07692308 1
## 5 0.06250000 0.00000000 0.19230769 1
## 6 0.06250000 0.18181818 0.00000000 1
Splitting and prepping the train and test data.
Exploring the architecture of the Artificial Neural Network.
Here, after a little bit of experimentation, A neural network with 2 hidden layers with 3 neurons each yieled the lowest error rate.
## Actual
## Predicted 0 1
## 0 5 5
## 1 12 39