library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv("C:/Users/DELL/Desktop/habermans/haberman.csv")
head(df, 3)
summary(df)
##       age             year           nodes            status     
##  Min.   :30.00   Min.   :58.00   Min.   : 0.000   Min.   :1.000  
##  1st Qu.:44.00   1st Qu.:60.00   1st Qu.: 0.000   1st Qu.:1.000  
##  Median :52.00   Median :63.00   Median : 1.000   Median :1.000  
##  Mean   :52.36   Mean   :62.87   Mean   : 4.033   Mean   :1.262  
##  3rd Qu.:60.00   3rd Qu.:66.00   3rd Qu.: 4.000   3rd Qu.:2.000  
##  Max.   :78.00   Max.   :69.00   Max.   :52.000   Max.   :2.000
unique(df$nodes)
##  [1]  1  3  0  2  4 10  9 30  7 13  6 15 21 11  5 23  8 20 52 14 19 16 12 24 46
## [26] 18 22 35 17 25 28
table(df$status)
## 
##   1   2 
## 225  80

This dataset gives access to a detailed gist of the age, nodes found and status. The patients are 30 to 78 years old, with an average age of 52.36 Most patients, the data shows, are between 44 and 60 years of age. The year variable, which presumably gives the approximate diagnosis or data collection year ranges from 58 to 69 with a mean of around62.87 years. A distribution where most data is bunched around the middle year. The number of nodes, which is a relevant feature in the dataset ranges from 0 to 52 with average equal to standard mean i. e. ifootnotes . The distribution is highly skewed towards lower node counts, with the majority of patients having 0–4 nodes. The degree of severity or detection is variable and many individuals are found to have very few, if any nodes. The status is a category with two statuses - 1 (275 instances) and 2 (80 instances). This basically indicates that there is an imbalanced dataset since status 1 has more frequency. This imbalance can affect model quality and might need handling when building the models to prevent predicting more frequent class. The dataset encompasses a broad range of ages and node counts but is imbalanced with respect to status distribution that should be taken into consideration when constructing or validating predictors.

# distributions of AGE
ggplot(df, aes(x = age)) +
  geom_histogram(bins = 20, fill = 'blue', color = 'black', alpha = 0.7) +
  labs(title = "Age Distribution", x = "Age", y = "Frequency")

X-Axis (Horizontal Axis) : age intervals or bins. These age data are in ranges of 30-40, 41-50, 51-60 as intervals. Y-Axis (Vertical Axis): Frequency — Number of patients in each age bin. Observations and Explanation Shape of a Normal Curve: The histogram looks like a bell-shaped, chicken-filled curve. This indicates that the patients are, collectively, approximately normally distributed with respect to age; there’s a central peak at which most of the patients’ ages lie. Most Commons Age (50-55 Year): Histogram shows that peak frequency is between 50–55 age group. The Most patients fall between these ages. More succinctly, the vast majority of patients in this dataset are between 50 and 55 years old. Frequency Distribution: Mode: The central peak in the histogram (50 to 55 age range), reveals that this is frequently encountered group of patients. In general, Histograms have two key characteristics that summarize a distribution: how “spread out” the histogram is (to either side of its peak) and whether or not there are data points in all age intervals. Age of patients stratified by the spread. Bars of histogram: each bar represents the palingenesy range (the absolute number or part-checker) Quantity of patients are age binned and the bar heights represent count. If, for example, a bar is taller in the 50-55 range then more patients are of this age compared to other ranges. Distribution Insights: Symmetry: The bell shape of the curve indicates that there is an even distribution around us with fewer patients at younger and older extremes any_means / E + / Getty Images Age in Your Data

# distributions of NODES
ggplot(df, aes(x = nodes)) +
  geom_histogram(bins = 20, fill = 'green', color = 'black', alpha = 0.7) +
  labs(title = "Nodes Distribution", x = "Number of Nodes", y = "Frequency")

X-Axis (Horizontal Axis) : It will show the number of nodes and divided into consecutive intervals or bins such as 0–10, 11–20,21-30 etc. Y axes (Frequency): No. of Patients: For each interval [node count] the number of patients Observations and Explanation Right-Skewed Distribution: A right-skewed distribution of nodes. This translates into a curve in which most of the patients have only few nodes, but there is an extended tail towards many more nodes. High-Frequency at Zero Nodes: It has the highest frequency in zero node interval. It says that most of the patients have no nodes and hence zero — 69 this is our major node count in whole dataset. Histogram: A Histogram depicting the frequency of decreasing as node count increases They are few patient with many for_virgo nodes at first, whereas this provides fewer patients as the number of nodes increases. High Node Counts Are Present: The histogram also shows there are patients with greater than 40 nodes. Although much less frequent, these high node counts do exist in the dataset. Bars on the Histogram: Each bar of histogram represents how many number patients fall under range x in node count The bars show how many patients there are in each node count interval. Distribution Insights: Skewness: The histogram is right-skewed shows most of the patients have less number of nodes, and few with a very large amount. The long tail which is visible at the right side of the data demonstrates that there are less patients with a very high node count. Frequency Distribution: A high frequency of zero nodes as noted in bar-graph, and the decreasing rate for higher node counts signifies that 0 positive lymph nodes are too much concentrated at one place. Higher no of patients is having less count

# distributions of status survival
ggplot(df, aes(x = factor(status), fill = factor(status))) +
  geom_bar() +
  labs(title = "Survival Status Distribution", x = "Status", y = "Count") +
  scale_fill_manual(values = c('green', 'blue'), labels = c("died", "survival"))

X-Axis (Horizontal Axis) - Survival — 0(Dead),1(Alive). 1: Survived after 5 years 2: Died before 5 years Y-Axis (Vertical Axis): The frequency and the number of patients on each categorySurvival_Yes^{-}885+342 =1227Suvirval_No+340166) )=450 Observations and Explanation Survival Status (Two Categories): The X-axis of the histogram contains two different categories Survived (Status = 1): Patients outlied be more than five years. Died (Status = 2): These patients indeed survived less than or equal to 5 years. More Likely to Survive: The histogram indicates most patients have the class label “Survived”. This means a significant number of patients lived 5 + years in the dataset. Survival and Death Frequency: Survived (Status = 1): There is a large spike for the histogram’s bar in this category which means that more number of patients survived after 5 years. Died (Status = 2): The cutoff for this listing is different, so there were fewer subjects who perished before reaching the end of their five-year observation window. Outliers: The outlying cases (conceptually, the single or few survival records with extremely different feature values vis-a-vis the rest of the training set). Outliers could be visualized via the histogram as extra bars or sharp points if any in this instance. Distribution Insights: Pre-dominance of Survival: The pre-domninace in category “Survived” is an indicator that there are more patients (higher probability) who survived less than 5 years as compared to -5 year survival, perhaps a function of lower node counts relatd with better survivial.

# Relationship between age and nodes
ggplot(df, aes(x = age, y = nodes, color = factor(status))) +
  geom_point(alpha = 0.6) +
  labs(title = "Age vs. Nodes by Survival Status", x = "Age", y = "Number of Nodes") +
  scale_color_manual(values = c('green', 'blue'), labels = c("Died", "Survived"))

Scatter Plot Characteristics: Distribution Pattern: A point on scatter plot corresponding to each patient where their age is plotted against the number of nodes. Trending: By observing the plot we can call if there are any apparent trends or clusters. 3. Age v/s Number of Nodes Not a Perceived Relationship: If there is no particular trend or relationship, it might indicate that age has little influence over the number of nodes found. A range of node counts may be both found in patients aged ≤55 and >median age. Possible Patterns: You may observe that younger or older patients have an increased number of nodes. If the trend also was found elsewhere, then it implies there may be a relationship between age and node count. Survival Implications: Fewer Lymph Nodes, Better Survival: With your above conclusion — if zero or very few nodes (in total around 15%) correlate with better survival on the plot you can conclude that fewer number of lymph nodes equals to more likely to survive. Age Influence: A scatter plot of this trend can be visualized that whether there a similar pattern across various age-groups. For example, you may notice that the fewer number of nodes is a higher chances survival than any age group. Conclusion: Survival and Increased number of Nodes: If you have observed that patients with a high counting node tends to shows less survival, your scatter plot confirms it by putting In direct relation the amount of nodes even if we are considering not differentiating on age. Effect of Age : The scatter plot would also help us understand if there is anything like age effects into the relationship or it’s all about the node count that have much relied on survival.

names(df)
## [1] "age"    "year"   "nodes"  "status"
# Adjusting the year column
df$year <- df$year + 1900
df$year <- as.integer(df$year)

In our dataset, the column year was originally recorded as two-digit format that represents years in era of 20th century. Because your dataset goes from 1958 until 1970, the two-digit year values have to be converted into four-digit years so th… This is done by adding a time to the value of some columns in this dataframe, below example will show how you may add 1900 from each number on year column; It means extra two digit format conversion and converting the year values from 2digit to normal four-digit. For example, the value of 64 as a two-digit year would be converted to 1964 by adding in this case 1900. Also, make sure you force the year column to become integers as it should be numeric when doing numerical analysis or operations. Doing this conversion is a way to keep the year data uniformed through computations and visualizations. By making this adjustment, the dataset itself better represents what year each record was in, hence placing data into timeframes more conducive to analysis.

summary(df$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1958    1960    1963    1963    1966    1969

Minimum (Min.): 1958 This is the first year in your dataset This shows that is oldest data point belongs to a year 1958. 1st Quartile (1st Qu.): 1960 This value is also referred to the data-specific 25th percentile. This means that 25% of the data points belong to year 1960 or before. Median: 1963 So the median is the middle value when every year, and all passengers, are lined up in order from least to most. By this I mean the most central year of your data is 1963 and that has similar amounts either side. Mean: 1963 Since the mean year is also 1963, you can see that at least the average time field value of data points lies in 1963. Having this number equals to its median implies a quite symmetric distribution of years. 3rd Quartile (3rd Qu.): 1966 This is the same as 75th percentile. This means that 75% of our data points are older than the year 1966. Maximum (Max.): 1969 The most recent year within your dataset. From this, it is indicated that the most recent year for which data has been made public appears to be 1969.

df$status <- recode(df$status, `1` = 1, `2` = 0)
table(df$status)
## 
##   0   1 
##  80 225

Transformation: The column status was encoded in the first step with values 1 and 2. It was mapped its values into the binary of course, using recode(df\(status, 1=1,2 =0): 1 (survived) remains 1. 2 (died) is mapped to 0. After transformation we distribute: Put simply, the table(df\)status) command gives a look at each unique value in our newly recoded status column + its pecking order. Living (1): 225 — This suggests that 225 people survived more than 5 years. Dead (0): 80–this tells us that no one has died in the first five years

df_grouped <- df %>% group_by(status) %>% summarise(across(everything(), mean))
df_grouped

Age: Survivors(1) : the average of Dead is about 52.02 years apparent Death (0): Average Age 53.31 years approximately Answer: The mean age of those that survived beyond 5 years was only a little younger than the group whose patient did not survive. This could mean a broader trend toward younger patients are more likely to make it through. Year: Survived⠀⠀(1): Rounded off, the average year is about 1962.86 Killed(0): Avg age is around 1962.89 year They have a very similar mean year of diagnosis between the died and survived categories. Meaning that, in general there is not a continued gap between the year of diagnosis for both outcomes. Nodes: Lives (1): Mean rank: ~2.79 nodes Died (0): Node count mean around: 7.53 Answer: Patients who survived tended to have the lower average nodes as compared to those died. This is in imply having more nodes gives a higher probability of death.

df <- df %>% select(-year)
head(df, 5)

Now, Year is dropped from the df and so we have only Age, Nodes & status_columns left. Now I will sum up what each column shows us: age: The age of patients in years at the time of their disease being diagnosed. If a person has detected some positive lymph nodes then it is an indicative of cancer status. The status column labels the survival rime of patients.escaping 1 means Died and leavingWith no breathingAllowed to live half a decadeSurrounded with airAdult Lives past five yearsSuscipicious0MouthbreatherPast bottom5 Years OnwardsBgirlofBelow OneWeedgirlLiving over decadesUnderUTheresthere_restSolelyif not jot for more TimeStayPositiveMoreYearsOldChappyLess than five Authors Coming soon KadzBothGTKadzdawnspam comingsoon was intended22CharsetsGmailInReactAnotherOneGenderBenderBeyond_AngelYa_boi_Gabrielseyeatsingsupport_legaliseHashTagichever_name_to_behonest(rawValue)_BOTHNone Concern please? Finally the dataset is containing only patient’s age,Number of positive nodes and whether he survived or not. This focus facilitates analysis that pertains specifically to these factors without the additional layering-on of year of diagnosis. The latter columns are crucial to know the patient demographic distributions of alive and died based on number nodes detected in exams.

X <- df %>% select(age, nodes)
head(X)

Age: This indicates the age of each patient when their condition is being recorded. Knowledge of the age distribution can provide some information into how or if age factors into cancer survival outcomes, such as having a greater number of positive lymph nodes. Nodes: The positive lymph node count for each patient. Simple cancer prognosis of the number of nodes is very important, because usually if more nodes that mean it has indicated a stage 2 or higher level. The 1st few rows of X dataset present different AGEs and their respective NODES counts amongst the PATIENTS. To give an example, the first four patients are young (in their early thirties) and only have a few nodes. Between all this movement of the dataset you notice variation both in age and node counts. This variability is important for identification of patterns and relationships between patients’ age and the severity of their condition, as measured by being positive or negative at different numberof nodes.

if (!requireNamespace("neuralnet", quietly = TRUE)) {
  install.packages("neuralnet")
}
library(tidyverse)
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.2.3
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute

loading the required libraries for the further model building

X <- df %>% select(age, nodes)
y <- df$status
data_nn <- cbind(X, status = y)

Feature Selection: It selects the age and nodes columns from the original dataset. These features will be input columns to the model. Target variable(Hive Column status) : It splits lip the column,in simple words it is did patient survived or died. This will be the output or target variable for your Model. First we combine the selected features and target i.e [age,nodes] with status to take a new data frame. Now you can build a machine learning model using this generated new data frame.

any(is.na(data_nn))
## [1] FALSE
str(data_nn)
## 'data.frame':    305 obs. of  3 variables:
##  $ age   : int  30 30 30 31 31 33 33 34 34 34 ...
##  $ nodes : int  1 3 0 2 4 10 0 0 9 30 ...
##  $ status: num  1 1 1 1 1 1 1 0 0 1 ...
data_nn_scaled <- data_nn %>%
  mutate(across(c(age, nodes), scale))
str(data_nn_scaled)
## 'data.frame':    305 obs. of  3 variables:
##  $ age   : num [1:305, 1] -2.09 -2.09 -2.09 -2 -2 ...
##   ..- attr(*, "scaled:center")= num 52.4
##   ..- attr(*, "scaled:scale")= num 10.7
##  $ nodes : num [1:305, 1] -0.42119 -0.14343 -0.56007 -0.28231 -0.00455 ...
##   ..- attr(*, "scaled:center")= num 4.03
##   ..- attr(*, "scaled:scale")= num 7.2
##  $ status: num  1 1 1 1 1 1 1 0 0 1 ...
head(data_nn_scaled)

Scaling: Standardization is used to scale the “age” and “nodes”. This process standardizes these variables by scaling themto a mean of 0 and unit variance. Whereas a value of age= -2.093778 and nodes =-0.4211895 suggests that they are much lower than the mean for their respective columns as given by ( 9:30 ). status Column: Status column is kept as it belongs to Classification output which not requires scaling. Data Overview: Now data_nn_scaled is in standard format for age and nodes columns, which can be used effectively with other machine learning algorithms since many ML models are performance sensitive to feature scaling as well. This transformation is beneficial to reduce weights differences among features so it enhances model performance and convergence by distancing contribution between all feature extraction processes.

if (!requireNamespace("caTools", quietly = TRUE)) {
  install.packages("caTools")
}

library(caTools)
## Warning: package 'caTools' was built under R version 4.2.3
set.seed(42)
split <- sample.split(data_nn_scaled$status, SplitRatio = 0.8)
train_data <- subset(data_nn_scaled, split == TRUE)
test_data <- subset(data_nn_scaled, split == FALSE)
str(train_data)
## 'data.frame':    244 obs. of  3 variables:
##  $ age   : num [1:244, 1] -2.09 -2 -1.81 -1.81 -1.72 ...
##  $ nodes : num [1:244, 1] -0.56007 -0.00455 0.82872 -0.56007 0.68984 ...
##  $ status: num  1 1 1 1 0 1 1 1 1 1 ...
str(test_data)
## 'data.frame':    61 obs. of  3 variables:
##  $ age   : num [1:61, 1] -2.09 -2.09 -2 -1.72 -1.63 ...
##  $ nodes : num [1:61, 1] -0.421 -0.143 -0.282 -0.56 1.245 ...
##  $ status: num  1 1 1 0 1 1 1 1 1 1 ...

Training Data (train_data): [ package : knapsack ] Total 244 Observations (rows). Each observation has (A) three variables age: Age in scaled value-relative of patients. nodes: Number of nodes (Scaled for a convenient representation). status : The status of the patients (1= survived, 0=died) [Its binary form] A trace for the scaling in nodes and age have proceeded as such, that numerical values are normalized with a mean and standard deviation. Testing Data (test_data): Contains 61 observations. Just like the training data, it has column values scaled for age and nodes including status variable. Summary of Data Splitting: Training Set: Bigger part of the data — used to create a model. This split takes most of the data, providing enough examples to train a model. Testing Set: Smallest part of the data when used to evaluate model performance. This way, you can evaluate the extent to which your model generalizes data it has never encountered.

formula <- status ~ age + nodes
set.seed(42)
nn_model <- neuralnet(
  formula,
  data = train_data,
  hidden = c(6, 8),
  linear.output = FALSE,
  stepmax = 2e6
)

if (is.null(nn_model$weights)) {
  stop("Model has not been trained. Check your training process.")
}

plot(nn_model)
predictions <- predict(nn_model, X)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

accuracy <- mean(predicted_classes == y)
print(glue::glue("Accuracy: {accuracy * 100}%"))
## Accuracy: 75.0819672131148%
for (i in 1:5) {
  print(glue::glue("Input: {paste(X[i, ], collapse = ', ')} => Predicted: {predicted_classes[i]} (Expected: {y[i]})"))
}
## Input: 30, 1 => Predicted: 1 (Expected: 1)
## Input: 30, 3 => Predicted: 1 (Expected: 1)
## Input: 30, 0 => Predicted: 1 (Expected: 1)
## Input: 31, 2 => Predicted: 1 (Expected: 1)
## Input: 31, 4 => Predicted: 1 (Expected: 1)

Training Data (train_data): Data with 244 observations. Every observation has 3 variables: age: age of the patients (0.90) nodes: the number of nodes, scaled. status: Binary (the integer values have been 0-nontrace.value died(presence of intubation included in this)=1, trace.value survived=2) Accuracy: This model estimated the prediction so that they had accuracy 75.08%. That tells you that the model predicted 75.08% of actual values in dataset It is an acceptable accuracy though with some more parameter tuning, or adjusting the model architecture etc., better performance can be achieved using advanced methods. Predictions vs. Truths Input = 30,1 => Prediction: 1 (Expected: 1) Now let’s test the AP system on the whole model and moving both opacity parameter values: [30, 3]. Input: 30, 0 ==> Predicted :1 (Expected:1) 31, 2 => PRED:1(actu_1) Example 31, one vs four => Predicted =1 [Correct: No] These are some samples that the model correctly predicted their status for each case All five cases are classified as the correct status, so it would seem that this model is working correctly for these inputs.

predictions <- predict(nn_model, X)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

y <- factor(y, levels = c(0, 1))

results <- data.frame(
  Prediction = factor(predicted_classes, levels = c(0, 1)),
  Actual = y
)

conf_matrix_table <- table(Predicted = results$Prediction, Actual = results$Actual)

TP <- conf_matrix_table["1", "1"]  
TN <- conf_matrix_table["0", "0"]  
FP <- conf_matrix_table["1", "0"]  
FN <- conf_matrix_table["0", "1"]  

precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("Confusion Matrix:\n")
## Confusion Matrix:
print(conf_matrix_table)
##          Actual
## Predicted   0   1
##         0   8   4
##         1  72 221
cat("\nPrecision:\n")
## 
## Precision:
print(precision)
## [1] 0.7542662
cat("\nRecall:\n")
## 
## Recall:
print(recall)
## [1] 0.9822222
cat("\nF1 Score:\n")
## 
## F1 Score:
print(f1_score)
## [1] 0.8532819
for (i in 1:5) {
  cat(sprintf("Input: %s => Predicted: %d (Expected: %d)\n", paste(X[i, ], collapse = ', '), predicted_classes[i], y[i]))
}
## Input: 30, 1 => Predicted: 1 (Expected: 2)
## Input: 30, 3 => Predicted: 1 (Expected: 2)
## Input: 30, 0 => Predicted: 1 (Expected: 2)
## Input: 31, 2 => Predicted: 1 (Expected: 2)
## Input: 31, 4 => Predicted: 1 (Expected: 2)

True Negatives (TN): The model correctly predicted died (0) 8 times, and the actual outcome was indeed died (0). False Positives (FP): The model predicted survival (1) 72 times, but the actual outcome was died (0). This high number of false positives suggests that the model is more likely to incorrectly label cases as survival when they are actually died. False Negatives (FN): The model predicted died (0) incorrectly for 4 instances where the actual outcome was survival (1). This indicates missed predictions of actual positive cases. True Positives (TP): The model correctly predicted survival (1) 221 times, and the actual outcome was indeed survival (1). Metrics Precision (0.754): This means that when the model predicted 1 (survival), it was correct about 75.4% of the time. Precision is the proportion of true positive predictions among all cases predicted as positive. Although precision is fairly high, the high number of false positives indicates that the model might be overly optimistic in predicting survival. Recall (0.982): This indicates that the model correctly identified 98.2% of the actual positive cases (survival). Recall measures how many of the actual positive cases were correctly identified by the model. A high recall shows the model is good at identifying true positives, but it comes at the cost of increased false positives. F1 Score (0.853): The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both. A high F1 score of 0.853 indicates that the model performs well overall, taking both precision and recall into account. It shows a good trade-off between the model’s ability to correctly identify positives and its precision in doing so. Predictions The output displays predictions for the first 5 instances. The model predicted 1 (survival) for each instance, but the expected value was 2, which suggests there might be a discrepancy in how the labels are encoded or matched. Ensure that survival and death labels are correctly represented as 0 and 1 respectively to avoid such errors. Additional Insights The high number of false positives compared to false negatives indicates that while the model is good at identifying actual positive cases (high recall), it tends to misclassify more cases as positive when they are actually negative. This results in an increased number of false positives, which could lead to suboptimal performance in practical scenarios where false positives are costly. The increased false positives contribute to higher Mean Squared Error (MSE) and could suggest a need for model adjustment or tuning to reduce these errors and improve overall prediction quality.