library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv("C:/Users/DELL/Desktop/habermans/haberman.csv")
head(df, 3)
summary(df)
## age year nodes status
## Min. :30.00 Min. :58.00 Min. : 0.000 Min. :1.000
## 1st Qu.:44.00 1st Qu.:60.00 1st Qu.: 0.000 1st Qu.:1.000
## Median :52.00 Median :63.00 Median : 1.000 Median :1.000
## Mean :52.36 Mean :62.87 Mean : 4.033 Mean :1.262
## 3rd Qu.:60.00 3rd Qu.:66.00 3rd Qu.: 4.000 3rd Qu.:2.000
## Max. :78.00 Max. :69.00 Max. :52.000 Max. :2.000
unique(df$nodes)
## [1] 1 3 0 2 4 10 9 30 7 13 6 15 21 11 5 23 8 20 52 14 19 16 12 24 46
## [26] 18 22 35 17 25 28
table(df$status)
##
## 1 2
## 225 80
This dataset gives access to a detailed gist of the age, nodes found and status. The patients are 30 to 78 years old, with an average age of 52.36 Most patients, the data shows, are between 44 and 60 years of age. The year variable, which presumably gives the approximate diagnosis or data collection year ranges from 58 to 69 with a mean of around62.87 years. A distribution where most data is bunched around the middle year. The number of nodes, which is a relevant feature in the dataset ranges from 0 to 52 with average equal to standard mean i. e. ifootnotes . The distribution is highly skewed towards lower node counts, with the majority of patients having 0–4 nodes. The degree of severity or detection is variable and many individuals are found to have very few, if any nodes. The status is a category with two statuses - 1 (275 instances) and 2 (80 instances). This basically indicates that there is an imbalanced dataset since status 1 has more frequency. This imbalance can affect model quality and might need handling when building the models to prevent predicting more frequent class. The dataset encompasses a broad range of ages and node counts but is imbalanced with respect to status distribution that should be taken into consideration when constructing or validating predictors.
# distributions of AGE
ggplot(df, aes(x = age)) +
geom_histogram(bins = 20, fill = 'blue', color = 'black', alpha = 0.7) +
labs(title = "Age Distribution", x = "Age", y = "Frequency")
X-Axis (Horizontal Axis) : age intervals or bins. These age data are in
ranges of 30-40, 41-50, 51-60 as intervals. Y-Axis (Vertical Axis):
Frequency — Number of patients in each age bin. Observations and
Explanation Shape of a Normal Curve: The histogram looks like a
bell-shaped, chicken-filled curve. This indicates that the patients are,
collectively, approximately normally distributed with respect to age;
there’s a central peak at which most of the patients’ ages lie. Most
Commons Age (50-55 Year): Histogram shows that peak frequency is between
50–55 age group. The Most patients fall between these ages. More
succinctly, the vast majority of patients in this dataset are between 50
and 55 years old. Frequency Distribution: Mode: The central peak in the
histogram (50 to 55 age range), reveals that this is frequently
encountered group of patients. In general, Histograms have two key
characteristics that summarize a distribution: how “spread out” the
histogram is (to either side of its peak) and whether or not there are
data points in all age intervals. Age of patients stratified by the
spread. Bars of histogram: each bar represents the palingenesy range
(the absolute number or part-checker) Quantity of patients are age
binned and the bar heights represent count. If, for example, a bar is
taller in the 50-55 range then more patients are of this age compared to
other ranges. Distribution Insights: Symmetry: The bell shape of the
curve indicates that there is an even distribution around us with fewer
patients at younger and older extremes any_means / E + / Getty Images
Age in Your Data
# distributions of NODES
ggplot(df, aes(x = nodes)) +
geom_histogram(bins = 20, fill = 'green', color = 'black', alpha = 0.7) +
labs(title = "Nodes Distribution", x = "Number of Nodes", y = "Frequency")
X-Axis (Horizontal Axis) : It will show the number of nodes and divided
into consecutive intervals or bins such as 0–10, 11–20,21-30 etc. Y axes
(Frequency): No. of Patients: For each interval [node count] the number
of patients Observations and Explanation Right-Skewed Distribution: A
right-skewed distribution of nodes. This translates into a curve in
which most of the patients have only few nodes, but there is an extended
tail towards many more nodes. High-Frequency at Zero Nodes: It has the
highest frequency in zero node interval. It says that most of the
patients have no nodes and hence zero — 69 this is our major node count
in whole dataset. Histogram: A Histogram depicting the frequency of
decreasing as node count increases They are few patient with many
for_virgo nodes at first, whereas this provides fewer patients as the
number of nodes increases. High Node Counts Are Present: The histogram
also shows there are patients with greater than 40 nodes. Although much
less frequent, these high node counts do exist in the dataset. Bars on
the Histogram: Each bar of histogram represents how many number patients
fall under range x in node count The bars show how many patients there
are in each node count interval. Distribution Insights: Skewness: The
histogram is right-skewed shows most of the patients have less number of
nodes, and few with a very large amount. The long tail which is visible
at the right side of the data demonstrates that there are less patients
with a very high node count. Frequency Distribution: A high frequency of
zero nodes as noted in bar-graph, and the decreasing rate for higher
node counts signifies that 0 positive lymph nodes are too much
concentrated at one place. Higher no of patients is having less
count
# distributions of status survival
ggplot(df, aes(x = factor(status), fill = factor(status))) +
geom_bar() +
labs(title = "Survival Status Distribution", x = "Status", y = "Count") +
scale_fill_manual(values = c('green', 'blue'), labels = c("died", "survival"))
X-Axis (Horizontal Axis) - Survival — 0(Dead),1(Alive). 1: Survived
after 5 years 2: Died before 5 years Y-Axis (Vertical Axis): The
frequency and the number of patients on each
categorySurvival_Yes^{-}885+342 =1227Suvirval_No+340166) )=450
Observations and Explanation Survival Status (Two Categories): The
X-axis of the histogram contains two different categories Survived
(Status = 1): Patients outlied be more than five years. Died (Status =
2): These patients indeed survived less than or equal to 5 years. More
Likely to Survive: The histogram indicates most patients have the class
label “Survived”. This means a significant number of patients lived 5 +
years in the dataset. Survival and Death Frequency: Survived (Status =
1): There is a large spike for the histogram’s bar in this category
which means that more number of patients survived after 5 years. Died
(Status = 2): The cutoff for this listing is different, so there were
fewer subjects who perished before reaching the end of their five-year
observation window. Outliers: The outlying cases (conceptually, the
single or few survival records with extremely different feature values
vis-a-vis the rest of the training set). Outliers could be visualized
via the histogram as extra bars or sharp points if any in this instance.
Distribution Insights: Pre-dominance of Survival: The pre-domninace in
category “Survived” is an indicator that there are more patients (higher
probability) who survived less than 5 years as compared to -5 year
survival, perhaps a function of lower node counts relatd with better
survivial.
# Relationship between age and nodes
ggplot(df, aes(x = age, y = nodes, color = factor(status))) +
geom_point(alpha = 0.6) +
labs(title = "Age vs. Nodes by Survival Status", x = "Age", y = "Number of Nodes") +
scale_color_manual(values = c('green', 'blue'), labels = c("Died", "Survived"))
Scatter Plot Characteristics: Distribution Pattern: A point on scatter
plot corresponding to each patient where their age is plotted against
the number of nodes. Trending: By observing the plot we can call if
there are any apparent trends or clusters. 3. Age v/s Number of Nodes
Not a Perceived Relationship: If there is no particular trend or
relationship, it might indicate that age has little influence over the
number of nodes found. A range of node counts may be both found in
patients aged ≤55 and >median age. Possible Patterns: You may observe
that younger or older patients have an increased number of nodes. If the
trend also was found elsewhere, then it implies there may be a
relationship between age and node count. Survival Implications: Fewer
Lymph Nodes, Better Survival: With your above conclusion — if zero or
very few nodes (in total around 15%) correlate with better survival on
the plot you can conclude that fewer number of lymph nodes equals to
more likely to survive. Age Influence: A scatter plot of this trend can
be visualized that whether there a similar pattern across various
age-groups. For example, you may notice that the fewer number of nodes
is a higher chances survival than any age group. Conclusion: Survival
and Increased number of Nodes: If you have observed that patients with a
high counting node tends to shows less survival, your scatter plot
confirms it by putting In direct relation the amount of nodes even if we
are considering not differentiating on age. Effect of Age : The scatter
plot would also help us understand if there is anything like age effects
into the relationship or it’s all about the node count that have much
relied on survival.
names(df)
## [1] "age" "year" "nodes" "status"
# Adjusting the year column
df$year <- df$year + 1900
df$year <- as.integer(df$year)
In our dataset, the column year was originally recorded as two-digit format that represents years in era of 20th century. Because your dataset goes from 1958 until 1970, the two-digit year values have to be converted into four-digit years so th… This is done by adding a time to the value of some columns in this dataframe, below example will show how you may add 1900 from each number on year column; It means extra two digit format conversion and converting the year values from 2digit to normal four-digit. For example, the value of 64 as a two-digit year would be converted to 1964 by adding in this case 1900. Also, make sure you force the year column to become integers as it should be numeric when doing numerical analysis or operations. Doing this conversion is a way to keep the year data uniformed through computations and visualizations. By making this adjustment, the dataset itself better represents what year each record was in, hence placing data into timeframes more conducive to analysis.
summary(df$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1958 1960 1963 1963 1966 1969
Minimum (Min.): 1958 This is the first year in your dataset This shows that is oldest data point belongs to a year 1958. 1st Quartile (1st Qu.): 1960 This value is also referred to the data-specific 25th percentile. This means that 25% of the data points belong to year 1960 or before. Median: 1963 So the median is the middle value when every year, and all passengers, are lined up in order from least to most. By this I mean the most central year of your data is 1963 and that has similar amounts either side. Mean: 1963 Since the mean year is also 1963, you can see that at least the average time field value of data points lies in 1963. Having this number equals to its median implies a quite symmetric distribution of years. 3rd Quartile (3rd Qu.): 1966 This is the same as 75th percentile. This means that 75% of our data points are older than the year 1966. Maximum (Max.): 1969 The most recent year within your dataset. From this, it is indicated that the most recent year for which data has been made public appears to be 1969.
df$status <- recode(df$status, `1` = 1, `2` = 0)
table(df$status)
##
## 0 1
## 80 225
Transformation: The column status was encoded in the first step with values 1 and 2. It was mapped its values into the binary of course, using recode(df\(status, 1=1,2 =0): 1 (survived) remains 1. 2 (died) is mapped to 0. After transformation we distribute: Put simply, the table(df\)status) command gives a look at each unique value in our newly recoded status column + its pecking order. Living (1): 225 — This suggests that 225 people survived more than 5 years. Dead (0): 80–this tells us that no one has died in the first five years
df_grouped <- df %>% group_by(status) %>% summarise(across(everything(), mean))
df_grouped
Age: Survivors(1) : the average of Dead is about 52.02 years apparent Death (0): Average Age 53.31 years approximately Answer: The mean age of those that survived beyond 5 years was only a little younger than the group whose patient did not survive. This could mean a broader trend toward younger patients are more likely to make it through. Year: Survived⠀⠀(1): Rounded off, the average year is about 1962.86 Killed(0): Avg age is around 1962.89 year They have a very similar mean year of diagnosis between the died and survived categories. Meaning that, in general there is not a continued gap between the year of diagnosis for both outcomes. Nodes: Lives (1): Mean rank: ~2.79 nodes Died (0): Node count mean around: 7.53 Answer: Patients who survived tended to have the lower average nodes as compared to those died. This is in imply having more nodes gives a higher probability of death.
df <- df %>% select(-year)
head(df, 5)
Now, Year is dropped from the df and so we have only Age, Nodes & status_columns left. Now I will sum up what each column shows us: age: The age of patients in years at the time of their disease being diagnosed. If a person has detected some positive lymph nodes then it is an indicative of cancer status. The status column labels the survival rime of patients.escaping 1 means Died and leavingWith no breathingAllowed to live half a decadeSurrounded with airAdult Lives past five yearsSuscipicious0MouthbreatherPast bottom5 Years OnwardsBgirlofBelow OneWeedgirlLiving over decadesUnderUTheresthere_restSolelyif not jot for more TimeStayPositiveMoreYearsOldChappyLess than five Authors Coming soon KadzBothGTKadzdawnspam comingsoon was intended22CharsetsGmailInReactAnotherOneGenderBenderBeyond_AngelYa_boi_Gabrielseyeatsingsupport_legaliseHashTagichever_name_to_behonest(rawValue)_BOTHNone Concern please? Finally the dataset is containing only patient’s age,Number of positive nodes and whether he survived or not. This focus facilitates analysis that pertains specifically to these factors without the additional layering-on of year of diagnosis. The latter columns are crucial to know the patient demographic distributions of alive and died based on number nodes detected in exams.
X <- df %>% select(age, nodes)
head(X)
Age: This indicates the age of each patient when their condition is being recorded. Knowledge of the age distribution can provide some information into how or if age factors into cancer survival outcomes, such as having a greater number of positive lymph nodes. Nodes: The positive lymph node count for each patient. Simple cancer prognosis of the number of nodes is very important, because usually if more nodes that mean it has indicated a stage 2 or higher level. The 1st few rows of X dataset present different AGEs and their respective NODES counts amongst the PATIENTS. To give an example, the first four patients are young (in their early thirties) and only have a few nodes. Between all this movement of the dataset you notice variation both in age and node counts. This variability is important for identification of patterns and relationships between patients’ age and the severity of their condition, as measured by being positive or negative at different numberof nodes.
if (!requireNamespace("neuralnet", quietly = TRUE)) {
install.packages("neuralnet")
}
library(tidyverse)
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.2.3
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
loading the required libraries for the further model building
X <- df %>% select(age, nodes)
y <- df$status
data_nn <- cbind(X, status = y)
Feature Selection: It selects the age and nodes columns from the original dataset. These features will be input columns to the model. Target variable(Hive Column status) : It splits lip the column,in simple words it is did patient survived or died. This will be the output or target variable for your Model. First we combine the selected features and target i.e [age,nodes] with status to take a new data frame. Now you can build a machine learning model using this generated new data frame.
any(is.na(data_nn))
## [1] FALSE
str(data_nn)
## 'data.frame': 305 obs. of 3 variables:
## $ age : int 30 30 30 31 31 33 33 34 34 34 ...
## $ nodes : int 1 3 0 2 4 10 0 0 9 30 ...
## $ status: num 1 1 1 1 1 1 1 0 0 1 ...
data_nn_scaled <- data_nn %>%
mutate(across(c(age, nodes), scale))
str(data_nn_scaled)
## 'data.frame': 305 obs. of 3 variables:
## $ age : num [1:305, 1] -2.09 -2.09 -2.09 -2 -2 ...
## ..- attr(*, "scaled:center")= num 52.4
## ..- attr(*, "scaled:scale")= num 10.7
## $ nodes : num [1:305, 1] -0.42119 -0.14343 -0.56007 -0.28231 -0.00455 ...
## ..- attr(*, "scaled:center")= num 4.03
## ..- attr(*, "scaled:scale")= num 7.2
## $ status: num 1 1 1 1 1 1 1 0 0 1 ...
head(data_nn_scaled)
Scaling: Standardization is used to scale the “age” and “nodes”. This process standardizes these variables by scaling themto a mean of 0 and unit variance. Whereas a value of age= -2.093778 and nodes =-0.4211895 suggests that they are much lower than the mean for their respective columns as given by ( 9:30 ). status Column: Status column is kept as it belongs to Classification output which not requires scaling. Data Overview: Now data_nn_scaled is in standard format for age and nodes columns, which can be used effectively with other machine learning algorithms since many ML models are performance sensitive to feature scaling as well. This transformation is beneficial to reduce weights differences among features so it enhances model performance and convergence by distancing contribution between all feature extraction processes.
if (!requireNamespace("caTools", quietly = TRUE)) {
install.packages("caTools")
}
library(caTools)
## Warning: package 'caTools' was built under R version 4.2.3
set.seed(42)
split <- sample.split(data_nn_scaled$status, SplitRatio = 0.8)
train_data <- subset(data_nn_scaled, split == TRUE)
test_data <- subset(data_nn_scaled, split == FALSE)
str(train_data)
## 'data.frame': 244 obs. of 3 variables:
## $ age : num [1:244, 1] -2.09 -2 -1.81 -1.81 -1.72 ...
## $ nodes : num [1:244, 1] -0.56007 -0.00455 0.82872 -0.56007 0.68984 ...
## $ status: num 1 1 1 1 0 1 1 1 1 1 ...
str(test_data)
## 'data.frame': 61 obs. of 3 variables:
## $ age : num [1:61, 1] -2.09 -2.09 -2 -1.72 -1.63 ...
## $ nodes : num [1:61, 1] -0.421 -0.143 -0.282 -0.56 1.245 ...
## $ status: num 1 1 1 0 1 1 1 1 1 1 ...
Training Data (train_data): [ package : knapsack ] Total 244 Observations (rows). Each observation has (A) three variables age: Age in scaled value-relative of patients. nodes: Number of nodes (Scaled for a convenient representation). status : The status of the patients (1= survived, 0=died) [Its binary form] A trace for the scaling in nodes and age have proceeded as such, that numerical values are normalized with a mean and standard deviation. Testing Data (test_data): Contains 61 observations. Just like the training data, it has column values scaled for age and nodes including status variable. Summary of Data Splitting: Training Set: Bigger part of the data — used to create a model. This split takes most of the data, providing enough examples to train a model. Testing Set: Smallest part of the data when used to evaluate model performance. This way, you can evaluate the extent to which your model generalizes data it has never encountered.
formula <- status ~ age + nodes
set.seed(42)
nn_model <- neuralnet(
formula,
data = train_data,
hidden = c(6, 8),
linear.output = FALSE,
stepmax = 2e6
)
if (is.null(nn_model$weights)) {
stop("Model has not been trained. Check your training process.")
}
plot(nn_model)
predictions <- predict(nn_model, X)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
accuracy <- mean(predicted_classes == y)
print(glue::glue("Accuracy: {accuracy * 100}%"))
## Accuracy: 75.0819672131148%
for (i in 1:5) {
print(glue::glue("Input: {paste(X[i, ], collapse = ', ')} => Predicted: {predicted_classes[i]} (Expected: {y[i]})"))
}
## Input: 30, 1 => Predicted: 1 (Expected: 1)
## Input: 30, 3 => Predicted: 1 (Expected: 1)
## Input: 30, 0 => Predicted: 1 (Expected: 1)
## Input: 31, 2 => Predicted: 1 (Expected: 1)
## Input: 31, 4 => Predicted: 1 (Expected: 1)
Training Data (train_data): Data with 244 observations. Every observation has 3 variables: age: age of the patients (0.90) nodes: the number of nodes, scaled. status: Binary (the integer values have been 0-nontrace.value died(presence of intubation included in this)=1, trace.value survived=2) Accuracy: This model estimated the prediction so that they had accuracy 75.08%. That tells you that the model predicted 75.08% of actual values in dataset It is an acceptable accuracy though with some more parameter tuning, or adjusting the model architecture etc., better performance can be achieved using advanced methods. Predictions vs. Truths Input = 30,1 => Prediction: 1 (Expected: 1) Now let’s test the AP system on the whole model and moving both opacity parameter values: [30, 3]. Input: 30, 0 ==> Predicted :1 (Expected:1) 31, 2 => PRED:1(actu_1) Example 31, one vs four => Predicted =1 [Correct: No] These are some samples that the model correctly predicted their status for each case All five cases are classified as the correct status, so it would seem that this model is working correctly for these inputs.
predictions <- predict(nn_model, X)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
y <- factor(y, levels = c(0, 1))
results <- data.frame(
Prediction = factor(predicted_classes, levels = c(0, 1)),
Actual = y
)
conf_matrix_table <- table(Predicted = results$Prediction, Actual = results$Actual)
TP <- conf_matrix_table["1", "1"]
TN <- conf_matrix_table["0", "0"]
FP <- conf_matrix_table["1", "0"]
FN <- conf_matrix_table["0", "1"]
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(conf_matrix_table)
## Actual
## Predicted 0 1
## 0 8 4
## 1 72 221
cat("\nPrecision:\n")
##
## Precision:
print(precision)
## [1] 0.7542662
cat("\nRecall:\n")
##
## Recall:
print(recall)
## [1] 0.9822222
cat("\nF1 Score:\n")
##
## F1 Score:
print(f1_score)
## [1] 0.8532819
for (i in 1:5) {
cat(sprintf("Input: %s => Predicted: %d (Expected: %d)\n", paste(X[i, ], collapse = ', '), predicted_classes[i], y[i]))
}
## Input: 30, 1 => Predicted: 1 (Expected: 2)
## Input: 30, 3 => Predicted: 1 (Expected: 2)
## Input: 30, 0 => Predicted: 1 (Expected: 2)
## Input: 31, 2 => Predicted: 1 (Expected: 2)
## Input: 31, 4 => Predicted: 1 (Expected: 2)
True Negatives (TN): The model correctly predicted died (0) 8 times, and the actual outcome was indeed died (0). False Positives (FP): The model predicted survival (1) 72 times, but the actual outcome was died (0). This high number of false positives suggests that the model is more likely to incorrectly label cases as survival when they are actually died. False Negatives (FN): The model predicted died (0) incorrectly for 4 instances where the actual outcome was survival (1). This indicates missed predictions of actual positive cases. True Positives (TP): The model correctly predicted survival (1) 221 times, and the actual outcome was indeed survival (1). Metrics Precision (0.754): This means that when the model predicted 1 (survival), it was correct about 75.4% of the time. Precision is the proportion of true positive predictions among all cases predicted as positive. Although precision is fairly high, the high number of false positives indicates that the model might be overly optimistic in predicting survival. Recall (0.982): This indicates that the model correctly identified 98.2% of the actual positive cases (survival). Recall measures how many of the actual positive cases were correctly identified by the model. A high recall shows the model is good at identifying true positives, but it comes at the cost of increased false positives. F1 Score (0.853): The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both. A high F1 score of 0.853 indicates that the model performs well overall, taking both precision and recall into account. It shows a good trade-off between the model’s ability to correctly identify positives and its precision in doing so. Predictions The output displays predictions for the first 5 instances. The model predicted 1 (survival) for each instance, but the expected value was 2, which suggests there might be a discrepancy in how the labels are encoded or matched. Ensure that survival and death labels are correctly represented as 0 and 1 respectively to avoid such errors. Additional Insights The high number of false positives compared to false negatives indicates that while the model is good at identifying actual positive cases (high recall), it tends to misclassify more cases as positive when they are actually negative. This results in an increased number of false positives, which could lead to suboptimal performance in practical scenarios where false positives are costly. The increased false positives contribute to higher Mean Squared Error (MSE) and could suggest a need for model adjustment or tuning to reduce these errors and improve overall prediction quality.