1 Introduction

This laboratory notebook entry explores the development of predictive models for breast cancer classification using machine learning techniques. The focus is on utilizing Logistic Regression and K-Nearest Neighbors (KNN) algorithms on a publicly available dataset obtained from Kaggle. This dataset consists of patient records containing a unique identifier, diagnosis (malignant or benign), visual characteristics of the cancer, and their average values.

2 Problem Statement

To develop and compare the predictive performance of Logistic Regression and K-Nearest Neighbors (KNN) models in classifying tumors as malignant or benign based on their visual characteristics using the provided dataset.

3 Data Preparation

Before moving forward, we must first load the necessary library required for conducting this research.

library(dplyr)
library(inspectdf)
library(GGally)
library(gtools) 
library(caret) 
library(ggplot2)
library(lattice)
library(outliers)
library(cluster)
library(factoextra)

The dataset comprises a total of 33 features, including a unique patient identifier (ID) and the target variable, diagnosis. The diagnosis feature is a categorical variable indicating the malignancy status of the tumor, with values “M” representing malignant and “B” representing benign.

The remaining 31 features represent quantitative measurements of various visual characteristics of the cancer. These features are divided into three groups based on their scaling:

Mean values: These features capture the average values of the specified characteristic across the entire tumor. They include radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, and fractal_dimension_mean.
Standard error (SE) values: These features represent the standard error of the corresponding mean values. They include radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, and fractal_dimension_se.
Worst or largest values: These features represent the worst or largest values of the specified characteristic within the tumor. They include radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, and fractal_dimension_worst.

Breast Cancer Dataset Attribute Information

Feature Name	Description
id	Unique identifier for each patient
diagnosis	Cancer type (M: Malignant, B: Benign)
radius_mean	Mean of radii
texture_mean	Mean of textures
perimeter_mean	Mean of perimeters
area_mean	Mean of areas
smoothness_mean	Mean of smoothness
compactness_mean	Mean of compactness
concavity_mean	Mean of concavity
concave points_mean	Mean of concave points
symmetry_mean	Mean of symmetry
fractal_dimension_mean	Mean of fractal dimension
radius_se	Standard error of radius
texture_se	Standard error of texture
perimeter_se	Standard error of perimeter
area_se	Standard error of area
smoothness_se	Standard error of smoothness
compactness_se	Standard error of compactness
concavity_se	Standard error of concavity
concave points_se	Standard error of concave points
symmetry_se	Standard error of symmetry
fractal_dimension_se	Standard error of fractal dimension
radius_worst	Worst or largest radius
texture_worst	Worst or largest texture
perimeter_worst	Worst or largest perimeter
area_worst	Worst or largest area
smoothness_worst	Worst or largest smoothness
compactness_worst	Worst or largest compactness
concavity_worst	Worst or largest concavity
concave points_worst	Worst or largest concave points
symmetry_worst	Worst or largest symmetry
fractal_dimension_worst	Worst or largest fractal dimension

Note: Target : “M” (Malignant) = 1 or “B” (Benign) = 0

3.1 Import & Read Data

The first step is to import the dataset using the read.csv() function.

cancer <- read.csv("data_input/Cancer_Data.csv")
cancer

3.2 Inspect Data

The subsequent phase involves an exploratory analysis of the imported dataset. To achieve this, the initial and terminal data points of the cancer dataset are examined through the application of the head() and tail() functions respectively.

head(cancer)

tail(cancer)

3.3 Structure Data

The appropriate data type is determined by initially checking it with the `glimpse() function.

cancer %>% 
  glimpse()

#> Rows: 569
#> Columns: 33
#> $ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, …
#> $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
#> $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
#> $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
#> $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
#> $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
#> $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
#> $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
#> $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
#> $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
#> $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
#> $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
#> $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
#> $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
#> $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
#> $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
#> $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
#> $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
#> $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
#> $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
#> $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
#> $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
#> $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
#> $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
#> $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
#> $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
#> $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
#> $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
#> $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
#> $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
#> $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
#> $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
#> $ X                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

4 Data Wrangling

4.1 Check for unecesary column

The id and X columns will be excluded from the dataset. Additionally, the data type of the diagnosis column will be converted to a factor. Following these data cleaning steps, a new dataset named cancer_clean will be created.

cancer_clean <- 
  cancer %>% 
  select(-id, -X) %>% 
  mutate(diagnosis = case_when(diagnosis == "M" ~ 1,
                            diagnosis == "B" ~ 0)) %>% 
  mutate(diagnosis = as.factor(diagnosis))
  
head(cancer_clean)

4.2 Check Missing Value

Once those steps are completed, it is important to verify any missing values in the dataset as well.

cancer_clean %>% 
  is.na() %>% 
  colSums()

#>               diagnosis             radius_mean            texture_mean 
#>                       0                       0                       0 
#>          perimeter_mean               area_mean         smoothness_mean 
#>                       0                       0                       0 
#>        compactness_mean          concavity_mean     concave.points_mean 
#>                       0                       0                       0 
#>           symmetry_mean  fractal_dimension_mean               radius_se 
#>                       0                       0                       0 
#>              texture_se            perimeter_se                 area_se 
#>                       0                       0                       0 
#>           smoothness_se          compactness_se            concavity_se 
#>                       0                       0                       0 
#>       concave.points_se             symmetry_se    fractal_dimension_se 
#>                       0                       0                       0 
#>            radius_worst           texture_worst         perimeter_worst 
#>                       0                       0                       0 
#>              area_worst        smoothness_worst       compactness_worst 
#>                       0                       0                       0 
#>         concavity_worst    concave.points_worst          symmetry_worst 
#>                       0                       0                       0 
#> fractal_dimension_worst 
#>                       0

Since there are no missing values in this dataset, it is ready to move on to the next stages.

Subsequently, a verification process is conducted to ensure the accuracy of data types across all remaining columns.

cancer_clean %>% 
  glimpse()

#> Rows: 569
#> Columns: 31
#> $ diagnosis               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
#> $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
#> $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
#> $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
#> $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
#> $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
#> $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
#> $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
#> $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
#> $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
#> $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
#> $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
#> $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
#> $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
#> $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
#> $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
#> $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
#> $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
#> $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
#> $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
#> $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
#> $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
#> $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
#> $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
#> $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
#> $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
#> $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
#> $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
#> $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
#> $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…

Having verified the data type consistency for all columns, we can proceed to the data preprocessing stage

5 Exploratory Data Analysis (EDA)

5.1 Check categorical and numerical data distribution

On this stage, we want to know how categorical and numerical data is distributed. To accomplish this, we are using:

inspect_cat() to view summary values for categorical variables.
inspect_num() to view summary values for numeric variables.

cancer_clean %>%
  inspect_cat()

cancer_clean %>%
  inspect_num()

Insights:

Area_mean has a minimum value of 143.5, a maximum of 2501 and an average of 654.89.
Each variable can be seen the minimum, maximum and average values

6 Cross Validation

6.1 Splitting Data

To prepare the data for model training and evaluation using Logistic Regression, we will split the cancer_clean dataset into two subsets: a training set (train_cancer_lr) and a testing set (test_cancer_lr). This split will be achieved using the initial_split(), training() and testing() functions. We will repeat this process for building model using KNN.

6.1.1 Data Splitting for building Logistic Regression Model

Next, we will divide the dataset cancer_clean into train (train_cancer_lr) and test (test_cancer_lr) datasets, maintaining an 80%:20% ratio using training() and testing() functions.

library(rsample)

RNGkind(sample.kind = "Rounding")
set.seed(123)

# split with proportion 80:20
splitter <- initial_split(data = cancer_clean, prop = 0.8)

# extract to dataframe
train_cancer_lr <- training(splitter)
test_cancer_lr <- testing(splitter)

head(train_cancer_lr)

6.1.2 Data Splitting for building KNN Model

This section will replicate the methodological approach employed in the Logistic Regression Model.

next, we will divide the dataset cancer_clean into train (train_cancer_knn) and test (test_cancer_knn) datasets, maintaining an 80%:20% ratio using training() and testing() functions.

library(rsample)

RNGkind(sample.kind = "Rounding")
set.seed(123)

#split with proportion 80:20
splitter <- initial_split(data = cancer_clean, prop = 0.8)

# extract to dataframe
train_cancer_knn <- training(splitter)
test_cancer_knn <- testing(splitter)

head(train_cancer_lr)

6.2 Check for Imbalance Class

6.2.1 Check for Imbalance Class for Logistics Regression Data

It’s important to check the class distribution in train_cancer_lr$diagnosis before proceeding. This helps mitigate potential bias in the model.

table(train_cancer_lr$diagnosis) %>% 
  prop.table()

#> 
#>         0         1 
#> 0.6263736 0.3736264

Insights : The dataset exhibits class imbalance, with 62.63% of samples classified as Benign (0) and 37.36% as Malignant (1). To address this issue, an upsampling technique will be employed.

6.2.2 Upsampling for data train Logistics Regression

Sampling techniques should only be performed on training data train_cancer_lr. The testing data is treated as new data for the model.

Function for Upsampling upSample(), with the parameters x as predictor, y as target and yname : target column name.

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

train_cancer_lr_up <- upSample(
  x = train_cancer_lr %>% select(-diagnosis),
  y = train_cancer_lr$diagnosis,
  yname = "diagnosis"
)

train_cancer_lr_up

Check the proportion of the target class after upsampling.

# your code here
table(train_cancer_lr_up$diagnosis) %>% 
  prop.table()

#> 
#>   0   1 
#> 0.5 0.5

Now, the training data employed for the Logistic Regression model is balanced.

6.2.3 Check for Imbalance Class for KNN Model

Next, we repeat the process employed in previous section where we check the class distribution in train_cancer_knn$diagnosis before proceeding. This helps mitigate potential bias in the model.

# your code here
table(train_cancer_knn$diagnosis) %>% 
  prop.table()

#> 
#>         0         1 
#> 0.6263736 0.3736264

Insights : The dataset exhibits class imbalance, with 62.63% of samples classified as Benign (0) and 37.36% as Malignant (1). To address this issue, an upsampling technique will be employed.

6.2.4 Upsampling for data train KNN Model

Sampling techniques should only be performed on training data train_cancer_knn. The testing data is treated as new data for the model.

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

train_cancer_knn_up <- upSample(
  x = train_cancer_knn %>% select(-diagnosis),
  y = train_cancer_knn$diagnosis,
  yname = "diagnosis"
)

head(train_cancer_knn_up)

Check the proportion of the target class after upsampling.

# your code here
table(train_cancer_knn_up$diagnosis) %>% 
  prop.table()

#> 
#>   0   1 
#> 0.5 0.5

Now, the training data employed for the K-Nearest Neighbor model is balanced.

7 Building The Model

This section focuses on building and interpreting machine learning models to predict whether a breast cancer is malignant or benign based on its visual characteristics. Based on the cancer_clean dataset, we’ll explore two common algorithms: Logistic Regression and K-Nearest Neighbors (KNN). Our goal is to develop a model that can effectively classify malignant or benign breast cancer based on their characteristics.

7.1 Logistic Regression

Model Training: we will use glm function to build logistic regression model. The formula will specify the dependent variable (train_cancer_lr_up$diagnosis) and independent variables that potentially have characteristics that influence whether a tumor is malignant or benign based on its visual characteristics.

We will create 6 models and choose most suited models. Those models are:

model_cancer_null
model_cancer_size
model_cancer_texture
model_cancer_ft
model_cancer_all
model_cancer_backward
model_cancer_forward
model_cancer_both

7.1.1 Without Variable

model_cancer_null <- glm(formula = diagnosis ~ 1,
                  data = train_cancer_lr_up, 
                  family = "binomial")
summary(model_cancer_null)

#> 
#> Call:
#> glm(formula = diagnosis ~ 1, family = "binomial", data = train_cancer_lr_up)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  0.00000    0.08377       0        1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.19  on 569  degrees of freedom
#> Residual deviance: 790.19  on 569  degrees of freedom
#> AIC: 792.19
#> 
#> Number of Fisher Scoring iterations: 2

The AIC value is 792.19 in the model without predictors. The expected AIC value is the smallest among the models created.

7.1.2 Variables based on the size and form

We will create a model by grouping the predictor variables based on the size and form of the cancer, consisting of radius_mean, radius_se, radius_worst, concavity_mean, concavity_se, concavity_worst. Considerations: Malignant tumors tend to be larger, have an irregular shape, and have deeper indentations compared to benign tumors.

model_cancer_size <- glm(formula = diagnosis ~ radius_mean + radius_se + radius_worst +
                           concavity_mean + concavity_se + concavity_worst,
                  data = train_cancer_lr_up, 
                  family = "binomial")
summary(model_cancer_size)

#> 
#> Call:
#> glm(formula = diagnosis ~ radius_mean + radius_se + radius_worst + 
#>     concavity_mean + concavity_se + concavity_worst, family = "binomial", 
#>     data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value       Pr(>|z|)    
#> (Intercept)      -20.3358     3.1911  -6.373 0.000000000186 ***
#> radius_mean       -1.2246     0.6065  -2.019       0.043461 *  
#> radius_se          6.4332     4.0297   1.596       0.110392    
#> radius_worst       1.9576     0.6062   3.229       0.001241 ** 
#> concavity_mean    47.6456    14.2086   3.353       0.000799 ***
#> concavity_se    -117.7780    35.1753  -3.348       0.000813 ***
#> concavity_worst   12.0366     4.8181   2.498       0.012482 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.19  on 569  degrees of freedom
#> Residual deviance: 103.44  on 563  degrees of freedom
#> AIC: 117.44
#> 
#> Number of Fisher Scoring iterations: 9

Insights :

The model has overall significance because the p-value is very small (p < 0.001), indicating that the model with all predictor variables is better than the model without predictor variables.
Significant predictor variables:

radius_mean: The coefficient is negative and significant (-1.2246), meaning that an increase in radius_mean is associated with a decreased risk of cancer diagnosis.
radius_worst: Positive and significant coefficient ( 1.9576) meaning that an increase in radius_worst is associated with an increased risk of cancer diagnosis.
concavity_mean: The coefficient is positive and highly significant ( 47.6456) meaning that an increase in concavity_mean is strongly associated with an increase in the risk of cancer diagnosis.
concavity_se: A negative and highly significant coefficient (-117.7780 ) meaning that an increase in concavity_se is strongly associated with a decreased risk of cancer diagnosis.
concavity_worst: Positive and significant coefficient ( 12.0366) meaning that an increase in concavity_worst is associated with an increased risk of cancer diagnosis.

Insignificant predictor variables:

radius_se has no significant effect on cancer diagnosis based on this model.

AIC value of 117.44, still high. Conclusion: we are still looking for a better model.

7.1.3 Variables based on Texture

Next, we will create a model by grouping the predictor variables based on the texture of the cancer, which consists of: texture_mean, texture_se, texture_worst. Considerations: Texture changes in breast tissue can be an indicator of malignancy.

model_cancer_texture <- glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst,
                  data = train_cancer_lr_up, 
                  family = "binomial")
summary(model_cancer_texture)

#> 
#> Call:
#> glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst, 
#>     family = "binomial", data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)   -4.912718   0.557226  -8.816 < 0.0000000000000002 ***
#> texture_mean   0.007088   0.058386   0.121                0.903    
#> texture_se    -1.608945   0.250503  -6.423       0.000000000134 ***
#> texture_worst  0.253968   0.043405   5.851       0.000000004885 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.19  on 569  degrees of freedom
#> Residual deviance: 613.57  on 566  degrees of freedom
#> AIC: 621.57
#> 
#> Number of Fisher Scoring iterations: 4

Insights :

The model has overall significance because the p-value is very small (p < 0.001), indicating that the model with all predictor variables is better than the model without predictor variables.
Significant predictor variables:

texture_se: The coefficient is negative and significant (-1.608945), meaning that a decrease in texture_se is associated with an increased risk of malignant cancer diagnosis.
texture_worst: Positive and significant coefficient (0.253968) meaning that an increase in texture_worst is associated with an increased risk of malignant cancer diagnosis.

AIC value of 621.57, still very high.

Conclusion: The variables texture_worst, texture_se have an influence in predicting malignant cancer diagnosis, while texture_mean has no significant relationship. We are still looking for a better model.

7.1.4 Variables based on The Form and Texture

Next, we will combine variables from the texture and form consisting of texture_mean, texture_se, texture_worst, concavity_mean, concavity_se, concavity_worst.

model_cancer_ft <- glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst + concavity_mean + concavity_se + concavity_worst,
                  data = train_cancer_lr_up, 
                  family = "binomial")
summary(model_cancer_ft)

#> 
#> Call:
#> glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst + 
#>     concavity_mean + concavity_se + concavity_worst, family = "binomial", 
#>     data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                    Estimate  Std. Error z value           Pr(>|z|)    
#> (Intercept)      -10.725505    1.444946  -7.423 0.0000000000001147 ***
#> texture_mean       0.003295    0.133168   0.025            0.98026    
#> texture_se        -2.055629    0.880958  -2.333            0.01963 *  
#> texture_worst      0.328971    0.126296   2.605            0.00919 ** 
#> concavity_mean   119.185826   15.727713   7.578 0.0000000000000351 ***
#> concavity_se    -171.550973   33.598204  -5.106 0.0000003291252522 ***
#> concavity_worst    0.150997    4.037077   0.037            0.97016    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.19  on 569  degrees of freedom
#> Residual deviance: 170.75  on 563  degrees of freedom
#> AIC: 184.75
#> 
#> Number of Fisher Scoring iterations: 8

Insights :

The model has overall significance because the p-value is very small (p < 0.001), indicating that the model with all predictor variables is better than the model without predictor variables.
Significant predictor variables:

texture_se: The coefficient is negative and significant (-2.055629), meaning that an increase in texture_se is associated with a decreased risk of cancer diagnosis.
texture_worst: Positive and significant coefficient (0.328971) meaning that an increase in texture_worst is associated with an increased risk of cancer diagnosis.
concavity_mean: Positive and highly significant coefficient (119.185826) meaning that an increase in concavity_mean is strongly associated with an increased risk of cancer diagnosis.
concavity_se: The coefficient is negative and highly significant (-171.550973) meaning that an increase in concavity_se is strongly associated with a decreased risk of cancer diagnosis.

The insignificant predictor variables including texture_mean and concavity_worst have no significant influence on cancer diagnosis based on this model.
AIC value of 184.75, this value is still high.

General conclusion: Texture and concavity related variables have a role in cancer diagnosis prediction, but not all variables have the same influence. We are still looking for a good model with a small AIC value.

7.1.5 All Variables

Next, we will build a model by including all the predictors.

model_cancer_all <- glm(formula = diagnosis ~ ., 
                  data = train_cancer_lr_up, 
                  family = "binomial")
summary(model_cancer_all)

#> 
#> Call:
#> glm(formula = diagnosis ~ ., family = "binomial", data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                              Estimate    Std. Error z value Pr(>|z|)
#> (Intercept)                -5534.0214   415179.0900  -0.013    0.989
#> radius_mean                -1726.4092    80607.6070  -0.021    0.983
#> texture_mean                  39.9573     2771.7389   0.014    0.988
#> perimeter_mean               258.3541    12926.2153   0.020    0.984
#> area_mean                     -1.2668      254.9928  -0.005    0.996
#> smoothness_mean            -1239.0607  1256026.3101  -0.001    0.999
#> compactness_mean          -17223.1279   718699.9907  -0.024    0.981
#> concavity_mean              -258.4364   467064.3929  -0.001    1.000
#> concave.points_mean        10829.8845   852057.1122   0.013    0.990
#> symmetry_mean              -2140.4349   226834.3975  -0.009    0.992
#> fractal_dimension_mean     30267.9394  2159131.7255   0.014    0.989
#> radius_se                   3456.7452   264257.0376   0.013    0.990
#> texture_se                    93.2084    14854.8186   0.006    0.995
#> perimeter_se                -322.0273    32469.7600  -0.010    0.992
#> area_se                       -5.9391      597.7267  -0.010    0.992
#> smoothness_se             -87436.6566  6470752.5575  -0.014    0.989
#> compactness_se             28449.0523  1337456.1292   0.021    0.983
#> concavity_se              -10847.3337   574820.2061  -0.019    0.985
#> concave.points_se          52595.3744  3112311.6391   0.017    0.987
#> symmetry_se                 2304.9277   821856.6103   0.003    0.998
#> fractal_dimension_se     -237651.3818 14001657.2950  -0.017    0.986
#> radius_worst                 126.7463    37241.4028   0.003    0.997
#> texture_worst                 -2.7761     2467.4273  -0.001    0.999
#> perimeter_worst               10.4596     5395.9051   0.002    0.998
#> area_worst                     0.4787      299.4577   0.002    0.999
#> smoothness_worst            9609.8878   763310.2716   0.013    0.990
#> compactness_worst          -2383.1934   200871.8576  -0.012    0.991
#> concavity_worst             2257.5377   220280.4545   0.010    0.992
#> concave.points_worst          72.3320   306852.9661   0.000    1.000
#> symmetry_worst              1807.1142   160275.6184   0.011    0.991
#> fractal_dimension_worst    10522.4000  1109609.7573   0.009    0.992
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.187785838  on 569  degrees of freedom
#> Residual deviance:   0.000004021  on 539  degrees of freedom
#> AIC: 62
#> 
#> Number of Fisher Scoring iterations: 25

Insights

All the coefficients of the predictor variables are insignificant (p-value close to 1). This indicates that none of the variables have a clear influence on cancer diagnosis based on this model.
Residual deviance is very small (almost 0). This may indicate overfitting, where the model is too complex.
AIC is very low, this also supports the possibility of overfitting.

7.1.6 Model Step-wise Regression for Feature Selection

We are still looking for the best model to predict, so we will make models with the Step-wise Regression. There are 3 methods performed: 1. Backward 2. Forward 3. Both

7.1.6.1 Backward Elimination

We use the model_cancer_all model which includes all variables as predictors. The stepwise regressing process uses the step() function, by filling in some parameters: model_cancer_all as the object, and “backward” as the direction.

model_cancer_backward <- step(object = model_cancer_all, 
                   direction = "backward",
                   trace = F)

summary(model_cancer_backward)

#> 
#> Call:
#> glm(formula = diagnosis ~ radius_mean + texture_mean + perimeter_mean + 
#>     compactness_mean + concave.points_mean + fractal_dimension_mean + 
#>     radius_se + texture_se + perimeter_se + smoothness_se + compactness_se + 
#>     concavity_se + concave.points_se + fractal_dimension_se + 
#>     radius_worst + smoothness_worst + concavity_worst + symmetry_worst + 
#>     fractal_dimension_worst, family = "binomial", data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)
#> (Intercept)              -15809.9   189960.0  -0.083    0.934
#> radius_mean               -5936.7    84668.0  -0.070    0.944
#> texture_mean                105.2     1265.1   0.083    0.934
#> perimeter_mean              823.5    11541.9   0.071    0.943
#> compactness_mean         -65002.9   830302.0  -0.078    0.938
#> concave.points_mean       42427.4   629829.7   0.067    0.946
#> fractal_dimension_mean    95704.9  1232295.2   0.078    0.938
#> radius_se                  5731.0    73660.9   0.078    0.938
#> texture_se                  641.0    10289.6   0.062    0.950
#> perimeter_se               -676.4    10435.4  -0.065    0.948
#> smoothness_se           -214910.7  2961870.2  -0.073    0.942
#> compactness_se            57413.4   739058.2   0.078    0.938
#> concavity_se             -21701.3   297834.7  -0.073    0.942
#> concave.points_se        178343.8  2529163.5   0.071    0.944
#> fractal_dimension_se    -679825.9  8307868.2  -0.082    0.935
#> radius_worst                810.9    11655.3   0.070    0.945
#> smoothness_worst          22475.4   296416.6   0.076    0.940
#> concavity_worst            4114.6    53542.6   0.077    0.939
#> symmetry_worst             4272.1    58374.2   0.073    0.942
#> fractal_dimension_worst   25969.5   411187.4   0.063    0.950
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.187785838  on 569  degrees of freedom
#> Residual deviance:   0.000028319  on 550  degrees of freedom
#> AIC: 40
#> 
#> Number of Fisher Scoring iterations: 25

Insights

All the coefficients of the predictor variables are insignificant (p-value close to 1). This indicates that none of the variables have a clear influence on cancer diagnosis based on this model.
Residual deviance is very small (almost 0). This may indicate overfitting, where the model is too complex.
Very low AIC: This also supports the possibility of overfitting.

Conclusion: 1. This model cannot be used for prediction or inference as it has no predictive power. 2. The insignificant results may be due to high multicollinearity between the predictor variables.

7.1.6.2 Forward Selection

The stepwise regression process uses the step() function, by filling in some parameters: model_cancer_null as the object, and “forward” as the direction. For the Forward Selection method, we need to define the scope parameter to indicate the maximum upper limit of predictor combinations with model_cancer_all.

model_cancer_forward <- step(object = model_cancer_null,
                      direction = "forward",
                      scope = list(upper= model_cancer_all),
                      trace=F)

summary(model_cancer_forward)

#> 
#> Call:
#> glm(formula = diagnosis ~ perimeter_worst + smoothness_worst + 
#>     texture_worst + symmetry_worst + concave.points_worst + area_worst + 
#>     radius_mean + concave.points_mean + compactness_mean, family = "binomial", 
#>     data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                       Estimate Std. Error z value  Pr(>|z|)    
#> (Intercept)          -16.27649    9.53874  -1.706   0.08794 .  
#> perimeter_worst        0.12727    0.16734   0.761   0.44693    
#> smoothness_worst      30.50015   31.63750   0.964   0.33502    
#> texture_worst          0.43109    0.09932   4.340 0.0000142 ***
#> symmetry_worst        21.61792   10.03283   2.155   0.03118 *  
#> concave.points_worst  38.61845   26.27892   1.470   0.14168    
#> area_worst             0.04172    0.01434   2.910   0.00362 ** 
#> radius_mean           -4.17580    1.30565  -3.198   0.00138 ** 
#> concave.points_mean  213.04153   67.10194   3.175   0.00150 ** 
#> compactness_mean     -76.90162   28.84636  -2.666   0.00768 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.188  on 569  degrees of freedom
#> Residual deviance:  48.495  on 560  degrees of freedom
#> AIC: 68.495
#> 
#> Number of Fisher Scoring iterations: 11

Insights :

Significant Variables as follows:

texture_worst has a significant coefficient (very low p-value: 0.0000142). This means that texture_worst has a strong influence on the diagnosis result.
symmetry_worst is also significant (p-value: 0.03118).
area_worst, radius_mean, compactness_mean and concave.points_mean also have significant influence (p-value < 0.05).

Variables that are not significant: The variables perimeter_worst, smoothness_worst, and concave.points_worst are not significant (p-value > 0.05).
The intercept value (coefficient for (Intercept)) is -16.27649.
The AIC (Akaike Information Criterion) value is 68.495. The lower the AIC value, the better the model.

7.1.6.3 Both (combine backward & forward)

The stepwise regression process uses the step() function, by filling in some parameters: model_cancer_null as the object, and “both” as the direction. For the Both Selection method, we need to define the scope parameter to indicate the maximum upper limit of predictor combinations with model_cancer_all.

model_cancer_both <- step(object = model_cancer_null,
                      direction = "both",
                      scope = list(upper= model_cancer_all),
                      trace=F)

summary(model_cancer_both)

#> 
#> Call:
#> glm(formula = diagnosis ~ texture_worst + symmetry_worst + concave.points_worst + 
#>     area_worst + radius_mean + concave.points_mean + compactness_mean, 
#>     family = "binomial", data = train_cancer_lr_up)
#> 
#> Coefficients:
#>                       Estimate Std. Error z value   Pr(>|z|)    
#> (Intercept)           -9.41898    6.33055  -1.488   0.136787    
#> texture_worst          0.42359    0.09534   4.443 0.00000887 ***
#> symmetry_worst        23.56378   10.09028   2.335   0.019528 *  
#> concave.points_worst  43.34106   24.40428   1.776   0.075739 .  
#> area_worst             0.04650    0.01150   4.045 0.00005236 ***
#> radius_mean           -3.82827    1.10560  -3.463   0.000535 ***
#> concave.points_mean  208.37648   62.55820   3.331   0.000866 ***
#> compactness_mean     -66.39121   26.48332  -2.507   0.012179 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 790.188  on 569  degrees of freedom
#> Residual deviance:  50.218  on 562  degrees of freedom
#> AIC: 66.218
#> 
#> Number of Fisher Scoring iterations: 11

Insights:

Significant Variables:

texture_worst has a significant coefficient (very low p-value: 0.00000887). This means that texture_worst has a strong influence on the diagnosis result.
symmetry_worst is also significant (p-value: 0.019528).
area_worst, radius_mean, and concave.points_mean also have significant influence (p-value < 0.05).
compactness_mean is also significant (p-value: 0.012179).

The intercept value (coefficient for (Intercept)) is -9.41898.
The AIC (Akaike Information Criterion) value is 66.218. The lower the AIC value, the better the model.

Conclusion: I will use the Step_wise model of Both model_cancer_both method to make predictions.

7.1.7 Interpretation Logistic Regression Model

Next we will use the model of model_cancer_both.In the logistic regression model, we use the coefficient value (Estimate) to calculate the inv.logit value (inverse logit). The inv.logit value converts log-odds into probabilities.

Use the inv.logit() function from the gtools library to find the odds (0 to 1).

# Converting the log of odds value into probability

# texture_worst
inv.logit(0.42359)

#> [1] 0.604342

# symmetry_worst
inv.logit(23.56378)

#> [1] 1

# area_worst
inv.logit(0.04650)

#> [1] 0.5116229

# radius_mean
inv.logit(-3.82827)

#> [1] 0.02128433

# concave.points_mean
inv.logit(208.37648)

#> [1] 1

# compactness_mean 
inv.logit(-66.39121)

#> [1] 0.0000000000000000000000000000146779

Insights :

The most significant variables are texture_worst, symmetry_worst, and concave.points_mean seem to be the most powerful variables in predicting the probability of a tumor being malignant.
Certain variables have a positive relationship with the likelihood of a tumor being malignant (inv.logit value close to 1), while other variables radius_mean, compactness_mean have a negative relationship (inv.logit value close to 0).

7.1.8 Model Selection

A series of models were constructed, encompassing:

model_cancer_null
model_cancer_size
model_cancer_texture
model_cancer_ft
model_cancer_all
model_cancer_backward
model_cancer_forward
model_cancer_both

Upon completion of model development, a selection process will be initiated to identify a model characterized by the following criteria:

7.1.8.1 Null & Residual Deviance

There are two primary types of deviance in this context:

Null deviance: A metric quantifying the model error when no predictors are included.
Residual deviance: A measure of the model error remaining after incorporating predictors.

model_cancer_null$deviance

#> [1] 790.1878

model_cancer_size$deviance

#> [1] 103.4436

model_cancer_texture$deviance

#> [1] 613.5696

model_cancer_ft$deviance

#> [1] 170.7509

model_cancer_all$deviance

#> [1] 0.000004020951

model_cancer_backward$deviance

#> [1] 0.00002831909

model_cancer_forward$deviance

#> [1] 48.49527

model_cancer_both$deviance

#> [1] 50.2181

Null and Residual Deviance Values:

Null deviance: 790.19
Residual deviance: model_cancer_size = 103.4436 model_cancer_texture = 613.5696 model_cancer_ft = 170.7509 model_cancer_all= 0.000004020951 model_cancer_backward= 0.00002831909 model_cancer_forward = 48.49527 model_cancer_both = 50.2181

Insights :

Between models model_cancer_forward = 48.49527 and model_cancer_both = 50.2181 have low residual values. The model_cancer_all and model_cancer_backward models is also very low but the all the coefficients of the predictor variables are insignificant.

7.1.8.2 AIC

AIC describes the amount of information lost from a model. The smaller the AIC value, the less information is lost.

# aic
model_cancer_null$aic

#> [1] 792.1878

model_cancer_size$aic

#> [1] 117.4436

model_cancer_texture$aic

#> [1] 621.5696

model_cancer_ft$aic

#> [1] 184.7509

model_cancer_all$aic

#> [1] 62

model_cancer_backward$aic

#> [1] 40.00003

model_cancer_forward$aic

#> [1] 68.49527

model_cancer_both$aic

#> [1] 66.2181

Insights :

The smallest AIC result of the model_cancer_both model by having significant predictor variables on the target (discussion above). Conclusion: from deviance and AIC, the model_cancer_both model was chosen.

7.2 K-Nearest Neighbors (KNN)

We’ll use the knn() function from the class package to build a KNN model. This model classifies new objects based on the similarity (distance) to existing objects in the training set that are labeled as hazardous or non-hazardous.

7.2.1 Scaling using z-score standardization

The predictor data will be scaled using z-score standardization. The test data should also be scaled using parameters from the train data (since it assumes the test data is unseen data).

library(dplyr)
# For predictor
cancer_train_x <- train_cancer_knn_up %>% select(-diagnosis)

cancer_test_x <- test_cancer_knn %>% select(-diagnosis)

# For Target
cancer_train_y <- train_cancer_knn_up$diagnosis

cancer_test_y <- test_cancer_knn$diagnosis

The scale function consists of several parameters + x = the object to be scaled + center = average/mean value (taken from the center value in the scaled cancer_train_x data) + scale = standard deviation value (taken from the sd value of the scaled cancer_train_x data)

# Scaling data
# Data Train
cancer_train_x_sc <- scale(cancer_train_x)

# Data Test
cancer_test_x_sc <- scale(cancer_test_x,
                        center = attr(cancer_train_x_sc, "scaled:center"),
                        scale = attr(cancer_train_x_sc, "scaled:scale"))

Next we will make predictions with KNN using the scaled train (cancer_train_x_sc) and test (cancer_test_x_sc) data.

8 Predict

8.1 Logistic Regression

Syntax: predict(object model, newdata, type)

object: fill with the model used for prediction
newdata: fill with test data/unseen data/new data
type: output type of the predicted value

The type parameter has a choice of values:

link: produces log of odds
response: produces probability

we use responsetype, because we need to know probability value on each class “M” (Malignant) = 1 or “B” (Benign) = 0.

predict(object = model_cancer_both,
        newdata = head(test_cancer_lr),
        type = "response")

#>         1         2         3         4         5         6 
#> 0.9999533 0.9994761 0.9999997 0.9999806 1.0000000 0.9999991

Convert odds to categorical with the function ifelse() - test_cancer_lr: tested condition - yes: Malignant - no: Benign

# Convert odds into prediction labels
pred_cancer_lr <- predict(object = model_cancer_both,
        newdata = head(test_cancer_lr),
        type = "response")
ifelse(pred_cancer_lr > 0.5, yes = 1, no = 0)

#> 1 2 3 4 5 6 
#> 1 1 1 1 1 1

Predict the probability diagnosis for the test_cancer_lr data and save it into a new column named pred in the test data.

test_cancer_lr$pred <- predict(object = model_cancer_both,
        newdata = test_cancer_lr,
        type = "response")

test_cancer_lr

Classify the test_cancer_lr data based on pred and save it in a new column named pred_label.

# ifelse(kondisi, benar, salah)
test_cancer_lr$pred_label <- ifelse(test_cancer_lr$pred > 0.5, yes = 1, no = 0)
test_cancer_lr

Here are the prediction results and the actual values:

test_cancer_lr %>% 
  select(diagnosis, 
         pred, 
         pred_label) %>% 
  head(10)

8.2 KNN Model

The knn() function does not build a model but directly predicts the train data. Activated the library(class) and fill in the parameters of the knn() function which consists of: - train : train data, predictor, already scaled, numeric type - test : test data, predictor, discaled, numeric type - cl : train data, actual (categorical) label (target) - k : specified k value (optimum k: root of the number of data (number of rows))

sqrt(nrow(train_cancer_knn_up))

#> [1] 23.87467

K Optimum value = 23 Number of target = 2 (1 dan 0)

library(class)
cancer_knn <- knn(train = cancer_train_x_sc, 
                  test = cancer_test_x_sc,
                  cl = cancer_train_y,
                  k = 23)
head(cancer_knn)

#> [1] 1 1 1 1 1 1
#> Levels: 0 1

# Save into the data test
test_cancer_knn$pred_label <- cancer_knn
head(test_cancer_knn)

Here are the prediction results (pred_label) and the actual values (diagnosis):

test_cancer_knn %>% 
  select(diagnosis, 
         pred_label) %>% 
  head(10)

9 Evaluation Model

9.1 Evaluation Model Logistic Regression

Syntax: confusionMatrix(data, reference, positive) - data: prediction result label (factor) - reference: actual label (factor) - positive: positive class name

# confusion matrix
library(caret)
confusionMatrix(data = as.factor(test_cancer_lr$pred_label), 
                reference = test_cancer_lr$diagnosis, 
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 72  2
#>          1  0 40
#>                                              
#>                Accuracy : 0.9825             
#>                  95% CI : (0.9381, 0.9979)   
#>     No Information Rate : 0.6316             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.9619             
#>                                              
#>  Mcnemar's Test P-Value : 0.4795             
#>                                              
#>             Sensitivity : 0.9524             
#>             Specificity : 1.0000             
#>          Pos Pred Value : 1.0000             
#>          Neg Pred Value : 0.9730             
#>              Prevalence : 0.3684             
#>          Detection Rate : 0.3509             
#>    Detection Prevalence : 0.3509             
#>       Balanced Accuracy : 0.9762             
#>                                              
#>        'Positive' Class : 1                  
#>

Insights:

Accuracy value is the proportion of correct predictions out of total predictions, the accuracy is about 98.25%.
Sensitivity (Recall) value is the proportion of correctly predicted positive cases (Malignant predicted as Malignant). Here, the sensitivity value is about 95.24%. In a medical context, correctly identifying Malignant-Benign cases is very important. Therefore, sensitivity is a very relevant metric.
Positive Predictive Value (Precision) is the proportion of correct positive predictions (out of all Malignant predictions, how many are actually Malignant). A value of 100% is very good which means that all Malignant predictions are correct.

Conclusion: The model Logistic Regression performed very well in predicting the “M” (Malignant) or “B” (Benign) class.

9.2 Evaluation Model K-Nearest Neighbor

Syntax: confusionMatrix(data, reference, positive) - data: prediction result label (factor) - reference: actual label (factor) - positive: positive class name

# confusion matrix KNN
library(caret)
confusionMatrix(data = test_cancer_knn$pred_label,
                reference = test_cancer_knn$diagnosis,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 72  1
#>          1  0 41
#>                                              
#>                Accuracy : 0.9912             
#>                  95% CI : (0.9521, 0.9998)   
#>     No Information Rate : 0.6316             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.9811             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 0.9762             
#>             Specificity : 1.0000             
#>          Pos Pred Value : 1.0000             
#>          Neg Pred Value : 0.9863             
#>              Prevalence : 0.3684             
#>          Detection Rate : 0.3596             
#>    Detection Prevalence : 0.3596             
#>       Balanced Accuracy : 0.9881             
#>                                              
#>        'Positive' Class : 1                  
#>

Insights:

Accuracy value is the proportion of correct predictions out of total predictions, the accuracy is about 99.12%.
Sensitivity (Recall) value is the proportion of correctly predicted positive cases (Malignant predicted as Malignant). Here, the sensitivity value is about 97.62%. In a medical context, correctly identifying Malignant-Benign cases is very important. Therefore, sensitivity is a very relevant metric.
Positive Predictive Value (Precision) is the proportion of correct positive predictions (out of all Malignant predictions, how many are actually Malignant). A value of 100% is very good which means that all Malignant predictions are correct.

Conclusion: The model KNN performed very well in predicting the “M” (Malignant) or “B” (Benign) class.

10 Conclusion

A comparative analysis of Logistic Regression and K-Nearest Neighbors (KNN) models for predicting cancer malignancy reveals comparable performance. Both algorithms demonstrated high accuracy in classifying tumors as malignant or benign. Based on the evaluation metrics, the KNN model exhibited slightly superior accuracy (99.12%) compared to Logistic Regression (98.25%). Similarly, the KNN model achieved a higher sensitivity (97.62%) in identifying malignant cases, whereas Logistic Regression attained a sensitivity of 95.24%. Both models demonstrated perfect positive predictive value (100%).

Considering the primary objective of accurately identifying malignant cases, sensitivity, or recall, emerges as the most critical evaluation metric for this study.

Breast Cancer Prediction Using Logistic Regression and K-Nearest Neighbors

Intan M Sari

2024-07-25

1 Introduction

2 Problem Statement

3 Data Preparation

3.1 Import & Read Data

3.2 Inspect Data

3.3 Structure Data

4 Data Wrangling

4.1 Check for unecesary column

4.2 Check Missing Value

5 Exploratory Data Analysis (EDA)

5.1 Check categorical and numerical data distribution

6 Cross Validation

6.1 Splitting Data

6.1.1 Data Splitting for building Logistic Regression Model

6.1.2 Data Splitting for building KNN Model

6.2 Check for Imbalance Class

6.2.1 Check for Imbalance Class for Logistics Regression Data

6.2.2 Upsampling for data train Logistics Regression

6.2.3 Check for Imbalance Class for KNN Model

6.2.4 Upsampling for data train KNN Model

7 Building The Model

7.1 Logistic Regression

7.1.1 Without Variable

7.1.2 Variables based on the size and form

7.1.3 Variables based on Texture

7.1.4 Variables based on The Form and Texture

7.1.5 All Variables

7.1.6 Model Step-wise Regression for Feature Selection

7.1.6.1 Backward Elimination

7.1.6.2 Forward Selection

7.1.6.3 Both (combine backward & forward)

7.1.7 Interpretation Logistic Regression Model

7.1.8 Model Selection

7.1.8.1 Null & Residual Deviance

7.1.8.2 AIC

7.2 K-Nearest Neighbors (KNN)

7.2.1 Scaling using z-score standardization

8 Predict

8.1 Logistic Regression

8.2 KNN Model

9 Evaluation Model

9.1 Evaluation Model Logistic Regression

9.2 Evaluation Model K-Nearest Neighbor

10 Conclusion