This laboratory notebook entry explores the development of predictive models for breast cancer classification using machine learning techniques. The focus is on utilizing Logistic Regression and K-Nearest Neighbors (KNN) algorithms on a publicly available dataset obtained from Kaggle. This dataset consists of patient records containing a unique identifier, diagnosis (malignant or benign), visual characteristics of the cancer, and their average values.
To develop and compare the predictive performance of Logistic Regression and K-Nearest Neighbors (KNN) models in classifying tumors as malignant or benign based on their visual characteristics using the provided dataset.
Before moving forward, we must first load the necessary library required for conducting this research.
library(dplyr)
library(inspectdf)
library(GGally)
library(gtools)
library(caret)
library(ggplot2)
library(lattice)
library(outliers)
library(cluster)
library(factoextra)
The dataset comprises a total of 33 features, including a unique patient identifier (ID) and the target variable, diagnosis. The diagnosis feature is a categorical variable indicating the malignancy status of the tumor, with values “M” representing malignant and “B” representing benign.
The remaining 31 features represent quantitative measurements of various visual characteristics of the cancer. These features are divided into three groups based on their scaling:
Mean values: These features capture the average values of the specified characteristic across the entire tumor. They include radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, and fractal_dimension_mean.
Standard error (SE) values: These features represent the standard error of the corresponding mean values. They include radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, and fractal_dimension_se.
Worst or largest values: These features represent the worst or largest values of the specified characteristic within the tumor. They include radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, and fractal_dimension_worst.
Breast Cancer Dataset Attribute Information
| Feature Name | Description |
|---|---|
| id | Unique identifier for each patient |
| diagnosis | Cancer type (M: Malignant, B: Benign) |
| radius_mean | Mean of radii |
| texture_mean | Mean of textures |
| perimeter_mean | Mean of perimeters |
| area_mean | Mean of areas |
| smoothness_mean | Mean of smoothness |
| compactness_mean | Mean of compactness |
| concavity_mean | Mean of concavity |
| concave points_mean | Mean of concave points |
| symmetry_mean | Mean of symmetry |
| fractal_dimension_mean | Mean of fractal dimension |
| radius_se | Standard error of radius |
| texture_se | Standard error of texture |
| perimeter_se | Standard error of perimeter |
| area_se | Standard error of area |
| smoothness_se | Standard error of smoothness |
| compactness_se | Standard error of compactness |
| concavity_se | Standard error of concavity |
| concave points_se | Standard error of concave points |
| symmetry_se | Standard error of symmetry |
| fractal_dimension_se | Standard error of fractal dimension |
| radius_worst | Worst or largest radius |
| texture_worst | Worst or largest texture |
| perimeter_worst | Worst or largest perimeter |
| area_worst | Worst or largest area |
| smoothness_worst | Worst or largest smoothness |
| compactness_worst | Worst or largest compactness |
| concavity_worst | Worst or largest concavity |
| concave points_worst | Worst or largest concave points |
| symmetry_worst | Worst or largest symmetry |
| fractal_dimension_worst | Worst or largest fractal dimension |
Note: Target : “M” (Malignant) = 1 or “B” (Benign) = 0
The first step is to import the dataset using the
read.csv() function.
cancer <- read.csv("data_input/Cancer_Data.csv")
cancer
The subsequent phase involves an exploratory analysis of the imported
dataset. To achieve this, the initial and terminal data points of the
cancer dataset are examined through the application of the
head() and tail() functions respectively.
head(cancer)
tail(cancer)
The appropriate data type is determined by initially checking it with the `glimpse() function.
cancer %>%
glimpse()
#> Rows: 569
#> Columns: 33
#> $ id <int> 842302, 842517, 84300903, 84348301, 84358402, …
#> $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
#> $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
#> $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
#> $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
#> $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
#> $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
#> $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
#> $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
#> $ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
#> $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
#> $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
#> $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
#> $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
#> $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
#> $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
#> $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
#> $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
#> $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
#> $ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
#> $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
#> $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
#> $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
#> $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
#> $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
#> $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
#> $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
#> $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
#> $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
#> $ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
#> $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
#> $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
#> $ X <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
The id and X columns will be excluded from
the dataset. Additionally, the data type of the diagnosis
column will be converted to a factor. Following these data cleaning
steps, a new dataset named cancer_clean will be
created.
cancer_clean <-
cancer %>%
select(-id, -X) %>%
mutate(diagnosis = case_when(diagnosis == "M" ~ 1,
diagnosis == "B" ~ 0)) %>%
mutate(diagnosis = as.factor(diagnosis))
head(cancer_clean)
Once those steps are completed, it is important to verify any missing values in the dataset as well.
cancer_clean %>%
is.na() %>%
colSums()
#> diagnosis radius_mean texture_mean
#> 0 0 0
#> perimeter_mean area_mean smoothness_mean
#> 0 0 0
#> compactness_mean concavity_mean concave.points_mean
#> 0 0 0
#> symmetry_mean fractal_dimension_mean radius_se
#> 0 0 0
#> texture_se perimeter_se area_se
#> 0 0 0
#> smoothness_se compactness_se concavity_se
#> 0 0 0
#> concave.points_se symmetry_se fractal_dimension_se
#> 0 0 0
#> radius_worst texture_worst perimeter_worst
#> 0 0 0
#> area_worst smoothness_worst compactness_worst
#> 0 0 0
#> concavity_worst concave.points_worst symmetry_worst
#> 0 0 0
#> fractal_dimension_worst
#> 0
Since there are no missing values in this dataset, it is ready to move on to the next stages.
Subsequently, a verification process is conducted to ensure the accuracy of data types across all remaining columns.
cancer_clean %>%
glimpse()
#> Rows: 569
#> Columns: 31
#> $ diagnosis <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
#> $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
#> $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
#> $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
#> $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
#> $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
#> $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
#> $ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
#> $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
#> $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
#> $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
#> $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
#> $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
#> $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
#> $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
#> $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
#> $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
#> $ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
#> $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
#> $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
#> $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
#> $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
#> $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
#> $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
#> $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
#> $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
#> $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
#> $ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
#> $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
#> $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
Having verified the data type consistency for all columns, we can proceed to the data preprocessing stage
On this stage, we want to know how categorical and numerical data is distributed. To accomplish this, we are using:
inspect_cat() to view summary values for categorical
variables.inspect_num() to view summary values for numeric
variables.cancer_clean %>%
inspect_cat()
cancer_clean %>%
inspect_num()
Insights:
To prepare the data for model training and evaluation using Logistic
Regression, we will split the cancer_clean dataset into two
subsets: a training set (train_cancer_lr) and a testing set
(test_cancer_lr). This split will be achieved using the
initial_split(), training() and
testing() functions. We will repeat this process for
building model using KNN.
Next, we will divide the dataset cancer_clean into train
(train_cancer_lr) and test (test_cancer_lr)
datasets, maintaining an 80%:20% ratio using training() and
testing() functions.
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(123)
# split with proportion 80:20
splitter <- initial_split(data = cancer_clean, prop = 0.8)
# extract to dataframe
train_cancer_lr <- training(splitter)
test_cancer_lr <- testing(splitter)
head(train_cancer_lr)
This section will replicate the methodological approach employed in the Logistic Regression Model.
next, we will divide the dataset cancer_clean into train
(train_cancer_knn) and test (test_cancer_knn)
datasets, maintaining an 80%:20% ratio using training() and
testing() functions.
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(123)
#split with proportion 80:20
splitter <- initial_split(data = cancer_clean, prop = 0.8)
# extract to dataframe
train_cancer_knn <- training(splitter)
test_cancer_knn <- testing(splitter)
head(train_cancer_lr)
It’s important to check the class distribution in
train_cancer_lr$diagnosis before proceeding. This helps
mitigate potential bias in the model.
table(train_cancer_lr$diagnosis) %>%
prop.table()
#>
#> 0 1
#> 0.6263736 0.3736264
Insights : The dataset exhibits class imbalance, with 62.63% of samples classified as Benign (0) and 37.36% as Malignant (1). To address this issue, an upsampling technique will be employed.
Sampling techniques should only be performed on training data
train_cancer_lr. The testing data is treated as new data
for the model.
Function for Upsampling upSample(), with the parameters
x as predictor, y as target and
yname : target column name.
# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
train_cancer_lr_up <- upSample(
x = train_cancer_lr %>% select(-diagnosis),
y = train_cancer_lr$diagnosis,
yname = "diagnosis"
)
train_cancer_lr_up
Check the proportion of the target class after upsampling.
# your code here
table(train_cancer_lr_up$diagnosis) %>%
prop.table()
#>
#> 0 1
#> 0.5 0.5
Now, the training data employed for the Logistic Regression model is balanced.
Next, we repeat the process employed in previous section where we
check the class distribution in train_cancer_knn$diagnosis
before proceeding. This helps mitigate potential bias in the model.
# your code here
table(train_cancer_knn$diagnosis) %>%
prop.table()
#>
#> 0 1
#> 0.6263736 0.3736264
Insights : The dataset exhibits class imbalance, with 62.63% of samples classified as Benign (0) and 37.36% as Malignant (1). To address this issue, an upsampling technique will be employed.
Sampling techniques should only be performed on training data
train_cancer_knn. The testing data is treated as new data
for the model.
# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
train_cancer_knn_up <- upSample(
x = train_cancer_knn %>% select(-diagnosis),
y = train_cancer_knn$diagnosis,
yname = "diagnosis"
)
head(train_cancer_knn_up)
Check the proportion of the target class after upsampling.
# your code here
table(train_cancer_knn_up$diagnosis) %>%
prop.table()
#>
#> 0 1
#> 0.5 0.5
Now, the training data employed for the K-Nearest Neighbor model is balanced.
This section focuses on building and interpreting machine learning
models to predict whether a breast cancer is malignant or benign based
on its visual characteristics. Based on the cancer_clean
dataset, we’ll explore two common algorithms: Logistic Regression and
K-Nearest Neighbors (KNN). Our goal is to develop a model that can
effectively classify malignant or benign breast cancer based on their
characteristics.
Model Training: we will use glm
function to build logistic regression model. The formula will specify
the dependent variable (train_cancer_lr_up$diagnosis) and
independent variables that potentially have characteristics that
influence whether a tumor is malignant or benign based on its visual
characteristics.
We will create 6 models and choose most suited models. Those models are:
model_cancer_nullmodel_cancer_sizemodel_cancer_texturemodel_cancer_ftmodel_cancer_allmodel_cancer_backwardmodel_cancer_forwardmodel_cancer_bothmodel_cancer_null <- glm(formula = diagnosis ~ 1,
data = train_cancer_lr_up,
family = "binomial")
summary(model_cancer_null)
#>
#> Call:
#> glm(formula = diagnosis ~ 1, family = "binomial", data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.00000 0.08377 0 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.19 on 569 degrees of freedom
#> Residual deviance: 790.19 on 569 degrees of freedom
#> AIC: 792.19
#>
#> Number of Fisher Scoring iterations: 2
The AIC value is 792.19 in the model without predictors. The expected AIC value is the smallest among the models created.
We will create a model by grouping the predictor variables based on
the size and form of the cancer, consisting of radius_mean,
radius_se, radius_worst,
concavity_mean, concavity_se,
concavity_worst. Considerations: Malignant tumors tend to
be larger, have an irregular shape, and have deeper indentations
compared to benign tumors.
model_cancer_size <- glm(formula = diagnosis ~ radius_mean + radius_se + radius_worst +
concavity_mean + concavity_se + concavity_worst,
data = train_cancer_lr_up,
family = "binomial")
summary(model_cancer_size)
#>
#> Call:
#> glm(formula = diagnosis ~ radius_mean + radius_se + radius_worst +
#> concavity_mean + concavity_se + concavity_worst, family = "binomial",
#> data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -20.3358 3.1911 -6.373 0.000000000186 ***
#> radius_mean -1.2246 0.6065 -2.019 0.043461 *
#> radius_se 6.4332 4.0297 1.596 0.110392
#> radius_worst 1.9576 0.6062 3.229 0.001241 **
#> concavity_mean 47.6456 14.2086 3.353 0.000799 ***
#> concavity_se -117.7780 35.1753 -3.348 0.000813 ***
#> concavity_worst 12.0366 4.8181 2.498 0.012482 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.19 on 569 degrees of freedom
#> Residual deviance: 103.44 on 563 degrees of freedom
#> AIC: 117.44
#>
#> Number of Fisher Scoring iterations: 9
Insights :
radius_mean: The coefficient is negative and
significant (-1.2246), meaning that an increase in radius_mean is
associated with a decreased risk of cancer diagnosis.radius_worst: Positive and significant coefficient (
1.9576) meaning that an increase in radius_worst is associated with an
increased risk of cancer diagnosis.concavity_mean: The coefficient is positive and highly
significant ( 47.6456) meaning that an increase in concavity_mean is
strongly associated with an increase in the risk of cancer
diagnosis.concavity_se: A negative and highly significant
coefficient (-117.7780 ) meaning that an increase in concavity_se is
strongly associated with a decreased risk of cancer diagnosis.concavity_worst: Positive and significant coefficient (
12.0366) meaning that an increase in concavity_worst is associated with
an increased risk of cancer diagnosis.radius_se has no significant effect on cancer diagnosis
based on this model.Next, we will create a model by grouping the predictor variables
based on the texture of the cancer, which consists of:
texture_mean, texture_se,
texture_worst. Considerations: Texture changes in breast
tissue can be an indicator of malignancy.
model_cancer_texture <- glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst,
data = train_cancer_lr_up,
family = "binomial")
summary(model_cancer_texture)
#>
#> Call:
#> glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst,
#> family = "binomial", data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.912718 0.557226 -8.816 < 0.0000000000000002 ***
#> texture_mean 0.007088 0.058386 0.121 0.903
#> texture_se -1.608945 0.250503 -6.423 0.000000000134 ***
#> texture_worst 0.253968 0.043405 5.851 0.000000004885 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.19 on 569 degrees of freedom
#> Residual deviance: 613.57 on 566 degrees of freedom
#> AIC: 621.57
#>
#> Number of Fisher Scoring iterations: 4
Insights :
The model has overall significance because the p-value is very small (p < 0.001), indicating that the model with all predictor variables is better than the model without predictor variables.
Significant predictor variables:
texture_se: The coefficient is negative and significant
(-1.608945), meaning that a decrease in texture_se is associated with an
increased risk of malignant cancer diagnosis.texture_worst: Positive and significant coefficient
(0.253968) meaning that an increase in texture_worst is associated with
an increased risk of malignant cancer diagnosis.Conclusion: The variables texture_worst,
texture_se have an influence in predicting malignant cancer
diagnosis, while texture_mean has no significant
relationship. We are still looking for a better model.
Next, we will combine variables from the texture and form consisting
of texture_mean, texture_se,
texture_worst, concavity_mean,
concavity_se, concavity_worst.
model_cancer_ft <- glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst + concavity_mean + concavity_se + concavity_worst,
data = train_cancer_lr_up,
family = "binomial")
summary(model_cancer_ft)
#>
#> Call:
#> glm(formula = diagnosis ~ texture_mean + texture_se + texture_worst +
#> concavity_mean + concavity_se + concavity_worst, family = "binomial",
#> data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -10.725505 1.444946 -7.423 0.0000000000001147 ***
#> texture_mean 0.003295 0.133168 0.025 0.98026
#> texture_se -2.055629 0.880958 -2.333 0.01963 *
#> texture_worst 0.328971 0.126296 2.605 0.00919 **
#> concavity_mean 119.185826 15.727713 7.578 0.0000000000000351 ***
#> concavity_se -171.550973 33.598204 -5.106 0.0000003291252522 ***
#> concavity_worst 0.150997 4.037077 0.037 0.97016
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.19 on 569 degrees of freedom
#> Residual deviance: 170.75 on 563 degrees of freedom
#> AIC: 184.75
#>
#> Number of Fisher Scoring iterations: 8
Insights :
The model has overall significance because the p-value is very small (p < 0.001), indicating that the model with all predictor variables is better than the model without predictor variables.
Significant predictor variables:
texture_se: The coefficient is negative and significant
(-2.055629), meaning that an increase in texture_se is associated with a
decreased risk of cancer diagnosis.texture_worst: Positive and significant coefficient
(0.328971) meaning that an increase in texture_worst is associated with
an increased risk of cancer diagnosis.concavity_mean: Positive and highly significant
coefficient (119.185826) meaning that an increase in concavity_mean is
strongly associated with an increased risk of cancer diagnosis.concavity_se: The coefficient is negative and highly
significant (-171.550973) meaning that an increase in concavity_se is
strongly associated with a decreased risk of cancer diagnosis.The insignificant predictor variables including
texture_mean and concavity_worst have no
significant influence on cancer diagnosis based on this model.
AIC value of 184.75, this value is still high.
General conclusion: Texture and concavity related variables have a role in cancer diagnosis prediction, but not all variables have the same influence. We are still looking for a good model with a small AIC value.
Next, we will build a model by including all the predictors.
model_cancer_all <- glm(formula = diagnosis ~ .,
data = train_cancer_lr_up,
family = "binomial")
summary(model_cancer_all)
#>
#> Call:
#> glm(formula = diagnosis ~ ., family = "binomial", data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -5534.0214 415179.0900 -0.013 0.989
#> radius_mean -1726.4092 80607.6070 -0.021 0.983
#> texture_mean 39.9573 2771.7389 0.014 0.988
#> perimeter_mean 258.3541 12926.2153 0.020 0.984
#> area_mean -1.2668 254.9928 -0.005 0.996
#> smoothness_mean -1239.0607 1256026.3101 -0.001 0.999
#> compactness_mean -17223.1279 718699.9907 -0.024 0.981
#> concavity_mean -258.4364 467064.3929 -0.001 1.000
#> concave.points_mean 10829.8845 852057.1122 0.013 0.990
#> symmetry_mean -2140.4349 226834.3975 -0.009 0.992
#> fractal_dimension_mean 30267.9394 2159131.7255 0.014 0.989
#> radius_se 3456.7452 264257.0376 0.013 0.990
#> texture_se 93.2084 14854.8186 0.006 0.995
#> perimeter_se -322.0273 32469.7600 -0.010 0.992
#> area_se -5.9391 597.7267 -0.010 0.992
#> smoothness_se -87436.6566 6470752.5575 -0.014 0.989
#> compactness_se 28449.0523 1337456.1292 0.021 0.983
#> concavity_se -10847.3337 574820.2061 -0.019 0.985
#> concave.points_se 52595.3744 3112311.6391 0.017 0.987
#> symmetry_se 2304.9277 821856.6103 0.003 0.998
#> fractal_dimension_se -237651.3818 14001657.2950 -0.017 0.986
#> radius_worst 126.7463 37241.4028 0.003 0.997
#> texture_worst -2.7761 2467.4273 -0.001 0.999
#> perimeter_worst 10.4596 5395.9051 0.002 0.998
#> area_worst 0.4787 299.4577 0.002 0.999
#> smoothness_worst 9609.8878 763310.2716 0.013 0.990
#> compactness_worst -2383.1934 200871.8576 -0.012 0.991
#> concavity_worst 2257.5377 220280.4545 0.010 0.992
#> concave.points_worst 72.3320 306852.9661 0.000 1.000
#> symmetry_worst 1807.1142 160275.6184 0.011 0.991
#> fractal_dimension_worst 10522.4000 1109609.7573 0.009 0.992
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.187785838 on 569 degrees of freedom
#> Residual deviance: 0.000004021 on 539 degrees of freedom
#> AIC: 62
#>
#> Number of Fisher Scoring iterations: 25
Insights
We are still looking for the best model to predict, so we will make models with the Step-wise Regression. There are 3 methods performed: 1. Backward 2. Forward 3. Both
We use the model_cancer_all model which includes all
variables as predictors. The stepwise regressing process uses the
step() function, by filling in some parameters:
model_cancer_all as the object, and “backward”
as the direction.
model_cancer_backward <- step(object = model_cancer_all,
direction = "backward",
trace = F)
summary(model_cancer_backward)
#>
#> Call:
#> glm(formula = diagnosis ~ radius_mean + texture_mean + perimeter_mean +
#> compactness_mean + concave.points_mean + fractal_dimension_mean +
#> radius_se + texture_se + perimeter_se + smoothness_se + compactness_se +
#> concavity_se + concave.points_se + fractal_dimension_se +
#> radius_worst + smoothness_worst + concavity_worst + symmetry_worst +
#> fractal_dimension_worst, family = "binomial", data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -15809.9 189960.0 -0.083 0.934
#> radius_mean -5936.7 84668.0 -0.070 0.944
#> texture_mean 105.2 1265.1 0.083 0.934
#> perimeter_mean 823.5 11541.9 0.071 0.943
#> compactness_mean -65002.9 830302.0 -0.078 0.938
#> concave.points_mean 42427.4 629829.7 0.067 0.946
#> fractal_dimension_mean 95704.9 1232295.2 0.078 0.938
#> radius_se 5731.0 73660.9 0.078 0.938
#> texture_se 641.0 10289.6 0.062 0.950
#> perimeter_se -676.4 10435.4 -0.065 0.948
#> smoothness_se -214910.7 2961870.2 -0.073 0.942
#> compactness_se 57413.4 739058.2 0.078 0.938
#> concavity_se -21701.3 297834.7 -0.073 0.942
#> concave.points_se 178343.8 2529163.5 0.071 0.944
#> fractal_dimension_se -679825.9 8307868.2 -0.082 0.935
#> radius_worst 810.9 11655.3 0.070 0.945
#> smoothness_worst 22475.4 296416.6 0.076 0.940
#> concavity_worst 4114.6 53542.6 0.077 0.939
#> symmetry_worst 4272.1 58374.2 0.073 0.942
#> fractal_dimension_worst 25969.5 411187.4 0.063 0.950
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.187785838 on 569 degrees of freedom
#> Residual deviance: 0.000028319 on 550 degrees of freedom
#> AIC: 40
#>
#> Number of Fisher Scoring iterations: 25
Insights
Conclusion: 1. This model cannot be used for prediction or inference as it has no predictive power. 2. The insignificant results may be due to high multicollinearity between the predictor variables.
The stepwise regression process uses the step()
function, by filling in some parameters: model_cancer_null
as the object, and “forward” as the direction. For the
Forward Selection method, we need to define the scope
parameter to indicate the maximum upper limit of predictor combinations
with model_cancer_all.
model_cancer_forward <- step(object = model_cancer_null,
direction = "forward",
scope = list(upper= model_cancer_all),
trace=F)
summary(model_cancer_forward)
#>
#> Call:
#> glm(formula = diagnosis ~ perimeter_worst + smoothness_worst +
#> texture_worst + symmetry_worst + concave.points_worst + area_worst +
#> radius_mean + concave.points_mean + compactness_mean, family = "binomial",
#> data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -16.27649 9.53874 -1.706 0.08794 .
#> perimeter_worst 0.12727 0.16734 0.761 0.44693
#> smoothness_worst 30.50015 31.63750 0.964 0.33502
#> texture_worst 0.43109 0.09932 4.340 0.0000142 ***
#> symmetry_worst 21.61792 10.03283 2.155 0.03118 *
#> concave.points_worst 38.61845 26.27892 1.470 0.14168
#> area_worst 0.04172 0.01434 2.910 0.00362 **
#> radius_mean -4.17580 1.30565 -3.198 0.00138 **
#> concave.points_mean 213.04153 67.10194 3.175 0.00150 **
#> compactness_mean -76.90162 28.84636 -2.666 0.00768 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.188 on 569 degrees of freedom
#> Residual deviance: 48.495 on 560 degrees of freedom
#> AIC: 68.495
#>
#> Number of Fisher Scoring iterations: 11
Insights :
texture_worst has a significant coefficient (very low
p-value: 0.0000142). This means that texture_worst has a
strong influence on the diagnosis result.symmetry_worst is also significant (p-value:
0.03118).area_worst, radius_mean,
compactness_mean and concave.points_mean also
have significant influence (p-value < 0.05).Variables that are not significant: The variables
perimeter_worst, smoothness_worst, and
concave.points_worst are not significant (p-value >
0.05).
The intercept value (coefficient for (Intercept)) is
-16.27649.
The AIC (Akaike Information Criterion) value is 68.495. The lower the AIC value, the better the model.
The stepwise regression process uses the step()
function, by filling in some parameters: model_cancer_null
as the object, and “both” as the direction. For the Both
Selection method, we need to define the scope parameter to
indicate the maximum upper limit of predictor combinations with
model_cancer_all.
model_cancer_both <- step(object = model_cancer_null,
direction = "both",
scope = list(upper= model_cancer_all),
trace=F)
summary(model_cancer_both)
#>
#> Call:
#> glm(formula = diagnosis ~ texture_worst + symmetry_worst + concave.points_worst +
#> area_worst + radius_mean + concave.points_mean + compactness_mean,
#> family = "binomial", data = train_cancer_lr_up)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -9.41898 6.33055 -1.488 0.136787
#> texture_worst 0.42359 0.09534 4.443 0.00000887 ***
#> symmetry_worst 23.56378 10.09028 2.335 0.019528 *
#> concave.points_worst 43.34106 24.40428 1.776 0.075739 .
#> area_worst 0.04650 0.01150 4.045 0.00005236 ***
#> radius_mean -3.82827 1.10560 -3.463 0.000535 ***
#> concave.points_mean 208.37648 62.55820 3.331 0.000866 ***
#> compactness_mean -66.39121 26.48332 -2.507 0.012179 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 790.188 on 569 degrees of freedom
#> Residual deviance: 50.218 on 562 degrees of freedom
#> AIC: 66.218
#>
#> Number of Fisher Scoring iterations: 11
Insights:
texture_worst has a significant coefficient (very low
p-value: 0.00000887). This means that texture_worst has a
strong influence on the diagnosis result.symmetry_worst is also significant (p-value:
0.019528).area_worst, radius_mean, and
concave.points_mean also have significant influence
(p-value < 0.05).compactness_mean is also significant (p-value:
0.012179).The intercept value (coefficient for (Intercept)) is
-9.41898.
The AIC (Akaike Information Criterion) value is 66.218. The lower the AIC value, the better the model.
Conclusion: I will use the Step_wise model of Both
model_cancer_both method to make predictions.
Next we will use the model of model_cancer_both.In the
logistic regression model, we use the coefficient value (Estimate) to
calculate the inv.logit value (inverse logit). The
inv.logit value converts log-odds into probabilities.
Use the inv.logit() function from the
gtools library to find the odds (0 to 1).
# Converting the log of odds value into probability
# texture_worst
inv.logit(0.42359)
#> [1] 0.604342
# symmetry_worst
inv.logit(23.56378)
#> [1] 1
# area_worst
inv.logit(0.04650)
#> [1] 0.5116229
# radius_mean
inv.logit(-3.82827)
#> [1] 0.02128433
# concave.points_mean
inv.logit(208.37648)
#> [1] 1
# compactness_mean
inv.logit(-66.39121)
#> [1] 0.0000000000000000000000000000146779
Insights :
The most significant variables are texture_worst,
symmetry_worst, and concave.points_mean seem
to be the most powerful variables in predicting the probability of a
tumor being malignant.
Certain variables have a positive relationship with the
likelihood of a tumor being malignant (inv.logit value close to 1),
while other variables radius_mean,
compactness_mean have a negative relationship (inv.logit
value close to 0).
A series of models were constructed, encompassing:
model_cancer_nullmodel_cancer_sizemodel_cancer_texturemodel_cancer_ftmodel_cancer_allmodel_cancer_backwardmodel_cancer_forwardmodel_cancer_bothUpon completion of model development, a selection process will be initiated to identify a model characterized by the following criteria:
There are two primary types of deviance in this context:
model_cancer_null$deviance
#> [1] 790.1878
model_cancer_size$deviance
#> [1] 103.4436
model_cancer_texture$deviance
#> [1] 613.5696
model_cancer_ft$deviance
#> [1] 170.7509
model_cancer_all$deviance
#> [1] 0.000004020951
model_cancer_backward$deviance
#> [1] 0.00002831909
model_cancer_forward$deviance
#> [1] 48.49527
model_cancer_both$deviance
#> [1] 50.2181
Null and Residual Deviance Values:
model_cancer_size = 103.4436
model_cancer_texture = 613.5696
model_cancer_ft = 170.7509 model_cancer_all=
0.000004020951 model_cancer_backward= 0.00002831909
model_cancer_forward = 48.49527
model_cancer_both = 50.2181Insights :
Between models model_cancer_forward = 48.49527 and
model_cancer_both = 50.2181 have low residual values. The
model_cancer_all and model_cancer_backward
models is also very low but the all the coefficients of the predictor
variables are insignificant.
AIC describes the amount of information lost from a model. The smaller the AIC value, the less information is lost.
# aic
model_cancer_null$aic
#> [1] 792.1878
model_cancer_size$aic
#> [1] 117.4436
model_cancer_texture$aic
#> [1] 621.5696
model_cancer_ft$aic
#> [1] 184.7509
model_cancer_all$aic
#> [1] 62
model_cancer_backward$aic
#> [1] 40.00003
model_cancer_forward$aic
#> [1] 68.49527
model_cancer_both$aic
#> [1] 66.2181
Insights :
The smallest AIC result of the model_cancer_both model
by having significant predictor variables on the target (discussion
above). Conclusion: from deviance and AIC, the
model_cancer_both model was chosen.
We’ll use the knn() function from the class package to
build a KNN model. This model classifies new objects based on the
similarity (distance) to existing objects in the training set that are
labeled as hazardous or non-hazardous.
The predictor data will be scaled using z-score standardization. The test data should also be scaled using parameters from the train data (since it assumes the test data is unseen data).
library(dplyr)
# For predictor
cancer_train_x <- train_cancer_knn_up %>% select(-diagnosis)
cancer_test_x <- test_cancer_knn %>% select(-diagnosis)
# For Target
cancer_train_y <- train_cancer_knn_up$diagnosis
cancer_test_y <- test_cancer_knn$diagnosis
The scale function consists of several parameters +
x = the object to be scaled + center =
average/mean value (taken from the center value in the scaled
cancer_train_x data) + scale = standard
deviation value (taken from the sd value of the scaled
cancer_train_x data)
# Scaling data
# Data Train
cancer_train_x_sc <- scale(cancer_train_x)
# Data Test
cancer_test_x_sc <- scale(cancer_test_x,
center = attr(cancer_train_x_sc, "scaled:center"),
scale = attr(cancer_train_x_sc, "scaled:scale"))
Next we will make predictions with KNN using the scaled train
(cancer_train_x_sc) and test
(cancer_test_x_sc) data.
Syntax: predict(object model, newdata, type)
object: fill with the model used for predictionnewdata: fill with test data/unseen data/new datatype: output type of the predicted valueThe type parameter has a choice of values:
link: produces log of oddsresponse: produces probabilitywe use responsetype, because we need to know probability
value on each class “M” (Malignant) = 1 or “B” (Benign) = 0.
predict(object = model_cancer_both,
newdata = head(test_cancer_lr),
type = "response")
#> 1 2 3 4 5 6
#> 0.9999533 0.9994761 0.9999997 0.9999806 1.0000000 0.9999991
Convert odds to categorical with the function ifelse() -
test_cancer_lr: tested condition - yes:
Malignant - no: Benign
# Convert odds into prediction labels
pred_cancer_lr <- predict(object = model_cancer_both,
newdata = head(test_cancer_lr),
type = "response")
ifelse(pred_cancer_lr > 0.5, yes = 1, no = 0)
#> 1 2 3 4 5 6
#> 1 1 1 1 1 1
Predict the probability diagnosis for the
test_cancer_lr data and save it into a new column named
pred in the test data.
test_cancer_lr$pred <- predict(object = model_cancer_both,
newdata = test_cancer_lr,
type = "response")
test_cancer_lr
Classify the test_cancer_lr data based on
pred and save it in a new column named
pred_label.
# ifelse(kondisi, benar, salah)
test_cancer_lr$pred_label <- ifelse(test_cancer_lr$pred > 0.5, yes = 1, no = 0)
test_cancer_lr
Here are the prediction results and the actual values:
test_cancer_lr %>%
select(diagnosis,
pred,
pred_label) %>%
head(10)
The knn() function does not build a model but directly
predicts the train data. Activated the library(class) and
fill in the parameters of the knn() function which consists
of: - train : train data, predictor, already scaled,
numeric type - test : test data, predictor, discaled,
numeric type - cl : train data, actual (categorical) label
(target) - k : specified k value (optimum k: root of the
number of data (number of rows))
sqrt(nrow(train_cancer_knn_up))
#> [1] 23.87467
K Optimum value = 23 Number of target = 2 (1 dan 0)
library(class)
cancer_knn <- knn(train = cancer_train_x_sc,
test = cancer_test_x_sc,
cl = cancer_train_y,
k = 23)
head(cancer_knn)
#> [1] 1 1 1 1 1 1
#> Levels: 0 1
# Save into the data test
test_cancer_knn$pred_label <- cancer_knn
head(test_cancer_knn)
Here are the prediction results (pred_label) and the
actual values (diagnosis):
test_cancer_knn %>%
select(diagnosis,
pred_label) %>%
head(10)
Syntax: confusionMatrix(data, reference, positive) -
data: prediction result label (factor) -
reference: actual label (factor) - positive:
positive class name
# confusion matrix
library(caret)
confusionMatrix(data = as.factor(test_cancer_lr$pred_label),
reference = test_cancer_lr$diagnosis,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 72 2
#> 1 0 40
#>
#> Accuracy : 0.9825
#> 95% CI : (0.9381, 0.9979)
#> No Information Rate : 0.6316
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9619
#>
#> Mcnemar's Test P-Value : 0.4795
#>
#> Sensitivity : 0.9524
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9730
#> Prevalence : 0.3684
#> Detection Rate : 0.3509
#> Detection Prevalence : 0.3509
#> Balanced Accuracy : 0.9762
#>
#> 'Positive' Class : 1
#>
Insights:
Accuracy value is the proportion of correct predictions out of total predictions, the accuracy is about 98.25%.
Sensitivity (Recall) value is the proportion of correctly predicted positive cases (Malignant predicted as Malignant). Here, the sensitivity value is about 95.24%. In a medical context, correctly identifying Malignant-Benign cases is very important. Therefore, sensitivity is a very relevant metric.
Positive Predictive Value (Precision) is the proportion of correct positive predictions (out of all Malignant predictions, how many are actually Malignant). A value of 100% is very good which means that all Malignant predictions are correct.
Conclusion: The model Logistic Regression performed very well in predicting the “M” (Malignant) or “B” (Benign) class.
Syntax: confusionMatrix(data, reference, positive) -
data: prediction result label (factor) -
reference: actual label (factor) - positive:
positive class name
# confusion matrix KNN
library(caret)
confusionMatrix(data = test_cancer_knn$pred_label,
reference = test_cancer_knn$diagnosis,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 72 1
#> 1 0 41
#>
#> Accuracy : 0.9912
#> 95% CI : (0.9521, 0.9998)
#> No Information Rate : 0.6316
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9811
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9762
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9863
#> Prevalence : 0.3684
#> Detection Rate : 0.3596
#> Detection Prevalence : 0.3596
#> Balanced Accuracy : 0.9881
#>
#> 'Positive' Class : 1
#>
Insights:
Accuracy value is the proportion of correct predictions out of total predictions, the accuracy is about 99.12%.
Sensitivity (Recall) value is the proportion of correctly predicted positive cases (Malignant predicted as Malignant). Here, the sensitivity value is about 97.62%. In a medical context, correctly identifying Malignant-Benign cases is very important. Therefore, sensitivity is a very relevant metric.
Positive Predictive Value (Precision) is the proportion of correct positive predictions (out of all Malignant predictions, how many are actually Malignant). A value of 100% is very good which means that all Malignant predictions are correct.
Conclusion: The model KNN performed very well in predicting the “M” (Malignant) or “B” (Benign) class.
A comparative analysis of Logistic Regression and K-Nearest Neighbors (KNN) models for predicting cancer malignancy reveals comparable performance. Both algorithms demonstrated high accuracy in classifying tumors as malignant or benign. Based on the evaluation metrics, the KNN model exhibited slightly superior accuracy (99.12%) compared to Logistic Regression (98.25%). Similarly, the KNN model achieved a higher sensitivity (97.62%) in identifying malignant cases, whereas Logistic Regression attained a sensitivity of 95.24%. Both models demonstrated perfect positive predictive value (100%).
Considering the primary objective of accurately identifying malignant cases, sensitivity, or recall, emerges as the most critical evaluation metric for this study.