1 Data Understanding

1.1 Introduction

Breast cancer is the most common cancer among women worldwide. Based on WHO’s statistic - Breast Cancer has been one of the top leading contributor and death cause among women in Malaysia with 19.8% death on every 100,000 female Malaysian population.
[1] https://data.who.int/countries/458
Cancer Research Malaysia also stated that locally 1 out of 4 Malaysian are affected by breast cancer while globally, 2.3 million women were diagnosed with this cancer and 685k women died from this particular disease in 2020 with rapid growth among Asian females.
[2] https://www.cancerresearch.my/our-work/breast-cancer/
In an event held at Thomson Hospital Kota Damansara, Dr Tan Gie Hooi who is a Consultant breast and oncoplastic surgeon of the hospital cited Universiti Putra Malaysia (UPM)’ “Review of Breast Cancer in Young Women” article that was published in 2020 where 13.6 per cent of women in Malaysia are diagnosed with breast cancer before the age of 40 with sharp contrast to the United States, where the rate is 5 per cent in the same age group.
[3] https://codeblue.galencentre.org/2023/10/31/13-of-women-in-malaysia-diagnosed-with-breast-cancer-before-40/, https://medic.upm.edu.my/upload/dokumen/2020120211081348_MJMHS_0472.pdf

These situations are alarming, and this is where our project come into picture.

1.2 Project Objectives

To develop a machine learning model to classify breast tumors as malignant or benign.
To utilize the Breast Cancer Wisconsin (Diagnostic) Dataset to train and evaluate the model.
To identify key features and factors that contribute to accurate tumor classification.

1.3 Data Background

We retrieved our dataset from Kaggle with data sourcing from Wisconsin Breast Cancer Database. University of Wisconsin Hospitals; Madison, USA: 1991.
- Data Source: Breast Cancer Dataset @ Kaggle
- Dataset name: breast-cancer.csv (122KB)
- Dataset is loaded into a dataframe and named as ‘bc_df’

2 Data Preprocessing

2.1 Load all library

library(dplyr)

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(pryr)

## 
## 载入程辑包：'pryr'

## The following object is masked from 'package:dplyr':
## 
##     where

library(Rcpp)
library(ggplot2)
library(lubridate)

## 
## 载入程辑包：'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(caret)

## 载入需要的程辑包：lattice

library(infotheo)
library(corrplot)

## corrplot 0.92 loaded

library(reshape2)

## 
## 载入程辑包：'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(coefplot)
library(writexl)
library(scales)

2.2 Read the dataset

bc_df <- read.csv("C:/Users/ahaen/Documents/breast-cancer.csv")
head(bc_df)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave.points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1     0.03003             0.006193        25.38         17.33          184.60
## 2     0.01389             0.003532        24.99         23.41          158.80
## 3     0.02250             0.004571        23.57         25.53          152.50
## 4     0.05963             0.009208        14.91         26.50           98.87
## 5     0.01756             0.005115        22.54         16.67          152.20
## 6     0.02165             0.005082        15.47         23.75          103.40
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1     2019.0           0.1622            0.6656          0.7119
## 2     1956.0           0.1238            0.1866          0.2416
## 3     1709.0           0.1444            0.4245          0.4504
## 4      567.7           0.2098            0.8663          0.6869
## 5     1575.0           0.1374            0.2050          0.4000
## 6      741.6           0.1791            0.5249          0.5355
##   concave.points_worst symmetry_worst fractal_dimension_worst
## 1               0.2654         0.4601                 0.11890
## 2               0.1860         0.2750                 0.08902
## 3               0.2430         0.3613                 0.08758
## 4               0.2575         0.6638                 0.17300
## 5               0.1625         0.2364                 0.07678
## 6               0.1741         0.3985                 0.12440

2.3 View the content of the data

glimpse(bc_df)

## Rows: 569
## Columns: 32
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, …
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…

summary(bc_df[, !names(bc_df) %in% "id", drop = FALSE])

##   diagnosis          radius_mean      texture_mean   perimeter_mean  
##  Length:569         Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  Class :character   1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##  Mode  :character   Median :13.370   Median :18.84   Median : 86.24  
##                     Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##                     3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##                     Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  concave.points_mean symmetry_mean    fractal_dimension_mean   radius_se     
##  Min.   :0.00000     Min.   :0.1060   Min.   :0.04996        Min.   :0.1115  
##  1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770        1st Qu.:0.2324  
##  Median :0.03350     Median :0.1792   Median :0.06154        Median :0.3242  
##  Mean   :0.04892     Mean   :0.1812   Mean   :0.06280        Mean   :0.4052  
##  3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612        3rd Qu.:0.4789  
##  Max.   :0.20120     Max.   :0.3040   Max.   :0.09744        Max.   :2.8730  
##    texture_se      perimeter_se       area_se        smoothness_se     
##  Min.   :0.3602   Min.   : 0.757   Min.   :  6.802   Min.   :0.001713  
##  1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850   1st Qu.:0.005169  
##  Median :1.1080   Median : 2.287   Median : 24.530   Median :0.006380  
##  Mean   :1.2169   Mean   : 2.866   Mean   : 40.337   Mean   :0.007041  
##  3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190   3rd Qu.:0.008146  
##  Max.   :4.8850   Max.   :21.980   Max.   :542.200   Max.   :0.031130  
##  compactness_se      concavity_se     concave.points_se   symmetry_se      
##  Min.   :0.002252   Min.   :0.00000   Min.   :0.000000   Min.   :0.007882  
##  1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638   1st Qu.:0.015160  
##  Median :0.020450   Median :0.02589   Median :0.010930   Median :0.018730  
##  Mean   :0.025478   Mean   :0.03189   Mean   :0.011796   Mean   :0.020542  
##  3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710   3rd Qu.:0.023480  
##  Max.   :0.135400   Max.   :0.39600   Max.   :0.052790   Max.   :0.078950  
##  fractal_dimension_se  radius_worst   texture_worst   perimeter_worst 
##  Min.   :0.0008948    Min.   : 7.93   Min.   :12.02   Min.   : 50.41  
##  1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11  
##  Median :0.0031870    Median :14.97   Median :25.41   Median : 97.66  
##  Mean   :0.0037949    Mean   :16.27   Mean   :25.68   Mean   :107.26  
##  3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40  
##  Max.   :0.0298400    Max.   :36.04   Max.   :49.54   Max.   :251.20  
##    area_worst     smoothness_worst  compactness_worst concavity_worst 
##  Min.   : 185.2   Min.   :0.07117   Min.   :0.02729   Min.   :0.0000  
##  1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145  
##  Median : 686.5   Median :0.13130   Median :0.21190   Median :0.2267  
##  Mean   : 880.6   Mean   :0.13237   Mean   :0.25427   Mean   :0.2722  
##  3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829  
##  Max.   :4254.0   Max.   :0.22260   Max.   :1.05800   Max.   :1.2520  
##  concave.points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

2.4 View missing values and duplicate values

# Calculate the total number of missing values
total_missing_values <- sum(is.na(bc_df))
print(paste("Total missing values      :", total_missing_values))

## [1] "Total missing values      : 0"

# Count the total number of duplicate rows
total_duplicate_rows <- sum(duplicated(bc_df))
print(paste("Total number of duplicate rows :", total_duplicate_rows))

## [1] "Total number of duplicate rows : 0"

There are no missing values and no duplicate rows in the dataset.

2.5 View the target

cols <- c("#008b8b", "#cd853f")

plt <- ggplot(bc_df, aes(x = factor(diagnosis))) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) + 
  labs(x = "", y = "Count") +
  
geom_bar(fill = cols) +
  
stat_count(geom = "text", aes(label = paste0(round(prop.table(after_stat(count)) * 100, 2), "%")),position = position_stack(vjust = 0.5), size = 4) + ggtitle("Counts and rates of malignant and benign tumors")

print(plt)

# print the count of B and M
print(table(bc_df$diagnosis))

## 
##   B   M 
## 357 212

Machine learning models are more likely to learn and predict binary classification problems. In order to enable the model to better understand and process this data, the diagnosis category is converted from “Malignant” to 1 and “Benign” to 0.

bc_df$diagnosis <- ifelse(bc_df$diagnosis == "M", 1, 0)
head(bc_df)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         1       17.99        10.38         122.80    1001.0
## 2   842517         1       20.57        17.77         132.90    1326.0
## 3 84300903         1       19.69        21.25         130.00    1203.0
## 4 84348301         1       11.42        20.38          77.58     386.1
## 5 84358402         1       20.29        14.34         135.10    1297.0
## 6   843786         1       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave.points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1     0.03003             0.006193        25.38         17.33          184.60
## 2     0.01389             0.003532        24.99         23.41          158.80
## 3     0.02250             0.004571        23.57         25.53          152.50
## 4     0.05963             0.009208        14.91         26.50           98.87
## 5     0.01756             0.005115        22.54         16.67          152.20
## 6     0.02165             0.005082        15.47         23.75          103.40
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1     2019.0           0.1622            0.6656          0.7119
## 2     1956.0           0.1238            0.1866          0.2416
## 3     1709.0           0.1444            0.4245          0.4504
## 4      567.7           0.2098            0.8663          0.6869
## 5     1575.0           0.1374            0.2050          0.4000
## 6      741.6           0.1791            0.5249          0.5355
##   concave.points_worst symmetry_worst fractal_dimension_worst
## 1               0.2654         0.4601                 0.11890
## 2               0.1860         0.2750                 0.08902
## 3               0.2430         0.3613                 0.08758
## 4               0.2575         0.6638                 0.17300
## 5               0.1625         0.2364                 0.07678
## 6               0.1741         0.3985                 0.12440

2.6 Drop useless features

Based on the data structure listed in table, there is one attribute that we have thought it is unnecessary, “ID number” attribute that has no effect on whether the outcome is malignant or benign. So that’s means that “ID number” attribute is not necessary for the next modelling step.So drop it.

bc_df1 <- bc_df[, !(names(bc_df) %in% c("id"))]

2.7 Correlation between features

# Calculate correlation matrix
corr <- cor(bc_df1)

# Create mask for upper triangle
mask <- upper.tri(corr)

# Set upper triangle values to NA
diag(corr) <- NA
corr[mask] <- NA

# Convert correlation matrix to long format
corr_long <- melt(corr, na.rm = TRUE)
print(head(corr_long, 20))

##                      Var1      Var2        value
## 2             radius_mean diagnosis  0.730028511
## 3            texture_mean diagnosis  0.415185300
## 4          perimeter_mean diagnosis  0.742635530
## 5               area_mean diagnosis  0.708983837
## 6         smoothness_mean diagnosis  0.358559965
## 7        compactness_mean diagnosis  0.596533678
## 8          concavity_mean diagnosis  0.696359707
## 9     concave.points_mean diagnosis  0.776613840
## 10          symmetry_mean diagnosis  0.330498554
## 11 fractal_dimension_mean diagnosis -0.012837603
## 12              radius_se diagnosis  0.567133821
## 13             texture_se diagnosis -0.008303333
## 14           perimeter_se diagnosis  0.556140703
## 15                area_se diagnosis  0.548235940
## 16          smoothness_se diagnosis -0.067016011
## 17         compactness_se diagnosis  0.292999244
## 18           concavity_se diagnosis  0.253729766
## 19      concave.points_se diagnosis  0.408042333
## 20            symmetry_se diagnosis -0.006521756
## 21   fractal_dimension_se diagnosis  0.077972417

# Plot heatmap
ggplot(corr_long, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#008b8b", high =  "#cd853f", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, size = 8, hjust = 1),
        axis.text.y = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 8)) +
  coord_fixed() +
  labs(title = "Correlation between features", x = "Variables", y = "Variables")

# Find features to drop
to_drop <- colnames(corr)[apply(corr > 0.9, 2, any)]

# Drop features
bc_df2 <- bc_df1[, !(names(bc_df1) %in% to_drop)]

# Print remaining number of features
print(colnames(bc_df2))

##  [1] "diagnosis"               "smoothness_mean"        
##  [3] "compactness_mean"        "symmetry_mean"          
##  [5] "fractal_dimension_mean"  "texture_se"             
##  [7] "area_se"                 "smoothness_se"          
##  [9] "compactness_se"          "concavity_se"           
## [11] "concave.points_se"       "symmetry_se"            
## [13] "fractal_dimension_se"    "texture_worst"          
## [15] "area_worst"              "smoothness_worst"       
## [17] "compactness_worst"       "concavity_worst"        
## [19] "concave.points_worst"    "symmetry_worst"         
## [21] "fractal_dimension_worst"

Drop high correlation column
The main purpose of removing highly correlated variables is to avoid multicollinearity. Therefore, in order to improve the interpretability, stability and performance of the model, we choose to delete highly correlated variables. During the feature selection process, selecting to retain features that have a high correlation with the target variable but a low correlation between independent variables helps to build a more reliable model.
The features with correlation greater than 0.9 were deleted, leaving 20 features.

2.8 Correlation between features and target variable

cor_matrix <- cor(bc_df2)
diagnosis_corr <- cor_matrix["diagnosis", ]
diagnosis_corr <- diagnosis_corr[names(diagnosis_corr) != "diagnosis"]
sorted_corr <- sort(abs(diagnosis_corr), decreasing = FALSE)
print(sorted_corr)

##             symmetry_se              texture_se  fractal_dimension_mean 
##             0.006521756             0.008303333             0.012837603 
##           smoothness_se    fractal_dimension_se            concavity_se 
##             0.067016011             0.077972417             0.253729766 
##          compactness_se fractal_dimension_worst           symmetry_mean 
##             0.292999244             0.323872189             0.330498554 
##         smoothness_mean       concave.points_se          symmetry_worst 
##             0.358559965             0.408042333             0.416294311 
##        smoothness_worst           texture_worst                 area_se 
##             0.421464861             0.456902821             0.548235940 
##       compactness_worst        compactness_mean         concavity_worst 
##             0.590998238             0.596533678             0.659610210 
##              area_worst    concave.points_worst 
##             0.733825035             0.793566017

par(mar = c(5, 10, 4, 2) + 0.1)
barplot(sorted_corr, 
        names.arg = names(sorted_corr), 
        col = "#008b8b", 
        horiz = TRUE, 
        main = "Correlation between features and target variable",
        xlab = "Correlation",
        
        cex.names = 0.7,
        las = 2,          
        xlim = c(0, 1))

# Find features to drop
to_drop <- names(diagnosis_corr)[abs(diagnosis_corr) < 0.1]

# Remove features with correlation less than 0.1
bc_df3 <- bc_df2[, !(names(bc_df2) %in% to_drop)]

# Print remaining number of features
print(colnames(bc_df3))

##  [1] "diagnosis"               "smoothness_mean"        
##  [3] "compactness_mean"        "symmetry_mean"          
##  [5] "area_se"                 "compactness_se"         
##  [7] "concavity_se"            "concave.points_se"      
##  [9] "texture_worst"           "area_worst"             
## [11] "smoothness_worst"        "compactness_worst"      
## [13] "concavity_worst"         "concave.points_worst"   
## [15] "symmetry_worst"          "fractal_dimension_worst"

Deleted features with a correlation less than 0.1 with the target, leaving 15 features.

3 Machine Learning Modeling

For modelling, we imported all necessaries packages especially on ML Classification Models from R Library.In this modeling, we trained and tested various classification models such as Random Forest, Logistic Regression, Decision Tree and Gaussian Naive Bayes.

3.1 Data Splitting

The dataset was divided into training and testing sets, with 70% allocated for training and 30% for testing. The resulting sets, x_train and x_test, along with their corresponding labels y_train and y_test, were displayed to confirm the sizes of both the training and testing data.

library(caTools)
#Splitting train 70% and test 30%
# Creating training and testing sets
split = sample.split(bc_df3$diagnosis, SplitRatio = 0.7)
train_data = subset(bc_df3, split == TRUE)
test_data = subset(bc_df3, split == FALSE)

# Separate features and target variable for training set
x_train = subset(train_data, select = -diagnosis)
y_train = train_data$diagnosis
length(x_train)

## [1] 15

# Separate features and target variable for testing set
x_test = subset(test_data, select = -diagnosis)
y_test = test_data$diagnosis

The “diagnosis” variable in both the training and test datasets is converted into a factor variable. In R, a factor is used to represent categorical data, where each category is represented by a level. Converting the “diagnosis” variable to a factor indicates that it contains categorical information rather than numerical.

train_data$diagnosis <- factor(train_data$diagnosis)
test_data$diagnosis <- factor(test_data$diagnosis)

3.2 Random Forest

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## 载入程辑包：'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## 载入程辑包：'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Train the model
set.seed(42)
rf_model <- randomForest(diagnosis ~ ., data = train_data, importance = TRUE)


# Print model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train_data, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 4.77%
## Confusion matrix:
##     0   1 class.error
## 0 245   5  0.02000000
## 1  14 134  0.09459459

# Predict on test data (before tuning)
y_pred <- predict(rf_model, newdata = test_data)
accuracy_rf <- sum(y_pred == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy:", round(accuracy_rf, 4)))

## [1] "Accuracy: 0.9708"

# Generate confusion matrix
conf_matrix_rf <- confusionMatrix(y_pred, test_data$diagnosis)
# Error: `data` and `reference` should be factors with the same levels.

# Print confusion matrix
print(conf_matrix_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 106   4
##          1   1  60
##                                           
##                Accuracy : 0.9708          
##                  95% CI : (0.9331, 0.9904)
##     No Information Rate : 0.6257          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.937           
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9375          
##          Pos Pred Value : 0.9636          
##          Neg Pred Value : 0.9836          
##              Prevalence : 0.6257          
##          Detection Rate : 0.6199          
##    Detection Prevalence : 0.6433          
##       Balanced Accuracy : 0.9641          
##                                           
##        'Positive' Class : 0               
##

# Calculate precision, recall, and F1 score
precision_rf <- conf_matrix_rf$byClass["Pos Pred Value"]
recall_rf <- conf_matrix_rf$byClass["Sensitivity"] 

f1_score_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)  

# Print F1 score
print(paste("F1 Score:", round(f1_score_rf, 4)))

## [1] "F1 Score: 0.977"

print(paste("Recall:", round(recall_rf, 4)))

## [1] "Recall: 0.9907"

print(paste("Precision:", round(precision_rf, 4)))

## [1] "Precision: 0.9636"

3.3 Logistic Regression

# Load libraries
library(glmnet)

## 载入需要的程辑包：Matrix

## 
## 载入程辑包：'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

# Train logistic regression model
log_model <- glm(diagnosis ~ ., data = train_data, family = "binomial", maxit = 1000)

## Warning: glm.fit:拟合機率算出来是数值零或一

# Predict on test data
y_pred_log <- predict(log_model, newdata = test_data, type = "response")
y_pred_class <- ifelse(y_pred_log > 0.5, 1, 0)

# Calculate accuracy
accuracy_log <- sum(y_pred_class == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy (Logistic Regression):", round(accuracy_log, 4)))

## [1] "Accuracy (Logistic Regression): 0.9766"

# Calculate precision
precision_log <- sum(y_pred_class[test_data$diagnosis == 1] == 1) / sum(y_pred_class == 1)
print(paste("Precision (Logistic Regression):", round(precision_log, 4)))

## [1] "Precision (Logistic Regression): 0.9839"

# Calculate recall
recall_log <- sum(y_pred_class[test_data$diagnosis == 1] == 1) / sum(test_data$diagnosis == 1)
print(paste("Recall (Logistic Regression):", round(recall_log, 4)))

## [1] "Recall (Logistic Regression): 0.9531"

# Calculate F1 score
f1_score_log <- 2 * (precision_log * recall_log) / (precision_log + recall_log)
print(paste("F1 Score (Logistic Regression):", round(f1_score_log, 4)))

## [1] "F1 Score (Logistic Regression): 0.9683"

# Generate confusion matrix
conf_matrix_log <- table(y_pred_class, test_data$diagnosis)

# Print confusion matrix
print("Confusion Matrix:")

## [1] "Confusion Matrix:"

print(conf_matrix_log)

##             
## y_pred_class   0   1
##            0 106   3
##            1   1  61

3.4 Decision Tree

# Load libraries
library(rpart)

# Train decision tree model
tree_model <- rpart(diagnosis ~ ., data = train_data, method = "class")

# Predict on test data
y_pred_tree <- predict(tree_model, newdata = test_data, type = "class")

# Calculate accuracy
accuracy_tree <- sum(y_pred_tree == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy (Decision Tree):", round(accuracy_tree, 4)))

## [1] "Accuracy (Decision Tree): 0.9591"

# Calculate precision, recall, and F1 score
conf_matrix_tree <- confusionMatrix(y_pred_tree, test_data$diagnosis)
precision_tree <- conf_matrix_tree$byClass["Pos Pred Value"]
recall_tree <- conf_matrix_tree$byClass["Sensitivity"]
f1_score_tree <- 2 * (precision_tree * recall_tree) / (precision_tree + recall_tree)
print(paste("Precision (Decision Tree):", round(precision_tree, 4)))

## [1] "Precision (Decision Tree): 0.9717"

print(paste("Recall (Decision Tree):", round(recall_tree, 4)))

## [1] "Recall (Decision Tree): 0.9626"

print(paste("F1 Score (Decision Tree):", round(f1_score_tree, 4)))

## [1] "F1 Score (Decision Tree): 0.9671"

# Generate confusion matrix
conf_matrix_tree <- confusionMatrix(y_pred_tree, test_data$diagnosis)

# Print confusion matrix
print("Confusion Matrix (Decision Tree):")

## [1] "Confusion Matrix (Decision Tree):"

print(conf_matrix_tree)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103   3
##          1   4  61
##                                           
##                Accuracy : 0.9591          
##                  95% CI : (0.9175, 0.9834)
##     No Information Rate : 0.6257          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9129          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9626          
##             Specificity : 0.9531          
##          Pos Pred Value : 0.9717          
##          Neg Pred Value : 0.9385          
##              Prevalence : 0.6257          
##          Detection Rate : 0.6023          
##    Detection Prevalence : 0.6199          
##       Balanced Accuracy : 0.9579          
##                                           
##        'Positive' Class : 0               
##

3.5 Gaussian Naive Bayes

# Load libraries
library(e1071)

## 
## 载入程辑包：'e1071'

## The following object is masked from 'package:coefplot':
## 
##     extractPath

# Train Gaussian Naive Bayes model
nb_model <- naiveBayes(diagnosis ~ ., data = train_data)

# Predict on test data
y_pred_nb <- predict(nb_model, newdata = test_data)

# Calculate accuracy
accuracy_nb <- sum(y_pred_nb == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy (Gaussian Naive Bayes):", round(accuracy_nb, 4)))

## [1] "Accuracy (Gaussian Naive Bayes): 0.9181"

# Calculate precision, recall, and F1 score
conf_matrix_nb <- confusionMatrix(y_pred_nb, test_data$diagnosis)
precision_nb <- conf_matrix_nb$byClass["Pos Pred Value"]
recall_nb <- conf_matrix_nb$byClass["Sensitivity"]
f1_score_nb <- 2 * (precision_nb * recall_nb) / (precision_nb + recall_nb)
print(paste("Precision (Gaussian Naive Bayes):", round(precision_nb, 4)))

## [1] "Precision (Gaussian Naive Bayes): 0.9189"

print(paste("Recall (Gaussian Naive Bayes):", round(recall_nb, 4)))

## [1] "Recall (Gaussian Naive Bayes): 0.9533"

print(paste("F1 Score (Gaussian Naive Bayes):", round(f1_score_nb, 4)))

## [1] "F1 Score (Gaussian Naive Bayes): 0.9358"

# Generate confusion matrix
conf_matrix_nb <- confusionMatrix(y_pred_nb, test_data$diagnosis)

# Print confusion matrix
print("Confusion Matrix (Gaussian Naive Bayes):")

## [1] "Confusion Matrix (Gaussian Naive Bayes):"

print(conf_matrix_nb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 102   9
##          1   5  55
##                                           
##                Accuracy : 0.9181          
##                  95% CI : (0.8664, 0.9545)
##     No Information Rate : 0.6257          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.823           
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.9533          
##             Specificity : 0.8594          
##          Pos Pred Value : 0.9189          
##          Neg Pred Value : 0.9167          
##              Prevalence : 0.6257          
##          Detection Rate : 0.5965          
##    Detection Prevalence : 0.6491          
##       Balanced Accuracy : 0.9063          
##                                           
##        'Positive' Class : 0               
##

3.6 Models’ Comparison using the performance metrics

We compared the four models with the accuracy, precision, recall and F1 score values.

# Calculate evaluation metrics for each model
metrics_df <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "Random Forest", "Gaussian Naive Bayes"),
  Accuracy = c(accuracy_log, accuracy_tree, accuracy_rf, accuracy_nb),
  Precision = c(precision_log, precision_tree, precision_rf, precision_nb),
  Recall = c(recall_log, recall_tree, recall_rf, recall_nb),
  F1_Score = c(f1_score_log, f1_score_tree, f1_score_rf, f1_score_nb)
)


# Melt the dataframe for plotting
library(reshape2)
melted_metrics <- melt(metrics_df, id.vars = "Model")

model_colors <- c("Logistic Regression" = "blue", 
                  "Decision Tree" = "red", 
                  "Random Forest" = "green",
                  "Gaussian Naive Bayes" = "orange")

# Create grouped bar plot with angled x-axis labels and custom colors
ggplot(melted_metrics, aes(x = Model, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_bar(stat = "identity", position = "dodge", aes(fill = Model)) +  
  labs(title = "Model Comparison - Evaluation Metrics", x = "Model", y = "Value", fill = "") + 
  scale_fill_manual(values = model_colors) +
  scale_color_manual(values = model_colors, guide = "none") + 
  theme_minimal() +
  theme(legend.position = "right", axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~variable, scales = "free_y")

4 Evaluation

4.1 Evaluating the Models

# F1_score, Recall, Precision, Accuracy Comparison

# Load the necessary packages
library(ggplot2)
library(reshape2)

# Create the data frame
data <- data.frame(
  Model = c("Random Forest", "Logistic Regression", "Decision Tree", "Naive Bayes"),
  Accuracy = c(0.9357, 0.9415, 0.9123, 0.9006),
  Precision = c(0.9286, 0.9219, 0.934, 0.9018),
  Recall = c(0.972, 0.9219, 0.952, 0.9439),
  F1_Score = c(0.9498, 0.9219, 0.9296, 0.9224)
)

# Converts a data frame from a wide format to a long format
data_long <- melt(data, id.vars = "Model", variable.name = "Metric", value.name = "Value")

# Create a stacked bar chart and add data labels
ggplot(data_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Value, 4)), position = position_stack(vjust = 0.5), size = 3) +
  theme_minimal() +
  labs(title = "Model Performance Metrics",
       x = "Performance Metric",
       y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Accuracy is the most intuitive indicator to measure the quality of the model. It describes the proportion of correct predictions of the model among all predictions and represents the accuracy of the model as a whole. Therefore, the greater the value, the better the performance of the model. From the model analysis in the third part, we can see that the accuracy of Logistic Regression and Random Forest is relatively high.
Precision describes the proportion of all correctly predicted results to all actually predicted results, and measures the accuracy of the model for positive predictions. It can be concluded from the above analysis that the values of Random forest and Decision Tree are larger.
Recall describes the proportion of all instances in which the model correctly predicts that it will be true. In most cases, the larger the value of Recall, the better, so Random Forest has a stronger ability to identify positive instances.
F1 Score, as an indicator used to measure the accuracy of binary classification (or multi-task binary classification) model in statistics, comprehensively considers Precision and Recall. The closer the value is to 1, the better the model performance will be. As shown in the figure above, F1 Score of random forest has the largest value. It shows that its model performance is better than the other three.
Summary:Through the above analysis of the performance indicators of the four evaluation models, it can be concluded that the Random Forest model performs well in both accuracy and precision, and has outstanding ability to identify positive instances.

4.2 Cross Validation

4.2.1 K-fold Cross Validation

# Load necessary library
library(pROC)

train_control <- trainControl(method="cv", number=10)

# 1.Random Forest
rf_cv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="rf")
rf_cv_model

## Random Forest 
## 
## 398 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 359, 358, 358, 358, 358, 358, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9596795  0.9120115
##    8    0.9571154  0.9069542
##   15    0.9395513  0.8694974
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The above results show that the model performs best in cross-validation when the number of randomly selected features in each tree is 2. This model has high accuracy and Kappa coefficient, which is suitable for classification prediction of new data.

# 2.Logistic Regression
tryCatch({
  log_cv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="glm")
  log_cv_model
}, warning = function(w){
  print('Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred.')
})

## [1] "Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred."

The occurrence of this situation can be understood as an overfitting. Due to data reasons, in the optimization search process of regression coefficient, the linear fitting value of the category belonging to a certain category (y= 1) tends to be large, while the linear fitting value of the category belonging to another category (y=0) tends to be small.
Because our target_variable (diagnosis)has only 0 and 1 values, logistic regression often leads to overfitting problems when the data is fully separable.

# 3.Decision Tree
tree_cv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="rpart")
tree_cv_model

## CART 
## 
## 398 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 358, 358, 359, 359, 358, 358, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01013514  0.9373718  0.8659589
##   0.09459459  0.8969872  0.7740047
##   0.78378378  0.7207051  0.2719683
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.

The above results show that when the complexity parameter of cp is 0.02702703, the model performs best in cross-validation. This model has high accuracy and Kappa coefficient, which is suitable for classification prediction of new data.

# 4.Gaussian Naive Bayes
tryCatch({
  nb_cv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="nb")
  nb_cv_model
}, warning = function(w){
  print('Warning: Numerical 0 probability for all classes with observation 1.')
})

## [1] "Warning: Numerical 0 probability for all classes with observation 1."

In the Gaussian Naive Bayes model, The warning “Numerical 0 probability for all classes with observation 1” usually means that for a particular observation, the model predicts a probability of 0 for all possible classes. If the feature distribution of the data differs significantly from the Gaussian distribution, the model may not accurately estimate the probability of the classes, resulting in a probability of 0 for all classes.

4.2.2 Leave-one-out Cross Validation (LOOCV)

train_control <- trainControl(method="LOOCV")

# 1.Random Forest
rf_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="rf")
rf_loocv_model

## Random Forest 
## 
## 398 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 397, 397, 397, 397, 397, 397, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9572864  0.9071599
##    8    0.9472362  0.8865851
##   15    0.9472362  0.8868982
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# 2.Logistic Regression
tryCatch({
  log_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="glm")
  log_loocv_model
}, warning = function(w){
  print('Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred.')
})

## [1] "Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred."

# 3.Decision Tree
tree_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="rpart")
tree_loocv_model

## CART 
## 
## 398 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 397, 397, 397, 397, 397, 397, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      
##   0.01013514  0.9396985   0.87091892
##   0.09459459  0.9095477   0.80088938
##   0.78378378  0.6130653  -0.02984072
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.

# 4.Gaussian Naive Bayes
tryCatch({
  nb_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="nb")
  nb_loocv_model
}, warning = function(w){
  print('Warning: Numerical 0 probability for all classes with observation 1.')
})

## [1] "Warning: Numerical 0 probability for all classes with observation 1."

4.2.3 Bootstrapped K-fold Cross Validation

train_control <- trainControl(method="boot", number=10)

# 1.Random Forest
rf_boot_model <- train(diagnosis~., data=train_data, trControl=train_control, method="rf")
rf_boot_model

## Random Forest 
## 
## 398 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 398, 398, 398, 398, 398, 398, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9611778  0.9150011
##    8    0.9502959  0.8918923
##   15    0.9392609  0.8678641
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# 2.Logistic Regression
tryCatch({
  log_boot_model <- train(diagnosis~., data=train_data, trControl=train_control, method="glm")
  log_boot_model
}, warning = function(w){
  print('Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred.')
})

## [1] "Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred."

# 3.Decision Tree
tree_boot_model <- train(diagnosis~., data=train_data, trControl=train_control,method="rpart")
tree_boot_model

## CART 
## 
## 398 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 398, 398, 398, 398, 398, 398, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01013514  0.9208516  0.8277488
##   0.09459459  0.9012010  0.7802165
##   0.78378378  0.7874591  0.4512162
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.

# 4.Gaussian Naive Bayes
tryCatch({
  nb_boot_model <- train(diagnosis~., data=train_data, trControl=train_control,method="nb")
  nb_boot_model
}, warning = function(w){
  print('Warning: Numerical 0 probability for all classes with some observations.')
})

## [1] "Warning: Numerical 0 probability for all classes with some observations."

According to the above cross-validation results, the data characteristics of this dataset are not suitable for using both Logstic Regression and Naive Bayes models, so try to use more complex models, such as random forests or decision trees, which may better handle complex relationships in the data.

4.3 Receiver Operating Characteristic(ROC)

# Load necessary libraries
library(caret)
library(pROC)

# Preparation
# Ensure the target variable is of factor type
train_data$diagnosis <- as.factor(train_data$diagnosis)

# View the levels of the target variable
levels(train_data$diagnosis)

## [1] "0" "1"

# If level names are invalid, use make.names to convert them
levels(train_data$diagnosis) <- make.names(levels(train_data$diagnosis))

# Check the levels again to ensure they are valid R variable names
levels(train_data$diagnosis)

## [1] "X0" "X1"

# 1.Random Forest
# Train the 'Random Forest' model using cross-validation

# Set K-fold cross-validation parameters
ctrl <- trainControl(method = "repeatedcv", number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)
ctrl

## $method
## [1] "repeatedcv"
## 
## $number
## [1] 10
## 
## $repeats
## [1] 1
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL
## 
## $horizon
## [1] 1
## 
## $fixedWindow
## [1] TRUE
## 
## $skip
## [1] 0
## 
## $verboseIter
## [1] FALSE
## 
## $returnData
## [1] TRUE
## 
## $returnResamp
## [1] "final"
## 
## $savePredictions
## [1] FALSE
## 
## $classProbs
## [1] TRUE
## 
## $summaryFunction
## function (data, lev = NULL, model = NULL) 
## {
##     if (length(lev) > 2) {
##         stop(paste("Your outcome has", length(lev), "levels. The twoClassSummary() function isn't appropriate."))
##     }
##     requireNamespaceQuietStop("pROC")
##     if (!all(levels(data[, "pred"]) == lev)) {
##         stop("levels of observed and predicted data do not match")
##     }
##     rocObject <- try(pROC::roc(data$obs, data[, lev[1]], direction = ">", 
##         quiet = TRUE), silent = TRUE)
##     rocAUC <- if (inherits(rocObject, "try-error")) 
##         NA
##     else rocObject$auc
##     out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
##         lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
##     names(out) <- c("ROC", "Sens", "Spec")
##     out
## }
## <bytecode: 0x00000200149093b8>
## <environment: namespace:caret>
## 
## $selectionFunction
## [1] "best"
## 
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
## 
## $preProcOptions$ICAcomp
## [1] 3
## 
## $preProcOptions$k
## [1] 5
## 
## $preProcOptions$freqCut
## [1] 19
## 
## $preProcOptions$uniqueCut
## [1] 10
## 
## $preProcOptions$cutoff
## [1] 0.9
## 
## 
## $sampling
## NULL
## 
## $index
## NULL
## 
## $indexOut
## NULL
## 
## $indexFinal
## NULL
## 
## $timingSamps
## [1] 0
## 
## $predictionBounds
## [1] FALSE FALSE
## 
## $seeds
## [1] NA
## 
## $adaptive
## $adaptive$min
## [1] 5
## 
## $adaptive$alpha
## [1] 0.05
## 
## $adaptive$method
## [1] "gls"
## 
## $adaptive$complete
## [1] TRUE
## 
## 
## $trim
## [1] FALSE
## 
## $allowParallel
## [1] TRUE

rf_Fit <- train(diagnosis ~ ., data = train_data, method = "rf", preProc = c("center", "scale"), trControl = ctrl, metric = "ROC")

# Print the results
print(rf_Fit)

## Random Forest 
## 
## 398 samples
##  15 predictor
##   2 classes: 'X0', 'X1' 
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 358, 358, 358, 358, 358, 358, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens   Spec     
##    2    0.9933905  0.980  0.9185714
##    8    0.9914095  0.972  0.9042857
##   15    0.9900571  0.948  0.9109524
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# Make ROC Curve
ggplot(rf_Fit)

# 2.Decision Tree
# Train the 'Decision Tree' model using cross-validation                       
tree_Fit <- train(diagnosis ~ ., data = train_data, method = "rpart", preProc = c("center", "scale"), trControl = ctrl, metric = "ROC")

# Print the results
print(tree_Fit)

## CART 
## 
## 398 samples
##  15 predictor
##   2 classes: 'X0', 'X1' 
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 359, 358, 358, 358, 358, 359, ... 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens   Spec     
##   0.01013514  0.9369333  0.952  0.8985714
##   0.09459459  0.8675333  0.952  0.7833333
##   0.78378378  0.6746190  0.984  0.3652381
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.

# Make ROC Curve
ggplot(tree_Fit)

5 Conclusion

Breast cancer is the most common cancer among women worldwide. It accounts for 25 percent of all cancer cases.The key challenge in detecting tumors is how to classify them as malignant (cancerous) or benign (non-cancerous).We used data analysis models to classify these tumors using datasets from an open source data modeling platform.From data cleaning, classification models are built and model performance is evaluated to predict whether a cancer type is malignant or benign.

In the part of data cleaning, missing and duplicate values in the data set are first processed, target variable is determined, and useless feature values are removed. Finally, category prediction is transformed into a binary classification problem, and the correlation between features and target variable is explored.

The third part is machine learning modeling. The data set is divided into train_data and test_data, and four mainstream models are selected to predict the data. random forest, logistic regression, decision tree and navie bayes respectively, and accuracy, kappa, Sensitivity and Specificity were used to measure the predictive performance of each model.

Then the model evaluation part. Based on the four machine learning models in the third part, the performance of each model is preliminarily estimated by comparing and analyzing the values of Accuracy, Precision, Recall and F1 Score. Among them, the performance of random forest model is better. Secondly, through K-fold, LOOCV and bootstrap K-fold cross-validation, it is concluded that logistics regression and naive Bayes model are not suitable for this dataset, and random forest and decision tree models are obviously better.

Breast Cancer Detection Analysis Using Machine Learning Approach

WQD7004 Group6
1. 22058335 LIN YANG
2. 22075938 Nitya A/P Ponnusamy
3. 22093309 Legawatthi A/P Thiyagarajan
4. 22089578 MIAO XINYU

1 Data Understanding

1.1 Introduction

1.2 Project Objectives

1.3 Data Background

2 Data Preprocessing

2.1 Load all library

2.2 Read the dataset

2.3 View the content of the data

2.4 View missing values and duplicate values

2.5 View the target

2.6 Drop useless features

2.7 Correlation between features

2.8 Correlation between features and target variable

3 Machine Learning Modeling

3.1 Data Splitting

3.2 Random Forest

3.3 Logistic Regression

3.4 Decision Tree

3.5 Gaussian Naive Bayes

3.6 Models’ Comparison using the performance metrics

4 Evaluation

4.1 Evaluating the Models

4.2 Cross Validation

4.2.1 K-fold Cross Validation

4.2.2 Leave-one-out Cross Validation (LOOCV)

4.2.3 Bootstrapped K-fold Cross Validation

4.3 Receiver Operating Characteristic(ROC)

5 Conclusion

Breast Cancer Detection Analysis Using Machine Learning Approach

WQD7004 Group6 1. 22058335 LIN YANG 2. 22075938 Nitya A/P Ponnusamy 3. 22093309 Legawatthi A/P Thiyagarajan 4. 22089578 MIAO XINYU

1 Data Understanding

1.1 Introduction

1.2 Project Objectives

1.3 Data Background

2 Data Preprocessing

2.1 Load all library

2.2 Read the dataset

2.3 View the content of the data

2.4 View missing values and duplicate values

2.5 View the target

2.6 Drop useless features

2.7 Correlation between features

2.8 Correlation between features and target variable

3 Machine Learning Modeling

3.1 Data Splitting

3.2 Random Forest

3.3 Logistic Regression

3.4 Decision Tree

3.5 Gaussian Naive Bayes

3.6 Models’ Comparison using the performance metrics

4 Evaluation

4.1 Evaluating the Models

4.2 Cross Validation

4.2.1 K-fold Cross Validation

4.2.2 Leave-one-out Cross Validation (LOOCV)

4.2.3 Bootstrapped K-fold Cross Validation

4.3 Receiver Operating Characteristic(ROC)

5 Conclusion

WQD7004 Group6
1. 22058335 LIN YANG
2. 22075938 Nitya A/P Ponnusamy
3. 22093309 Legawatthi A/P Thiyagarajan
4. 22089578 MIAO XINYU