Breast cancer is the most common cancer among women worldwide.
Based on WHO’s statistic - Breast Cancer has been one of the top leading
contributor and death cause among women in Malaysia with 19.8% death on
every 100,000 female Malaysian population.
[1] https://data.who.int/countries/458
Cancer Research Malaysia also stated that locally 1 out of 4
Malaysian are affected by breast cancer while globally, 2.3 million
women were diagnosed with this cancer and 685k women died from this
particular disease in 2020 with rapid growth among Asian females.
[2] https://www.cancerresearch.my/our-work/breast-cancer/
In an event held at Thomson Hospital Kota Damansara, Dr Tan Gie
Hooi who is a Consultant breast and oncoplastic surgeon of the hospital
cited Universiti Putra Malaysia (UPM)’ “Review of Breast Cancer in Young
Women” article that was published in 2020 where 13.6 per cent of women
in Malaysia are diagnosed with breast cancer before the age of 40 with
sharp contrast to the United States, where the rate is 5 per cent in the
same age group.
[3] https://codeblue.galencentre.org/2023/10/31/13-of-women-in-malaysia-diagnosed-with-breast-cancer-before-40/,
https://medic.upm.edu.my/upload/dokumen/2020120211081348_MJMHS_0472.pdf
These situations are alarming, and this is where our project come into picture.
We retrieved our dataset from Kaggle with data sourcing from
Wisconsin Breast Cancer Database. University of Wisconsin Hospitals;
Madison, USA: 1991.
- Data Source:
Breast
Cancer Dataset @ Kaggle
- Dataset name: breast-cancer.csv
(122KB)
- Dataset is loaded into a dataframe and named as
‘bc_df’
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(pryr)
##
## 载入程辑包:'pryr'
## The following object is masked from 'package:dplyr':
##
## where
library(Rcpp)
library(ggplot2)
library(lubridate)
##
## 载入程辑包:'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(caret)
## 载入需要的程辑包:lattice
library(infotheo)
library(corrplot)
## corrplot 0.92 loaded
library(reshape2)
##
## 载入程辑包:'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(coefplot)
library(writexl)
library(scales)
bc_df <- read.csv("C:/Users/ahaen/Documents/breast-cancer.csv")
head(bc_df)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02250 0.004571 23.57 25.53 152.50
## 4 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01756 0.005115 22.54 16.67 152.20
## 6 0.02165 0.005082 15.47 23.75 103.40
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019.0 0.1622 0.6656 0.7119
## 2 1956.0 0.1238 0.1866 0.2416
## 3 1709.0 0.1444 0.4245 0.4504
## 4 567.7 0.2098 0.8663 0.6869
## 5 1575.0 0.1374 0.2050 0.4000
## 6 741.6 0.1791 0.5249 0.5355
## concave.points_worst symmetry_worst fractal_dimension_worst
## 1 0.2654 0.4601 0.11890
## 2 0.1860 0.2750 0.08902
## 3 0.2430 0.3613 0.08758
## 4 0.2575 0.6638 0.17300
## 5 0.1625 0.2364 0.07678
## 6 0.1741 0.3985 0.12440
glimpse(bc_df)
## Rows: 569
## Columns: 32
## $ id <int> 842302, 842517, 84300903, 84348301, 84358402, …
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
summary(bc_df[, !names(bc_df) %in% "id", drop = FALSE])
## diagnosis radius_mean texture_mean perimeter_mean
## Length:569 Min. : 6.981 Min. : 9.71 Min. : 43.79
## Class :character 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Mode :character Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave.points_mean symmetry_mean fractal_dimension_mean radius_se
## Min. :0.00000 Min. :0.1060 Min. :0.04996 Min. :0.1115
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324
## Median :0.03350 Median :0.1792 Median :0.06154 Median :0.3242
## Mean :0.04892 Mean :0.1812 Mean :0.06280 Mean :0.4052
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789
## Max. :0.20120 Max. :0.3040 Max. :0.09744 Max. :2.8730
## texture_se perimeter_se area_se smoothness_se
## Min. :0.3602 Min. : 0.757 Min. : 6.802 Min. :0.001713
## 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850 1st Qu.:0.005169
## Median :1.1080 Median : 2.287 Median : 24.530 Median :0.006380
## Mean :1.2169 Mean : 2.866 Mean : 40.337 Mean :0.007041
## 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190 3rd Qu.:0.008146
## Max. :4.8850 Max. :21.980 Max. :542.200 Max. :0.031130
## compactness_se concavity_se concave.points_se symmetry_se
## Min. :0.002252 Min. :0.00000 Min. :0.000000 Min. :0.007882
## 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638 1st Qu.:0.015160
## Median :0.020450 Median :0.02589 Median :0.010930 Median :0.018730
## Mean :0.025478 Mean :0.03189 Mean :0.011796 Mean :0.020542
## 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710 3rd Qu.:0.023480
## Max. :0.135400 Max. :0.39600 Max. :0.052790 Max. :0.078950
## fractal_dimension_se radius_worst texture_worst perimeter_worst
## Min. :0.0008948 Min. : 7.93 Min. :12.02 Min. : 50.41
## 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11
## Median :0.0031870 Median :14.97 Median :25.41 Median : 97.66
## Mean :0.0037949 Mean :16.27 Mean :25.68 Mean :107.26
## 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40
## Max. :0.0298400 Max. :36.04 Max. :49.54 Max. :251.20
## area_worst smoothness_worst compactness_worst concavity_worst
## Min. : 185.2 Min. :0.07117 Min. :0.02729 Min. :0.0000
## 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145
## Median : 686.5 Median :0.13130 Median :0.21190 Median :0.2267
## Mean : 880.6 Mean :0.13237 Mean :0.25427 Mean :0.2722
## 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829
## Max. :4254.0 Max. :0.22260 Max. :1.05800 Max. :1.2520
## concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.29100 Max. :0.6638 Max. :0.20750
# Calculate the total number of missing values
total_missing_values <- sum(is.na(bc_df))
print(paste("Total missing values :", total_missing_values))
## [1] "Total missing values : 0"
# Count the total number of duplicate rows
total_duplicate_rows <- sum(duplicated(bc_df))
print(paste("Total number of duplicate rows :", total_duplicate_rows))
## [1] "Total number of duplicate rows : 0"
There are no missing values and no duplicate rows in the dataset.
cols <- c("#008b8b", "#cd853f")
plt <- ggplot(bc_df, aes(x = factor(diagnosis))) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, hjust = 1)) +
labs(x = "", y = "Count") +
geom_bar(fill = cols) +
stat_count(geom = "text", aes(label = paste0(round(prop.table(after_stat(count)) * 100, 2), "%")),position = position_stack(vjust = 0.5), size = 4) + ggtitle("Counts and rates of malignant and benign tumors")
print(plt)
# print the count of B and M
print(table(bc_df$diagnosis))
##
## B M
## 357 212
Machine learning models are more likely to learn and predict binary classification problems. In order to enable the model to better understand and process this data, the diagnosis category is converted from “Malignant” to 1 and “Benign” to 0.
bc_df$diagnosis <- ifelse(bc_df$diagnosis == "M", 1, 0)
head(bc_df)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 1 17.99 10.38 122.80 1001.0
## 2 842517 1 20.57 17.77 132.90 1326.0
## 3 84300903 1 19.69 21.25 130.00 1203.0
## 4 84348301 1 11.42 20.38 77.58 386.1
## 5 84358402 1 20.29 14.34 135.10 1297.0
## 6 843786 1 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02250 0.004571 23.57 25.53 152.50
## 4 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01756 0.005115 22.54 16.67 152.20
## 6 0.02165 0.005082 15.47 23.75 103.40
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019.0 0.1622 0.6656 0.7119
## 2 1956.0 0.1238 0.1866 0.2416
## 3 1709.0 0.1444 0.4245 0.4504
## 4 567.7 0.2098 0.8663 0.6869
## 5 1575.0 0.1374 0.2050 0.4000
## 6 741.6 0.1791 0.5249 0.5355
## concave.points_worst symmetry_worst fractal_dimension_worst
## 1 0.2654 0.4601 0.11890
## 2 0.1860 0.2750 0.08902
## 3 0.2430 0.3613 0.08758
## 4 0.2575 0.6638 0.17300
## 5 0.1625 0.2364 0.07678
## 6 0.1741 0.3985 0.12440
Based on the data structure listed in table, there is one attribute that we have thought it is unnecessary, “ID number” attribute that has no effect on whether the outcome is malignant or benign. So that’s means that “ID number” attribute is not necessary for the next modelling step.So drop it.
bc_df1 <- bc_df[, !(names(bc_df) %in% c("id"))]
# Calculate correlation matrix
corr <- cor(bc_df1)
# Create mask for upper triangle
mask <- upper.tri(corr)
# Set upper triangle values to NA
diag(corr) <- NA
corr[mask] <- NA
# Convert correlation matrix to long format
corr_long <- melt(corr, na.rm = TRUE)
print(head(corr_long, 20))
## Var1 Var2 value
## 2 radius_mean diagnosis 0.730028511
## 3 texture_mean diagnosis 0.415185300
## 4 perimeter_mean diagnosis 0.742635530
## 5 area_mean diagnosis 0.708983837
## 6 smoothness_mean diagnosis 0.358559965
## 7 compactness_mean diagnosis 0.596533678
## 8 concavity_mean diagnosis 0.696359707
## 9 concave.points_mean diagnosis 0.776613840
## 10 symmetry_mean diagnosis 0.330498554
## 11 fractal_dimension_mean diagnosis -0.012837603
## 12 radius_se diagnosis 0.567133821
## 13 texture_se diagnosis -0.008303333
## 14 perimeter_se diagnosis 0.556140703
## 15 area_se diagnosis 0.548235940
## 16 smoothness_se diagnosis -0.067016011
## 17 compactness_se diagnosis 0.292999244
## 18 concavity_se diagnosis 0.253729766
## 19 concave.points_se diagnosis 0.408042333
## 20 symmetry_se diagnosis -0.006521756
## 21 fractal_dimension_se diagnosis 0.077972417
# Plot heatmap
ggplot(corr_long, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#008b8b", high = "#cd853f", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 1, size = 8, hjust = 1),
axis.text.y = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 8)) +
coord_fixed() +
labs(title = "Correlation between features", x = "Variables", y = "Variables")
# Find features to drop
to_drop <- colnames(corr)[apply(corr > 0.9, 2, any)]
# Drop features
bc_df2 <- bc_df1[, !(names(bc_df1) %in% to_drop)]
# Print remaining number of features
print(colnames(bc_df2))
## [1] "diagnosis" "smoothness_mean"
## [3] "compactness_mean" "symmetry_mean"
## [5] "fractal_dimension_mean" "texture_se"
## [7] "area_se" "smoothness_se"
## [9] "compactness_se" "concavity_se"
## [11] "concave.points_se" "symmetry_se"
## [13] "fractal_dimension_se" "texture_worst"
## [15] "area_worst" "smoothness_worst"
## [17] "compactness_worst" "concavity_worst"
## [19] "concave.points_worst" "symmetry_worst"
## [21] "fractal_dimension_worst"
Drop high correlation column
The main purpose of removing highly
correlated variables is to avoid multicollinearity. Therefore, in order
to improve the interpretability, stability and performance of the model,
we choose to delete highly correlated variables. During the feature
selection process, selecting to retain features that have a high
correlation with the target variable but a low correlation between
independent variables helps to build a more reliable model.
The features with correlation greater than 0.9 were deleted,
leaving 20 features.
cor_matrix <- cor(bc_df2)
diagnosis_corr <- cor_matrix["diagnosis", ]
diagnosis_corr <- diagnosis_corr[names(diagnosis_corr) != "diagnosis"]
sorted_corr <- sort(abs(diagnosis_corr), decreasing = FALSE)
print(sorted_corr)
## symmetry_se texture_se fractal_dimension_mean
## 0.006521756 0.008303333 0.012837603
## smoothness_se fractal_dimension_se concavity_se
## 0.067016011 0.077972417 0.253729766
## compactness_se fractal_dimension_worst symmetry_mean
## 0.292999244 0.323872189 0.330498554
## smoothness_mean concave.points_se symmetry_worst
## 0.358559965 0.408042333 0.416294311
## smoothness_worst texture_worst area_se
## 0.421464861 0.456902821 0.548235940
## compactness_worst compactness_mean concavity_worst
## 0.590998238 0.596533678 0.659610210
## area_worst concave.points_worst
## 0.733825035 0.793566017
par(mar = c(5, 10, 4, 2) + 0.1)
barplot(sorted_corr,
names.arg = names(sorted_corr),
col = "#008b8b",
horiz = TRUE,
main = "Correlation between features and target variable",
xlab = "Correlation",
cex.names = 0.7,
las = 2,
xlim = c(0, 1))
# Find features to drop
to_drop <- names(diagnosis_corr)[abs(diagnosis_corr) < 0.1]
# Remove features with correlation less than 0.1
bc_df3 <- bc_df2[, !(names(bc_df2) %in% to_drop)]
# Print remaining number of features
print(colnames(bc_df3))
## [1] "diagnosis" "smoothness_mean"
## [3] "compactness_mean" "symmetry_mean"
## [5] "area_se" "compactness_se"
## [7] "concavity_se" "concave.points_se"
## [9] "texture_worst" "area_worst"
## [11] "smoothness_worst" "compactness_worst"
## [13] "concavity_worst" "concave.points_worst"
## [15] "symmetry_worst" "fractal_dimension_worst"
Deleted features with a correlation less than 0.1 with the target, leaving 15 features.
For modelling, we imported all necessaries packages especially on ML Classification Models from R Library.In this modeling, we trained and tested various classification models such as Random Forest, Logistic Regression, Decision Tree and Gaussian Naive Bayes.
The dataset was divided into training and testing sets, with 70% allocated for training and 30% for testing. The resulting sets, x_train and x_test, along with their corresponding labels y_train and y_test, were displayed to confirm the sizes of both the training and testing data.
library(caTools)
#Splitting train 70% and test 30%
# Creating training and testing sets
split = sample.split(bc_df3$diagnosis, SplitRatio = 0.7)
train_data = subset(bc_df3, split == TRUE)
test_data = subset(bc_df3, split == FALSE)
# Separate features and target variable for training set
x_train = subset(train_data, select = -diagnosis)
y_train = train_data$diagnosis
length(x_train)
## [1] 15
# Separate features and target variable for testing set
x_test = subset(test_data, select = -diagnosis)
y_test = test_data$diagnosis
The “diagnosis” variable in both the training and test datasets is converted into a factor variable. In R, a factor is used to represent categorical data, where each category is represented by a level. Converting the “diagnosis” variable to a factor indicates that it contains categorical information rather than numerical.
train_data$diagnosis <- factor(train_data$diagnosis)
test_data$diagnosis <- factor(test_data$diagnosis)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## 载入程辑包:'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Train the model
set.seed(42)
rf_model <- randomForest(diagnosis ~ ., data = train_data, importance = TRUE)
# Print model summary
print(rf_model)
##
## Call:
## randomForest(formula = diagnosis ~ ., data = train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.77%
## Confusion matrix:
## 0 1 class.error
## 0 245 5 0.02000000
## 1 14 134 0.09459459
# Predict on test data (before tuning)
y_pred <- predict(rf_model, newdata = test_data)
accuracy_rf <- sum(y_pred == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy:", round(accuracy_rf, 4)))
## [1] "Accuracy: 0.9708"
# Generate confusion matrix
conf_matrix_rf <- confusionMatrix(y_pred, test_data$diagnosis)
# Error: `data` and `reference` should be factors with the same levels.
# Print confusion matrix
print(conf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 106 4
## 1 1 60
##
## Accuracy : 0.9708
## 95% CI : (0.9331, 0.9904)
## No Information Rate : 0.6257
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.937
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.9907
## Specificity : 0.9375
## Pos Pred Value : 0.9636
## Neg Pred Value : 0.9836
## Prevalence : 0.6257
## Detection Rate : 0.6199
## Detection Prevalence : 0.6433
## Balanced Accuracy : 0.9641
##
## 'Positive' Class : 0
##
# Calculate precision, recall, and F1 score
precision_rf <- conf_matrix_rf$byClass["Pos Pred Value"]
recall_rf <- conf_matrix_rf$byClass["Sensitivity"]
f1_score_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)
# Print F1 score
print(paste("F1 Score:", round(f1_score_rf, 4)))
## [1] "F1 Score: 0.977"
print(paste("Recall:", round(recall_rf, 4)))
## [1] "Recall: 0.9907"
print(paste("Precision:", round(precision_rf, 4)))
## [1] "Precision: 0.9636"
# Load libraries
library(glmnet)
## 载入需要的程辑包:Matrix
##
## 载入程辑包:'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
# Train logistic regression model
log_model <- glm(diagnosis ~ ., data = train_data, family = "binomial", maxit = 1000)
## Warning: glm.fit:拟合機率算出来是数值零或一
# Predict on test data
y_pred_log <- predict(log_model, newdata = test_data, type = "response")
y_pred_class <- ifelse(y_pred_log > 0.5, 1, 0)
# Calculate accuracy
accuracy_log <- sum(y_pred_class == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy (Logistic Regression):", round(accuracy_log, 4)))
## [1] "Accuracy (Logistic Regression): 0.9766"
# Calculate precision
precision_log <- sum(y_pred_class[test_data$diagnosis == 1] == 1) / sum(y_pred_class == 1)
print(paste("Precision (Logistic Regression):", round(precision_log, 4)))
## [1] "Precision (Logistic Regression): 0.9839"
# Calculate recall
recall_log <- sum(y_pred_class[test_data$diagnosis == 1] == 1) / sum(test_data$diagnosis == 1)
print(paste("Recall (Logistic Regression):", round(recall_log, 4)))
## [1] "Recall (Logistic Regression): 0.9531"
# Calculate F1 score
f1_score_log <- 2 * (precision_log * recall_log) / (precision_log + recall_log)
print(paste("F1 Score (Logistic Regression):", round(f1_score_log, 4)))
## [1] "F1 Score (Logistic Regression): 0.9683"
# Generate confusion matrix
conf_matrix_log <- table(y_pred_class, test_data$diagnosis)
# Print confusion matrix
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(conf_matrix_log)
##
## y_pred_class 0 1
## 0 106 3
## 1 1 61
# Load libraries
library(rpart)
# Train decision tree model
tree_model <- rpart(diagnosis ~ ., data = train_data, method = "class")
# Predict on test data
y_pred_tree <- predict(tree_model, newdata = test_data, type = "class")
# Calculate accuracy
accuracy_tree <- sum(y_pred_tree == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy (Decision Tree):", round(accuracy_tree, 4)))
## [1] "Accuracy (Decision Tree): 0.9591"
# Calculate precision, recall, and F1 score
conf_matrix_tree <- confusionMatrix(y_pred_tree, test_data$diagnosis)
precision_tree <- conf_matrix_tree$byClass["Pos Pred Value"]
recall_tree <- conf_matrix_tree$byClass["Sensitivity"]
f1_score_tree <- 2 * (precision_tree * recall_tree) / (precision_tree + recall_tree)
print(paste("Precision (Decision Tree):", round(precision_tree, 4)))
## [1] "Precision (Decision Tree): 0.9717"
print(paste("Recall (Decision Tree):", round(recall_tree, 4)))
## [1] "Recall (Decision Tree): 0.9626"
print(paste("F1 Score (Decision Tree):", round(f1_score_tree, 4)))
## [1] "F1 Score (Decision Tree): 0.9671"
# Generate confusion matrix
conf_matrix_tree <- confusionMatrix(y_pred_tree, test_data$diagnosis)
# Print confusion matrix
print("Confusion Matrix (Decision Tree):")
## [1] "Confusion Matrix (Decision Tree):"
print(conf_matrix_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 103 3
## 1 4 61
##
## Accuracy : 0.9591
## 95% CI : (0.9175, 0.9834)
## No Information Rate : 0.6257
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9129
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9626
## Specificity : 0.9531
## Pos Pred Value : 0.9717
## Neg Pred Value : 0.9385
## Prevalence : 0.6257
## Detection Rate : 0.6023
## Detection Prevalence : 0.6199
## Balanced Accuracy : 0.9579
##
## 'Positive' Class : 0
##
# Load libraries
library(e1071)
##
## 载入程辑包:'e1071'
## The following object is masked from 'package:coefplot':
##
## extractPath
# Train Gaussian Naive Bayes model
nb_model <- naiveBayes(diagnosis ~ ., data = train_data)
# Predict on test data
y_pred_nb <- predict(nb_model, newdata = test_data)
# Calculate accuracy
accuracy_nb <- sum(y_pred_nb == test_data$diagnosis) / nrow(test_data)
print(paste("Accuracy (Gaussian Naive Bayes):", round(accuracy_nb, 4)))
## [1] "Accuracy (Gaussian Naive Bayes): 0.9181"
# Calculate precision, recall, and F1 score
conf_matrix_nb <- confusionMatrix(y_pred_nb, test_data$diagnosis)
precision_nb <- conf_matrix_nb$byClass["Pos Pred Value"]
recall_nb <- conf_matrix_nb$byClass["Sensitivity"]
f1_score_nb <- 2 * (precision_nb * recall_nb) / (precision_nb + recall_nb)
print(paste("Precision (Gaussian Naive Bayes):", round(precision_nb, 4)))
## [1] "Precision (Gaussian Naive Bayes): 0.9189"
print(paste("Recall (Gaussian Naive Bayes):", round(recall_nb, 4)))
## [1] "Recall (Gaussian Naive Bayes): 0.9533"
print(paste("F1 Score (Gaussian Naive Bayes):", round(f1_score_nb, 4)))
## [1] "F1 Score (Gaussian Naive Bayes): 0.9358"
# Generate confusion matrix
conf_matrix_nb <- confusionMatrix(y_pred_nb, test_data$diagnosis)
# Print confusion matrix
print("Confusion Matrix (Gaussian Naive Bayes):")
## [1] "Confusion Matrix (Gaussian Naive Bayes):"
print(conf_matrix_nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 102 9
## 1 5 55
##
## Accuracy : 0.9181
## 95% CI : (0.8664, 0.9545)
## No Information Rate : 0.6257
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.823
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.9533
## Specificity : 0.8594
## Pos Pred Value : 0.9189
## Neg Pred Value : 0.9167
## Prevalence : 0.6257
## Detection Rate : 0.5965
## Detection Prevalence : 0.6491
## Balanced Accuracy : 0.9063
##
## 'Positive' Class : 0
##
We compared the four models with the accuracy, precision, recall and F1 score values.
# Calculate evaluation metrics for each model
metrics_df <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "Random Forest", "Gaussian Naive Bayes"),
Accuracy = c(accuracy_log, accuracy_tree, accuracy_rf, accuracy_nb),
Precision = c(precision_log, precision_tree, precision_rf, precision_nb),
Recall = c(recall_log, recall_tree, recall_rf, recall_nb),
F1_Score = c(f1_score_log, f1_score_tree, f1_score_rf, f1_score_nb)
)
# Melt the dataframe for plotting
library(reshape2)
melted_metrics <- melt(metrics_df, id.vars = "Model")
model_colors <- c("Logistic Regression" = "blue",
"Decision Tree" = "red",
"Random Forest" = "green",
"Gaussian Naive Bayes" = "orange")
# Create grouped bar plot with angled x-axis labels and custom colors
ggplot(melted_metrics, aes(x = Model, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
geom_bar(stat = "identity", position = "dodge", aes(fill = Model)) +
labs(title = "Model Comparison - Evaluation Metrics", x = "Model", y = "Value", fill = "") +
scale_fill_manual(values = model_colors) +
scale_color_manual(values = model_colors, guide = "none") +
theme_minimal() +
theme(legend.position = "right", axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~variable, scales = "free_y")
# F1_score, Recall, Precision, Accuracy Comparison
# Load the necessary packages
library(ggplot2)
library(reshape2)
# Create the data frame
data <- data.frame(
Model = c("Random Forest", "Logistic Regression", "Decision Tree", "Naive Bayes"),
Accuracy = c(0.9357, 0.9415, 0.9123, 0.9006),
Precision = c(0.9286, 0.9219, 0.934, 0.9018),
Recall = c(0.972, 0.9219, 0.952, 0.9439),
F1_Score = c(0.9498, 0.9219, 0.9296, 0.9224)
)
# Converts a data frame from a wide format to a long format
data_long <- melt(data, id.vars = "Model", variable.name = "Metric", value.name = "Value")
# Create a stacked bar chart and add data labels
ggplot(data_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Value, 4)), position = position_stack(vjust = 0.5), size = 3) +
theme_minimal() +
labs(title = "Model Performance Metrics",
x = "Performance Metric",
y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Load necessary library
library(pROC)
train_control <- trainControl(method="cv", number=10)
# 1.Random Forest
rf_cv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="rf")
rf_cv_model
## Random Forest
##
## 398 samples
## 15 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 359, 358, 358, 358, 358, 358, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9596795 0.9120115
## 8 0.9571154 0.9069542
## 15 0.9395513 0.8694974
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The above results show that the model performs best in cross-validation when the number of randomly selected features in each tree is 2. This model has high accuracy and Kappa coefficient, which is suitable for classification prediction of new data.
# 2.Logistic Regression
tryCatch({
log_cv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="glm")
log_cv_model
}, warning = function(w){
print('Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred.')
})
## [1] "Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred."
# 3.Decision Tree
tree_cv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="rpart")
tree_cv_model
## CART
##
## 398 samples
## 15 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 358, 358, 359, 359, 358, 358, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01013514 0.9373718 0.8659589
## 0.09459459 0.8969872 0.7740047
## 0.78378378 0.7207051 0.2719683
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.
The above results show that when the complexity parameter of cp is
0.02702703, the model performs best in cross-validation. This model has
high accuracy and Kappa coefficient, which is suitable for
classification prediction of new data.
# 4.Gaussian Naive Bayes
tryCatch({
nb_cv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="nb")
nb_cv_model
}, warning = function(w){
print('Warning: Numerical 0 probability for all classes with observation 1.')
})
## [1] "Warning: Numerical 0 probability for all classes with observation 1."
In the Gaussian Naive Bayes model, The warning “Numerical 0 probability for all classes with observation 1” usually means that for a particular observation, the model predicts a probability of 0 for all possible classes. If the feature distribution of the data differs significantly from the Gaussian distribution, the model may not accurately estimate the probability of the classes, resulting in a probability of 0 for all classes.
train_control <- trainControl(method="LOOCV")
# 1.Random Forest
rf_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="rf")
rf_loocv_model
## Random Forest
##
## 398 samples
## 15 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 397, 397, 397, 397, 397, 397, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9572864 0.9071599
## 8 0.9472362 0.8865851
## 15 0.9472362 0.8868982
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# 2.Logistic Regression
tryCatch({
log_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control, method="glm")
log_loocv_model
}, warning = function(w){
print('Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred.')
})
## [1] "Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred."
# 3.Decision Tree
tree_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="rpart")
tree_loocv_model
## CART
##
## 398 samples
## 15 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 397, 397, 397, 397, 397, 397, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01013514 0.9396985 0.87091892
## 0.09459459 0.9095477 0.80088938
## 0.78378378 0.6130653 -0.02984072
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.
# 4.Gaussian Naive Bayes
tryCatch({
nb_loocv_model <- train(diagnosis~., data=train_data, trControl=train_control,method="nb")
nb_loocv_model
}, warning = function(w){
print('Warning: Numerical 0 probability for all classes with observation 1.')
})
## [1] "Warning: Numerical 0 probability for all classes with observation 1."
train_control <- trainControl(method="boot", number=10)
# 1.Random Forest
rf_boot_model <- train(diagnosis~., data=train_data, trControl=train_control, method="rf")
rf_boot_model
## Random Forest
##
## 398 samples
## 15 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (10 reps)
## Summary of sample sizes: 398, 398, 398, 398, 398, 398, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9611778 0.9150011
## 8 0.9502959 0.8918923
## 15 0.9392609 0.8678641
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# 2.Logistic Regression
tryCatch({
log_boot_model <- train(diagnosis~., data=train_data, trControl=train_control, method="glm")
log_boot_model
}, warning = function(w){
print('Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred.')
})
## [1] "Warning: glm.fit:fitted probabilities numerically 0 or 1 occurred."
# 3.Decision Tree
tree_boot_model <- train(diagnosis~., data=train_data, trControl=train_control,method="rpart")
tree_boot_model
## CART
##
## 398 samples
## 15 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (10 reps)
## Summary of sample sizes: 398, 398, 398, 398, 398, 398, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01013514 0.9208516 0.8277488
## 0.09459459 0.9012010 0.7802165
## 0.78378378 0.7874591 0.4512162
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.
# 4.Gaussian Naive Bayes
tryCatch({
nb_boot_model <- train(diagnosis~., data=train_data, trControl=train_control,method="nb")
nb_boot_model
}, warning = function(w){
print('Warning: Numerical 0 probability for all classes with some observations.')
})
## [1] "Warning: Numerical 0 probability for all classes with some observations."
According to the above cross-validation results, the data characteristics of this dataset are not suitable for using both Logstic Regression and Naive Bayes models, so try to use more complex models, such as random forests or decision trees, which may better handle complex relationships in the data.
# Load necessary libraries
library(caret)
library(pROC)
# Preparation
# Ensure the target variable is of factor type
train_data$diagnosis <- as.factor(train_data$diagnosis)
# View the levels of the target variable
levels(train_data$diagnosis)
## [1] "0" "1"
# If level names are invalid, use make.names to convert them
levels(train_data$diagnosis) <- make.names(levels(train_data$diagnosis))
# Check the levels again to ensure they are valid R variable names
levels(train_data$diagnosis)
## [1] "X0" "X1"
# 1.Random Forest
# Train the 'Random Forest' model using cross-validation
# Set K-fold cross-validation parameters
ctrl <- trainControl(method = "repeatedcv", number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)
ctrl
## $method
## [1] "repeatedcv"
##
## $number
## [1] 10
##
## $repeats
## [1] 1
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
##
## $horizon
## [1] 1
##
## $fixedWindow
## [1] TRUE
##
## $skip
## [1] 0
##
## $verboseIter
## [1] FALSE
##
## $returnData
## [1] TRUE
##
## $returnResamp
## [1] "final"
##
## $savePredictions
## [1] FALSE
##
## $classProbs
## [1] TRUE
##
## $summaryFunction
## function (data, lev = NULL, model = NULL)
## {
## if (length(lev) > 2) {
## stop(paste("Your outcome has", length(lev), "levels. The twoClassSummary() function isn't appropriate."))
## }
## requireNamespaceQuietStop("pROC")
## if (!all(levels(data[, "pred"]) == lev)) {
## stop("levels of observed and predicted data do not match")
## }
## rocObject <- try(pROC::roc(data$obs, data[, lev[1]], direction = ">",
## quiet = TRUE), silent = TRUE)
## rocAUC <- if (inherits(rocObject, "try-error"))
## NA
## else rocObject$auc
## out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"],
## lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
## names(out) <- c("ROC", "Sens", "Spec")
## out
## }
## <bytecode: 0x00000200149093b8>
## <environment: namespace:caret>
##
## $selectionFunction
## [1] "best"
##
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
##
## $preProcOptions$ICAcomp
## [1] 3
##
## $preProcOptions$k
## [1] 5
##
## $preProcOptions$freqCut
## [1] 19
##
## $preProcOptions$uniqueCut
## [1] 10
##
## $preProcOptions$cutoff
## [1] 0.9
##
##
## $sampling
## NULL
##
## $index
## NULL
##
## $indexOut
## NULL
##
## $indexFinal
## NULL
##
## $timingSamps
## [1] 0
##
## $predictionBounds
## [1] FALSE FALSE
##
## $seeds
## [1] NA
##
## $adaptive
## $adaptive$min
## [1] 5
##
## $adaptive$alpha
## [1] 0.05
##
## $adaptive$method
## [1] "gls"
##
## $adaptive$complete
## [1] TRUE
##
##
## $trim
## [1] FALSE
##
## $allowParallel
## [1] TRUE
rf_Fit <- train(diagnosis ~ ., data = train_data, method = "rf", preProc = c("center", "scale"), trControl = ctrl, metric = "ROC")
# Print the results
print(rf_Fit)
## Random Forest
##
## 398 samples
## 15 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (15), scaled (15)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 358, 358, 358, 358, 358, 358, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9933905 0.980 0.9185714
## 8 0.9914095 0.972 0.9042857
## 15 0.9900571 0.948 0.9109524
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make ROC Curve
ggplot(rf_Fit)
# 2.Decision Tree
# Train the 'Decision Tree' model using cross-validation
tree_Fit <- train(diagnosis ~ ., data = train_data, method = "rpart", preProc = c("center", "scale"), trControl = ctrl, metric = "ROC")
# Print the results
print(tree_Fit)
## CART
##
## 398 samples
## 15 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (15), scaled (15)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 359, 358, 358, 358, 358, 359, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.01013514 0.9369333 0.952 0.8985714
## 0.09459459 0.8675333 0.952 0.7833333
## 0.78378378 0.6746190 0.984 0.3652381
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01013514.
# Make ROC Curve
ggplot(tree_Fit)
Breast cancer is the most common cancer among women worldwide. It accounts for 25 percent of all cancer cases.The key challenge in detecting tumors is how to classify them as malignant (cancerous) or benign (non-cancerous).We used data analysis models to classify these tumors using datasets from an open source data modeling platform.From data cleaning, classification models are built and model performance is evaluated to predict whether a cancer type is malignant or benign.
In the part of data cleaning, missing and duplicate values in the data set are first processed, target variable is determined, and useless feature values are removed. Finally, category prediction is transformed into a binary classification problem, and the correlation between features and target variable is explored.
The third part is machine learning modeling. The data set is divided into train_data and test_data, and four mainstream models are selected to predict the data. random forest, logistic regression, decision tree and navie bayes respectively, and accuracy, kappa, Sensitivity and Specificity were used to measure the predictive performance of each model.
Then the model evaluation part. Based on the four machine learning models in the third part, the performance of each model is preliminarily estimated by comparing and analyzing the values of Accuracy, Precision, Recall and F1 Score. Among them, the performance of random forest model is better. Secondly, through K-fold, LOOCV and bootstrap K-fold cross-validation, it is concluded that logistics regression and naive Bayes model are not suitable for this dataset, and random forest and decision tree models are obviously better.