Introduction

Background

Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.In medical diagnostics, quickly and accurately identifying breast cancer types (malignant or benign) is crucial for patient treatment and prognosis.

Brief Overview of The Project

The Dataset for health and it is for Social Good: Women Coders’ Bootcamp organized by Artificial Intelligence for Development in collaboration with UNDP Nepal. features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Objectives of The Predictive Analysis

Objective 1: Develop an efficient predictive model to assist doctors in diagnosing high-risk patients (using classification methods) Objective 2: Develop an efficient predictive model to assist doctors in diagnosing high-risk patients (using regression methods) ## Success Criteria:

Model accuracy >= 90%, with high precision and recall, especially for malignant tumor detection.

library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(xgboost)
library(corrplot)
library(e1071)
library(pROC)

library(gridExtra)

Analyse

object 1

Data Cleaning and Preprocessing

df <- read.csv("C:\\Users\\15537\\Desktop\\data.csv")
head(df)
##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave_points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1     0.03003             0.006193        25.38         17.33          184.60
## 2     0.01389             0.003532        24.99         23.41          158.80
## 3     0.02250             0.004571        23.57         25.53          152.50
## 4     0.05963             0.009208        14.91         26.50           98.87
## 5     0.01756             0.005115        22.54         16.67          152.20
## 6     0.02165             0.005082        15.47         23.75          103.40
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1     2019.0           0.1622            0.6656          0.7119
## 2     1956.0           0.1238            0.1866          0.2416
## 3     1709.0           0.1444            0.4245          0.4504
## 4      567.7           0.2098            0.8663          0.6869
## 5     1575.0           0.1374            0.2050          0.4000
## 6      741.6           0.1791            0.5249          0.5355
##   concave_points_worst symmetry_worst fractal_dimension_worst
## 1               0.2654         0.4601                 0.11890
## 2               0.1860         0.2750                 0.08902
## 3               0.2430         0.3613                 0.08758
## 4               0.2575         0.6638                 0.17300
## 5               0.1625         0.2364                 0.07678
## 6               0.1741         0.3985                 0.12440
str(df)
## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave_points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave_points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave_points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
dim(df)
## [1] 569  32
sum(duplicated(df))
## [1] 0
colSums(is.na(df))
##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave_points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave_points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave_points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst 
##                       0                       0

Exploratory Data Analysis

ggplot(df, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar() +
  ggtitle("Diagnosis proportion in dataset") +
  theme_minimal()

df <- df %>%
  select(-id)
df$diagnosis <- ifelse(df$diagnosis == 'M', 1, 0)

# Extract the column names of features_mean
features_mean <- colnames(df)[2:11]

# Extract the two classes of data separately
dfM <- subset(df, diagnosis == 1)  # M 类
dfB <- subset(df, diagnosis == 0)  # B 类

# Create a histogram for each feature
plots <- lapply(features_mean, function(feature) {
  binwidth <- (max(df[[feature]], na.rm = TRUE) - min(df[[feature]], na.rm = TRUE)) / 50
  
  ggplot() +
    geom_histogram(data = dfM, aes_string(x = feature, y = "..density.."), 
                   binwidth = binwidth, fill = "red", alpha = 1) +
    geom_histogram(data = dfB, aes_string(x = feature, y = "..density.."), 
                   binwidth = binwidth, fill = "green", alpha = 1) +
    ggtitle(feature) +
    labs(x = feature, y = "Density") +
    theme_minimal() +
    theme(plot.title = element_text(size = 10))
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Use gridExtra to arrange multiple plots into a grid
do.call(grid.arrange, c(plots, ncol = 2))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

correlation_matrix <- cor(df)
corrplot(
  correlation_matrix, 
  method = "color", 
  tl.cex = 1, 
  addCoef.col = "black"
  
)

# Calculate the correlation matrix
corr_matrix <- abs(cor(df))

# Find the column indices with correlation coefficients greater than 0.92
to_drop <- c()
for (i in 1:(ncol(corr_matrix) - 1)) {
  for (j in (i + 1):ncol(corr_matrix)) {
    if (corr_matrix[i, j] > 0.92) {
      to_drop <- c(to_drop, colnames(corr_matrix)[j])
    }
  }
}

# Remove duplicate columns
to_drop <- unique(to_drop)
df <- df[, !(colnames(df) %in% to_drop)]

# Print the number of columns after deletion
print(ncol(df))
## [1] 23
# Features and labels
set.seed(42)
# Create an index for splitting the data (70% training, 30% testing)
index <- createDataPartition(df$diagnosis, p = 0.7, list = FALSE)
# Split the data
train_data <- df[index, ]
test_data <- df[-index, ]
train_data$diagnosis <- factor(train_data$diagnosis)
test_data$diagnosis <- factor(test_data$diagnosis)
# Check the dimensions of the training and testing sets
dim(train_data)
## [1] 399  23

Modeling and Evaluation

# Train a Random Forest model
rf_model <- randomForest(diagnosis ~ ., data = train_data, ntree = 100)

# Obtain predicted probabilities from the Random Forest model
rf_predictions_prob <- predict(rf_model, newdata = test_data, type = "prob")[, 2]

# Obtain predicted classes from the Random Forest model (based on probabilities > 0.5)
rf_predictions_class <- ifelse(rf_predictions_prob > 0.5, 1, 0)

# Train a Logistic Regression model
logit_model <- glm(diagnosis ~ ., data = train_data, family = binomial)
## Warning: glm.fit:拟合概率算出来是数值零或一
# Obtain predicted probabilities from the Logistic Regression model
logit_predictions_prob <- predict(logit_model, newdata = test_data, type = "response")

# Obtain predicted classes from the Logistic Regression model (based on probabilities > 0.5))
logit_predictions_class <- ifelse(logit_predictions_prob > 0.5, 1, 0)

# Compute the ROC curves for both models
roc_rf <- roc(test_data$diagnosis, rf_predictions_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_logit <- roc(test_data$diagnosis, logit_predictions_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot the ROC curve using 1 - specificity as the x-axis and TPR as the y-axis
plot(1 - roc_rf$specificities, roc_rf$sensitivities, type = "l", col = "blue", 
     main = "ROC Curve Comparison", lwd = 2, ylim = c(0, 1), xlim = c(0, 1))

lines(1 - roc_logit$specificities, roc_logit$sensitivities, col = "red", lwd = 2)

# Add legend
legend("bottomright", legend = c("Random Forest", "Logistic Regression"),
       col = c("blue", "red"), lwd = 2)

# Print AUC values
cat("Random Forest AUC: ", auc(roc_rf), "\n")
## Random Forest AUC:  0.9965644
cat("Logistic Regression AUC: ", auc(roc_logit), "\n")
## Logistic Regression AUC:  0.9948084
# Compute and print the accuracy of the Random Forest model
rf_accuracy <- sum(rf_predictions_class == test_data$diagnosis) / length(test_data$diagnosis)
cat("Random Forest Accuracy: ", rf_accuracy, "\n")
## Random Forest Accuracy:  0.9705882
# Compute and print the accuracy of the Logistic Regression model
logit_accuracy <- sum(logit_predictions_class == test_data$diagnosis) / length(test_data$diagnosis)
cat("Logistic Regression Accuracy: ", logit_accuracy, "\n")
## Logistic Regression Accuracy:  0.9470588

Object2

Data Cleaning and Preprocessing

# Load data
data <- read.csv("data.csv")
# Data preprocessing: Remove non-numeric and non-correlated columns
features <- data[, !(names(data) %in% c("id", "diagnosis", "area_mean"))]
target <- data$area_mean
# Divide the training set and test set
set.seed(42)
trainIndex <- createDataPartition(target, p = 0.8, list = FALSE)
X_train <- features[trainIndex, ]
X_test <- features[-trainIndex, ]
y_train <- target[trainIndex]
y_test <- target[-trainIndex]

Modeling and Evaluation

# Multiple linear regression model
lr_model <- lm(y_train ~ ., data = cbind(X_train, y_train = y_train))
lr_predictions <- predict(lr_model, newdata = X_test)
# Hyperparameter optimization: Random forest
rf_grid <- expand.grid(mtry = c(2, 4, 6, 8)) # Adjust the number of random selections for the variable
rf_control <- trainControl(method = "cv", number = 5) # 5 fold cross verification
rf_model <- train(x = X_train, y = y_train,
                  method = "rf",
                  tuneGrid = rf_grid,
                  trControl = rf_control)
rf_best <- rf_model$finalModel
rf_predictions <- predict(rf_model, newdata = X_test)
# Model evaluation
lr_mse <- mean((y_test - lr_predictions)^2)
lr_r2 <- cor(y_test, lr_predictions)^2
cat("Multiple linear regression model:", lr_mse, "R²:", lr_r2, "\n")
## Multiple linear regression model: 402.6589 R²: 0.9970166
rf_mse <- mean((y_test - rf_predictions)^2)
rf_r2 <- cor(y_test, rf_predictions)^2
cat("Best Random Forest Model:", paste(names(rf_model$bestTune), rf_model$bestTune, sep = "=", collapse = ", "), "\n")
## Best Random Forest Model: mtry=8
cat("Hyperparameter optimization: Random forest MSE:", rf_mse, "R²:", rf_r2, "\n")
## Hyperparameter optimization: Random forest MSE: 2452.416 R²: 0.9822163
# Visual prediction results
# Hyperparameter optimization: Random forest
rf_plot <- ggplot(data.frame(actual = y_test, predicted = rf_predictions), aes(x = actual, y = predicted)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Hyperparameter optimization: Random forest:Predicted vs. actual", x = "Actual Area Mean", y = "Predicted Area Mean") +
  theme_minimal()

# Multiple linear regression model
lr_plot <- ggplot(data.frame(actual = y_test, predicted = lr_predictions), aes(x = actual, y = predicted)) +
  geom_point(color = "green", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Multiple linear regression model:Predicted vs. actual", x = "Actual Area Mean", y = "Predicted Area Mean") +
  theme_minimal()

# Display Image
print(rf_plot)

print(lr_plot)

conclusion

The analysis demonstrates that Random Forest and Logistic Regression models perform well in breast cancer diagnosis. The Random Forest model achieved high accuracy and AUC, reliably distinguishing malignant from benign cases. Logistic Regression serves as a simple and effective baseline, complementing the Random Forest model.

Data exploration revealed significant relationships between features, which played a key role in improving model performance. These models can assist healthcare professionals in early detection of breast cancer, enhancing patient outcomes, especially for identifying malignant cases promptly.

Although the small dataset may limit the generalizability of the models, future improvements could include larger, more diverse datasets and advanced modeling techniques. These enhancements would further strengthen the models’ robustness and adaptability in real-world applications.