Introduction
Analyse
- object 1
- Object2
  - Data Cleaning and Preprocessing
  - Modeling and Evaluation
conclusion

Introduction

Background

Breast cancer diagnosis is a critical challenge in healthcare, where accurately distinguishing between malignant and benign tumors can significantly impact patient treatment and prognosis. Leveraging predictive models can enhance diagnostic accuracy and assist doctors in identifying high-risk patients more effectively.

Brief Overview of The Project

This project aims to develop a machine learning-based predictive model using the Wisconsin Breast Cancer Dataset. It involves data preprocessing, feature selection, model training with algorithms like Logistic Regression and XGBoost, and deployment in a clinical setting to support doctors in diagnosing breast cancer.

Objectives of The Predictive Analysis

Object1:For the classification task, two models are utilized to distinguish between malignant and benign tumors: Random Forest, a robust ensemble-based model capable of handling complex patterns, and Logistic Regression, a simple yet effective baseline model. These models aim to achieve high accuracy, sensitivity, and ROC-AUC scores to ensure reliable early diagnosis and assist healthcare professionals in identifying high-risk patients.

Object2:For the regression task, two models are implemented to predict the tumor area mean, a key quantitative feature: Linear Regression, which provides a straightforward approach for continuous value prediction, and Random Forest Regression, an ensemble method known for its improved precision and ability to model non-linear relationships. The performance of these models is evaluated using Mean Squared Error (MSE) and R-squared (R²) to ensure accurate and reliable predictions.

library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(xgboost)
library(corrplot)
library(e1071)
library(pROC)

library(gridExtra)

Analyse

object 1

Data Cleaning and Preprocessing

df <- read.csv("C:\\Users\\98631\\Desktop\\WQD7004\\data.csv")

head(df)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave_points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1     0.03003             0.006193        25.38         17.33          184.60
## 2     0.01389             0.003532        24.99         23.41          158.80
## 3     0.02250             0.004571        23.57         25.53          152.50
## 4     0.05963             0.009208        14.91         26.50           98.87
## 5     0.01756             0.005115        22.54         16.67          152.20
## 6     0.02165             0.005082        15.47         23.75          103.40
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1     2019.0           0.1622            0.6656          0.7119
## 2     1956.0           0.1238            0.1866          0.2416
## 3     1709.0           0.1444            0.4245          0.4504
## 4      567.7           0.2098            0.8663          0.6869
## 5     1575.0           0.1374            0.2050          0.4000
## 6      741.6           0.1791            0.5249          0.5355
##   concave_points_worst symmetry_worst fractal_dimension_worst
## 1               0.2654         0.4601                 0.11890
## 2               0.1860         0.2750                 0.08902
## 3               0.2430         0.3613                 0.08758
## 4               0.2575         0.6638                 0.17300
## 5               0.1625         0.2364                 0.07678
## 6               0.1741         0.3985                 0.12440

str(df)

## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave_points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave_points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave_points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

dim(df)

## [1] 569  32

sum(duplicated(df))

## [1] 0

colSums(is.na(df))

##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave_points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave_points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave_points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst 
##                       0                       0

Exploratory Data Analysis

ggplot(df, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar() +
  ggtitle("Diagnosis proportion in dataset") +
  theme_minimal()

df <- df %>%
  select(-id)

df$diagnosis <- ifelse(df$diagnosis == 'M', 1, 0)

# Extract the column names of features_mean
features_mean <- colnames(df)[2:11]

library(ggplot2)
library(gridExtra)


numeric_features <- colnames(df)[sapply(df, is.numeric)]  
numeric_features <- setdiff(numeric_features, "diagnosis")  

plot_list <- lapply(numeric_features, function(feature) {
  ggplot(df, aes(y = .data[[feature]])) + 
    geom_boxplot(fill = "steelblue", alpha = 0.7) +
    labs(title = feature, y = feature) +
    theme_minimal()
})

grid.arrange(grobs = plot_list, ncol = 6)

# outliers
detect_outliers_iqr <- function(data, feature) {
  Q1 <- quantile(data[[feature]], 0.25, na.rm = TRUE)
  Q3 <- quantile(data[[feature]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  outliers <- data[data[[feature]] < lower_bound | data[[feature]] > upper_bound, ]
  return(outliers)
}


numeric_features <- setdiff(colnames(df), "diagnosis")
outliers_info <- lapply(numeric_features, function(feature) {
  detect_outliers_iqr(df, feature)
})

for (feature in numeric_features) {
  outliers <- detect_outliers_iqr(df, feature)
  cat(sprintf("Feature: %s, Outliers: %d\n", feature, nrow(outliers)))
}

## Feature: radius_mean, Outliers: 14
## Feature: texture_mean, Outliers: 7
## Feature: perimeter_mean, Outliers: 13
## Feature: area_mean, Outliers: 25
## Feature: smoothness_mean, Outliers: 6
## Feature: compactness_mean, Outliers: 16
## Feature: concavity_mean, Outliers: 18
## Feature: concave_points_mean, Outliers: 10
## Feature: symmetry_mean, Outliers: 15
## Feature: fractal_dimension_mean, Outliers: 15
## Feature: radius_se, Outliers: 38
## Feature: texture_se, Outliers: 20
## Feature: perimeter_se, Outliers: 38
## Feature: area_se, Outliers: 65
## Feature: smoothness_se, Outliers: 30
## Feature: compactness_se, Outliers: 28
## Feature: concavity_se, Outliers: 22
## Feature: concave_points_se, Outliers: 19
## Feature: symmetry_se, Outliers: 27
## Feature: fractal_dimension_se, Outliers: 28
## Feature: radius_worst, Outliers: 17
## Feature: texture_worst, Outliers: 5
## Feature: perimeter_worst, Outliers: 15
## Feature: area_worst, Outliers: 35
## Feature: smoothness_worst, Outliers: 7
## Feature: compactness_worst, Outliers: 16
## Feature: concavity_worst, Outliers: 12
## Feature: concave_points_worst, Outliers: 0
## Feature: symmetry_worst, Outliers: 23
## Feature: fractal_dimension_worst, Outliers: 24

# Extract the two classes of data separately
dfM <- subset(df, diagnosis == 1)  # M 
dfB <- subset(df, diagnosis == 0)  # B 

# Create a histogram for each feature
plots <- lapply(features_mean, function(feature) {
  binwidth <- (max(df[[feature]], na.rm = TRUE) - min(df[[feature]], na.rm = TRUE)) / 50
  
  ggplot() +
    geom_histogram(data = dfM, aes_string(x = feature, y = "..density.."), 
                   binwidth = binwidth, fill = "pink", alpha = 1) +
    geom_histogram(data = dfB, aes_string(x = feature, y = "..density.."), 
                   binwidth = binwidth, fill = "darkgreen", alpha = 1) +
    ggtitle(feature) +
    labs(x = feature, y = "Density") +
    theme_minimal() +
    theme(plot.title = element_text(size = 10))
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Use gridExtra to arrange multiple plots into a grid
do.call(grid.arrange, c(plots, ncol = 2))

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

correlation_matrix <- cor(df)
corrplot(
  correlation_matrix, 
  method = "color", 
  tl.cex = 1, 
  addCoef.col = "black"
  
)

# Compute the absolute correlation matrix
corr_matrix <- abs(cor(df))

# Step 1: Remove features that are weakly correlated with 'diagnosis' (|r| < 0.2)
# Identify features (excluding 'diagnosis') that have correlation < 0.2 with 'diagnosis'
weak_corr_features <- names(which(corr_matrix["diagnosis", ] < 0.2 & colnames(corr_matrix) != "diagnosis"))

# Drop these weakly correlated features
df <- df[, !(colnames(df) %in% weak_corr_features)]

# Recalculate the correlation matrix after removing weakly correlated features
corr_matrix <- abs(cor(df))

# Step 2: Remove features that are highly correlated with each other (|r| > 0.92)
to_drop <- c()
for (i in 1:(ncol(corr_matrix) - 1)) {
  for (j in (i + 1):ncol(corr_matrix)) {
    if (corr_matrix[i, j] > 0.92) {
      to_drop <- c(to_drop, colnames(corr_matrix)[j])
    }
  }
}

# Remove duplicates
to_drop <- unique(to_drop)
df <- df[, !(colnames(df) %in% to_drop)]

# Print the number of remaining columns
print(ncol(df))

## [1] 18

# Features and labels
set.seed(42)
# Create an index for splitting the data (70% training, 30% testing)
index <- createDataPartition(df$diagnosis, p = 0.7, list = FALSE)
# Split the data
train_data <- df[index, ]
test_data <- df[-index, ]
train_data$diagnosis <- factor(train_data$diagnosis)
test_data$diagnosis <- factor(test_data$diagnosis)
# Check the dimensions of the training and testing sets
dim(train_data)

## [1] 399  18

Modeling and Evaluation

# Train a Random Forest model
rf_model <- randomForest(diagnosis ~ ., data = train_data, ntree = 100)

# Obtain predicted probabilities from the Random Forest model
rf_predictions_prob <- predict(rf_model, newdata = test_data, type = "prob")[, 2]

# Obtain predicted classes from the Random Forest model (based on probabilities > 0.5)
rf_predictions_class <- ifelse(rf_predictions_prob > 0.5, 1, 0)

# Train a Logistic Regression model
logit_model <- glm(diagnosis ~ ., data = train_data, family = binomial)

## Warning: glm.fit:拟合概率算出来是数值零或一

# Obtain predicted probabilities from the Logistic Regression model
logit_predictions_prob <- predict(logit_model, newdata = test_data, type = "response")

# Obtain predicted classes from the Logistic Regression model (based on probabilities > 0.5)）
logit_predictions_class <- ifelse(logit_predictions_prob > 0.5, 1, 0)

# Compute the ROC curves for both models
roc_rf <- roc(test_data$diagnosis, rf_predictions_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc_logit <- roc(test_data$diagnosis, logit_predictions_prob)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Plot the ROC curve using 1 - specificity as the x-axis and TPR as the y-axis
plot(1 - roc_rf$specificities, roc_rf$sensitivities, type = "l", col = "blue", 
     main = "ROC Curve Comparison", lwd = 2, ylim = c(0, 1), xlim = c(0, 1))

lines(1 - roc_logit$specificities, roc_logit$sensitivities, col = "red", lwd = 2)

# Add legend
legend("bottomright", legend = c("Random Forest", "Logistic Regression"),
       col = c("blue", "red"), lwd = 2)

# Print AUC values
cat("Random Forest AUC: ", auc(roc_rf), "\n")

## Random Forest AUC:  0.9941212

cat("Logistic Regression AUC: ", auc(roc_logit), "\n")

## Logistic Regression AUC:  0.9975569

# Compute and print the accuracy of the Random Forest model
rf_accuracy <- sum(rf_predictions_class == test_data$diagnosis) / length(test_data$diagnosis)
cat("Random Forest Accuracy: ", rf_accuracy, "\n")

## Random Forest Accuracy:  0.9529412

# Compute and print the accuracy of the Logistic Regression model
logit_accuracy <- sum(logit_predictions_class == test_data$diagnosis) / length(test_data$diagnosis)
cat("Logistic Regression Accuracy: ", logit_accuracy, "\n")

## Logistic Regression Accuracy:  0.9705882

library(caret)

# Install the necessary packages
if (!require(caret)) install.packages("caret")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(gridExtra)) install.packages("gridExtra")
if (!require(iml)) install.packages("iml")

## 载入需要的程序包：iml

## Warning: 程序包'iml'是用R版本4.4.2 来建造的

if (!require(randomForest)) install.packages("randomForest")
if (!require(shapper)) install.packages("shapper")

## 载入需要的程序包：shapper

## Warning: 程序包'shapper'是用R版本4.4.2 来建造的

library(shapper)
# Load required libraries
library(caret)
library(ggplot2)
library(gridExtra)
library(randomForest)
library(randomForest)
library(iml)
# Ensure diagnosis column is factor with levels 0 and 1
train_data$diagnosis <- factor(train_data$diagnosis, levels = c(0, 1))
test_data$diagnosis <- factor(test_data$diagnosis, levels = c(0, 1))

# Train Random Forest model
rf_model <- randomForest(diagnosis ~ ., data = train_data, ntree = 100)
rf_predictions_prob <- predict(rf_model, newdata = test_data, type = "prob")[, 2]
rf_predictions_class <- factor(ifelse(rf_predictions_prob > 0.5, 1, 0), levels = c(0, 1))

# Train Logistic Regression model
logit_model <- glm(diagnosis ~ ., data = train_data, family = binomial)

## Warning: glm.fit:拟合概率算出来是数值零或一

logit_predictions_prob <- predict(logit_model, newdata = test_data, type = "response")
logit_predictions_class <- factor(ifelse(logit_predictions_prob > 0.5, 1, 0), levels = c(0, 1))

# Real labels
test_data_diagnosis <- factor(test_data$diagnosis, levels = c(0, 1))

# Compute confusion matrices
rf_confusion <- confusionMatrix(rf_predictions_class, test_data_diagnosis)
logit_confusion <- confusionMatrix(logit_predictions_class, test_data_diagnosis)

# Extract confusion matrix data
rf_table <- as.data.frame(rf_confusion$table)
logit_table <- as.data.frame(logit_confusion$table)

# Add model names
rf_table$Model <- "Random Forest"
logit_table$Model <- "Logistic Regression"

# Merge confusion matrix data
conf_matrix_all <- rbind(rf_table, logit_table)

# Function to plot confusion matrix
plot_confusion_matrix <- function(conf_table, model_name) {
  ggplot(data = conf_table[conf_table$Model == model_name, ], aes(x = Prediction, y = Reference, fill = Freq)) +
    geom_tile(color = "white") +
    geom_text(aes(label = Freq), size = 6, color = "white") +
    scale_fill_gradient(low = "lightblue", high = "darkblue") +
    labs(title = paste(model_name, "Confusion Matrix"),
         x = "Predicted Label",
         y = "True Label") +
    theme_minimal(base_size = 10) +
    theme(plot.title = element_text(hjust = 0.5))
}

# Plot confusion matrices
rf_plot <- plot_confusion_matrix(conf_matrix_all, "Random Forest")
logit_plot <- plot_confusion_matrix(conf_matrix_all, "Logistic Regression")

# Arrange the two confusion matrices in a grid
grid.arrange(rf_plot, logit_plot, ncol = 2)

# Print AUC values
roc_rf <- roc(test_data$diagnosis, rf_predictions_prob)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc_logit <- roc(test_data$diagnosis, logit_predictions_prob)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

cat("Random Forest AUC: ", auc(roc_rf), "\n")

## Random Forest AUC:  0.9940449

cat("Logistic Regression AUC: ", auc(roc_logit), "\n")

## Logistic Regression AUC:  0.9975569

Object2

Data Cleaning and Preprocessing

# Load data
data <- read.csv("C:\\Users\\98631\\Desktop\\WQD7004\\data.csv")

# Data preprocessing: Remove non-numeric and non-correlated columns
features <- data[, !(names(data) %in% c("id", "diagnosis", "area_mean"))]
target <- data$area_mean

# Divide the training set and test set
set.seed(42)
trainIndex <- createDataPartition(target, p = 0.8, list = FALSE)
X_train <- features[trainIndex, ]
X_test <- features[-trainIndex, ]
y_train <- target[trainIndex]
y_test <- target[-trainIndex]

Modeling and Evaluation

# Multiple linear regression model
lr_model <- lm(y_train ~ ., data = cbind(X_train, y_train = y_train))
lr_predictions <- predict(lr_model, newdata = X_test)

# Hyperparameter optimization: Random forest
rf_grid <- expand.grid(mtry = c(2, 4, 6, 8)) # Adjust the number of random selections for the variable
rf_control <- trainControl(method = "cv", number = 5) # 5 fold cross verification
rf_model <- train(x = X_train, y = y_train,
                  method = "rf",
                  tuneGrid = rf_grid,
                  trControl = rf_control)
rf_best <- rf_model$finalModel
rf_predictions <- predict(rf_model, newdata = X_test)

# Model evaluation
lr_mse <- mean((y_test - lr_predictions)^2)
lr_r2 <- cor(y_test, lr_predictions)^2
cat("Multiple linear regression model：", lr_mse, "R²:", lr_r2, "\n")

## Multiple linear regression model： 402.6589 R²: 0.9970166

rf_mse <- mean((y_test - rf_predictions)^2)
rf_r2 <- cor(y_test, rf_predictions)^2
cat("Best Random Forest Model:", paste(names(rf_model$bestTune), rf_model$bestTune, sep = "=", collapse = ", "), "\n")

## Best Random Forest Model: mtry=8

cat("Hyperparameter optimization: Random forest MSE:", rf_mse, "R²:", rf_r2, "\n")

## Hyperparameter optimization: Random forest MSE: 2452.416 R²: 0.9822163

# Visual prediction results
# Hyperparameter optimization: Random forest
rf_plot <- ggplot(data.frame(actual = y_test, predicted = rf_predictions), aes(x = actual, y = predicted)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Hyperparameter optimization: Random forest：Predicted vs. actual", x = "Actual Area Mean", y = "Predicted Area Mean") +
  theme_minimal()

# Multiple linear regression model
lr_plot <- ggplot(data.frame(actual = y_test, predicted = lr_predictions), aes(x = actual, y = predicted)) +
  geom_point(color = "green", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Multiple linear regression model：Predicted vs. actual", x = "Actual Area Mean", y = "Predicted Area Mean") +
  theme_minimal()

# Display Image
print(rf_plot)

print(lr_plot)

conclusion

In this project, we developed and evaluated machine learning models to assist in breast cancer diagnosis and prediction, focusing on classification of tumor malignancy and regression analysis of tumor characteristics. The Random Forest model demonstrated excellent performance in distinguishing malignant from benign tumors, while regression models like Random Forest Regression and Linear Regression effectively predicted tumor features such as size. Through rigorous data preprocessing, feature selection, and model evaluation, the project highlights the potential of integrating machine learning into clinical workflows to enhance diagnostic accuracy and support early detection. Despite the limitations of dataset size, the findings provide a solid foundation for future research using more advanced techniques and diverse datasets to improve model generalizability and robustness in real-world applications.

Breast Cancer Prediction

CHEN XI [23122459], CHEN ZHAOYANG [23098590], XU HAN [23087780], LUO QINGZHEN [23102307], ZHANG XIOAMENG [24057547]

2024-11-15

Introduction

Background

Brief Overview of The Project

Objectives of The Predictive Analysis

Analyse

object 1

Data Cleaning and Preprocessing

Exploratory Data Analysis

Modeling and Evaluation

Object2

Data Cleaning and Preprocessing

Modeling and Evaluation

conclusion