Breast Cancer Malignancy Prediction Using Machine Learning

WQD7004 Programming for Data Science

Group 07 Members

Matric Number	Full Name
S2034194	Phoon Hao Xian
23096526	Nur Ariana Sofea binti Badrul Hisham
S2033073	Vijaykumar Kartha Ramchandran

Methodology: CRISP-DM

1. Understand: Define the problem and understand the dataset.

2. Explore: Load and examine the dataset, identifying key attributes.

3. Prepare: Clean and preprocess the data for analysis.

4. Model: Build predictive models using machine learning techniques.

5. Evaluate: Assess model performance using relevant metrics.

6. Deploy: Present findings, forecasts and actionable recommendations.

1. Data Understanding

Problem Definition: The goal of this project is to predict whether a breast tumor is malignant or benign based on clinical data. We aim to apply machine learning algorithms to solve this classification problem.
Business Objectives: The outcome will assist in early diagnosis and decision-making for better treatment of breast cancer.
Data Collection: We use the Breast Cancer Prediction dataset available on Kaggle, which includes attributes related to tumors (e.g., size, texture, perimeter) and their diagnosis (benign or malignant).

Research Questions and Objectives

Research Questions:

How can clinical data predict whether a breast tumor is malignant or benign? (Classification)
How accurately can a regression model predict tumor size based on available clinical data? (Regression)

Research Objectives:

To develop a machine learning model for classifying breast cancer diagnosis based on clinical data. (Classification)
To build a predictive model that estimates tumor size using relevant clinical data to understand growth patterns. (Regression)

2. Data Exploration (Exploring and Loading the Data)

Data Overview:

The dataset used in this project is sourced from Kaggle and can be accessed here.
The dataset was collected in 2015 and consists of various clinical attributes related to breast cancer diagnosis.
Download the dataset from Kaggle manually, and then upload it to R project directory as ‘breast_cancer_data.csv’.

Initial Exploration:

Step 1: Installing Required Packages

The code first ensures that the readr package is installed, which is used to read the CSV file efficiently into R. If readr is already installed, it skips this step.

# Install required packages if not already installed
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")

Step 2: Loading the readr Package

Once readr is installed, it’s loaded into the R environment using library(readr) to make the read_csv() function available for reading the CSV file.

# Load necessary libraries
library(readr)

Step 3: Reading the CSV File

The read_csv() function is used to read the file ‘breast_cancer_data.csv’ into a data frame called df. This function automatically detects the data types for each column based on the content.

# Now, read the file into R
df <- read_csv("breast_cancer_data.csv")

## New names:
## • `` -> `...33`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 568 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): diagnosis
## dbl (31): id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothne...
## lgl  (1): ...33
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3A. Data Preparation (Cleaning the Data)

Data cleaning is a critical step in the data analysis process. The raw dataset often contains errors, inconsistencies and irrelevant information. In this project, data cleaning involves handling missing values, removing duplicate entries, converting variables to appropriate formats and ensuring that the dataset is ready for further analysis. Proper data cleaning enhances the accuracy and reliability of the results obtained from data analysis and model building.

In our breast cancer dataset, we have clinical attributes that need to be processed before they can be used for classification and regression tasks. Specifically, we will:

Handle missing data by either removing or imputing missing values.
Remove duplicate rows that can distort the analysis and lead to overfitting.
Convert categorical variables like diagnosis to the proper data type.
Remove unnecessary columns that do not add value to the analysis (such as the column ‘X’).
Ensure that all variables are correctly formatted for machine learning models.

Step 1: Load the Necessary Libraries

The first step in data cleaning is to load the necessary libraries for data manipulation and visualization. In this case, we will use dplyr for data manipulation tasks and ggplot2 for visualizations.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

By loading these libraries, we gain access to the functions and methods that allow us to perform operations like filtering, selecting and transforming the dataset. dplyr is a popular R package for efficient data manipulation, and ggplot2 helps in visualizing the cleaned dataset.

Step 2: Load the Dataset

The next step is to load the ‘breast_cancer_data.csv’ dataset into R. We will use the read.csv() function to read the dataset into a data frame.

# Load the renamed CSV file
df <- read.csv("breast_cancer_data.csv", stringsAsFactors = FALSE)

# Preview the data
head(df)

# Get the dimensions of the dataset
dim(df)  # Returns the number of rows and columns in the dataset

## [1] 569  33

This step loads the dataset into the df variable. The head() function is used to preview the first six rows of the data to understand its structure before any processing is done. By setting stringsAsFactors = FALSE, we ensure that categorical variables are not automatically converted to factors (which can interfere with analysis).

Step 3: Check the Structure of the Dataset

Now, we will check the structure of the dataset using the str() function. This allows us to see the data types of each column.

str(df)

## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

The str() function provides an overview of the dataset, including the number of observations (rows) and variables (columns), as well as the data type of each column. This is useful for identifying potential issues, such as columns that should be factors but are currently stored as characters.

Step 4: Get Statistical Summary

We use the summary() function to get a statistical summary of the dataset. This function will provide basic descriptive statistics for each numeric column.

summary(df)

##        id             diagnosis          radius_mean      texture_mean  
##  Min.   :     8670   Length:569         Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   Class :character   1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024   Mode  :character   Median :13.370   Median :18.84  
##  Mean   : 30371831                      Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129                      3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502                      Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     concave.points_se 
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
##  Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
##   symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
##  Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
##  1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
##  Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
##  Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
##  3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
##  Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
##  perimeter_worst    area_worst     smoothness_worst  compactness_worst
##  Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
##  1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
##  Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
##  Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
##  3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
##  Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
##  concavity_worst  concave.points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750        
##     X          
##  Mode:logical  
##  NA's:569      
##                
##                
##                
##

The summary() function gives us a quick overview of the central tendencies (mean, median), range (min, max) and the spread (quartiles) for numeric columns. This step helps us understand the distribution of each attribute and detect potential outliers or data quality issues.

Step 5: Check for Missing Values

Next, we check for missing values across the dataset. This will help us identify if any columns contain NA (missing) values that need to be addressed.

colSums(is.na(df))

##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave.points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave.points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave.points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst                       X 
##                       0                       0                     569

The is.na() function checks for missing values in the dataset. colSums() will sum the missing values in each column. This step is crucial because missing data can impact the performance of machine learning models, so we need to decide whether to remove or impute the missing values.

Step 6: Remove Unnecessary Columns

We will remove the ‘X’ column, which is empty and not useful for our analysis.

df <- df %>% select(-X)

This step uses dplyr’s select() function to remove the irrelevant column. Removing unnecessary columns ensures that our dataset only contains relevant variables for analysis, reducing the complexity of the data.

Step 7: Convert ‘diagnosis’ to Factor

Since the diagnosis column contains categorical values (Benign and Malignant), we will convert it to a factor variable.

df$diagnosis <- as.factor(df$diagnosis)
str(df$diagnosis)

##  Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...

table(df$diagnosis)

## 
##   B   M 
## 357 212

Converting the diagnosis column to a factor ensures that R treats it as a categorical variable. The str() function confirms the data type, and table() shows the count of each category (Benign vs. Malignant). This is important for classification tasks.

Step 8: Check for Duplicates

We check for duplicate rows in the dataset to ensure data integrity.

sum(duplicated(df))

## [1] 0

The duplicated() function checks for duplicate rows. If duplicates are found, they can introduce bias or overfitting in machine learning models, so they need to be removed.

Step 9: Remove Duplicate Rows

If any duplicate rows exist, we will remove them to ensure that each observation is unique.

df <- df[!duplicated(df), ]
sum(duplicated(df))

## [1] 0

This step removes duplicate rows using the duplicated() function, and then rechecks for any remaining duplicates. Removing duplicates ensures that the dataset only contains unique observations.

Step 10: Save the Cleaned Dataset

Once the dataset is cleaned and preprocessed, we save it as a new CSV file. This allows us to retain a copy of the cleaned data for future use and analysis, which can be helpful for reproducibility or sharing the dataset with others.

By using a relative file path (“cleaned_breast_cancer_data.csv”), we ensure that the file will be saved in the current working directory, making it portable across different environments.

Download Link: Click here to download the cleaned dataset

3B. EDA (Exploratory Data Analysis) Section

Exploratory Data Analysis (EDA) is an essential step in data analysis that helps us uncover patterns, relationships and trends within the dataset. In the context of our project, EDA will allow us to understand the distribution of key variables and their relationships to each other. This step also involves visualizing the data to detect any potential anomalies or outliers that might affect the results.

In this section, we will:

Perform univariate analysis to understand individual variables.
Conduct bivariate analysis to examine relationships between pairs of variables.
Use multivariate analysis to explore relationships between multiple variables at once.

Step 1: Univariate Analysis - Bar Plot of Diagnosis Counts

We begin with a bar plot to visualize the distribution of diagnoses (Benign vs. Malignant).

# Bar plot of diagnosis counts
  ggplot(df, aes(x = diagnosis)) +
    geom_bar(fill = c("lightblue", "salmon")) +
    labs(title = "Count of Diagnosis (Benign vs Malignant)", x = "Diagnosis", y = "Count") +
    theme_minimal()

Explanation for the Output of Step 1 (Bar Plot of Diagnosis Counts)

The bar plot shown here visualizes the distribution of the diagnosis variable, which indicates whether a tumor is Benign (‘B’) or Malignant (‘M’). In this case:

Key Insights from the Bar Plot:

Benign (B) Diagnoses:
- The blue bar represents the Benign category (‘B’).
- It is clearly the larger category, with over 300 instances. This indicates that the dataset contains a higher number of benign cases compared to malignant ones.
Malignant (M) Diagnoses:
- The red bar represents the Malignant category (‘M’).
- It is smaller in size, with slightly over 200 instances, indicating that malignant cases are fewer than benign cases in the dataset.

Conclusion:
The dataset is imbalanced with more benign tumors compared to malignant ones. This imbalance is important because it may affect how we train machine learning models, especially for classification tasks. We may need to consider techniques like oversampling, undersampling or using weighted models to account for this imbalance during model training.

Step 2: Univariate Analysis - Boxplot for radius_mean

We visualize the distribution of radius_mean across diagnosis categories using a boxplot.

  # Boxplot for radius_mean
ggplot(df, aes(x = diagnosis, y = radius_mean, fill = diagnosis)) +
  geom_boxplot() +
  labs(title = "Distribution of Radius Mean by Diagnosis", x = "Diagnosis", y = "Radius Mean") +
  theme_minimal()

Explanation for the Output of Step 2 (Boxplot for radius_mean by Diagnosis)

The boxplot shown here visualizes the distribution of the radius_mean variable across the two diagnosis categories: Benign (B) and Malignant (M).

Key Observations from the Boxplot:

Benign (B) Tumors:
- The red box represents the Benign category (‘B’).
- The interquartile range (IQR) of radius_mean for benign tumors is relatively narrow, with a median around 12.8.
- There are a few outliers present (points above the upper whisker), which indicates some benign tumors have unusually larger radii compared to most of the other benign cases.
Malignant (M) Tumors:
- The teal box represents the Malignant category (‘M’).
- The radius_mean for malignant tumors is generally higher, with a median around 17.3.
- The IQR for malignant tumors is broader, suggesting greater variability in the sizes of malignant tumors.
- There are also a few outliers (points above the upper whisker), indicating that some malignant tumors have a significantly larger radius than most other malignant tumors.

Conclusion:
The radius_mean variable appears to be higher for malignant tumors compared to benign ones. The distribution of malignant tumor sizes is more spread out, and it has a higher median compared to benign tumors. This difference could be an important feature in classification models for predicting whether a tumor is benign or malignant.

Step 3: Univariate Analysis - Boxplot for texture_mean

A similar boxplot for texture_mean helps compare its distribution across diagnosis categories.

# Boxplot for texture_mean
ggplot(df, aes(x = diagnosis, y = texture_mean, fill = diagnosis)) +
  geom_boxplot() +
  labs(title = "Distribution of Texture Mean by Diagnosis", x = "Diagnosis", y = "Texture Mean") +
  theme_minimal()

Explanation for the Output of Step 3 (Boxplot for texture_mean by Diagnosis)

The boxplot shown here visualizes the distribution of the texture_mean variable across the two diagnosis categories: Benign (B) and Malignant (M).

Key Observations from the Boxplot:

Benign (B) Tumors:
- The red box represents the Benign category (‘B’).
- The interquartile range (IQR) for texture_mean in benign tumors is approximately 15 to 20.
- The median texture value for benign tumors is around 15.5.
- There are numerous outliers (points above the upper whisker), suggesting that some benign tumors have a significantly higher texture mean than most benign cases.
Malignant (M) Tumors:
- The teal box represents the Malignant category (‘M’).
- The texture_mean for malignant tumors ranges from approximately 19 to 24.
- The median for malignant tumors is slightly higher than for benign tumors, around 22.
- The malignant tumors also show outliers (points above the upper whisker), indicating that some malignant tumors have a much higher texture mean compared to others.

Conclusion:
The texture_mean variable exhibits a higher median for malignant tumors compared to benign tumors. Additionally, malignant tumors tend to have a wider spread of texture values, indicating more variability in their texture compared to benign tumors.

Step 4: Univariate Analysis - Histogram for radius_mean

We plot a histogram to examine the distribution of radius_mean.

# Histogram for radius_mean distribution by diagnosis
ggplot(df, aes(x = radius_mean, fill = diagnosis)) + 
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Distribution of Radius Mean by Diagnosis", x = "Radius Mean", y = "Count") +
  theme_minimal()

Explanation for the Output of Step 4 (Histogram for radius_mean by Diagnosis)

The histogram shown here visualizes the distribution of the radius_mean variable, which represents the mean of the tumor’s radius. This variable is important as it gives us an idea about the overall size of the tumor.

Key Observations from the Histogram:

Benign (B) Tumors (represented by red bars):
- Most of the Benign tumors have a radius_mean in the range of 6 to 18. The histogram shows a right-skewed distribution, indicating that the majority of benign tumors are smaller in size.
- There is a peak in the histogram around radius_mean = 10, which suggests that many benign tumors have a radius close to this value.
- The number of Benign tumors decreases as the radius_mean increases above 12.
Malignant (M) Tumors (represented by blue bars):
- The Malignant tumors, shown in the blue bars, tend to have a wider spread across the range of radius_mean values. However, there is a great concentration in the range between 13 to 21.
- There is a noticeable shift towards higher values of radius_mean in malignant tumors, with a long tail on the right side of the distribution. This indicates that while most malignant tumors have a radius_mean similar to benign tumors, a few malignant tumors are significantly larger.

Conclusion:
The radius_mean variable exhibits a different distribution for Benign and Malignant tumors. While both types of tumors show some overlap in their radius_mean values (especially between 12 and 18), malignant tumors seem to exhibit wider variation and a tendency towards larger tumor sizes. The histogram also reveals that the majority of Benign tumors are smaller in size, whereas Malignant tumors exhibit more variability and larger values. This information is valuable for building a classification model to distinguish between Benign and Malignant tumors based on their radius_mean.

Step 5: Bivariate Analysis - Scatter Plot of radius_mean vs. texture_mean

We create a scatter plot to visualize the relationship between radius_mean and texture_mean.

# Scatter plot of radius_mean vs. texture_mean, colored by diagnosis
ggplot(df, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
  geom_point() +
  labs(title = "Radius Mean vs. Texture Mean by Diagnosis", x = "Radius Mean", y = "Texture Mean") +
  theme_minimal()

Explanation for the Output of Step 5 (Scatter Plot of radius_mean vs. texture_mean by Diagnosis)

The scatter plot shown here visualizes the relationship between two features of the dataset: radius_mean and texture_mean. These two features are critical in identifying characteristics of the tumors in the dataset.

Key Observations from the Scatter Plot:

Benign (B) Tumors (Red points):
- The Benign tumors are generally concentrated at the lower end of the radius_mean axis, with radius_mean values mostly between 8 and 16.
- The texture_mean values for Benign tumors range from about 10 to 28. The red points form a relatively dense cloud, indicating that most Benign tumors have smaller radii and moderate texture values.
- There is a distinct clustering in the lower part of the graph for Benign tumors.
Malignant (M) Tumors (Blue points):
- The Malignant tumors, represented by blue points, tend to spread across the graph. While they also appear in the lower range of the radius_mean (between 10 and 25), there is a notable shift towards the right..
- Malignant tumors have more diverse texture values (ranging from about 12 to 31), and the spread is more wide and dispersed compared to Benign tumors.
- There is a separation in the higher radius_mean and texture_mean values, indicating that Malignant tumors tend to be larger and have more varied textures.

Conclusion: The scatter plot indicates that there is a potential relationship between the radius_mean and texture_mean of the tumors. Malignant tumors tend to have larger radii and more varied textures, while Benign tumors show smaller radii and more consistent textures. This distinction is helpful in identifying the type of tumor and may assist in building a classification model for tumor diagnosis based on these features.

Step 6: Bivariate Analysis - Boxplot for smoothness_mean by Diagnosis

We use a boxplot to visualize the distribution of smoothness_mean for each diagnosis.

# Boxplot for smoothness_mean by diagnosis
ggplot(df, aes(x = diagnosis, y = smoothness_mean, fill = diagnosis)) + 
  geom_boxplot() +
  labs(title = "Smoothness Mean by Diagnosis", x = "Diagnosis", y = "Smoothness Mean") +
  theme_minimal()

Explanation for the Output of Step 6 (Boxplot for smoothness_mean by Diagnosis)

The boxplot displayed here visualizes the distribution of the smoothness_mean feature, segmented by the diagnosis of the tumor (Benign vs. Malignant). The smoothness_mean measures the smoothness of the tumor’s surface, which can be indicative of the tumor’s texture and consistency.

Key Observations from the Boxplot:

Benign (B) Tumors (Red box):
- The Benign tumors tend to have a slightly lower median smoothness_mean (around 0.090**), as shown by the red box.
- The interquartile range (IQR) for Benign tumors is narrower, indicating that most of the Benign tumors have a relatively consistent smoothness.
- The upper and lower whiskers for the Benign tumors indicate that there are a few outliers with high smoothness values, but the majority of the data is concentrated in the lower range of the smoothness_mean.
Malignant (M) Tumors (Blue box):
- The Malignant tumors, represented by the blue box, show a slightly higher median smoothness_mean (around 0.113), indicating that Malignant tumors are generally somewhat smoother.
- The interquartile range for Malignant tumors is wider, suggesting more variation in smoothness compared to Benign tumors.
- Malignant tumors also exhibit outliers, with some having much higher smoothness values, especially in the upper range, which indicates that there is greater diversity in smoothness for Malignant tumors.

Conclusion: The boxplot reveals that Benign tumors tend to have a more consistent smoothness with a lower median value, whereas Malignant tumors exhibit more variability in smoothness and have a slightly higher median. These observations can be useful in distinguishing between Benign and Malignant tumors based on smoothness, providing a useful feature for classification models.

Step 7: Multivariate Analysis - Correlation Matrix

We compute and visualize the correlation matrix to understand the relationships between numeric features.

numeric_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, numeric_cols], use = "complete.obs")
library(corrplot)
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.7)

Explanation for the Output of Step 7 (Correlation Matrix)

The correlation matrix displayed here visualizes the relationships between numeric features in the dataset, particularly focusing on the correlation between various tumor-related measurements. This matrix is generated using a heatmap where the color intensity indicates the strength of the relationship between pairs of variables. The color scale at the right side of the plot indicates the strength and direction of the correlation: - Blue represents positive correlations, where both variables increase or decrease together. - Red represents negative correlations, where as one variable increases, the other decreases. - White/Light color indicates a weak or no correlation between variables.

Key Observations from the Correlation Matrix:

High Positive Correlations:
- There are strong positive correlations between several variables, which makes sense as many of the features are related to each other in terms of size, shape or texture of the tumors:
  - radius_mean and perimeter_mean: These two features show a very strong positive correlation (actually 1), meaning that as the radius increases, the perimeter also increases. This is expected because a larger radius would typically result in a larger perimeter.
  - area_mean and perimeter_mean: The correlation between these two features is also strong, as a larger area tends to correlate with a larger perimeter.
  - smoothness_mean and symmetry_mean: These features also show a positive correlation (almost 0.7), which suggests that smoother tumors tend to be more symmetrical.
- These high correlations are expected since many of these measurements are geometric features that are often related to each other.
Weaker or Negative Correlations:
- Some correlations are weaker or negative, which indicates that there isn’t a strong relationship between these pairs of features:
  - For example, there is a weak correlation between radius_mean and fractal_dimension_mean. The fractal dimension measures the irregularity or complexity of the tumor, which doesn’t necessarily increase or decrease with the radius.

Conclusion: The correlation matrix is a useful tool for identifying relationships between variables. It helps us identify redundant features (e.g., radius_mean, perimeter_mean and area_mean are highly correlated) and suggests which features might provide complementary information when building predictive models. The matrix also indicates which variables might be combined or excluded for model building based on their correlations.

Step 8: Multivariate Analysis - Pair Plot of Selected Features

We use a pair plot to visualize the relationships between selected features (e.g., radius_mean, texture_mean).

library(GGally)
features_subset <- c("diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean")
ggpairs(df[, features_subset], mapping = aes(color = diagnosis), title = "Pairplot of Selected Features")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Explanation for the Output of Step 8 (Pair Plot of Selected Features)

The pair plot is a powerful visualization tool for exploring relationships between multiple features in the dataset. It provides a matrix of scatter plots, histograms and correlation coefficients to help identify patterns, trends and possible relationships between pairs of variables.

Key Observations from the Pair Plot:

Univariate Distribution (Diagonal):
- The histograms along the diagonal show the distribution of each feature. For example:
  - radius_mean and perimeter_mean show distinct distributions for Benign and Malignant tumors, with Malignant tumors (denoted by blue) having higher values.
  - texture_mean and area_mean also exhibit different distributions between Benign and Malignant cases, which suggests these features can help differentiate the two classes.
Bivariate Relationships (Off-diagonal):
- The scatter plots show how two features interact with each other. For example:
  - There is a strong positive correlation between radius_mean and perimeter_mean (as expected, since a larger radius leads to a larger perimeter). The correlation value (0.998) is displayed in the plot, which confirms this.
  - The relationship between radius_mean and texture_mean shows a weak positive correlation (0.324), suggesting that these two features do not have a strong linear relationship.
  - The scatter plot for area_mean and perimeter_mean also demonstrates a strong positive correlation (0.987), which is similar to the relationship between radius_mean and perimeter_mean.
Correlation Coefficients:
- The correlation values on each scatter plot show the degree of linear relationship between pairs of features:
  - Strong Positive Correlations:
    - radius_mean vs. perimeter_mean: 0.998 (very strong correlation, indicating that as radius increases, perimeter also increases).
    - perimeter_mean vs. area_mean: 0.987 (strong positive correlation, as expected for geometric properties of tumors).
  - Moderate Correlations:
    - radius_mean vs. texture_mean: 0.324 (weak positive correlation).
    - area_mean vs. smoothness_mean: 0.177 (weak positive correlation).
  - Negative Correlations:
    - texture_mean vs. smoothness_mean: -0.023 (small negative correlation, suggesting that higher texture values tend to correspond to lower smoothness).

Conclusion: The pair plot visually confirms the relationships between various features in the dataset. Some features like radius_mean, perimeter_mean and area_mean are strongly correlated, which can inform our feature selection and model-building process. On the other hand, features with weak correlations like radius_mean and texture_mean may offer additional unique information for classification or regression models.

4. Data Modelling

Based on exploratory data analysis (EDA) and the cleaned breast cancer dataset, we will train a machine learning model to predict the size of a tumor (measured by area_mean) using various clinical measurements. Since we already have labeled data with tumor size and diagnosis (benign or malignant), supervised learning is the appropriate choice.

We begin with Linear Regression, which is suitable for predicting continuous numerical variables using numerical input features. The objective is to find the best-fit linear relationship between the input features and the target variable.

Feature Selection

The original dataset includes many clinical features. However, for our model, we remove those that are directly correlated with or derivable from the target variable (area_mean), to prevent data leakage. Specifically, we exclude features like radius_*, perimeter_*, area_se, and area_worst.

Target Variable:

area_mean (tumor size)

Excluded Features:

radius_mean, radius_se, radius_worst
perimeter_mean, perimeter_se, perimeter_worst
area_se, area_worst
texture_mean, texture_se, texture_worst

Selected Input Features:

smoothness_
compactness_
concavity_
concave.points_
symmetry_
fractal_dimension_

4A. Regression Models for Tumor Size Prediction

In this section, we build a linear regression model to predict the average tumor area (area_mean) using several clinical features from the breast cancer dataset. We first load the data and inspect its structure, then carefully select input features to avoid data leakage by excluding variables directly related to area, radius, or perimeter. We then fit a linear model with the selected predictors and display the summary of model coefficients and overall performance.

The steps include:

Loading required packages and data
Selecting relevant clinical features as predictors
Fitting a linear regression model to predict tumor size
Reviewing the model’s summary and interpreting the main results

# 1. Load required packages for data manipulation and analysis
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# 2. Read in the cleaned breast cancer dataset
data <- read.csv("cleaned_breast_cancer_data.csv")

# 3. Display all column names in the dataset
colnames(data)

##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave.points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave.points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave.points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"

# 4. Select input features (clinical variables to use as predictors)
input_features <- data %>%
  select(
    texture_mean, texture_se, texture_worst,
    smoothness_mean, smoothness_se, smoothness_worst,
    compactness_mean, compactness_se, compactness_worst,
    concavity_mean, concavity_se, concavity_worst,
    concave.points_mean, concave.points_se, concave.points_worst,
    symmetry_mean, symmetry_se, symmetry_worst,
    fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst
  )

# 5. Create a new data frame that combines the target variable (area_mean) with selected predictors
model_data <- cbind(area_mean = data$area_mean, input_features)

# 6. Fit a linear regression model to predict area_mean using the selected clinical features
model <- lm(area_mean ~ ., data = model_data)

# 7. Display a summary of the model, including coefficients and overall fit statistics
summary(model)

## 
## Call:
## lm(formula = area_mean ~ ., data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -344.16  -79.47   -4.35   60.31  737.62 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1899.863     94.586  20.086  < 2e-16 ***
## texture_mean                 4.464      3.977   1.123   0.2621    
## texture_se                 -27.704     18.233  -1.519   0.1292    
## texture_worst               -1.209      3.451  -0.350   0.7263    
## smoothness_mean          -2081.443    988.983  -2.105   0.0358 *  
## smoothness_se             1178.960   3316.281   0.356   0.7223    
## smoothness_worst          -651.244    719.240  -0.905   0.3656    
## compactness_mean           858.997    504.839   1.702   0.0894 .  
## compactness_se            -713.588   1089.052  -0.655   0.5126    
## compactness_worst         -110.845    188.287  -0.589   0.5563    
## concavity_mean             181.210    488.417   0.371   0.7108    
## concavity_se              1029.424    633.448   1.625   0.1047    
## concavity_worst           -266.272    133.468  -1.995   0.0465 *  
## concave.points_mean       8265.114    905.487   9.128  < 2e-16 ***
## concave.points_se       -12134.831   2448.102  -4.957 9.57e-07 ***
## concave.points_worst       870.287    445.687   1.953   0.0514 .  
## symmetry_mean             -493.813    368.524  -1.340   0.1808    
## symmetry_se               1980.721   1328.944   1.490   0.1367    
## symmetry_worst            -359.753    242.540  -1.483   0.1386    
## fractal_dimension_mean  -21533.110   2318.737  -9.287  < 2e-16 ***
## fractal_dimension_se     14127.010   5627.863   2.510   0.0124 *  
## fractal_dimension_worst    991.675   1193.376   0.831   0.4063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121.1 on 547 degrees of freedom
## Multiple R-squared:  0.886,  Adjusted R-squared:  0.8816 
## F-statistic: 202.5 on 21 and 547 DF,  p-value: < 2.2e-16

Linear Regression: Model Training and RMSE Evaluation

We now assess the predictive performance of our linear regression model. To do this, we randomly split the dataset into training (80%) and test (20%) sets. The model is trained on the training set and then used to predict tumor size on the test set. We use Root Mean Squared Error (RMSE) as our evaluation metric. RMSE provides an interpretable measure of the average prediction error (in the same units as the target variable), with lower values indicating better model performance.

# Split into training and test sets
set.seed(123)
index <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train <- model_data[index, ]
test <- model_data[-index, ]

# Train model
model_train <- lm(area_mean ~ ., data = train)

# Predict on test data
predictions <- predict(model_train, newdata = test)

# Calculate RMSE
rmse <- sqrt(mean((test$area_mean - predictions)^2))
print(paste("RMSE:", round(rmse, 2)))

## [1] "RMSE: 121.41"

Regularized Regression Models: Ridge & Lasso

We now apply Ridge and Lasso regression, which help reduce overfitting by penalizing large coefficients. Ridge regression applies L2 regularization, which shrinks coefficients toward zero without setting them exactly to zero. Lasso regression uses L1 regularization, which can shrink some coefficients all the way to zero, effectively performing feature selection.

1. Data Preparation for Regularized Regression:

We reload and preprocess the data, selecting and scaling input features to ensure fair evaluation across models.

#install.packages("glmnet")
#install.packages("caret")  # for splitting
library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

2. Model Training:

# Load the data
data <- read.csv("cleaned_breast_cancer_data.csv")

# Select only input features (excluding radius, area, perimeter)
features <- data %>%
  select(
    texture_mean, texture_se, texture_worst,
    smoothness_mean, smoothness_se, smoothness_worst,
    compactness_mean, compactness_se, compactness_worst,
    concavity_mean, concavity_se, concavity_worst,
    concave.points_mean, concave.points_se, concave.points_worst,
    symmetry_mean, symmetry_se, symmetry_worst,
    fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst
  )

# Scale the features
scaled_features <- scale(features)

# Prepare input matrix and target
X <- as.matrix(scaled_features)
y <- data$area_mean

# Split data into training and test set
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]

3. Ridge Regression:

We fit a Ridge regression model and evaluate its RMSE on the test set.

ridge_model <- cv.glmnet(X_train, y_train, alpha = 0)  # alpha = 0 for Ridge
plot(ridge_model)

# Best lambda
ridge_lambda <- ridge_model$lambda.min

# Predict and evaluate
ridge_preds <- predict(ridge_model, s = ridge_lambda, newx = X_test)
ridge_rmse <- sqrt(mean((y_test - ridge_preds)^2))
cat("Ridge RMSE:", round(ridge_rmse, 2), "\n")

## Ridge RMSE: 138.61

4. Lasso Regression:

We fit a Lasso regression model and report the test RMSE.

lasso_model <- cv.glmnet(X_train, y_train, alpha = 1)  # alpha = 1 for Lasso
plot(lasso_model)

# Best lambda
lasso_lambda <- lasso_model$lambda.min

# Predict and evaluate
lasso_preds <- predict(lasso_model, s = lasso_lambda, newx = X_test)
lasso_rmse <- sqrt(mean((y_test - lasso_preds)^2))
cat("Lasso RMSE:", round(lasso_rmse, 2), "\n")

## Lasso RMSE: 125.33

5. Elastic Net Regression:

We fit an Elastic Net model, which blends Ridge and Lasso penalties and report the RMSE.

enet_model <- cv.glmnet(X_train, y_train, alpha = 0.5)
enet_lambda <- enet_model$lambda.min
enet_preds <- predict(enet_model, s = enet_lambda, newx = X_test)
enet_rmse <- sqrt(mean((y_test - enet_preds)^2))
cat("Elastic Net RMSE:", round(enet_rmse, 2), "\n")

## Elastic Net RMSE: 133.05

Performance by Diagnosis Type

We evaluate the Ridge model separately on benign and malignant tumors to check for differences in model accuracy by diagnosis class.

data$diagnosis <- ifelse(data$diagnosis == "M", 1, 0)

# Reconstruct test_data
test_data <- data[-trainIndex, ]

# Predict on test set
preds <- as.vector(predict(ridge_model, s = ridge_lambda, newx = X_test))

# Compute RMSE for benign (diagnosis == 0)
rmse_benign <- sqrt(mean((preds[test_data$diagnosis == 0] - y_test[test_data$diagnosis == 0])^2))

# Compute RMSE for malignant (diagnosis == 1)
rmse_malignant <- sqrt(mean((preds[test_data$diagnosis == 1] - y_test[test_data$diagnosis == 1])^2))

# Print
cat("RMSE (Benign):", round(rmse_benign, 2), "\n")

## RMSE (Benign): 106

cat("RMSE (Malignant):", round(rmse_malignant, 2), "\n")

## RMSE (Malignant): 177.6

The model performs better on benign cases than on malignant ones, likely due to class imbalance and higher variability in malignant tumors. Additional steps like upsampling, class-weighted loss, or separate modeling may help improve results.

K-Fold Cross-Validation

To obtain a more robust performance estimate, we use 10-fold cross-validation on the linear regression model.

# Install caret if needed
#install.packages("caret")
library(caret)

# Create model_data again (if not already in environment)
model_data <- cbind(area_mean = data$area_mean, input_features)

# Set up 10-fold cross-validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)

# Train linear regression model using caret with cross-validation
cv_model <- train(
  area_mean ~ .,
  data = model_data,
  method = "lm",
  trControl = train_control,
  metric = "RMSE"
)

# View cross-validated performance
print(cv_model)

## Linear Regression 
## 
## 569 samples
##  21 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 512, 513, 511, 512, 512, 513, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   126.1271  0.8759337  93.27927
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

4B. Classification Models for Tumor Diagnosis

In this section, we use several machine learning models to predict whether a tumor is benign or malignant based on clinical features from the breast cancer dataset. We will compare models using accuracy and other classification metrics.

Data Preparation and Splitting

First, we load the required libraries, import the data, clean it and split it into training and test sets for model evaluation.

# install.packages("randomForest")
# install.packages("xgboost")

# Load required libraries
library(tidyverse)    
library(caret)        
library(ggplot2)      
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(rpart)        
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(e1071)        
library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

# Load the dataset
data <- read.csv("cleaned_breast_cancer_data.csv")

# Remove the 'id' column as it's not relevant for prediction
data <- data %>% select(-id)

# Convert diagnosis to a factor (M = Malignant, B = Benign)
data$diagnosis <- as.factor(ifelse(data$diagnosis == "M", 1, 0))

# Check the distribution of the target variable
table(data$diagnosis)

## 
##   0   1 
## 357 212

prop.table(table(data$diagnosis))

## 
##         0         1 
## 0.6274165 0.3725835

# Set seed for reproducibility
set.seed(42)

# Split data into training (80%) and testing (20%) sets
train_index <- createDataPartition(data$diagnosis, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# Check distribution in both sets
table(train_data$diagnosis)

## 
##   0   1 
## 286 170

table(test_data$diagnosis)

## 
##  0  1 
## 71 42

1. Logistic Regression:

We fit a regularized logistic regression model to classify tumor diagnosis, using cross-validation for model selection.

# Train regularized logistic regression model
logit_model <- train(diagnosis ~ .,
                     data = train_data,
                     method = "glmnet",
                     trControl = trainControl(method = "cv", number = 5),
                     preProcess = c("center", "scale"),
                     tuneLength = 5)

# Predictions on test set
logit_pred <- predict(logit_model, newdata = test_data)
logit_prob <- predict(logit_model, newdata = test_data, type = "prob")[, "1"]

# Confusion matrix
logit_cm <- confusionMatrix(logit_pred, test_data$diagnosis, positive = "1")
print(logit_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 69  0
##          1  2 42
##                                           
##                Accuracy : 0.9823          
##                  95% CI : (0.9375, 0.9978)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9625          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9718          
##          Pos Pred Value : 0.9545          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3717          
##    Detection Prevalence : 0.3894          
##       Balanced Accuracy : 0.9859          
##                                           
##        'Positive' Class : 1               
##

# Accuracy
cat("Logistic Regression's Accuracy:", round(logit_cm$overall['Accuracy'], 4), "\n")

## Logistic Regression's Accuracy: 0.9823

2. Random Forest:

We train a random forest classifier, an ensemble method known for robustness and high accuracy.

# Train random forest model
rf_model <- train(diagnosis ~ .,
                  data = train_data,
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 5),
                  importance = TRUE)

# Predictions on test set
rf_pred <- predict(rf_model, newdata = test_data)
rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[,"1"]

# Confusion matrix
rf_cm <- confusionMatrix(rf_pred, test_data$diagnosis, positive = "1")
print(rf_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70  0
##          1  1 42
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9517, 0.9998)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9811          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9859          
##          Pos Pred Value : 0.9767          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3717          
##    Detection Prevalence : 0.3805          
##       Balanced Accuracy : 0.9930          
##                                           
##        'Positive' Class : 1               
##

# Accuracy
cat("Random Forest's Accuracy:", round(rf_cm$overall['Accuracy'], 4), "\n")

## Random Forest's Accuracy: 0.9912

3. Support Vector Machine (SVM):

We fit a support vector machine for binary classification, extracting probabilities for ROC analysis.

# Train the SVM model with probability = TRUE
svm_model <- svm(diagnosis ~ ., data = train_data, probability = TRUE)

# Predict on test set
svm_pred <- predict(svm_model, newdata = test_data, probability = TRUE)

# Extract probabilities for class "1" (Malignant)
svm_prob <- attr(svm_pred, "probabilities")[, "1"]

# Confusion Matrix
svm_cm <- confusionMatrix(svm_pred, test_data$diagnosis, positive = "1")
print(svm_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 68  0
##          1  3 42
##                                           
##                Accuracy : 0.9735          
##                  95% CI : (0.9244, 0.9945)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.944           
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9577          
##          Pos Pred Value : 0.9333          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3717          
##    Detection Prevalence : 0.3982          
##       Balanced Accuracy : 0.9789          
##                                           
##        'Positive' Class : 1               
##

# Accuracy
cat("SVM's Accuracy:", round(svm_cm$overall['Accuracy'], 4), "\n")

## SVM's Accuracy: 0.9735

4. XGBoost:

We use XGBoost, a powerful gradient boosting algorithm, for classification and report accuracy on the test set.

# Ensure diagnosis is numeric (1 = Malignant, 0 = Benign)
train_label <- as.numeric(as.character(train_data$diagnosis))
test_label <- as.numeric(as.character(test_data$diagnosis))

# Prepare model matrices
train_matrix <- model.matrix(diagnosis ~ . - 1, data = train_data)
test_matrix <- model.matrix(diagnosis ~ . - 1, data = test_data)

# Convert to xgb.DMatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

# Set parameters
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 6,
  eta = 0.1,
  gamma = 0,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  subsample = 0.8
)

# Train XGBoost model
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  print_every_n = 10,
  verbose = 1
)

## [1]  train-auc:0.987258  test-auc:0.980382 
## [11] train-auc:0.999044  test-auc:0.999665 
## [21] train-auc:0.999609  test-auc:0.999665 
## [31] train-auc:0.999897  test-auc:0.999329 
## [41] train-auc:0.999959  test-auc:1.000000 
## [51] train-auc:0.999979  test-auc:1.000000 
## [61] train-auc:1.000000  test-auc:1.000000 
## [71] train-auc:1.000000  test-auc:0.999665 
## [81] train-auc:1.000000  test-auc:0.999665 
## [91] train-auc:1.000000  test-auc:0.999665 
## [100]    train-auc:1.000000  test-auc:0.999665

# Predictions
xgb_prob <- predict(xgb_model, dtest)
xgb_pred <- ifelse(xgb_prob > 0.5, "1", "0")
xgb_pred <- factor(xgb_pred, levels = c("0", "1"))
test_data$diagnosis <- factor(test_data$diagnosis, levels = c("0", "1"))

# Confusion matrix
xgb_cm <- confusionMatrix(xgb_pred, test_data$diagnosis, positive = "1")
print(xgb_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 69  0
##          1  2 42
##                                           
##                Accuracy : 0.9823          
##                  95% CI : (0.9375, 0.9978)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9625          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9718          
##          Pos Pred Value : 0.9545          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3717          
##    Detection Prevalence : 0.3894          
##       Balanced Accuracy : 0.9859          
##                                           
##        'Positive' Class : 1               
##

# Accuracy
cat("XGBoost's Accuracy:", round(xgb_cm$overall['Accuracy'], 4), "\n")

## XGBoost's Accuracy: 0.9823

4C. Visualizations

This section presents graphical summaries to interpret our regression and classification model results. We visualize predicted vs. actual values for the regression model and plot ROC curves to evaluate classifier performance.

1. Predicted vs Actual Plot for Regression Model:

This scatter plot compares the predicted tumor area to the actual tumor area for our test set. The dashed red line represents perfect prediction. Points closer to the line indicate more accurate predictions.

library(ggplot2)

# Create a data frame with actual and predicted values
results <- data.frame(
  Actual = test$area_mean,
  Predicted = predictions
)

# Plot predicted vs actual
ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs Actual Tumor Area",
    x = "Actual area_mean",
    y = "Predicted area_mean"
  ) +
  theme_minimal()

2. ROC Curves for Classification Models:

We plot the Receiver Operating Characteristic (ROC) curves for each classifier. ROC curves illustrate the trade-off between sensitivity and specificity for different thresholds. The area under the curve (AUC) quantifies overall model performance.

# Logistic Regression ROC curve
logit_roc <- suppressMessages(roc(test_data$diagnosis, logit_prob))
plot(logit_roc, main = "ROC Curve - Logistic Regression", col = "blue")

# Random Forest ROC curve
rf_roc <- suppressMessages(roc(test_data$diagnosis, rf_prob))
plot(rf_roc, main = "ROC Curve - Random Forest", col = "green")

# SVM ROC Curve
numeric_test_data <- as.numeric(as.character(test_data$diagnosis))  # Convert actual labels to numeric (for ROC)
svm_roc <- suppressMessages(roc(numeric_test_data, svm_prob))
plot(svm_roc, col = "purple", main = "ROC Curve - SVM")

# XGBoost ROC Curve
xgb_roc <- suppressMessages(roc(response = test_data$diagnosis,
               predictor = xgb_prob,
               levels = c("0", "1"),
               direction = "<"))
plot(xgb_roc, main = "ROC Curve - XGBoost", col = "red")

5A. Classification Model Performance Comparison

This section summarizes and compares the performance of all trained classification models for breast cancer diagnosis. Models evaluated include:

Logistic Regression
Random Forest
Support Vector Machine (SVM)
XGBoost

Each model is assessed on the test set using key metrics: Accuracy, Sensitivity, Specificity and Area Under the ROC Curve (AUC). We also visualize the results for easier interpretation.

# Create a data frame to compare model performance
model_comparison <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "SVM", "XGBoost"),
  Accuracy = c(logit_cm$overall["Accuracy"], 
               rf_cm$overall["Accuracy"],
               svm_cm$overall["Accuracy"],
               xgb_cm$overall["Accuracy"]),
  Sensitivity = c(logit_cm$byClass["Sensitivity"],
                  rf_cm$byClass["Sensitivity"],
                  svm_cm$byClass["Sensitivity"],
                  xgb_cm$byClass["Sensitivity"]),
  Specificity = c(logit_cm$byClass["Specificity"],
                  rf_cm$byClass["Specificity"],
                  svm_cm$byClass["Specificity"],
                  xgb_cm$byClass["Specificity"]),
  AUC = c(auc(logit_roc),
          auc(rf_roc),
          auc(svm_roc),
          auc(xgb_roc))
)

# Print the comparison table
print(model_comparison)

##                 Model  Accuracy Sensitivity Specificity       AUC
## 1 Logistic Regression 0.9823009           1   0.9718310 0.9993293
## 2       Random Forest 0.9911504           1   0.9859155 0.9993293
## 3                 SVM 0.9734513           1   0.9577465 1.0000000
## 4             XGBoost 0.9823009           1   0.9718310 0.9996647

# Visualize model comparison
ggplot(model_comparison, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Accuracy, 3)), vjust = -0.5) +
  labs(title = "Model Accuracy Comparison",
       y = "Accuracy",
       x = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot ROC curves together
plot(logit_roc, col = "blue", main = "ROC Curves Comparison")
plot(rf_roc, col = "green", add = TRUE)
plot(svm_roc, col = "purple", add = TRUE)
plot(xgb_roc, col = "red", add = TRUE)
legend("bottomright", 
       legend = c(paste("Logistic Regression (AUC =", round(auc(logit_roc), 3), ")"),
                  paste("Random Forest (AUC =", round(auc(rf_roc), 3), ")"),
                  paste("SVM (AUC =", round(auc(svm_roc), 3), ")"),
                  paste("XGBoost (AUC =", round(auc(xgb_roc), 3), ")")),
       col = c("blue", "green", "purple", "red"), lwd = 2)

5B. Regression Model Performance Comparison

This section summarizes the performance of various regression models used to predict tumor size (area_mean). The models evaluated include:

Linear Regression (ordinary least squares)
Ridge Regression (L2 regularization)
Lasso Regression (L1 regularization)
Elastic Net Regression (combined L1 and L2 regularization)

Each model’s performance is assessed using Root Mean Squared Error (RMSE) on a test set. Additionally, 10-fold cross-validation was used for Linear Regression to estimate its generalization ability.

Model	RMSE (Overall)	RMSE (Benign)	RMSE (Malignant)
Linear Regression	121.41	—	—
Ridge Regression	138.61	106.00	177.60
Lasso Regression	125.33	—	—
Elastic Net	133.05	—	—
Linear Regression (10-fold CV)	126.13	—	—

Key Insights:

Linear Regression yields the lowest RMSE on the test set (121.41), serving as a strong baseline.
Regularized models (Ridge, Lasso, Elastic Net) produce slightly higher RMSE values but help address overfitting.
Among regularized models, Lasso Regression achieves the lowest RMSE (125.33), providing a good balance between prediction accuracy and feature selection.
Ridge Regression is much better at predicting benign tumors (RMSE = 106.00) than malignant tumors (RMSE = 177.60), possibly due to higher variability or class imbalance in malignant cases.
10-fold cross-validation for Linear Regression supports the stability and generalizability of its performance estimate (RMSE = 126.13).

6. Conclusions and Recommendations

This project developed and evaluated a range of machine learning models to predict tumor malignancy (classification) and tumor size (regression) using the Breast Cancer Wisconsin dataset. Models such as Logistic Regression, Random Forest, SVM and XGBoost demonstrated high predictive power for diagnosing malignancy, with Random Forest achieving the top accuracy (99.12%), closely followed by Logistic Regression and XGBoost (both 98.23%). All models achieved perfect sensitivity (1.000), ensuring detection of malignant tumors, and excellent specificity, especially in Random Forest (98.59%).

Regression models for tumor size (area_mean) were evaluated with Linear Regression, Ridge, Lasso and Elastic Net. Linear Regression gave the lowest test RMSE (121.41) and a high R-squared (0.8759). Lasso Regression was a strong regularized alternative. These findings support both accurate diagnosis and meaningful size prediction to guide treatment planning.

Key Findings:

Accuracy and Sensitivity: The classification models provided high accuracy, particularly Random Forest and Logistic Regression, with perfect sensitivity. This is crucial as it ensures that malignant tumors are accurately detected, minimizing the risk of false negatives.
Model Comparisons: While Random Forest showed the best accuracy, all models demonstrated reliable performance in detecting malignant tumors, emphasizing the reliability of ensemble models like Random Forest and XGBoost. On the other hand, SVM showed slightly lower performance, which could be improved with parameter tuning.
Tumor Size Prediction: The regression models, particularly Linear Regression with an R-squared of 0.8759, effectively captured tumor size variation, which is a crucial factor for treatment planning.
Model Robustness: All models were assessed with confusion matrices, providing robust classification performance metrics. The Confusion Matrix for Logistic Regression and Random Forest showed exceptional performance with very low false positives, reflecting the model’s high predictive capability.
Improvement Potential: While the models are strong, further improvements could be made with hyperparameter tuning, feature engineering and addressing data imbalance. Specifically, Random Forest and XGBoost can be fine-tuned to enhance performance and reduce overfitting.
Model Interpretability: Models like Logistic Regression are interpretable, providing transparency on which features (such as concave.points_mean and fractal_dimension_mean) impact predictions. In contrast, models like XGBoost and Random Forest are harder to interpret, which is a limitation in clinical applications where interpretability is important.

Limitations:

While this project provides valuable insights into breast cancer diagnosis prediction, there are a few limitations that should be considered:

Class Imbalance: The dataset is imbalanced, with a larger proportion of benign cases compared to malignant ones. This imbalance may affect the model’s ability to accurately predict malignant cases, potentially leading to biased predictions. Research indicates that imbalanced data can result in biased predictions, where the model may predict the majority class more accurately, while the minority class (malignant cases) receives less attention (Chawla et al., 2002).
Data Quality: The dataset may contain some noisy or missing values. Although preprocessing was done to handle missing values and irrelevant columns, slight inconsistencies or inaccuracies in the data might still impact model performance. Missing data, when not handled correctly, can lead to biased results and overfitting (Joel et al., 2022).
Feature Selection: The models were built using a set of features chosen based on the dataset, but there might be additional relevant features that could improve the prediction accuracy. The feature selection process was not exhaustive and could benefit from further exploration. Feature engineering is critical, as better or more informative features can significantly improve model performance (Guyon & Elisseeff, 2003).
Model Generalization: While the models performed well during training, there may still be overfitting due to the specific characteristics of the dataset. Cross-validation was applied, but it is still possible that the models might not generalize well to unseen data or different populations. Overfitting is a common issue in machine learning, where models are too complex and fail to generalize well on new data (Kuhn & Johnson, 2013).
Lack of External Validation: The dataset used for model training and testing is from a single source (the Kaggle dataset). Validation on an external dataset or in a clinical setting would be necessary to assess the real-world applicability of the models. External validation ensures that the model’s performance is consistent across different datasets and settings (Riley et al., 2016).

Recommendations for Future Work:

To improve the robustness and effectiveness of the models in predicting breast cancer malignancy, several future steps could be considered:

Address Class Imbalance: Techniques like oversampling, undersampling or using weighted loss functions should be explored to mitigate the effects of class imbalance. Synthetic data generation (e.g., using SMOTE) could also help in generating more malignant samples. These techniques have been shown to significantly improve the performance of classifiers on imbalanced datasets (He & Garcia, 2009).
Additional Feature Engineering: More in-depth feature engineering, such as creating new derived features or incorporating domain knowledge from medical experts, could enhance the models’ ability to discriminate between benign and malignant cases. Feature engineering is crucial, as it can allow models to capture more relevant patterns in the data (Domingos, 2012).
Model Selection and Tuning: Although various models like logistic regression, random forests, SVM and XGBoost were explored, further experimentation with ensemble methods, deep learning or neural networks might yield even better results. Hyperparameter optimization through grid search or random search could also enhance model performance. Ensemble methods have been shown to outperform individual models in various machine learning tasks (Dietterich, 2000).
External Validation and Real-world Testing: Testing the model on an external dataset or in collaboration with medical institutions will help in understanding how well the model performs on real-world clinical data. Collaborative efforts with hospitals could lead to more clinically relevant insights. Real-world validation is necessary to confirm the generalizability of models developed in controlled settings.
Interpretability and Explainability: To increase trust in the models, efforts should be made to improve model interpretability. Techniques like SHAP values or LIME could provide insights into how the models make their predictions and help in clinical decision-making. Explainable AI has gained significant attention in healthcare for providing transparent reasoning for predictions (Caruana et al., 2015).

References

Battineni, G., Chintalapudi, N. & Amenta, F. (2020).
Performance Analysis of Different Machine Learning Algorithms in Breast Cancer Predictions. EAI Endorsed Trans. Pervasive Health Technol., 6(e4). https://doi.org/10.4108/eai.28-5-2020.166010
The study explores how machine learning models, including Logistic Regression (LR) and Support Vector Machines (SVM), can be employed for breast cancer diagnosis. These models are evaluated for their predictive power in classifying tumors as benign or malignant, offering critical insights into model selection for our classification task (Battineni et al., 2020).
Caruana, R., Gehrke, J., Koch, P., Sturm, M., & Elhadad, N. (2015).
Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721-1730. https://doi.org/10.1145/2783258.2788613
This paper focuses on making machine learning models interpretable in healthcare, which is essential for clinical applications. The authors used models to predict pneumonia risk and 30-day hospital readmission, demonstrating the importance of explainability in critical healthcare predictions (Caruana et al., 2015).
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002).
SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
This foundational paper introduces the SMOTE technique for handling class imbalance in machine learning datasets. SMOTE generates synthetic samples of the minority class, improving model performance for imbalanced data, which is crucial for our classification task in breast cancer prediction (Chawla et al., 2002).
Chtouki, K., Rhanoui, M., Mikram, M., Amazian, K. & Yousfi, S. (2023).
Supervised Machine Learning for Breast Cancer Risk Factors Analysis and Survival Prediction. ArXiv. https://doi.org/10.48550/arXiv.2304.07299
The authors demonstrate that machine learning algorithms, including Decision Trees, Random Forest, and SVM, can effectively predict the survival of breast cancer patients. This directly supports the regression task in our project to predict tumor size and understand growth patterns based on clinical data (Chtouki et al., 2023).
Dietterich, T. G. (2000).
Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems (pp. 1-15). Springer. https://doi.org/10.1007/3-540-45014-9_1
This paper provides a comprehensive discussion on ensemble learning, particularly the advantages of combining multiple models to improve prediction accuracy, an approach used in our Random Forest and XGBoost models (Dietterich, 2000).
Domingos, P. (2012).
A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. https://doi.org/10.1145/2347736.2347755
This article provides essential insights into key concepts in machine learning, emphasizing the importance of model selection and feature engineering. The concepts covered in this paper were directly applied in our feature selection and model comparison steps (Domingos, 2012).
Guyon, I., & Elisseeff, A. (2003).
An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182. https://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
This paper outlines essential methods for feature selection, which is a critical step in reducing dimensionality and improving model performance. The authors highlight several techniques, including recursive feature elimination, which was useful for our feature selection process (Guyon & Elisseeff, 2003).
He, H., & Garcia, E. A. (2009).
Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239
This paper explores methods for learning from imbalanced datasets, where one class is significantly more prevalent than the other. Techniques such as SMOTE and cost-sensitive learning are discussed, both of which are highly relevant to our classification models, especially with the class imbalance in breast cancer datasets (He & Garcia, 2009).
Islam, M. & Poly, T. N. (2019).
Machine Learning Models of Breast Cancer Risk Prediction. bioRxiv. https://doi.org/10.1101/723304
In this paper, the authors explore multiple machine learning techniques, such as decision trees and KNN, and compare their effectiveness in predicting breast cancer. The study reveals that KNN provides high accuracy, suggesting it as a viable candidate for the classification task in our project (Islam & Poly, 2019).
Joel, L. O., Doorsamy, W., & Paul, B. S. (2022).
A review of missing data handling techniques for machine learning. International Journal of Innovative Technology & Interdisciplinary Sciences, 5(3), 971–1005. https://doi.org/10.15157/IJITIS.2022.5.3.971-1005
This paper reviews techniques for handling missing data in machine learning models, a crucial preprocessing step. Although we have handled missing values in the dataset, this reference provided insights into advanced techniques for future data processing (Joel et al., 2022).
Kuhn, M., & Johnson, K. (2013).
Applied predictive modeling. Springer. https://doi.org/10.1007/978-1-4614-6849-3
This book provides comprehensive coverage of the process of building predictive models. It discusses practical aspects of model fitting, evaluation, and performance metrics, which were foundational in our model development and evaluation phases (Kuhn & Johnson, 2013).
Moturi, S., Rao, S. & Vemuru, S. (2021).
Risk Prediction-Based Breast Cancer Diagnosis using Personal Health Records and Machine Learning Models. Springer. https://doi.org/10.1007/978-981-15-9516-5_37
This research examines the use of various machine learning models like Random Forest and SVM for classifying breast cancer as benign or malignant. The findings emphasize how data preprocessing and feature selection improve the accuracy of predictions, aligning with our data cleaning and feature selection steps (Moturi et al., 2021).
Riley, R. D., Ensor, J., Snell, K. I., Debray, T. P., Altman, D. G., Moons, K. G., & Collins, G. S. (2016).
External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ (Clinical research ed.), 353, i3140. https://doi.org/10.1136/bmj.i3140
This paper discusses the challenges and opportunities of external validation for clinical prediction models. It emphasizes the importance of testing models on different datasets, which is critical for our models to generalize beyond the current dataset (Riley et al., 2016).
Sudarsa, S. & Reddy, K. (2024).
Systematic Review on Breast Cancer Prediction and Classification using Machine Learning and Deep Learning Methods. In 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). https://doi.org/10.1109/I-SMAC61858.2024.10714683
This review highlights various machine learning and deep learning methods for breast cancer classification and prediction, offering a comprehensive overview of methodologies that could inform our model selection and evaluation processes, especially for improving diagnostic accuracy (Sudarsa & Reddy, 2024).

Breast Cancer Malignancy Prediction Using Machine Learning

Phoon Hao Xian, Nur Ariana Sofea binti Badrul Hisham, Vijaykumar Kartha Ramchandran

2025-05-23

WQD7004 Programming for Data Science

Group 07 Members

Methodology: CRISP-DM

1. Understand: Define the problem and understand the dataset.

2. Explore: Load and examine the dataset, identifying key attributes.

3. Prepare: Clean and preprocess the data for analysis.

4. Model: Build predictive models using machine learning techniques.

5. Evaluate: Assess model performance using relevant metrics.

6. Deploy: Present findings, forecasts and actionable recommendations.

1. Data Understanding

Research Questions and Objectives

Research Questions:

Research Objectives:

2. Data Exploration (Exploring and Loading the Data)

Data Overview:

Initial Exploration:

3A. Data Preparation (Cleaning the Data)

3B. EDA (Exploratory Data Analysis) Section

4. Data Modelling

Feature Selection

4A. Regression Models for Tumor Size Prediction

Linear Regression: Model Training and RMSE Evaluation

Regularized Regression Models: Ridge & Lasso

1. Data Preparation for Regularized Regression:

2. Model Training:

3. Ridge Regression:

4. Lasso Regression:

5. Elastic Net Regression:

Performance by Diagnosis Type

K-Fold Cross-Validation

4B. Classification Models for Tumor Diagnosis

Data Preparation and Splitting

1. Logistic Regression:

2. Random Forest:

3. Support Vector Machine (SVM):

4. XGBoost:

4C. Visualizations

1. Predicted vs Actual Plot for Regression Model:

2. ROC Curves for Classification Models:

5A. Classification Model Performance Comparison

5B. Regression Model Performance Comparison

Key Insights:

6. Conclusions and Recommendations

Key Findings:

Limitations:

Recommendations for Future Work:

References