| Matric Number | Full Name |
|---|---|
| S2034194 | Phoon Hao Xian |
| 23096526 | Nur Ariana Sofea binti Badrul Hisham |
| S2033073 | Vijaykumar Kartha Ramchandran |
The code first ensures that the readr package is installed, which is used to read the CSV file efficiently into R. If readr is already installed, it skips this step.
# Install required packages if not already installed
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
Once readr is installed, it’s loaded into the R environment using library(readr) to make the read_csv() function available for reading the CSV file.
# Load necessary libraries
library(readr)
The read_csv() function is used to read the file ‘breast_cancer_data.csv’ into a data frame called df. This function automatically detects the data types for each column based on the content.
# Now, read the file into R
df <- read_csv("breast_cancer_data.csv")
## New names:
## • `` -> `...33`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 568 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): diagnosis
## dbl (31): id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothne...
## lgl (1): ...33
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data cleaning is a critical step in the data analysis process. The raw dataset often contains errors, inconsistencies and irrelevant information. In this project, data cleaning involves handling missing values, removing duplicate entries, converting variables to appropriate formats and ensuring that the dataset is ready for further analysis. Proper data cleaning enhances the accuracy and reliability of the results obtained from data analysis and model building.
In our breast cancer dataset, we have clinical attributes that need to be processed before they can be used for classification and regression tasks. Specifically, we will:
The first step in data cleaning is to load the necessary libraries for data manipulation and visualization. In this case, we will use dplyr for data manipulation tasks and ggplot2 for visualizations.
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
By loading these libraries, we gain access to the functions and methods that allow us to perform operations like filtering, selecting and transforming the dataset. dplyr is a popular R package for efficient data manipulation, and ggplot2 helps in visualizing the cleaned dataset.
The next step is to load the ‘breast_cancer_data.csv’ dataset into R. We will use the read.csv() function to read the dataset into a data frame.
# Load the renamed CSV file
df <- read.csv("breast_cancer_data.csv", stringsAsFactors = FALSE)
# Preview the data
head(df)
# Get the dimensions of the dataset
dim(df) # Returns the number of rows and columns in the dataset
## [1] 569 33
This step loads the dataset into the df variable. The head() function is used to preview the first six rows of the data to understand its structure before any processing is done. By setting stringsAsFactors = FALSE, we ensure that categorical variables are not automatically converted to factors (which can interfere with analysis).
Now, we will check the structure of the dataset using the str() function. This allows us to see the data types of each column.
str(df)
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
The str() function provides an overview of the dataset, including the number of observations (rows) and variables (columns), as well as the data type of each column. This is useful for identifying potential issues, such as columns that should be factors but are currently stored as characters.
We use the summary() function to get a statistical summary of the dataset. This function will provide basic descriptive statistics for each numeric column.
summary(df)
## id diagnosis radius_mean texture_mean
## Min. : 8670 Length:569 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 Class :character 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Mode :character Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638
## Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930
## Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710
## Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :0.007882 Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.015160 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.018730 Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.020542 Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.023480 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.078950 Max. :0.0298400 Max. :36.04 Max. :49.54
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720
## Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190
## Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910
## Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800
## concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
## X
## Mode:logical
## NA's:569
##
##
##
##
The summary() function gives us a quick overview of the central tendencies (mean, median), range (min, max) and the spread (quartiles) for numeric columns. This step helps us understand the distribution of each attribute and detect potential outliers or data quality issues.
Next, we check for missing values across the dataset. This will help us identify if any columns contain NA (missing) values that need to be addressed.
colSums(is.na(df))
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave.points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave.points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave.points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst X
## 0 0 569
The is.na() function checks for missing values in the dataset. colSums() will sum the missing values in each column. This step is crucial because missing data can impact the performance of machine learning models, so we need to decide whether to remove or impute the missing values.
We will remove the ‘X’ column, which is empty and not useful for our analysis.
df <- df %>% select(-X)
This step uses dplyr’s select() function to remove the irrelevant column. Removing unnecessary columns ensures that our dataset only contains relevant variables for analysis, reducing the complexity of the data.
Since the diagnosis column contains categorical values (Benign and Malignant), we will convert it to a factor variable.
df$diagnosis <- as.factor(df$diagnosis)
str(df$diagnosis)
## Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
table(df$diagnosis)
##
## B M
## 357 212
Converting the diagnosis column to a factor ensures that R treats it as a categorical variable. The str() function confirms the data type, and table() shows the count of each category (Benign vs. Malignant). This is important for classification tasks.
We check for duplicate rows in the dataset to ensure data integrity.
sum(duplicated(df))
## [1] 0
The duplicated() function checks for duplicate rows. If duplicates are found, they can introduce bias or overfitting in machine learning models, so they need to be removed.
If any duplicate rows exist, we will remove them to ensure that each observation is unique.
df <- df[!duplicated(df), ]
sum(duplicated(df))
## [1] 0
This step removes duplicate rows using the duplicated() function, and then rechecks for any remaining duplicates. Removing duplicates ensures that the dataset only contains unique observations.
Once the dataset is cleaned and preprocessed, we save it as a new CSV file. This allows us to retain a copy of the cleaned data for future use and analysis, which can be helpful for reproducibility or sharing the dataset with others.
By using a relative file path (“cleaned_breast_cancer_data.csv”), we ensure that the file will be saved in the current working directory, making it portable across different environments.
Download Link: Click here to download the cleaned dataset
Exploratory Data Analysis (EDA) is an essential step in data analysis that helps us uncover patterns, relationships and trends within the dataset. In the context of our project, EDA will allow us to understand the distribution of key variables and their relationships to each other. This step also involves visualizing the data to detect any potential anomalies or outliers that might affect the results.
In this section, we will:
Diagnosis Counts
We begin with a bar plot to visualize the distribution of
diagnoses (Benign vs. Malignant).
# Bar plot of diagnosis counts
ggplot(df, aes(x = diagnosis)) +
geom_bar(fill = c("lightblue", "salmon")) +
labs(title = "Count of Diagnosis (Benign vs Malignant)", x = "Diagnosis", y = "Count") +
theme_minimal()
Explanation for the Output of Step 1 (Bar Plot of
Diagnosis Counts)
The bar plot shown here visualizes the distribution
of the diagnosis variable, which indicates whether a tumor
is Benign (‘B’) or Malignant (‘M’). In
this case:
Key Insights from the Bar Plot:
Conclusion:
The dataset is imbalanced with more benign tumors
compared to malignant ones. This imbalance is important because it may
affect how we train machine learning models, especially for
classification tasks. We may need to consider techniques like
oversampling, undersampling or using
weighted models to account for this imbalance during model training.
radius_mean
We visualize the distribution of radius_mean across
diagnosis categories using a boxplot.
# Boxplot for radius_mean
ggplot(df, aes(x = diagnosis, y = radius_mean, fill = diagnosis)) +
geom_boxplot() +
labs(title = "Distribution of Radius Mean by Diagnosis", x = "Diagnosis", y = "Radius Mean") +
theme_minimal()
Explanation for the Output of Step 2 (Boxplot for
radius_mean by Diagnosis)
The boxplot shown here visualizes the distribution
of the radius_mean variable across the two diagnosis
categories: Benign (B) and Malignant
(M).
Key Observations from the Boxplot:
radius_mean for benign
tumors is relatively narrow, with a median around
12.8.radius_mean for malignant tumors is generally
higher, with a median around 17.3.Conclusion:
The radius_mean variable appears to be higher for malignant
tumors compared to benign ones. The distribution of
malignant tumor sizes is more spread out, and it has a
higher median compared to benign tumors. This
difference could be an important feature in classification
models for predicting whether a tumor is benign or
malignant.
texture_mean
A similar boxplot for texture_mean helps compare its
distribution across diagnosis categories.
# Boxplot for texture_mean
ggplot(df, aes(x = diagnosis, y = texture_mean, fill = diagnosis)) +
geom_boxplot() +
labs(title = "Distribution of Texture Mean by Diagnosis", x = "Diagnosis", y = "Texture Mean") +
theme_minimal()
Explanation for the Output of Step 3 (Boxplot for
texture_mean by Diagnosis)
The boxplot shown here visualizes the distribution
of the texture_mean variable across the two diagnosis
categories: Benign (B) and Malignant
(M).
Key Observations from the Boxplot:
texture_mean in
benign tumors is approximately 15 to 20.texture_mean for malignant tumors ranges from
approximately 19 to 24.Conclusion:
The texture_mean variable exhibits a higher
median for malignant tumors compared to benign tumors.
Additionally, malignant tumors tend to have a wider
spread of texture values, indicating more variability
in their texture compared to benign tumors.
radius_mean
We plot a histogram to examine the distribution of
radius_mean.
# Histogram for radius_mean distribution by diagnosis
ggplot(df, aes(x = radius_mean, fill = diagnosis)) +
geom_histogram(position = "dodge", bins = 30) +
labs(title = "Distribution of Radius Mean by Diagnosis", x = "Radius Mean", y = "Count") +
theme_minimal()
Explanation for the Output of Step 4 (Histogram for
radius_mean by Diagnosis)
The histogram shown here visualizes the distribution
of the radius_mean variable, which represents the mean of
the tumor’s radius. This variable is important as it gives us an idea
about the overall size of the tumor.
Key Observations from the Histogram:
radius_mean in the range of 6 to 18. The
histogram shows a right-skewed distribution, indicating
that the majority of benign tumors are smaller in size.radius_mean = 10, which suggests that many benign tumors
have a radius close to this value.radius_mean increases above 12.radius_mean values. However, there is a great
concentration in the range between 13 to
21.radius_mean in malignant tumors, with a long
tail on the right side of the distribution. This indicates that
while most malignant tumors have a radius_mean similar to
benign tumors, a few malignant tumors are significantly larger.Conclusion:
The radius_mean variable exhibits a different distribution
for Benign and Malignant tumors. While
both types of tumors show some overlap in their radius_mean
values (especially between 12 and 18), malignant tumors
seem to exhibit wider variation and a tendency towards
larger tumor sizes. The histogram also reveals that the
majority of Benign tumors are smaller in size, whereas
Malignant tumors exhibit more variability and larger
values. This information is valuable for building a
classification model to distinguish between Benign and
Malignant tumors based on their radius_mean.
radius_mean vs. texture_mean
We create a scatter plot to visualize the relationship between
radius_mean and texture_mean.
# Scatter plot of radius_mean vs. texture_mean, colored by diagnosis
ggplot(df, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
geom_point() +
labs(title = "Radius Mean vs. Texture Mean by Diagnosis", x = "Radius Mean", y = "Texture Mean") +
theme_minimal()
Explanation for the Output of Step 5 (Scatter Plot of
radius_mean vs. texture_mean by
Diagnosis)
The scatter plot shown here visualizes the
relationship between two features of the dataset:
radius_mean and texture_mean. These two
features are critical in identifying characteristics of the tumors in
the dataset.
Key Observations from the Scatter Plot:
radius_mean axis, with radius_mean values
mostly between 8 and 16.texture_mean values for Benign tumors range from
about 10 to 28. The red points form a relatively
dense cloud, indicating that most Benign tumors have
smaller radii and moderate texture values.radius_mean and texture_mean values,
indicating that Malignant tumors tend to be larger and
have more varied textures.Conclusion: The scatter plot indicates that there is
a potential relationship between the
radius_mean and texture_mean of the tumors.
Malignant tumors tend to have larger
radii and more varied textures, while
Benign tumors show smaller radii and
more consistent textures. This distinction is helpful
in identifying the type of tumor and may assist in building a
classification model for tumor diagnosis based on these
features.
smoothness_mean by Diagnosis
We use a boxplot to visualize the distribution of
smoothness_mean for each diagnosis.
# Boxplot for smoothness_mean by diagnosis
ggplot(df, aes(x = diagnosis, y = smoothness_mean, fill = diagnosis)) +
geom_boxplot() +
labs(title = "Smoothness Mean by Diagnosis", x = "Diagnosis", y = "Smoothness Mean") +
theme_minimal()
Explanation for the Output of Step 6 (Boxplot for
smoothness_mean by Diagnosis)
The boxplot displayed here visualizes the
distribution of the smoothness_mean feature, segmented by
the diagnosis of the tumor (Benign vs. Malignant). The
smoothness_mean measures the smoothness of the tumor’s
surface, which can be indicative of the tumor’s texture and
consistency.
Key Observations from the Boxplot:
smoothness_mean (around 0.090**), as
shown by the red box.smoothness_mean.Conclusion: The boxplot reveals that Benign tumors tend to have a more consistent smoothness with a lower median value, whereas Malignant tumors exhibit more variability in smoothness and have a slightly higher median. These observations can be useful in distinguishing between Benign and Malignant tumors based on smoothness, providing a useful feature for classification models.
We compute and visualize the correlation matrix to understand the relationships between numeric features.
numeric_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, numeric_cols], use = "complete.obs")
library(corrplot)
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.7)
Explanation for the Output of Step 7 (Correlation Matrix)
The correlation matrix displayed here visualizes the relationships between numeric features in the dataset, particularly focusing on the correlation between various tumor-related measurements. This matrix is generated using a heatmap where the color intensity indicates the strength of the relationship between pairs of variables. The color scale at the right side of the plot indicates the strength and direction of the correlation: - Blue represents positive correlations, where both variables increase or decrease together. - Red represents negative correlations, where as one variable increases, the other decreases. - White/Light color indicates a weak or no correlation between variables.
Key Observations from the Correlation Matrix:
radius_mean and perimeter_mean: These two
features show a very strong positive correlation (actually
1), meaning that as the radius increases, the perimeter
also increases. This is expected because a larger radius would typically
result in a larger perimeter.area_mean and perimeter_mean: The
correlation between these two features is also strong, as a larger area
tends to correlate with a larger perimeter.smoothness_mean and symmetry_mean: These
features also show a positive correlation (almost 0.7), which suggests
that smoother tumors tend to be more symmetrical.radius_mean and fractal_dimension_mean. The
fractal dimension measures the irregularity or complexity of the tumor,
which doesn’t necessarily increase or decrease with the radius.Conclusion: The correlation matrix is a useful tool
for identifying relationships between variables. It helps us identify
redundant features (e.g., radius_mean,
perimeter_mean and area_mean are highly
correlated) and suggests which features might provide complementary
information when building predictive models. The matrix also indicates
which variables might be combined or excluded for model building based
on their correlations.
We use a pair plot to visualize the relationships between selected
features (e.g., radius_mean,
texture_mean).
library(GGally)
features_subset <- c("diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean")
ggpairs(df[, features_subset], mapping = aes(color = diagnosis), title = "Pairplot of Selected Features")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Explanation for the Output of Step 8 (Pair Plot of Selected Features)
The pair plot is a powerful visualization tool for exploring relationships between multiple features in the dataset. It provides a matrix of scatter plots, histograms and correlation coefficients to help identify patterns, trends and possible relationships between pairs of variables.
Key Observations from the Pair Plot:
radius_mean and
perimeter_mean show distinct distributions
for Benign and Malignant tumors, with Malignant tumors (denoted by blue)
having higher values.texture_mean and
area_mean also exhibit different
distributions between Benign and Malignant cases, which suggests these
features can help differentiate the two classes.radius_mean and perimeter_mean (as expected,
since a larger radius leads to a larger perimeter). The correlation
value (0.998) is displayed in the plot, which confirms
this.radius_mean
and texture_mean shows a weak
positive correlation (0.324), suggesting that
these two features do not have a strong linear relationship.area_mean and
perimeter_mean also demonstrates a
strong positive correlation (0.987),
which is similar to the relationship between radius_mean
and perimeter_mean.radius_mean vs. perimeter_mean:
0.998 (very strong correlation, indicating that as
radius increases, perimeter also increases).perimeter_mean vs. area_mean:
0.987 (strong positive correlation, as expected for
geometric properties of tumors).radius_mean vs. texture_mean:
0.324 (weak positive correlation).area_mean vs. smoothness_mean:
0.177 (weak positive correlation).texture_mean vs. smoothness_mean:
-0.023 (small negative correlation, suggesting that
higher texture values tend to correspond to lower smoothness).Conclusion: The pair plot visually confirms the
relationships between various features in the dataset. Some features
like radius_mean, perimeter_mean and
area_mean are strongly correlated, which can inform our
feature selection and model-building process. On the other hand,
features with weak correlations like radius_mean and
texture_mean may offer additional unique information for
classification or regression models.
Based on exploratory data analysis (EDA) and the cleaned breast cancer dataset, we will train a machine learning model to predict the size of a tumor (measured by area_mean) using various clinical measurements. Since we already have labeled data with tumor size and diagnosis (benign or malignant), supervised learning is the appropriate choice.
We begin with Linear Regression, which is suitable for predicting continuous numerical variables using numerical input features. The objective is to find the best-fit linear relationship between the input features and the target variable.
The original dataset includes many clinical features. However, for
our model, we remove those that are directly correlated with or
derivable from the target variable (area_mean), to prevent
data leakage. Specifically, we exclude features like
radius_*, perimeter_*, area_se,
and area_worst.
Target Variable:
area_mean (tumor size)Excluded Features:
radius_mean, radius_se,
radius_worstperimeter_mean, perimeter_se,
perimeter_worstarea_se, area_worsttexture_mean, texture_se,
texture_worstSelected Input Features:
smoothness_compactness_concavity_concave.points_symmetry_fractal_dimension_In this section, we build a linear regression model to predict the
average tumor area (area_mean) using several clinical
features from the breast cancer dataset. We first load the data and
inspect its structure, then carefully select input features to avoid
data leakage by excluding variables directly related to area, radius, or
perimeter. We then fit a linear model with the selected predictors and
display the summary of model coefficients and overall performance.
The steps include:
# 1. Load required packages for data manipulation and analysis
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# 2. Read in the cleaned breast cancer dataset
data <- read.csv("cleaned_breast_cancer_data.csv")
# 3. Display all column names in the dataset
colnames(data)
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave.points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave.points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave.points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
# 4. Select input features (clinical variables to use as predictors)
input_features <- data %>%
select(
texture_mean, texture_se, texture_worst,
smoothness_mean, smoothness_se, smoothness_worst,
compactness_mean, compactness_se, compactness_worst,
concavity_mean, concavity_se, concavity_worst,
concave.points_mean, concave.points_se, concave.points_worst,
symmetry_mean, symmetry_se, symmetry_worst,
fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst
)
# 5. Create a new data frame that combines the target variable (area_mean) with selected predictors
model_data <- cbind(area_mean = data$area_mean, input_features)
# 6. Fit a linear regression model to predict area_mean using the selected clinical features
model <- lm(area_mean ~ ., data = model_data)
# 7. Display a summary of the model, including coefficients and overall fit statistics
summary(model)
##
## Call:
## lm(formula = area_mean ~ ., data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -344.16 -79.47 -4.35 60.31 737.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1899.863 94.586 20.086 < 2e-16 ***
## texture_mean 4.464 3.977 1.123 0.2621
## texture_se -27.704 18.233 -1.519 0.1292
## texture_worst -1.209 3.451 -0.350 0.7263
## smoothness_mean -2081.443 988.983 -2.105 0.0358 *
## smoothness_se 1178.960 3316.281 0.356 0.7223
## smoothness_worst -651.244 719.240 -0.905 0.3656
## compactness_mean 858.997 504.839 1.702 0.0894 .
## compactness_se -713.588 1089.052 -0.655 0.5126
## compactness_worst -110.845 188.287 -0.589 0.5563
## concavity_mean 181.210 488.417 0.371 0.7108
## concavity_se 1029.424 633.448 1.625 0.1047
## concavity_worst -266.272 133.468 -1.995 0.0465 *
## concave.points_mean 8265.114 905.487 9.128 < 2e-16 ***
## concave.points_se -12134.831 2448.102 -4.957 9.57e-07 ***
## concave.points_worst 870.287 445.687 1.953 0.0514 .
## symmetry_mean -493.813 368.524 -1.340 0.1808
## symmetry_se 1980.721 1328.944 1.490 0.1367
## symmetry_worst -359.753 242.540 -1.483 0.1386
## fractal_dimension_mean -21533.110 2318.737 -9.287 < 2e-16 ***
## fractal_dimension_se 14127.010 5627.863 2.510 0.0124 *
## fractal_dimension_worst 991.675 1193.376 0.831 0.4063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121.1 on 547 degrees of freedom
## Multiple R-squared: 0.886, Adjusted R-squared: 0.8816
## F-statistic: 202.5 on 21 and 547 DF, p-value: < 2.2e-16
We now assess the predictive performance of our linear regression model. To do this, we randomly split the dataset into training (80%) and test (20%) sets. The model is trained on the training set and then used to predict tumor size on the test set. We use Root Mean Squared Error (RMSE) as our evaluation metric. RMSE provides an interpretable measure of the average prediction error (in the same units as the target variable), with lower values indicating better model performance.
# Split into training and test sets
set.seed(123)
index <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train <- model_data[index, ]
test <- model_data[-index, ]
# Train model
model_train <- lm(area_mean ~ ., data = train)
# Predict on test data
predictions <- predict(model_train, newdata = test)
# Calculate RMSE
rmse <- sqrt(mean((test$area_mean - predictions)^2))
print(paste("RMSE:", round(rmse, 2)))
## [1] "RMSE: 121.41"
We now apply Ridge and Lasso regression, which help reduce overfitting by penalizing large coefficients. Ridge regression applies L2 regularization, which shrinks coefficients toward zero without setting them exactly to zero. Lasso regression uses L1 regularization, which can shrink some coefficients all the way to zero, effectively performing feature selection.
We reload and preprocess the data, selecting and scaling input features to ensure fair evaluation across models.
#install.packages("glmnet")
#install.packages("caret") # for splitting
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Load the data
data <- read.csv("cleaned_breast_cancer_data.csv")
# Select only input features (excluding radius, area, perimeter)
features <- data %>%
select(
texture_mean, texture_se, texture_worst,
smoothness_mean, smoothness_se, smoothness_worst,
compactness_mean, compactness_se, compactness_worst,
concavity_mean, concavity_se, concavity_worst,
concave.points_mean, concave.points_se, concave.points_worst,
symmetry_mean, symmetry_se, symmetry_worst,
fractal_dimension_mean, fractal_dimension_se, fractal_dimension_worst
)
# Scale the features
scaled_features <- scale(features)
# Prepare input matrix and target
X <- as.matrix(scaled_features)
y <- data$area_mean
# Split data into training and test set
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
We fit a Ridge regression model and evaluate its RMSE on the test set.
ridge_model <- cv.glmnet(X_train, y_train, alpha = 0) # alpha = 0 for Ridge
plot(ridge_model)
# Best lambda
ridge_lambda <- ridge_model$lambda.min
# Predict and evaluate
ridge_preds <- predict(ridge_model, s = ridge_lambda, newx = X_test)
ridge_rmse <- sqrt(mean((y_test - ridge_preds)^2))
cat("Ridge RMSE:", round(ridge_rmse, 2), "\n")
## Ridge RMSE: 138.61
We fit a Lasso regression model and report the test RMSE.
lasso_model <- cv.glmnet(X_train, y_train, alpha = 1) # alpha = 1 for Lasso
plot(lasso_model)
# Best lambda
lasso_lambda <- lasso_model$lambda.min
# Predict and evaluate
lasso_preds <- predict(lasso_model, s = lasso_lambda, newx = X_test)
lasso_rmse <- sqrt(mean((y_test - lasso_preds)^2))
cat("Lasso RMSE:", round(lasso_rmse, 2), "\n")
## Lasso RMSE: 125.33
We fit an Elastic Net model, which blends Ridge and Lasso penalties and report the RMSE.
enet_model <- cv.glmnet(X_train, y_train, alpha = 0.5)
enet_lambda <- enet_model$lambda.min
enet_preds <- predict(enet_model, s = enet_lambda, newx = X_test)
enet_rmse <- sqrt(mean((y_test - enet_preds)^2))
cat("Elastic Net RMSE:", round(enet_rmse, 2), "\n")
## Elastic Net RMSE: 133.05
We evaluate the Ridge model separately on benign and malignant tumors to check for differences in model accuracy by diagnosis class.
data$diagnosis <- ifelse(data$diagnosis == "M", 1, 0)
# Reconstruct test_data
test_data <- data[-trainIndex, ]
# Predict on test set
preds <- as.vector(predict(ridge_model, s = ridge_lambda, newx = X_test))
# Compute RMSE for benign (diagnosis == 0)
rmse_benign <- sqrt(mean((preds[test_data$diagnosis == 0] - y_test[test_data$diagnosis == 0])^2))
# Compute RMSE for malignant (diagnosis == 1)
rmse_malignant <- sqrt(mean((preds[test_data$diagnosis == 1] - y_test[test_data$diagnosis == 1])^2))
# Print
cat("RMSE (Benign):", round(rmse_benign, 2), "\n")
## RMSE (Benign): 106
cat("RMSE (Malignant):", round(rmse_malignant, 2), "\n")
## RMSE (Malignant): 177.6
The model performs better on benign cases than on malignant ones, likely due to class imbalance and higher variability in malignant tumors. Additional steps like upsampling, class-weighted loss, or separate modeling may help improve results.
To obtain a more robust performance estimate, we use 10-fold cross-validation on the linear regression model.
# Install caret if needed
#install.packages("caret")
library(caret)
# Create model_data again (if not already in environment)
model_data <- cbind(area_mean = data$area_mean, input_features)
# Set up 10-fold cross-validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
# Train linear regression model using caret with cross-validation
cv_model <- train(
area_mean ~ .,
data = model_data,
method = "lm",
trControl = train_control,
metric = "RMSE"
)
# View cross-validated performance
print(cv_model)
## Linear Regression
##
## 569 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 512, 513, 511, 512, 512, 513, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 126.1271 0.8759337 93.27927
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
In this section, we use several machine learning models to predict whether a tumor is benign or malignant based on clinical features from the breast cancer dataset. We will compare models using accuracy and other classification metrics.
First, we load the required libraries, import the data, clean it and split it into training and test sets for model evaluation.
# install.packages("randomForest")
# install.packages("xgboost")
# Load required libraries
library(tidyverse)
library(caret)
library(ggplot2)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(e1071)
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
# Load the dataset
data <- read.csv("cleaned_breast_cancer_data.csv")
# Remove the 'id' column as it's not relevant for prediction
data <- data %>% select(-id)
# Convert diagnosis to a factor (M = Malignant, B = Benign)
data$diagnosis <- as.factor(ifelse(data$diagnosis == "M", 1, 0))
# Check the distribution of the target variable
table(data$diagnosis)
##
## 0 1
## 357 212
prop.table(table(data$diagnosis))
##
## 0 1
## 0.6274165 0.3725835
# Set seed for reproducibility
set.seed(42)
# Split data into training (80%) and testing (20%) sets
train_index <- createDataPartition(data$diagnosis, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Check distribution in both sets
table(train_data$diagnosis)
##
## 0 1
## 286 170
table(test_data$diagnosis)
##
## 0 1
## 71 42
We fit a regularized logistic regression model to classify tumor diagnosis, using cross-validation for model selection.
# Train regularized logistic regression model
logit_model <- train(diagnosis ~ .,
data = train_data,
method = "glmnet",
trControl = trainControl(method = "cv", number = 5),
preProcess = c("center", "scale"),
tuneLength = 5)
# Predictions on test set
logit_pred <- predict(logit_model, newdata = test_data)
logit_prob <- predict(logit_model, newdata = test_data, type = "prob")[, "1"]
# Confusion matrix
logit_cm <- confusionMatrix(logit_pred, test_data$diagnosis, positive = "1")
print(logit_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 69 0
## 1 2 42
##
## Accuracy : 0.9823
## 95% CI : (0.9375, 0.9978)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9625
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9718
## Pos Pred Value : 0.9545
## Neg Pred Value : 1.0000
## Prevalence : 0.3717
## Detection Rate : 0.3717
## Detection Prevalence : 0.3894
## Balanced Accuracy : 0.9859
##
## 'Positive' Class : 1
##
# Accuracy
cat("Logistic Regression's Accuracy:", round(logit_cm$overall['Accuracy'], 4), "\n")
## Logistic Regression's Accuracy: 0.9823
We train a random forest classifier, an ensemble method known for robustness and high accuracy.
# Train random forest model
rf_model <- train(diagnosis ~ .,
data = train_data,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
importance = TRUE)
# Predictions on test set
rf_pred <- predict(rf_model, newdata = test_data)
rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[,"1"]
# Confusion matrix
rf_cm <- confusionMatrix(rf_pred, test_data$diagnosis, positive = "1")
print(rf_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 0
## 1 1 42
##
## Accuracy : 0.9912
## 95% CI : (0.9517, 0.9998)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9811
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9859
## Pos Pred Value : 0.9767
## Neg Pred Value : 1.0000
## Prevalence : 0.3717
## Detection Rate : 0.3717
## Detection Prevalence : 0.3805
## Balanced Accuracy : 0.9930
##
## 'Positive' Class : 1
##
# Accuracy
cat("Random Forest's Accuracy:", round(rf_cm$overall['Accuracy'], 4), "\n")
## Random Forest's Accuracy: 0.9912
We fit a support vector machine for binary classification, extracting probabilities for ROC analysis.
# Train the SVM model with probability = TRUE
svm_model <- svm(diagnosis ~ ., data = train_data, probability = TRUE)
# Predict on test set
svm_pred <- predict(svm_model, newdata = test_data, probability = TRUE)
# Extract probabilities for class "1" (Malignant)
svm_prob <- attr(svm_pred, "probabilities")[, "1"]
# Confusion Matrix
svm_cm <- confusionMatrix(svm_pred, test_data$diagnosis, positive = "1")
print(svm_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 0
## 1 3 42
##
## Accuracy : 0.9735
## 95% CI : (0.9244, 0.9945)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.944
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9577
## Pos Pred Value : 0.9333
## Neg Pred Value : 1.0000
## Prevalence : 0.3717
## Detection Rate : 0.3717
## Detection Prevalence : 0.3982
## Balanced Accuracy : 0.9789
##
## 'Positive' Class : 1
##
# Accuracy
cat("SVM's Accuracy:", round(svm_cm$overall['Accuracy'], 4), "\n")
## SVM's Accuracy: 0.9735
We use XGBoost, a powerful gradient boosting algorithm, for classification and report accuracy on the test set.
# Ensure diagnosis is numeric (1 = Malignant, 0 = Benign)
train_label <- as.numeric(as.character(train_data$diagnosis))
test_label <- as.numeric(as.character(test_data$diagnosis))
# Prepare model matrices
train_matrix <- model.matrix(diagnosis ~ . - 1, data = train_data)
test_matrix <- model.matrix(diagnosis ~ . - 1, data = test_data)
# Convert to xgb.DMatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# Set parameters
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 6,
eta = 0.1,
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8
)
# Train XGBoost model
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
watchlist = list(train = dtrain, test = dtest),
print_every_n = 10,
verbose = 1
)
## [1] train-auc:0.987258 test-auc:0.980382
## [11] train-auc:0.999044 test-auc:0.999665
## [21] train-auc:0.999609 test-auc:0.999665
## [31] train-auc:0.999897 test-auc:0.999329
## [41] train-auc:0.999959 test-auc:1.000000
## [51] train-auc:0.999979 test-auc:1.000000
## [61] train-auc:1.000000 test-auc:1.000000
## [71] train-auc:1.000000 test-auc:0.999665
## [81] train-auc:1.000000 test-auc:0.999665
## [91] train-auc:1.000000 test-auc:0.999665
## [100] train-auc:1.000000 test-auc:0.999665
# Predictions
xgb_prob <- predict(xgb_model, dtest)
xgb_pred <- ifelse(xgb_prob > 0.5, "1", "0")
xgb_pred <- factor(xgb_pred, levels = c("0", "1"))
test_data$diagnosis <- factor(test_data$diagnosis, levels = c("0", "1"))
# Confusion matrix
xgb_cm <- confusionMatrix(xgb_pred, test_data$diagnosis, positive = "1")
print(xgb_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 69 0
## 1 2 42
##
## Accuracy : 0.9823
## 95% CI : (0.9375, 0.9978)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9625
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9718
## Pos Pred Value : 0.9545
## Neg Pred Value : 1.0000
## Prevalence : 0.3717
## Detection Rate : 0.3717
## Detection Prevalence : 0.3894
## Balanced Accuracy : 0.9859
##
## 'Positive' Class : 1
##
# Accuracy
cat("XGBoost's Accuracy:", round(xgb_cm$overall['Accuracy'], 4), "\n")
## XGBoost's Accuracy: 0.9823
This section presents graphical summaries to interpret our regression and classification model results. We visualize predicted vs. actual values for the regression model and plot ROC curves to evaluate classifier performance.
This scatter plot compares the predicted tumor area to the actual tumor area for our test set. The dashed red line represents perfect prediction. Points closer to the line indicate more accurate predictions.
library(ggplot2)
# Create a data frame with actual and predicted values
results <- data.frame(
Actual = test$area_mean,
Predicted = predictions
)
# Plot predicted vs actual
ggplot(results, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.6, color = "blue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(
title = "Predicted vs Actual Tumor Area",
x = "Actual area_mean",
y = "Predicted area_mean"
) +
theme_minimal()
We plot the Receiver Operating Characteristic (ROC) curves for each classifier. ROC curves illustrate the trade-off between sensitivity and specificity for different thresholds. The area under the curve (AUC) quantifies overall model performance.
# Logistic Regression ROC curve
logit_roc <- suppressMessages(roc(test_data$diagnosis, logit_prob))
plot(logit_roc, main = "ROC Curve - Logistic Regression", col = "blue")
# Random Forest ROC curve
rf_roc <- suppressMessages(roc(test_data$diagnosis, rf_prob))
plot(rf_roc, main = "ROC Curve - Random Forest", col = "green")
# SVM ROC Curve
numeric_test_data <- as.numeric(as.character(test_data$diagnosis)) # Convert actual labels to numeric (for ROC)
svm_roc <- suppressMessages(roc(numeric_test_data, svm_prob))
plot(svm_roc, col = "purple", main = "ROC Curve - SVM")
# XGBoost ROC Curve
xgb_roc <- suppressMessages(roc(response = test_data$diagnosis,
predictor = xgb_prob,
levels = c("0", "1"),
direction = "<"))
plot(xgb_roc, main = "ROC Curve - XGBoost", col = "red")
This section summarizes and compares the performance of all trained classification models for breast cancer diagnosis. Models evaluated include:
Logistic Regression
Random Forest
Support Vector Machine (SVM)
XGBoost
Each model is assessed on the test set using key metrics: Accuracy, Sensitivity, Specificity and Area Under the ROC Curve (AUC). We also visualize the results for easier interpretation.
# Create a data frame to compare model performance
model_comparison <- data.frame(
Model = c("Logistic Regression", "Random Forest", "SVM", "XGBoost"),
Accuracy = c(logit_cm$overall["Accuracy"],
rf_cm$overall["Accuracy"],
svm_cm$overall["Accuracy"],
xgb_cm$overall["Accuracy"]),
Sensitivity = c(logit_cm$byClass["Sensitivity"],
rf_cm$byClass["Sensitivity"],
svm_cm$byClass["Sensitivity"],
xgb_cm$byClass["Sensitivity"]),
Specificity = c(logit_cm$byClass["Specificity"],
rf_cm$byClass["Specificity"],
svm_cm$byClass["Specificity"],
xgb_cm$byClass["Specificity"]),
AUC = c(auc(logit_roc),
auc(rf_roc),
auc(svm_roc),
auc(xgb_roc))
)
# Print the comparison table
print(model_comparison)
## Model Accuracy Sensitivity Specificity AUC
## 1 Logistic Regression 0.9823009 1 0.9718310 0.9993293
## 2 Random Forest 0.9911504 1 0.9859155 0.9993293
## 3 SVM 0.9734513 1 0.9577465 1.0000000
## 4 XGBoost 0.9823009 1 0.9718310 0.9996647
# Visualize model comparison
ggplot(model_comparison, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Accuracy, 3)), vjust = -0.5) +
labs(title = "Model Accuracy Comparison",
y = "Accuracy",
x = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot ROC curves together
plot(logit_roc, col = "blue", main = "ROC Curves Comparison")
plot(rf_roc, col = "green", add = TRUE)
plot(svm_roc, col = "purple", add = TRUE)
plot(xgb_roc, col = "red", add = TRUE)
legend("bottomright",
legend = c(paste("Logistic Regression (AUC =", round(auc(logit_roc), 3), ")"),
paste("Random Forest (AUC =", round(auc(rf_roc), 3), ")"),
paste("SVM (AUC =", round(auc(svm_roc), 3), ")"),
paste("XGBoost (AUC =", round(auc(xgb_roc), 3), ")")),
col = c("blue", "green", "purple", "red"), lwd = 2)
This section summarizes the performance of various regression models
used to predict tumor size (area_mean). The models
evaluated include:
Linear Regression (ordinary least squares)
Ridge Regression (L2 regularization)
Lasso Regression (L1 regularization)
Elastic Net Regression (combined L1 and L2 regularization)
Each model’s performance is assessed using Root Mean Squared Error (RMSE) on a test set. Additionally, 10-fold cross-validation was used for Linear Regression to estimate its generalization ability.
| Model | RMSE (Overall) | RMSE (Benign) | RMSE (Malignant) |
|---|---|---|---|
| Linear Regression | 121.41 | — | — |
| Ridge Regression | 138.61 | 106.00 | 177.60 |
| Lasso Regression | 125.33 | — | — |
| Elastic Net | 133.05 | — | — |
| Linear Regression (10-fold CV) | 126.13 | — | — |
Linear Regression yields the lowest RMSE on the test set (121.41), serving as a strong baseline.
Regularized models (Ridge, Lasso, Elastic Net) produce slightly higher RMSE values but help address overfitting.
Among regularized models, Lasso Regression achieves the lowest RMSE (125.33), providing a good balance between prediction accuracy and feature selection.
Ridge Regression is much better at predicting benign tumors (RMSE = 106.00) than malignant tumors (RMSE = 177.60), possibly due to higher variability or class imbalance in malignant cases.
10-fold cross-validation for Linear Regression supports the stability and generalizability of its performance estimate (RMSE = 126.13).
This project developed and evaluated a range of machine learning models to predict tumor malignancy (classification) and tumor size (regression) using the Breast Cancer Wisconsin dataset. Models such as Logistic Regression, Random Forest, SVM and XGBoost demonstrated high predictive power for diagnosing malignancy, with Random Forest achieving the top accuracy (99.12%), closely followed by Logistic Regression and XGBoost (both 98.23%). All models achieved perfect sensitivity (1.000), ensuring detection of malignant tumors, and excellent specificity, especially in Random Forest (98.59%).
Regression models for tumor size (area_mean) were
evaluated with Linear Regression,
Ridge, Lasso and Elastic
Net. Linear Regression gave the lowest test
RMSE (121.41) and a high R-squared
(0.8759). Lasso Regression was a
strong regularized alternative. These findings support both accurate
diagnosis and meaningful size prediction to guide treatment
planning.
Accuracy and Sensitivity: The classification models provided high accuracy, particularly Random Forest and Logistic Regression, with perfect sensitivity. This is crucial as it ensures that malignant tumors are accurately detected, minimizing the risk of false negatives.
Model Comparisons: While Random Forest showed the best accuracy, all models demonstrated reliable performance in detecting malignant tumors, emphasizing the reliability of ensemble models like Random Forest and XGBoost. On the other hand, SVM showed slightly lower performance, which could be improved with parameter tuning.
Tumor Size Prediction: The regression models, particularly Linear Regression with an R-squared of 0.8759, effectively captured tumor size variation, which is a crucial factor for treatment planning.
Model Robustness: All models were assessed with confusion matrices, providing robust classification performance metrics. The Confusion Matrix for Logistic Regression and Random Forest showed exceptional performance with very low false positives, reflecting the model’s high predictive capability.
Improvement Potential: While the models are strong, further improvements could be made with hyperparameter tuning, feature engineering and addressing data imbalance. Specifically, Random Forest and XGBoost can be fine-tuned to enhance performance and reduce overfitting.
Model Interpretability: Models like
Logistic Regression are interpretable, providing
transparency on which features (such as concave.points_mean
and fractal_dimension_mean) impact predictions. In
contrast, models like XGBoost and Random
Forest are harder to interpret, which is a limitation in
clinical applications where interpretability is important.
While this project provides valuable insights into breast cancer diagnosis prediction, there are a few limitations that should be considered:
Class Imbalance: The dataset is imbalanced, with a larger proportion of benign cases compared to malignant ones. This imbalance may affect the model’s ability to accurately predict malignant cases, potentially leading to biased predictions. Research indicates that imbalanced data can result in biased predictions, where the model may predict the majority class more accurately, while the minority class (malignant cases) receives less attention (Chawla et al., 2002).
Data Quality: The dataset may contain some noisy or missing values. Although preprocessing was done to handle missing values and irrelevant columns, slight inconsistencies or inaccuracies in the data might still impact model performance. Missing data, when not handled correctly, can lead to biased results and overfitting (Joel et al., 2022).
Feature Selection: The models were built using a set of features chosen based on the dataset, but there might be additional relevant features that could improve the prediction accuracy. The feature selection process was not exhaustive and could benefit from further exploration. Feature engineering is critical, as better or more informative features can significantly improve model performance (Guyon & Elisseeff, 2003).
Model Generalization: While the models performed well during training, there may still be overfitting due to the specific characteristics of the dataset. Cross-validation was applied, but it is still possible that the models might not generalize well to unseen data or different populations. Overfitting is a common issue in machine learning, where models are too complex and fail to generalize well on new data (Kuhn & Johnson, 2013).
Lack of External Validation: The dataset used for model training and testing is from a single source (the Kaggle dataset). Validation on an external dataset or in a clinical setting would be necessary to assess the real-world applicability of the models. External validation ensures that the model’s performance is consistent across different datasets and settings (Riley et al., 2016).
To improve the robustness and effectiveness of the models in predicting breast cancer malignancy, several future steps could be considered:
Address Class Imbalance: Techniques like oversampling, undersampling or using weighted loss functions should be explored to mitigate the effects of class imbalance. Synthetic data generation (e.g., using SMOTE) could also help in generating more malignant samples. These techniques have been shown to significantly improve the performance of classifiers on imbalanced datasets (He & Garcia, 2009).
Additional Feature Engineering: More in-depth feature engineering, such as creating new derived features or incorporating domain knowledge from medical experts, could enhance the models’ ability to discriminate between benign and malignant cases. Feature engineering is crucial, as it can allow models to capture more relevant patterns in the data (Domingos, 2012).
Model Selection and Tuning: Although various models like logistic regression, random forests, SVM and XGBoost were explored, further experimentation with ensemble methods, deep learning or neural networks might yield even better results. Hyperparameter optimization through grid search or random search could also enhance model performance. Ensemble methods have been shown to outperform individual models in various machine learning tasks (Dietterich, 2000).
External Validation and Real-world Testing: Testing the model on an external dataset or in collaboration with medical institutions will help in understanding how well the model performs on real-world clinical data. Collaborative efforts with hospitals could lead to more clinically relevant insights. Real-world validation is necessary to confirm the generalizability of models developed in controlled settings.
Interpretability and Explainability: To increase trust in the models, efforts should be made to improve model interpretability. Techniques like SHAP values or LIME could provide insights into how the models make their predictions and help in clinical decision-making. Explainable AI has gained significant attention in healthcare for providing transparent reasoning for predictions (Caruana et al., 2015).
Battineni, G., Chintalapudi, N. & Amenta, F.
(2020).
Performance Analysis of Different Machine Learning Algorithms in Breast
Cancer Predictions. EAI Endorsed Trans. Pervasive Health
Technol., 6(e4). https://doi.org/10.4108/eai.28-5-2020.166010
The study explores how machine learning models, including Logistic
Regression (LR) and Support Vector Machines (SVM), can be employed for
breast cancer diagnosis. These models are evaluated for their predictive
power in classifying tumors as benign or malignant, offering critical
insights into model selection for our classification task (Battineni et
al., 2020).
Caruana, R., Gehrke, J., Koch, P., Sturm, M., &
Elhadad, N. (2015).
Intelligible models for healthcare: Predicting pneumonia risk and
hospital 30-day readmission. Proceedings of the 21st ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
1721-1730. https://doi.org/10.1145/2783258.2788613
This paper focuses on making machine learning models interpretable in
healthcare, which is essential for clinical applications. The authors
used models to predict pneumonia risk and 30-day hospital readmission,
demonstrating the importance of explainability in critical healthcare
predictions (Caruana et al., 2015).
Chawla, N. V., Bowyer, K. W., Hall, L. O., &
Kegelmeyer, W. P. (2002).
SMOTE: Synthetic minority over-sampling technique. Journal of
Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
This foundational paper introduces the SMOTE technique
for handling class imbalance in machine learning datasets. SMOTE
generates synthetic samples of the minority class, improving model
performance for imbalanced data, which is crucial for our classification
task in breast cancer prediction (Chawla et al., 2002).
Chtouki, K., Rhanoui, M., Mikram, M., Amazian, K. &
Yousfi, S. (2023).
Supervised Machine Learning for Breast Cancer Risk Factors Analysis and
Survival Prediction. ArXiv. https://doi.org/10.48550/arXiv.2304.07299
The authors demonstrate that machine learning algorithms, including
Decision Trees, Random Forest, and SVM, can effectively predict the
survival of breast cancer patients. This directly supports the
regression task in our project to predict tumor size and understand
growth patterns based on clinical data (Chtouki et al., 2023).
Dietterich, T. G. (2000).
Ensemble methods in machine learning. In Proceedings of the First
International Workshop on Multiple Classifier Systems (pp. 1-15).
Springer. https://doi.org/10.1007/3-540-45014-9_1
This paper provides a comprehensive discussion on ensemble learning,
particularly the advantages of combining multiple models to improve
prediction accuracy, an approach used in our Random Forest and XGBoost
models (Dietterich, 2000).
Domingos, P. (2012).
A few useful things to know about machine learning. Communications
of the ACM, 55(10), 78-87. https://doi.org/10.1145/2347736.2347755
This article provides essential insights into key concepts in machine
learning, emphasizing the importance of model selection and feature
engineering. The concepts covered in this paper were directly applied in
our feature selection and model comparison steps (Domingos,
2012).
Guyon, I., & Elisseeff, A. (2003).
An introduction to variable and feature selection. Journal of
Machine Learning Research, 3, 1157-1182. https://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
This paper outlines essential methods for feature selection, which is a
critical step in reducing dimensionality and improving model
performance. The authors highlight several techniques, including
recursive feature elimination, which was useful for our feature
selection process (Guyon & Elisseeff, 2003).
He, H., & Garcia, E. A. (2009).
Learning from imbalanced data. IEEE Transactions on Knowledge and
Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239
This paper explores methods for learning from imbalanced datasets, where
one class is significantly more prevalent than the other. Techniques
such as SMOTE and cost-sensitive learning are discussed, both of which
are highly relevant to our classification models, especially with the
class imbalance in breast cancer datasets (He & Garcia,
2009).
Islam, M. & Poly, T. N. (2019).
Machine Learning Models of Breast Cancer Risk Prediction.
bioRxiv. https://doi.org/10.1101/723304
In this paper, the authors explore multiple machine learning techniques,
such as decision trees and KNN, and compare their effectiveness in
predicting breast cancer. The study reveals that KNN provides high
accuracy, suggesting it as a viable candidate for the classification
task in our project (Islam & Poly, 2019).
Joel, L. O., Doorsamy, W., & Paul, B. S.
(2022).
A review of missing data handling techniques for machine learning.
International Journal of Innovative Technology &
Interdisciplinary Sciences, 5(3), 971–1005. https://doi.org/10.15157/IJITIS.2022.5.3.971-1005
This paper reviews techniques for handling missing data in machine
learning models, a crucial preprocessing step. Although we have handled
missing values in the dataset, this reference provided insights into
advanced techniques for future data processing (Joel et al.,
2022).
Kuhn, M., & Johnson, K. (2013).
Applied predictive modeling. Springer. https://doi.org/10.1007/978-1-4614-6849-3
This book provides comprehensive coverage of the process of building
predictive models. It discusses practical aspects of model fitting,
evaluation, and performance metrics, which were foundational in our
model development and evaluation phases (Kuhn & Johnson,
2013).
Moturi, S., Rao, S. & Vemuru, S.
(2021).
Risk Prediction-Based Breast Cancer Diagnosis using Personal Health
Records and Machine Learning Models. Springer. https://doi.org/10.1007/978-981-15-9516-5_37
This research examines the use of various machine learning models like
Random Forest and SVM for classifying breast cancer as benign or
malignant. The findings emphasize how data preprocessing and feature
selection improve the accuracy of predictions, aligning with our data
cleaning and feature selection steps (Moturi et al., 2021).
Riley, R. D., Ensor, J., Snell, K. I., Debray, T. P.,
Altman, D. G., Moons, K. G., & Collins, G. S. (2016).
External validation of clinical prediction models using big datasets
from e-health records or IPD meta-analysis: opportunities and
challenges. BMJ (Clinical research ed.), 353, i3140. https://doi.org/10.1136/bmj.i3140
This paper discusses the challenges and opportunities of external
validation for clinical prediction models. It emphasizes the importance
of testing models on different datasets, which is critical for our
models to generalize beyond the current dataset (Riley et al.,
2016).
Sudarsa, S. & Reddy, K. (2024).
Systematic Review on Breast Cancer Prediction and Classification using
Machine Learning and Deep Learning Methods. In 2024 8th
International Conference on I-SMAC (IoT in Social, Mobile, Analytics and
Cloud) (I-SMAC). https://doi.org/10.1109/I-SMAC61858.2024.10714683
This review highlights various machine learning and deep learning
methods for breast cancer classification and prediction, offering a
comprehensive overview of methodologies that could inform our model
selection and evaluation processes, especially for improving diagnostic
accuracy (Sudarsa & Reddy, 2024).