Classification & Regression Models for Breast Cancer Predictive Analysis

Ahmed, Salah, Luqman
1/14/2022

Introduction

One of the most common types of cancer is breast cancer, it accounts for 25% of cancer cases among women.
The way it starts is when breast cells start to grow uncontrollably, they then form a tumer
that can either be seen through x-ray tests or felt as lumps in the chest area. 

Objectives

Our main goal is to analyse the data set to extract some useful insights that can lead to data driven decisions.
The scope is to use the correlation between the breast cancer data set features
to predict whether a patient has a benign (not harmful) tumor or malignant (cancerous) tumor.

Process Methodology Flowchart

Used Libraries.
library(tidyverse) #For tidying the data
library(skimr) #It is designed to provide summary statistics about variables in data frames
library(ggplot2) #For plotting
library(RColorBrewer) #For aesthetics
library(tidymodels) #For splitting the data and training
library(caret) #For machine learning
library(party) #The core of the package is ctree()
library(yardstick) #For plotting the confusion matrix

Importing The Data Set

df <- read.csv("/Users/salahkaf/Desktop/breast_cancer_1.csv")

Details about the Dataset

Title: Breast Cancer Wisconsin (Diagnostic) Data Set
Last Update: 2021-12-29
Purpose: Analysing this data set aims to protect women from breast cancer.
The analysis is accomplished by using classification and regression models
to predict the cancer before its occurring to take the necessary measurements to prevent it.
Link: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Dataset Dimentions and Content

dim(df) # dimension of the dataset
[1] 569  32
names(df) # column names
 [1] "id"                      "diagnosis"              
 [3] "radius_mean"             "texture_mean"           
 [5] "perimeter_mean"          "area_mean"              
 [7] "smoothness_mean"         "compactness_mean"       
 [9] "concavity_mean"          "concave.points_mean"    
[11] "symmetry_mean"           "fractal_dimension_mean" 
[13] "radius_se"               "texture_se"             
[15] "perimeter_se"            "area_se"                
[17] "smoothness_se"           "compactness_se"         
[19] "concavity_se"            "concave.points_se"      
[21] "symmetry_se"             "fractal_dimension_se"   
[23] "radius_worst"            "texture_worst"          
[25] "perimeter_worst"         "area_worst"             
[27] "smoothness_worst"        "compactness_worst"      
[29] "concavity_worst"         "concave.points_worst"   
[31] "symmetry_worst"          "fractal_dimension_worst"

Dataset Structure and Summary

str(df) # structure of the data
'data.frame':   569 obs. of  32 variables:
 $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
 $ diagnosis              : chr  NA "M" "M" "M" ...
 $ radius_mean            : num  NA 20.6 19.7 11.4 20.3 ...
 $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
 $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
 $ area_mean              : num  1001 1326 1203 386 1297 ...
 $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
 $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
 $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
 $ concave.points_mean    : num  NA 0.0702 0.1279 0.1052 0.1043 ...
 $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
 $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
 $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
 $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
 $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
 $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
 $ smoothness_se          : num  NA 0.00522 0.00615 0.00911 0.01149 ...
 $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
 $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
 $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
 $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
 $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
 $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
 $ texture_worst          : num  NA 23.4 25.5 26.5 16.7 ...
 $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
 $ area_worst             : num  2019 1956 1709 568 1575 ...
 $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
 $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
 $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
 $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
 $ symmetry_worst         : num  NA 0.275 0.361 0.664 0.236 ...
 $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
summary(df) # Summary of each column in the data
       id             diagnosis          radius_mean    
 Min.   :     8670   Length:569         Min.   : 6.981  
 1st Qu.:   869218   Class :character   1st Qu.:11.685  
 Median :   906024   Mode  :character   Median :13.300  
 Mean   : 30371831                      Mean   :14.107  
 3rd Qu.:  8813129                      3rd Qu.:15.765  
 Max.   :911320502                      Max.   :28.110  
                                        NA's   :6       
  texture_mean   perimeter_mean     area_mean      smoothness_mean  
 Min.   : 9.71   Min.   : 43.79   Min.   : 143.5   Min.   :0.05263  
 1st Qu.:16.17   1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637  
 Median :18.84   Median : 86.24   Median : 551.1   Median :0.09587  
 Mean   :19.29   Mean   : 91.97   Mean   : 654.9   Mean   :0.09636  
 3rd Qu.:21.80   3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530  
 Max.   :39.28   Max.   :188.50   Max.   :2501.0   Max.   :0.16340  
                                                                    
 compactness_mean  concavity_mean    concave.points_mean
 Min.   :0.01938   Min.   :0.00000   Min.   :0.00000    
 1st Qu.:0.06492   1st Qu.:0.02956   1st Qu.:0.02029    
 Median :0.09263   Median :0.06154   Median :0.03326    
 Mean   :0.10434   Mean   :0.08880   Mean   :0.04852    
 3rd Qu.:0.13040   3rd Qu.:0.13070   3rd Qu.:0.07352    
 Max.   :0.34540   Max.   :0.42680   Max.   :0.20120    
                                     NA's   :6          
 symmetry_mean    fractal_dimension_mean   radius_se     
 Min.   :0.1060   Min.   :0.04996        Min.   :0.1115  
 1st Qu.:0.1619   1st Qu.:0.05770        1st Qu.:0.2324  
 Median :0.1792   Median :0.06154        Median :0.3242  
 Mean   :0.1812   Mean   :0.06280        Mean   :0.4052  
 3rd Qu.:0.1957   3rd Qu.:0.06612        3rd Qu.:0.4789  
 Max.   :0.3040   Max.   :0.09744        Max.   :2.8730  
                                                         
   texture_se      perimeter_se       area_se       
 Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
 1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
 Median :1.1080   Median : 2.287   Median : 24.530  
 Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
 3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
 Max.   :4.8850   Max.   :21.980   Max.   :542.200  
                                                    
 smoothness_se      compactness_se      concavity_se    
 Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
 1st Qu.:0.005163   1st Qu.:0.013080   1st Qu.:0.01509  
 Median :0.006380   Median :0.020450   Median :0.02589  
 Mean   :0.007050   Mean   :0.025478   Mean   :0.03189  
 3rd Qu.:0.008182   3rd Qu.:0.032450   3rd Qu.:0.04205  
 Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
 NA's   :6                                              
 concave.points_se   symmetry_se       fractal_dimension_se
 Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
 1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
 Median :0.010930   Median :0.018730   Median :0.0031870   
 Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
 3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
 Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
                                                           
  radius_worst   texture_worst   perimeter_worst    area_worst    
 Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
 1st Qu.:13.01   1st Qu.:21.09   1st Qu.: 84.11   1st Qu.: 515.3  
 Median :14.97   Median :25.40   Median : 97.66   Median : 686.5  
 Mean   :16.27   Mean   :25.67   Mean   :107.26   Mean   : 880.6  
 3rd Qu.:18.79   3rd Qu.:29.80   3rd Qu.:125.40   3rd Qu.:1084.0  
 Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
                 NA's   :6                                        
 smoothness_worst  compactness_worst concavity_worst 
 Min.   :0.07117   Min.   :0.02729   Min.   :0.0000  
 1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145  
 Median :0.13130   Median :0.21190   Median :0.2267  
 Mean   :0.13237   Mean   :0.25427   Mean   :0.2722  
 3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829  
 Max.   :0.22260   Max.   :1.05800   Max.   :1.2520  
                                                     
 concave.points_worst symmetry_worst   fractal_dimension_worst
 Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
 1st Qu.:0.06493      1st Qu.:0.2503   1st Qu.:0.07146        
 Median :0.09993      Median :0.2822   Median :0.08004        
 Mean   :0.11461      Mean   :0.2899   Mean   :0.08395        
 3rd Qu.:0.16140      3rd Qu.:0.3177   3rd Qu.:0.09208        
 Max.   :0.29100      Max.   :0.6638   Max.   :0.20750        
                      NA's   :6                               
We observe from summary that we have missing values in our data set, we need to tidy this part.

Preparing Data for Analysis by Checking for Missing Values & Replacing Them.

any(is.na(df)) # To confirm if there is NAs values in the data
[1] TRUE
colnames(df) <- str_to_lower(colnames(df)) # Changing all columns names to lowercase
# Finding the columns that has NA values
colSums(is.na(df))
                     id               diagnosis 
                      0                       6 
            radius_mean            texture_mean 
                      6                       0 
         perimeter_mean               area_mean 
                      0                       0 
        smoothness_mean        compactness_mean 
                      0                       0 
         concavity_mean     concave.points_mean 
                      0                       6 
          symmetry_mean  fractal_dimension_mean 
                      0                       0 
              radius_se              texture_se 
                      0                       0 
           perimeter_se                 area_se 
                      0                       0 
          smoothness_se          compactness_se 
                      6                       0 
           concavity_se       concave.points_se 
                      0                       0 
            symmetry_se    fractal_dimension_se 
                      0                       0 
           radius_worst           texture_worst 
                      0                       6 
        perimeter_worst              area_worst 
                      0                       0 
       smoothness_worst       compactness_worst 
                      0                       0 
        concavity_worst    concave.points_worst 
                      0                       0 
         symmetry_worst fractal_dimension_worst 
                      6                       0 

Replacing NA Values in Numeric Columns by The Mean

# Replacing the values of these columns by the mean
df$radius_mean[is.na(df$radius_mean)] <- mean(df$radius_mean, na.rm = TRUE)
df$concave.points_mean[is.na(df$concave.points_mean)] <- mean(df$concave.points_mean,na.rm = TRUE)
df$smoothness_se[is.na(df$smoothness_se)] <- mean(df$smoothness_se, na.rm = TRUE)
df$texture_worst[is.na(df$texture_worst)] <- mean(df$texture_worst, na.rm = TRUE)
df$symmetry_worst[is.na(df$symmetry_worst)] <- mean(df$symmetry_worst, na.rm = TRUE)
# checking again if there is any NA values in the numeric columns
colSums(is.na(df))
                     id               diagnosis 
                      0                       6 
            radius_mean            texture_mean 
                      0                       0 
         perimeter_mean               area_mean 
                      0                       0 
        smoothness_mean        compactness_mean 
                      0                       0 
         concavity_mean     concave.points_mean 
                      0                       0 
          symmetry_mean  fractal_dimension_mean 
                      0                       0 
              radius_se              texture_se 
                      0                       0 
           perimeter_se                 area_se 
                      0                       0 
          smoothness_se          compactness_se 
                      0                       0 
           concavity_se       concave.points_se 
                      0                       0 
            symmetry_se    fractal_dimension_se 
                      0                       0 
           radius_worst           texture_worst 
                      0                       0 
        perimeter_worst              area_worst 
                      0                       0 
       smoothness_worst       compactness_worst 
                      0                       0 
        concavity_worst    concave.points_worst 
                      0                       0 
         symmetry_worst fractal_dimension_worst 
                      0                       0 

Replacing NA Values in Character Columns by The Mode.

# the only columns that is character is "diagnosis"
# this is a function to find the mode in a column
calc_mode <- function(x)
  {
  # List the distinct / unique values
  distinct_values <- unique(x)
  # Count the occurrence of each distinct value
  distinct_tabulate <- tabulate(match(x, distinct_values))
  # Return the value with the highest occurrence
  distinct_values[which.max(distinct_tabulate)]
  }
# replacing the NA values by the mode
df <- df %>% 
  mutate(diagnosis = if_else(is.na(diagnosis), 
                         calc_mode(diagnosis), 
                         diagnosis))
# checking again for NA values in the data.
any(is.na(df)) #Should Be FALSE
[1] FALSE

Statistical Analysis Questions

Q1- How to predict a breast tumor is malignant or benign? (Classification problem)
Q2- How to predict the mean of the lobes radius of breast cancer tumor? (Regression Problem)
 Making a copy of the Data set for classification manipulation
df_c <- df

Question 1 - Is The Tumor Malignant or Benign?

EDAs

Extracting Some Useful Information About Each Column

skim_without_charts(df_c) # extracting useful summary statistics
Table 1: Data summary
Name df_c
Number of rows 569
Number of columns 32
_______________________
Column type frequency:
character 1
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
diagnosis 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1 30371831.43 125020585.61 8670.00 869218.00 906024.00 8813129.00 911320502.00
radius_mean 0 1 14.11 3.51 6.98 11.70 13.38 15.75 28.11
texture_mean 0 1 19.29 4.30 9.71 16.17 18.84 21.80 39.28
perimeter_mean 0 1 91.97 24.30 43.79 75.17 86.24 104.10 188.50
area_mean 0 1 654.89 351.91 143.50 420.30 551.10 782.70 2501.00
smoothness_mean 0 1 0.10 0.01 0.05 0.09 0.10 0.11 0.16
compactness_mean 0 1 0.10 0.05 0.02 0.06 0.09 0.13 0.35
concavity_mean 0 1 0.09 0.08 0.00 0.03 0.06 0.13 0.43
concave.points_mean 0 1 0.05 0.04 0.00 0.02 0.03 0.07 0.20
symmetry_mean 0 1 0.18 0.03 0.11 0.16 0.18 0.20 0.30
fractal_dimension_mean 0 1 0.06 0.01 0.05 0.06 0.06 0.07 0.10
radius_se 0 1 0.41 0.28 0.11 0.23 0.32 0.48 2.87
texture_se 0 1 1.22 0.55 0.36 0.83 1.11 1.47 4.88
perimeter_se 0 1 2.87 2.02 0.76 1.61 2.29 3.36 21.98
area_se 0 1 40.34 45.49 6.80 17.85 24.53 45.19 542.20
smoothness_se 0 1 0.01 0.00 0.00 0.01 0.01 0.01 0.03
compactness_se 0 1 0.03 0.02 0.00 0.01 0.02 0.03 0.14
concavity_se 0 1 0.03 0.03 0.00 0.02 0.03 0.04 0.40
concave.points_se 0 1 0.01 0.01 0.00 0.01 0.01 0.01 0.05
symmetry_se 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.08
fractal_dimension_se 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.03
radius_worst 0 1 16.27 4.83 7.93 13.01 14.97 18.79 36.04
texture_worst 0 1 25.67 6.12 12.02 21.18 25.45 29.66 49.54
perimeter_worst 0 1 107.26 33.60 50.41 84.11 97.66 125.40 251.20
area_worst 0 1 880.58 569.36 185.20 515.30 686.50 1084.00 4254.00
smoothness_worst 0 1 0.13 0.02 0.07 0.12 0.13 0.15 0.22
compactness_worst 0 1 0.25 0.16 0.03 0.15 0.21 0.34 1.06
concavity_worst 0 1 0.27 0.21 0.00 0.11 0.23 0.38 1.25
concave.points_worst 0 1 0.11 0.07 0.00 0.06 0.10 0.16 0.29
symmetry_worst 0 1 0.29 0.06 0.16 0.25 0.28 0.32 0.66
fractal_dimension_worst 0 1 0.08 0.02 0.06 0.07 0.08 0.09 0.21

Figure 1: Bar Chart for The Mumber of Benign and Malignant Tumors

df_c %>% count(diagnosis) # count the number of benign and malignant samples
  diagnosis   n
1         B 361
2         M 208
# setting diagnosis column to factor for plotting and modeling
df_c$diagnosis <- as.factor(df_c$diagnosis)
# plotting the bar plot
ggplot(df_c, aes(x=diagnosis, fill= diagnosis)) +
geom_bar(stat="count") +
theme_classic() +
scale_y_continuous(breaks = seq(0, 400, by = 25)) +
labs(title="Distribution of Diagnosis")

Figure 2: Boxplot for the Radium Mean of The Tumors Vs The Diagnosis

fig2 <- df_c[c("radius_mean", "diagnosis")]
# plotting the box plot
ggplot(fig2, aes(diagnosis, radius_mean, fill = diagnosis)) + 
  geom_boxplot()+
  labs(col="Type of The Tumor") + ylab("lobes radius mean") +
labs(title="Distribution of diagnosis")+
  # changing the color of the boxplots
  scale_fill_manual(values = c( "dodgerblue1","red3")) 

Figure 3: catter plot of the Concavity Mean Vs Radius Mean

ggplot(df_c, aes(radius_mean, concavity_mean)) +
  geom_point(aes(color = diagnosis)) +
  labs(title = "Radious mean Vs Concavity mean",
       y = "Mean of Concavity", x = "Radius of Lobes",
       col="Type of The Tumor")+
   scale_colour_manual(labels = c("Benign", "Malignant"),
                       values = c("dodgerblue1","red2")) +
  theme_bw()

Machine Learning Analysis

Splitting the data into training set and test set

# setting the independent variable to factor so that we can be able to fit it to the classification model
df_c$diagnosis <- as.factor(df_c$diagnosis) 
df_c <- df_c[,-1] #Removing ID column as it does not have correlation
split=0.80 # define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(df_c$diagnosis, p=split, list=FALSE)
data_train <- df_c[ trainIndex,]
data_test <- df_c[-trainIndex,]

Building a Cross validation Object

#use train() and trainControl()
train_control <- trainControl(method="cv", number=10)

Building a K-Nearest Neighbor (KNN) Classification Model

# train the model
model1 <- train(diagnosis~., data=data_train, trControl=train_control,
method="knn")
predictions1<-predict(model1, newdata = data_test)

Plotting Model1

plot(model1)

Checking The Accuracy Using Confusion Matrix

cm1<-confusionMatrix(predictions1,data_test$diagnosis) # using confusion matrix for accuracy
cm1
Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 69  5
         M  3 36
                                          
               Accuracy : 0.9292          
                 95% CI : (0.8653, 0.9689)
    No Information Rate : 0.6372          
    P-Value [Acc > NIR] : 4.941e-13       
                                          
                  Kappa : 0.8453          
                                          
 Mcnemar's Test P-Value : 0.7237          
                                          
            Sensitivity : 0.9583          
            Specificity : 0.8780          
         Pos Pred Value : 0.9324          
         Neg Pred Value : 0.9231          
             Prevalence : 0.6372          
         Detection Rate : 0.6106          
   Detection Prevalence : 0.6549          
      Balanced Accuracy : 0.9182          
                                          
       'Positive' Class : B               
                                          

Plotting The Confusion Matrix for K-Nearest Neighbor

# making a data frame for the predicted and actual values
cm1_plot <- data.frame(predictions1,data_test$diagnosis )
# renaming the columns
names(cm1_plot) <- c("Predicted", "Actual")


cm1_plot <- conf_mat(cm1_plot,  Actual,Predicted)
autoplot(cm1_plot, type = "heatmap") +
  scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") +
  theme(legend.position = "right") + labs(title = "Confusion matrix for KNN model")

Training a Decision Tree (DT) Model

model2 <- ctree(diagnosis ~ ., data_train) # training the DT model
predictions2<-predict(model2, newdata = data_test)

Checking The Accuracy Using The Confusion Matrix

cm2<-confusionMatrix(predictions2,data_test$diagnosis)
cm2
Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 67  0
         M  5 41
                                          
               Accuracy : 0.9558          
                 95% CI : (0.8998, 0.9855)
    No Information Rate : 0.6372          
    P-Value [Acc > NIR] : 6.935e-16       
                                          
                  Kappa : 0.9068          
                                          
 Mcnemar's Test P-Value : 0.07364         
                                          
            Sensitivity : 0.9306          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.8913          
             Prevalence : 0.6372          
         Detection Rate : 0.5929          
   Detection Prevalence : 0.5929          
      Balanced Accuracy : 0.9653          
                                          
       'Positive' Class : B               
                                          

Plotting The Confusion Matrix for Decision Tree

# making a data frame for the predicted and actual values
cm2_plot <- data.frame(predictions2,data_test$diagnosis )
# renaming the columns
names(cm2_plot) <- c("Predicted", "Actual")


cm2_plot <- conf_mat(cm2_plot,  Actual,Predicted)

autoplot(cm2_plot, type = "heatmap") +
  scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") +
  theme(legend.position = "right") + labs(title = "Confusion matrix for DT model")

Question 2 - Mean of The Lobes Radius - Prediction

Making a Copy of The Data set for Regression Manipulation

# Picking the columns that are highly correlated with the mean of lobes radius
df_R <- df[c("perimeter_mean", "area_mean" , "concave.points_mean", "radius_worst", 
       "radius_worst", "perimeter_worst", "area_worst", "radius_mean" )]

EDAs

Building a Correlation Matrix for The radius_mean Column

cor(df_R)[,8]
     perimeter_mean           area_mean concave.points_mean 
          0.9924522           0.9831023           0.8222758 
       radius_worst      radius_worst.1     perimeter_worst 
          0.9626530           0.9626530           0.9569919 
         area_worst         radius_mean 
          0.9340473           1.0000000 

Figure 4: Scatter Plot for The Radius Mean Vs The Area Mean

ggplot(df_R, aes(radius_mean, area_mean))+ 
  geom_point(aes(alpha = 0.1))+ geom_smooth(method = "lm") + 
  labs(title = "The mean of lobes radius Vs the mean of the lobes area",
       y = "The mean of lobes areas", x = "The mean of lobes radius") +theme_bw()

Figure 5: Scatter plot for The Radius Mean Vs The Perimeter Mean

ggplot(df_R, aes(radius_mean, perimeter_mean))+ 
  geom_point(aes(alpha = 0.1))+ geom_smooth(method = "lm") + 
  labs(title = "The mean of lobes radius Vs the mean of the lobes perimeter", 
       y = "The mean of lobes perimeter", x = "The mean of lobes radius") +theme_bw()

Splitting The Data to Training and Testing Sets

split=0.80 # define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(df_R$radius_mean, p=split, list=FALSE)
data_train_R <- df_R[ trainIndex,]
data_test_R <- df_R[-trainIndex,]

Training The Multiple Linear Regression Model

model3 <- lm(radius_mean ~ perimeter_mean+area_mean+concave.points_mean+radius_worst+radius_worst+
               perimeter_worst+area_worst,data = data_train_R)

Examining The Coefficient Table

summary(model3)$coefficient
                        Estimate   Std. Error     t value
(Intercept)          1.452766700 0.2556453324  5.68274291
perimeter_mean       0.087434825 0.0113911580  7.67567485
area_mean            0.004246361 0.0007016029  6.05237018
concave.points_mean  0.044078471 0.9869770440  0.04466008
radius_worst         0.498508836 0.0494352482 10.08407673
perimeter_worst     -0.042312350 0.0056869708 -7.44022626
area_worst          -0.001986717 0.0003754594 -5.29143073
                        Pr(>|t|)
(Intercept)         2.384706e-08
perimeter_mean      1.032026e-13
area_mean           3.009551e-09
concave.points_mean 9.643981e-01
radius_worst        1.071058e-21
perimeter_worst     5.140728e-13
area_worst          1.899586e-07

Using The Model on The Test Data to Predict The Radius Mean

prediction_R = predict(model3, newdata =data_test_R, interval = "prediction")
head(prediction_R) #Exploring the head of our prediction table
        fit      lwr      upr
6  12.56525 11.79421 13.33629
15 13.59875 12.82401 14.37348
23 15.17628 14.40568 15.94688
42 11.00685 10.23759 11.77612
53 11.87405 11.10747 12.64064
55 14.98148 14.21370 15.74927

Figure 6: Scatter Plot for The Actual Values Vs the Predicted Values for Radius Mean

# binding the predicted values with the actual ones
act_pred <- as.data.frame(cbind(data_test_R$radius_mean,prediction_R )) 
# taking only the predicted and actual values without the upper and lower boundaries
act_pred <- act_pred[,1:2]
#renaming the columns
act_pred <- act_pred %>% rename( Actual= V1, predicted= fit )
#assigning row names to Null to get rid of the unorganized index
row.names(act_pred) <- NULL
head(act_pred)
  Actual predicted
1  12.45  12.56525
2  13.73  13.59875
3  15.34  15.17628
4  10.95  11.00685
5  11.94  11.87405
6  15.10  14.98148
ggplot(act_pred, aes(Actual, predicted)) +
  geom_point(alpha = 0.4) + labs(title = "Predicted Values Vs Actual Values For Radius Mean"
                      , x = "Actual Values of The Test Data",  "Predicted Values of The Test Data")+ theme_gray()

There is a perfect linear relationship between the actual and the predicted values

Checking The Accuarcy Using The Residual Standard Error (RSE), or Sigma:

# The error rate can be estimated by dividing the RSE by the mean outcome variable:
# the smaller the error the more accurate the model is.
sigma(model3)/mean(data_test_R$radius_mean)
[1] 0.02775463

Conclusion

Data Science and Statistics are powerful tools that can be used to predict disasters
before happening. Breast cancer is a hideous disease, which has a strong negative impact on 
society. In our project, we contributed to developing two supervised learning models, i.e.,
Classification model using two techniques, namely, KNN method with a 91% prediction accuracy,
and DT method with a ~90% accuracy. Moreover, we made a multiple linear regression statistical model
that predict the mean of the lobes radius based on seven parameters with an RSE value of 0.02~0.03.