Introduction

The objective of this project is to build a predictive model to forecast the likelihood of having a breast cancer when one, few or all of 9 descriptive factors (describes later) and its interactions present. The project consists of two phases. Phase I focuses on data preprocessing and exploration of descriptive factors, as covered in this report. The model building, hypothesis testing, validation for model assumptions and providing confidence intervals for model parameters will be done in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets. Section 3 covers data pre-processing. In Section 4, we explore factors and their inter-relationships. The last section ends with a summary.

Dataset

This analysis is based on the dataset extracted from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). Dataset provides 699 patients’ cytological characteristics of a breast mass which were computed from a digitized image of a fine needle aspirate (FNA) under 9 descriptive factors reported to differ between benign and malignant samples (The binary response which we are interested) was graded 1 to 10, with 1 being the closest to benign and 10 the most anaplastic (malignant). The detail of descriptive factors and the binary response is given below.

Data Pre-processing

In this phase of the project the following R packages were used.

library(reader)
library(dplyr)
library(knitr)
library(cowplot)
library(ggplot2)
library(mlr)

In order to make sure the data is free from missing values, errors and other anomalies and presented with proper data types, the following table with column wise summary is used.

str(mydata)
## 'data.frame':    699 obs. of  11 variables:
##  $ ID             : int  1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
##  $ Clump_thickness: int  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell_Size      : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell_Shape     : int  1 4 1 8 1 10 1 2 1 1 ...
##  $ Adhesion       : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epi.Cell_Size  : int  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare_Nuclei    : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
##  $ Bland_Chromatin: int  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal_Nucleoli: int  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : int  2 2 2 2 2 4 2 2 2 2 ...
summarizeColumns(mydata) %>% knitr::kable( caption =  'Variables Summary before Data Preprocessing')
Variables Summary before Data Preprocessing
name type na mean disp median mad min max nlevs
ID integer 0 1.071704e+06 6.170957e+05 1171710 154755.2706 61634 13454352 0
Clump_thickness integer 0 4.417740e+00 2.815741e+00 4 2.9652 1 10 0
Cell_Size integer 0 3.134478e+00 3.051459e+00 1 0.0000 1 10 0
Cell_Shape integer 0 3.207439e+00 2.971913e+00 1 0.0000 1 10 0
Adhesion integer 0 2.806867e+00 2.855379e+00 1 0.0000 1 10 0
Epi.Cell_Size integer 0 3.216023e+00 2.214300e+00 2 0.0000 1 10 0
Bare_Nuclei factor 0 NA 4.248927e-01 NA NA 4 402 11
Bland_Chromatin integer 0 3.437768e+00 2.438364e+00 3 1.4826 1 10 0
Normal_Nucleoli integer 0 2.866953e+00 3.053634e+00 1 0.0000 1 10 0
Mitoses integer 0 1.589413e+00 1.715078e+00 1 0.0000 1 10 0
Class integer 0 2.689557e+00 9.512725e-01 2 0.0000 2 4 0
We can see that al l the vari ables are in integer format except `Bare_Nucl ei`. Despite t hat, `Ba re_Nuclei` indicates some missing values. The below R chunk shows there are 16 missing values in Bare_Nuclei (Row nos. for missing values: 24,41,140, 146, 159,165,236,250,276,293,295,298,316,322,412,618). In order to calculate reliable estimates for missing values first we examine which level of breast cancer they are coming from.

Out of 16 missing values of Bare_Nuclei, only 2 came from malignant and all others came from benign suggest that imputed value for many missing should be a very small value. As per below graph, the most relevant value for missing values of benign sample is 1 and malignant sample is 10.

Missing values with row no. 24 and 293 came from malignant and their imputed value should be 10. All other missing values came from benign and the imputed value for them is 1.

summary(mydata$Bare_Nuclei)
##   ?   1  10   2   3   4   5   6   7   8   9 
##  16 402 132  30  28  19  30   4   8  21   9
# Examining missing values of Bare_Nuclei with levels of Breast Cancer
p <- ggplot(mydata, aes(x = Bare_Nuclei, fill = Class)) + 
  geom_bar() + facet_grid(Class~.) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

Even though 9 descriptive variables are measured in an ordinal scale, space between levels are same for all levels, it is better to consider them as quantitative variables with discrete scale of 1 to 10 rather in a nominal scale. Only Class variable should be in nominal scale and as the event of interest is malignant, we assign 1 for that level and 0 for benign.

mydata$Bare_Nuclei <- factor(mydata$Bare_Nuclei,levels = c("?",1,2,3,4,5,6,7,8,9,10),
                             labels = c(1,1,2,3,4,5,6,7,8,9,10))
mydata$Bare_Nuclei <- as.integer(paste(mydata$Bare_Nuclei))
mydata$Class <- factor(mydata$Class,levels = c(2,4),labels = c(0,1))
summarizeColumns(mydata) %>% knitr::kable( caption =  'Variables Summary after Data Preprocessing')
Variables Summary after Data Preprocessing
name type na mean disp median mad min max nlevs
ID integer 0 1.071704e+06 6.170957e+05 1171710 154755.2706 61634 13454352 0
Clump_thickness integer 0 4.417740e+00 2.815741e+00 4 2.9652 1 10 0
Cell_Size integer 0 3.134478e+00 3.051459e+00 1 0.0000 1 10 0
Cell_Shape integer 0 3.207439e+00 2.971913e+00 1 0.0000 1 10 0
Adhesion integer 0 2.806867e+00 2.855379e+00 1 0.0000 1 10 0
Epi.Cell_Size integer 0 3.216023e+00 2.214300e+00 2 0.0000 1 10 0
Bare_Nuclei integer 0 3.486409e+00 3.621929e+00 1 0.0000 1 10 0
Bland_Chromatin integer 0 3.437768e+00 2.438364e+00 3 1.4826 1 10 0
Normal_Nucleoli integer 0 2.866953e+00 3.053634e+00 1 0.0000 1 10 0
Mitoses integer 0 1.589413e+00 1.715078e+00 1 0.0000 1 10 0
Class factor 0 NA 3.447783e-01 NA NA 241 458 2
mydata[24,'Bare_Nuclei'] <- 10
mydata[293,'Bare_Nuclei'] <- 10

Data Exploration

In this data set, there are 458 (65%) cases with malignant and 241 (35%) cases with benign. Therefore, there is a considerable imbalance toward malignant.

Univariate Visualisation

Clump Thickness

p1 <- ggplot(data=mydata,aes(x=Clump_thickness))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Clump Thickness")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Clump_thickness))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Clump Thickness with Levels of Breast Cancer", y="Clump Thickness")
plot_grid(p1, p2, ncol = 1)

As per the histogram, Clump Thickness is little right skewed, and more than 75% of values for benign (level 0) is less than 5 and more than 75% of values for malignant (level 1) is greater than 5.

Uniformity of Cell Size

p1 <- ggplot(data=mydata,aes(x=Cell_Size))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Uniformity of Cell Size")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Cell_Size))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Cell Size with Levels of Breast Cancer", y="Cell Size")
plot_grid(p1, p2, ncol = 1)

Over 50% of Cell Sizes is around 1 and there are many outliers toward right making the distribution far right skewed. Uniformity of cell size is also highly related to levels of breast cancer as more than 75% of values for benign is very small while many values for malignant is above 5.

Uniformity of Cell Shape

p1 <- ggplot(data=mydata,aes(x=Cell_Shape))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Uniformity of Cell Shape")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Cell_Shape))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Cell Shape with Levels of Breast Cancer", y="Cell Shape")
plot_grid(p1, p2, ncol = 1)

Distribution for Uniformity of Cell Shape is also positively skewed making more values around 1. As earlier noticed, Cell Shape is also highly related with levels of breast cancer as many values for benign is very small while many values for malignant is quite large.

Marginal Adhesion

p1 <- ggplot(data=mydata,aes(x=Adhesion))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Marginal Adhesion")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Adhesion))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Marginal Adhesion with Levels of Breast Cancer", y="Adhesion")
plot_grid(p1, p2, ncol = 1)

Both histogram and boxplots for Marginal Adhesion too are quite similar for Uniformity of Cell Shape and Uniformity of Cell Size.

Single Epithelial Cell Size

p1 <- ggplot(data=mydata,aes(x=Epi.Cell_Size))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Single Epithelial Cell Size")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Epi.Cell_Size))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Epithelial Cell Size with Levels of Breast Cancer", y="Epi.Cell_Size")
plot_grid(p1, p2, ncol = 1)

This histogram is quite different from other histogram we have observed so far as mode is located in the second bar (mode=2). As earlier seen, more than 75% values for benign is less than 2.5 and more than 75% values for malignant is greater than 2.5.

Bare Nuclei

p1 <- ggplot(data=mydata,aes(x=Bare_Nuclei))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Bare Nuclei")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Bare_Nuclei))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Bare Nuclei with Levels of Breast Cancer", y="Bare Nuclei")
plot_grid(p1, p2, ncol = 1)

Histogram for Bare Nuclei shows many values are gathered around 1 (around 400) and 10 (around 100). level wise boxplots show more than 75% benign values are around 1 and a half of values (median) for malignant is 10.

Blandness of nuclear chromatin

p1 <- ggplot(data=mydata,aes(x=Bland_Chromatin))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Blandness of nuclear chromatin")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Bland_Chromatin))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Bland_Chromatin with Levels of Breast Cancer", y="Bland_Chromatin")
plot_grid(p1, p2, ncol = 1)

As per the histogram around 450 values for Blandness of Nuclear Chromatin are equally spread around 1, 2 and 3 and afterward histogram is diminishing toward the end showing a bounce at 7. As earlier, even though the variable reported many smaller values for benign, the boxplot for benign is quite normal as we can see whole body and whiskers of it.

Normal nucleoli

p1 <- ggplot(data=mydata,aes(x=Normal_Nucleoli))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Normal Nucleoli")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Normal_Nucleoli))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Normal Nucleoli with Levels of Breast Cancer", y="Normal Nucleli")
plot_grid(p1, p2, ncol = 1)

Histogram for Normal Nucleoli shows around 60% of values is dense around 1 and other values are quite equally spread within 2-10. Normal Nucleoli got very small (around 1) for more than 75% benign category while this value for malignant is comparatively high as usual.

Infrequent mitoses

p1 <- ggplot(data=mydata,aes(x=Mitoses))+
      geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
      labs(title="Histogram for Infrequent Mitoses")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Mitoses))+
      geom_boxplot(fill='#FFAAAA', color="black")+
      stat_summary(fun.y=mean, colour="red", geom="point")+
      labs(title="Infrequent Mitoses with Levels of Breast Cancer", y="Mitoses")
plot_grid(p1, p2, ncol = 1)

As per histogram of Infrequent Mitoses about 85% of values is 1. There is no big difference in values for level wise boxplots contrary to the previous boxplots we have seen. That is level of breast cancer does not depend that much on Infrequent Mitoses.

Multivariate Visualisation

Below scatter plot matrix for 9 descriptive variables is used to find any multicollinearity between variables.

upper.panel<-function(x, y){
  points(x,y, pch=19, col=c("black", "red")[mydata$Class])
  r <- round(cor(x, y), digits=2)
  txt <- paste0("R = ", r)
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  text(0.5, 0.9, txt)
}
pairs(mydata[,2:10], lower.panel = NULL, 
      upper.panel = upper.panel)

As per scatter plot matrix, Uniformity of Cell Size and Uniformity of Cell Shape shows very high correlation (r=0.91) and apparently these two variables are multicollinear.

Summary

In this project our aim is to predict the probability of having a breast cancer based on 9 descriptive features. As per the above univariate exploration on variables except Infrequent Mitoses all other features look more relevant when deciding breast cancer level whether it is benign or malignant. Two features (Uniformity of Cell Size and Uniformity of Cell Shape )look having a strong multicollinearity and this issue can be handled in the modeling section in phase II.