The objective of this project is to build a predictive model to forecast the likelihood of having a breast cancer when one, few or all of 9 descriptive factors (describes later) and its interactions present. The project consists of two phases. Phase I focuses on data preprocessing and exploration of descriptive factors, as covered in this report. The model building, hypothesis testing, validation for model assumptions and providing confidence intervals for model parameters will be done in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets. Section 3 covers data pre-processing. In Section 4, we explore factors and their inter-relationships. The last section ends with a summary.
This analysis is based on the dataset extracted from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). Dataset provides 699 patients’ cytological characteristics of a breast mass which were computed from a digitized image of a fine needle aspirate (FNA) under 9 descriptive factors reported to differ between benign and malignant samples (The binary response which we are interested) was graded 1 to 10, with 1 being the closest to benign and 10 the most anaplastic (malignant). The detail of descriptive factors and the binary response is given below.
In this phase of the project the following R packages were used.
library(reader)
library(dplyr)
library(knitr)
library(cowplot)
library(ggplot2)
library(mlr)
In order to make sure the data is free from missing values, errors and other anomalies and presented with proper data types, the following table with column wise summary is used.
str(mydata)
## 'data.frame': 699 obs. of 11 variables:
## $ ID : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
## $ Clump_thickness: int 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell_Size : int 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell_Shape : int 1 4 1 8 1 10 1 2 1 1 ...
## $ Adhesion : int 1 5 1 1 3 8 1 1 1 1 ...
## $ Epi.Cell_Size : int 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare_Nuclei : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
## $ Bland_Chromatin: int 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal_Nucleoli: int 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : int 2 2 2 2 2 4 2 2 2 2 ...
summarizeColumns(mydata) %>% knitr::kable( caption = 'Variables Summary before Data Preprocessing')
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| ID | integer | 0 | 1.071704e+06 | 6.170957e+05 | 1171710 | 154755.2706 | 61634 | 13454352 | 0 |
| Clump_thickness | integer | 0 | 4.417740e+00 | 2.815741e+00 | 4 | 2.9652 | 1 | 10 | 0 |
| Cell_Size | integer | 0 | 3.134478e+00 | 3.051459e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Cell_Shape | integer | 0 | 3.207439e+00 | 2.971913e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Adhesion | integer | 0 | 2.806867e+00 | 2.855379e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Epi.Cell_Size | integer | 0 | 3.216023e+00 | 2.214300e+00 | 2 | 0.0000 | 1 | 10 | 0 |
| Bare_Nuclei | factor | 0 | NA | 4.248927e-01 | NA | NA | 4 | 402 | 11 |
| Bland_Chromatin | integer | 0 | 3.437768e+00 | 2.438364e+00 | 3 | 1.4826 | 1 | 10 | 0 |
| Normal_Nucleoli | integer | 0 | 2.866953e+00 | 3.053634e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Mitoses | integer | 0 | 1.589413e+00 | 1.715078e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Class | integer | 0 | 2.689557e+00 | 9.512725e-01 | 2 | 0.0000 | 2 | 4 | 0 |
| We can see that al | l the vari | ables | are in integer | format except | `Bare_Nucl | ei`. Despite t | hat, `Ba | re_Nuclei` | indicates some missing values. The below R chunk shows there are 16 missing values in Bare_Nuclei (Row nos. for missing values: 24,41,140, 146, 159,165,236,250,276,293,295,298,316,322,412,618). In order to calculate reliable estimates for missing values first we examine which level of breast cancer they are coming from. |
Out of 16 missing values of Bare_Nuclei, only 2 came from malignant and all others came from benign suggest that imputed value for many missing should be a very small value. As per below graph, the most relevant value for missing values of benign sample is 1 and malignant sample is 10.
Missing values with row no. 24 and 293 came from malignant and their imputed value should be 10. All other missing values came from benign and the imputed value for them is 1.
summary(mydata$Bare_Nuclei)
## ? 1 10 2 3 4 5 6 7 8 9
## 16 402 132 30 28 19 30 4 8 21 9
# Examining missing values of Bare_Nuclei with levels of Breast Cancer
p <- ggplot(mydata, aes(x = Bare_Nuclei, fill = Class)) +
geom_bar() + facet_grid(Class~.) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
Even though 9 descriptive variables are measured in an ordinal scale, space between levels are same for all levels, it is better to consider them as quantitative variables with discrete scale of 1 to 10 rather in a nominal scale. Only Class variable should be in nominal scale and as the event of interest is malignant, we assign 1 for that level and 0 for benign.
mydata$Bare_Nuclei <- factor(mydata$Bare_Nuclei,levels = c("?",1,2,3,4,5,6,7,8,9,10),
labels = c(1,1,2,3,4,5,6,7,8,9,10))
mydata$Bare_Nuclei <- as.integer(paste(mydata$Bare_Nuclei))
mydata$Class <- factor(mydata$Class,levels = c(2,4),labels = c(0,1))
summarizeColumns(mydata) %>% knitr::kable( caption = 'Variables Summary after Data Preprocessing')
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| ID | integer | 0 | 1.071704e+06 | 6.170957e+05 | 1171710 | 154755.2706 | 61634 | 13454352 | 0 |
| Clump_thickness | integer | 0 | 4.417740e+00 | 2.815741e+00 | 4 | 2.9652 | 1 | 10 | 0 |
| Cell_Size | integer | 0 | 3.134478e+00 | 3.051459e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Cell_Shape | integer | 0 | 3.207439e+00 | 2.971913e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Adhesion | integer | 0 | 2.806867e+00 | 2.855379e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Epi.Cell_Size | integer | 0 | 3.216023e+00 | 2.214300e+00 | 2 | 0.0000 | 1 | 10 | 0 |
| Bare_Nuclei | integer | 0 | 3.486409e+00 | 3.621929e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Bland_Chromatin | integer | 0 | 3.437768e+00 | 2.438364e+00 | 3 | 1.4826 | 1 | 10 | 0 |
| Normal_Nucleoli | integer | 0 | 2.866953e+00 | 3.053634e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Mitoses | integer | 0 | 1.589413e+00 | 1.715078e+00 | 1 | 0.0000 | 1 | 10 | 0 |
| Class | factor | 0 | NA | 3.447783e-01 | NA | NA | 241 | 458 | 2 |
mydata[24,'Bare_Nuclei'] <- 10
mydata[293,'Bare_Nuclei'] <- 10
In this data set, there are 458 (65%) cases with malignant and 241 (35%) cases with benign. Therefore, there is a considerable imbalance toward malignant.
p1 <- ggplot(data=mydata,aes(x=Clump_thickness))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Clump Thickness")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Clump_thickness))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Clump Thickness with Levels of Breast Cancer", y="Clump Thickness")
plot_grid(p1, p2, ncol = 1)
As per the histogram,
Clump Thickness is little right skewed, and more than 75% of values for benign (level 0) is less than 5 and more than 75% of values for malignant (level 1) is greater than 5.
p1 <- ggplot(data=mydata,aes(x=Cell_Size))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Uniformity of Cell Size")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Cell_Size))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Cell Size with Levels of Breast Cancer", y="Cell Size")
plot_grid(p1, p2, ncol = 1)
Over 50% of Cell Sizes is around 1 and there are many outliers toward right making the distribution far right skewed.
Uniformity of cell size is also highly related to levels of breast cancer as more than 75% of values for benign is very small while many values for malignant is above 5.
p1 <- ggplot(data=mydata,aes(x=Cell_Shape))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Uniformity of Cell Shape")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Cell_Shape))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Cell Shape with Levels of Breast Cancer", y="Cell Shape")
plot_grid(p1, p2, ncol = 1)
Distribution for
Uniformity of Cell Shape is also positively skewed making more values around 1. As earlier noticed, Cell Shape is also highly related with levels of breast cancer as many values for benign is very small while many values for malignant is quite large.
p1 <- ggplot(data=mydata,aes(x=Adhesion))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Marginal Adhesion")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Adhesion))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Marginal Adhesion with Levels of Breast Cancer", y="Adhesion")
plot_grid(p1, p2, ncol = 1)
Both histogram and boxplots for
Marginal Adhesion too are quite similar for Uniformity of Cell Shape and Uniformity of Cell Size.
p1 <- ggplot(data=mydata,aes(x=Epi.Cell_Size))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Single Epithelial Cell Size")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Epi.Cell_Size))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Epithelial Cell Size with Levels of Breast Cancer", y="Epi.Cell_Size")
plot_grid(p1, p2, ncol = 1)
This histogram is quite different from other histogram we have observed so far as mode is located in the second bar (mode=2). As earlier seen, more than 75% values for benign is less than 2.5 and more than 75% values for malignant is greater than 2.5.
p1 <- ggplot(data=mydata,aes(x=Bare_Nuclei))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Bare Nuclei")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Bare_Nuclei))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Bare Nuclei with Levels of Breast Cancer", y="Bare Nuclei")
plot_grid(p1, p2, ncol = 1)
Histogram for
Bare Nuclei shows many values are gathered around 1 (around 400) and 10 (around 100). level wise boxplots show more than 75% benign values are around 1 and a half of values (median) for malignant is 10.
p1 <- ggplot(data=mydata,aes(x=Bland_Chromatin))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Blandness of nuclear chromatin")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Bland_Chromatin))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Bland_Chromatin with Levels of Breast Cancer", y="Bland_Chromatin")
plot_grid(p1, p2, ncol = 1)
As per the histogram around 450 values for
Blandness of Nuclear Chromatin are equally spread around 1, 2 and 3 and afterward histogram is diminishing toward the end showing a bounce at 7. As earlier, even though the variable reported many smaller values for benign, the boxplot for benign is quite normal as we can see whole body and whiskers of it.
p1 <- ggplot(data=mydata,aes(x=Normal_Nucleoli))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Normal Nucleoli")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Normal_Nucleoli))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Normal Nucleoli with Levels of Breast Cancer", y="Normal Nucleli")
plot_grid(p1, p2, ncol = 1)
Histogram for
Normal Nucleoli shows around 60% of values is dense around 1 and other values are quite equally spread within 2-10. Normal Nucleoli got very small (around 1) for more than 75% benign category while this value for malignant is comparatively high as usual.
p1 <- ggplot(data=mydata,aes(x=Mitoses))+
geom_histogram(bins = 10,colour="black",fill="#80DFFF")+
labs(title="Histogram for Infrequent Mitoses")+xlab("")
p2 <- ggplot(data = mydata, aes(x = Class, y = Mitoses))+
geom_boxplot(fill='#FFAAAA', color="black")+
stat_summary(fun.y=mean, colour="red", geom="point")+
labs(title="Infrequent Mitoses with Levels of Breast Cancer", y="Mitoses")
plot_grid(p1, p2, ncol = 1)
As per histogram of
Infrequent Mitoses about 85% of values is 1. There is no big difference in values for level wise boxplots contrary to the previous boxplots we have seen. That is level of breast cancer does not depend that much on Infrequent Mitoses.
Below scatter plot matrix for 9 descriptive variables is used to find any multicollinearity between variables.
upper.panel<-function(x, y){
points(x,y, pch=19, col=c("black", "red")[mydata$Class])
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, txt)
}
pairs(mydata[,2:10], lower.panel = NULL,
upper.panel = upper.panel)
As per scatter plot matrix,
Uniformity of Cell Size and Uniformity of Cell Shape shows very high correlation (r=0.91) and apparently these two variables are multicollinear.
In this project our aim is to predict the probability of having a breast cancer based on 9 descriptive features. As per the above univariate exploration on variables except Infrequent Mitoses all other features look more relevant when deciding breast cancer level whether it is benign or malignant. Two features (Uniformity of Cell Size and Uniformity of Cell Shape )look having a strong multicollinearity and this issue can be handled in the modeling section in phase II.