This document is intended to be a concise report to explain a few takeaways of a dataset containing information about breast cancer (available here) obtained at UCI Machine Learning Repository on this link. The analysis was created as part of the Data Science Certificate in the class Methods for Data Analysis at University of Washington.
The idea is to perform an exploratory analysis of the information contained in the dataset, figuring out ways of making the dataset tidier. The ultimate objective is to, in the end, build and compare models to predict if a given tumor is benign or malignant (breast cancer) using the information available on this dataset. Some functions created for this purpose are included in the appendix.
The analysis show that, with a Random Forest model, we can predict if a given tumor is malignant with 97.86% of Accuracy. This result is 1.96% higher than the Accuracy of 95.90% reported in the UCI Machine Learning as the highest for this dataset. We also conclude that the most important information for this prediction is the ‘uniformity of the cell size’.
First we load the functions created (loading omitted from the report, loaded behind the scenes but included in the appendix) for this report. Then the data is loaded using the function read.breast.cancer.data, that also puts the values ? as NAs, name the columns appropriately, give real names for the Class column; and removes the ID proxy column.
cancer <- read.breast.cancer.data(".") # function is included in the appendix
summary(cancer)
## Clump.Thickness Uniformity.of.Cell.Size Uniformity.of.Cell.Shape Marginal.Adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.000
## Mean : 4.418 Mean : 3.134 Mean : 3.207 Mean : 2.807
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
##
## Single.Epithelial.Cell.Size Bare.Nuclei Bland.Chromatin Normal.Nucleoli Mitoses
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.000 Median : 1.000
## Mean : 3.216 Mean : 3.545 Mean : 3.438 Mean : 2.867 Mean : 1.589
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 1.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## NA's :16
## Class
## Benign :458
## Malignant:241
##
##
##
##
##
We can see that the column Bare.Nuclei have 16 NA values, and we have only numerical features in the dataset. Also we now define our baseline accuray as 458/(458+241), that is, 65.52% – as this is our metric if we guess every instance as the majority class (Benign).
To deal with the NA values of the column Bare.Nuclei, we use the method called MICE (Multiple Imputation by Chained Equations), from the mice package, to impute the NA values with the ones most suited considering all 9 columns of featuers.
cancer <- impute.missing.data(cancer, cols.to.consider=c(1:9)) # function is included in the appendix
summary(cancer)
## Clump.Thickness Uniformity.of.Cell.Size Uniformity.of.Cell.Shape Marginal.Adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.000
## Mean : 4.418 Mean : 3.134 Mean : 3.207 Mean : 2.807
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## Single.Epithelial.Cell.Size Bare.Nuclei Bland.Chromatin Normal.Nucleoli Mitoses
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.000 Median : 1.000
## Mean : 3.216 Mean : 3.525 Mean : 3.438 Mean : 2.867 Mean : 1.589
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 1.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## Class
## Benign :458
## Malignant:241
##
##
##
##
From this last summary of the data we can see that the column Bare.Nuclei has no more NA values. Also if we compare this summary of that column with the previous summary (before imputation), we see that the stats are basically the same.
To start exploring our data and later also be able to create models correctly, we need to separate our data into train and test data. This is done to simulate a real world dataset (test) that have class information and is not used in anyway during the analysis (instead we use train). This ensures that our test dataset is really simulating real world data, since it has not been seen during exploration or modeling.
For this purpose we use the R package caTools, as displayed below.
library(caTools)
set.seed(1000)
split = sample.split(cancer$Class, SplitRatio = 0.80)
train = subset(cancer, split==TRUE)
test = subset(cancer, split==FALSE)
Separating 80% of the dataset for training, we have 559 instances, thus remaining 140 for testing.
To start with an initial general exploration of the whole dataset, we plot boxplots of the features.
library(ggplot2)
ggplot(stack(train[,1:9]), aes(x = ind, y = values)) + geom_boxplot() +
theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust=1)) +
labs(title = "Boxplots of columns") + labs(x = "", y = "Values") +
scale_y_continuous(breaks = seq(1, 10, by = 1))
We can notice that several columns have outliers, with the column Mitoses being the most critical. In the next subsections we aim to explore and adjust the most important of such columns, starting with Mitoses.
First, to have a good look at the column Mitoses, we observe its values along side the Class column.
table(train$Mitoses, train$Class)
##
## Benign Malignant
## 1 356 111
## 2 6 21
## 3 1 23
## 4 0 11
## 5 1 3
## 6 0 2
## 7 1 6
## 8 1 6
## 10 0 10
We can see that starting from Mitoses of 2, the proportion of Benign and Malignant tumors seems to remain close. Therefore we can group these values to make the information simpler, helping reduce chances of overfitting or influence of outliers in the extremes, thus making future models better for generalization.
train <- adjust.with.discretization(train, c(1), c(9)) # function is included in the appendix
## Correlation with Class BEFORE: 0.407756
## ---------------------------------------
## - AFTER adjustment:
## COLUMN: Mitoses
## Benign Malignant
## 1 356 111
## 2 10 82
## ---------------------------------------
## Correlation with Class AFTER: 0.5097491
From the results above, wee that the correlation of the column with the class improved considerably after adjustment, therefore confirming the validity of the changes made.
To more formally validate these adjustments, we can also perform a hypothesis test.
hyp.test.col(train[,c(9,10)]) # function is included in the appendix
## Null Hyp.: for column ' Mitoses ', chance of malignant tumor for value 2 is less or equal than 1
## Alt. Hyp.: for column ' Mitoses ', chance of malignant tumor for value 2 is greater than 1
##
## Welch Two Sample t-test
##
## data: data.to.test$Class[data.to.test[, 1] == value1] and data.to.test$Class[data.to.test[, 1] == value2]
## t = 17.144, df = 165.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.5905549 Inf
## sample estimates:
## mean of x mean of y
## 0.8913043 0.2376874
With a very small p-value, we can reject the null hypothesis, confirming our finding that the grouping created for the column Mitoses has a real impact in the chance of the tumor being Malignant. Specifically, we see that the chance of malignant tumor for values of Mitoses equal to 2 (or more) is greater than the chance for the value of 1.
First, to have a good look at the column Single.Epithelial.Cell.Size, we observe its values along side the Class column.
table(train$Single.Epithelial.Cell.Size, train$Class)
##
## Benign Malignant
## 1 37 1
## 2 290 18
## 3 23 32
## 4 5 33
## 5 4 26
## 6 1 32
## 7 3 9
## 8 2 16
## 9 0 2
## 10 1 24
We can see that starting from Single.Epithelial.Cell.Size of 4, the proportion of Benign and Malignant tumors seems to remain close, and also values of 1 and 2 seem to have the same proportion. Therefore we can group these values to make the information simpler, helping reduce chances of overfitting or influence of outliers in the extremes, thus making future models better for generalization.
train <- adjust.with.discretization(train, c(2,3), c(5)) # function is included in the appendix
## Correlation with Class BEFORE: 0.6857513
## ---------------------------------------
## - AFTER adjustment:
## COLUMN: Single.Epithelial.Cell.Size
## Benign Malignant
## 2 327 19
## 3 23 32
## 4 16 142
## ---------------------------------------
## Correlation with Class AFTER: 0.7962077
From the results above, wee that the correlation of the column with the class improved considerably after adjustment, therefore confirming the validity of the changes made.
To more formally validate these adjustments, we can also perform hypothesis tests.
hyp.test.col(train[,c(5,10)], formula(Class ~ Single.Epithelial.Cell.Size)) # function is included in the appendix
## ANOVA
## Df Sum Sq Mean Sq F value Pr(>F)
## Single.Epithelial.Cell.Size 2 80.65 40.32 490.4 <2e-16 ***
## Residuals 556 45.72 0.08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ----------------------------------------------------------------
##
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
##
## $Single.Epithelial.Cell.Size
## diff lwr upr p adj
## 3-2 0.5269049 0.4290847 0.6247251 0
## 4-2 0.8438209 0.7791179 0.9085239 0
## 4-3 0.3169160 0.2114152 0.4224168 0
The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in the chance of the tumor being Malignant for different values of Single.Epithelial.Cell.Size. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference in the chance of the tumor being Malignant for all the groups of values of Single.Epithelial.Cell.Size that we created.
Therefore, this confirms our finding that the groups created for the column Single.Epithelial.Cell.Size have a real impact in the chance of the tumor being Malignant.
First, to have a good look at the column Normal.Nucleoli, we observe its values along side the Class column.
table(train$Normal.Nucleoli, train$Class)
##
## Benign Malignant
## 1 324 33
## 2 22 5
## 3 9 25
## 4 1 16
## 5 1 16
## 6 4 16
## 7 1 13
## 8 3 17
## 9 1 13
## 10 0 39
We can see that starting from Normal.Nucleoli of 4, the proportion of Benign and Malignant tumors seems to remain close, and also values of 1 and 2 seem to have the same proportion. Therefore we can group these values to make the information simpler, helping reduce chances of overfitting or influence of outliers in the extremes, thus making future models better for generalization.
train <- adjust.with.discretization(train, c(2,3), c(8)) # function is included in the appendix
## Correlation with Class BEFORE: 0.7080206
## ---------------------------------------
## - AFTER adjustment:
## COLUMN: Normal.Nucleoli
## Benign Malignant
## 2 346 38
## 3 9 25
## 4 11 130
## ---------------------------------------
## Correlation with Class AFTER: 0.7641009
From the results above, wee that the correlation of the column with the class improved considerably after adjustment, therefore confirming the validity of the changes made.
To more formally validate these adjustments, we can also perform hypothesis tests.
hyp.test.col(train[,c(8,10)], formula(Class ~ Normal.Nucleoli)) # function is included in the appendix
## ANOVA
## Df Sum Sq Mean Sq F value Pr(>F)
## Normal.Nucleoli 2 75.37 37.68 410.8 <2e-16 ***
## Residuals 556 51.00 0.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ----------------------------------------------------------------
##
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = formula.to.anova, data = data.to.anova)
##
## $Normal.Nucleoli
## diff lwr upr p adj
## 3-2 0.6363358 0.50898680 0.7636848 0.0000000
## 4-2 0.8230275 0.75294380 0.8931112 0.0000000
## 4-3 0.1866917 0.05070938 0.3226740 0.0037916
The first test above is the standard ANOVA, showing a very small p-value, which indicates that must exist some difference in the chance of the tumor being Malignant for different values of Normal.Nucleoli. Then to find out which values have differences we also performed the Tukey’s Range Test above, that shows a significant positive value for the difference in the chance of the tumor being Malignant for all the groups of values of Normal.Nucleoli that we created.
Therefore, this confirms our finding that the groups created for the column Normal.Nucleoli have a real impact in the chance of the tumor being Malignant.
To observe how the remaining numeric features are related to each other, we look at the correlation matrix for the dataset.
feat.correlation.analysis(train, c(1:4,6:7), corr.cutoff=0.75) # function is included in the appendix
## Summary of correlation before:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5030 0.6538 0.6812 0.6738 0.7006 0.9089
##
## Suggestion: Remove column(s) Uniformity.of.Cell.Size
##
## Summary of correlation after eliminating ' Uniformity.of.Cell.Size ':
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5030 0.5861 0.6681 0.6404 0.6867 0.7331
We can see from the correlation plot that all the numeric columns have some certain correlation with each other. Although, from the first summary of the correlations above we can also see that 75% of the correlations are less than 0.70. Therefore, we used a cut-off of 0.75 and received a suggestion to remove the column Uniformity.of.Cell.Size. The last summary of correlations confirms the good choice for removing Uniformity.of.Cell.Size, since after removing it, the maximum correlation is 0.73.
With this correlation analysis in mind, we will not consider the column Uniformity.of.Cell.Size for building a logistic regression model, since it can be negatively affected by the presence of highly correlated features.
As a final step we are going to create several models for the dataset and analyze the results. Specifically, we are going to create models with four different methods:
models <- lapply(c('rf','glm','rpart','svmLinear'),
function(x) {
if(x=='glm'){create.model(method=x, train[,-2])} # function is included in the appendix
else{create.model(method=x, train)} # function is included in the appendix
})
## [1] "Method: rf -- 10-fold CV Accuracy = 96.958%"
## [1] "Method: glm -- 10-fold CV Accuracy = 96.958%"
## [1] "Method: rpart -- 10-fold CV Accuracy = 93.912%"
## [1] "Method: svmLinear -- 10-fold CV Accuracy = 96.779%"
The results show above are those of the 10-fold Cross Validation of the modeling, therefore trying to simulate real world results even before using the test dataset. From the accuracy results, we see that the Random Forest model got the best results.
Therefore, with Random Forest being the chosen model we display the plot that gives us the overall importance of the columns used.
chosen.model <- models[[1]]
varImpPlot(chosen.model$finalModel, main = 'Importance of Features')
Above we see that the first four columns in order of importance are:
Finally, to confirm the validity of our Random Forest model, we apply it to predict the results of the test dataset, simulating a situation of real world labeled data.
First we have to transform the columns of the test dataset in order to be in the exact same format expected by the model we built, that is, adjust the columns changed in the train dataset in the same way.
test <- adjust.with.discretization(test, c(1), c(9))
## Correlation with Class BEFORE: 0.4816347
## ---------------------------------------
## - AFTER adjustment:
## COLUMN: Mitoses
## Benign Malignant
## 1 89 23
## 2 3 25
## ---------------------------------------
## Correlation with Class AFTER: 0.5793569
test <- adjust.with.discretization(test, c(2,3), c(5))
## Correlation with Class BEFORE: 0.6728089
## ---------------------------------------
## - AFTER adjustment:
## COLUMN: Single.Epithelial.Cell.Size
## Benign Malignant
## 2 82 5
## 3 6 11
## 4 4 32
## ---------------------------------------
## Correlation with Class AFTER: 0.7751312
test <- adjust.with.discretization(test, c(2,3), c(8))
## Correlation with Class BEFORE: 0.7380251
## ---------------------------------------
## - AFTER adjustment:
## COLUMN: Normal.Nucleoli
## Benign Malignant
## 2 86 9
## 3 3 7
## 4 3 32
## ---------------------------------------
## Correlation with Class AFTER: 0.7596943
From the results above we can see that in the test dataset, the information seem to also be tidier with the changes performed.
To evaluate the Random Forest model in the test dataset, we predict its labels and compare them with the given class values by displaying the confusion matrix and all its derived metrics.
confusionMatrix(predict(chosen.model, newdata=test), test$Class, positive='Malignant')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 90 1
## Malignant 2 47
##
## Accuracy : 0.9786
## 95% CI : (0.9387, 0.9956)
## No Information Rate : 0.6571
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9527
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9792
## Specificity : 0.9783
## Pos Pred Value : 0.9592
## Neg Pred Value : 0.9890
## Prevalence : 0.3429
## Detection Rate : 0.3357
## Detection Prevalence : 0.3500
## Balanced Accuracy : 0.9787
##
## 'Positive' Class : Malignant
##
Observing the results above we conclude that, with a Random Forest model, we can predict if a given tumor is malignant with 97.86% of Accuracy. This result is 1.96% higher than the Accuracy of 95.90% reported in the UCI Machine Learning as the highest for this dataset.
From the analysis above we can conclude that the model created gives excellent accuracy in predicting breast cancer from tumor data, therefore all the exploration and manipulation of the dataset were valid for this purpose.
Functions used in this report are showed below.
# Read the txt file into a data frame
read.breast.cancer.data <- function(path="SET-PATH-HERE"){
filePath <- file.path(path, 'breast-cancer-wisconsin.data.txt')
# https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
cancer <- read.csv(filePath, na.strings='?', stringsAsFactors=F, header=F)
names(cancer) <- c('ID', 'Clump.Thickness', 'Uniformity.of.Cell.Size', 'Uniformity.of.Cell.Shape',
'Marginal.Adhesion', 'Single.Epithelial.Cell.Size', 'Bare.Nuclei',
'Bland.Chromatin', 'Normal.Nucleoli', 'Mitoses', 'Class')
cancer$Class <- as.factor(ifelse(cancer$Class > 3, 'Malignant', 'Benign'))
cancer$ID <- NULL
return(cancer)
}
# Multiple Imputation by Chained Equations - mice package
impute.missing.data <- function(givenDF, cols.to.consider){
library(mice)
data.to.consider = givenDF[,cols.to.consider] # only columns with features (not class)
set.seed(144)
imputed = complete(mice(data.to.consider, printFlag = FALSE))
return(cbind(imputed, Class=givenDF$Class))
}
# Does ANOVA and Tukey's Range Test, outputing results in text and plot for Tukey's
anova.tests <- function(formula.to.anova, data.to.anova, leftpar=4){
aov <- aov(formula.to.anova, data = data.to.anova) # Calculates ANOVA
# Print ANOVA
cat(" ANOVA\n")
print(summary(aov))
cat("\n----------------------------------------------------------------\n\n")
tukey_anova <- TukeyHSD(aov) # Calculates Tukey's
# Print Tukey's
print(tukey_anova)
par(mar=c(5,leftpar,4,2) + 0.1)
plot(tukey_anova, las=1)
par(mar=c(5,4,4,2) + 0.1)
}
# Adjusts column for thresh values given, showing correlation with Class before and after
adjust.with.discretization <- function(givenDF, thresh.values, col.nr){
cat("Correlation with Class BEFORE:", cor(givenDF[,col.nr], ifelse(givenDF$Class == 'Benign', 0, 1)), "\n")
# Adjusts ranges given thresh values
if(length(thresh.values)==1){
givenDF[,col.nr] <- ifelse(givenDF[,col.nr] <= thresh.values, thresh.values, thresh.values+1)
}
else if(length(thresh.values)==2){
givenDF[,col.nr] <- ifelse(givenDF[,col.nr] <= thresh.values[1], thresh.values[1],
ifelse(givenDF[,col.nr] <= thresh.values[2], thresh.values[2],
thresh.values[2]+1)
)
}
cat("---------------------------------------\n")
cat("- AFTER adjustment:\n")
cat("COLUMN:", names(givenDF)[col.nr])
print(table(givenDF[,col.nr], givenDF$Class))
cat("---------------------------------------\n")
cat("Correlation with Class AFTER:", cor(givenDF[,col.nr], ifelse(givenDF$Class == 'Benign', 0, 1)), "\n")
givenDF[,col.nr] <- as.factor(givenDF[,col.nr])
return(givenDF)
}
# Controls hypothesis testing for given data
hyp.test.col <- function(data.to.test, formula.to.test=NULL, leftpar=4){
data.to.test$Class <- as.numeric(data.to.test$Class)-1
# if data.to.test[,1] is column with more than 2 levels = anova+tukey
if(length(table(data.to.test[,1])) > 2){
anova.tests(formula.to.test, data.to.test, leftpar)
}
else { # if data.to.test[,1] is just column that has only 2 levels = t.test
value1 <- names(table(data.to.test[,1]))[2]
value2 <- names(table(data.to.test[,1]))[1]
cat("Null Hyp.: for column '", names(data.to.test)[1], "', chance of malignant tumor for value", value1, "is less or equal than", value2, "\n")
cat("Alt. Hyp.: for column '", names(data.to.test)[1], "', chance of malignant tumor for value", value1, "is greater than", value2, "\n")
print(t.test(data.to.test$Class[data.to.test[,1] == value1],
data.to.test$Class[data.to.test[,1] == value2], "greater", 0, FALSE, FALSE, 0.95))
}
}
# Analyse the correlation of numeric columns of given data frame and suggests elimination for given cutoff
feat.correlation.analysis <- function(givenDF, numeric.cols, corr.cutoff=0.75){
library(corrplot)
library(caret)
corrplot(cor(givenDF[,numeric.cols]), title = 'Correlation Plot')
# Before
descrCor <- cor(givenDF[,numeric.cols])
cat("Summary of correlation before: \n")
print(summary(descrCor[upper.tri(descrCor)]))
# Finding correlation
highlyCorDescr <- findCorrelation(descrCor, cutoff = corr.cutoff, exact=TRUE)
cols.to.eliminate <- names(givenDF[,numeric.cols])[highlyCorDescr] # TO ELIMINATE
cat("\nSuggestion: Remove column(s)", cols.to.eliminate, "\n")
# After
filteredDescr <- (givenDF[,c(numeric.cols)])[,-highlyCorDescr]
descrCor2 <- cor(filteredDescr)
cat("\nSummary of correlation after eliminating '", cols.to.eliminate, "': \n")
print(summary(descrCor2[upper.tri(descrCor2)]))
}
# Creates model and uses 10-fold cross validation
create.model <- function(method, data.to.model){
library(caret)
cvCtrl <- trainControl("cv", 10, savePred=T) # 10-fold Cross Validation
set.seed(123)
model <- train(Class ~ .,
data = data.to.model,
method = method,
trControl = cvCtrl)
model
print(paste0("Method: ", method, " -- 10-fold CV Accuracy = ", round(100*max(model$results$Accuracy), 3), "%"))
return(model)
}