library(ggplot2)
library(caret)
library(mice)
Read in the data file, give the columns more meaningful names according to the dataset’s description and replace all missing values coded as a “?” in the dataset with NA in the dataframe. Finally, print a summary of the dataframe.
The missing values (n = 16) are all located in the Bare_Nuclei predictor column.
df <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", header=FALSE, na.strings="?")
#http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
#breast-cancer-wisconsin.names
colnames(df) <- c("ID", "Clump_Thickness", "Uniform_Cell_Size", "Uniform_Cell_Shape",
"Marg_Adhesion", "Single_Epith_Cell_Size", "Bare_Nuclei", "Bland_Chromatin",
"Normal_Nucleoli", "Mitoses", "Class")
df$Class <- as.factor(df$Class)
levels(df$Class) <- c(0, 1)
summary(df)
## ID Clump_Thickness Uniform_Cell_Size Uniform_Cell_Shape
## Min. : 61634 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000
## Mean : 1071704 Mean : 4.418 Mean : 3.134 Mean : 3.207
## 3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000
##
## Marg_Adhesion Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.000 Median : 2.000 Median : 1.000 Median : 3.000
## Mean : 2.807 Mean : 3.216 Mean : 3.545 Mean : 3.438
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## NA's :16
## Normal_Nucleoli Mitoses Class
## Min. : 1.000 Min. : 1.000 0:458
## 1st Qu.: 1.000 1st Qu.: 1.000 1:241
## Median : 1.000 Median : 1.000
## Mean : 2.867 Mean : 1.589
## 3rd Qu.: 4.000 3rd Qu.: 1.000
## Max. :10.000 Max. :10.000
##
A very brief exploration of the data shows that the samples belong to either Class 2 or Class 4 representing benign and malignant, respectively. Although the purpose of this assignment is not to explore the data graphically, it is of note that the observations may be classified according to, for example, Clump_Thickness and Uniform_Cell_size as shown in the graph below. Note that the separation is not perfect.
qplot(data=df, x=Clump_Thickness, y=Uniform_Cell_Size, color=as.factor(Class)) +
geom_jitter()
Since we will be using datasets with different (imputed) values for Bare_Nuclei, let’s make a function that can return a training and test set for a particular dataframe.
#function to split data into training and test set
#removes the ID column but not the response 'Class'
splitData <- function(df) {
set.seed(777)
df <- subset(df, select=-c(ID))
smpl_size <- floor(0.8 * nrow(df))
train_ind <- sample(seq_len(nrow(df)), size = smpl_size)
train <- df[train_ind, ]
test <- df[-train_ind, ]
return (list("training" = train, "test" = test))
}
In addition we will define a function that takes a training and test set as well as a set of values for the k-nearest neighbor classifier and returns the ‘best’ k value and the associated accuracy.
runKNN <- function(train, test, kList) {
bestK = 0; bestAccuracy = 0
for (k in kList) {
fit <- knn3(Class ~., data=train, k=k)
predictions <- predict(fit, test[0:(length(test)-1)], type="class")
accuracy <- round((sum(predictions == test$Class) / length(test$Class)), digits=3)
if (accuracy > bestAccuracy) {
bestK <- k
bestAccuracy <- accuracy
}
}
return (list("best_k" = bestK, "best_accuracy" = bestAccuracy))
}
The classifier used to group observations will be the k-nearest neighbor classification method in the caret package. As a baseline, we will use the dataset for which all observations with a missing value are removed.
kList <- seq(1,10) # use for all subsequent knn analyses
dfDropped<- na.omit(df)
trainTest <- splitData(dfDropped)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset with dropped NA's")
## [1] "Results from dataset with dropped NA's"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 2"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.993"
Impute the mean for the missing values and run k-nearest neighbor.
dfNew <- df
dfNew$Bare_Nuclei[is.na(dfNew$Bare_Nuclei)] <- mean(dfNew$Bare_Nuclei, na.rm=TRUE)
trainTest <- splitData(dfDropped)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset with imputed mean")
## [1] "Results from dataset with imputed mean"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 2"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.993"
Impute the mode for the missing values and run k-nearest neighbor.
dfNew <- df
dfNew$Bare_Nuclei[is.na(dfNew$Bare_Nuclei)] <- mode(dfNew$Bare_Nuclei)
trainTest <- splitData(dfDropped)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset with imputed mode")
## [1] "Results from dataset with imputed mode"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 2"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.993"
Use regression to impute the values for the missing data and run k-nearest neighbor.
imp <- mice(df, method="norm.predict", m=1)
##
## iter imp variable
## 1 1 Bare_Nuclei
## 2 1 Bare_Nuclei
## 3 1 Bare_Nuclei
## 4 1 Bare_Nuclei
## 5 1 Bare_Nuclei
data_imp<-complete(imp)
trainTest <- splitData(data_imp)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset using regression to impute missing data")
## [1] "Results from dataset using regression to impute missing data"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 4"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.964"
Use regression with perturbation to impute the values for the missing data and run k-nearest neighbor.
imp <- mice(df, method="norm.nob", m=1)
##
## iter imp variable
## 1 1 Bare_Nuclei
## 2 1 Bare_Nuclei
## 3 1 Bare_Nuclei
## 4 1 Bare_Nuclei
## 5 1 Bare_Nuclei
data_imp<-complete(imp)
trainTest <- splitData(data_imp)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset using regression with perturbation to impute missing data")
## [1] "Results from dataset using regression with perturbation to impute missing data"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 5"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.964"
Based on the results of the k-nearest neighbor analysis using datasets in which the missing data for Bare_Nuclei are replaced with imputed values, simply deleting the observations with missing values or imputing with the mean or mode all gave similar results, namely, using two neighbors resulted in an accuracy of approximately 0.99 on the test set. In contrast, using regression, with or without perturbation, resulted in a somewhat lower accuracy. Note, however, that more neighbors were used in those latter two analyses. This seems to imply that training the model required more information, i.e., neighbors, in order to be able to make accurate predictions.
It is not unreasonable to expect that using (linear) regression to impute values would result in a better model (‘fit’) as compared to simply replacing missing values with the mean or mode. Here, however, imputation using the mean or mode gave the best set of predictions where ‘best’ is defined as a k of two groups with a ‘better’ accuracy.