library(ggplot2)
library(caret)
library(mice)

Methods for Imputation

Read in the data file, give the columns more meaningful names according to the dataset’s description and replace all missing values coded as a “?” in the dataset with NA in the dataframe. Finally, print a summary of the dataframe.

The missing values (n = 16) are all located in the Bare_Nuclei predictor column.

df <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", header=FALSE, na.strings="?")
#http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
#breast-cancer-wisconsin.names
colnames(df) <- c("ID", "Clump_Thickness", "Uniform_Cell_Size", "Uniform_Cell_Shape",
                 "Marg_Adhesion", "Single_Epith_Cell_Size", "Bare_Nuclei", "Bland_Chromatin",
                 "Normal_Nucleoli", "Mitoses", "Class")
df$Class <- as.factor(df$Class)
levels(df$Class) <- c(0, 1)
summary(df)
##        ID           Clump_Thickness  Uniform_Cell_Size Uniform_Cell_Shape
##  Min.   :   61634   Min.   : 1.000   Min.   : 1.000    Min.   : 1.000    
##  1st Qu.:  870688   1st Qu.: 2.000   1st Qu.: 1.000    1st Qu.: 1.000    
##  Median : 1171710   Median : 4.000   Median : 1.000    Median : 1.000    
##  Mean   : 1071704   Mean   : 4.418   Mean   : 3.134    Mean   : 3.207    
##  3rd Qu.: 1238298   3rd Qu.: 6.000   3rd Qu.: 5.000    3rd Qu.: 5.000    
##  Max.   :13454352   Max.   :10.000   Max.   :10.000    Max.   :10.000    
##                                                                          
##  Marg_Adhesion    Single_Epith_Cell_Size  Bare_Nuclei     Bland_Chromatin 
##  Min.   : 1.000   Min.   : 1.000         Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 2.000         1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 1.000   Median : 2.000         Median : 1.000   Median : 3.000  
##  Mean   : 2.807   Mean   : 3.216         Mean   : 3.545   Mean   : 3.438  
##  3rd Qu.: 4.000   3rd Qu.: 4.000         3rd Qu.: 6.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000         Max.   :10.000   Max.   :10.000  
##                                          NA's   :16                       
##  Normal_Nucleoli     Mitoses       Class  
##  Min.   : 1.000   Min.   : 1.000   0:458  
##  1st Qu.: 1.000   1st Qu.: 1.000   1:241  
##  Median : 1.000   Median : 1.000          
##  Mean   : 2.867   Mean   : 1.589          
##  3rd Qu.: 4.000   3rd Qu.: 1.000          
##  Max.   :10.000   Max.   :10.000          
## 

A very brief exploration of the data shows that the samples belong to either Class 2 or Class 4 representing benign and malignant, respectively. Although the purpose of this assignment is not to explore the data graphically, it is of note that the observations may be classified according to, for example, Clump_Thickness and Uniform_Cell_size as shown in the graph below. Note that the separation is not perfect.

qplot(data=df, x=Clump_Thickness, y=Uniform_Cell_Size, color=as.factor(Class)) +
  geom_jitter()

Since we will be using datasets with different (imputed) values for Bare_Nuclei, let’s make a function that can return a training and test set for a particular dataframe.

#function to split data into training and test set
#removes the ID column but not the response 'Class'
splitData <- function(df) {
  set.seed(777)
  df <- subset(df, select=-c(ID))
  smpl_size <- floor(0.8 * nrow(df))
  train_ind <- sample(seq_len(nrow(df)), size = smpl_size)
  train <- df[train_ind, ]
  test <- df[-train_ind, ]
  
  return (list("training" = train, "test" = test))
}

In addition we will define a function that takes a training and test set as well as a set of values for the k-nearest neighbor classifier and returns the ‘best’ k value and the associated accuracy.

runKNN <- function(train, test, kList) {
  bestK = 0; bestAccuracy = 0
  for (k in kList) {
    fit <- knn3(Class ~., data=train, k=k)
    predictions <- predict(fit, test[0:(length(test)-1)], type="class")
    accuracy <- round((sum(predictions == test$Class) / length(test$Class)), digits=3)
    if (accuracy > bestAccuracy) {
      bestK <- k
      bestAccuracy <- accuracy
    }
  }
  return (list("best_k" = bestK, "best_accuracy" = bestAccuracy))
}

The classifier used to group observations will be the k-nearest neighbor classification method in the caret package. As a baseline, we will use the dataset for which all observations with a missing value are removed.

kList <- seq(1,10) # use for all subsequent knn analyses
dfDropped<- na.omit(df)
trainTest <- splitData(dfDropped)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset with dropped NA's")
## [1] "Results from dataset with dropped NA's"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 2"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.993"

Impute the mean for the missing values and run k-nearest neighbor.

dfNew <- df
dfNew$Bare_Nuclei[is.na(dfNew$Bare_Nuclei)] <- mean(dfNew$Bare_Nuclei, na.rm=TRUE)
trainTest <- splitData(dfDropped)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset with imputed mean")
## [1] "Results from dataset with imputed mean"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 2"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.993"

Impute the mode for the missing values and run k-nearest neighbor.

dfNew <- df
dfNew$Bare_Nuclei[is.na(dfNew$Bare_Nuclei)] <- mode(dfNew$Bare_Nuclei)
trainTest <- splitData(dfDropped)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset with imputed mode")
## [1] "Results from dataset with imputed mode"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 2"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.993"

Use regression to impute the values for the missing data and run k-nearest neighbor.

imp <- mice(df, method="norm.predict", m=1)
## 
##  iter imp variable
##   1   1  Bare_Nuclei
##   2   1  Bare_Nuclei
##   3   1  Bare_Nuclei
##   4   1  Bare_Nuclei
##   5   1  Bare_Nuclei
data_imp<-complete(imp)
trainTest <- splitData(data_imp)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset using regression to impute missing data")
## [1] "Results from dataset using regression to impute missing data"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 4"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.964"

Use regression with perturbation to impute the values for the missing data and run k-nearest neighbor.

imp <- mice(df, method="norm.nob", m=1)
## 
##  iter imp variable
##   1   1  Bare_Nuclei
##   2   1  Bare_Nuclei
##   3   1  Bare_Nuclei
##   4   1  Bare_Nuclei
##   5   1  Bare_Nuclei
data_imp<-complete(imp)
trainTest <- splitData(data_imp)
res <- runKNN(trainTest$training, trainTest$test, kList)
print("Results from dataset using regression with perturbation to impute missing data")
## [1] "Results from dataset using regression with perturbation to impute missing data"
print(paste0("Best k: ", res$best_k))
## [1] "Best k: 5"
print(paste0("Accuracy: ", res$best_accuracy))
## [1] "Accuracy: 0.964"

Based on the results of the k-nearest neighbor analysis using datasets in which the missing data for Bare_Nuclei are replaced with imputed values, simply deleting the observations with missing values or imputing with the mean or mode all gave similar results, namely, using two neighbors resulted in an accuracy of approximately 0.99 on the test set. In contrast, using regression, with or without perturbation, resulted in a somewhat lower accuracy. Note, however, that more neighbors were used in those latter two analyses. This seems to imply that training the model required more information, i.e., neighbors, in order to be able to make accurate predictions.

It is not unreasonable to expect that using (linear) regression to impute values would result in a better model (‘fit’) as compared to simply replacing missing values with the mean or mode. Here, however, imputation using the mean or mode gave the best set of predictions where ‘best’ is defined as a k of two groups with a ‘better’ accuracy.