In this post, I will be discussing functions knnImputation and regr.eval from DMwR package. Dataset I will be using is from assignment 5, Wine and is available on Github.

To keep the discussion relevant, I will be using partial dataset, with variables TARGET, pH, Density, AcidIndex, LabelAppeal and Stars.

Let’s load dataset.

library(tidyverse)
library(DMwR)
library(knitr)
library(kableExtra)
library(VIM)

wine.data.df <- as.data.frame(read.csv("https://raw.githubusercontent.com/akulapa/Akula-DATA621-Project04/master/wine-training-data.csv", header=T, stringsAsFactors = F, na.strings=c("","NA")))

wine.df <- wine.data.df %>% select(CasesSold = TARGET, pH, Density, AcidIndex, LabelAppeal, Stars=STARS)

wine.df <- wine.df %>% mutate(LabelAppeal = + LabelAppeal + 3)

wine.df$Stars <- factor(wine.df$Stars)
wine.df$LabelAppeal <- factor(wine.df$LabelAppeal)

As variable LabelAppeal has negative values lets convert them to positive values by adding 3. Both LabelAppeal and Stars is categorical variables.

Let’s check for missing values in the dataset.

#Generate summary of missing values
aggr_plot <- aggr(wine.df, numbers=F, sortVars=F, labels=names(wine.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"))

summary(aggr_plot)$missings %>%  
  data.frame() %>% filter(Count > 0) %>% 
  mutate(Percentage = Count*100/nrow(wine.df)) %>% 
  mutate(Percentage = paste0(round(Percentage,2),'%')) %>% 
  kable("html",caption = "Variables With Missing Values", row.names = F, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
Variables With Missing Values
Variable Count Percentage
pH 395 3.09%
Stars 3359 26.25%

Variable pH has approximately 3% missing values and Stars has about 26%. I will be using knnImputation function to impute missing values. Function accepcts five different parameters.

knnImputation(data, k = 10, scale = T, meth = "weighAvg", distData = NULL)
#Get complet cases from the dataset
wine.complete.cases.df <- wine.df[complete.cases(wine.df), ]

#Remove independent variable
wine.test.df <- wine.complete.cases.df
wine.test.df$CasesSold <- NULL

#Generate test set
set.seed(7374)
wine.test.df[sample(1:nrow(wine.test.df), round(nrow(wine.test.df)*3/100),0), "Stars"] <- NA
wine.test.df[sample(1:nrow(wine.test.df), round(nrow(wine.test.df)*26/100),0), "pH"] <- NA

#Loop through to find optimal 'k' value
for(i in 220:240){
  wine.imp.set <- wine.test.df
  
  #Imputation using kNN
  wine.imp.df <- knnImputation(wine.imp.set, i, meth='weighAvg')
  actual <- wine.complete.cases.df$pH[is.na(wine.test.df$pH)]
  predicts <- wine.imp.df$pH[is.na(wine.test.df$pH)]
  
  #Get errors and save them to dataframe
  error.rate <- regr.eval(actual, predicts)
  if (i==220){
    Accuracy.df <- data.frame(error.rate, stringsAsFactors = F) %>% 
      t() %>% 
      data.frame()
      row.names(Accuracy.df) <- paste0("kNN-",i)
  }else{
    
    A <- data.frame(error.rate, stringsAsFactors = F) %>% 
                  t() %>% 
                  data.frame()
    row.names(A) <- paste0("kNN-",i)
    Accuracy.df <- rbind(Accuracy.df, A)
  }
}

Accuracy.df <- tibble::rownames_to_column(Accuracy.df, "kNN")
Accuracy.df$Neighbors <- 220:240

#Generate graph to identify optimal 'k' value
#Lower error better
Accuracy.df %>% 
  reshape2::melt(id.vars = c('kNN','Neighbors')) %>% 
  ggplot(aes(x = Neighbors, y = value)) + 
    geom_point() + 
    geom_line() +
    facet_wrap(~variable, scales = "free", nrow = 2, ncol = 2) +
    scale_x_continuous(breaks=seq(220,240,2))

While identifying k value make sure dependent variable is removed from the dataset. In this case, CasesSold will be removed before imputing the data. Most of the times, if test dataset needs imputation before prediction, dependent variable won’t be available for imputation.

Following are four error values regr.eval outputs and mape is most popular error measure used by practitioners. Lower values are better.

In our case 232 is optimal k value. In other words, we will be imputing missing values with weighted average generated using 232 neighbors.

wine.df$CasesSold <- NULL
wine.knn <- knnImputation(wine.df, 232, meth='weighAvg')

wine.knn <- cbind(CasesSold = wine.data.df$TARGET, wine.knn)

Now that we have a clean dataset let’s build Poisson model to predict the number of cases of wine that can be sold based on predictor variables. Prediction ranges from zero to infinity positive values.

wine.glm <- glm(CasesSold ~ ., data = wine.knn, family = poisson)
summary(wine.glm)
## 
## Call:
## glm(formula = CasesSold ~ ., family = poisson, data = wine.knn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4937  -0.4944   0.2113   0.6372   2.7495  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.728404   0.197217   8.764  < 2e-16 ***
## pH           -0.020310   0.007639  -2.659  0.00784 ** 
## Density      -0.465430   0.191483  -2.431  0.01507 *  
## AcidIndex    -0.111461   0.004452 -25.037  < 2e-16 ***
## LabelAppeal2  0.337366   0.037933   8.894  < 2e-16 ***
## LabelAppeal3  0.533144   0.037208  14.329  < 2e-16 ***
## LabelAppeal4  0.661202   0.037928  17.433  < 2e-16 ***
## LabelAppeal5  0.770322   0.042737  18.025  < 2e-16 ***
## Stars2        0.189285   0.013733  13.783  < 2e-16 ***
## Stars3        0.509729   0.015595  32.684  < 2e-16 ***
## Stars4        0.693826   0.021663  32.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18136  on 12784  degrees of freedom
## AIC: 50100
## 
## Number of Fisher Scoring iterations: 5

Summary says all the variables are significant to the model. How do we measure the accuracy of the model?

Since we have actual cases sold and predicted cases, we can use regr.eval function.

wine.test <- wine.knn %>% select(-CasesSold)

wine.knn$Predict <- round(predict(wine.glm, newdata = wine.test, type = "response"))

error.rate <- regr.eval(wine.knn$CasesSold, wine.knn$Predict)
error.rate
##      mae      mse     rmse     mape 
## 1.257991 2.697460 1.642395      Inf

This time output shows Inf in mape error measure. The reason behind it we have zeros in observed values. When the dependent variable can take zero as one of the outputs, we cannot use mape as error measure. In this case other error measures should be used.

References