MAPE - Infinity Output

In this post, I will be discussing functions knnImputation and regr.eval from DMwR package. Dataset I will be using is from assignment 5, Wine and is available on Github.

To keep the discussion relevant, I will be using partial dataset, with variables TARGET, pH, Density, AcidIndex, LabelAppeal and Stars.

Let’s load dataset.

library(tidyverse)
library(DMwR)
library(knitr)
library(kableExtra)
library(VIM)

wine.data.df <- as.data.frame(read.csv("https://raw.githubusercontent.com/akulapa/Akula-DATA621-Project04/master/wine-training-data.csv", header=T, stringsAsFactors = F, na.strings=c("","NA")))

wine.df <- wine.data.df %>% select(CasesSold = TARGET, pH, Density, AcidIndex, LabelAppeal, Stars=STARS)

wine.df <- wine.df %>% mutate(LabelAppeal = + LabelAppeal + 3)

wine.df$Stars <- factor(wine.df$Stars)
wine.df$LabelAppeal <- factor(wine.df$LabelAppeal)

As variable LabelAppeal has negative values lets convert them to positive values by adding 3. Both LabelAppeal and Stars is categorical variables.

Let’s check for missing values in the dataset.

#Generate summary of missing values
aggr_plot <- aggr(wine.df, numbers=F, sortVars=F, labels=names(wine.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"))

summary(aggr_plot)$missings %>%  
  data.frame() %>% filter(Count > 0) %>% 
  mutate(Percentage = Count*100/nrow(wine.df)) %>% 
  mutate(Percentage = paste0(round(Percentage,2),'%')) %>% 
  kable("html",caption = "Variables With Missing Values", row.names = F, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

Variables With Missing Values
Variable	Count	Percentage
pH	395	3.09%
Stars	3359	26.25%

Variable pH has approximately 3% missing values and Stars has about 26%. I will be using knnImputation function to impute missing values. Function accepcts five different parameters.

knnImputation(data, k = 10, scale = T, meth = "weighAvg", distData = NULL)

data - it is the data frame that needs to be imputed.
k - The number of nearest neighbours to use (defaults to 10). Following R code shows how to find optimal value.
scale - Boolean setting if the data should be scale before finding the nearest neighbours (defaults to T)
meth - String indicating the method used to calculate the value to fill in each NA. Available values are ‘median’ or ‘weighAvg’ (the default).
distData - Optionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in imputing dataset

#Get complet cases from the dataset
wine.complete.cases.df <- wine.df[complete.cases(wine.df), ]

#Remove independent variable
wine.test.df <- wine.complete.cases.df
wine.test.df$CasesSold <- NULL

#Generate test set
set.seed(7374)
wine.test.df[sample(1:nrow(wine.test.df), round(nrow(wine.test.df)*3/100),0), "Stars"] <- NA
wine.test.df[sample(1:nrow(wine.test.df), round(nrow(wine.test.df)*26/100),0), "pH"] <- NA

#Loop through to find optimal 'k' value
for(i in 220:240){
  wine.imp.set <- wine.test.df
  
  #Imputation using kNN
  wine.imp.df <- knnImputation(wine.imp.set, i, meth='weighAvg')
  actual <- wine.complete.cases.df$pH[is.na(wine.test.df$pH)]
  predicts <- wine.imp.df$pH[is.na(wine.test.df$pH)]
  
  #Get errors and save them to dataframe
  error.rate <- regr.eval(actual, predicts)
  if (i==220){
    Accuracy.df <- data.frame(error.rate, stringsAsFactors = F) %>% 
      t() %>% 
      data.frame()
      row.names(Accuracy.df) <- paste0("kNN-",i)
  }else{
    
    A <- data.frame(error.rate, stringsAsFactors = F) %>% 
                  t() %>% 
                  data.frame()
    row.names(A) <- paste0("kNN-",i)
    Accuracy.df <- rbind(Accuracy.df, A)
  }
}

Accuracy.df <- tibble::rownames_to_column(Accuracy.df, "kNN")
Accuracy.df$Neighbors <- 220:240

#Generate graph to identify optimal 'k' value
#Lower error better
Accuracy.df %>% 
  reshape2::melt(id.vars = c('kNN','Neighbors')) %>% 
  ggplot(aes(x = Neighbors, y = value)) + 
    geom_point() + 
    geom_line() +
    facet_wrap(~variable, scales = "free", nrow = 2, ncol = 2) +
    scale_x_continuous(breaks=seq(220,240,2))

While identifying k value make sure dependent variable is removed from the dataset. In this case, CasesSold will be removed before imputing the data. Most of the times, if test dataset needs imputation before prediction, dependent variable won’t be available for imputation.

Following are four error values regr.eval outputs and mape is most popular error measure used by practitioners. Lower values are better.

mae - mean absolute percentage error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n |O_i - E_i|\]
mse - mean squared error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n (O_i - E_i)^2\]
rmse - square root mean squared error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n \sqrt{(O_i - E_i)^2}\]
mape - mean absolute percentage error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n \bigg|\frac{O_i - E_i}{O_i}\bigg|\]

In our case 232 is optimal k value. In other words, we will be imputing missing values with weighted average generated using 232 neighbors.

wine.df$CasesSold <- NULL
wine.knn <- knnImputation(wine.df, 232, meth='weighAvg')

wine.knn <- cbind(CasesSold = wine.data.df$TARGET, wine.knn)

Now that we have a clean dataset let’s build Poisson model to predict the number of cases of wine that can be sold based on predictor variables. Prediction ranges from zero to infinity positive values.

wine.glm <- glm(CasesSold ~ ., data = wine.knn, family = poisson)
summary(wine.glm)

## 
## Call:
## glm(formula = CasesSold ~ ., family = poisson, data = wine.knn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4937  -0.4944   0.2113   0.6372   2.7495  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.728404   0.197217   8.764  < 2e-16 ***
## pH           -0.020310   0.007639  -2.659  0.00784 ** 
## Density      -0.465430   0.191483  -2.431  0.01507 *  
## AcidIndex    -0.111461   0.004452 -25.037  < 2e-16 ***
## LabelAppeal2  0.337366   0.037933   8.894  < 2e-16 ***
## LabelAppeal3  0.533144   0.037208  14.329  < 2e-16 ***
## LabelAppeal4  0.661202   0.037928  17.433  < 2e-16 ***
## LabelAppeal5  0.770322   0.042737  18.025  < 2e-16 ***
## Stars2        0.189285   0.013733  13.783  < 2e-16 ***
## Stars3        0.509729   0.015595  32.684  < 2e-16 ***
## Stars4        0.693826   0.021663  32.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18136  on 12784  degrees of freedom
## AIC: 50100
## 
## Number of Fisher Scoring iterations: 5

Summary says all the variables are significant to the model. How do we measure the accuracy of the model?

Since we have actual cases sold and predicted cases, we can use regr.eval function.

wine.test <- wine.knn %>% select(-CasesSold)

wine.knn$Predict <- round(predict(wine.glm, newdata = wine.test, type = "response"))

error.rate <- regr.eval(wine.knn$CasesSold, wine.knn$Predict)
error.rate

##      mae      mse     rmse     mape 
## 1.257991 2.697460 1.642395      Inf

This time output shows Inf in mape error measure. The reason behind it we have zeros in observed values. When the dependent variable can take zero as one of the outputs, we cannot use mape as error measure. In this case other error measures should be used.

MAPE - Infinity Output

Pavan Akula

May 20, 2018

References