In this post, I will be discussing functions knnImputation and regr.eval from DMwR package. Dataset I will be using is from assignment 5, Wine and is available on Github.
To keep the discussion relevant, I will be using partial dataset, with variables TARGET, pH, Density, AcidIndex, LabelAppeal and Stars.
Let’s load dataset.
library(tidyverse)
library(DMwR)
library(knitr)
library(kableExtra)
library(VIM)
wine.data.df <- as.data.frame(read.csv("https://raw.githubusercontent.com/akulapa/Akula-DATA621-Project04/master/wine-training-data.csv", header=T, stringsAsFactors = F, na.strings=c("","NA")))
wine.df <- wine.data.df %>% select(CasesSold = TARGET, pH, Density, AcidIndex, LabelAppeal, Stars=STARS)
wine.df <- wine.df %>% mutate(LabelAppeal = + LabelAppeal + 3)
wine.df$Stars <- factor(wine.df$Stars)
wine.df$LabelAppeal <- factor(wine.df$LabelAppeal)
As variable LabelAppeal has negative values lets convert them to positive values by adding 3. Both LabelAppeal and Stars is categorical variables.
Let’s check for missing values in the dataset.
#Generate summary of missing values
aggr_plot <- aggr(wine.df, numbers=F, sortVars=F, labels=names(wine.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"))
summary(aggr_plot)$missings %>%
data.frame() %>% filter(Count > 0) %>%
mutate(Percentage = Count*100/nrow(wine.df)) %>%
mutate(Percentage = paste0(round(Percentage,2),'%')) %>%
kable("html",caption = "Variables With Missing Values", row.names = F, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| Variable | Count | Percentage |
|---|---|---|
| pH | 395 | 3.09% |
| Stars | 3359 | 26.25% |
Variable pH has approximately 3% missing values and Stars has about 26%. I will be using knnImputation function to impute missing values. Function accepcts five different parameters.
knnImputation(data, k = 10, scale = T, meth = "weighAvg", distData = NULL)
data - it is the data frame that needs to be imputed.k - The number of nearest neighbours to use (defaults to 10). Following R code shows how to find optimal value.scale - Boolean setting if the data should be scale before finding the nearest neighbours (defaults to T)meth - String indicating the method used to calculate the value to fill in each NA. Available values are ‘median’ or ‘weighAvg’ (the default).distData - Optionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in imputing dataset#Get complet cases from the dataset
wine.complete.cases.df <- wine.df[complete.cases(wine.df), ]
#Remove independent variable
wine.test.df <- wine.complete.cases.df
wine.test.df$CasesSold <- NULL
#Generate test set
set.seed(7374)
wine.test.df[sample(1:nrow(wine.test.df), round(nrow(wine.test.df)*3/100),0), "Stars"] <- NA
wine.test.df[sample(1:nrow(wine.test.df), round(nrow(wine.test.df)*26/100),0), "pH"] <- NA
#Loop through to find optimal 'k' value
for(i in 220:240){
wine.imp.set <- wine.test.df
#Imputation using kNN
wine.imp.df <- knnImputation(wine.imp.set, i, meth='weighAvg')
actual <- wine.complete.cases.df$pH[is.na(wine.test.df$pH)]
predicts <- wine.imp.df$pH[is.na(wine.test.df$pH)]
#Get errors and save them to dataframe
error.rate <- regr.eval(actual, predicts)
if (i==220){
Accuracy.df <- data.frame(error.rate, stringsAsFactors = F) %>%
t() %>%
data.frame()
row.names(Accuracy.df) <- paste0("kNN-",i)
}else{
A <- data.frame(error.rate, stringsAsFactors = F) %>%
t() %>%
data.frame()
row.names(A) <- paste0("kNN-",i)
Accuracy.df <- rbind(Accuracy.df, A)
}
}
Accuracy.df <- tibble::rownames_to_column(Accuracy.df, "kNN")
Accuracy.df$Neighbors <- 220:240
#Generate graph to identify optimal 'k' value
#Lower error better
Accuracy.df %>%
reshape2::melt(id.vars = c('kNN','Neighbors')) %>%
ggplot(aes(x = Neighbors, y = value)) +
geom_point() +
geom_line() +
facet_wrap(~variable, scales = "free", nrow = 2, ncol = 2) +
scale_x_continuous(breaks=seq(220,240,2))
While identifying k value make sure dependent variable is removed from the dataset. In this case, CasesSold will be removed before imputing the data. Most of the times, if test dataset needs imputation before prediction, dependent variable won’t be available for imputation.
Following are four error values regr.eval outputs and mape is most popular error measure used by practitioners. Lower values are better.
mae - mean absolute percentage error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n |O_i - E_i|\]mse - mean squared error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n (O_i - E_i)^2\]rmse - square root mean squared error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n \sqrt{(O_i - E_i)^2}\]mape - mean absolute percentage error. it is calculated as, \[\frac{1}{n} \times \sum_{i=1}^n \bigg|\frac{O_i - E_i}{O_i}\bigg|\]In our case 232 is optimal k value. In other words, we will be imputing missing values with weighted average generated using 232 neighbors.
wine.df$CasesSold <- NULL
wine.knn <- knnImputation(wine.df, 232, meth='weighAvg')
wine.knn <- cbind(CasesSold = wine.data.df$TARGET, wine.knn)
Now that we have a clean dataset let’s build Poisson model to predict the number of cases of wine that can be sold based on predictor variables. Prediction ranges from zero to infinity positive values.
wine.glm <- glm(CasesSold ~ ., data = wine.knn, family = poisson)
summary(wine.glm)
##
## Call:
## glm(formula = CasesSold ~ ., family = poisson, data = wine.knn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4937 -0.4944 0.2113 0.6372 2.7495
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.728404 0.197217 8.764 < 2e-16 ***
## pH -0.020310 0.007639 -2.659 0.00784 **
## Density -0.465430 0.191483 -2.431 0.01507 *
## AcidIndex -0.111461 0.004452 -25.037 < 2e-16 ***
## LabelAppeal2 0.337366 0.037933 8.894 < 2e-16 ***
## LabelAppeal3 0.533144 0.037208 14.329 < 2e-16 ***
## LabelAppeal4 0.661202 0.037928 17.433 < 2e-16 ***
## LabelAppeal5 0.770322 0.042737 18.025 < 2e-16 ***
## Stars2 0.189285 0.013733 13.783 < 2e-16 ***
## Stars3 0.509729 0.015595 32.684 < 2e-16 ***
## Stars4 0.693826 0.021663 32.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 18136 on 12784 degrees of freedom
## AIC: 50100
##
## Number of Fisher Scoring iterations: 5
Summary says all the variables are significant to the model. How do we measure the accuracy of the model?
Since we have actual cases sold and predicted cases, we can use regr.eval function.
wine.test <- wine.knn %>% select(-CasesSold)
wine.knn$Predict <- round(predict(wine.glm, newdata = wine.test, type = "response"))
error.rate <- regr.eval(wine.knn$CasesSold, wine.knn$Predict)
error.rate
## mae mse rmse mape
## 1.257991 2.697460 1.642395 Inf
This time output shows Inf in mape error measure. The reason behind it we have zeros in observed values. When the dependent variable can take zero as one of the outputs, we cannot use mape as error measure. In this case other error measures should be used.