Mayuri Ingle
01/21/2019
In this project, we will build a kNN (k-Nearest Neighbour) classifier to predict wine quality using red wine quality data set from UCI repository.
Before we start few details about kNN algorithm…
kNN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity.
Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases. In classification this is the mode (or most common) class value.
Pros:- Easy to understand (It’s simplest algorithm in ML).
No assumptions about data (eg, like distribution of data).
Can be applied to both classification and regression.
Works easily on multi-class problems.
Cons:- Memory Intensive / Computationally expensive.
Sensitive to scale of data.(Standardization is very important here)
Not work well on rare event (skewed) target variable.
Struggle when high number of independent variables.
For any given problem, a small value of k will lead to a large variance in predictions.
Alternatively, setting k to a large value may lead to a large model bias.Hence finding optimal value of k is important.
Let’s get started with our classifier….
#Import all required libraries
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(tidyverse)
## -- Attaching packages ----------- tidyverse 1.2.1 --
## v tibble 2.0.1 v purrr 0.2.5
## v tidyr 0.8.2 v dplyr 0.7.8
## v readr 1.3.1 v stringr 1.3.1
## v tibble 2.0.1 v forcats 0.3.0
## -- Conflicts -------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
# Load red wine quality dataset
redWine <- read_csv("F:\\RFiles\\Datasets\\wineQualityReds.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## fixed.acidity = col_double(),
## volatile.acidity = col_double(),
## citric.acid = col_double(),
## residual.sugar = col_double(),
## chlorides = col_double(),
## free.sulfur.dioxide = col_double(),
## total.sulfur.dioxide = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_double()
## )
# View structure of dataset
str(redWine)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1599 obs. of 13 variables:
## $ X1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "spec")=
## .. cols(
## .. X1 = col_double(),
## .. fixed.acidity = col_double(),
## .. volatile.acidity = col_double(),
## .. citric.acid = col_double(),
## .. residual.sugar = col_double(),
## .. chlorides = col_double(),
## .. free.sulfur.dioxide = col_double(),
## .. total.sulfur.dioxide = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
The dataset has 13 columns first as index ,last as quality of wine . Rest all have details of wine composition.
# Removing first column
redWine <- redWine[,-1]
# Summary of dataset
summary(redWine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The summary tells quality of wine is avaaliable in range of 3 to 8.
# visualise the quality variable .
redWine %>% ggplot(aes(x = quality ))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The dataset is quiet imbalanced as observations for wine quality 3,4,8 are very less than 5 and 6.This will get our model trained mainly on majority (5 and 6) quality observations. We will convert the qulaity variable in two categories as “low” if wine quality is less than 5 otherwise “high.”
# Refactoring quality range as low or high
redWine1 <- redWine %>%
mutate(quality = ifelse(quality <= 5 ,"Low" , "High"))
redWine1$quality <- as.factor(redWine1$quality)
# Plot new variable
redWine1 %>% ggplot(aes(x = quality))+ geom_bar(width = 0.2)
Now we can say , we have balanced dataset with better ratio of response categories.
We will split our data into train and test data then we will standardised the predictor variables.
# Create data partition
set.seed(123)
wine_split <- createDataPartition(redWine1$quality, p = 0.7 , list= FALSE)
wine_train <- redWine1[wine_split ,]
wine_test <- redWine1[-wine_split,]
# Standardise data
preProcValues <- preProcess(wine_train, method = c("center", "scale"))
wine_trainsc <- predict(preProcValues, wine_train )
wine_testsc <- predict(preProcValues, wine_test )
# Check processed data
summary(wine_trainsc$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8926 -0.8666 -0.2136 0.0000 0.6258 4.1702
sd(wine_trainsc$alcohol)
## [1] 1
Our dataset variables are standardised around mean zero with standard deviation as 1. Now we can train our model using the data.
Here we will build simple model using knn function from class library where we have to explicity define k value.Generally the k is selected as square root of no. of observation which is 39 (sqrt of 1599 ) here.
# Data processing
wineTrain <- wine_trainsc[,-12]
wineTest <- wine_testsc[,-12]
wineTrain_label <- wine_trainsc[,12 , drop = TRUE]
wineTest_label <- wine_testsc[,12 , drop = TRUE]
# Basic knn model
set.seed(123)
wineknns<- knn(train = wineTrain, test=wineTest, cl=wineTrain_label, k=39)
# Evaluating model performance
confusionMatrix(wineTest_label, wineknns)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 196 60
## Low 72 151
##
## Accuracy : 0.7244
## 95% CI : (0.6821, 0.764)
## No Information Rate : 0.5595
## P-Value [Acc > NIR] : 7.08e-14
##
## Kappa : 0.4443
## Mcnemar's Test P-Value : 0.3384
##
## Sensitivity : 0.7313
## Specificity : 0.7156
## Pos Pred Value : 0.7656
## Neg Pred Value : 0.6771
## Prevalence : 0.5595
## Detection Rate : 0.4092
## Detection Prevalence : 0.5344
## Balanced Accuracy : 0.7235
##
## 'Positive' Class : High
##
Accuracy of our model is ~72% , which means there is scope of improvement in our model prediction.
We will now train our model using train method from caret library, this method will automatically find value of k using cross validation. Here we are using repeated cross validation method using trainControl . Number denotes either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3 separate 10-fold validations are used.
# Setting up train controls
repeats = 3
numbers = 10
tunel = 20
trnCntrl = trainControl(method = "repeatedcv",
number = numbers,
repeats = repeats,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# KNN using train method fron caret
set.seed(123)
wineKnn <- train(quality~. , data = wine_trainsc, method = "knn",
trControl = trnCntrl,
metric = "ROC",
tuneLength = tunel)
# Summary of model
wineKnn
## k-Nearest Neighbors
##
## 1120 samples
## 11 predictor
## 2 classes: 'High', 'Low'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1008, 1008, 1009, 1008, 1007, 1008, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.7994567 0.7846704 0.6737421
## 7 0.8006876 0.7712806 0.6679487
## 9 0.8111298 0.7829473 0.6903120
## 11 0.8109276 0.7696045 0.6979923
## 13 0.8112747 0.7729379 0.6999032
## 15 0.8123058 0.7667797 0.6973149
## 17 0.8131111 0.7740113 0.6998670
## 19 0.8161605 0.7690113 0.7069424
## 21 0.8131647 0.7701036 0.7037978
## 23 0.8139441 0.7701130 0.7069787
## 25 0.8145111 0.7712524 0.6998791
## 27 0.8135681 0.7701412 0.6922109
## 29 0.8129872 0.7696045 0.6979560
## 31 0.8114864 0.7745763 0.6896710
## 33 0.8120940 0.7746045 0.6934930
## 35 0.8145213 0.7706968 0.6992501
## 37 0.8147393 0.7679190 0.6947750
## 39 0.8140626 0.7779567 0.6941461
## 41 0.8151652 0.7751412 0.6966860
## 43 0.8151473 0.7729190 0.6934930
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 19.
# Plot to visualize optimal k selection
plot(wineKnn)
The plot above shows optimal value for k as 19 . And the final model was built using K = 19
Finally to make predictions on our test set, we use predict function in which the first argument is the formula to be applied and second argument is the new data on which we want the predictions.
# Predict values using model
test_pred <- predict(wineKnn ,wine_testsc, type = "prob")
# Reforctoring values in two classes
test_pred$final <- as.factor(ifelse(test_pred$High > 0.5 ,"High","Low"))
#Evaluating model performance
confusionMatrix(wineTest_label, test_pred$final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 200 56
## Low 78 145
##
## Accuracy : 0.7203
## 95% CI : (0.6777, 0.76)
## No Information Rate : 0.5804
## P-Value [Acc > NIR] : 1.513e-10
##
## Kappa : 0.4342
## Mcnemar's Test P-Value : 0.06966
##
## Sensitivity : 0.7194
## Specificity : 0.7214
## Pos Pred Value : 0.7813
## Neg Pred Value : 0.6502
## Prevalence : 0.5804
## Detection Rate : 0.4175
## Detection Prevalence : 0.5344
## Balanced Accuracy : 0.7204
##
## 'Positive' Class : High
##
Here we have accuracy of ~ 72 % .Our Classifier can be further tunned to incrase accuracy.
This concludes our exercise of building a kNN classifier.