Wine Color Prediction KNN Model

0.1 Let’s start with Data Processing

The wine dataset contains 3198 observations on the color of wine, chemical properties of the wine, and quality indicator of the wine. The following packages in R will be utilized in this project.

The rio package will be employed in importing the dataset.
The janitor package will help in variable renaming to maintain the UpperCamelCase consistencies.
The tidymodels package will be used to streamline data engineering and machine learning tasks.
kableExtra will be very useful in creating tables in our report.
kknn package will be used to create the KNN model.

Now we will start by importing our Wine dataset.

##Let's load our packages first
library(rio); library(janitor); library(tidyverse)

##Loading the Dataset
DataWineAllFeatures=import("https://ai.lange-analytics.com/data/WineData.rds")

##All the features
head(DataWineAllFeatures)

##   wineColor acidity volatileAcidity citricAcid residualSugar Chlorides
## 1       red    10.8           0.320       0.44           1.6     0.063
## 2     white     6.4           0.310       0.39           7.5     0.040
## 3     white     9.4           0.280       0.30           1.6     0.045
## 4     white     8.2           0.220       0.36           6.8     0.034
## 5     white     6.4           0.290       0.44           3.6     0.197
## 6       red     6.7           0.855       0.02           1.9     0.064
##   freeSulfurDioxide totalSulfurDioxide Density   pH sulphates alcohol quality
## 1                16                 37 0.99850 3.22      0.78   10.00       6
## 2                57                213 0.99475 3.32      0.43   10.00       5
## 3                36                139 0.99534 3.11      0.49    9.30       5
## 4                12                 90 0.99440 3.01      0.38   10.50       8
## 5                75                183 0.99420 3.01      0.38    9.10       5
## 6                29                 38 0.99472 3.30      0.56   10.75       6

##Select Acidity, Sulfur, and WineCOlor
DataWine=DataWineAllFeatures|>
  clean_names("upper_camel")|>
  select(WineColor, Sulfur=TotalSulfurDioxide, Acidity)|>
  mutate(WineColor=as.factor(WineColor))##Convert Wine Colors to factor variable

head(DataWine)

##   WineColor Sulfur Acidity
## 1       red     37    10.8
## 2     white    213     6.4
## 3     white    139     9.4
## 4     white     90     8.2
## 5     white    183     6.4
## 6       red     38     6.7

library(tidymodels)
##Let's split DataWine into 85-15
set.seed(500)
Split8515=initial_split(DataWine,prop=0.7,strata="WineColor")
DataTrain=training(Split8515)
DataTest=testing(Split8515)
head(DataTrain)

##   WineColor Sulfur Acidity
## 1       red     38     6.7
## 2       red     66     7.6
## 3       red    102     7.9
## 4       red     11     5.8
## 5       red     49     6.8
## 6       red    110     7.0

0.2 Data Exploration

The two features acidity and sulfur dioxide may offer a unique insight about the color a wine may assume. A simple scatter plot of the acidity level on the sulfur dioxide content grouped by the wine colour will suffice this scenario. Figure 1 shows the behaviour of wine colors explained by acidity and sulfur dioxide content.

##Let's visualize the data
library(ggplot2)
ggplot(DataWine, aes(x=Sulfur,y=Acidity,color=WineColor))+
  geom_point()+
  labs(
    title="Wine Color Separated by Acidity and Sulfur Dioxide", #title
    x="Sulfur (Total Sulfur Dioxide in mg/liter)", #xlab
    y="Acidity (Tartaric acid in g/liter)", #ylab
    color="Wine Color" #Title of Legend
  )+
  scale_color_manual(values=c("red"="red","white"="gold"))+ #Adding colors
  theme_minimal()

Figure 1: Wine Color by Acidity vs Sulfur Dioxide

Figure 1 makes an interesting revelation. Wines with higher Sulfur Dioxide generally tend to be white and those with high acidity levels are mostly red wines.

Let’s dive into uni-axial analysis of wine color classification. Along the acidity axis, it is fair to say that most wines with \(acidity>8\) could be classified as red, otherwise, they are white. This is illustrated in Figure ?? where a horizontal segment project through the acidity axis at 8 yields significant classification accuracy.

ggplot(DataWine,aes(x=Sulfur,y=Acidity,color=WineColor))+
  geom_point()+
  geom_segment(aes(x=0,y=8,xend=450,yend=8),inherit.aes = FALSE,linewidth=1,color="tomato")+
  labs(x="Sulfur",
       y="Acidity",
       title="Wine Classification along Acidity Axis",
       color="WineColor")+
  scale_color_manual(values = c("red"="red","white"="gold"))+
  theme_minimal()

## Warning in geom_segment(aes(x = 0, y = 8, xend = 450, yend = 8), inherit.aes = FALSE, : All aesthetics have length 1, but the data has 3198 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

1 Classification along Acidity and Sulfur Axis

When both axis are used to identify the potential class a wine may belong to, it is natural to think that the classification accuracy may increase. Cast back to Figure ??, we realize that the acidity value of 8 and Sulfur value of 75 may provide reasonable wine color prediction. As a result, wines with Acidity>8 may be classified as red, white if the Sulfur content is above 75, and still red when both the properties are below 8 and 75 respectively. Figure ?? reveals the bi-axial classification.

seg_data <- data.frame(
  x    = c(0, 75),
  y    = c(8, 0),
  xend = c(450, 75),
  yend = c(8, 8)
)

ggplot(DataWine, aes(x = Sulfur, y = Acidity, color = WineColor)) +
  geom_point() +
  geom_segment(
    data = seg_data,
    aes(x = x, y = y, xend = xend, yend = yend),
    inherit.aes = FALSE,
    linewidth = 1,
    color = "tomato"
  ) +
  annotate("point", x = 69, y = 5, size = 2, col="blue")+
  labs(
    x = "Sulfur",
    y = "Acidity",
    title = "Wine Classification along Acidity and Sulfur Axis",
    color = "WineColor"
  ) +
  ylim(0,16)+
  scale_color_manual(values = c("red" = "red", "white" = "gold")) +
  theme_minimal()

2 Data Scaling

Let’s zoom into an unknown wine in the midst of some already known wines.

It might be misleading to assume that the two wines are very dissimilar in acidity terms than in terms of sulfur. This problem usually arises due to the unstandardized scales. Acidity has been measured in g/l whereas Sulfur, in mg/l. The vast difference in scale can lead to misleading conclusions. Standardization resolves this issue. The Normalize recipe in tidymodel can be helpful here.

Figure ?? shows the normalized version of the data.