Introduction:
The evaluation of wine quality is a fine art usually gate kept with a select flew, conventionally dependent on a blend of chemical characteristics and subjective evaluations by experts. This study analyses two separate datasets containing “Vinho Verde” wine variations from Portugal. Paulo Cortez and his team carefully assembled these datasets in 2009. The datasets contain a range of physical/chemical characteristics, and quality ratings ranging from 0 to 10. Our objective is to utilise statistical learning methods, notably regression and classification, to forecast and classify the quality of wine.

About Dataset:
The dataset titled “Wine Quality” is available on the UCI Machine Learning Repository. It contains information about both red and white versions of Portuguese “Vinho Verde” wine. The dataset consists of 1,599 samples of red wine and 4,898 samples of white wine. It provides information on 11 physicochemical characteristics, including acidity, sugar content, and alcohol levels. Additionally, the dataset includes quality ratings assigned by experts on a scale ranging from 0 to 10. This dataset is commonly utilized for regression and classification tasks in the field of machine learning, with the specific objective of predicting the quality of wine based on its chemical makeup.

Wine-Quality UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/186/wine+quality


Objectives:
Objective 1: To determine the essential physicochemical properties that affect the quality of wine

To determine exactly what attributes of wine. Attributes include acidity, sugar level, alcohol content, sulphates, pH level, and others which may have impact on wine taste.

Objective 2: To build regression models for predicting wine quality

The objective is to create and optimize regression model that can accurately predict the numerical quality score of wines using their physicochemical attributes.

Objective 3: To build classification models for predicting wine type

The objective is to create a classification model that can reliably identify red and white wines based on quantitative properties including acidity, alcohol content, and sugar levels.

Data Preprocessing:

Necessary libraries “tidyr”, “dplyr”, “ggplot2” are loaded.

library(tidyr)
library(dplyr)
library(ggplot2)

Set the working directory and load the winequality-red and winequality-white datasets, check number of rows and columns in both datasets

file_path <- "C:/Users/Sowjanya/OneDrive/Documents/R Project"
red <- read.csv("winequality-red.csv", header = TRUE)
red$type <- "red"
white <- read.csv("winequality-white.csv", header = TRUE)
white$type <- "white"
#View(red)
dim(red)
## [1] 1599    2
#View(white)
dim(white)
## [1] 4898    2

Append both winequality-red and winequality-white datasets together

wine <- rbind(red, white)
dim(wine)
## [1] 6497    2

Combined dataset is cleaned by removing empty rows/columns (Findings: No empty rows/columns)

wine <- wine[rowSums(is.na(wine)) != ncol(wine),]
wine <- wine[, colSums(is.na(wine)) != nrow(wine)]
dim(wine)
## [1] 6497    2

Check for missing values (Findings: No missing values)

any(is.na(wine))
## [1] FALSE

Remove duplicate rows

wine1 <- wine %>% distinct()
dim(wine1)
## [1] 5320    2

Summary stats of the dataset

summary(wine1)
##  fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
##  Length:5320                                                                                                                                      
##  Class :character                                                                                                                                 
##  Mode  :character                                                                                                                                 
##      type          
##  Length:5320       
##  Class :character  
##  Mode  :character

Classes of wine1 dataset

sapply(wine1, class)
## fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality 
##                                                                                                                                       "character" 
##                                                                                                                                              type 
##                                                                                                                                       "character"

A new binary column type_bin is created to differentiate red and white wines. (“red - 1”, “white - 0”)

wine1$type_bin <- ifelse(wine1$type == "red", 1, 0)
head(wine1)
##   fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
## 1                                                                                                  7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
## 2                                                                                                  7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
## 3                                                                                               7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
## 4                                                                                              11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
## 5                                                                                                 7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
## 6                                                                                                7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
##   type type_bin
## 1  red        1
## 2  red        1
## 3  red        1
## 4  red        1
## 5  red        1
## 6  red        1
tail(wine1)
##      fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
## 5315                                                                                            6.5;0.23;0.38;1.3;0.032;29;112;0.99298;3.29;0.54;9.7;5
## 5316                                                                                             6.2;0.21;0.29;1.6;0.039;24;92;0.99114;3.27;0.5;11.2;6
## 5317                                                                                               6.6;0.32;0.36;8;0.047;57;168;0.9949;3.15;0.46;9.6;5
## 5318                                                                                            6.5;0.24;0.19;1.2;0.041;30;111;0.99254;2.99;0.46;9.4;6
## 5319                                                                                            5.5;0.29;0.3;1.1;0.022;20;110;0.98869;3.34;0.38;12.8;7
## 5320                                                                                               6;0.21;0.38;0.8;0.02;22;98;0.98941;3.26;0.32;11.8;6
##       type type_bin
## 5315 white        0
## 5316 white        0
## 5317 white        0
## 5318 white        0
## 5319 white        0
## 5320 white        0

The cleaned data is exported to wine_cleaned.csv.

# Use absolute path
file_path <- "C:/Users/Sowjanya/OneDrive/Documents/R Project/wine_cleaned.csv"
data <- read.csv(file_path)
output_file <- "wine_cleaned.csv"

Data Visualization and Exploratory Data Analysis:

Histograms: Created for various physicochemical variables like fixed acidity, volatile acidity, citric acid, etc., with differentiation between red and white wine types.

Bar Charts: Illustrate the distribution of wine quality across different wine types.

Grouped Bar Plots: Comparing different chemical attributes that determine wine taste, such as fixed acidity, alcohol, and residual sugar.

Box Plot: Focus on the distribution of citric acid across different wine types.

Scatter Plot: Analyse the relationship between selected pairs of variables.

Correlation Heatmap: Visualizing the correlation matrix to identify relationships between various physicochemical properties.

Necessary libraries like “tidyr”, “dplyr”, “ggplot2”, “corrplot”, “plotly” are used.

head(data)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.66        0.00            1.8     0.075
## 6           7.9             0.60        0.06            1.6     0.069
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  13                   40  0.9978 3.51      0.56     9.4
## 6                  15                   59  0.9964 3.30      0.46     9.4
##   quality type type_bin
## 1       5  red        1
## 2       5  red        1
## 3       5  red        1
## 4       6  red        1
## 5       5  red        1
## 6       5  red        1
summary(data)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2400   1st Qu.: 1.800  
##  Median : 7.000   Median :0.3000   Median :0.3100   Median : 2.700  
##  Mean   : 7.215   Mean   :0.3441   Mean   :0.3185   Mean   : 5.048  
##  3rd Qu.: 7.700   3rd Qu.:0.4100   3rd Qu.:0.4000   3rd Qu.: 7.500  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 16.00      1st Qu.: 74.0        1st Qu.:0.9922  
##  Median :0.04700   Median : 28.00      Median :116.0        Median :0.9947  
##  Mean   :0.05669   Mean   : 30.04      Mean   :114.1        Mean   :0.9945  
##  3rd Qu.:0.06600   3rd Qu.: 41.00      3rd Qu.:153.2        3rd Qu.:0.9968  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.40   Median :6.000  
##  Mean   :3.225   Mean   :0.5334   Mean   :10.55   Mean   :5.796  
##  3rd Qu.:3.330   3rd Qu.:0.6000   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :9.000  
##      type              type_bin     
##  Length:5320        Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.2555  
##                     3rd Qu.:1.0000  
##                     Max.   :1.0000
# Check for missing values
any(is.na(data))
## [1] FALSE
#Visualization of all Variables-Histogram
numerical_vars <- c(
  "fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", 
  "total.sulfur.dioxide", "density", "pH", "sulphates", "alcohol", "quality"
)

data <- read.csv("wine_cleaned.csv")
library(ggplot2)

# Define colors for 'red' and 'white' types
type_colors <- c("red" = "darkred", "white" = "skyblue")

for (var in numerical_vars) {
  print(
    ggplot(data, aes(x = get(var), fill = type)) +
      geom_histogram(binwidth = 1, color = "white") +
      scale_fill_manual(values = type_colors) +
      labs(title = paste("Histogram: ", var), x = var, y = "Count") + theme_minimal()
  )
}


When examining variables such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality, it is evident the noticeable variations in concentration levels between red and white wines. These differences are highlighted through color-coded histograms. For example, the histograms for alcohol content and quality indicate a difference in the distribution between the two types of wines, suggesting that these factors may have varying impacts on the perceived quality of each wine type. These visualizations allow for further analysis on the impact of chemical properties on the sensory attributes and consumer preferences of red and white wines.


# Visualization of Wine Quality by Wine Type - Bar Chart
data <- read.csv("wine_cleaned.csv")
ggplot(data, aes(x = factor(quality), fill = factor(type))) + geom_bar(position = "dodge", alpha = 0.7) +
  labs(title = "Bar Chart of Wine Quality by Wine Type", x = "Quality", y = "Count") + theme_minimal()



The bar chart provides a clear distinction between the quality of red and white wines, displaying the distribution of wine ratings. The analysis showcases the frequency of specific quality scores for each type of wine.


# Visualization - Grouped Bar Plot
library(ggplot2)
data <- read.csv("wine_cleaned.csv")
# fixed acidity, alcohol and residual sugar determine the taste of the Wine 
numerical_vars <- c("fixed.acidity", "alcohol", "residual.sugar")
data_long <- tidyr::gather(data, key = "variable", value = "value", all_of(numerical_vars))
ggplot(data_long, aes(x = type, fill = variable, y = value)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_fill_manual(values = c("fixed.acidity" = "orange", "alcohol" = "darkblue", "residual.sugar" = "purple")) +
  facet_wrap(~variable, scales = "free_y", ncol = 1) +
  labs(title = "Grouped Bar Plot of Chemical Attributes that Determine Wine Taste",
       x = "Wine Type",
       y = "Variable Value") +
  theme_minimal()


The grouped bar plot shows the main chemical characteristics that impact the flavor of wine. With emphasis on alcohol, fixed acidity, and residual sugar. The difference between each of these characteristics based on wine type is clear, enabling an easy analysis of how these factors differ between red and white wines. This information can provide insights to winemakers and enhance the understanding of tasting profiles.


#Visualization - Box Plot
# Bivariate Analysis
data <- read.csv("wine_cleaned.csv")
library(ggplot2)
ggplot(data, aes(x = type, y = citric.acid, fill = type)) +
  geom_boxplot(alpha = 1.0, color = "black") + 
  labs(title = "Distribution of Citric Acid by Wine Type", x = "Wine Type", y = "Citric Acid") +
  scale_fill_manual(values = c("red" = "darkred", "white" = "lightblue")) +
  theme_minimal() +
  theme(legend.position = "top", legend.title = element_blank())



The box plot shows the difference in citric acid content between red and white wines. It is observed that red wines generally have lower levels of citric acid compared to white wines. The visualization displays the distribution and central tendency of the citric acid variable for each wine type.


#Visualization - Scatter Plot
#Multivariate Analysis
library(ggplot2)
data <- read.csv("wine_cleaned.csv")
selected_vars <- c("fixed.acidity", "alcohol", "volatile.acidity", "sulphates", "quality", "type")
ggplot(data, aes(x = .data[[selected_vars[1]]], y = .data[[selected_vars[2]]], color = type)) +
  geom_point(alpha = 0.5) +
  labs(title = "Scatter Plot: Selected Variables by Wine Type", x = selected_vars[1], y = selected_vars[2]) +
  theme_minimal()



The scatter plot shows the variations of fixed acidity and alcohol levels in red and white wines. Both types show a wide range, with white wines typically displaying lower acidity. The overlap suggests that red and white wines exhibit similar acid and alcohol levels, allowing for a convenient visual comparison between the two wine types.


#Visualization
#Correlation Heatmap

#install.packages("corrplot")
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
data <- read.csv("wine_cleaned.csv")
correlation_matrix <- cor(data[, sapply(data, is.numeric)])

par(mar = c(6, 6, 6, 6))  
corrplot(correlation_matrix, method = "color", type = "upper", addCoef.col = "black", number.cex = 0.7)



The correlation heatmap provides a visual representation of the relationships between the physicochemical properties of wines, showcasing their strength and direction. Stronger correlations are observed between free sulfur dioxide and total sulfur dioxide, with darker shades indicating this relationship. On the other hand, the light areas indicate less strong correlations. This map provides valuable insights into the detailed connections between various factors and their potential impact on the quality of wine.

Model Building:

Necessary libraries like “rpart”, “rattle”, “randomForest”, “xgboost”, “caret”, “e1071”, “caTools” are used.

library(caTools)
set.seed(123)
#wine quality is turned into nominal data
data$qualitytype<-ifelse(data$quality<6,'bad','good')
data$qualitytype[data$quality==6]<-'normal'
data$qualitytype<-as.factor(data$qualitytype)
ggplot(data=data)+geom_bar(mapping = aes(x=quality, fill = factor(quality)), stat = "count") +
  scale_fill_manual(values = c("3" = "red", "4" = "blue", "5" = "green", "6" = "orange", "7" = "purple", "8" = "pink"))+
  labs(x = "Quality", y = "Count") +
  theme_minimal()


A new variable called “qualitytype” is created to assess the quality of wine. The bar chart displays the ratings of wines in terms of quality, with a significant portion falling into the average category. Wines are categorized into different quality ratings, which are then utilized to prepare the data for training machine learning models.


#split data into two parts as train and test
options(repos = c(CRAN = "https://cloud.r-project.org"))
#install.packages("rattle")
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked _by_ '.GlobalEnv':
## 
##     wine
part<-sample.split(data$fixed.acidity,SplitRatio = 0.7)
train<-data[part,]
test<-data[!part,]
library(rpart) 
library(plotly) 
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(rpart.plot)


Objective 1: To determine the essential physicochemical properties that affect the quality of wine

#Random Forest
#Determine the essential physicochemical properties 
#Load necessary libraries
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice
# Read the cleaned wine data
wine_data <- read.csv('C:/Users/Sowjanya/OneDrive/Documents/R Project/wine_cleaned.csv')
wine_data$quality <- as.factor(wine_data$quality)
# Split the data into training and testing sets
set.seed(123)  # For reproducibility
trainIndex <- createDataPartition(wine_data$quality, p = 0.7, list = FALSE)
trainData <- wine_data[trainIndex, ]
testData <- wine_data[-trainIndex, ]
# Train the model
rfModel <- randomForest(quality ~ ., data = trainData, importance = TRUE, ntree = 500)
# View the model results
print(rfModel)
## 
## Call:
##  randomForest(formula = quality ~ ., data = trainData, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 43.51%
## Confusion matrix:
##   3 4   5    6   7 8 9 class.error
## 3 0 0  12    8   1 0 0   1.0000000
## 4 0 7  91   45   2 0 0   0.9517241
## 5 0 7 758  448  14 0 0   0.3822331
## 6 0 3 364 1131 127 2 0   0.3048556
## 7 0 0  18  373 205 4 0   0.6583333
## 8 0 0   1   63  35 5 0   0.9519231
## 9 0 0   0    3   1 0 0   1.0000000
# Make predictions
predictions <- predict(rfModel, newdata = testData)
# Evaluate the model
confMatrix <- confusionMatrix(predictions, testData$quality)
print(confMatrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   1   0   0   0   0   0
##          4   0   1   3   1   0   0   0
##          5   5  35 349 152   8   0   0
##          6   4  24 167 492 172  26   0
##          7   0   0   6  50  73  17   1
##          8   0   0   0   1   3   1   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5754          
##                  95% CI : (0.5507, 0.5998)
##     No Information Rate : 0.4372          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.324           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 3  Class: 4 Class: 5 Class: 6 Class: 7  Class: 8
## Sensitivity          0.0000000 0.0163934   0.6648   0.7069  0.28516 0.0227273
## Specificity          0.9993683 0.9973873   0.8126   0.5614  0.94461 0.9974160
## Pos Pred Value       0.0000000 0.2000000   0.6357   0.5559  0.49660 0.2000000
## Neg Pred Value       0.9943432 0.9621928   0.8313   0.7115  0.87336 0.9729049
## Prevalence           0.0056533 0.0383166   0.3298   0.4372  0.16080 0.0276382
## Detection Rate       0.0000000 0.0006281   0.2192   0.3090  0.04585 0.0006281
## Detection Prevalence 0.0006281 0.0031407   0.3448   0.5559  0.09234 0.0031407
## Balanced Accuracy    0.4996841 0.5068904   0.7387   0.6341  0.61488 0.5100716
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993719
## Prevalence           0.0006281
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000
# Get the importance matrix
importanceMatrix <- importance(rfModel)
# Filter out 'type' and 'type_bin' from the importance matrix
varImportance <- data.frame(Variable = rownames(importanceMatrix), Importance = importanceMatrix[, "MeanDecreaseGini"])
varImportance <- varImportance[!varImportance$Variable %in% c("type", "type_bin"), ]
# Visualize the importance (Using the correct importance measure)
ggplot(varImportance, aes(x = reorder(Variable, Importance), y = Importance, fill = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  scale_fill_gradient(low = "Lavender", high = "Thistle") +
  xlab("Physicochemical Properties") +
  ylab("Essential") +
  ggtitle("Essential physicochemical properties using the Random Forest Model\n")


The Random Forest model determines that alcohol, density, and volatile.acidity are the primary physicochemical properties that have an impact on wine quality. These properties hold a lot more importance when compared to others, such as total sulfur dioxide, sulphates, and chlorides. Free sulfur dioxide, pH, residual sugar, citric acid, and fixed acidity all make smaller but still significant contributions. This shows that the secret to enhancing wine quality may lie in wine making techniques that optimize alcohol content, control acidity levels, and monitor density.


Objective 2: To build regression models for predicting wine quality


Random Forest for predicting the quality of wine:

# RANDOM FOREST - Regression

library(randomForest)
library(caret)

# Ensure the 'quality' column is treated as numeric
wine_data$quality <- as.numeric(wine_data$quality)

# Split the data into training and testing sets
set.seed(123) # Setting seed for reproducibility
indexes <- createDataPartition(wine_data$quality, p = 0.7, list = FALSE)
train_set <- wine_data[indexes, ]
test_set <- wine_data[-indexes, ]

# Train the Random Forest regression model
rf_model <- randomForest(quality ~ ., data = train_set)

# Make predictions on the test set
predictions <- predict(rf_model, newdata = test_set)

# Calculate RMSE
rmse <- sqrt(mean((test_set$quality - predictions)^2))

# Return the RMSE
rmse
## [1] 0.6985049


The Random Forest regression model performs exceptionally well, with a remarkably low RMSE (Root Mean Square Error) of 0.6985049. This RMSE demonstrates the model’s ability to predict wine quality with high precision, making it ideal for regression tasks.

In the context of regression, the model excels at making accurate predictions about wine quality as a continuous variable. Its low RMSE indicates minimal prediction error and a good fit to the data. The RF model outperforms that of the Decision Tree and XGBoost regression models.


XGBoost for predicting the quality of wine:

# XGBoost-Regression

set.seed(123)

# Define predictors and responses
train_x <- data.matrix(train[,-c(12, 13, 15)])
train_y <- data.matrix(train[, 12])
test_x <- data.matrix(test[,-c(12, 13, 15)])
test_y <- data.matrix(test[, 12])

# Load the xgboost library
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:rattle':
## 
##     xgboost
## The following object is masked from 'package:dplyr':
## 
##     slice
# Create DMatrix objects for training and testing data
train_xgb <- xgb.DMatrix(data = train_x, label = train_y)
test_xgb <- xgb.DMatrix(data = test_x, label = test_y)

# Train the XGBoost model
xgb <- xgb.train(
  data = train_xgb,
  max.depth = 3,
  watchlist = list(train = train_xgb, test = test_xgb),
  nrounds = 100,
  verbose = 0
)

# Determine the number of rounds with the minimum RMSE
xgb_final <- xgboost(data = train_xgb, max.depth = 3, nrounds = 88, verbose = 0)

# Make predictions
pred_xgb <- predict(xgb_final, newdata = test_xgb)

# Calculate Mean Squared Error (MSE)
mse <- mean((test_y - pred_xgb)^2)

# Calculate Mean Absolute Error (MAE)
mae <- caret::MAE(test_y, pred_xgb)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test_y - pred_xgb)^2))  
# Convert numeric predictions to a factor with thresholds 
pred_xgb_factor <- as.factor(ifelse(pred_xgb < 5, 'Low', ifelse(pred_xgb > 7, 'High', 'Medium')))

# Convert the actual numeric values to the same categorical levels for consistency
test_y_factor <- as.factor(ifelse(test_y < 5, 'Low', ifelse(test_y > 7, 'High', 'Medium')))

# Create a confusion matrix
conf_matrix_xgb <- confusionMatrix(pred_xgb_factor, test_y_factor)

# Output the confusion matrix and RMSE
print(conf_matrix_xgb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High      4    0      9
##     Low       0   28     84
##     Medium   46   43   1378
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8857         
##                  95% CI : (0.869, 0.9009)
##     No Information Rate : 0.924          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2124         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity             0.080000    0.39437        0.9368
## Specificity             0.994163    0.94477        0.2645
## Pos Pred Value          0.307692    0.25000        0.9393
## Neg Pred Value          0.970868    0.97095        0.2560
## Prevalence              0.031407    0.04460        0.9240
## Detection Rate          0.002513    0.01759        0.8656
## Detection Prevalence    0.008166    0.07035        0.9215
## Balanced Accuracy       0.537082    0.66957        0.6006
print(paste("RMSE:", rmse)) 
## [1] "RMSE: 0.708532210672326"


The XGBoost regression model performs admirably, with an RMSE (Root Mean Square Error) of only 0.7085322. The RMSE value demonstrates the model’s ability to provide highly precise predictions for wine quality, making it ideal for regression tasks.

In terms of regression analysis, the model excels at making accurate predictions for wine quality as a continuous variable. The low RMSE indicates minimal prediction error and a strong fit to the dataset. This level of performance outperforms the Decision Tree regression model, demonstrating XGBoost’s superiority in regression applications.


Decision Tree for predicting the quality of wine:

#Decision tree
#predict the quality of wine.
library(readr)
library(caret)
library(rpart)
library(rpart.plot)

wine_data$quality <- as.factor(wine_data$quality)

set.seed(123)
dtModel<-rpart(quality~.,data = train[,-c(14,15)])
dtModel
## n= 3728 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3728 2866.5170 5.799088  
##    2) alcohol< 10.91667 2435 1393.5330 5.536345  
##      4) volatile.acidity>=0.285 1297  590.4225 5.329221 *
##      5) volatile.acidity< 0.285 1138  684.0536 5.772408  
##       10) alcohol< 10.11667 692  380.8367 5.643064 *
##       11) alcohol>=10.11667 446  273.6771 5.973094 *
##    3) alcohol>=10.91667 1293  988.3217 6.293890  
##      6) alcohol< 11.675 561  420.0570 6.040998 *
##      7) alcohol>=11.675 732  504.8893 6.487705  
##       14) free.sulfur.dioxide< 21.5 262  206.0305 6.213740 *
##       15) free.sulfur.dioxide>=21.5 470  268.2319 6.640426 *
#visualize the decision tree
fancyRpartPlot(dtModel)

#model evaluation
pred_dtModel<-predict(dtModel,newdata = test)
#summary of the prediction vs summary of the real test data
summary(pred_dtModel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.329   5.329   5.643   5.806   6.041   6.640
summary(test$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.788   6.000   9.000
RMSE_dtModel<-sqrt(mean((pred_dtModel-test$quality)^2))
print("RMSE of the model")
## [1] "RMSE of the model"
RMSE_dtModel
## [1] 0.7730765
#data frame with real response and the prediction
dat<-data.frame(test$quality,pred_dtModel)
head(dat)
##    test.quality pred_dtModel
## 5             5     5.329221
## 6             5     5.329221
## 7             7     5.329221
## 15            7     5.973094
## 16            5     5.329221
## 18            6     5.329221


Comparison Table for Regression Models:

# RMSE for the Random Forest model
rmse_rf <- 0.6985049

# RMSE for the Decision Tree model (Regression)
rmse_dt <- 0.7730765

# RMSE for the XGBoost model (Regression)
rmse_xgb <- 0.7085322

# Create a data frame with the model performance metrics based on RMSE
model_comparison_rmse <- data.frame(
  Model = c("Random Forest (Regression)", "XGBoost (Regression)", "Decision Tree (Regression)"),
  RMSE = c(rmse_rf, rmse_xgb, rmse_dt)
)

# 
knitr::kable(model_comparison_rmse, format = "pipe", caption = "Comparative Analysis of Models Based on RMSE")
Comparative Analysis of Models Based on RMSE
Model RMSE
Random Forest (Regression) 0.6985049
XGBoost (Regression) 0.7085322
Decision Tree (Regression) 0.7730765


Comparative Analysis Explanation for Regression Model:

1. Random Forest (Regression):
RMSE = 0.6985049.
Interpretation: The Random Forest regression model has a slightly lower RMSE than the XGBoost and Decision Tree regression models. This shows that Random Forest makes the most accurate predictions with the lowest prediction error.

2. XGBoost Regression:
RMSE = 0.7085322.
Interpretation: XGBoost has a moderate prediction error for the continuous target variable. Its RMSE is greater than that of Random Forest, indicating that it is less accurate in comparison.

3. Decision Tree (Regression):
RMSE = 0.7849993.
Interpretation: The Decision Tree regression model has the highest RMSE of the three models, indicating the least accurate predictions and the greatest prediction error.

In a comparative analysis, Random Forest outperforms the other two regression models with lower RMSE and higher prediction accuracy. XGBoost, with a moderate RMSE, is less accurate than Random Forest but outperforms the Decision Tree regression model. The Decision Tree regression model has the highest RMSE, implying the least accurate predictions.


Objective 3: To build classification models for predicting wine type

Decision Tree for predicting the wine type:

#Decision Tree
#predict the type of wine
library(rpart)
library(rpart.plot)
ctree1<-rpart(type_bin~.,data=train[,-c(13,15)])
ctree1
## n= 3728 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3728 705.454700 0.253487100  
##    2) chlorides< 0.0615 2692  88.855870 0.034175330  
##      4) total.sulfur.dioxide>=49.5 2590  18.860620 0.007335907 *
##      5) total.sulfur.dioxide< 49.5 102  20.754900 0.715686300  
##       10) chlorides< 0.0405 26   2.653846 0.115384600 *
##       11) chlorides>=0.0405 76   5.526316 0.921052600 *
##    3) chlorides>=0.0615 1036 150.674700 0.823359100  
##      6) total.sulfur.dioxide>=113.5 195  33.517950 0.220512800  
##       12) fixed.acidity< 7.65 147   6.666667 0.047619050 *
##       13) fixed.acidity>=7.65 48   9.000000 0.750000000 *
##      7) total.sulfur.dioxide< 113.5 841  29.857310 0.963139100  
##       14) density< 0.993215 29   6.206897 0.310344800 *
##       15) density>=0.993215 812  10.850990 0.986453200 *
fancyRpartPlot(ctree1)

pred_ctree1<-predict(ctree1,newdata = test[,-13])
pred_ctree1<-ifelse(pred_ctree1>0.5,1,0)
#accuracy, specificity, precision ,etc of the model
conf_ctree1<-confusionMatrix(as.factor(test$type_bin),as.factor(pred_ctree1), mode = "everything")
conf_ctree1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1159   19
##          1   39  375
##                                           
##                Accuracy : 0.9636          
##                  95% CI : (0.9532, 0.9722)
##     No Information Rate : 0.7525          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9038          
##                                           
##  Mcnemar's Test P-Value : 0.0126          
##                                           
##             Sensitivity : 0.9674          
##             Specificity : 0.9518          
##          Pos Pred Value : 0.9839          
##          Neg Pred Value : 0.9058          
##               Precision : 0.9839          
##                  Recall : 0.9674          
##                      F1 : 0.9756          
##              Prevalence : 0.7525          
##          Detection Rate : 0.7280          
##    Detection Prevalence : 0.7399          
##       Balanced Accuracy : 0.9596          
##                                           
##        'Positive' Class : 0               
## 
preci<-precision(conf_ctree1$table)
paste("precision is",preci)
## [1] "precision is 0.983870967741935"
recal<-recall(conf_ctree1$table)
paste("recall is ",recal)
## [1] "recall is  0.967445742904841"
fscore<-2*preci*recal/(preci+recal)
paste("f1score is ",fscore)
## [1] "f1score is  0.975589225589226"


The Decision Tree model takes into account factors such as chlorides and sulphur dioxide levels in order to differentiate between different types of wines. The evaluation metrics obtained from the confusion matrix show a remarkable level of accuracy (96.36%) and precision (98.39%), which serves as an evidence to the exceptional performance of the model. The F1 score indicates a strong balance between precision and recall, with a high value of approximately 97.56%. This suggests that the model is able to consistently identify and classify wine types based on their chemical properties.

SVM for predicting the wine type:

#SVM MODEL-Classification
#Predict the type of wine
#Ensure the e1071 package is installed and loaded
#install.packages("e1071")
library(e1071)
train$type_bin <- as.factor(train$type_bin)
test$type_bin <- as.factor(test$type_bin)
#Fit the SVM model on the training data
svm_model <- svm(type_bin ~ ., data = train[,-c(12,13,15)], kernel = "radial")

#Predict the wine type on the test data
svm_predictions <- predict(svm_model, newdata = test[,-c(12,13,15)])
#Evaluate the model's performance
svm_conf_matrix <- confusionMatrix(svm_predictions, test$type_bin)
print(svm_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1173    7
##          1    5  407
##                                           
##                Accuracy : 0.9925          
##                  95% CI : (0.9869, 0.9961)
##     No Information Rate : 0.7399          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9804          
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.9958          
##             Specificity : 0.9831          
##          Pos Pred Value : 0.9941          
##          Neg Pred Value : 0.9879          
##              Prevalence : 0.7399          
##          Detection Rate : 0.7368          
##    Detection Prevalence : 0.7412          
##       Balanced Accuracy : 0.9894          
##                                           
##        'Positive' Class : 0               
## 
# Extract precision, recall, and F1 score from the confusion matrix
precision_value <- svm_conf_matrix$byClass["Pos Pred Value"]
recall_value <- svm_conf_matrix$byClass["Sensitivity"]
f1_score <- 2 * (precision_value * recall_value) / (precision_value + recall_value)

# Print precision, recall, and F1 score
print(paste("Precision:", precision_value))
## [1] "Precision: 0.994067796610169"
print(paste("Recall:", recall_value))
## [1] "Recall: 0.995755517826825"
print(paste("F1 Score:", f1_score))
## [1] "F1 Score: 0.994910941475827"


The SVM model shown exceptional performance in accurately identifying different types of wine, achieving an impressive accuracy rate of over 99.25%. The model demonstrated a remarkable level of precision and recall. This demonstrates its reliability as a tool for wine classification tasks.


K-NN for predicting the wine type:

##k-NN MODEL- Classification
#install.packages("class")
#install.packages("caret")
library(class)
library(caret)

numeric_train_data <- train[, sapply(train, is.numeric)]
numeric_test_data <- test[, sapply(test, is.numeric)]

maxs <- apply(numeric_train_data, 2, max)
mins <- apply(numeric_train_data, 2, min)

scaled_train_data <- as.data.frame(scale(numeric_train_data, center = mins, scale = maxs - mins))

# Scaling the test data using the same scaling parameters
scaled_test_data <- as.data.frame(scale(numeric_test_data, center = mins, scale = maxs - mins))

# Train the k-NN model
set.seed(123) # for reproducibility
k <- 5 

# Train the model using scaled training data and original labels
knn_pred <- knn(train = scaled_train_data, test = scaled_test_data, cl = train$type_bin, k = k)

# Evaluate the model's performance
knn_conf_matrix <- confusionMatrix(knn_pred, test$type_bin)
knn_conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1168   15
##          1   10  399
##                                           
##                Accuracy : 0.9843          
##                  95% CI : (0.9769, 0.9898)
##     No Information Rate : 0.7399          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.959           
##                                           
##  Mcnemar's Test P-Value : 0.4237          
##                                           
##             Sensitivity : 0.9915          
##             Specificity : 0.9638          
##          Pos Pred Value : 0.9873          
##          Neg Pred Value : 0.9756          
##              Prevalence : 0.7399          
##          Detection Rate : 0.7337          
##    Detection Prevalence : 0.7431          
##       Balanced Accuracy : 0.9776          
##                                           
##        'Positive' Class : 0               
## 
# Output precision, recall, and F1 score
preci <- precision(knn_conf_matrix$table)
recal <- recall(knn_conf_matrix$table)
fscore <- 2 * preci * recal / (preci + recal)

# Print the performance metrics
print(paste("Precision is", preci))
## [1] "Precision is 0.987320371935757"
print(paste("Recall is", recal))
## [1] "Recall is 0.99151103565365"
print(paste("F1 score is", fscore))
## [1] "F1 score is 0.989411266412537"


The k-NN model employed for classifying wines into different types achieve an accuracy rate exceeding 98.43%. The system excelled in accurately identifying and categorizing wines, ensuring that wines classified as a specific type were indeed of that type. It also captured a significant majority of wines belonging to a particular category. The F1 score suggests that the model achieved a good balance between precision and recall. Based on the findings, it appears that k-NN is a reliable option for classifying wine types.


Comparative Analysis on the Classification Models:

# Performance metrics for Decision Tree
accuracy_dt <- "96.36%"
kappa_dt <- "0.9038"
sensitivity_dt <- "96.74%"
specificity_dt <- "95.18%"
ppv_dt <- "98.39%" 
npv_dt <- "90.58%" 
f1_dt <- "97.55%" 

#Performance metrics for SVM 
accuracy_svm <- "99.25%"
kappa_svm <- "0.9804"
sensitivity_svm <- "99.58%"
specificity_svm <- "98.31%"
ppv_svm <- "99.41%"
npv_svm <- "98.79%"
f1_svm <- "99.49%"

#Performance metrics for k-NN
accuracy_knn <- "98.43%"
kappa_knn <- "0.959"
sensitivity_knn <- "99.15%"
specificity_knn <- "96.38%"
ppv_knn <- "98.73%" # Positive Predictive Value is equivalent to Precision
npv_knn <- "97.56%" # Negative Predictive Value
f1_knn <- "98.94%"

# Create a data frame with the model performance metrics
model_comparison <- data.frame(
  Metric = c("Accuracy", "Kappa", "Sensitivity (Recall)", "Specificity", "Positive Predictive Value (Precision)", "Negative Predictive Value", "F1 Score"),
  `Model 1 (Decision Tree)` = c(accuracy_dt, kappa_dt, sensitivity_dt, specificity_dt, ppv_dt, npv_dt, f1_dt),
  `Model 2 (SVM)` = c(accuracy_svm, kappa_svm, sensitivity_svm, specificity_svm, ppv_svm, npv_svm, f1_svm),
  `Model 3 (k-NN)` = c(accuracy_knn, kappa_knn, sensitivity_knn, specificity_knn, ppv_knn, npv_knn, f1_knn)
)

#Comparison Table 
knitr::kable(model_comparison, format = "pipe", caption = "Comparative Analysis of Decision Tree Models, SVM, and k-NN")
Comparative Analysis of Decision Tree Models, SVM, and k-NN
Metric Model.1..Decision.Tree. Model.2..SVM. Model.3..k.NN.
Accuracy 96.36% 99.25% 98.43%
Kappa 0.9038 0.9804 0.959
Sensitivity (Recall) 96.74% 99.58% 99.15%
Specificity 95.18% 98.31% 96.38%
Positive Predictive Value (Precision) 98.39% 99.41% 98.73%
Negative Predictive Value 90.58% 98.79% 97.56%
F1 Score 97.55% 99.49% 98.94%

Comparative Analysis Explanation for Classification Model:

Accuracy: The SVM model is the most accurate (99.25%), followed by the k-NN model (98.43%). The Decision Tree model’s accuracy is lower at 96.36%.

Kappa: The SVM has the highest kappa value of 0.98990.9804, indicating very strong agreement. The k-NN’s kappa is also high at 0.959, and the Decision Tree’s kappa is 0.9038, which is good but lower than the other two models.

Sensitivity and Specificity: The SVM and k-NN models both show very high sensitivity and specificity, with SVM slightly outperforming k-NN. The Decision Tree has lower sensitivity and specificity in comparison but still performs well.

Predictive Values: The SVM model has the highest precision (positive predictive value) and negative predictive value, indicating a higher likelihood that its predictions are correct. The k-NN model’s predictive values are slightly lower but still very high. The Decision Tree model has a lower precision, but its negative predictive value is not provided.

F1 Score: The SVM and k-NN models both have very high F1 scores with slight difference of 99.49% and 98.94% respectively. The F1 score for the Decision Tree is 97.55%, which is slightly lower than the other two models but still indicates a high precision and recall balance.

Overall, the SVM model is the top-performing model among the three. It achieves the highest accuracy, kappa, sensitivity, specificity, precision, and F1 score. The k-NN model is also highly effective, with only marginally lower metrics than SVM. The Decision Tree model, while not performing at the same level as SVM and k-NN, still shows robust performance across all metrics.


Conclusion:

A thorough analysis into wine quality prediction using regression and classification methods has provided valuable insights and practical applications for models. The project centered around Portuguese “Vinho Verde” wines, using datasets that contained physicochemical properties and expert quality ratings. With thorough preprocessing, the data was carefully prepared to ensure its suitability for analysis, which played a vital role in the following modeling stages.

Through the use of different visualization techniques like histograms, bar charts, box plots, and heatmaps, a comprehensive understanding of the distribution and correlation of the physicochemical variables was obtained. The selection of appropriate features for predictive modeling was guided by this foundational knowledge.

In the regression analysis, three models were trained to predict wine numerical quality scores: Random Forest, XGBoost, and Decision Trees. The RF model stood out as the top performer, boasting the lowest RMSE. This suggests its proficiency in handling the continuous quality variable and delivering the most accurate predictions.

Three models were used for the classification task, which involved predicting between the red and white wines: Decision Tree, SVM, and k-NN. The SVM model showcased remarkable accuracy and precision, better in nearly all evaluation metrics vs other models.Hence it is the most dependable option for classifying the wine type.

The project achieved its goals by discovering important physicochemical properties that impact wine quality, constructing regression models to forecast wine quality scores, and designing classification models to differentiate between different types of wine.

Future studies could build on this foundation by integrating additional information, exploring more complex machine learning techniques, and possibly including sensory data from tasters to enhance the models even further. In addition, implementing these models in practical environments could offer valuable insights for wine producers, empowering them to make informed decisions based on data and improve the quality and categorization of their wines.


References:
Cortez,Paulo, Cerdeira,A., Almeida,F., Matos,T., and Reis,J.. (2009). Wine Quality. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T.