Introduction:
The
evaluation of wine quality is a fine art usually gate kept with a select
flew, conventionally dependent on a blend of chemical characteristics
and subjective evaluations by experts. This study analyses two separate
datasets containing “Vinho Verde” wine variations from Portugal. Paulo
Cortez and his team carefully assembled these datasets in 2009. The
datasets contain a range of physical/chemical characteristics, and
quality ratings ranging from 0 to 10. Our objective is to utilise
statistical learning methods, notably regression and classification, to
forecast and classify the quality of wine.
About Dataset:
The
dataset titled “Wine Quality” is available on the UCI Machine Learning
Repository. It contains information about both red and white versions of
Portuguese “Vinho Verde” wine. The dataset consists of 1,599 samples of
red wine and 4,898 samples of white wine. It provides information on 11
physicochemical characteristics, including acidity, sugar content, and
alcohol levels. Additionally, the dataset includes quality ratings
assigned by experts on a scale ranging from 0 to 10. This dataset is
commonly utilized for regression and classification tasks in the field
of machine learning, with the specific objective of predicting the
quality of wine based on its chemical makeup.
Wine-Quality UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/186/wine+quality
Objectives:
Objective 1: To determine the essential
physicochemical properties that affect the quality of wine
To determine exactly what attributes of wine. Attributes include acidity, sugar level, alcohol content, sulphates, pH level, and others which may have impact on wine taste.
Objective 2: To build regression models for predicting wine quality
The objective is to create and optimize regression model that can accurately predict the numerical quality score of wines using their physicochemical attributes.
Objective 3: To build classification models for predicting wine type
The objective is to create a classification model that can reliably identify red and white wines based on quantitative properties including acidity, alcohol content, and sugar levels.
Data Preprocessing:
Necessary libraries “tidyr”, “dplyr”, “ggplot2” are loaded.
library(tidyr)
library(dplyr)
library(ggplot2)
Set the working directory and load the winequality-red and winequality-white datasets, check number of rows and columns in both datasets
file_path <- "C:/Users/Sowjanya/OneDrive/Documents/R Project"
red <- read.csv("winequality-red.csv", header = TRUE)
red$type <- "red"
white <- read.csv("winequality-white.csv", header = TRUE)
white$type <- "white"
#View(red)
dim(red)
## [1] 1599 2
#View(white)
dim(white)
## [1] 4898 2
Append both winequality-red and winequality-white datasets together
wine <- rbind(red, white)
dim(wine)
## [1] 6497 2
Combined dataset is cleaned by removing empty rows/columns (Findings: No empty rows/columns)
wine <- wine[rowSums(is.na(wine)) != ncol(wine),]
wine <- wine[, colSums(is.na(wine)) != nrow(wine)]
dim(wine)
## [1] 6497 2
Check for missing values (Findings: No missing values)
any(is.na(wine))
## [1] FALSE
Remove duplicate rows
wine1 <- wine %>% distinct()
dim(wine1)
## [1] 5320 2
Summary stats of the dataset
summary(wine1)
## fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
## Length:5320
## Class :character
## Mode :character
## type
## Length:5320
## Class :character
## Mode :character
Classes of wine1 dataset
sapply(wine1, class)
## fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
## "character"
## type
## "character"
A new binary column type_bin is created to differentiate red and white wines. (“red - 1”, “white - 0”)
wine1$type_bin <- ifelse(wine1$type == "red", 1, 0)
head(wine1)
## fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
## 1 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
## 2 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
## 3 7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
## 4 11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
## 5 7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
## 6 7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
## type type_bin
## 1 red 1
## 2 red 1
## 3 red 1
## 4 red 1
## 5 red 1
## 6 red 1
tail(wine1)
## fixed.acidity.volatile.acidity.citric.acid.residual.sugar.chlorides.free.sulfur.dioxide.total.sulfur.dioxide.density.pH.sulphates.alcohol.quality
## 5315 6.5;0.23;0.38;1.3;0.032;29;112;0.99298;3.29;0.54;9.7;5
## 5316 6.2;0.21;0.29;1.6;0.039;24;92;0.99114;3.27;0.5;11.2;6
## 5317 6.6;0.32;0.36;8;0.047;57;168;0.9949;3.15;0.46;9.6;5
## 5318 6.5;0.24;0.19;1.2;0.041;30;111;0.99254;2.99;0.46;9.4;6
## 5319 5.5;0.29;0.3;1.1;0.022;20;110;0.98869;3.34;0.38;12.8;7
## 5320 6;0.21;0.38;0.8;0.02;22;98;0.98941;3.26;0.32;11.8;6
## type type_bin
## 5315 white 0
## 5316 white 0
## 5317 white 0
## 5318 white 0
## 5319 white 0
## 5320 white 0
The cleaned data is exported to wine_cleaned.csv.
# Use absolute path
file_path <- "C:/Users/Sowjanya/OneDrive/Documents/R Project/wine_cleaned.csv"
data <- read.csv(file_path)
output_file <- "wine_cleaned.csv"
Data Visualization and Exploratory Data Analysis:
Histograms: Created for various physicochemical variables like fixed acidity, volatile acidity, citric acid, etc., with differentiation between red and white wine types.
Bar Charts: Illustrate the distribution of wine quality across different wine types.
Grouped Bar Plots: Comparing different chemical attributes that determine wine taste, such as fixed acidity, alcohol, and residual sugar.
Box Plot: Focus on the distribution of citric acid across different wine types.
Scatter Plot: Analyse the relationship between selected pairs of variables.
Correlation Heatmap: Visualizing the correlation matrix to identify relationships between various physicochemical properties.
Necessary libraries like “tidyr”, “dplyr”, “ggplot2”, “corrplot”, “plotly” are used.
head(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.66 0.00 1.8 0.075
## 6 7.9 0.60 0.06 1.6 0.069
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 13 40 0.9978 3.51 0.56 9.4
## 6 15 59 0.9964 3.30 0.46 9.4
## quality type type_bin
## 1 5 red 1
## 2 5 red 1
## 3 5 red 1
## 4 6 red 1
## 5 5 red 1
## 6 5 red 1
summary(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2400 1st Qu.: 1.800
## Median : 7.000 Median :0.3000 Median :0.3100 Median : 2.700
## Mean : 7.215 Mean :0.3441 Mean :0.3185 Mean : 5.048
## 3rd Qu.: 7.700 3rd Qu.:0.4100 3rd Qu.:0.4000 3rd Qu.: 7.500
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 16.00 1st Qu.: 74.0 1st Qu.:0.9922
## Median :0.04700 Median : 28.00 Median :116.0 Median :0.9947
## Mean :0.05669 Mean : 30.04 Mean :114.1 Mean :0.9945
## 3rd Qu.:0.06600 3rd Qu.: 41.00 3rd Qu.:153.2 3rd Qu.:0.9968
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.40 Median :6.000
## Mean :3.225 Mean :0.5334 Mean :10.55 Mean :5.796
## 3rd Qu.:3.330 3rd Qu.:0.6000 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
## type type_bin
## Length:5320 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.2555
## 3rd Qu.:1.0000
## Max. :1.0000
# Check for missing values
any(is.na(data))
## [1] FALSE
#Visualization of all Variables-Histogram
numerical_vars <- c(
"fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide",
"total.sulfur.dioxide", "density", "pH", "sulphates", "alcohol", "quality"
)
data <- read.csv("wine_cleaned.csv")
library(ggplot2)
# Define colors for 'red' and 'white' types
type_colors <- c("red" = "darkred", "white" = "skyblue")
for (var in numerical_vars) {
print(
ggplot(data, aes(x = get(var), fill = type)) +
geom_histogram(binwidth = 1, color = "white") +
scale_fill_manual(values = type_colors) +
labs(title = paste("Histogram: ", var), x = var, y = "Count") + theme_minimal()
)
}
When examining variables such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality, it is evident the noticeable variations in concentration levels between red and white wines. These differences are highlighted through color-coded histograms. For example, the histograms for alcohol content and quality indicate a difference in the distribution between the two types of wines, suggesting that these factors may have varying impacts on the perceived quality of each wine type. These visualizations allow for further analysis on the impact of chemical properties on the sensory attributes and consumer preferences of red and white wines.
# Visualization of Wine Quality by Wine Type - Bar Chart
data <- read.csv("wine_cleaned.csv")
ggplot(data, aes(x = factor(quality), fill = factor(type))) + geom_bar(position = "dodge", alpha = 0.7) +
labs(title = "Bar Chart of Wine Quality by Wine Type", x = "Quality", y = "Count") + theme_minimal()
The bar chart provides a clear distinction between the quality of red
and white wines, displaying the distribution of wine ratings. The
analysis showcases the frequency of specific quality scores for each
type of wine.
# Visualization - Grouped Bar Plot
library(ggplot2)
data <- read.csv("wine_cleaned.csv")
# fixed acidity, alcohol and residual sugar determine the taste of the Wine
numerical_vars <- c("fixed.acidity", "alcohol", "residual.sugar")
data_long <- tidyr::gather(data, key = "variable", value = "value", all_of(numerical_vars))
ggplot(data_long, aes(x = type, fill = variable, y = value)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
scale_fill_manual(values = c("fixed.acidity" = "orange", "alcohol" = "darkblue", "residual.sugar" = "purple")) +
facet_wrap(~variable, scales = "free_y", ncol = 1) +
labs(title = "Grouped Bar Plot of Chemical Attributes that Determine Wine Taste",
x = "Wine Type",
y = "Variable Value") +
theme_minimal()
The grouped bar plot shows the main chemical characteristics that impact the flavor of wine. With emphasis on alcohol, fixed acidity, and residual sugar. The difference between each of these characteristics based on wine type is clear, enabling an easy analysis of how these factors differ between red and white wines. This information can provide insights to winemakers and enhance the understanding of tasting profiles.
#Visualization - Box Plot
# Bivariate Analysis
data <- read.csv("wine_cleaned.csv")
library(ggplot2)
ggplot(data, aes(x = type, y = citric.acid, fill = type)) +
geom_boxplot(alpha = 1.0, color = "black") +
labs(title = "Distribution of Citric Acid by Wine Type", x = "Wine Type", y = "Citric Acid") +
scale_fill_manual(values = c("red" = "darkred", "white" = "lightblue")) +
theme_minimal() +
theme(legend.position = "top", legend.title = element_blank())
The box plot shows the difference in citric acid content between red and
white wines. It is observed that red wines generally have lower levels
of citric acid compared to white wines. The visualization displays the
distribution and central tendency of the citric acid variable for each
wine type.
#Visualization - Scatter Plot
#Multivariate Analysis
library(ggplot2)
data <- read.csv("wine_cleaned.csv")
selected_vars <- c("fixed.acidity", "alcohol", "volatile.acidity", "sulphates", "quality", "type")
ggplot(data, aes(x = .data[[selected_vars[1]]], y = .data[[selected_vars[2]]], color = type)) +
geom_point(alpha = 0.5) +
labs(title = "Scatter Plot: Selected Variables by Wine Type", x = selected_vars[1], y = selected_vars[2]) +
theme_minimal()
The scatter plot shows the variations of fixed acidity and alcohol
levels in red and white wines. Both types show a wide range, with white
wines typically displaying lower acidity. The overlap suggests that red
and white wines exhibit similar acid and alcohol levels, allowing for a
convenient visual comparison between the two wine types.
#Visualization
#Correlation Heatmap
#install.packages("corrplot")
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
data <- read.csv("wine_cleaned.csv")
correlation_matrix <- cor(data[, sapply(data, is.numeric)])
par(mar = c(6, 6, 6, 6))
corrplot(correlation_matrix, method = "color", type = "upper", addCoef.col = "black", number.cex = 0.7)
The correlation heatmap provides a visual representation of the
relationships between the physicochemical properties of wines,
showcasing their strength and direction. Stronger correlations are
observed between free sulfur dioxide and total sulfur dioxide, with
darker shades indicating this relationship. On the other hand, the light
areas indicate less strong correlations. This map provides valuable
insights into the detailed connections between various factors and their
potential impact on the quality of wine.
Model Building:
Necessary libraries like “rpart”, “rattle”, “randomForest”, “xgboost”, “caret”, “e1071”, “caTools” are used.
library(caTools)
set.seed(123)
#wine quality is turned into nominal data
data$qualitytype<-ifelse(data$quality<6,'bad','good')
data$qualitytype[data$quality==6]<-'normal'
data$qualitytype<-as.factor(data$qualitytype)
ggplot(data=data)+geom_bar(mapping = aes(x=quality, fill = factor(quality)), stat = "count") +
scale_fill_manual(values = c("3" = "red", "4" = "blue", "5" = "green", "6" = "orange", "7" = "purple", "8" = "pink"))+
labs(x = "Quality", y = "Count") +
theme_minimal()
A new variable called “qualitytype” is created to assess the quality of wine. The bar chart displays the ratings of wines in terms of quality, with a significant portion falling into the average category. Wines are categorized into different quality ratings, which are then utilized to prepare the data for training machine learning models.
#split data into two parts as train and test
options(repos = c(CRAN = "https://cloud.r-project.org"))
#install.packages("rattle")
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked _by_ '.GlobalEnv':
##
## wine
part<-sample.split(data$fixed.acidity,SplitRatio = 0.7)
train<-data[part,]
test<-data[!part,]
library(rpart)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(rpart.plot)
Objective 1: To determine the essential physicochemical
properties that affect the quality of wine
#Random Forest
#Determine the essential physicochemical properties
#Load necessary libraries
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
# Read the cleaned wine data
wine_data <- read.csv('C:/Users/Sowjanya/OneDrive/Documents/R Project/wine_cleaned.csv')
wine_data$quality <- as.factor(wine_data$quality)
# Split the data into training and testing sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(wine_data$quality, p = 0.7, list = FALSE)
trainData <- wine_data[trainIndex, ]
testData <- wine_data[-trainIndex, ]
# Train the model
rfModel <- randomForest(quality ~ ., data = trainData, importance = TRUE, ntree = 500)
# View the model results
print(rfModel)
##
## Call:
## randomForest(formula = quality ~ ., data = trainData, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 43.51%
## Confusion matrix:
## 3 4 5 6 7 8 9 class.error
## 3 0 0 12 8 1 0 0 1.0000000
## 4 0 7 91 45 2 0 0 0.9517241
## 5 0 7 758 448 14 0 0 0.3822331
## 6 0 3 364 1131 127 2 0 0.3048556
## 7 0 0 18 373 205 4 0 0.6583333
## 8 0 0 1 63 35 5 0 0.9519231
## 9 0 0 0 3 1 0 0 1.0000000
# Make predictions
predictions <- predict(rfModel, newdata = testData)
# Evaluate the model
confMatrix <- confusionMatrix(predictions, testData$quality)
print(confMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 1 0 0 0 0 0
## 4 0 1 3 1 0 0 0
## 5 5 35 349 152 8 0 0
## 6 4 24 167 492 172 26 0
## 7 0 0 6 50 73 17 1
## 8 0 0 0 1 3 1 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5754
## 95% CI : (0.5507, 0.5998)
## No Information Rate : 0.4372
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.324
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.0000000 0.0163934 0.6648 0.7069 0.28516 0.0227273
## Specificity 0.9993683 0.9973873 0.8126 0.5614 0.94461 0.9974160
## Pos Pred Value 0.0000000 0.2000000 0.6357 0.5559 0.49660 0.2000000
## Neg Pred Value 0.9943432 0.9621928 0.8313 0.7115 0.87336 0.9729049
## Prevalence 0.0056533 0.0383166 0.3298 0.4372 0.16080 0.0276382
## Detection Rate 0.0000000 0.0006281 0.2192 0.3090 0.04585 0.0006281
## Detection Prevalence 0.0006281 0.0031407 0.3448 0.5559 0.09234 0.0031407
## Balanced Accuracy 0.4996841 0.5068904 0.7387 0.6341 0.61488 0.5100716
## Class: 9
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9993719
## Prevalence 0.0006281
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
# Get the importance matrix
importanceMatrix <- importance(rfModel)
# Filter out 'type' and 'type_bin' from the importance matrix
varImportance <- data.frame(Variable = rownames(importanceMatrix), Importance = importanceMatrix[, "MeanDecreaseGini"])
varImportance <- varImportance[!varImportance$Variable %in% c("type", "type_bin"), ]
# Visualize the importance (Using the correct importance measure)
ggplot(varImportance, aes(x = reorder(Variable, Importance), y = Importance, fill = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
scale_fill_gradient(low = "Lavender", high = "Thistle") +
xlab("Physicochemical Properties") +
ylab("Essential") +
ggtitle("Essential physicochemical properties using the Random Forest Model\n")
The Random Forest model determines that alcohol, density, and
volatile.acidity are the primary physicochemical properties that have an
impact on wine quality. These properties hold a lot more importance when
compared to others, such as total sulfur dioxide, sulphates, and
chlorides. Free sulfur dioxide, pH, residual sugar, citric acid, and
fixed acidity all make smaller but still significant contributions. This
shows that the secret to enhancing wine quality may lie in wine making
techniques that optimize alcohol content, control acidity levels, and
monitor density.
Objective 2: To build regression models for predicting wine quality
Random Forest for predicting the quality of
wine:
# RANDOM FOREST - Regression
library(randomForest)
library(caret)
# Ensure the 'quality' column is treated as numeric
wine_data$quality <- as.numeric(wine_data$quality)
# Split the data into training and testing sets
set.seed(123) # Setting seed for reproducibility
indexes <- createDataPartition(wine_data$quality, p = 0.7, list = FALSE)
train_set <- wine_data[indexes, ]
test_set <- wine_data[-indexes, ]
# Train the Random Forest regression model
rf_model <- randomForest(quality ~ ., data = train_set)
# Make predictions on the test set
predictions <- predict(rf_model, newdata = test_set)
# Calculate RMSE
rmse <- sqrt(mean((test_set$quality - predictions)^2))
# Return the RMSE
rmse
## [1] 0.6985049
The Random Forest regression model performs exceptionally well, with a remarkably low RMSE (Root Mean Square Error) of 0.6985049. This RMSE demonstrates the model’s ability to predict wine quality with high precision, making it ideal for regression tasks.
In the context of regression, the model excels at making accurate predictions about wine quality as a continuous variable. Its low RMSE indicates minimal prediction error and a good fit to the data. The RF model outperforms that of the Decision Tree and XGBoost regression models.
XGBoost for predicting the quality of wine:
# XGBoost-Regression
set.seed(123)
# Define predictors and responses
train_x <- data.matrix(train[,-c(12, 13, 15)])
train_y <- data.matrix(train[, 12])
test_x <- data.matrix(test[,-c(12, 13, 15)])
test_y <- data.matrix(test[, 12])
# Load the xgboost library
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
##
## slice
## The following object is masked from 'package:rattle':
##
## xgboost
## The following object is masked from 'package:dplyr':
##
## slice
# Create DMatrix objects for training and testing data
train_xgb <- xgb.DMatrix(data = train_x, label = train_y)
test_xgb <- xgb.DMatrix(data = test_x, label = test_y)
# Train the XGBoost model
xgb <- xgb.train(
data = train_xgb,
max.depth = 3,
watchlist = list(train = train_xgb, test = test_xgb),
nrounds = 100,
verbose = 0
)
# Determine the number of rounds with the minimum RMSE
xgb_final <- xgboost(data = train_xgb, max.depth = 3, nrounds = 88, verbose = 0)
# Make predictions
pred_xgb <- predict(xgb_final, newdata = test_xgb)
# Calculate Mean Squared Error (MSE)
mse <- mean((test_y - pred_xgb)^2)
# Calculate Mean Absolute Error (MAE)
mae <- caret::MAE(test_y, pred_xgb)
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test_y - pred_xgb)^2))
# Convert numeric predictions to a factor with thresholds
pred_xgb_factor <- as.factor(ifelse(pred_xgb < 5, 'Low', ifelse(pred_xgb > 7, 'High', 'Medium')))
# Convert the actual numeric values to the same categorical levels for consistency
test_y_factor <- as.factor(ifelse(test_y < 5, 'Low', ifelse(test_y > 7, 'High', 'Medium')))
# Create a confusion matrix
conf_matrix_xgb <- confusionMatrix(pred_xgb_factor, test_y_factor)
# Output the confusion matrix and RMSE
print(conf_matrix_xgb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 4 0 9
## Low 0 28 84
## Medium 46 43 1378
##
## Overall Statistics
##
## Accuracy : 0.8857
## 95% CI : (0.869, 0.9009)
## No Information Rate : 0.924
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2124
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.080000 0.39437 0.9368
## Specificity 0.994163 0.94477 0.2645
## Pos Pred Value 0.307692 0.25000 0.9393
## Neg Pred Value 0.970868 0.97095 0.2560
## Prevalence 0.031407 0.04460 0.9240
## Detection Rate 0.002513 0.01759 0.8656
## Detection Prevalence 0.008166 0.07035 0.9215
## Balanced Accuracy 0.537082 0.66957 0.6006
print(paste("RMSE:", rmse))
## [1] "RMSE: 0.708532210672326"
The XGBoost regression model performs admirably, with an RMSE (Root Mean Square Error) of only 0.7085322. The RMSE value demonstrates the model’s ability to provide highly precise predictions for wine quality, making it ideal for regression tasks.
In terms of regression analysis, the model excels at making accurate predictions for wine quality as a continuous variable. The low RMSE indicates minimal prediction error and a strong fit to the dataset. This level of performance outperforms the Decision Tree regression model, demonstrating XGBoost’s superiority in regression applications.
Decision Tree for predicting the quality of
wine:
#Decision tree
#predict the quality of wine.
library(readr)
library(caret)
library(rpart)
library(rpart.plot)
wine_data$quality <- as.factor(wine_data$quality)
set.seed(123)
dtModel<-rpart(quality~.,data = train[,-c(14,15)])
dtModel
## n= 3728
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3728 2866.5170 5.799088
## 2) alcohol< 10.91667 2435 1393.5330 5.536345
## 4) volatile.acidity>=0.285 1297 590.4225 5.329221 *
## 5) volatile.acidity< 0.285 1138 684.0536 5.772408
## 10) alcohol< 10.11667 692 380.8367 5.643064 *
## 11) alcohol>=10.11667 446 273.6771 5.973094 *
## 3) alcohol>=10.91667 1293 988.3217 6.293890
## 6) alcohol< 11.675 561 420.0570 6.040998 *
## 7) alcohol>=11.675 732 504.8893 6.487705
## 14) free.sulfur.dioxide< 21.5 262 206.0305 6.213740 *
## 15) free.sulfur.dioxide>=21.5 470 268.2319 6.640426 *
#visualize the decision tree
fancyRpartPlot(dtModel)
#model evaluation
pred_dtModel<-predict(dtModel,newdata = test)
#summary of the prediction vs summary of the real test data
summary(pred_dtModel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.329 5.329 5.643 5.806 6.041 6.640
summary(test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.788 6.000 9.000
RMSE_dtModel<-sqrt(mean((pred_dtModel-test$quality)^2))
print("RMSE of the model")
## [1] "RMSE of the model"
RMSE_dtModel
## [1] 0.7730765
#data frame with real response and the prediction
dat<-data.frame(test$quality,pred_dtModel)
head(dat)
## test.quality pred_dtModel
## 5 5 5.329221
## 6 5 5.329221
## 7 7 5.329221
## 15 7 5.973094
## 16 5 5.329221
## 18 6 5.329221
Comparison Table for Regression Models:
# RMSE for the Random Forest model
rmse_rf <- 0.6985049
# RMSE for the Decision Tree model (Regression)
rmse_dt <- 0.7730765
# RMSE for the XGBoost model (Regression)
rmse_xgb <- 0.7085322
# Create a data frame with the model performance metrics based on RMSE
model_comparison_rmse <- data.frame(
Model = c("Random Forest (Regression)", "XGBoost (Regression)", "Decision Tree (Regression)"),
RMSE = c(rmse_rf, rmse_xgb, rmse_dt)
)
#
knitr::kable(model_comparison_rmse, format = "pipe", caption = "Comparative Analysis of Models Based on RMSE")
| Model | RMSE |
|---|---|
| Random Forest (Regression) | 0.6985049 |
| XGBoost (Regression) | 0.7085322 |
| Decision Tree (Regression) | 0.7730765 |
Comparative Analysis Explanation for Regression
Model:
1. Random Forest (Regression):
RMSE = 0.6985049.
Interpretation: The Random Forest regression model has a slightly lower
RMSE than the XGBoost and Decision Tree regression models. This shows
that Random Forest makes the most accurate predictions with the lowest
prediction error.
2. XGBoost Regression:
RMSE = 0.7085322.
Interpretation: XGBoost has a moderate prediction error for the
continuous target variable. Its RMSE is greater than that of Random
Forest, indicating that it is less accurate in comparison.
3. Decision Tree (Regression):
RMSE = 0.7849993.
Interpretation: The Decision Tree regression model has the highest RMSE
of the three models, indicating the least accurate predictions and the
greatest prediction error.
In a comparative analysis, Random Forest outperforms the other two regression models with lower RMSE and higher prediction accuracy. XGBoost, with a moderate RMSE, is less accurate than Random Forest but outperforms the Decision Tree regression model. The Decision Tree regression model has the highest RMSE, implying the least accurate predictions.
Objective 3: To build classification models for
predicting wine type
Decision Tree for predicting the wine type:
#Decision Tree
#predict the type of wine
library(rpart)
library(rpart.plot)
ctree1<-rpart(type_bin~.,data=train[,-c(13,15)])
ctree1
## n= 3728
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3728 705.454700 0.253487100
## 2) chlorides< 0.0615 2692 88.855870 0.034175330
## 4) total.sulfur.dioxide>=49.5 2590 18.860620 0.007335907 *
## 5) total.sulfur.dioxide< 49.5 102 20.754900 0.715686300
## 10) chlorides< 0.0405 26 2.653846 0.115384600 *
## 11) chlorides>=0.0405 76 5.526316 0.921052600 *
## 3) chlorides>=0.0615 1036 150.674700 0.823359100
## 6) total.sulfur.dioxide>=113.5 195 33.517950 0.220512800
## 12) fixed.acidity< 7.65 147 6.666667 0.047619050 *
## 13) fixed.acidity>=7.65 48 9.000000 0.750000000 *
## 7) total.sulfur.dioxide< 113.5 841 29.857310 0.963139100
## 14) density< 0.993215 29 6.206897 0.310344800 *
## 15) density>=0.993215 812 10.850990 0.986453200 *
fancyRpartPlot(ctree1)
pred_ctree1<-predict(ctree1,newdata = test[,-13])
pred_ctree1<-ifelse(pred_ctree1>0.5,1,0)
#accuracy, specificity, precision ,etc of the model
conf_ctree1<-confusionMatrix(as.factor(test$type_bin),as.factor(pred_ctree1), mode = "everything")
conf_ctree1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1159 19
## 1 39 375
##
## Accuracy : 0.9636
## 95% CI : (0.9532, 0.9722)
## No Information Rate : 0.7525
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9038
##
## Mcnemar's Test P-Value : 0.0126
##
## Sensitivity : 0.9674
## Specificity : 0.9518
## Pos Pred Value : 0.9839
## Neg Pred Value : 0.9058
## Precision : 0.9839
## Recall : 0.9674
## F1 : 0.9756
## Prevalence : 0.7525
## Detection Rate : 0.7280
## Detection Prevalence : 0.7399
## Balanced Accuracy : 0.9596
##
## 'Positive' Class : 0
##
preci<-precision(conf_ctree1$table)
paste("precision is",preci)
## [1] "precision is 0.983870967741935"
recal<-recall(conf_ctree1$table)
paste("recall is ",recal)
## [1] "recall is 0.967445742904841"
fscore<-2*preci*recal/(preci+recal)
paste("f1score is ",fscore)
## [1] "f1score is 0.975589225589226"
The Decision Tree model takes into account factors such as
chlorides and sulphur dioxide levels in order to differentiate between
different types of wines. The evaluation metrics obtained from the
confusion matrix show a remarkable level of accuracy (96.36%) and
precision (98.39%), which serves as an evidence to the exceptional
performance of the model. The F1 score indicates a strong balance
between precision and recall, with a high value of approximately 97.56%.
This suggests that the model is able to consistently identify and
classify wine types based on their chemical properties.
SVM for predicting the wine type:
#SVM MODEL-Classification
#Predict the type of wine
#Ensure the e1071 package is installed and loaded
#install.packages("e1071")
library(e1071)
train$type_bin <- as.factor(train$type_bin)
test$type_bin <- as.factor(test$type_bin)
#Fit the SVM model on the training data
svm_model <- svm(type_bin ~ ., data = train[,-c(12,13,15)], kernel = "radial")
#Predict the wine type on the test data
svm_predictions <- predict(svm_model, newdata = test[,-c(12,13,15)])
#Evaluate the model's performance
svm_conf_matrix <- confusionMatrix(svm_predictions, test$type_bin)
print(svm_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1173 7
## 1 5 407
##
## Accuracy : 0.9925
## 95% CI : (0.9869, 0.9961)
## No Information Rate : 0.7399
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9804
##
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.9958
## Specificity : 0.9831
## Pos Pred Value : 0.9941
## Neg Pred Value : 0.9879
## Prevalence : 0.7399
## Detection Rate : 0.7368
## Detection Prevalence : 0.7412
## Balanced Accuracy : 0.9894
##
## 'Positive' Class : 0
##
# Extract precision, recall, and F1 score from the confusion matrix
precision_value <- svm_conf_matrix$byClass["Pos Pred Value"]
recall_value <- svm_conf_matrix$byClass["Sensitivity"]
f1_score <- 2 * (precision_value * recall_value) / (precision_value + recall_value)
# Print precision, recall, and F1 score
print(paste("Precision:", precision_value))
## [1] "Precision: 0.994067796610169"
print(paste("Recall:", recall_value))
## [1] "Recall: 0.995755517826825"
print(paste("F1 Score:", f1_score))
## [1] "F1 Score: 0.994910941475827"
The SVM model shown exceptional performance in accurately identifying different types of wine, achieving an impressive accuracy rate of over 99.25%. The model demonstrated a remarkable level of precision and recall. This demonstrates its reliability as a tool for wine classification tasks.
K-NN for predicting the wine type:
##k-NN MODEL- Classification
#install.packages("class")
#install.packages("caret")
library(class)
library(caret)
numeric_train_data <- train[, sapply(train, is.numeric)]
numeric_test_data <- test[, sapply(test, is.numeric)]
maxs <- apply(numeric_train_data, 2, max)
mins <- apply(numeric_train_data, 2, min)
scaled_train_data <- as.data.frame(scale(numeric_train_data, center = mins, scale = maxs - mins))
# Scaling the test data using the same scaling parameters
scaled_test_data <- as.data.frame(scale(numeric_test_data, center = mins, scale = maxs - mins))
# Train the k-NN model
set.seed(123) # for reproducibility
k <- 5
# Train the model using scaled training data and original labels
knn_pred <- knn(train = scaled_train_data, test = scaled_test_data, cl = train$type_bin, k = k)
# Evaluate the model's performance
knn_conf_matrix <- confusionMatrix(knn_pred, test$type_bin)
knn_conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1168 15
## 1 10 399
##
## Accuracy : 0.9843
## 95% CI : (0.9769, 0.9898)
## No Information Rate : 0.7399
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.959
##
## Mcnemar's Test P-Value : 0.4237
##
## Sensitivity : 0.9915
## Specificity : 0.9638
## Pos Pred Value : 0.9873
## Neg Pred Value : 0.9756
## Prevalence : 0.7399
## Detection Rate : 0.7337
## Detection Prevalence : 0.7431
## Balanced Accuracy : 0.9776
##
## 'Positive' Class : 0
##
# Output precision, recall, and F1 score
preci <- precision(knn_conf_matrix$table)
recal <- recall(knn_conf_matrix$table)
fscore <- 2 * preci * recal / (preci + recal)
# Print the performance metrics
print(paste("Precision is", preci))
## [1] "Precision is 0.987320371935757"
print(paste("Recall is", recal))
## [1] "Recall is 0.99151103565365"
print(paste("F1 score is", fscore))
## [1] "F1 score is 0.989411266412537"
The k-NN model employed for classifying wines into different types achieve an accuracy rate exceeding 98.43%. The system excelled in accurately identifying and categorizing wines, ensuring that wines classified as a specific type were indeed of that type. It also captured a significant majority of wines belonging to a particular category. The F1 score suggests that the model achieved a good balance between precision and recall. Based on the findings, it appears that k-NN is a reliable option for classifying wine types.
Comparative Analysis on the Classification
Models:
# Performance metrics for Decision Tree
accuracy_dt <- "96.36%"
kappa_dt <- "0.9038"
sensitivity_dt <- "96.74%"
specificity_dt <- "95.18%"
ppv_dt <- "98.39%"
npv_dt <- "90.58%"
f1_dt <- "97.55%"
#Performance metrics for SVM
accuracy_svm <- "99.25%"
kappa_svm <- "0.9804"
sensitivity_svm <- "99.58%"
specificity_svm <- "98.31%"
ppv_svm <- "99.41%"
npv_svm <- "98.79%"
f1_svm <- "99.49%"
#Performance metrics for k-NN
accuracy_knn <- "98.43%"
kappa_knn <- "0.959"
sensitivity_knn <- "99.15%"
specificity_knn <- "96.38%"
ppv_knn <- "98.73%" # Positive Predictive Value is equivalent to Precision
npv_knn <- "97.56%" # Negative Predictive Value
f1_knn <- "98.94%"
# Create a data frame with the model performance metrics
model_comparison <- data.frame(
Metric = c("Accuracy", "Kappa", "Sensitivity (Recall)", "Specificity", "Positive Predictive Value (Precision)", "Negative Predictive Value", "F1 Score"),
`Model 1 (Decision Tree)` = c(accuracy_dt, kappa_dt, sensitivity_dt, specificity_dt, ppv_dt, npv_dt, f1_dt),
`Model 2 (SVM)` = c(accuracy_svm, kappa_svm, sensitivity_svm, specificity_svm, ppv_svm, npv_svm, f1_svm),
`Model 3 (k-NN)` = c(accuracy_knn, kappa_knn, sensitivity_knn, specificity_knn, ppv_knn, npv_knn, f1_knn)
)
#Comparison Table
knitr::kable(model_comparison, format = "pipe", caption = "Comparative Analysis of Decision Tree Models, SVM, and k-NN")
| Metric | Model.1..Decision.Tree. | Model.2..SVM. | Model.3..k.NN. |
|---|---|---|---|
| Accuracy | 96.36% | 99.25% | 98.43% |
| Kappa | 0.9038 | 0.9804 | 0.959 |
| Sensitivity (Recall) | 96.74% | 99.58% | 99.15% |
| Specificity | 95.18% | 98.31% | 96.38% |
| Positive Predictive Value (Precision) | 98.39% | 99.41% | 98.73% |
| Negative Predictive Value | 90.58% | 98.79% | 97.56% |
| F1 Score | 97.55% | 99.49% | 98.94% |
Comparative Analysis Explanation for Classification
Model:
Accuracy: The SVM model is the most accurate
(99.25%), followed by the k-NN model (98.43%). The Decision Tree model’s
accuracy is lower at 96.36%.
Kappa: The SVM has the highest kappa value of 0.98990.9804, indicating very strong agreement. The k-NN’s kappa is also high at 0.959, and the Decision Tree’s kappa is 0.9038, which is good but lower than the other two models.
Sensitivity and Specificity: The SVM and k-NN models both show very high sensitivity and specificity, with SVM slightly outperforming k-NN. The Decision Tree has lower sensitivity and specificity in comparison but still performs well.
Predictive Values: The SVM model has the highest precision (positive predictive value) and negative predictive value, indicating a higher likelihood that its predictions are correct. The k-NN model’s predictive values are slightly lower but still very high. The Decision Tree model has a lower precision, but its negative predictive value is not provided.
F1 Score: The SVM and k-NN models both have very high F1 scores with slight difference of 99.49% and 98.94% respectively. The F1 score for the Decision Tree is 97.55%, which is slightly lower than the other two models but still indicates a high precision and recall balance.
Overall, the SVM model is the top-performing model among the three. It achieves the highest accuracy, kappa, sensitivity, specificity, precision, and F1 score. The k-NN model is also highly effective, with only marginally lower metrics than SVM. The Decision Tree model, while not performing at the same level as SVM and k-NN, still shows robust performance across all metrics.
Conclusion:
A thorough analysis into wine quality prediction using regression and classification methods has provided valuable insights and practical applications for models. The project centered around Portuguese “Vinho Verde” wines, using datasets that contained physicochemical properties and expert quality ratings. With thorough preprocessing, the data was carefully prepared to ensure its suitability for analysis, which played a vital role in the following modeling stages.
Through the use of different visualization techniques like histograms, bar charts, box plots, and heatmaps, a comprehensive understanding of the distribution and correlation of the physicochemical variables was obtained. The selection of appropriate features for predictive modeling was guided by this foundational knowledge.
In the regression analysis, three models were trained to predict wine numerical quality scores: Random Forest, XGBoost, and Decision Trees. The RF model stood out as the top performer, boasting the lowest RMSE. This suggests its proficiency in handling the continuous quality variable and delivering the most accurate predictions.
Three models were used for the classification task, which involved predicting between the red and white wines: Decision Tree, SVM, and k-NN. The SVM model showcased remarkable accuracy and precision, better in nearly all evaluation metrics vs other models.Hence it is the most dependable option for classifying the wine type.
The project achieved its goals by discovering important physicochemical properties that impact wine quality, constructing regression models to forecast wine quality scores, and designing classification models to differentiate between different types of wine.
Future studies could build on this foundation by integrating additional information, exploring more complex machine learning techniques, and possibly including sensory data from tasters to enhance the models even further. In addition, implementing these models in practical environments could offer valuable insights for wine producers, empowering them to make informed decisions based on data and improve the quality and categorization of their wines.
References:
Cortez,Paulo, Cerdeira,A., Almeida,F., Matos,T., and Reis,J.. (2009).
Wine Quality. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T.