Introduction

This document outlines a step-by-step implementation of a Random Forest classification model on the wine quality dataset using R. The process involves setting up the Java environment, loading necessary libraries, reading the data, splitting it into training and testing sets, training a Random Forest model, and evaluating the results.

Setting Up the Environment

Set JAVA_HOME for Java-based Packages

We need to set the Java home environment variable to ensure that Java-based packages work correctly.

Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk-23.0.1")

Load the Necessary Packages The following packages are required for the analysis.

#install.packages("rJava")  # Install if not already installed
library(rJava)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(xlsx)

Data Loading

path <- "C:\\Users\\Dhikrullah\\Documents\\winequality-red.xls"
data <- read.xlsx(path, sheetIndex = 1)
head(data)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Data Preparation

We will treat the quality variable as a categorical outcome for classification.

data$quality = as.factor(data$quality)

We’ll split the dataset into training (80%) and testing (20%) sets.

data_set_size <- floor(nrow(data) * 0.80)
index <- sample(1:nrow(data), size = data_set_size)
training <- data[index, ]
testing <- data[-index, ]

Model Building

We train a Random Forest model with 2001 trees and 4 randomly selected variables at each split.

rf <- randomForest(quality ~ ., data = training, mtry = 4, ntree = 2001, importance = TRUE)
rf
## 
## Call:
##  randomForest(formula = quality ~ ., data = training, mtry = 4,      ntree = 2001, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 29.55%
## Confusion matrix:
##   3 4   5   6  7 8 class.error
## 3 0 1   7   2  0 0   1.0000000
## 4 1 0  27  13  1 0   1.0000000
## 5 0 0 446  98  2 0   0.1831502
## 6 0 0 108 378 26 1   0.2631579
## 7 0 0   9  71 75 1   0.5192308
## 8 0 0   0   3  7 2   0.8333333

Visualizing the Random Forest Model

plot(rf)

We predicted the quality of wines in the testing set using the trained Random Forest model.

result <- data.frame(testing$quality, predict(rf, testing[, 1:11], type = "response"))
head(result, 5)
##    testing.quality predict.rf..testing...1.11...type....response..
## 10               5                                               5
## 16               5                                               5
## 26               5                                               5
## 31               5                                               5
## 33               5                                               5

Data Visualization

plot(result)

Conclusion

The Random Forest model was successfully trained and tested on the wine quality dataset. The results of the predictions are shown above.