This document outlines a step-by-step implementation of a Random Forest classification model on the wine quality dataset using R. The process involves setting up the Java environment, loading necessary libraries, reading the data, splitting it into training and testing sets, training a Random Forest model, and evaluating the results.
Setting Up the Environment
Set JAVA_HOME for Java-based Packages
We need to set the Java home environment variable to ensure that Java-based packages work correctly.
Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk-23.0.1")
Load the Necessary Packages The following packages are required for the analysis.
#install.packages("rJava") # Install if not already installed
library(rJava)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(xlsx)
path <- "C:\\Users\\Dhikrullah\\Documents\\winequality-red.xls"
data <- read.xlsx(path, sheetIndex = 1)
head(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
We will treat the quality variable as a categorical outcome for classification.
data$quality = as.factor(data$quality)
We’ll split the dataset into training (80%) and testing (20%) sets.
data_set_size <- floor(nrow(data) * 0.80)
index <- sample(1:nrow(data), size = data_set_size)
training <- data[index, ]
testing <- data[-index, ]
We train a Random Forest model with 2001 trees and 4 randomly selected variables at each split.
rf <- randomForest(quality ~ ., data = training, mtry = 4, ntree = 2001, importance = TRUE)
rf
##
## Call:
## randomForest(formula = quality ~ ., data = training, mtry = 4, ntree = 2001, importance = TRUE)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 29.55%
## Confusion matrix:
## 3 4 5 6 7 8 class.error
## 3 0 1 7 2 0 0 1.0000000
## 4 1 0 27 13 1 0 1.0000000
## 5 0 0 446 98 2 0 0.1831502
## 6 0 0 108 378 26 1 0.2631579
## 7 0 0 9 71 75 1 0.5192308
## 8 0 0 0 3 7 2 0.8333333
Visualizing the Random Forest Model
plot(rf)
We predicted the quality of wines in the testing set using the trained Random Forest model.
result <- data.frame(testing$quality, predict(rf, testing[, 1:11], type = "response"))
head(result, 5)
## testing.quality predict.rf..testing...1.11...type....response..
## 10 5 5
## 16 5 5
## 26 5 5
## 31 5 5
## 33 5 5
plot(result)
The Random Forest model was successfully trained and tested on the wine quality dataset. The results of the predictions are shown above.