Below are the libraries used to complete this assignment
library(tidyverse)
library(skimr)
library(rpart)
library(rpart.plot)
library(knitr)
library(tidyr)
library(gridExtra)
library(stringr)
library(tidymodels)
library(corrplot)
library(randomForest)
library(caret)
For this assignment I have chosen to work with the Wine Quality data set. This data set can be accessed from (https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009)This data set contains two sub data sets for Red and White wine respectively. For the purposes of this analysis, I will be working with the Red wine sub set. The goal of this data set was to model the wine quality based on physicochemical test. It contains 12 attributes as listed below.
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
For this analysis, I will be attempting to model the quality of the wine based on a different combinations of attributes.
\(~\)
The data was downloaded from Kaggle.com and loaded into my github. The data wine+Quality contain two sub data; white and red wine. I decided to use the red wine data
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
Using the skimr library we can obtain a quick summary
statistic of the dataset. It has 1599 values with 12 variables all
numeric and no missing variables.
| Name | wine_data |
| Number of rows | 1599 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 0 | 1 | 8.32 | 1.74 | 4.60 | 7.10 | 7.90 | 9.20 | 15.90 | ▂▇▂▁▁ |
| volatile.acidity | 0 | 1 | 0.53 | 0.18 | 0.12 | 0.39 | 0.52 | 0.64 | 1.58 | ▅▇▂▁▁ |
| citric.acid | 0 | 1 | 0.27 | 0.19 | 0.00 | 0.09 | 0.26 | 0.42 | 1.00 | ▇▆▅▁▁ |
| residual.sugar | 0 | 1 | 2.54 | 1.41 | 0.90 | 1.90 | 2.20 | 2.60 | 15.50 | ▇▁▁▁▁ |
| chlorides | 0 | 1 | 0.09 | 0.05 | 0.01 | 0.07 | 0.08 | 0.09 | 0.61 | ▇▁▁▁▁ |
| free.sulfur.dioxide | 0 | 1 | 15.87 | 10.46 | 1.00 | 7.00 | 14.00 | 21.00 | 72.00 | ▇▅▁▁▁ |
| total.sulfur.dioxide | 0 | 1 | 46.47 | 32.90 | 6.00 | 22.00 | 38.00 | 62.00 | 289.00 | ▇▂▁▁▁ |
| density | 0 | 1 | 1.00 | 0.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▃▇▂▁ |
| pH | 0 | 1 | 3.31 | 0.15 | 2.74 | 3.21 | 3.31 | 3.40 | 4.01 | ▁▅▇▂▁ |
| sulphates | 0 | 1 | 0.66 | 0.17 | 0.33 | 0.55 | 0.62 | 0.73 | 2.00 | ▇▅▁▁▁ |
| alcohol | 0 | 1 | 10.42 | 1.07 | 8.40 | 9.50 | 10.20 | 11.10 | 14.90 | ▇▇▃▁▁ |
| quality | 0 | 1 | 5.64 | 0.81 | 3.00 | 5.00 | 6.00 | 6.00 | 8.00 | ▁▇▇▂▁ |
\(~\)
\(~\)
There is no correlation between a wine’s residual sugar and its quality rating.
There’s no visible relationship between chloride content, free sulfur dioxide, and wine quality.
Wines containing higher levels of total sulfur dioxide are not consistently rated as low quality wines and don’t provide a reliable indicator of wine quality.
There is a slight negative relationship between a wine’s density and it’s quality rating. Higher density wines tend to have a slightly lower quality rating.
There is very little to no correlation between pH and wine quality.
There is a slight positive relationship between alcohol content and wine quality. The higher the alcohol content, the higher the average of the wine quality.
\(~\)
Now that I’ve visualized the data I want to do one minor change to the columns. Most of the columns have a “.” and I’m changing it to an “_“. Since there’s no missing values, and all values are already numeric, there’s not much to prepare the data.
| Fixed_Acidity | Volatile_Acidity | Citric_Acid | Residual_Sugar | Chlorides | Free_Sulfur_Dioxide | Total_Sulfur_Dioxide | Density | pH | Sulphates | Alcohol | Quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
\(~\)
The correlation plot below is measuring the degree of linear relationship within the dataset. The values in which this is measured falls between -1 and +1, with +1 being a strong positive correlation and -1 a strong negative correlation. The darker the dot the more strongly correlated (whether positive or negative). From the results below, there’s a strong positive correlation with citric acid, density and fixed acidity as well as free sulfur dioxide and total sulfur dioxide. Negative strong correlations are only seen with fixed acidity and pH, citric acid and volatile acidy, citric acid and pH, and density and alcohol.
\(~\)
We have to create two decision tree models and one random forest
model. The first decision tree is between Quality and the
whole data set. I started off by doing the cross validations setup by
using the 75:25 ratio. After that we then created the decision tree seen
below:
Then we test the model using the validation dataset to create the prediction table below:
| 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|
| 3 | 0 | 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 0 | 7 | 8 | 0 | 0 |
| 5 | 0 | 0 | 110 | 55 | 4 | 0 |
| 6 | 0 | 0 | 42 | 106 | 11 | 0 |
| 7 | 0 | 0 | 2 | 37 | 13 | 0 |
| 8 | 0 | 0 | 0 | 1 | 1 | 0 |
and we check the accuracy which is 57.4%:
| x |
|---|
| 0.5739348 |
\(~\)
For the second decision tree I will be looking at the relationship
between Quality and Density, pH,
and Alcohol. I created a new dataset from the original
choosing only the variables above. Following the same step to create the
first decision tree, we create the second:
Same as before, we create the prediciton table:
| 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|
| 3 | 0 | 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 0 | 11 | 3 | 0 | 0 |
| 5 | 0 | 0 | 138 | 32 | 0 | 0 |
| 6 | 0 | 0 | 70 | 89 | 0 | 0 |
| 7 | 0 | 0 | 9 | 41 | 0 | 0 |
| 8 | 0 | 0 | 1 | 3 | 0 | 0 |
and now for the accuracy of 56.8% which is lower than the first decision tree:
| x |
|---|
| 0.5689223 |
\(~\)
A Random Forest is an ensemble learning technique in machine learning that combines multiple decision trees to make accurate predictions. It works by creating a collection of decision trees, each trained on a bootstrapped dataset (randomly sampled with replacement) from the original data and considering only a subset of features at each split. The final prediction in a classification task is determined by a majority vote of the individual trees, while in a regression task, it’s an average of their predictions. Random Forests are valued for their high accuracy, resistance to overfitting, and the ability to assess feature importance.
For the random forest model, I am choosing the first decision tree as it had a higher accuracy compared to the second model. First we create the random forest model using the training data and then applying it to the validation data.
##
## Call:
## randomForest(formula = Quality ~ ., data = wine_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.340287
## % Var explained: 48.36
From the random forest model we created, we can create a variable
importance plot which shows each variable and how important it is in
classifying the data. From the plot below we note that
Alcohol, Sulphates and
Volatile_Acidity are among the top variables that play a
significant role in the classification of the quality of the wine.
Numerically, we can see the same result below:
| Overall | |
|---|---|
| Fixed_Acidity | 48.62991 |
| Volatile_Acidity | 98.69333 |
| Citric_Acid | 51.91002 |
| Residual_Sugar | 41.04249 |
| Chlorides | 51.96039 |
| Free_Sulfur_Dioxide | 36.21257 |
| Total_Sulfur_Dioxide | 58.15330 |
| Density | 64.25719 |
| pH | 43.23137 |
| Sulphates | 107.61902 |
| Alcohol | 149.20112 |
Lastly, I perform the random forest on the validation data to check the accuracy of the model with the results below:
# create some random number for reproduction
set.seed(4)
# creating random forest model using the validation data
rf_pred <- predict(rf_model,newdata = wine_valid)
# confusion matrix output
#confusionMatrix(rf_pred, wine_valid$Quality)
\(~\)
To alter the popular idea of decision tree, especially considering their drawback and instances where they’ve gone wrong, you can adopt various strategies when using a decision tree to address real problems. Acknowledge their limitations and be transparent about what they can and cannot do. Focus on data quality and preprocessing to ensure the best input. Implement techniques to control overfitting, such as pruning or ensembling. Choose relevant features and maintain interpretability, explaining the tree’s decisions transparently. Continuously monitor and update the model, document the process, and conduct sensitivity analyses. Additionally, consider ethical aspects and educate stakeholders on the strengths and weaknesses of decision trees, ultimately promoting a more informed and realistic perspective on their utility. However, like any tool, they can have limitations and drawbacks. In this homework, we were able to correct ther error of ramdom forest.