Introduction: This report analyzes the Red Wine Quality dataset to
explore how different chemical properties relate to the quality of the
wine. Data tidying and transformations were performed using
tidyr
and dplyr
. here is the link for the
dataset: https://archive.ics.uci.edu/dataset/186/wine+quality
To tidy and transform the Red Wine Quality dataset following actions have been taken:
# Load the required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the red wine dataset with the correct delimiter
wine_red <- read.csv("/Users/aribarazzaq/Desktop/wine+quality/winequality-red.csv", sep = ";")
# Preview the first few rows to understand the structure
head(wine_red)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
str(wine_red)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
# Identify missing values
sum(is.na(wine_red))
## [1] 0
# If missing values exist, we can remove them or use imputation
wine_red <- wine_red %>% drop_na()
# Rename columns to remove spaces and make them more readable
wine_red <- wine_red %>%
rename(
fixed_acidity = `fixed.acidity`,
volatile_acidity = `volatile.acidity`,
citric_acid = `citric.acid`,
residual_sugar = `residual.sugar`,
free_sulfur_dioxide = `free.sulfur.dioxide`,
total_sulfur_dioxide = `total.sulfur.dioxide`
)
# Confirm new column names
colnames(wine_red)
## [1] "fixed_acidity" "volatile_acidity" "citric_acid"
## [4] "residual_sugar" "chlorides" "free_sulfur_dioxide"
## [7] "total_sulfur_dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
# Convert the data to long format for analysis
wine_red_long <- wine_red %>%
pivot_longer(
cols = fixed_acidity:alcohol,
names_to = "characteristic",
values_to = "value"
)
# View the long format dataset
head(wine_red_long)
## # A tibble: 6 × 3
## quality characteristic value
## <int> <chr> <dbl>
## 1 5 fixed_acidity 7.4
## 2 5 volatile_acidity 0.7
## 3 5 citric_acid 0
## 4 5 residual_sugar 1.9
## 5 5 chlorides 0.076
## 6 5 free_sulfur_dioxide 11
# Add a column to classify wine quality
wine_red <- wine_red %>%
mutate(quality_label = ifelse(quality >= 7, "High Quality", "Low Quality"))
# View the dataset with the new column
head(wine_red)
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality quality_label
## 1 5 Low Quality
## 2 5 Low Quality
## 3 5 Low Quality
## 4 6 Low Quality
## 5 5 Low Quality
## 6 5 Low Quality
# Summarize the average fixed acidity for each quality label
acidity_summary <- wine_red %>%
group_by(quality_label) %>%
summarise(mean_fixed_acidity = mean(fixed_acidity, na.rm = TRUE))
print(acidity_summary)
## # A tibble: 2 × 2
## quality_label mean_fixed_acidity
## <chr> <dbl>
## 1 High Quality 8.85
## 2 Low Quality 8.24
# Filter wines with high alcohol content (> 10%) and group by quality label
high_alcohol_wines <- wine_red %>%
filter(alcohol > 10) %>%
group_by(quality_label) %>%
summarise(count = n())
print(high_alcohol_wines)
## # A tibble: 2 × 2
## quality_label count
## <chr> <int>
## 1 High Quality 196
## 2 Low Quality 656
To tidy and transform the Red Wine Quality dataset:
Renamed Columns: Made column names more descriptive. Handled Missing Values: Checked for and removed missing data. Long Format Transformation: Optionally transformed data to a long format for further analysis. Data Transformation: Added new variables and summarized key statistics for better insight into wine quality.