Introduction: This report analyzes the Red Wine Quality dataset to explore how different chemical properties relate to the quality of the wine. Data tidying and transformations were performed using tidyr and dplyr. here is the link for the dataset: https://archive.ics.uci.edu/dataset/186/wine+quality

To tidy and transform the Red Wine Quality dataset following actions have been taken:

  1. Renamed Columns: Made column names more descriptive.
  2. Handled Missing Values: Checked for and removed missing data.
  3. Long Format Transformation: transformed data to a long format for further analysis.
  4. Data Transformation: Added new variables and summarized key statistics for better insight into wine quality.
# Load the required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the red wine dataset with the correct delimiter
wine_red <- read.csv("/Users/aribarazzaq/Desktop/wine+quality/winequality-red.csv", sep = ";")

# Preview the first few rows to understand the structure
head(wine_red)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
str(wine_red)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
# Identify missing values
sum(is.na(wine_red))
## [1] 0
# If missing values exist, we can remove them or use imputation
wine_red <- wine_red %>% drop_na()
# Rename columns to remove spaces and make them more readable
wine_red <- wine_red %>%
  rename(
    fixed_acidity = `fixed.acidity`,
    volatile_acidity = `volatile.acidity`,
    citric_acid = `citric.acid`,
    residual_sugar = `residual.sugar`,
    free_sulfur_dioxide = `free.sulfur.dioxide`,
    total_sulfur_dioxide = `total.sulfur.dioxide`
  )

# Confirm new column names
colnames(wine_red)
##  [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
##  [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
##  [7] "total_sulfur_dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
# Convert the data to long format for analysis
wine_red_long <- wine_red %>%
  pivot_longer(
    cols = fixed_acidity:alcohol,
    names_to = "characteristic",
    values_to = "value"
  )

# View the long format dataset
head(wine_red_long)
## # A tibble: 6 × 3
##   quality characteristic       value
##     <int> <chr>                <dbl>
## 1       5 fixed_acidity        7.4  
## 2       5 volatile_acidity     0.7  
## 3       5 citric_acid          0    
## 4       5 residual_sugar       1.9  
## 5       5 chlorides            0.076
## 6       5 free_sulfur_dioxide 11
# Add a column to classify wine quality
wine_red <- wine_red %>%
  mutate(quality_label = ifelse(quality >= 7, "High Quality", "Low Quality"))

# View the dataset with the new column
head(wine_red)
##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality quality_label
## 1       5   Low Quality
## 2       5   Low Quality
## 3       5   Low Quality
## 4       6   Low Quality
## 5       5   Low Quality
## 6       5   Low Quality
# Summarize the average fixed acidity for each quality label
acidity_summary <- wine_red %>%
  group_by(quality_label) %>%
  summarise(mean_fixed_acidity = mean(fixed_acidity, na.rm = TRUE))

print(acidity_summary)
## # A tibble: 2 × 2
##   quality_label mean_fixed_acidity
##   <chr>                      <dbl>
## 1 High Quality                8.85
## 2 Low Quality                 8.24
# Filter wines with high alcohol content (> 10%) and group by quality label
high_alcohol_wines <- wine_red %>%
  filter(alcohol > 10) %>%
  group_by(quality_label) %>%
  summarise(count = n())

print(high_alcohol_wines)
## # A tibble: 2 × 2
##   quality_label count
##   <chr>         <int>
## 1 High Quality    196
## 2 Low Quality     656

To tidy and transform the Red Wine Quality dataset:

Renamed Columns: Made column names more descriptive. Handled Missing Values: Checked for and removed missing data. Long Format Transformation: Optionally transformed data to a long format for further analysis. Data Transformation: Added new variables and summarized key statistics for better insight into wine quality.