Wine Data Analysis

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

wine <- read.csv("/Users/madhurimasarkar/Documents/MS Uni/1. Applied Stats Method - BANA7095/3. Assignment - Wine Data/winequality-red.csv",header = TRUE,sep = ";")

head(wine,10)

##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1            7.4             0.70        0.00            1.9     0.076
## 2            7.8             0.88        0.00            2.6     0.098
## 3            7.8             0.76        0.04            2.3     0.092
## 4           11.2             0.28        0.56            1.9     0.075
## 5            7.4             0.70        0.00            1.9     0.076
## 6            7.4             0.66        0.00            1.8     0.075
## 7            7.9             0.60        0.06            1.6     0.069
## 8            7.3             0.65        0.00            1.2     0.065
## 9            7.8             0.58        0.02            2.0     0.073
## 10           7.5             0.50        0.36            6.1     0.071
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   11                   34  0.9978 3.51      0.56     9.4
## 2                   25                   67  0.9968 3.20      0.68     9.8
## 3                   15                   54  0.9970 3.26      0.65     9.8
## 4                   17                   60  0.9980 3.16      0.58     9.8
## 5                   11                   34  0.9978 3.51      0.56     9.4
## 6                   13                   40  0.9978 3.51      0.56     9.4
## 7                   15                   59  0.9964 3.30      0.46     9.4
## 8                   15                   21  0.9946 3.39      0.47    10.0
## 9                    9                   18  0.9968 3.36      0.57     9.5
## 10                  17                  102  0.9978 3.35      0.80    10.5
##    quality
## 1        5
## 2        5
## 3        5
## 4        6
## 5        5
## 6        5
## 7        5
## 8        7
## 9        7
## 10       5

a. What is the sample size?

str(wine)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Sample Size: - 1599

b. Any outliers? Do you have any concerns about the data quality?

wine_box <- wine %>%
  select_if(is.numeric) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value")%>%{ ggplot(., aes(y = value)) +                       
      geom_boxplot(fill = "skyblue", color = "black") +
      facet_wrap(~ variable, scales = "free_y") +    
      theme_minimal() +
      labs(title = "Boxplots of Numeric Variables",
           x = NULL, y = "Value")
  }

wine_box

Observations:

All the variables have outliers.
Data Quality Concerns exist due to all variables having outliers and higher number of outliers for sulphates,chlorides and free.sulfur.dioxide.

c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

summary(wine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Observations:

Mean is greater than Median for most of the variables making the data rightly skewed.
Range = Max - Min, Range is high for residual_sugar,free.sulfur.dioxide, total.sulfur.dioxide.Hence,spread of data for these variable is very high

d. How can you visualize the distribution of each variable?

Visualising Histogram for each variables for checking skewness:

wine_hist <- wine %>%
  select_if(is.numeric) %>%                      
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value") %>%
  { ggplot(., aes(x = value)) +
      geom_histogram(bins = 30, fill = "skyblue", color = "black") +
      facet_wrap(~variable, scales = "free_x") +
      theme_minimal()
  }

wine_hist

Observations :-

Data is positively skewed for 10 out of 12 variables, meaning most of the values for these variables are concentrated at lower values but very few values are large which are pulling the mean to the right.

e. Do you see any skewed distributions?

Data for all the variables in the red wine dataset is right skewed (mean > median) except for the following variables:

density : normally distributed (mean(0.9967) and median(0.9968) being almost same)
pH :normally distributed (mean(3.311) and median(3.3110) )
Quality has mean < median, making it left skewed.

Wine Data Analysis

Madhurima Sarkar

2025-09-15

a. What is the sample size?

Sample Size: - 1599

b. Any outliers? Do you have any concerns about the data quality?

Observations:

c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

Observations:

d. How can you visualize the distribution of each variable?

Visualising Histogram for each variables for checking skewness:

Observations :-

e. Do you see any skewed distributions?