library(here)
library(tidyverse)

Introduction

The main aim of wine project is to study the quality of different types of wines. The main purpose behind this project is the prediction of wine preferences from objective analytical tests that are available at the certification step. Building such model is valuable not only for certification entities but also wine producers and even consumers. It can be used to :

support the oenologist’s wine evaluations,
potentially improving the quality and speed of their decisions.

Moreover, measuring the impact of the physicochemical tests in the final wine quality is useful for improving the production process. Furthermore, it can help in target marketing, i.e. by applying similar techniques to model the consumer’s preferences of niche and/or profitable markets.

The dataset consists of 2 different types of wine: Red wine & White Wine.

Question 5a. What is the sample size?

#importing the dataset into R
wine_red <- read.csv(file = "C:/Users/tanma/Desktop/University of Cincinnati/Subjects/Stat Methods/data/winequality-red.csv", header = T)

wine_white <- read.csv(file = "C:/Users/tanma/Desktop/University of Cincinnati/Subjects/Stat Methods/data/winequality-white.csv", header = T)

#attaching the dataset to access the variables without calling the dataframe
attach(wine_red)
attach(wine_white)

x <- dim(wine_red)
sample_size_red <- x[1]
cat("The sample size of the red wine dataset is: ", sample_size_red)

## The sample size of the red wine dataset is:  1599

y <- dim(wine_white)
sample_size_white <- y[1]
cat("The sample size of the white wine dataset is: ", sample_size_white)

## The sample size of the white wine dataset is:  4898

Since the sample size for both the datasets is different, hence we will be studying both the datasets separately.

Red Wine

The dataset for red wine looks as follows:

The summary of the red wine dataset is as follows:

str(wine_red)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

summary(wine_red)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Fixed.Acidity

We will first check the summary of fixed.acidity column

summary(fixed.acidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Question 5b. Any outliers? Do you have any concerns about the data quality?

For fixed.acidity we can see the mean and median values a bit farther from each other. But there is a difference of around 0.4 units. Using this summary data we can say that there are a few outliers. Also, comparing Min., Mean and Max. values we can say that there are data quality issues as the max value is quite on the higher end when compared to the data distribution. 75% of the values are below 9.20.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the dataframe into one value or vector. However, the statistics I am presenting alongwith summary for indepth understanding of the data are: -

variance
standard deviation
quantile distribution

var(fixed.acidity)

## [1] 3.031416

sd(fixed.acidity)

## [1] 1.741096

quantile(fixed.acidity , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##  10%  25%  50%  75%  90% 
##  6.5  7.1  7.9  9.2 10.7

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(fixed.acidity, main = "BoxPlot")
hist(fixed.acidity, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(fixed.acidity), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Volatile Acidity

We will first check the summary of volatile.acidity column

summary(volatile.acidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Question 5b. Any outliers? Do you have any concerns about the data quality?

For volatile.acidity we can see the mean and median values are very close to each other. Hence the data appears to be normally distributed. There are some outliers present in this data as the min. and Max. values are a bit farther from the mean and median values.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

variance
standard deviation
quantile distribution

var(volatile.acidity)

## [1] 0.03206238

sd(volatile.acidity)

## [1] 0.1790597

quantile(volatile.acidity , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##   10%   25%   50%   75%   90% 
## 0.310 0.390 0.520 0.640 0.745

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(volatile.acidity, main = "BoxPlot")
hist(volatile.acidity, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(volatile.acidity), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Citric Acid

We will first check the summary of citric.acid column

summary(citric.acid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Question 5b. Any outliers? Do you have any concerns about the data quality?

For citric.acid we can see the mean and median values are very close to each other. But there are definelty outliers present. Through the summary() we can see that 75% of the values are below 0.420 whereas the max is at 1.000.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(citric.acid)

## [1] 0.03794748

sd(citric.acid)

## [1] 0.1948011

quantile(citric.acid , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##   10%   25%   50%   75%   90% 
## 0.010 0.090 0.260 0.420 0.522

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(citric.acid, main = "BoxPlot")
hist(citric.acid, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(citric.acid), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Residual Sugar

We will first check the summary of residual.sugar column

summary(residual.sugar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Question 5b. Any outliers? Do you have any concerns about the data quality?

For residual.sugar we can see the mean and median values are a bit farther from each other. Also there are definitely outliers present as 75% of the values fall below 2.600 whereas the Max value is 15.500. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big compared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(residual.sugar)

## [1] 1.987897

sd(residual.sugar)

## [1] 1.409928

quantile(residual.sugar , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

## 10% 25% 50% 75% 90% 
## 1.7 1.9 2.2 2.6 3.6

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(residual.sugar, main = "BoxPlot")
hist(residual.sugar, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(residual.sugar), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Chlorides

We will first check the summary of chlorides column

summary(chlorides)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Question 5b. Any outliers? Do you have any concerns about the data quality?

For chlorides we can see the mean and median values are a bit farther from each other. Also there are definetly outliers present as 75% of the values fall below 0.09000 whereas the Max value is 0.61100. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(chlorides)

## [1] 0.002215143

sd(chlorides)

## [1] 0.0470653

quantile(chlorides , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##   10%   25%   50%   75%   90% 
## 0.060 0.070 0.079 0.090 0.109

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(chlorides, main = "BoxPlot")
hist(chlorides, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(chlorides), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Free Sulphur Dioxide

We will first check the summary of free.sulfur.dioxide column

summary(free.sulfur.dioxide)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Question 5b. Any outliers? Do you have any concerns about the data quality?

For free.sulfur.dioxide we can see the mean and median values are a bit farther from each other. Also there are definetly outliers present as 75% of the values fall below 21.00 whereas the Max value is 72.00. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution. Also the min. value appears to be too less compared to the data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(free.sulfur.dioxide)

## [1] 109.4149

sd(free.sulfur.dioxide)

## [1] 10.46016

quantile(free.sulfur.dioxide , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

## 10% 25% 50% 75% 90% 
##   5   7  14  21  31

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(free.sulfur.dioxide, main = "BoxPlot")
hist(free.sulfur.dioxide, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(free.sulfur.dioxide), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Total Sulphur Dioxide

We will first check the summary of total.sulfur.dioxide column

summary(total.sulfur.dioxide)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Question 5b. Any outliers? Do you have any concerns about the data quality?

For total.sulfur.dioxide we can see the mean and median values are farther from each other. Also there are definetly outliers present as 75% of the values fall below 62.00 whereas the Max value is 289.00. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(total.sulfur.dioxide)

## [1] 1082.102

sd(total.sulfur.dioxide)

## [1] 32.89532

quantile(total.sulfur.dioxide , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##  10%  25%  50%  75%  90% 
## 14.0 22.0 38.0 62.0 93.2

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(total.sulfur.dioxide, main = "BoxPlot")
hist(total.sulfur.dioxide, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(total.sulfur.dioxide), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Density

We will first check the summary of density column

summary(density)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Question 5b. Any outliers? Do you have any concerns about the data quality?

For density we can see the mean and median values are almost equal. Hence there doesn’t seem to be any outliers present. Also data quality is good as is evident from the summary.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(density)

## [1] 3.562029e-06

sd(density)

## [1] 0.001887334

quantile(density , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##      10%      25%      50%      75%      90% 
## 0.994556 0.995600 0.996750 0.997835 0.999140

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(density, main = "BoxPlot")
hist(density, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(density), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

No the data is not skewed. It appears to be a normal distribution.

We will first check the summary of pH column

summary(pH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Question 5b. Any outliers? Do you have any concerns about the data quality?

For pH we can see the mean and median values are almost equal. Hence there doesn’t seem to be any outliers present. Also data quality is good as is evident from the summary.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(pH)

## [1] 0.02383518

sd(pH)

## [1] 0.1543865

quantile(pH , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##  10%  25%  50%  75%  90% 
## 3.12 3.21 3.31 3.40 3.51

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(pH, main = "BoxPlot")
hist(pH, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(pH), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes, the data seems slightly skewed to the right as is evident in the boxplot where there are a bit more numbers to the right.

Sulphates

We will first check the summary of sulphates column

summary(sulphates)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Question 5b. Any outliers? Do you have any concerns about the data quality?

For sulphates we can see the mean and median values are a bit farther from each other. Also there are definitely outliers present as 75% of the values fall below 0.7300 whereas the Max value is 2.0000. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big compared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(sulphates)

## [1] 0.02873262

sd(sulphates)

## [1] 0.169507

quantile(sulphates , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##  10%  25%  50%  75%  90% 
## 0.50 0.55 0.62 0.73 0.85

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(sulphates, main = "BoxPlot")
hist(sulphates, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(sulphates), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

Alcohol

We will first check the summary of alcohol column

summary(alcohol)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Question 5b. Any outliers? Do you have any concerns about the data quality?

For alcohol we can see the mean and median values are a bit farther from each other. There might be some outliers present due to this. Also 75% values fall under 11.10 whereas the Max value is at 14.90 indicating some outlier values.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

variance
standard deviation
quantile distribution

var(alcohol)

## [1] 1.135647

sd(alcohol)

## [1] 1.065668

quantile(alcohol , p=c(0.1, 0.25, 0.5, 0.75, 0.9))

##  10%  25%  50%  75%  90% 
##  9.3  9.5 10.2 11.1 12.0

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(alcohol, main = "BoxPlot")
hist(alcohol, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(alcohol), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a tail outside the normal range.

Quality - Target Variable

We will first check the summary of quality column

summary(quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

hist(quality)

Quality column is the target variable. Hence its distribution is based on the other variables.

Also from the histogram plot we can see that the maximum quality is 5.

Wine_Project_Part A

Tanmay Khairnar

2023-09-05

Introduction

Red Wine