library(here)
library(tidyverse)

Introduction

The main aim of wine project is to study the quality of different types of wines. The main purpose behind this project is the prediction of wine preferences from objective analytical tests that are available at the certification step. Building such model is valuable not only for certification entities but also wine producers and even consumers. It can be used to :

Moreover, measuring the impact of the physicochemical tests in the final wine quality is useful for improving the production process. Furthermore, it can help in target marketing, i.e. by applying similar techniques to model the consumer’s preferences of niche and/or profitable markets.

The dataset consists of 2 different types of wine: Red wine & White Wine.

Question 5a. What is the sample size?

#importing the dataset into R
wine_red <- read.csv(file = "C:/Users/tanma/Desktop/University of Cincinnati/Subjects/Stat Methods/data/winequality-red.csv", header = T)

wine_white <- read.csv(file = "C:/Users/tanma/Desktop/University of Cincinnati/Subjects/Stat Methods/data/winequality-white.csv", header = T)

#attaching the dataset to access the variables without calling the dataframe
attach(wine_red)
attach(wine_white)
x <- dim(wine_red)
sample_size_red <- x[1]
cat("The sample size of the red wine dataset is: ", sample_size_red)
## The sample size of the red wine dataset is:  1599
y <- dim(wine_white)
sample_size_white <- y[1]
cat("The sample size of the white wine dataset is: ", sample_size_white)
## The sample size of the white wine dataset is:  4898

Since the sample size for both the datasets is different, hence we will be studying both the datasets separately.

Red Wine

The dataset for red wine looks as follows:

The summary of the red wine dataset is as follows:

str(wine_red)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(wine_red)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000
  1. Fixed.Acidity
summary(fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Question 5b. Any outliers? Do you have any concerns about the data quality?

For fixed.acidity we can see the mean and median values a bit farther from each other. But there is a difference of around 0.4 units. Using this summary data we can say that there are a few outliers. Also, comparing Min., Mean and Max. values we can say that there are data quality issues as the max value is quite on the higher end when compared to the data distribution. 75% of the values are below 9.20.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the dataframe into one value or vector. However, the statistics I am presenting alongwith summary for indepth understanding of the data are: -

var(fixed.acidity)
## [1] 3.031416
sd(fixed.acidity)
## [1] 1.741096
quantile(fixed.acidity , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##  10%  25%  50%  75%  90% 
##  6.5  7.1  7.9  9.2 10.7

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(fixed.acidity, main = "BoxPlot")
hist(fixed.acidity, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(fixed.acidity), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Volatile Acidity
summary(volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Question 5b. Any outliers? Do you have any concerns about the data quality?

For volatile.acidity we can see the mean and median values are very close to each other. Hence the data appears to be normally distributed. There are some outliers present in this data as the min. and Max. values are a bit farther from the mean and median values.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(volatile.acidity)
## [1] 0.03206238
sd(volatile.acidity)
## [1] 0.1790597
quantile(volatile.acidity , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##   10%   25%   50%   75%   90% 
## 0.310 0.390 0.520 0.640 0.745

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(volatile.acidity, main = "BoxPlot")
hist(volatile.acidity, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(volatile.acidity), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Citric Acid
summary(citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Question 5b. Any outliers? Do you have any concerns about the data quality?

For citric.acid we can see the mean and median values are very close to each other. But there are definelty outliers present. Through the summary() we can see that 75% of the values are below 0.420 whereas the max is at 1.000.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(citric.acid)
## [1] 0.03794748
sd(citric.acid)
## [1] 0.1948011
quantile(citric.acid , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##   10%   25%   50%   75%   90% 
## 0.010 0.090 0.260 0.420 0.522

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(citric.acid, main = "BoxPlot")
hist(citric.acid, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(citric.acid), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Residual Sugar
summary(residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Question 5b. Any outliers? Do you have any concerns about the data quality?

For residual.sugar we can see the mean and median values are a bit farther from each other. Also there are definitely outliers present as 75% of the values fall below 2.600 whereas the Max value is 15.500. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big compared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(residual.sugar)
## [1] 1.987897
sd(residual.sugar)
## [1] 1.409928
quantile(residual.sugar , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90% 
## 1.7 1.9 2.2 2.6 3.6

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(residual.sugar, main = "BoxPlot")
hist(residual.sugar, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(residual.sugar), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Chlorides
summary(chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Question 5b. Any outliers? Do you have any concerns about the data quality?

For chlorides we can see the mean and median values are a bit farther from each other. Also there are definetly outliers present as 75% of the values fall below 0.09000 whereas the Max value is 0.61100. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(chlorides)
## [1] 0.002215143
sd(chlorides)
## [1] 0.0470653
quantile(chlorides , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##   10%   25%   50%   75%   90% 
## 0.060 0.070 0.079 0.090 0.109

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(chlorides, main = "BoxPlot")
hist(chlorides, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(chlorides), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Free Sulphur Dioxide
summary(free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Question 5b. Any outliers? Do you have any concerns about the data quality?

For free.sulfur.dioxide we can see the mean and median values are a bit farther from each other. Also there are definetly outliers present as 75% of the values fall below 21.00 whereas the Max value is 72.00. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution. Also the min. value appears to be too less compared to the data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(free.sulfur.dioxide)
## [1] 109.4149
sd(free.sulfur.dioxide)
## [1] 10.46016
quantile(free.sulfur.dioxide , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90% 
##   5   7  14  21  31

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(free.sulfur.dioxide, main = "BoxPlot")
hist(free.sulfur.dioxide, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(free.sulfur.dioxide), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Total Sulphur Dioxide
summary(total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Question 5b. Any outliers? Do you have any concerns about the data quality?

For total.sulfur.dioxide we can see the mean and median values are farther from each other. Also there are definetly outliers present as 75% of the values fall below 62.00 whereas the Max value is 289.00. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(total.sulfur.dioxide)
## [1] 1082.102
sd(total.sulfur.dioxide)
## [1] 32.89532
quantile(total.sulfur.dioxide , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##  10%  25%  50%  75%  90% 
## 14.0 22.0 38.0 62.0 93.2

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(total.sulfur.dioxide, main = "BoxPlot")
hist(total.sulfur.dioxide, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(total.sulfur.dioxide), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Density
summary(density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Question 5b. Any outliers? Do you have any concerns about the data quality?

For density we can see the mean and median values are almost equal. Hence there doesn’t seem to be any outliers present. Also data quality is good as is evident from the summary.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(density)
## [1] 3.562029e-06
sd(density)
## [1] 0.001887334
quantile(density , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##      10%      25%      50%      75%      90% 
## 0.994556 0.995600 0.996750 0.997835 0.999140

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(density, main = "BoxPlot")
hist(density, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(density), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

No the data is not skewed. It appears to be a normal distribution.

  1. pH
summary(pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Question 5b. Any outliers? Do you have any concerns about the data quality?

For pH we can see the mean and median values are almost equal. Hence there doesn’t seem to be any outliers present. Also data quality is good as is evident from the summary.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(pH)
## [1] 0.02383518
sd(pH)
## [1] 0.1543865
quantile(pH , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##  10%  25%  50%  75%  90% 
## 3.12 3.21 3.31 3.40 3.51

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(pH, main = "BoxPlot")
hist(pH, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(pH), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes, the data seems slightly skewed to the right as is evident in the boxplot where there are a bit more numbers to the right.

  1. Sulphates
summary(sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Question 5b. Any outliers? Do you have any concerns about the data quality?

For sulphates we can see the mean and median values are a bit farther from each other. Also there are definitely outliers present as 75% of the values fall below 0.7300 whereas the Max value is 2.0000. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big compared to the other values and the general data distribution.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(sulphates)
## [1] 0.02873262
sd(sulphates)
## [1] 0.169507
quantile(sulphates , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##  10%  25%  50%  75%  90% 
## 0.50 0.55 0.62 0.73 0.85

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(sulphates, main = "BoxPlot")
hist(sulphates, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(sulphates), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.

  1. Alcohol
summary(alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Question 5b. Any outliers? Do you have any concerns about the data quality?

For alcohol we can see the mean and median values are a bit farther from each other. There might be some outliers present due to this. Also 75% values fall under 11.10 whereas the Max value is at 14.90 indicating some outlier values.

Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -

var(alcohol)
## [1] 1.135647
sd(alcohol)
## [1] 1.065668
quantile(alcohol , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
##  10%  25%  50%  75%  90% 
##  9.3  9.5 10.2 11.1 12.0

Question 5d. How can you visualize the distribution of each variable?

We can use boxplot and histogram to visualize the distribution.

par(mfrow = c(1, 2))
boxplot(alcohol, main = "BoxPlot")
hist(alcohol, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(alcohol), lwd = 2, col = "blue")

Question 5e. Do you see any skewed distributions?

Yes it is a skewed to the right which is evident from the boxplot as it has a tail outside the normal range.

  1. Quality - Target Variable
summary(quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
hist(quality)

Quality column is the target variable. Hence its distribution is based on the other variables.

Also from the histogram plot we can see that the maximum quality is 5.