library(here)
library(tidyverse)
The main aim of wine project is to study the quality of different types of wines. The main purpose behind this project is the prediction of wine preferences from objective analytical tests that are available at the certification step. Building such model is valuable not only for certification entities but also wine producers and even consumers. It can be used to :
support the oenologist’s wine evaluations,
potentially improving the quality and speed of their decisions.
Moreover, measuring the impact of the physicochemical tests in the final wine quality is useful for improving the production process. Furthermore, it can help in target marketing, i.e. by applying similar techniques to model the consumer’s preferences of niche and/or profitable markets.
The dataset consists of 2 different types of wine: Red wine & White Wine.
Question 5a. What is the sample size?
#importing the dataset into R
wine_red <- read.csv(file = "C:/Users/tanma/Desktop/University of Cincinnati/Subjects/Stat Methods/data/winequality-red.csv", header = T)
wine_white <- read.csv(file = "C:/Users/tanma/Desktop/University of Cincinnati/Subjects/Stat Methods/data/winequality-white.csv", header = T)
#attaching the dataset to access the variables without calling the dataframe
attach(wine_red)
attach(wine_white)
x <- dim(wine_red)
sample_size_red <- x[1]
cat("The sample size of the red wine dataset is: ", sample_size_red)
## The sample size of the red wine dataset is: 1599
y <- dim(wine_white)
sample_size_white <- y[1]
cat("The sample size of the white wine dataset is: ", sample_size_white)
## The sample size of the white wine dataset is: 4898
Since the sample size for both the datasets is different, hence we will be studying both the datasets separately.
The dataset for red wine looks as follows:
The summary of the red wine dataset is as follows:
str(wine_red)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(wine_red)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
summary(fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Question 5b. Any outliers? Do you have any concerns about the data quality?
For fixed.acidity we can see the mean and median values a bit farther from each other. But there is a difference of around 0.4 units. Using this summary data we can say that there are a few outliers. Also, comparing Min., Mean and Max. values we can say that there are data quality issues as the max value is quite on the higher end when compared to the data distribution. 75% of the values are below 9.20.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the dataframe into one value or vector. However, the statistics I am presenting alongwith summary for indepth understanding of the data are: -
variance
standard deviation
quantile distribution
var(fixed.acidity)
## [1] 3.031416
sd(fixed.acidity)
## [1] 1.741096
quantile(fixed.acidity , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 6.5 7.1 7.9 9.2 10.7
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(fixed.acidity, main = "BoxPlot")
hist(fixed.acidity, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(fixed.acidity), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Question 5b. Any outliers? Do you have any concerns about the data quality?
For volatile.acidity we can see the mean and median values are very close to each other. Hence the data appears to be normally distributed. There are some outliers present in this data as the min. and Max. values are a bit farther from the mean and median values.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(volatile.acidity)
## [1] 0.03206238
sd(volatile.acidity)
## [1] 0.1790597
quantile(volatile.acidity , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.310 0.390 0.520 0.640 0.745
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(volatile.acidity, main = "BoxPlot")
hist(volatile.acidity, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(volatile.acidity), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Question 5b. Any outliers? Do you have any concerns about the data quality?
For citric.acid we can see the mean and median values are very close to each other. But there are definelty outliers present. Through the summary() we can see that 75% of the values are below 0.420 whereas the max is at 1.000.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(citric.acid)
## [1] 0.03794748
sd(citric.acid)
## [1] 0.1948011
quantile(citric.acid , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.010 0.090 0.260 0.420 0.522
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(citric.acid, main = "BoxPlot")
hist(citric.acid, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(citric.acid), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Question 5b. Any outliers? Do you have any concerns about the data quality?
For residual.sugar we can see the mean and median values are a bit farther from each other. Also there are definitely outliers present as 75% of the values fall below 2.600 whereas the Max value is 15.500. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big compared to the other values and the general data distribution.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(residual.sugar)
## [1] 1.987897
sd(residual.sugar)
## [1] 1.409928
quantile(residual.sugar , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 1.7 1.9 2.2 2.6 3.6
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(residual.sugar, main = "BoxPlot")
hist(residual.sugar, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(residual.sugar), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Question 5b. Any outliers? Do you have any concerns about the data quality?
For chlorides we can see the mean and median values are a bit farther from each other. Also there are definetly outliers present as 75% of the values fall below 0.09000 whereas the Max value is 0.61100. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(chlorides)
## [1] 0.002215143
sd(chlorides)
## [1] 0.0470653
quantile(chlorides , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.060 0.070 0.079 0.090 0.109
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(chlorides, main = "BoxPlot")
hist(chlorides, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(chlorides), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Question 5b. Any outliers? Do you have any concerns about the data quality?
For free.sulfur.dioxide we can see the mean and median values are a bit farther from each other. Also there are definetly outliers present as 75% of the values fall below 21.00 whereas the Max value is 72.00. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution. Also the min. value appears to be too less compared to the data distribution.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(free.sulfur.dioxide)
## [1] 109.4149
sd(free.sulfur.dioxide)
## [1] 10.46016
quantile(free.sulfur.dioxide , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 5 7 14 21 31
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(free.sulfur.dioxide, main = "BoxPlot")
hist(free.sulfur.dioxide, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(free.sulfur.dioxide), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Question 5b. Any outliers? Do you have any concerns about the data quality?
For total.sulfur.dioxide we can see the mean and median values are farther from each other. Also there are definetly outliers present as 75% of the values fall below 62.00 whereas the Max value is 289.00. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big commpared to the other values and the general data distribution.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(total.sulfur.dioxide)
## [1] 1082.102
sd(total.sulfur.dioxide)
## [1] 32.89532
quantile(total.sulfur.dioxide , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 14.0 22.0 38.0 62.0 93.2
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(total.sulfur.dioxide, main = "BoxPlot")
hist(total.sulfur.dioxide, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(total.sulfur.dioxide), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Question 5b. Any outliers? Do you have any concerns about the data quality?
For density we can see the mean and median values are almost equal. Hence there doesn’t seem to be any outliers present. Also data quality is good as is evident from the summary.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(density)
## [1] 3.562029e-06
sd(density)
## [1] 0.001887334
quantile(density , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.994556 0.995600 0.996750 0.997835 0.999140
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(density, main = "BoxPlot")
hist(density, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(density), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
No the data is not skewed. It appears to be a normal distribution.
summary(pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Question 5b. Any outliers? Do you have any concerns about the data quality?
For pH we can see the mean and median values are almost equal. Hence there doesn’t seem to be any outliers present. Also data quality is good as is evident from the summary.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(pH)
## [1] 0.02383518
sd(pH)
## [1] 0.1543865
quantile(pH , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 3.12 3.21 3.31 3.40 3.51
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(pH, main = "BoxPlot")
hist(pH, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(pH), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes, the data seems slightly skewed to the right as is evident in the boxplot where there are a bit more numbers to the right.
summary(sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Question 5b. Any outliers? Do you have any concerns about the data quality?
For sulphates we can see the mean and median values are a bit farther from each other. Also there are definitely outliers present as 75% of the values fall below 0.7300 whereas the Max value is 2.0000. Also looking at the min., mean and Max values, the data quality is not good as the Max value is too big compared to the other values and the general data distribution.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(sulphates)
## [1] 0.02873262
sd(sulphates)
## [1] 0.169507
quantile(sulphates , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.50 0.55 0.62 0.73 0.85
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(sulphates, main = "BoxPlot")
hist(sulphates, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(sulphates), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a very big tail outside the normal range.
summary(alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Question 5b. Any outliers? Do you have any concerns about the data quality?
For alcohol we can see the mean and median values are a bit farther from each other. There might be some outliers present due to this. Also 75% values fall under 11.10 whereas the Max value is at 14.90 indicating some outlier values.
Question 5c. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
You can summarize the data using the summary() which summarizes the data frame into one value or vector. However, the statistics I am presenting along with summary for in-depth understanding of the data are: -
variance
standard deviation
quantile distribution
var(alcohol)
## [1] 1.135647
sd(alcohol)
## [1] 1.065668
quantile(alcohol , p=c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 9.3 9.5 10.2 11.1 12.0
Question 5d. How can you visualize the distribution of each variable?
We can use boxplot and histogram to visualize the distribution.
par(mfrow = c(1, 2))
boxplot(alcohol, main = "BoxPlot")
hist(alcohol, freq = FALSE, main = "Histogram", col = "lightblue" )
lines(density(alcohol), lwd = 2, col = "blue")
Question 5e. Do you see any skewed distributions?
Yes it is a skewed to the right which is evident from the boxplot as it has a tail outside the normal range.
summary(quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
hist(quality)
Quality column is the target variable. Hence its distribution is based
on the other variables.
Also from the histogram plot we can see that the maximum quality is 5.