A. Import the red wine data set into your R system. `
library("readxl")
red_wine <- read_excel("C:/Users/Adam Deuber/OneDrive/UC/BANA Masters/Statistical Methods/Homework/Wine Project/winequality-red.xlsx")
#What is the sample size?
str(red_wine)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1599 obs. of 12 variables:
## $ fixed acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free sulfur dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total sulfur dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num 5 5 5 6 5 5 5 7 7 5 ...
B. Any outliers? Do you have any concerns about the data quality?
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.88 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
attach(red_wine)
Total Sulfur Dioxide has the biggest difference between median and mean. I investigate this first.
sulfur_sorted <- sort(`total sulfur dioxide`)
mean(sulfur_sorted)
## [1] 46.46842
median(sulfur_sorted)
## [1] 38
boxplot(sulfur_sorted)
Many outliers in total sulfur dioxide
boxplot(`fixed acidity`)
Many outliers in fixed acidity
boxplot(`volatile acidity`)
Many outliers in volatile acidity
boxplot(`citric acid`)
One outlier in citric acid
boxplot(`residual sugar`)
Many outliers in residual sugar
boxplot(chlorides)
Many outliers in chlorides
boxplot(`free sulfur dioxide`)
Many outliers in free sulfur dioxide
boxplot(`total sulfur dioxide`)
Many outliers in total sulfur dioxide
boxplot(density)
Many outliers in density
boxplot(`pH`)
Many outliers in pH
boxplot(sulphates)
Many outliers in sulphates
boxplot(alcohol)
Four outliers in alcohol
boxplot(quality)
Two outliers in quality. However, quality does not represent a large difference due to a range of 1-10.
I am not too concerned with the data quality as some outliers in a large dataset is normal. Further investigation of the data before analysis would be useful to specifically identify the signficant outliers.
C. How can you summarize the data of each variable in a concise way? What statistics are you going to present?
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.88 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
summary(`fixed acidity`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The summary function for the dataset and for individual variables is effective.
D. How can you visualize the distribution of each variable?
hist(`fixed acidity`)
Hist and boxplot are two great functions to visualize this. See code above for an example of both.
E. Do you see any skewed distributions?
hist(sulfur_sorted)
Skewed to the right
hist(`fixed acidity`)
Skewed to the right
hist(`volatile acidity`)
Skewed to the right
hist(`citric acid`)
Skewed to the right
hist(`residual sugar`)
Skewed to the right
hist(chlorides)
Skewed to the right
hist(`free sulfur dioxide`)
Skewed to the right
hist(`total sulfur dioxide`)
Skewed to the right
hist(density)
Normal distribution
hist(`pH`)
Normal distribution
hist(sulphates)
Skewed to the right
hist(alcohol)
Skewed to the right
hist(quality)
Normal distribution
We can conclude the dataset is skewed to the right as a whole based on the individual variables’ skewness.