A. Import the red wine data set into your R system. `

library("readxl")
red_wine <- read_excel("C:/Users/Adam Deuber/OneDrive/UC/BANA Masters/Statistical Methods/Homework/Wine Project/winequality-red.xlsx")

#What is the sample size?
str(red_wine)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1599 obs. of  12 variables:
##  $ fixed acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free sulfur dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total sulfur dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num  5 5 5 6 5 5 5 7 7 5 ...

B. Any outliers? Do you have any concerns about the data quality?

summary(red_wine)
##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.88       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000
attach(red_wine)

Total Sulfur Dioxide has the biggest difference between median and mean. I investigate this first.

sulfur_sorted <- sort(`total sulfur dioxide`)
mean(sulfur_sorted)
## [1] 46.46842
median(sulfur_sorted)
## [1] 38
boxplot(sulfur_sorted)

Many outliers in total sulfur dioxide

boxplot(`fixed acidity`)

Many outliers in fixed acidity

boxplot(`volatile acidity`)

Many outliers in volatile acidity

boxplot(`citric acid`)

One outlier in citric acid

boxplot(`residual sugar`)

Many outliers in residual sugar

boxplot(chlorides)

Many outliers in chlorides

boxplot(`free sulfur dioxide`)

Many outliers in free sulfur dioxide

boxplot(`total sulfur dioxide`)

Many outliers in total sulfur dioxide

boxplot(density)

Many outliers in density

boxplot(`pH`)

Many outliers in pH

boxplot(sulphates)

Many outliers in sulphates

boxplot(alcohol)

Four outliers in alcohol

boxplot(quality)

Two outliers in quality. However, quality does not represent a large difference due to a range of 1-10.

I am not too concerned with the data quality as some outliers in a large dataset is normal. Further investigation of the data before analysis would be useful to specifically identify the signficant outliers.

C. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

summary(red_wine)
##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.88       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000
summary(`fixed acidity`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The summary function for the dataset and for individual variables is effective.

D. How can you visualize the distribution of each variable?

hist(`fixed acidity`)

Hist and boxplot are two great functions to visualize this. See code above for an example of both.

E. Do you see any skewed distributions?

hist(sulfur_sorted)

Skewed to the right

hist(`fixed acidity`)

Skewed to the right

hist(`volatile acidity`)

Skewed to the right

hist(`citric acid`)

Skewed to the right

hist(`residual sugar`)

Skewed to the right

hist(chlorides)

Skewed to the right

hist(`free sulfur dioxide`)

Skewed to the right

hist(`total sulfur dioxide`)

Skewed to the right

hist(density)

Normal distribution

hist(`pH`)

Normal distribution

hist(sulphates)

Skewed to the right

hist(alcohol)

Skewed to the right

hist(quality)

Normal distribution

We can conclude the dataset is skewed to the right as a whole based on the individual variables’ skewness.