Import the red wine data set into your R system.

library("readxl")
red_wine <- read_excel("C:/Users/katie/OneDrive - University of Cincinnati/FS20/First Half/Statistical Methods (BANA 7051)/Wine Project/winequality-red.xlsx")

What is the sample size?

head(red_wine)
## # A tibble: 6 x 12
##   `fixed acidity` `volatile acidi~ `citric acid` `residual sugar` chlorides
##             <dbl>            <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4             0.7           0                 1.9     0.076
## 2             7.8             0.88          0                 2.6     0.098
## 3             7.8             0.76          0.04              2.3     0.092
## 4            11.2             0.28          0.56              1.9     0.075
## 5             7.4             0.7           0                 1.9     0.076
## 6             7.4             0.66          0                 1.8     0.075
## # ... with 7 more variables: `free sulfur dioxide` <dbl>, `total sulfur
## #   dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>,
## #   quality <dbl>
summary(red_wine)
##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.88       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000
str(red_wine)
## tibble [1,599 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed acidity       : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile acidity    : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric acid         : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual sugar      : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free sulfur dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total sulfur dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
attach(red_wine)

The sample size, n, is the number of records / observations in the data set. In the above dataset, the sample size is 1,599.

Any outliers? Do you have any concerns about the data quality?

An outlier is any value that is very distant from the other values in a data set. A boxplot is a graphical representation of the 1st (25%), 2nd (50%), and 3rd (75%) percentiles. Also, look at the means from the summary() function and compare results to the median. Outliers are defined as data points that are located outside the whiskers of the box plot.

boxplot(red_wine)

boxplot(`fixed acidity`)

Fixed acidity appears to have many outliers.

boxplot(`volatile acidity`)

Volatile acidity appears to have many outliers.

boxplot(`citric acid`)

Citric acid appears to have one outlier.

boxplot(`residual sugar`) 

Residual sugar appears to have a plethora of outliers.

boxplot(`chlorides`)

Chlorides appears to have many outliers.

boxplot(`free sulfur dioxide`)

Free sulfur dioxide appears to have outliers.

boxplot(`total sulfur dioxide`)

Total sulfur dioxide has a lot of outliers.

boxplot(`density`)

There appears to be many outliers for density.

boxplot(`pH`)

There appears to be outliers for pH.

boxplot(`sulphates`)

There appears to be many outliers for sulphates.

boxplot(`alcohol`)

There appears to be about three outliers for alcohol.

boxplot(`quality`)

There appears to be two outliers for quality.

I do not have any main concerns regarding the data quality. This is largely because it is hard to distinguish what a “normal” range is for each variable. There are also a reasonable number of outliers for a data set this size.

How can you summarize the data of each variable in a concise way? What statistics are you going to present?

summary(red_wine)
##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.88       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Or, for a specific variable example:

summary(`fixed acidity`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The summary() function returns the minimum, first quartile, median, mean, third quartile, and maximum of the data for each variable.

How can you visualize the distribution of each variable?

You can visualize the distributions by using box plots. Please see the above examples regarding outliers. For more advanced graphics, you can also use data visualization tools found in the ggplot2 package. You may also use the histogram function. Please see the below for examples.

hist(`chlorides`)

hist(`free sulfur dioxide`)

Do you see any skewed distributions?

hist(`fixed acidity`)

The distribution of fixed acidity appears to be skewed to the right.

hist(`volatile acidity`)

The distribution of volatile acidity is skewed to the right.

hist(`citric acid`)

The distribution of citric acid appears to be skewed to the right.

hist(`residual sugar`) 

The distribution of residual sugar is skewed to the right.

hist(`chlorides`)

The distribution of chlorides is skewed to the right.

hist(`free sulfur dioxide`)

The distribution of free sulfur dioxide is skewed to the right.

hist(`total sulfur dioxide`)

The distribution of total sulfur dioxide is skewed to the right.

hist(`density`)

The distribution of density appears to be normally distributed.

hist(`pH`)

The distribution of pH appears to be normally distributed; it is slightly skewed to the right if anything.

hist(`sulphates`)

The distribution of sulphates appear to be skewed to the right.

hist(`alcohol`)

The distribution for alcohol is skewed to the right.

hist(`quality`)

This distribution appears to be normally distributed; it is skewed to the right a little.

As you can see from the above histograms, the data set appears to be overall skewed to the right.