library("readxl")
red_wine <- read_excel("C:/Users/katie/OneDrive - University of Cincinnati/FS20/First Half/Statistical Methods (BANA 7051)/Wine Project/winequality-red.xlsx")
head(red_wine)
## # A tibble: 6 x 12
## `fixed acidity` `volatile acidi~ `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # ... with 7 more variables: `free sulfur dioxide` <dbl>, `total sulfur
## # dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>,
## # quality <dbl>
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.88 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
str(red_wine)
## tibble [1,599 x 12] (S3: tbl_df/tbl/data.frame)
## $ fixed acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free sulfur dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## $ total sulfur dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
attach(red_wine)
The sample size, n, is the number of records / observations in the data set. In the above dataset, the sample size is 1,599.
An outlier is any value that is very distant from the other values in a data set. A boxplot is a graphical representation of the 1st (25%), 2nd (50%), and 3rd (75%) percentiles. Also, look at the means from the summary() function and compare results to the median. Outliers are defined as data points that are located outside the whiskers of the box plot.
boxplot(red_wine)
boxplot(`fixed acidity`)
Fixed acidity appears to have many outliers.
boxplot(`volatile acidity`)
Volatile acidity appears to have many outliers.
boxplot(`citric acid`)
Citric acid appears to have one outlier.
boxplot(`residual sugar`)
Residual sugar appears to have a plethora of outliers.
boxplot(`chlorides`)
Chlorides appears to have many outliers.
boxplot(`free sulfur dioxide`)
Free sulfur dioxide appears to have outliers.
boxplot(`total sulfur dioxide`)
Total sulfur dioxide has a lot of outliers.
boxplot(`density`)
There appears to be many outliers for density.
boxplot(`pH`)
There appears to be outliers for pH.
boxplot(`sulphates`)
There appears to be many outliers for sulphates.
boxplot(`alcohol`)
There appears to be about three outliers for alcohol.
boxplot(`quality`)
There appears to be two outliers for quality.
I do not have any main concerns regarding the data quality. This is largely because it is hard to distinguish what a “normal” range is for each variable. There are also a reasonable number of outliers for a data set this size.
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.88 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Or, for a specific variable example:
summary(`fixed acidity`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The summary() function returns the minimum, first quartile, median, mean, third quartile, and maximum of the data for each variable.
You can visualize the distributions by using box plots. Please see the above examples regarding outliers. For more advanced graphics, you can also use data visualization tools found in the ggplot2 package. You may also use the histogram function. Please see the below for examples.
hist(`chlorides`)
hist(`free sulfur dioxide`)
hist(`fixed acidity`)
The distribution of fixed acidity appears to be skewed to the right.
hist(`volatile acidity`)
The distribution of volatile acidity is skewed to the right.
hist(`citric acid`)
The distribution of citric acid appears to be skewed to the right.
hist(`residual sugar`)
The distribution of residual sugar is skewed to the right.
hist(`chlorides`)
The distribution of chlorides is skewed to the right.
hist(`free sulfur dioxide`)
The distribution of free sulfur dioxide is skewed to the right.
hist(`total sulfur dioxide`)
The distribution of total sulfur dioxide is skewed to the right.
hist(`density`)
The distribution of density appears to be normally distributed.
hist(`pH`)
The distribution of pH appears to be normally distributed; it is slightly skewed to the right if anything.
hist(`sulphates`)
The distribution of sulphates appear to be skewed to the right.
hist(`alcohol`)
The distribution for alcohol is skewed to the right.
hist(`quality`)
This distribution appears to be normally distributed; it is skewed to the right a little.
As you can see from the above histograms, the data set appears to be overall skewed to the right.