Address the following questions in R Code:
wine <- read.csv(file = "winequality-red.csv", sep=";", header = T) # Load in the dataset
knitr::kable(head(wine,100), caption = "Red Wine Dataset")
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.700 | 0.00 | 1.90 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.8 | 0.880 | 0.00 | 2.60 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 7.8 | 0.760 | 0.04 | 2.30 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 11.2 | 0.280 | 0.56 | 1.90 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 7.4 | 0.700 | 0.00 | 1.90 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.4 | 0.660 | 0.00 | 1.80 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.9 | 0.600 | 0.06 | 1.60 | 0.069 | 15 | 59 | 0.9964 | 3.30 | 0.46 | 9.4 | 5 |
| 7.3 | 0.650 | 0.00 | 1.20 | 0.065 | 15 | 21 | 0.9946 | 3.39 | 0.47 | 10.0 | 7 |
| 7.8 | 0.580 | 0.02 | 2.00 | 0.073 | 9 | 18 | 0.9968 | 3.36 | 0.57 | 9.5 | 7 |
| 7.5 | 0.500 | 0.36 | 6.10 | 0.071 | 17 | 102 | 0.9978 | 3.35 | 0.80 | 10.5 | 5 |
| 6.7 | 0.580 | 0.08 | 1.80 | 0.097 | 15 | 65 | 0.9959 | 3.28 | 0.54 | 9.2 | 5 |
| 7.5 | 0.500 | 0.36 | 6.10 | 0.071 | 17 | 102 | 0.9978 | 3.35 | 0.80 | 10.5 | 5 |
| 5.6 | 0.615 | 0.00 | 1.60 | 0.089 | 16 | 59 | 0.9943 | 3.58 | 0.52 | 9.9 | 5 |
| 7.8 | 0.610 | 0.29 | 1.60 | 0.114 | 9 | 29 | 0.9974 | 3.26 | 1.56 | 9.1 | 5 |
| 8.9 | 0.620 | 0.18 | 3.80 | 0.176 | 52 | 145 | 0.9986 | 3.16 | 0.88 | 9.2 | 5 |
| 8.9 | 0.620 | 0.19 | 3.90 | 0.170 | 51 | 148 | 0.9986 | 3.17 | 0.93 | 9.2 | 5 |
| 8.5 | 0.280 | 0.56 | 1.80 | 0.092 | 35 | 103 | 0.9969 | 3.30 | 0.75 | 10.5 | 7 |
| 8.1 | 0.560 | 0.28 | 1.70 | 0.368 | 16 | 56 | 0.9968 | 3.11 | 1.28 | 9.3 | 5 |
| 7.4 | 0.590 | 0.08 | 4.40 | 0.086 | 6 | 29 | 0.9974 | 3.38 | 0.50 | 9.0 | 4 |
| 7.9 | 0.320 | 0.51 | 1.80 | 0.341 | 17 | 56 | 0.9969 | 3.04 | 1.08 | 9.2 | 6 |
| 8.9 | 0.220 | 0.48 | 1.80 | 0.077 | 29 | 60 | 0.9968 | 3.39 | 0.53 | 9.4 | 6 |
| 7.6 | 0.390 | 0.31 | 2.30 | 0.082 | 23 | 71 | 0.9982 | 3.52 | 0.65 | 9.7 | 5 |
| 7.9 | 0.430 | 0.21 | 1.60 | 0.106 | 10 | 37 | 0.9966 | 3.17 | 0.91 | 9.5 | 5 |
| 8.5 | 0.490 | 0.11 | 2.30 | 0.084 | 9 | 67 | 0.9968 | 3.17 | 0.53 | 9.4 | 5 |
| 6.9 | 0.400 | 0.14 | 2.40 | 0.085 | 21 | 40 | 0.9968 | 3.43 | 0.63 | 9.7 | 6 |
| 6.3 | 0.390 | 0.16 | 1.40 | 0.080 | 11 | 23 | 0.9955 | 3.34 | 0.56 | 9.3 | 5 |
| 7.6 | 0.410 | 0.24 | 1.80 | 0.080 | 4 | 11 | 0.9962 | 3.28 | 0.59 | 9.5 | 5 |
| 7.9 | 0.430 | 0.21 | 1.60 | 0.106 | 10 | 37 | 0.9966 | 3.17 | 0.91 | 9.5 | 5 |
| 7.1 | 0.710 | 0.00 | 1.90 | 0.080 | 14 | 35 | 0.9972 | 3.47 | 0.55 | 9.4 | 5 |
| 7.8 | 0.645 | 0.00 | 2.00 | 0.082 | 8 | 16 | 0.9964 | 3.38 | 0.59 | 9.8 | 6 |
| 6.7 | 0.675 | 0.07 | 2.40 | 0.089 | 17 | 82 | 0.9958 | 3.35 | 0.54 | 10.1 | 5 |
| 6.9 | 0.685 | 0.00 | 2.50 | 0.105 | 22 | 37 | 0.9966 | 3.46 | 0.57 | 10.6 | 6 |
| 8.3 | 0.655 | 0.12 | 2.30 | 0.083 | 15 | 113 | 0.9966 | 3.17 | 0.66 | 9.8 | 5 |
| 6.9 | 0.605 | 0.12 | 10.70 | 0.073 | 40 | 83 | 0.9993 | 3.45 | 0.52 | 9.4 | 6 |
| 5.2 | 0.320 | 0.25 | 1.80 | 0.103 | 13 | 50 | 0.9957 | 3.38 | 0.55 | 9.2 | 5 |
| 7.8 | 0.645 | 0.00 | 5.50 | 0.086 | 5 | 18 | 0.9986 | 3.40 | 0.55 | 9.6 | 6 |
| 7.8 | 0.600 | 0.14 | 2.40 | 0.086 | 3 | 15 | 0.9975 | 3.42 | 0.60 | 10.8 | 6 |
| 8.1 | 0.380 | 0.28 | 2.10 | 0.066 | 13 | 30 | 0.9968 | 3.23 | 0.73 | 9.7 | 7 |
| 5.7 | 1.130 | 0.09 | 1.50 | 0.172 | 7 | 19 | 0.9940 | 3.50 | 0.48 | 9.8 | 4 |
| 7.3 | 0.450 | 0.36 | 5.90 | 0.074 | 12 | 87 | 0.9978 | 3.33 | 0.83 | 10.5 | 5 |
| 7.3 | 0.450 | 0.36 | 5.90 | 0.074 | 12 | 87 | 0.9978 | 3.33 | 0.83 | 10.5 | 5 |
| 8.8 | 0.610 | 0.30 | 2.80 | 0.088 | 17 | 46 | 0.9976 | 3.26 | 0.51 | 9.3 | 4 |
| 7.5 | 0.490 | 0.20 | 2.60 | 0.332 | 8 | 14 | 0.9968 | 3.21 | 0.90 | 10.5 | 6 |
| 8.1 | 0.660 | 0.22 | 2.20 | 0.069 | 9 | 23 | 0.9968 | 3.30 | 1.20 | 10.3 | 5 |
| 6.8 | 0.670 | 0.02 | 1.80 | 0.050 | 5 | 11 | 0.9962 | 3.48 | 0.52 | 9.5 | 5 |
| 4.6 | 0.520 | 0.15 | 2.10 | 0.054 | 8 | 65 | 0.9934 | 3.90 | 0.56 | 13.1 | 4 |
| 7.7 | 0.935 | 0.43 | 2.20 | 0.114 | 22 | 114 | 0.9970 | 3.25 | 0.73 | 9.2 | 5 |
| 8.7 | 0.290 | 0.52 | 1.60 | 0.113 | 12 | 37 | 0.9969 | 3.25 | 0.58 | 9.5 | 5 |
| 6.4 | 0.400 | 0.23 | 1.60 | 0.066 | 5 | 12 | 0.9958 | 3.34 | 0.56 | 9.2 | 5 |
| 5.6 | 0.310 | 0.37 | 1.40 | 0.074 | 12 | 96 | 0.9954 | 3.32 | 0.58 | 9.2 | 5 |
| 8.8 | 0.660 | 0.26 | 1.70 | 0.074 | 4 | 23 | 0.9971 | 3.15 | 0.74 | 9.2 | 5 |
| 6.6 | 0.520 | 0.04 | 2.20 | 0.069 | 8 | 15 | 0.9956 | 3.40 | 0.63 | 9.4 | 6 |
| 6.6 | 0.500 | 0.04 | 2.10 | 0.068 | 6 | 14 | 0.9955 | 3.39 | 0.64 | 9.4 | 6 |
| 8.6 | 0.380 | 0.36 | 3.00 | 0.081 | 30 | 119 | 0.9970 | 3.20 | 0.56 | 9.4 | 5 |
| 7.6 | 0.510 | 0.15 | 2.80 | 0.110 | 33 | 73 | 0.9955 | 3.17 | 0.63 | 10.2 | 6 |
| 7.7 | 0.620 | 0.04 | 3.80 | 0.084 | 25 | 45 | 0.9978 | 3.34 | 0.53 | 9.5 | 5 |
| 10.2 | 0.420 | 0.57 | 3.40 | 0.070 | 4 | 10 | 0.9971 | 3.04 | 0.63 | 9.6 | 5 |
| 7.5 | 0.630 | 0.12 | 5.10 | 0.111 | 50 | 110 | 0.9983 | 3.26 | 0.77 | 9.4 | 5 |
| 7.8 | 0.590 | 0.18 | 2.30 | 0.076 | 17 | 54 | 0.9975 | 3.43 | 0.59 | 10.0 | 5 |
| 7.3 | 0.390 | 0.31 | 2.40 | 0.074 | 9 | 46 | 0.9962 | 3.41 | 0.54 | 9.4 | 6 |
| 8.8 | 0.400 | 0.40 | 2.20 | 0.079 | 19 | 52 | 0.9980 | 3.44 | 0.64 | 9.2 | 5 |
| 7.7 | 0.690 | 0.49 | 1.80 | 0.115 | 20 | 112 | 0.9968 | 3.21 | 0.71 | 9.3 | 5 |
| 7.5 | 0.520 | 0.16 | 1.90 | 0.085 | 12 | 35 | 0.9968 | 3.38 | 0.62 | 9.5 | 7 |
| 7.0 | 0.735 | 0.05 | 2.00 | 0.081 | 13 | 54 | 0.9966 | 3.39 | 0.57 | 9.8 | 5 |
| 7.2 | 0.725 | 0.05 | 4.65 | 0.086 | 4 | 11 | 0.9962 | 3.41 | 0.39 | 10.9 | 5 |
| 7.2 | 0.725 | 0.05 | 4.65 | 0.086 | 4 | 11 | 0.9962 | 3.41 | 0.39 | 10.9 | 5 |
| 7.5 | 0.520 | 0.11 | 1.50 | 0.079 | 11 | 39 | 0.9968 | 3.42 | 0.58 | 9.6 | 5 |
| 6.6 | 0.705 | 0.07 | 1.60 | 0.076 | 6 | 15 | 0.9962 | 3.44 | 0.58 | 10.7 | 5 |
| 9.3 | 0.320 | 0.57 | 2.00 | 0.074 | 27 | 65 | 0.9969 | 3.28 | 0.79 | 10.7 | 5 |
| 8.0 | 0.705 | 0.05 | 1.90 | 0.074 | 8 | 19 | 0.9962 | 3.34 | 0.95 | 10.5 | 6 |
| 7.7 | 0.630 | 0.08 | 1.90 | 0.076 | 15 | 27 | 0.9967 | 3.32 | 0.54 | 9.5 | 6 |
| 7.7 | 0.670 | 0.23 | 2.10 | 0.088 | 17 | 96 | 0.9962 | 3.32 | 0.48 | 9.5 | 5 |
| 7.7 | 0.690 | 0.22 | 1.90 | 0.084 | 18 | 94 | 0.9961 | 3.31 | 0.48 | 9.5 | 5 |
| 8.3 | 0.675 | 0.26 | 2.10 | 0.084 | 11 | 43 | 0.9976 | 3.31 | 0.53 | 9.2 | 4 |
| 9.7 | 0.320 | 0.54 | 2.50 | 0.094 | 28 | 83 | 0.9984 | 3.28 | 0.82 | 9.6 | 5 |
| 8.8 | 0.410 | 0.64 | 2.20 | 0.093 | 9 | 42 | 0.9986 | 3.54 | 0.66 | 10.5 | 5 |
| 8.8 | 0.410 | 0.64 | 2.20 | 0.093 | 9 | 42 | 0.9986 | 3.54 | 0.66 | 10.5 | 5 |
| 6.8 | 0.785 | 0.00 | 2.40 | 0.104 | 14 | 30 | 0.9966 | 3.52 | 0.55 | 10.7 | 6 |
| 6.7 | 0.750 | 0.12 | 2.00 | 0.086 | 12 | 80 | 0.9958 | 3.38 | 0.52 | 10.1 | 5 |
| 8.3 | 0.625 | 0.20 | 1.50 | 0.080 | 27 | 119 | 0.9972 | 3.16 | 1.12 | 9.1 | 4 |
| 6.2 | 0.450 | 0.20 | 1.60 | 0.069 | 3 | 15 | 0.9958 | 3.41 | 0.56 | 9.2 | 5 |
| 7.8 | 0.430 | 0.70 | 1.90 | 0.464 | 22 | 67 | 0.9974 | 3.13 | 1.28 | 9.4 | 5 |
| 7.4 | 0.500 | 0.47 | 2.00 | 0.086 | 21 | 73 | 0.9970 | 3.36 | 0.57 | 9.1 | 5 |
| 7.3 | 0.670 | 0.26 | 1.80 | 0.401 | 16 | 51 | 0.9969 | 3.16 | 1.14 | 9.4 | 5 |
| 6.3 | 0.300 | 0.48 | 1.80 | 0.069 | 18 | 61 | 0.9959 | 3.44 | 0.78 | 10.3 | 6 |
| 6.9 | 0.550 | 0.15 | 2.20 | 0.076 | 19 | 40 | 0.9961 | 3.41 | 0.59 | 10.1 | 5 |
| 8.6 | 0.490 | 0.28 | 1.90 | 0.110 | 20 | 136 | 0.9972 | 2.93 | 1.95 | 9.9 | 6 |
| 7.7 | 0.490 | 0.26 | 1.90 | 0.062 | 9 | 31 | 0.9966 | 3.39 | 0.64 | 9.6 | 5 |
| 9.3 | 0.390 | 0.44 | 2.10 | 0.107 | 34 | 125 | 0.9978 | 3.14 | 1.22 | 9.5 | 5 |
| 7.0 | 0.620 | 0.08 | 1.80 | 0.076 | 8 | 24 | 0.9978 | 3.48 | 0.53 | 9.0 | 5 |
| 7.9 | 0.520 | 0.26 | 1.90 | 0.079 | 42 | 140 | 0.9964 | 3.23 | 0.54 | 9.5 | 5 |
| 8.6 | 0.490 | 0.28 | 1.90 | 0.110 | 20 | 136 | 0.9972 | 2.93 | 1.95 | 9.9 | 6 |
| 8.6 | 0.490 | 0.29 | 2.00 | 0.110 | 19 | 133 | 0.9972 | 2.93 | 1.98 | 9.8 | 5 |
| 7.7 | 0.490 | 0.26 | 1.90 | 0.062 | 9 | 31 | 0.9966 | 3.39 | 0.64 | 9.6 | 5 |
| 5.0 | 1.020 | 0.04 | 1.40 | 0.045 | 41 | 85 | 0.9938 | 3.75 | 0.48 | 10.5 | 4 |
| 4.7 | 0.600 | 0.17 | 2.30 | 0.058 | 17 | 106 | 0.9932 | 3.85 | 0.60 | 12.9 | 6 |
| 6.8 | 0.775 | 0.00 | 3.00 | 0.102 | 8 | 23 | 0.9965 | 3.45 | 0.56 | 10.7 | 5 |
| 7.0 | 0.500 | 0.25 | 2.00 | 0.070 | 3 | 22 | 0.9963 | 3.25 | 0.63 | 9.2 | 5 |
| 7.6 | 0.900 | 0.06 | 2.50 | 0.079 | 5 | 10 | 0.9967 | 3.39 | 0.56 | 9.8 | 5 |
| 8.1 | 0.545 | 0.18 | 1.90 | 0.080 | 13 | 35 | 0.9972 | 3.30 | 0.59 | 9.0 | 6 |
Get the number of rows (n)
nrow(wine)
## [1] 1599
Using the “nrow” function, I was able to determine that the dataset has a row count or sample size (n) of 1,599 rows. In addition to this, I verified this answer via visual examination of red wine Excel dataset.
Data Quality Concerns:
Data quality concerns could include the following:
Values outside of defined requirements/expectations (https://archive.ics.uci.edu/dataset/186/wine+quality):
Values outside of requirements/expectations we know from life experience/semantics:
No negative values can be in this dataset
pH values must be 0-14
Missing/Null Values
Let’s see if the dataset exhibits any of the data quality concerns noted above
range(wine[["quality"]]) # see if the values fall within the range of 0-10
## [1] 3 8
sum(wine < 0) # check for negative values in the dataset; if sum is zero, then no negative values in the dataset
## [1] 0
range(wine[["pH"]]) # see if the values fall within the range of 0-10
## [1] 2.74 4.01
sum(is.na(wine)) # get a count of null values
## [1] 0
Interpreting the above output:
Provided the output above, I do not believe there are any data quality issues with the red wine dataset.
Outliers:
There are several methods for detecting/identifying outliers in datasets. To identify outliers in this wine dataset, I will generate boxplots for each of the variables in the dataset and then identify whether or not outliers exist using these plots. A boxplot with labeled components – inclusive to outliers – can be seen below.
As can be seen in the diagram above, outliers – or in this case, the values that I am suggesting are very distant from the others – are points outside of the following domain: [Q1 - 1.5xIQR, Q3 + 1.5xIQR]. Additionally, they are the points that fall outside of the “whiskers” on the diagram.
Generation of box-and-whisker plots for each of the variables can be seen below:
boxplot(wine[["fixed.acidity"]])
title("Boxplot of Fixed Acidity")
boxplot(wine[["volatile.acidity"]])
title("Boxplot of Volatile Acidity")
boxplot(wine[["citric.acid"]])
title("Boxplot of Citric Acid")
boxplot(wine[["residual.sugar"]])
title("Boxplot of Residual Sugar")
boxplot(wine[["chlorides"]])
title("Boxplot of Chlorides")
boxplot(wine[["free.sulfur.dioxide"]])
title("Boxplot of Free Sulfur Dioxide")
boxplot(wine[["total.sulfur.dioxide"]])
title("Boxplot of Total Sulfur Dioxide")
boxplot(wine[["density"]])
title("Boxplot of Density")
boxplot(wine[["pH"]])
title("Boxplot of pH")
boxplot(wine[["sulphates"]])
title("Boxplot of Sulphates")
boxplot(wine[["alcohol"]])
title("Boxplot of Alcohol")
boxplot(wine[["quality"]])
title("Boxplot of Quality")
It can be seen in the boxplots above that at least one outlier exists for every column within the dataset.
As highlighted in lecture 1-3, summary statistics can be provided via the summary function. This function will provide the following statistics for each variable in the output:
See the “summary” function output for each of the variables below:
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
There are several ways to visualize the distribution of each variable. You can visualize the distribution of each variable using boxplots like I did for the previous question. You can also visualize the distribution of each variable using histograms. See histograms for each of the dataset’s variables below:
hist(wine[["fixed.acidity"]], main= "Histogram of Fixed Acidity", xlab = "Fixed Acidity")
hist(wine[["volatile.acidity"]], main="Histogram of Volatile Acidity", xlab = "Volatile Acidity")
hist(wine[["citric.acid"]], main="Histogram of Citric Acid", xlab = "Citric Acid")
hist(wine[["residual.sugar"]], main="Histogram of Residual Sugar", xlab = "Residual Sugar")
hist(wine[["chlorides"]], main="Histogram of Chlorides", xlab = "Chlorides")
hist(wine[["free.sulfur.dioxide"]], main="Histogram of Free Sulfur Dioxide", xlab = "Free Sulfur Dioxide")
hist(wine[["total.sulfur.dioxide"]], main="Histogram of Total Sulfur Dioxide", xlab = "Total Sulfur Dioxide")
hist(wine[["density"]], main="Histogram of Density", xlab = "Density")
hist(wine[["pH"]], main="Histogram of pH", xlab = "pH")
hist(wine[["sulphates"]], main="Histogram of Sulphates", xlab = "Sulphates")
hist(wine[["alcohol"]], main="Histogram of Alcohol", xlab = "Alcohol")
hist(wine[["quality"]], main="Histogram of Quality", xlab = "Quality")
To see if a distribution is skewed, please refer to the histograms from the last question.
Yes, all of the columns/variables – except for density, pH, and quality – are skewed right.