Requirements for the Assignment

Address the following questions in R Code:

Pre-Requisite to answering the questions – Load in the Red Wine Dataset and Preview top 100 rows

wine <- read.csv(file = "winequality-red.csv", sep=";", header = T) # Load in the dataset
knitr::kable(head(wine,100), caption = "Red Wine Dataset")
Red Wine Dataset
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
7.4 0.700 0.00 1.90 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.880 0.00 2.60 0.098 25 67 0.9968 3.20 0.68 9.8 5
7.8 0.760 0.04 2.30 0.092 15 54 0.9970 3.26 0.65 9.8 5
11.2 0.280 0.56 1.90 0.075 17 60 0.9980 3.16 0.58 9.8 6
7.4 0.700 0.00 1.90 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.4 0.660 0.00 1.80 0.075 13 40 0.9978 3.51 0.56 9.4 5
7.9 0.600 0.06 1.60 0.069 15 59 0.9964 3.30 0.46 9.4 5
7.3 0.650 0.00 1.20 0.065 15 21 0.9946 3.39 0.47 10.0 7
7.8 0.580 0.02 2.00 0.073 9 18 0.9968 3.36 0.57 9.5 7
7.5 0.500 0.36 6.10 0.071 17 102 0.9978 3.35 0.80 10.5 5
6.7 0.580 0.08 1.80 0.097 15 65 0.9959 3.28 0.54 9.2 5
7.5 0.500 0.36 6.10 0.071 17 102 0.9978 3.35 0.80 10.5 5
5.6 0.615 0.00 1.60 0.089 16 59 0.9943 3.58 0.52 9.9 5
7.8 0.610 0.29 1.60 0.114 9 29 0.9974 3.26 1.56 9.1 5
8.9 0.620 0.18 3.80 0.176 52 145 0.9986 3.16 0.88 9.2 5
8.9 0.620 0.19 3.90 0.170 51 148 0.9986 3.17 0.93 9.2 5
8.5 0.280 0.56 1.80 0.092 35 103 0.9969 3.30 0.75 10.5 7
8.1 0.560 0.28 1.70 0.368 16 56 0.9968 3.11 1.28 9.3 5
7.4 0.590 0.08 4.40 0.086 6 29 0.9974 3.38 0.50 9.0 4
7.9 0.320 0.51 1.80 0.341 17 56 0.9969 3.04 1.08 9.2 6
8.9 0.220 0.48 1.80 0.077 29 60 0.9968 3.39 0.53 9.4 6
7.6 0.390 0.31 2.30 0.082 23 71 0.9982 3.52 0.65 9.7 5
7.9 0.430 0.21 1.60 0.106 10 37 0.9966 3.17 0.91 9.5 5
8.5 0.490 0.11 2.30 0.084 9 67 0.9968 3.17 0.53 9.4 5
6.9 0.400 0.14 2.40 0.085 21 40 0.9968 3.43 0.63 9.7 6
6.3 0.390 0.16 1.40 0.080 11 23 0.9955 3.34 0.56 9.3 5
7.6 0.410 0.24 1.80 0.080 4 11 0.9962 3.28 0.59 9.5 5
7.9 0.430 0.21 1.60 0.106 10 37 0.9966 3.17 0.91 9.5 5
7.1 0.710 0.00 1.90 0.080 14 35 0.9972 3.47 0.55 9.4 5
7.8 0.645 0.00 2.00 0.082 8 16 0.9964 3.38 0.59 9.8 6
6.7 0.675 0.07 2.40 0.089 17 82 0.9958 3.35 0.54 10.1 5
6.9 0.685 0.00 2.50 0.105 22 37 0.9966 3.46 0.57 10.6 6
8.3 0.655 0.12 2.30 0.083 15 113 0.9966 3.17 0.66 9.8 5
6.9 0.605 0.12 10.70 0.073 40 83 0.9993 3.45 0.52 9.4 6
5.2 0.320 0.25 1.80 0.103 13 50 0.9957 3.38 0.55 9.2 5
7.8 0.645 0.00 5.50 0.086 5 18 0.9986 3.40 0.55 9.6 6
7.8 0.600 0.14 2.40 0.086 3 15 0.9975 3.42 0.60 10.8 6
8.1 0.380 0.28 2.10 0.066 13 30 0.9968 3.23 0.73 9.7 7
5.7 1.130 0.09 1.50 0.172 7 19 0.9940 3.50 0.48 9.8 4
7.3 0.450 0.36 5.90 0.074 12 87 0.9978 3.33 0.83 10.5 5
7.3 0.450 0.36 5.90 0.074 12 87 0.9978 3.33 0.83 10.5 5
8.8 0.610 0.30 2.80 0.088 17 46 0.9976 3.26 0.51 9.3 4
7.5 0.490 0.20 2.60 0.332 8 14 0.9968 3.21 0.90 10.5 6
8.1 0.660 0.22 2.20 0.069 9 23 0.9968 3.30 1.20 10.3 5
6.8 0.670 0.02 1.80 0.050 5 11 0.9962 3.48 0.52 9.5 5
4.6 0.520 0.15 2.10 0.054 8 65 0.9934 3.90 0.56 13.1 4
7.7 0.935 0.43 2.20 0.114 22 114 0.9970 3.25 0.73 9.2 5
8.7 0.290 0.52 1.60 0.113 12 37 0.9969 3.25 0.58 9.5 5
6.4 0.400 0.23 1.60 0.066 5 12 0.9958 3.34 0.56 9.2 5
5.6 0.310 0.37 1.40 0.074 12 96 0.9954 3.32 0.58 9.2 5
8.8 0.660 0.26 1.70 0.074 4 23 0.9971 3.15 0.74 9.2 5
6.6 0.520 0.04 2.20 0.069 8 15 0.9956 3.40 0.63 9.4 6
6.6 0.500 0.04 2.10 0.068 6 14 0.9955 3.39 0.64 9.4 6
8.6 0.380 0.36 3.00 0.081 30 119 0.9970 3.20 0.56 9.4 5
7.6 0.510 0.15 2.80 0.110 33 73 0.9955 3.17 0.63 10.2 6
7.7 0.620 0.04 3.80 0.084 25 45 0.9978 3.34 0.53 9.5 5
10.2 0.420 0.57 3.40 0.070 4 10 0.9971 3.04 0.63 9.6 5
7.5 0.630 0.12 5.10 0.111 50 110 0.9983 3.26 0.77 9.4 5
7.8 0.590 0.18 2.30 0.076 17 54 0.9975 3.43 0.59 10.0 5
7.3 0.390 0.31 2.40 0.074 9 46 0.9962 3.41 0.54 9.4 6
8.8 0.400 0.40 2.20 0.079 19 52 0.9980 3.44 0.64 9.2 5
7.7 0.690 0.49 1.80 0.115 20 112 0.9968 3.21 0.71 9.3 5
7.5 0.520 0.16 1.90 0.085 12 35 0.9968 3.38 0.62 9.5 7
7.0 0.735 0.05 2.00 0.081 13 54 0.9966 3.39 0.57 9.8 5
7.2 0.725 0.05 4.65 0.086 4 11 0.9962 3.41 0.39 10.9 5
7.2 0.725 0.05 4.65 0.086 4 11 0.9962 3.41 0.39 10.9 5
7.5 0.520 0.11 1.50 0.079 11 39 0.9968 3.42 0.58 9.6 5
6.6 0.705 0.07 1.60 0.076 6 15 0.9962 3.44 0.58 10.7 5
9.3 0.320 0.57 2.00 0.074 27 65 0.9969 3.28 0.79 10.7 5
8.0 0.705 0.05 1.90 0.074 8 19 0.9962 3.34 0.95 10.5 6
7.7 0.630 0.08 1.90 0.076 15 27 0.9967 3.32 0.54 9.5 6
7.7 0.670 0.23 2.10 0.088 17 96 0.9962 3.32 0.48 9.5 5
7.7 0.690 0.22 1.90 0.084 18 94 0.9961 3.31 0.48 9.5 5
8.3 0.675 0.26 2.10 0.084 11 43 0.9976 3.31 0.53 9.2 4
9.7 0.320 0.54 2.50 0.094 28 83 0.9984 3.28 0.82 9.6 5
8.8 0.410 0.64 2.20 0.093 9 42 0.9986 3.54 0.66 10.5 5
8.8 0.410 0.64 2.20 0.093 9 42 0.9986 3.54 0.66 10.5 5
6.8 0.785 0.00 2.40 0.104 14 30 0.9966 3.52 0.55 10.7 6
6.7 0.750 0.12 2.00 0.086 12 80 0.9958 3.38 0.52 10.1 5
8.3 0.625 0.20 1.50 0.080 27 119 0.9972 3.16 1.12 9.1 4
6.2 0.450 0.20 1.60 0.069 3 15 0.9958 3.41 0.56 9.2 5
7.8 0.430 0.70 1.90 0.464 22 67 0.9974 3.13 1.28 9.4 5
7.4 0.500 0.47 2.00 0.086 21 73 0.9970 3.36 0.57 9.1 5
7.3 0.670 0.26 1.80 0.401 16 51 0.9969 3.16 1.14 9.4 5
6.3 0.300 0.48 1.80 0.069 18 61 0.9959 3.44 0.78 10.3 6
6.9 0.550 0.15 2.20 0.076 19 40 0.9961 3.41 0.59 10.1 5
8.6 0.490 0.28 1.90 0.110 20 136 0.9972 2.93 1.95 9.9 6
7.7 0.490 0.26 1.90 0.062 9 31 0.9966 3.39 0.64 9.6 5
9.3 0.390 0.44 2.10 0.107 34 125 0.9978 3.14 1.22 9.5 5
7.0 0.620 0.08 1.80 0.076 8 24 0.9978 3.48 0.53 9.0 5
7.9 0.520 0.26 1.90 0.079 42 140 0.9964 3.23 0.54 9.5 5
8.6 0.490 0.28 1.90 0.110 20 136 0.9972 2.93 1.95 9.9 6
8.6 0.490 0.29 2.00 0.110 19 133 0.9972 2.93 1.98 9.8 5
7.7 0.490 0.26 1.90 0.062 9 31 0.9966 3.39 0.64 9.6 5
5.0 1.020 0.04 1.40 0.045 41 85 0.9938 3.75 0.48 10.5 4
4.7 0.600 0.17 2.30 0.058 17 106 0.9932 3.85 0.60 12.9 6
6.8 0.775 0.00 3.00 0.102 8 23 0.9965 3.45 0.56 10.7 5
7.0 0.500 0.25 2.00 0.070 3 22 0.9963 3.25 0.63 9.2 5
7.6 0.900 0.06 2.50 0.079 5 10 0.9967 3.39 0.56 9.8 5
8.1 0.545 0.18 1.90 0.080 13 35 0.9972 3.30 0.59 9.0 6

What is the sample size?

Get the number of rows (n)

nrow(wine)
## [1] 1599

Using the “nrow” function, I was able to determine that the dataset has a row count or sample size (n) of 1,599 rows. In addition to this, I verified this answer via visual examination of red wine Excel dataset.

Any outliers? Do you have any concerns about data quality?

Data Quality Concerns:

  • Data quality concerns could include the following:

    • Values outside of defined requirements/expectations (https://archive.ics.uci.edu/dataset/186/wine+quality):

      • Quality: score between 0 and 10
    • Values outside of requirements/expectations we know from life experience/semantics:

      • No negative values can be in this dataset

      • pH values must be 0-14

    • Missing/Null Values

Let’s see if the dataset exhibits any of the data quality concerns noted above

range(wine[["quality"]]) # see if the values fall within the range of 0-10
## [1] 3 8
sum(wine < 0) # check for negative values in the dataset; if sum is zero, then no negative values in the dataset
## [1] 0
range(wine[["pH"]]) # see if the values fall within the range of 0-10
## [1] 2.74 4.01
sum(is.na(wine)) # get a count of null values
## [1] 0

Interpreting the above output:

  • The “Quality” values fall in the range of 3-8
  • There are no negative values in the dataset
  • The “pH” values fall in the range of 2.74-4.01
  • There are no null values in the dataset

Provided the output above, I do not believe there are any data quality issues with the red wine dataset.

Outliers:

There are several methods for detecting/identifying outliers in datasets. To identify outliers in this wine dataset, I will generate boxplots for each of the variables in the dataset and then identify whether or not outliers exist using these plots. A boxplot with labeled components – inclusive to outliers – can be seen below.

As can be seen in the diagram above, outliers – or in this case, the values that I am suggesting are very distant from the others – are points outside of the following domain: [Q1 - 1.5xIQR, Q3 + 1.5xIQR]. Additionally, they are the points that fall outside of the “whiskers” on the diagram.

Generation of box-and-whisker plots for each of the variables can be seen below:

boxplot(wine[["fixed.acidity"]]) 
title("Boxplot of Fixed Acidity")

boxplot(wine[["volatile.acidity"]])
title("Boxplot of Volatile Acidity")

boxplot(wine[["citric.acid"]])
title("Boxplot of Citric Acid")

boxplot(wine[["residual.sugar"]])
title("Boxplot of Residual Sugar")

boxplot(wine[["chlorides"]])
title("Boxplot of Chlorides")

boxplot(wine[["free.sulfur.dioxide"]])
title("Boxplot of Free Sulfur Dioxide")

boxplot(wine[["total.sulfur.dioxide"]])
title("Boxplot of Total Sulfur Dioxide")

boxplot(wine[["density"]])
title("Boxplot of Density")

boxplot(wine[["pH"]])
title("Boxplot of pH")

boxplot(wine[["sulphates"]])
title("Boxplot of Sulphates")

boxplot(wine[["alcohol"]])
title("Boxplot of Alcohol")

boxplot(wine[["quality"]])
title("Boxplot of Quality")

It can be seen in the boxplots above that at least one outlier exists for every column within the dataset.

How can you summarize the data of each variable in a concise way? What statistics are you going to present?

As highlighted in lecture 1-3, summary statistics can be provided via the summary function. This function will provide the following statistics for each variable in the output:

  • Minimum
  • 1st Quartile
  • Median
  • Mean
  • 3rd Quartile
  • Maximum

See the “summary” function output for each of the variables below:

summary(wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

How can you visualize the distribution of each variable?

There are several ways to visualize the distribution of each variable. You can visualize the distribution of each variable using boxplots like I did for the previous question. You can also visualize the distribution of each variable using histograms. See histograms for each of the dataset’s variables below:

hist(wine[["fixed.acidity"]], main= "Histogram of Fixed Acidity", xlab = "Fixed Acidity")

hist(wine[["volatile.acidity"]], main="Histogram of Volatile Acidity", xlab = "Volatile Acidity")

hist(wine[["citric.acid"]], main="Histogram of Citric Acid", xlab = "Citric Acid")

hist(wine[["residual.sugar"]], main="Histogram of Residual Sugar", xlab = "Residual Sugar")

hist(wine[["chlorides"]], main="Histogram of Chlorides", xlab = "Chlorides")

hist(wine[["free.sulfur.dioxide"]], main="Histogram of Free Sulfur Dioxide", xlab = "Free Sulfur Dioxide")

hist(wine[["total.sulfur.dioxide"]], main="Histogram of Total Sulfur Dioxide", xlab = "Total Sulfur Dioxide")

hist(wine[["density"]], main="Histogram of Density", xlab = "Density")

hist(wine[["pH"]], main="Histogram of pH", xlab = "pH")

hist(wine[["sulphates"]], main="Histogram of Sulphates", xlab = "Sulphates")

hist(wine[["alcohol"]], main="Histogram of Alcohol", xlab = "Alcohol")

hist(wine[["quality"]], main="Histogram of Quality", xlab = "Quality")

Do you see any skewed Distributions?

To see if a distribution is skewed, please refer to the histograms from the last question.

Yes, all of the columns/variables – except for density, pH, and quality – are skewed right.