Requirements for the Assignment

Address the following questions in R Code:

What is the sample size?
Any outliers? Do you have any concerns about the data quality?
How can you summarize the data of each variable in a concise way? What statistics are you going to present?
How can you visualize the distribution of each variable?
Do you see any skewed distributions?

Pre-Requisite to answering the questions – Load in the Red Wine Dataset and Preview top 100 rows

wine <- read.csv(file = "winequality-red.csv", sep=";", header = T) # Load in the dataset
knitr::kable(head(wine,100), caption = "Red Wine Dataset")

Red Wine Dataset
fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
7.4	0.700	0.00	1.90	0.076	11	34	0.9978	3.51	0.56	9.4	5
7.8	0.880	0.00	2.60	0.098	25	67	0.9968	3.20	0.68	9.8	5
7.8	0.760	0.04	2.30	0.092	15	54	0.9970	3.26	0.65	9.8	5
11.2	0.280	0.56	1.90	0.075	17	60	0.9980	3.16	0.58	9.8	6
7.4	0.700	0.00	1.90	0.076	11	34	0.9978	3.51	0.56	9.4	5
7.4	0.660	0.00	1.80	0.075	13	40	0.9978	3.51	0.56	9.4	5
7.9	0.600	0.06	1.60	0.069	15	59	0.9964	3.30	0.46	9.4	5
7.3	0.650	0.00	1.20	0.065	15	21	0.9946	3.39	0.47	10.0	7
7.8	0.580	0.02	2.00	0.073	9	18	0.9968	3.36	0.57	9.5	7
7.5	0.500	0.36	6.10	0.071	17	102	0.9978	3.35	0.80	10.5	5
6.7	0.580	0.08	1.80	0.097	15	65	0.9959	3.28	0.54	9.2	5
7.5	0.500	0.36	6.10	0.071	17	102	0.9978	3.35	0.80	10.5	5
5.6	0.615	0.00	1.60	0.089	16	59	0.9943	3.58	0.52	9.9	5
7.8	0.610	0.29	1.60	0.114	9	29	0.9974	3.26	1.56	9.1	5
8.9	0.620	0.18	3.80	0.176	52	145	0.9986	3.16	0.88	9.2	5
8.9	0.620	0.19	3.90	0.170	51	148	0.9986	3.17	0.93	9.2	5
8.5	0.280	0.56	1.80	0.092	35	103	0.9969	3.30	0.75	10.5	7
8.1	0.560	0.28	1.70	0.368	16	56	0.9968	3.11	1.28	9.3	5
7.4	0.590	0.08	4.40	0.086	6	29	0.9974	3.38	0.50	9.0	4
7.9	0.320	0.51	1.80	0.341	17	56	0.9969	3.04	1.08	9.2	6
8.9	0.220	0.48	1.80	0.077	29	60	0.9968	3.39	0.53	9.4	6
7.6	0.390	0.31	2.30	0.082	23	71	0.9982	3.52	0.65	9.7	5
7.9	0.430	0.21	1.60	0.106	10	37	0.9966	3.17	0.91	9.5	5
8.5	0.490	0.11	2.30	0.084	9	67	0.9968	3.17	0.53	9.4	5
6.9	0.400	0.14	2.40	0.085	21	40	0.9968	3.43	0.63	9.7	6
6.3	0.390	0.16	1.40	0.080	11	23	0.9955	3.34	0.56	9.3	5
7.6	0.410	0.24	1.80	0.080	4	11	0.9962	3.28	0.59	9.5	5
7.9	0.430	0.21	1.60	0.106	10	37	0.9966	3.17	0.91	9.5	5
7.1	0.710	0.00	1.90	0.080	14	35	0.9972	3.47	0.55	9.4	5
7.8	0.645	0.00	2.00	0.082	8	16	0.9964	3.38	0.59	9.8	6
6.7	0.675	0.07	2.40	0.089	17	82	0.9958	3.35	0.54	10.1	5
6.9	0.685	0.00	2.50	0.105	22	37	0.9966	3.46	0.57	10.6	6
8.3	0.655	0.12	2.30	0.083	15	113	0.9966	3.17	0.66	9.8	5
6.9	0.605	0.12	10.70	0.073	40	83	0.9993	3.45	0.52	9.4	6
5.2	0.320	0.25	1.80	0.103	13	50	0.9957	3.38	0.55	9.2	5
7.8	0.645	0.00	5.50	0.086	5	18	0.9986	3.40	0.55	9.6	6
7.8	0.600	0.14	2.40	0.086	3	15	0.9975	3.42	0.60	10.8	6
8.1	0.380	0.28	2.10	0.066	13	30	0.9968	3.23	0.73	9.7	7
5.7	1.130	0.09	1.50	0.172	7	19	0.9940	3.50	0.48	9.8	4
7.3	0.450	0.36	5.90	0.074	12	87	0.9978	3.33	0.83	10.5	5
7.3	0.450	0.36	5.90	0.074	12	87	0.9978	3.33	0.83	10.5	5
8.8	0.610	0.30	2.80	0.088	17	46	0.9976	3.26	0.51	9.3	4
7.5	0.490	0.20	2.60	0.332	8	14	0.9968	3.21	0.90	10.5	6
8.1	0.660	0.22	2.20	0.069	9	23	0.9968	3.30	1.20	10.3	5
6.8	0.670	0.02	1.80	0.050	5	11	0.9962	3.48	0.52	9.5	5
4.6	0.520	0.15	2.10	0.054	8	65	0.9934	3.90	0.56	13.1	4
7.7	0.935	0.43	2.20	0.114	22	114	0.9970	3.25	0.73	9.2	5
8.7	0.290	0.52	1.60	0.113	12	37	0.9969	3.25	0.58	9.5	5
6.4	0.400	0.23	1.60	0.066	5	12	0.9958	3.34	0.56	9.2	5
5.6	0.310	0.37	1.40	0.074	12	96	0.9954	3.32	0.58	9.2	5
8.8	0.660	0.26	1.70	0.074	4	23	0.9971	3.15	0.74	9.2	5
6.6	0.520	0.04	2.20	0.069	8	15	0.9956	3.40	0.63	9.4	6
6.6	0.500	0.04	2.10	0.068	6	14	0.9955	3.39	0.64	9.4	6
8.6	0.380	0.36	3.00	0.081	30	119	0.9970	3.20	0.56	9.4	5
7.6	0.510	0.15	2.80	0.110	33	73	0.9955	3.17	0.63	10.2	6
7.7	0.620	0.04	3.80	0.084	25	45	0.9978	3.34	0.53	9.5	5
10.2	0.420	0.57	3.40	0.070	4	10	0.9971	3.04	0.63	9.6	5
7.5	0.630	0.12	5.10	0.111	50	110	0.9983	3.26	0.77	9.4	5
7.8	0.590	0.18	2.30	0.076	17	54	0.9975	3.43	0.59	10.0	5
7.3	0.390	0.31	2.40	0.074	9	46	0.9962	3.41	0.54	9.4	6
8.8	0.400	0.40	2.20	0.079	19	52	0.9980	3.44	0.64	9.2	5
7.7	0.690	0.49	1.80	0.115	20	112	0.9968	3.21	0.71	9.3	5
7.5	0.520	0.16	1.90	0.085	12	35	0.9968	3.38	0.62	9.5	7
7.0	0.735	0.05	2.00	0.081	13	54	0.9966	3.39	0.57	9.8	5
7.2	0.725	0.05	4.65	0.086	4	11	0.9962	3.41	0.39	10.9	5
7.2	0.725	0.05	4.65	0.086	4	11	0.9962	3.41	0.39	10.9	5
7.5	0.520	0.11	1.50	0.079	11	39	0.9968	3.42	0.58	9.6	5
6.6	0.705	0.07	1.60	0.076	6	15	0.9962	3.44	0.58	10.7	5
9.3	0.320	0.57	2.00	0.074	27	65	0.9969	3.28	0.79	10.7	5
8.0	0.705	0.05	1.90	0.074	8	19	0.9962	3.34	0.95	10.5	6
7.7	0.630	0.08	1.90	0.076	15	27	0.9967	3.32	0.54	9.5	6
7.7	0.670	0.23	2.10	0.088	17	96	0.9962	3.32	0.48	9.5	5
7.7	0.690	0.22	1.90	0.084	18	94	0.9961	3.31	0.48	9.5	5
8.3	0.675	0.26	2.10	0.084	11	43	0.9976	3.31	0.53	9.2	4
9.7	0.320	0.54	2.50	0.094	28	83	0.9984	3.28	0.82	9.6	5
8.8	0.410	0.64	2.20	0.093	9	42	0.9986	3.54	0.66	10.5	5
8.8	0.410	0.64	2.20	0.093	9	42	0.9986	3.54	0.66	10.5	5
6.8	0.785	0.00	2.40	0.104	14	30	0.9966	3.52	0.55	10.7	6
6.7	0.750	0.12	2.00	0.086	12	80	0.9958	3.38	0.52	10.1	5
8.3	0.625	0.20	1.50	0.080	27	119	0.9972	3.16	1.12	9.1	4
6.2	0.450	0.20	1.60	0.069	3	15	0.9958	3.41	0.56	9.2	5
7.8	0.430	0.70	1.90	0.464	22	67	0.9974	3.13	1.28	9.4	5
7.4	0.500	0.47	2.00	0.086	21	73	0.9970	3.36	0.57	9.1	5
7.3	0.670	0.26	1.80	0.401	16	51	0.9969	3.16	1.14	9.4	5
6.3	0.300	0.48	1.80	0.069	18	61	0.9959	3.44	0.78	10.3	6
6.9	0.550	0.15	2.20	0.076	19	40	0.9961	3.41	0.59	10.1	5
8.6	0.490	0.28	1.90	0.110	20	136	0.9972	2.93	1.95	9.9	6
7.7	0.490	0.26	1.90	0.062	9	31	0.9966	3.39	0.64	9.6	5
9.3	0.390	0.44	2.10	0.107	34	125	0.9978	3.14	1.22	9.5	5
7.0	0.620	0.08	1.80	0.076	8	24	0.9978	3.48	0.53	9.0	5
7.9	0.520	0.26	1.90	0.079	42	140	0.9964	3.23	0.54	9.5	5
8.6	0.490	0.28	1.90	0.110	20	136	0.9972	2.93	1.95	9.9	6
8.6	0.490	0.29	2.00	0.110	19	133	0.9972	2.93	1.98	9.8	5
7.7	0.490	0.26	1.90	0.062	9	31	0.9966	3.39	0.64	9.6	5
5.0	1.020	0.04	1.40	0.045	41	85	0.9938	3.75	0.48	10.5	4
4.7	0.600	0.17	2.30	0.058	17	106	0.9932	3.85	0.60	12.9	6
6.8	0.775	0.00	3.00	0.102	8	23	0.9965	3.45	0.56	10.7	5
7.0	0.500	0.25	2.00	0.070	3	22	0.9963	3.25	0.63	9.2	5
7.6	0.900	0.06	2.50	0.079	5	10	0.9967	3.39	0.56	9.8	5
8.1	0.545	0.18	1.90	0.080	13	35	0.9972	3.30	0.59	9.0	6

What is the sample size?

Get the number of rows (n)

nrow(wine)

## [1] 1599

Using the “nrow” function, I was able to determine that the dataset has a row count or sample size (n) of 1,599 rows. In addition to this, I verified this answer via visual examination of red wine Excel dataset.

Any outliers? Do you have any concerns about data quality?

Data Quality Concerns:

Data quality concerns could include the following:
- Values outside of defined requirements/expectations (https://archive.ics.uci.edu/dataset/186/wine+quality):
  - Quality: score between 0 and 10
- Values outside of requirements/expectations we know from life experience/semantics:
  - No negative values can be in this dataset
  - pH values must be 0-14
- Missing/Null Values

Let’s see if the dataset exhibits any of the data quality concerns noted above

range(wine[["quality"]]) # see if the values fall within the range of 0-10

## [1] 3 8

sum(wine < 0) # check for negative values in the dataset; if sum is zero, then no negative values in the dataset

## [1] 0

range(wine[["pH"]]) # see if the values fall within the range of 0-10

## [1] 2.74 4.01

sum(is.na(wine)) # get a count of null values

## [1] 0

Interpreting the above output:

The “Quality” values fall in the range of 3-8
There are no negative values in the dataset
The “pH” values fall in the range of 2.74-4.01
There are no null values in the dataset

Provided the output above, I do not believe there are any data quality issues with the red wine dataset.

Outliers:

There are several methods for detecting/identifying outliers in datasets. To identify outliers in this wine dataset, I will generate boxplots for each of the variables in the dataset and then identify whether or not outliers exist using these plots. A boxplot with labeled components – inclusive to outliers – can be seen below.

As can be seen in the diagram above, outliers – or in this case, the values that I am suggesting are very distant from the others – are points outside of the following domain: [Q1 - 1.5xIQR, Q3 + 1.5xIQR]. Additionally, they are the points that fall outside of the “whiskers” on the diagram.

Generation of box-and-whisker plots for each of the variables can be seen below:

boxplot(wine[["fixed.acidity"]]) 
title("Boxplot of Fixed Acidity")

boxplot(wine[["volatile.acidity"]])
title("Boxplot of Volatile Acidity")

boxplot(wine[["citric.acid"]])
title("Boxplot of Citric Acid")

boxplot(wine[["residual.sugar"]])
title("Boxplot of Residual Sugar")

boxplot(wine[["chlorides"]])
title("Boxplot of Chlorides")

boxplot(wine[["free.sulfur.dioxide"]])
title("Boxplot of Free Sulfur Dioxide")

boxplot(wine[["total.sulfur.dioxide"]])
title("Boxplot of Total Sulfur Dioxide")

boxplot(wine[["density"]])
title("Boxplot of Density")

boxplot(wine[["pH"]])
title("Boxplot of pH")

boxplot(wine[["sulphates"]])
title("Boxplot of Sulphates")

boxplot(wine[["alcohol"]])
title("Boxplot of Alcohol")

boxplot(wine[["quality"]])
title("Boxplot of Quality")

It can be seen in the boxplots above that at least one outlier exists for every column within the dataset.

How can you summarize the data of each variable in a concise way? What statistics are you going to present?

As highlighted in lecture 1-3, summary statistics can be provided via the summary function. This function will provide the following statistics for each variable in the output:

Minimum
1st Quartile
Median
Mean
3rd Quartile
Maximum

See the “summary” function output for each of the variables below:

summary(wine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

How can you visualize the distribution of each variable?

There are several ways to visualize the distribution of each variable. You can visualize the distribution of each variable using boxplots like I did for the previous question. You can also visualize the distribution of each variable using histograms. See histograms for each of the dataset’s variables below:

hist(wine[["fixed.acidity"]], main= "Histogram of Fixed Acidity", xlab = "Fixed Acidity")

hist(wine[["volatile.acidity"]], main="Histogram of Volatile Acidity", xlab = "Volatile Acidity")

hist(wine[["citric.acid"]], main="Histogram of Citric Acid", xlab = "Citric Acid")

hist(wine[["residual.sugar"]], main="Histogram of Residual Sugar", xlab = "Residual Sugar")

hist(wine[["chlorides"]], main="Histogram of Chlorides", xlab = "Chlorides")

hist(wine[["free.sulfur.dioxide"]], main="Histogram of Free Sulfur Dioxide", xlab = "Free Sulfur Dioxide")

hist(wine[["total.sulfur.dioxide"]], main="Histogram of Total Sulfur Dioxide", xlab = "Total Sulfur Dioxide")

hist(wine[["density"]], main="Histogram of Density", xlab = "Density")

hist(wine[["pH"]], main="Histogram of pH", xlab = "pH")

hist(wine[["sulphates"]], main="Histogram of Sulphates", xlab = "Sulphates")

hist(wine[["alcohol"]], main="Histogram of Alcohol", xlab = "Alcohol")

hist(wine[["quality"]], main="Histogram of Quality", xlab = "Quality")

Do you see any skewed Distributions?

To see if a distribution is skewed, please refer to the histograms from the last question.

Yes, all of the columns/variables – except for density, pH, and quality – are skewed right.

Assignment 1: Part A

Benjamin Harris

2024-01-21