Wine Project Part A Report

Data Quality

To assess data quality, first check if there are any missing values in the data set

missingValues <- sum(is.na(wineDataSet))
missingValues

## [1] 0

There were no missing values in the data set which suggests data is of good quality. However we still need to check each variable for typos and outliers to ensure our analysis is not incorrectly influenced.

To see which variable could have outliers, lets print out a summary of all variables.

summary(wineDataSet)

##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Looking at above data, two major issues can be highlighted:

Gap between 3rd Quantile and Max value found in below variables; we need to check for outliers

Fixed Acidity
Volatile Acidity
Citric Acid
Residual Sugar
Chlorides
Free Sulfur dioxide
Total Sulfur dioxide
Sulphates
Alcohol

Study in detail for below variables to check if mean, median, or trimmed mean should be used to define center of data

Residual Sugar
Chlorides
Free Sulfur dioxide
Total Sulfur dioxide

VAR 01 :: Fixed Acidity

hist(`fixed acidity`)

Initial look at the histogram suggests it is right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.

a <- mean(`fixed acidity`)-1.96*sd(`fixed acidity`)
b <- mean(`fixed acidity`)+1.96*sd(`fixed acidity`)
sum((`fixed acidity`>a & `fixed acidity`<b) == TRUE)/rows

## [1] 0.9474672

Since 94.7% of data is within two standard deviations; distribution can be normal.

We can check for outliers by checking the quantile at right side:

quantile(`fixed acidity`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))

##  75%  80%  85%  90%  95% 100% 
##  9.2  9.7 10.2 10.7 11.8 15.9

There appears to abrupt increase in values found at 5% of the tail.

tail(sort(`fixed acidity`),80) # Check last 5% of values

##  [1] 11.8 11.8 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 12.0
## [16] 12.0 12.0 12.0 12.0 12.0 12.0 12.1 12.2 12.2 12.2 12.2 12.3 12.3 12.3 12.3
## [31] 12.3 12.4 12.4 12.4 12.4 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.6 12.6 12.6
## [46] 12.6 12.7 12.7 12.7 12.7 12.8 12.8 12.8 12.8 12.8 12.9 12.9 13.0 13.0 13.0
## [61] 13.2 13.2 13.2 13.3 13.3 13.3 13.4 13.5 13.7 13.7 13.8 14.0 14.3 15.0 15.0
## [76] 15.5 15.5 15.6 15.6 15.9

We can now try removing the outliers

outliers <- `fixed acidity` > 11.79
newDataSet <- wineDataSet
newDataSet[outliers, 'fixed acidity'] <- NA
hist(newDataSet$'fixed acidity')

After we remove the outliers (5% at tail) our distribution is similar to normal.

old_mean <- mean(`fixed acidity`)
new_mean <- mean(newDataSet$`fixed acidity`, na.rm = TRUE)

Old mean was 8.3196373 whereas new mean is 8.0812912. Difference is minor so we do not need to update our mean.

VAR 02 :: Volatile Acidity

hist(`volatile acidity`)

Initial look at the histogram suggests it is slightly right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.

a <- mean(`volatile acidity`)-1.96*sd(`volatile acidity`)
b <- mean(`volatile acidity`)+1.96*sd(`volatile acidity`)
sum((`volatile acidity`>a & `volatile acidity`<b) == TRUE)/rows

## [1] 0.9587242

Since 95.8% of data is within two standard deviations; normal distribution is possible.

We can check for outliers by checking the quantile on the right side:

quantile(`volatile acidity`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))

##    75%    80%    85%    90%    95%  97.5%   100% 
## 0.6400 0.6600 0.6915 0.7450 0.8400 0.9150 1.5800

There appears to abrupt increase in values found at 2.5% of the tail.

We can try removing the outliers.

tail(sort(`volatile acidity`),40)  # check for last 2.5% of values

##  [1] 0.915 0.920 0.935 0.935 0.950 0.955 0.960 0.960 0.960 0.965 0.965 0.965
## [13] 0.975 0.980 0.980 0.980 1.000 1.000 1.000 1.005 1.010 1.020 1.020 1.020
## [25] 1.020 1.025 1.035 1.040 1.040 1.040 1.070 1.090 1.115 1.130 1.180 1.185
## [37] 1.240 1.330 1.330 1.580

outliers <- `volatile acidity` > 0.914
newDataSet <- wineDataSet
newDataSet[outliers, 'volatile acidity'] <- NA
hist(newDataSet$'volatile acidity')

After we remove the outliers (2.5% at tail) our distribution look similar to a normal distribution.

old_mean <- mean(`volatile acidity`)
new_mean <- mean(newDataSet$`volatile acidity`, na.rm = TRUE)

Old mean was 0.5278205 whereas new mean is 0.5137629.

Difference is minor so we do not need to update our mean.

VAR 03 :: Citric Acid

hist(`citric acid`)

The histogram suggests it is clearly right skewed.

boxplot(`citric acid`)

Our evaluation is further confirmed by box plot above.

a <- mean(`citric acid`)-1.96*sd(`citric acid`)
b <- mean(`citric acid`)+1.96*sd(`citric acid`)
sum((`citric acid`>a & `citric acid`<b) == TRUE)/rows

## [1] 0.9693558

97% of data in this set is within two standard deviations whereas for normal distributions it is 95%.

quantile(`citric acid`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))

##   75%   80%   85%   90%   95% 97.5%  100% 
## 0.420 0.460 0.490 0.522 0.600 0.660 1.000

Outliers are found to be in the 2.5% of data at tail (right side).

tail(sort(`citric acid`),40) #last 2.5% values

##  [1] 0.66 0.66 0.66 0.66 0.66 0.67 0.67 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68
## [16] 0.68 0.68 0.68 0.69 0.69 0.69 0.69 0.70 0.70 0.71 0.72 0.73 0.73 0.73 0.74
## [31] 0.74 0.74 0.74 0.75 0.76 0.76 0.76 0.78 0.79 1.00

mn <- mean(`citric acid`)
md <- median(`citric acid`)

Since the data is skewed, it is recommended to use median to define the center of data, however there isn’t much difference between the two in this case; mean is 0.2709756 and median is 0.26.

VAR 04 :: Residual Sugar

hist(`residual sugar`)

This histogram clearly suggests that the distribution is right skewed.

boxplot(`residual sugar`)

Further confirmed by box plot above.

a <- mean(`residual sugar`)-1.96*sd(`residual sugar`)
b <- mean(`residual sugar`)+1.96*sd(`residual sugar`)
sum((`residual sugar`>a & `residual sugar`<b) == TRUE)/rows

## [1] 0.9530957

Even though 95.3% of data is within two standard deviations.

We can check for outliers by checking the quantile (right side):

quantile(`residual sugar`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))

##    75%    80%    85%    90%    95%   100% 
##  2.600  2.700  2.965  3.600  5.100 15.500

There appears to abrupt increase in values found at 5% of the tail.

tail(sort(`residual sugar`),80)  # check for last 5% of values

##  [1]  5.10  5.15  5.20  5.20  5.20  5.40  5.50  5.50  5.50  5.50  5.50  5.50
## [13]  5.50  5.50  5.60  5.60  5.60  5.60  5.60  5.60  5.70  5.80  5.80  5.80
## [25]  5.80  5.90  5.90  5.90  6.00  6.00  6.00  6.00  6.10  6.10  6.10  6.10
## [37]  6.20  6.20  6.20  6.30  6.30  6.40  6.40  6.40  6.55  6.55  6.60  6.60
## [49]  6.70  6.70  7.00  7.20  7.30  7.50  7.80  7.80  7.90  7.90  7.90  8.10
## [61]  8.10  8.30  8.30  8.30  8.60  8.80  8.80  8.90  9.00 10.70 11.00 11.00
## [73] 12.90 13.40 13.80 13.80 13.90 15.40 15.40 15.50

We can try removing the outliers.

outliers <- `residual sugar` > 5.1
newDataSet <- wineDataSet
newDataSet[outliers, 'residual sugar'] <- NA
hist(newDataSet$'residual sugar')

Even after removing the outliers (5% at tail) our distribution doesn’t clearly look normal; therefore this data set should be classified as skewed.

mn <- mean(`residual sugar`)
md <- median(`residual sugar`)

For skewed distributions, it is recommended to use median 2.2 rather than mean 2.5388055 to define center of data.

VAR 05 :: Chlorides

hist(`chlorides`)

This histogram shows that the distribution is right skewed.

boxplot(chlorides)

Further confirmed by box plot above.

a <- mean(chlorides)-1.96*sd(chlorides)
b <- mean(chlorides)+1.96*sd(chlorides)
sum((chlorides>a & chlorides<b) == TRUE)/rows

## [1] 0.9718574

97.2% of data is within two standard deviations; therefore not a normal distribution.

We can check for outliers by checking the quantile on the right side:

quantile(`chlorides`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))

##    75%    80%    85%    90%    95%  97.5%   100% 
## 0.0900 0.0940 0.0990 0.1090 0.1261 0.2050 0.6110

There appears to abrupt increase in values found at 5% of the tail - these are our outliers.

tail(sort(`chlorides`),80)  # check for last 5% of values

##  [1] 0.127 0.128 0.132 0.132 0.132 0.132 0.136 0.137 0.143 0.145 0.146 0.147
## [13] 0.148 0.152 0.152 0.153 0.157 0.157 0.157 0.159 0.161 0.165 0.166 0.166
## [25] 0.166 0.168 0.169 0.170 0.171 0.171 0.172 0.174 0.176 0.178 0.178 0.186
## [37] 0.190 0.194 0.200 0.205 0.205 0.213 0.214 0.214 0.214 0.216 0.222 0.226
## [49] 0.226 0.230 0.235 0.236 0.241 0.243 0.250 0.263 0.267 0.270 0.332 0.337
## [61] 0.341 0.343 0.358 0.360 0.368 0.369 0.387 0.401 0.403 0.413 0.414 0.414
## [73] 0.415 0.415 0.415 0.422 0.464 0.467 0.610 0.611

mn <- mean(`chlorides`)
md <- median(`chlorides`)

For skewed distributions, it is recommended to use median 0.079 rather than mean 0.0874665 to define center of data.

VAR 06 :: Free sulfur dioxide

hist(`free sulfur dioxide`)

Initial look at the histogram suggests it is right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.

a <- mean(`free sulfur dioxide`)-1.96*sd(`free sulfur dioxide`)
b <- mean(`free sulfur dioxide`)+1.96*sd(`free sulfur dioxide`)
sum((`free sulfur dioxide`>a & `free sulfur dioxide`<b) == TRUE)/rows

## [1] 0.9587242

Since 95.8% of data is within two standard deviations; normal distribution may be possible.

We can check for outliers by checking the quantile on the right side:

quantile(`free sulfur dioxide`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))

##    75%    80%    85%    90%    95%  97.5%   100% 
## 21.000 24.000 27.000 31.000 35.000 40.525 72.000

There appears to abrupt increase in values found at 2.5% of the tail.

tail(sort(`free sulfur dioxide`),40)  # check for last 2.5% of values

##  [1] 41 41 41 41 41 41 41 42 42 42 43 43 43 45 45 45 46 47 48 48 48 48 50 50 51
## [26] 51 51 51 52 52 52 53 54 55 55 57 66 68 68 72

We can try removing the outliers.

outliers <- `free sulfur dioxide` > 40
newDataSet <- wineDataSet
newDataSet[outliers, 'free sulfur dioxide'] <- NA
hist(newDataSet$'free sulfur dioxide')

Even after we remove the outliers (2.5% at tail) our distribution does not look like a normal distribution. Therefore we can conclude that this distribution is right skewed further supported by box plot below.

boxplot(`free sulfur dioxide`)

mn <- mean(`free sulfur dioxide`)
md <- median(`free sulfur dioxide`)

For skewed distributions, it is recommended to use median 14 rather than mean 15.8749218 to define center of data even though there isn’t that big of a difference.

VAR 07 :: Total sulfur dioxide

hist(`total sulfur dioxide`)

Initial look at the histogram suggests it is right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.

a <- mean(`total sulfur dioxide`)-1.96*sd(`total sulfur dioxide`)
b <- mean(`total sulfur dioxide`)+1.96*sd(`total sulfur dioxide`)
sum((`total sulfur dioxide`>a & `total sulfur dioxide`<b) == TRUE)/rows

## [1] 0.9462164

Since 94.6% of data is within two standard deviations; normal distribution may be possible.

We can check for outliers by checking the quantile at tail (right side):

quantile(`total sulfur dioxide`, p=c(0.85,0.875,0.90,0.925,0.95,0.975,1.0))

##   85% 87.5%   90% 92.5%   95% 97.5%  100% 
##  82.0  88.0  93.2 102.0 112.1 131.0 289.0

There appears to abrupt increase in values found at 5% of the tail.

tail(sort(`total sulfur dioxide`),80)  # check for last 5% of values

##  [1] 113 113 113 113 114 114 115 115 116 119 119 119 119 119 119 119 120 120 121
## [20] 121 121 121 122 122 122 124 124 124 125 125 126 127 127 128 128 129 129 129
## [39] 130 131 131 131 133 133 133 134 134 135 135 136 136 139 140 141 141 141 142
## [58] 143 143 144 144 144 145 145 145 147 147 147 148 148 149 151 151 152 153 155
## [77] 160 165 278 289

We can try removing the outliers.

outliers <- `total sulfur dioxide` > 112
newDataSet <- wineDataSet
newDataSet[outliers, 'total sulfur dioxide'] <- NA
hist(newDataSet$'total sulfur dioxide')

Even after we remove the outliers (5% at tail) our distribution does not look like a normal distribution. Therefore we can conclude that this distribution is right skewed further supported by box plot below.

boxplot(`total sulfur dioxide`)

mn <- mean(`total sulfur dioxide`)
md <- median(`total sulfur dioxide`)

For skewed distributions, it is recommended to use median 38 rather than mean 46.4677924 to define center of data.

VAR 08 :: Density

hist(`density`)

This histogram clearly indicates a normal distribution.

a <- mean(density)-1.96*sd(density)
b <- mean(density)+1.96*sd(density)
sum((density>a & density<b) == TRUE)/rows

## [1] 0.9493433

Moreover, 94.9% of data is within two standard deviations thereby confirming our analysis.

We can check for outliers by checking the quantile (both sides):

quantile(`density`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))

##      75%      80%      85%      90%      95%     100% 
## 0.997835 0.998174 0.998600 0.999140 1.000000 1.003690

quantile(`density`, p=c(0.00,0.05,0.1,0.15,0.20,0.25))

##       0%       5%      10%      15%      20%      25% 
## 0.990070 0.993598 0.994556 0.995000 0.995340 0.995600

Both sets of values at head and tail do not suggest any abrupt changes; therefore it is safe to say there are no outliers for this data set.

boxplot(density)

Even though above box plot suggests “possible” outliers at both ends, our analysis should not be affected by it since distribution is strongly normal.

mn <- mean(density)

Center of data is confidently defined by its mean 0.9967467 in this data set.

VAR 09 :: PH

hist(`pH`)

This histogram clearly indicates a normal distribution.

a <- mean(pH)-1.96*sd(pH)
b <- mean(pH)+1.96*sd(pH)
sum((pH>a & pH<b) == TRUE)/rows

## [1] 0.9530957

Moreover, 95.3% of data is within two standard deviations thereby confirming our analysis.

We can check for outliers by checking the quantile (both sides):

quantile(`pH`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))

##   75%   80%   85%   90%   95%  100% 
## 3.400 3.424 3.460 3.510 3.570 4.010

quantile(`pH`, p=c(0.00,0.015,0.02,0.05,0.1))

##   0% 1.5%   2%   5%  10% 
## 2.74 2.98 3.00 3.06 3.12

There appears to abrupt change in values found at 5% of tail and 2% of head.

head(sort(`pH`),32)  # check for first 2% of values

##  [1] 2.74 2.86 2.87 2.88 2.88 2.89 2.89 2.89 2.89 2.90 2.92 2.92 2.92 2.92 2.93
## [16] 2.93 2.93 2.94 2.94 2.94 2.94 2.95 2.98 2.98 2.98 2.98 2.98 2.99 2.99 3.00
## [31] 3.00 3.00

tail(sort(`pH`),80)  # check for last 5% of values

##  [1] 3.57 3.57 3.57 3.57 3.57 3.57 3.57 3.58 3.58 3.58 3.58 3.58 3.58 3.58 3.58
## [16] 3.58 3.58 3.59 3.59 3.59 3.59 3.59 3.59 3.59 3.59 3.60 3.60 3.60 3.60 3.60
## [31] 3.60 3.60 3.61 3.61 3.61 3.61 3.61 3.61 3.61 3.61 3.62 3.62 3.62 3.62 3.63
## [46] 3.63 3.63 3.66 3.66 3.66 3.66 3.67 3.67 3.67 3.68 3.68 3.68 3.68 3.68 3.69
## [61] 3.69 3.69 3.69 3.70 3.71 3.71 3.71 3.71 3.72 3.72 3.72 3.74 3.75 3.78 3.78
## [76] 3.85 3.90 3.90 4.01 4.01

We can try removing the outliers.

outliers1 <- `pH` > 3.56 
outliers2 <- `pH` < 3.01
newDataSet <- wineDataSet
newDataSet[outliers1, 'pH'] <- NA
newDataSet[outliers2, 'pH'] <- NA

After we remove the outliers (2% at head and 5% at tail), we need to re-calculate center of data.

old_mean <- mean(`pH`)
new_mean <- mean(newDataSet$`pH`, na.rm = TRUE)

To define Center of data, we compare new mean i.e. 3.3010263 to old mean i.e. 3.3111132 and there isn’t much of a difference.

VAR 10 :: Sulphates

hist(`sulphates`)

This histogram clearly suggests that the distribution is right skewed.

a <- mean(sulphates)-1.96*sd(sulphates)
b <- mean(sulphates)+1.96*sd(sulphates)
sum((sulphates>a & sulphates<b) == TRUE)/rows

## [1] 0.9631019

Since 96.3% of data is within two standard deviations, there may be a slight chance of normal distribution after removal of outliers.

We can check for outliers by checking the quantile on the right side:

quantile(`sulphates`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))

##  75%  80%  85%  90%  95% 100% 
## 0.73 0.76 0.80 0.85 0.93 2.00

There appears to abrupt increase in values found at 5% of the tail.

tail(sort(`sulphates`),80)  # check for last 5% of values

##  [1] 0.93 0.94 0.94 0.94 0.94 0.95 0.95 0.96 0.96 0.96 0.97 0.97 0.97 0.97 0.97
## [16] 0.97 0.98 0.98 0.99 0.99 0.99 1.00 1.01 1.02 1.02 1.02 1.03 1.03 1.04 1.04
## [31] 1.05 1.05 1.05 1.06 1.06 1.06 1.06 1.07 1.07 1.08 1.08 1.08 1.09 1.10 1.10
## [46] 1.11 1.12 1.13 1.13 1.14 1.14 1.15 1.16 1.17 1.17 1.17 1.17 1.17 1.18 1.18
## [61] 1.18 1.20 1.22 1.26 1.28 1.28 1.31 1.33 1.34 1.36 1.36 1.36 1.56 1.59 1.61
## [76] 1.62 1.95 1.95 1.98 2.00

We can try removing outliers.

outliers <- `sulphates` > 0.92
newDataSet <- wineDataSet
newDataSet[outliers, 'sulphates'] <- NA
hist(newDataSet$'sulphates')

After removing the outliers (5% at tail) our distribution looks somewhat similar to normal distribution.

old_mean <- mean(`sulphates`)
new_mean <- mean(newDataSet$`sulphates`, na.rm = TRUE)
md <- median(sulphates)

We compare new mean i.e. 0.6301258 to old mean i.e. 0.6581488 and there isn’t much of a difference. In fact, we should use median since new mean is close to it and initially the distribution looked skewed. Median is 0.62 and should be used to define center of data .

VAR 11 :: Alcohol

hist(`alcohol`)

This histogram is clearly showing the distribution to be right skewed.

a <- mean(alcohol)-1.96*sd(alcohol)
b <- mean(alcohol)+1.96*sd(alcohol)
sum((alcohol>a & alcohol<b) == TRUE)/rows

## [1] 0.9562226

95.6% of data is within two standard deviations, suggesting a chance of normal distribution after removal of outliers.

We can check for outliers by checking the quantile on the right side:

quantile(`alcohol`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))

##   75%   80%   85%   90%   95% 97.5%  100% 
##  11.1  11.3  11.6  12.0  12.5  12.8  14.9

There appears to abrupt increase in values found at 2.5% of the tail.

tail(sort(`alcohol`),40)  # check for last 2.5% of values

##  [1] 12.80000 12.80000 12.90000 12.90000 12.90000 12.90000 12.90000 12.90000
##  [9] 12.90000 12.90000 12.90000 13.00000 13.00000 13.00000 13.00000 13.00000
## [17] 13.00000 13.10000 13.10000 13.20000 13.30000 13.30000 13.30000 13.40000
## [25] 13.40000 13.40000 13.50000 13.56667 13.60000 13.60000 13.60000 13.60000
## [33] 14.00000 14.00000 14.00000 14.00000 14.00000 14.00000 14.00000 14.90000

We can try removing outliers.

outliers <- `alcohol` > 12.79
newDataSet <- wineDataSet
newDataSet[outliers, 'alcohol'] <- NA
hist(newDataSet$'alcohol')

Even after After removing the outliers (2.5% at tail) our distribution does not look like a normal distribution.

boxplot(alcohol)

Further confirmed by the box plot that our distribution is skewed.

mn <- mean(alcohol)
md <- median(alcohol)

To define Center of data, we should take medain i.e. 10.2 over the mean 10.4229831 because it is skewed distribution.

VAR 12 :: Quality

hist(`quality`)

This is out output variable with discrete values which can be described as a normal distribution as shown in histogram above.

a <- mean(quality)-1.96*sd(quality)
b <- mean(quality)+1.96*sd(quality)
sum((quality>a & quality<b) == TRUE)/rows

## [1] 0.9493433

94.9% of data is within two standard deviations - normal Distribution.

Variable	Re-assessed after outlier(s) removals	Final conclusion
Fixed Acidity	Yes	Looks Normal after removal of outliers
Volatile Acidity	Yes	Looks Normal after removal of outliers
Citric Acid	No	Right Skewed
Residual Sugar	Yes	Right Skewed
Chlorides	No	Right Skewed
Free Sulfur dioxide	Yes	Right Skewed
Total Sulfur dixide	Yes	Right Skewed
Density	N/A	Normal
pH	N/A	Normal
Sulphates	Yes	Looks Normal after removal of outliers
Alcohol	Yes	Right Skewed
Quality	N/A	Normal (output variable)

Wine Project Part A Report

Introduction

Data Quality

VAR 01 :: Fixed Acidity

VAR 02 :: Volatile Acidity

VAR 03 :: Citric Acid

VAR 04 :: Residual Sugar

VAR 05 :: Chlorides

VAR 06 :: Free sulfur dioxide

VAR 07 :: Total sulfur dioxide

VAR 08 :: Density

VAR 09 :: PH

VAR 10 :: Sulphates

VAR 11 :: Alcohol

VAR 12 :: Quality

Summary