wineDataSet <- data.table::fread(here("data","winequality-red.csv"))
attach(wineDataSet)
str(wineDataSet)
## Classes 'data.table' and 'data.frame': 1599 obs. of 12 variables:
## $ fixed acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free sulfur dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total sulfur dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, ".internal.selfref")=<externalptr>
dim(wineDataSet)
## [1] 1599 12
rows <- dim(wineDataSet)[1]
columns <- dim(wineDataSet)[2]
Above is a summary showing we have 12 variables (columns) and 1599 observations (rows). Sample size for this data is 1599.
To assess data quality, first check if there are any missing values in the data set
missingValues <- sum(is.na(wineDataSet))
missingValues
## [1] 0
There were no missing values in the data set which suggests data is of good quality. However we still need to check each variable for typos and outliers to ensure our analysis is not incorrectly influenced.
To see which variable could have outliers, lets print out a summary of all variables.
summary(wineDataSet)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Looking at above data, two major issues can be highlighted:
Fixed Acidity
Volatile Acidity
Citric Acid
Residual Sugar
Chlorides
Free Sulfur dioxide
Total Sulfur dioxide
Sulphates
Alcohol
Residual Sugar
Chlorides
Free Sulfur dioxide
Total Sulfur dioxide
hist(`fixed acidity`)
Initial look at the histogram suggests it is right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.
a <- mean(`fixed acidity`)-1.96*sd(`fixed acidity`)
b <- mean(`fixed acidity`)+1.96*sd(`fixed acidity`)
sum((`fixed acidity`>a & `fixed acidity`<b) == TRUE)/rows
## [1] 0.9474672
Since 94.7% of data is within two standard deviations; distribution can be normal.
We can check for outliers by checking the quantile at right side:
quantile(`fixed acidity`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))
## 75% 80% 85% 90% 95% 100%
## 9.2 9.7 10.2 10.7 11.8 15.9
There appears to abrupt increase in values found at 5% of the tail.
tail(sort(`fixed acidity`),80) # Check last 5% of values
## [1] 11.8 11.8 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 12.0
## [16] 12.0 12.0 12.0 12.0 12.0 12.0 12.1 12.2 12.2 12.2 12.2 12.3 12.3 12.3 12.3
## [31] 12.3 12.4 12.4 12.4 12.4 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.6 12.6 12.6
## [46] 12.6 12.7 12.7 12.7 12.7 12.8 12.8 12.8 12.8 12.8 12.9 12.9 13.0 13.0 13.0
## [61] 13.2 13.2 13.2 13.3 13.3 13.3 13.4 13.5 13.7 13.7 13.8 14.0 14.3 15.0 15.0
## [76] 15.5 15.5 15.6 15.6 15.9
We can now try removing the outliers
outliers <- `fixed acidity` > 11.79
newDataSet <- wineDataSet
newDataSet[outliers, 'fixed acidity'] <- NA
hist(newDataSet$'fixed acidity')
After we remove the outliers (5% at tail) our distribution is similar to normal.
old_mean <- mean(`fixed acidity`)
new_mean <- mean(newDataSet$`fixed acidity`, na.rm = TRUE)
Old mean was 8.3196373 whereas new mean is 8.0812912. Difference is minor so we do not need to update our mean.
hist(`volatile acidity`)
Initial look at the histogram suggests it is slightly right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.
a <- mean(`volatile acidity`)-1.96*sd(`volatile acidity`)
b <- mean(`volatile acidity`)+1.96*sd(`volatile acidity`)
sum((`volatile acidity`>a & `volatile acidity`<b) == TRUE)/rows
## [1] 0.9587242
Since 95.8% of data is within two standard deviations; normal distribution is possible.
We can check for outliers by checking the quantile on the right side:
quantile(`volatile acidity`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))
## 75% 80% 85% 90% 95% 97.5% 100%
## 0.6400 0.6600 0.6915 0.7450 0.8400 0.9150 1.5800
There appears to abrupt increase in values found at 2.5% of the tail.
We can try removing the outliers.
tail(sort(`volatile acidity`),40) # check for last 2.5% of values
## [1] 0.915 0.920 0.935 0.935 0.950 0.955 0.960 0.960 0.960 0.965 0.965 0.965
## [13] 0.975 0.980 0.980 0.980 1.000 1.000 1.000 1.005 1.010 1.020 1.020 1.020
## [25] 1.020 1.025 1.035 1.040 1.040 1.040 1.070 1.090 1.115 1.130 1.180 1.185
## [37] 1.240 1.330 1.330 1.580
outliers <- `volatile acidity` > 0.914
newDataSet <- wineDataSet
newDataSet[outliers, 'volatile acidity'] <- NA
hist(newDataSet$'volatile acidity')
After we remove the outliers (2.5% at tail) our distribution look similar to a normal distribution.
old_mean <- mean(`volatile acidity`)
new_mean <- mean(newDataSet$`volatile acidity`, na.rm = TRUE)
Old mean was 0.5278205 whereas new mean is 0.5137629.
Difference is minor so we do not need to update our mean.
hist(`citric acid`)
The histogram suggests it is clearly right skewed.
boxplot(`citric acid`)
Our evaluation is further confirmed by box plot above.
a <- mean(`citric acid`)-1.96*sd(`citric acid`)
b <- mean(`citric acid`)+1.96*sd(`citric acid`)
sum((`citric acid`>a & `citric acid`<b) == TRUE)/rows
## [1] 0.9693558
97% of data in this set is within two standard deviations whereas for normal distributions it is 95%.
quantile(`citric acid`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))
## 75% 80% 85% 90% 95% 97.5% 100%
## 0.420 0.460 0.490 0.522 0.600 0.660 1.000
Outliers are found to be in the 2.5% of data at tail (right side).
tail(sort(`citric acid`),40) #last 2.5% values
## [1] 0.66 0.66 0.66 0.66 0.66 0.67 0.67 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68
## [16] 0.68 0.68 0.68 0.69 0.69 0.69 0.69 0.70 0.70 0.71 0.72 0.73 0.73 0.73 0.74
## [31] 0.74 0.74 0.74 0.75 0.76 0.76 0.76 0.78 0.79 1.00
mn <- mean(`citric acid`)
md <- median(`citric acid`)
Since the data is skewed, it is recommended to use median to define the center of data, however there isn’t much difference between the two in this case; mean is 0.2709756 and median is 0.26.
hist(`residual sugar`)
This histogram clearly suggests that the distribution is right skewed.
boxplot(`residual sugar`)
Further confirmed by box plot above.
a <- mean(`residual sugar`)-1.96*sd(`residual sugar`)
b <- mean(`residual sugar`)+1.96*sd(`residual sugar`)
sum((`residual sugar`>a & `residual sugar`<b) == TRUE)/rows
## [1] 0.9530957
Even though 95.3% of data is within two standard deviations.
We can check for outliers by checking the quantile (right side):
quantile(`residual sugar`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))
## 75% 80% 85% 90% 95% 100%
## 2.600 2.700 2.965 3.600 5.100 15.500
There appears to abrupt increase in values found at 5% of the tail.
tail(sort(`residual sugar`),80) # check for last 5% of values
## [1] 5.10 5.15 5.20 5.20 5.20 5.40 5.50 5.50 5.50 5.50 5.50 5.50
## [13] 5.50 5.50 5.60 5.60 5.60 5.60 5.60 5.60 5.70 5.80 5.80 5.80
## [25] 5.80 5.90 5.90 5.90 6.00 6.00 6.00 6.00 6.10 6.10 6.10 6.10
## [37] 6.20 6.20 6.20 6.30 6.30 6.40 6.40 6.40 6.55 6.55 6.60 6.60
## [49] 6.70 6.70 7.00 7.20 7.30 7.50 7.80 7.80 7.90 7.90 7.90 8.10
## [61] 8.10 8.30 8.30 8.30 8.60 8.80 8.80 8.90 9.00 10.70 11.00 11.00
## [73] 12.90 13.40 13.80 13.80 13.90 15.40 15.40 15.50
We can try removing the outliers.
outliers <- `residual sugar` > 5.1
newDataSet <- wineDataSet
newDataSet[outliers, 'residual sugar'] <- NA
hist(newDataSet$'residual sugar')
Even after removing the outliers (5% at tail) our distribution doesn’t clearly look normal; therefore this data set should be classified as skewed.
mn <- mean(`residual sugar`)
md <- median(`residual sugar`)
For skewed distributions, it is recommended to use median 2.2 rather than mean 2.5388055 to define center of data.
hist(`chlorides`)
This histogram shows that the distribution is right skewed.
boxplot(chlorides)
Further confirmed by box plot above.
a <- mean(chlorides)-1.96*sd(chlorides)
b <- mean(chlorides)+1.96*sd(chlorides)
sum((chlorides>a & chlorides<b) == TRUE)/rows
## [1] 0.9718574
97.2% of data is within two standard deviations; therefore not a normal distribution.
We can check for outliers by checking the quantile on the right side:
quantile(`chlorides`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))
## 75% 80% 85% 90% 95% 97.5% 100%
## 0.0900 0.0940 0.0990 0.1090 0.1261 0.2050 0.6110
There appears to abrupt increase in values found at 5% of the tail - these are our outliers.
tail(sort(`chlorides`),80) # check for last 5% of values
## [1] 0.127 0.128 0.132 0.132 0.132 0.132 0.136 0.137 0.143 0.145 0.146 0.147
## [13] 0.148 0.152 0.152 0.153 0.157 0.157 0.157 0.159 0.161 0.165 0.166 0.166
## [25] 0.166 0.168 0.169 0.170 0.171 0.171 0.172 0.174 0.176 0.178 0.178 0.186
## [37] 0.190 0.194 0.200 0.205 0.205 0.213 0.214 0.214 0.214 0.216 0.222 0.226
## [49] 0.226 0.230 0.235 0.236 0.241 0.243 0.250 0.263 0.267 0.270 0.332 0.337
## [61] 0.341 0.343 0.358 0.360 0.368 0.369 0.387 0.401 0.403 0.413 0.414 0.414
## [73] 0.415 0.415 0.415 0.422 0.464 0.467 0.610 0.611
mn <- mean(`chlorides`)
md <- median(`chlorides`)
For skewed distributions, it is recommended to use median 0.079 rather than mean 0.0874665 to define center of data.
hist(`free sulfur dioxide`)
Initial look at the histogram suggests it is right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.
a <- mean(`free sulfur dioxide`)-1.96*sd(`free sulfur dioxide`)
b <- mean(`free sulfur dioxide`)+1.96*sd(`free sulfur dioxide`)
sum((`free sulfur dioxide`>a & `free sulfur dioxide`<b) == TRUE)/rows
## [1] 0.9587242
Since 95.8% of data is within two standard deviations; normal distribution may be possible.
We can check for outliers by checking the quantile on the right side:
quantile(`free sulfur dioxide`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))
## 75% 80% 85% 90% 95% 97.5% 100%
## 21.000 24.000 27.000 31.000 35.000 40.525 72.000
There appears to abrupt increase in values found at 2.5% of the tail.
tail(sort(`free sulfur dioxide`),40) # check for last 2.5% of values
## [1] 41 41 41 41 41 41 41 42 42 42 43 43 43 45 45 45 46 47 48 48 48 48 50 50 51
## [26] 51 51 51 52 52 52 53 54 55 55 57 66 68 68 72
We can try removing the outliers.
outliers <- `free sulfur dioxide` > 40
newDataSet <- wineDataSet
newDataSet[outliers, 'free sulfur dioxide'] <- NA
hist(newDataSet$'free sulfur dioxide')
Even after we remove the outliers (2.5% at tail) our distribution does not look like a normal distribution. Therefore we can conclude that this distribution is right skewed further supported by box plot below.
boxplot(`free sulfur dioxide`)
mn <- mean(`free sulfur dioxide`)
md <- median(`free sulfur dioxide`)
For skewed distributions, it is recommended to use median 14 rather than mean 15.8749218 to define center of data even though there isn’t that big of a difference.
hist(`total sulfur dioxide`)
Initial look at the histogram suggests it is right skewed. However, we need to consider removing any possible outliers and re-assess the distribution.
a <- mean(`total sulfur dioxide`)-1.96*sd(`total sulfur dioxide`)
b <- mean(`total sulfur dioxide`)+1.96*sd(`total sulfur dioxide`)
sum((`total sulfur dioxide`>a & `total sulfur dioxide`<b) == TRUE)/rows
## [1] 0.9462164
Since 94.6% of data is within two standard deviations; normal distribution may be possible.
We can check for outliers by checking the quantile at tail (right side):
quantile(`total sulfur dioxide`, p=c(0.85,0.875,0.90,0.925,0.95,0.975,1.0))
## 85% 87.5% 90% 92.5% 95% 97.5% 100%
## 82.0 88.0 93.2 102.0 112.1 131.0 289.0
There appears to abrupt increase in values found at 5% of the tail.
tail(sort(`total sulfur dioxide`),80) # check for last 5% of values
## [1] 113 113 113 113 114 114 115 115 116 119 119 119 119 119 119 119 120 120 121
## [20] 121 121 121 122 122 122 124 124 124 125 125 126 127 127 128 128 129 129 129
## [39] 130 131 131 131 133 133 133 134 134 135 135 136 136 139 140 141 141 141 142
## [58] 143 143 144 144 144 145 145 145 147 147 147 148 148 149 151 151 152 153 155
## [77] 160 165 278 289
We can try removing the outliers.
outliers <- `total sulfur dioxide` > 112
newDataSet <- wineDataSet
newDataSet[outliers, 'total sulfur dioxide'] <- NA
hist(newDataSet$'total sulfur dioxide')
Even after we remove the outliers (5% at tail) our distribution does not look like a normal distribution. Therefore we can conclude that this distribution is right skewed further supported by box plot below.
boxplot(`total sulfur dioxide`)
mn <- mean(`total sulfur dioxide`)
md <- median(`total sulfur dioxide`)
For skewed distributions, it is recommended to use median 38 rather than mean 46.4677924 to define center of data.
hist(`density`)
This histogram clearly indicates a normal distribution.
a <- mean(density)-1.96*sd(density)
b <- mean(density)+1.96*sd(density)
sum((density>a & density<b) == TRUE)/rows
## [1] 0.9493433
Moreover, 94.9% of data is within two standard deviations thereby confirming our analysis.
We can check for outliers by checking the quantile (both sides):
quantile(`density`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))
## 75% 80% 85% 90% 95% 100%
## 0.997835 0.998174 0.998600 0.999140 1.000000 1.003690
quantile(`density`, p=c(0.00,0.05,0.1,0.15,0.20,0.25))
## 0% 5% 10% 15% 20% 25%
## 0.990070 0.993598 0.994556 0.995000 0.995340 0.995600
Both sets of values at head and tail do not suggest any abrupt changes; therefore it is safe to say there are no outliers for this data set.
boxplot(density)
Even though above box plot suggests “possible” outliers at both ends, our analysis should not be affected by it since distribution is strongly normal.
mn <- mean(density)
Center of data is confidently defined by its mean 0.9967467 in this data set.
hist(`pH`)
This histogram clearly indicates a normal distribution.
a <- mean(pH)-1.96*sd(pH)
b <- mean(pH)+1.96*sd(pH)
sum((pH>a & pH<b) == TRUE)/rows
## [1] 0.9530957
Moreover, 95.3% of data is within two standard deviations thereby confirming our analysis.
We can check for outliers by checking the quantile (both sides):
quantile(`pH`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))
## 75% 80% 85% 90% 95% 100%
## 3.400 3.424 3.460 3.510 3.570 4.010
quantile(`pH`, p=c(0.00,0.015,0.02,0.05,0.1))
## 0% 1.5% 2% 5% 10%
## 2.74 2.98 3.00 3.06 3.12
There appears to abrupt change in values found at 5% of tail and 2% of head.
head(sort(`pH`),32) # check for first 2% of values
## [1] 2.74 2.86 2.87 2.88 2.88 2.89 2.89 2.89 2.89 2.90 2.92 2.92 2.92 2.92 2.93
## [16] 2.93 2.93 2.94 2.94 2.94 2.94 2.95 2.98 2.98 2.98 2.98 2.98 2.99 2.99 3.00
## [31] 3.00 3.00
tail(sort(`pH`),80) # check for last 5% of values
## [1] 3.57 3.57 3.57 3.57 3.57 3.57 3.57 3.58 3.58 3.58 3.58 3.58 3.58 3.58 3.58
## [16] 3.58 3.58 3.59 3.59 3.59 3.59 3.59 3.59 3.59 3.59 3.60 3.60 3.60 3.60 3.60
## [31] 3.60 3.60 3.61 3.61 3.61 3.61 3.61 3.61 3.61 3.61 3.62 3.62 3.62 3.62 3.63
## [46] 3.63 3.63 3.66 3.66 3.66 3.66 3.67 3.67 3.67 3.68 3.68 3.68 3.68 3.68 3.69
## [61] 3.69 3.69 3.69 3.70 3.71 3.71 3.71 3.71 3.72 3.72 3.72 3.74 3.75 3.78 3.78
## [76] 3.85 3.90 3.90 4.01 4.01
We can try removing the outliers.
outliers1 <- `pH` > 3.56
outliers2 <- `pH` < 3.01
newDataSet <- wineDataSet
newDataSet[outliers1, 'pH'] <- NA
newDataSet[outliers2, 'pH'] <- NA
After we remove the outliers (2% at head and 5% at tail), we need to re-calculate center of data.
old_mean <- mean(`pH`)
new_mean <- mean(newDataSet$`pH`, na.rm = TRUE)
To define Center of data, we compare new mean i.e. 3.3010263 to old mean i.e. 3.3111132 and there isn’t much of a difference.
hist(`sulphates`)
This histogram clearly suggests that the distribution is right skewed.
a <- mean(sulphates)-1.96*sd(sulphates)
b <- mean(sulphates)+1.96*sd(sulphates)
sum((sulphates>a & sulphates<b) == TRUE)/rows
## [1] 0.9631019
Since 96.3% of data is within two standard deviations, there may be a slight chance of normal distribution after removal of outliers.
We can check for outliers by checking the quantile on the right side:
quantile(`sulphates`, p=c(0.75,0.80,0.85,0.90,0.95,1.0))
## 75% 80% 85% 90% 95% 100%
## 0.73 0.76 0.80 0.85 0.93 2.00
There appears to abrupt increase in values found at 5% of the tail.
tail(sort(`sulphates`),80) # check for last 5% of values
## [1] 0.93 0.94 0.94 0.94 0.94 0.95 0.95 0.96 0.96 0.96 0.97 0.97 0.97 0.97 0.97
## [16] 0.97 0.98 0.98 0.99 0.99 0.99 1.00 1.01 1.02 1.02 1.02 1.03 1.03 1.04 1.04
## [31] 1.05 1.05 1.05 1.06 1.06 1.06 1.06 1.07 1.07 1.08 1.08 1.08 1.09 1.10 1.10
## [46] 1.11 1.12 1.13 1.13 1.14 1.14 1.15 1.16 1.17 1.17 1.17 1.17 1.17 1.18 1.18
## [61] 1.18 1.20 1.22 1.26 1.28 1.28 1.31 1.33 1.34 1.36 1.36 1.36 1.56 1.59 1.61
## [76] 1.62 1.95 1.95 1.98 2.00
We can try removing outliers.
outliers <- `sulphates` > 0.92
newDataSet <- wineDataSet
newDataSet[outliers, 'sulphates'] <- NA
hist(newDataSet$'sulphates')
After removing the outliers (5% at tail) our distribution looks somewhat similar to normal distribution.
old_mean <- mean(`sulphates`)
new_mean <- mean(newDataSet$`sulphates`, na.rm = TRUE)
md <- median(sulphates)
We compare new mean i.e. 0.6301258 to old mean i.e. 0.6581488 and there isn’t much of a difference. In fact, we should use median since new mean is close to it and initially the distribution looked skewed. Median is 0.62 and should be used to define center of data .
hist(`alcohol`)
This histogram is clearly showing the distribution to be right skewed.
a <- mean(alcohol)-1.96*sd(alcohol)
b <- mean(alcohol)+1.96*sd(alcohol)
sum((alcohol>a & alcohol<b) == TRUE)/rows
## [1] 0.9562226
95.6% of data is within two standard deviations, suggesting a chance of normal distribution after removal of outliers.
We can check for outliers by checking the quantile on the right side:
quantile(`alcohol`, p=c(0.75,0.80,0.85,0.90,0.95,0.975,1.0))
## 75% 80% 85% 90% 95% 97.5% 100%
## 11.1 11.3 11.6 12.0 12.5 12.8 14.9
There appears to abrupt increase in values found at 2.5% of the tail.
tail(sort(`alcohol`),40) # check for last 2.5% of values
## [1] 12.80000 12.80000 12.90000 12.90000 12.90000 12.90000 12.90000 12.90000
## [9] 12.90000 12.90000 12.90000 13.00000 13.00000 13.00000 13.00000 13.00000
## [17] 13.00000 13.10000 13.10000 13.20000 13.30000 13.30000 13.30000 13.40000
## [25] 13.40000 13.40000 13.50000 13.56667 13.60000 13.60000 13.60000 13.60000
## [33] 14.00000 14.00000 14.00000 14.00000 14.00000 14.00000 14.00000 14.90000
We can try removing outliers.
outliers <- `alcohol` > 12.79
newDataSet <- wineDataSet
newDataSet[outliers, 'alcohol'] <- NA
hist(newDataSet$'alcohol')
Even after After removing the outliers (2.5% at tail) our distribution does not look like a normal distribution.
boxplot(alcohol)
Further confirmed by the box plot that our distribution is skewed.
mn <- mean(alcohol)
md <- median(alcohol)
To define Center of data, we should take medain i.e. 10.2 over the mean 10.4229831 because it is skewed distribution.
hist(`quality`)
This is out output variable with discrete values which can be described as a normal distribution as shown in histogram above.
a <- mean(quality)-1.96*sd(quality)
b <- mean(quality)+1.96*sd(quality)
sum((quality>a & quality<b) == TRUE)/rows
## [1] 0.9493433
94.9% of data is within two standard deviations - normal Distribution.
Following table captures the distribution assessed for each variable:
| Variable | Re-assessed after outlier(s) removals | Final conclusion |
|---|---|---|
| Fixed Acidity | Yes | Looks Normal after removal of outliers |
| Volatile Acidity | Yes | Looks Normal after removal of outliers |
| Citric Acid | No | Right Skewed |
| Residual Sugar | Yes | Right Skewed |
| Chlorides | No | Right Skewed |
| Free Sulfur dioxide | Yes | Right Skewed |
| Total Sulfur dixide | Yes | Right Skewed |
| Density | N/A | Normal |
| pH | N/A | Normal |
| Sulphates | Yes | Looks Normal after removal of outliers |
| Alcohol | Yes | Right Skewed |
| Quality | N/A | Normal (output variable) |