DATA DICTIONARY

The Wine-Quality data frame has 4898 rows and 12 columns.This is a non continuous dataset.

This data frame contains the following columns:

fixed.acidity :

most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

volatile.acidity :

the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

citric.acid :

found in small quantities, citric acid can add ‘freshness’ and flavor to wines

residual.sugar :

the amount of sugar remaining after fermentation stops.

chlorides :

the amount of salt in the wine.

free.sulfur.dioxide :

the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion.

total.sulfur.dioxide :

amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

density :

the density of wine is close to that of water depending on the percent alcohol and sugar content

pH :

describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

sulphates :

a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

alcohol :

the percent alcohol content of the wine.

quality :

Dependent variable (based on sensory data, score between 0 and 10).0 signifies very bad quality and 10 signifies very good quality.

Importing the dataset

dataset = read.csv('winequality-white.csv')
str(dataset)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
names(dataset)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Before I convert the dependent variable into categorical variables,I wanted to check a quick correlation between all the independent variables But our dependent variable is integer, so we need to convert it into numeric data type, Because correlation is only for numeric variables

dataset$quality = as.numeric(dataset$quality)

checking correlation using pearson as default method

cr=cor(dataset)

Plotting correlation

library(corrplot)
## corrplot 0.84 loaded
corrplot(cr,method= "number")

finding :

Quality is mostly correlated with amount of alcohol. No surprise there !!

Lets check the distribution of our dependent variable by plotting a histogram

table(dataset$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
hist(dataset$quality)

So the dependent variable is highly concentrated minimum value is 3 and highest value is 9 and most of the values are 5,6,7.

To make sure a good model we need to categories our dependent variable properly.

dataset$quality = ifelse(dataset$quality <=5, "0", ifelse(dataset$quality <=7,"1","2"))

Dependent Variable values less than 5 shows low quality wine,6 or 7 shows medium quality and 8 or 9 shows high quality wine.

checking for distribution now

table(dataset$quality)
## 
##    0    1    2 
## 1640 3078  180

Now our dependent variable is reasonably well distributed.Let’s categories it as factor as R still considers it as integers.

dataset$quality = factor(dataset$quality)
class(dataset$quality)
## [1] "factor"

checking for NA values in whole dataset

sum(is.na(dataset))
## [1] 0

Splitting the dataset into the Training set and Test set #install.packages(‘caTools’)

library(caTools)
set.seed(150)
split = sample.split(dataset$quality, SplitRatio = 0.8)

Create training and testing sets

training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Decision Tree Regression

Fitting Decision Tree Regression to the dataset. We don’t need feature scaling here, because decision tree doesn’t use euclidean distance method to do the classification

library(rpart)
classifier = rpart(formula = quality ~ .,
                   data = training_set)

Predicting the test dataset with Decision Tree Regression

y_pred = predict(classifier, newdata = test_set[-12],type='class')

cm = as.matrix(table(actual=test_set$quality,predicted=y_pred))

So decision tree gives an accuracy of 72.1%. We can not plot the classification using graph because we have 11 independent variables in our dataset. We need to apply dimensionality reduction tools like CPA to bring it down to 2 independent variables to plot it.

RANDOM FOREST

Fitting Random Forest Classification to the Training set

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
classifier1 = randomForest(x = training_set[-12],
                          y = training_set$quality,
                          ntree = 500)

Predicting the Test set results

y_pred1 = predict(classifier1, newdata = test_set[-12])

Making the Confusion Matrix

cm1 = table(actual=test_set[, 12],predicted= y_pred1)

So an accuracy of 81.6%

Kernel SVM

We need feature scaling for Kernel SVM

training_set1 = training_set
training_set1[-12] = scale(training_set[-12])
test_set1 = test_set
test_set1[-12] = scale(test_set[-12])

Fitting Kernel SVM to the Training set

library(e1071)
classifier2 = svm(formula = quality ~ .,
                 data = training_set1,
                 type = 'C-classification',
                 kernel = 'radial')

Predicting the Test set results

y_pred2 = predict(classifier2, newdata = test_set1[-12])

Making the Confusion Matrix

cm2 = table(actual=test_set1[, 12], predicted= y_pred2)

so an accuracy of 75%

So the best model is proved to be Random forest model,but there were a lot of hyper parameter we could tune to improve our model.

IMPROVING THE RANDOM FOREST MODEL

Applying Grid Search to find the best parameters

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
classifier3 = train(form = quality ~ ., data = training_set, method = 'rf')
(classifier3$bestTune)
##   mtry
## 1    2

so according to Grid search the random forest model is tuned or most optimized at mtry = 2

set.seed(123)
classifier4 = randomForest(x = training_set[-12],
                           y = training_set$quality,
                           ntree = 1000, mtry = 2)

Predicting the Test set results

y_pred4 = predict(classifier4, newdata = test_set[-12])

Making the Confusion Matrix

cm3 = table(actual=test_set[, 12],predicted= y_pred4)
accuracy = ((223+554+17)/980)*100

Clustering K-Means Clustering

Let’s do clustering on our dataset and see what kind of results it can get for us So there are two famous clustering tools. K means clustering and Hierarchical Clustering. Both of these give nearly the same result. I am going to do K-means clustering on our dataset.

So first thing to decide here is in how many clusters we should divide our dataset.Number of clusters must be optimum. For that we use Elbow method.

We don’t want our graph to be too cluttered , because we have to plot the graph in 2-D . So splitting the data into small portion.

library(caTools)
set.seed(150)
split = sample.split(dataset$quality, SplitRatio = 0.95)

cluster_set = subset(dataset, split == F)

Scaling our cluster_set

Because K-means clustering uses euclidean distance method . So we don’t want some data to dominate the other.

cluster_set = scale(cluster_set[,-12])

Applying Elbow Method to find optimal number of clusters

The method that I am using here for estimating the optimal number of clusters is within sum of squares (wss)

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(x=cluster_set,FUNcluster = kmeans,method = "wss")+
        labs(subtitle = "Elbow method")

distance <- get_dist(cluster_set)

So we see a kink at point 3 (or arguably 2) in Elbow graph, I am going to take 3 numbers for our clustering dataset.

Applying k-means to the dataset to develop the clusters

set.seed(12554)
final = kmeans(cluster_set,centers = 3,iter.max = 300, nstart=100)

final is the final result of kmeans clustering process and cluster_set is the dataset on which clustering was done

Visualizing the cluster on cluster_set

fviz_cluster(final, data = cluster_set)

1.As can be seen here that how 3 clusters have beautifully classified all the data points.

2.There are some overlapping with clusters as well. But that is a common thing to happen if data points are very close to each other.

3.Further the data points are obviously plotted against only two variables, where as we had 11 independent variables (leaving the DV) in our dataset.

4.If we had the dataset only with 2 independent variables, than we could clearly see which point is where and plan different strategies for them.

5.This is what marketing agencies do, after categorizing the market into 3 clusters based on their salary and how much they spend

6.Agencies can now run marketing campaign for like rolex watches or luxury items to their extravagant customers and they can also advertise commodities to the customers with low expenditure habits.

So final here is a list with all the information about our clusters

(final)
## K-means clustering with 3 clusters of sizes 91, 87, 67
## 
## Cluster means:
##   fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1     0.1306014        0.1741813   0.4206541      0.8826870  0.5376086
## 2    -0.5627541       -0.4130637  -0.2788546     -0.5527319 -0.3119964
## 3     0.5533564        0.2997917  -0.2092414     -0.4811468 -0.3250551
##   free.sulfur.dioxide total.sulfur.dioxide    density         pH   sulphates
## 1           0.7337749            0.8731176  1.0031609 -0.2324269  0.06241205
## 2          -0.1973467           -0.3397761 -0.5447638  0.7677282  0.31264097
## 3          -0.7403634           -0.7446743 -0.6551222 -0.6812166 -0.49073524
##      alcohol
## 1 -0.8683244
## 2  0.4073541
## 3  0.6504136
## 
## Clustering vector:
##   25   45   69   71   87  101  117  122  132  153  154  162  194  200  214  236 
##    2    2    2    1    1    1    2    1    1    3    2    1    2    1    2    1 
##  250  252  273  293  302  307  337  342  385  399  404  422  428  468  495  523 
##    2    1    1    1    3    1    2    3    2    1    1    1    1    3    2    3 
##  538  541  580  586  624  653  693  701  709  748  752  761  826  856  858  879 
##    1    2    2    3    3    1    2    1    1    3    1    1    1    2    1    1 
##  901  903  904  947  952  953  971 1006 1073 1168 1180 1196 1225 1273 1317 1326 
##    2    1    1    1    1    2    2    2    3    2    2    1    2    1    3    3 
## 1371 1380 1406 1410 1417 1437 1445 1452 1471 1501 1554 1555 1587 1620 1660 1663 
##    3    3    3    3    2    1    1    3    1    3    2    1    3    2    1    3 
## 1685 1694 1767 1770 1877 1894 1924 1944 1973 1976 1997 2002 2019 2032 2047 2061 
##    1    1    1    3    2    1    3    1    1    1    1    2    2    3    1    1 
## 2135 2150 2170 2214 2219 2224 2227 2267 2270 2273 2292 2295 2315 2376 2391 2410 
##    2    2    1    2    2    1    1    3    1    2    3    2    2    2    2    2 
## 2421 2425 2431 2510 2538 2543 2547 2549 2563 2567 2594 2597 2642 2648 2651 2669 
##    3    1    1    1    2    3    1    1    3    3    1    2    1    2    1    3 
## 2696 2742 2744 2757 2759 2767 2776 2789 2801 2824 2832 2834 2837 2847 2852 2861 
##    2    1    3    1    2    3    3    1    1    3    1    3    1    2    3    2 
## 2939 2945 2961 2962 2973 2977 2997 3031 3055 3058 3072 3099 3115 3133 3173 3209 
##    2    2    1    2    2    3    3    3    2    2    3    2    2    2    3    2 
## 3244 3246 3284 3285 3301 3339 3363 3385 3407 3408 3413 3447 3452 3461 3464 3512 
##    3    3    1    3    3    3    2    2    2    1    1    1    3    1    2    2 
## 3514 3517 3560 3590 3591 3611 3615 3641 3647 3683 3693 3708 3730 3781 3854 3855 
##    3    2    3    1    2    1    1    2    3    2    3    3    3    1    2    2 
## 3864 3868 3880 3913 3921 3958 3960 3972 4014 4047 4062 4078 4106 4127 4171 4212 
##    1    1    3    2    3    3    2    2    1    1    3    2    3    2    1    1 
## 4220 4245 4270 4275 4312 4313 4320 4324 4376 4382 4391 4398 4408 4427 4431 4477 
##    1    1    1    1    3    3    3    3    2    2    1    1    3    2    1    2 
## 4492 4509 4519 4531 4535 4591 4643 4645 4677 4722 4741 4745 4788 4791 4801 4812 
##    3    2    3    1    3    2    2    3    1    2    1    2    2    3    2    2 
## 4813 4815 4821 4823 4863 
##    2    2    1    2    2 
## 
## Within cluster sum of squares by cluster:
## [1] 831.7875 536.6894 486.1313
##  (between_SS / total_SS =  30.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

1.So , as can be seen above the cluster are classified into 93 89 63 numbers of datapoints .

2.Cluster means can be seen here also

3.We can see for each and every data points , which cluster they belong to.

4.Within sum of squares for each clusters are respectively
918.47 , 736.78 , 483.1 The within-cluster sum of squares is a measure of the variability of the observations within each cluster. In general, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares.

5.It also shows all the components inside final$ we can use to know about our cluster.

End Notes

1.Random forest gives the highest accuracy of 81.6%,where as SVM kernel and decision tree give the accuracy of 75% and 72.1% respectively.

2.The Quality of wine highly dependent on the amount of alcohol it has.

3.Even after tuning the Random Forest model , the accuracy more or less stays the same.

4.So a neural network might be best suited for this dataset to get higher accuracy.Which I have decided to do on some other dataset.