The objective of this report is to study and practice the data visualization with ggplot2, machine learning by caret, cluster analysis by comparing different method. Expectation of this report to help the company classify the characteristic of high class wine, or acknowledge the most important factors to improve the wine quality.
A wine company send 178 samples for a characteristic testing. They want to know what are the difference of wine between Customer Segmentation.
wine <- read.csv("C:/Users/dell/Desktop/Data/Bai tap R/Cluster/Wine.csv")
The first step is to overview the number of observation (rows), number of column (variables), type of variables (factor vs number, continuous vs discrete,…)
str(wine)
## 'data.frame': 178 obs. of 14 variables:
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic_Acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Ash_Alcanity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Total_Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoid_Phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color_Intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ OD280 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
## $ Customer_Segment : int 1 1 1 1 1 1 1 1 1 1 ...
There are total 14 variables, specified to 2 types of data: number and integer. But the 14th variables should be factor instead of integer. I will change it to factor.
wine$Customer_Segment <- as.factor(wine$Customer_Segment)
Now creating graph to quick view on data.
wine %>% gather(1:13, key = "variables", value = "result") %>%
ggplot(aes(Customer_Segment, result, fill = Customer_Segment)) +
geom_jitter(color = "grey", alpha = 0.5)+
geom_boxplot()+
theme_classic()+
facet_wrap(.~variables, scale = "free")
wine %>% gather(1:13, key = "variables", value = "result") %>%
ggplot(aes(result)) +
geom_histogram(aes(fill = variables), color = "white")+
theme_classic()+
facet_wrap(.~variables, scale = "free") +
theme(legend.position = "none")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
wine %>% gather(1:13, key = "variables", value = "result") %>%
ggplot(aes(result, fill = Customer_Segment)) +
geom_density(alpha = 0.5)+
theme_classic()+
facet_wrap(.~variables, scale = "free")
Correlation graph between variables.
corrplot(cor(wine[,-14]),
method = "color",
type = "upper",
addCoef.col = "white",number.cex = 0.7,
tl.col="black", tl.srt=35,tl.cex = 0.7,
cl.cex = 0.7,order = "hclust",
title = "correlation matrix of wine variable")
The flavanoid is high correlated to other variables. I will remove later after evaluate its impact to our model accuracy.
Create data to 2 sets of data. One for modeling and one for modeling accuracy validation.
set.seed(123)
index <- createDataPartition(wine$Customer_Segment, times = 1, p = 0.8, list = F)
train_set <- wine[index,]
test_set <- wine[-index,]
Linear model is the most common and simpler model in modeling.
fit_rf <- train(wine[,-14],
wine$Customer_Segment,
method = "rf",
trControl = trainControl(method = "cv",
number = 3,
p = 0.8))
y_hat <- predict(fit_rf,
test_set)
confusionMatrix(y_hat,
test_set$Customer_Segment)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 11 0 0
## 2 0 14 0
## 3 0 0 9
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8972, 1)
## No Information Rate : 0.4118
## P-Value [Acc > NIR] : 7.908e-14
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3235 0.4118 0.2647
## Detection Rate 0.3235 0.4118 0.2647
## Detection Prevalence 0.3235 0.4118 0.2647
## Balanced Accuracy 1.0000 1.0000 1.0000
Great results, Accuracy is awesome by 100%.
So what is the most important varibles impact to the quality of the wine?
plot(varImp(fit_rf))
Results show that flavonoid, color intensity, proline are 3 most important variables impact to the quality of wine.
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. [1]
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
AndreyBu, who has more than 5 years of machine learning experience and currently teaches people his skills, says that “the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.” You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.[2]
wine_kmean <- kmeans(wine, centers = 2)
wine_kmean$centers
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 12.70285 2.544553 2.339106 20.40813 96.81301 2.062114
## 2 13.66655 1.870727 2.427818 17.45273 106.29091 2.816182
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1 1.641463 0.3926829 1.454065 4.851382
## 2 2.896545 0.2929091 1.896909 5.520364
## Hue OD280 Proline Customer_Segment
## 1 0.9086179 2.408211 565.8699 2.308943
## 2 1.0666545 3.066727 1151.7273 1.109091
What is Cluster size?
wine_kmean$size
## [1] 123 55
The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.
betweenss. The between-cluster sum of squares. In an optimal segmentation, one expects this ratio to be as higher as possible, since we would like to have heterogeneous clusters.
withinss. Vector of within-cluster sum of squares, one component per cluster. In an optimal segmentation, one expects this ratio to be as lower as possible for each cluster, since we would like to have homogeneity within the clusters.
tot.withinss. Total within-cluster sum of squares.
totss. The total sum of squares.
Above Kmean model, I choose center = 2. Let’s review the performance of this model.
data.frame(betweenss = wine_kmean$betweenss,
withinss = wine_kmean$withinss,
tot.withinss = wine_kmean$tot.withinss,
totss = wine_kmean$totss)
## betweenss withinss tot.withinss totss
## 1 13048601 2566072 4543801 17592403
## 2 13048601 1977729 4543801 17592403
Now to choose the best fit for parameter center, I will do a loop on this.
center = 3 or 4 will be the best choice in this case. Now build the the kmean with center = 3.
wine_pca <- prcomp(wine[,-14])
summary(wine_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 314.9632 13.13527 3.07215 2.23409 1.10853 0.91710
## Proportion of Variance 0.9981 0.00174 0.00009 0.00005 0.00001 0.00001
## Cumulative Proportion 0.9981 0.99983 0.99992 0.99997 0.99998 0.99999
## PC7 PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.5282 0.3891 0.3348 0.2678 0.1938 0.1452 0.09057
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.00000
The first PC describe 99.81% of the data.
plot(wine_pca, col = "steel blue")
head(wine_pca$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -318.56298 -21.4921307 3.1307347 0.2501138 0.6770782 -0.56808104
## [2,] -303.09742 5.3647177 6.8228355 0.8640347 -0.4860960 -0.01433987
## [3,] -438.06113 6.5373094 -1.1132230 -0.9124107 0.3806514 -0.67240375
## [4,] -733.24014 -0.1927290 -0.9172570 0.5412506 0.8586623 -0.59912165
## [5,] 11.57143 -18.4899946 -0.5544221 -1.3608961 0.2764416 -0.76888353
## [6,] -703.23119 0.3321587 0.9493753 0.3599938 0.1568271 -0.06101136
## PC7 PC8 PC9 PC10 PC11
## [1,] 0.61964183 -0.19955538 0.70128028 0.09500757 0.08873400
## [2,] -0.10886512 0.60471445 0.28671685 0.04578198 0.03977819
## [3,] -0.78581886 -0.50088570 0.02454666 0.20895977 0.23777003
## [4,] -0.01877023 0.19042835 0.05427684 -0.53168357 -0.09604379
## [5,] 0.30997581 0.11909101 -0.19584299 -0.06177064 0.31646644
## [6,] -0.02607580 0.09780938 -0.38809618 -0.10195325 -0.03064081
## PC12 PC13
## [1,] -0.038547563 0.08026443
## [2,] -0.057191577 0.01359275
## [3,] -0.048797875 -0.03540816
## [4,] -0.166353072 0.01634436
## [5,] -0.007118042 0.01527761
## [6,] -0.031614149 0.07486414
data.frame(PC1 = wine_pca$x[,1], PC2 = wine_pca$x[,2], Customer_Segment = wine[,14])%>%
ggplot(aes(PC1, PC2, color = Customer_Segment))+
geom_point(size = 3, alpha = 0.6)+
stat_ellipse()+
theme_bw()
autoplot(wine_pca, data = wine, colour = "Customer_Segment",
frame = T, frame.colour = "Customer_Segment")
wlda <- lda(Customer_Segment ~., data = wine)
wlda
## Call:
## lda(Customer_Segment ~ ., data = wine)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3314607 0.3988764 0.2696629
##
## Group means:
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169
## 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873
## 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1 2.9823729 0.290000 1.899322 5.528305
## 2 2.0808451 0.363662 1.630282 3.086620
## 3 0.7814583 0.447500 1.153542 7.396250
## Hue OD280 Proline
## 1 1.0620339 3.157797 1115.7119
## 2 1.0562817 2.785352 519.5070
## 3 0.6827083 1.683542 629.8958
##
## Coefficients of linear discriminants:
## LD1 LD2
## Alcohol -0.403399781 0.8717930699
## Malic_Acid 0.165254596 0.3053797325
## Ash -0.369075256 2.3458497486
## Ash_Alcanity 0.154797889 -0.1463807654
## Magnesium -0.002163496 -0.0004627565
## Total_Phenols 0.618052068 -0.0322128171
## Flavanoids -1.661191235 -0.4919980543
## Nonflavanoid_Phenols -1.495818440 -1.6309537953
## Proanthocyanins 0.134092628 -0.3070875776
## Color_Intensity 0.355055710 0.2532306865
## Hue -0.818036073 -1.5156344987
## OD280 -1.157559376 0.0511839665
## Proline -0.002691206 0.0028529846
##
## Proportion of trace:
## LD1 LD2
## 0.6875 0.3125
wpredict <- predict(wlda)
head(wpredict$x)
## LD1 LD2
## 1 -4.700244 1.9791383
## 2 -4.301958 1.1704129
## 3 -3.420720 1.4291014
## 4 -4.205754 4.0028715
## 5 -1.509982 0.4512239
## 6 -4.518689 3.2131376
ldahist(data = wpredict$x[,1], g = wine$Customer_Segment)
ldahist(data = wpredict$x[,2], g = wine$Customer_Segment)
data.frame(LD1 = wpredict$x[,1], LD2 = wpredict$x[,2], Type = wine$Customer_Segment) %>%
ggplot(aes(LD1, LD2, color = Type))+
geom_point(size = 3, alpha = 0.6)+
stat_ellipse()+
theme_bw()
library(lfda)
## Warning: package 'lfda' was built under R version 3.5.3
m = lfda(wine[,-14],
wine[,14],
r = 4,
metric = "plain")
autoplot(m, data = wine, frame = T, frame.colour = "Customer_Segment")
d <- dist(wine[,-14])
hc <- hclust(d)
plot(hc, labels = F, hang = -1)
library(cluster)
wpam = pam(d, 3)
autoplot(pam(wine[,-14],3), frame = TRUE, frame.type = "norm")
[1] https://en.wikipedia.org/wiki/K-means_clustering
[2] https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1
[3] other resources from internet