This short report concerns the segmentation analysis problem. The approach presented in the following study is focused on distinction of wine based on chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.
The methods used to view customer groups were Pricipal Component Analysis (PCA) and K-Means Clustering. PCA is particularly useful for massive data processing due to the fact that PCA is a dimensionality reduction algorithm. It decomposes the data into principal components (PC). It is a good base for further clustering models like the K-Means Clustering algorithm, which works by finding like groups based on Euclidean distance, a measure of distance or similarity. The researcher selects k groups to cluster, and the algorithm searches for the best centroids for the k groups. Then, one can use those groups to determine which factors group members relate.
The dataset is a collection of the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. For each record there were determined the quantities of 13 constituents found in each of the three types of wines. The attributes are: 1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash
5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue 12)OD280/OD315 of diluted wines 13)Proline.
All attributes are continuous.
head(wine_data)
## # A tibble: 6 x 14
## A Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols
## <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 14.23 1.71 2.43 15.6 127 2.80
## 2 1 13.20 1.78 2.14 11.2 100 2.65
## 3 1 13.16 2.36 2.67 18.6 101 2.80
## 4 1 14.37 1.95 2.50 16.8 113 3.85
## 5 1 13.24 2.59 2.87 21.0 118 2.80
## 6 1 14.20 1.76 2.45 15.2 112 3.27
## # ... with 7 more variables: Flavanoids <dbl>, Nonflavanoid <dbl>,
## # Proanthocyanins <dbl>, Color_intensity <dbl>, Hue <dbl>, OD280 <dbl>,
## # Proline <int>
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.