1 Introduction

The main goal of this paper is to present some unsupervised learning practices - Principal Component Analysis and Principal Component Regression. Thanks to data set with information about given wines, I will be able to show PCA and PCR analysis with ‘quality’ as the dependent variable and other characteristics as independent variables.

2 Data

I will use data from Kaggle - free and open source of data sets. In this data set, we have 13 variables, 4898 observations and no NA values:
- id - wine id,
- fixed_acidity - the level of fixed acidity in the wine,
- volatile_acidity - the level of volatile acidity in the wine,
- citric_acid - the level of citric acid in the wine,
- sugar - the level of sugar in the wine,
- chlorides - the level of chloride in the wine,
- free_sulfur - the level of free sulfur dioxide in the wine,
- total_sulfur - the level of total sulfur dioxide in the wine,
- density - wine density,
- ph - wine pH,
- sulphates - the level of sulphates in the wine,
- alcohol - the level of alcohol in the wine,
- quality - wine quality (score between 0 and 10).
After short introduction about variables meaning, we can move to looking at data and its properties:

data <- read.csv("data.csv")
str(data)

## 'data.frame':    4898 obs. of  13 variables:
##  $ id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed_acidity   : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile_acidity: num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric_acid     : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ sugar           : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides       : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free_sulfur     : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total_sulfur    : num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density         : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ ph              : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates       : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol         : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality         : int  6 6 6 6 6 6 6 6 6 6 ...

head(x = data, 5)

##   id fixed_acidity volatile_acidity citric_acid sugar chlorides free_sulfur
## 1  1           7.0             0.27        0.36  20.7     0.045          45
## 2  2           6.3             0.30        0.34   1.6     0.049          14
## 3  3           8.1             0.28        0.40   6.9     0.050          30
## 4  4           7.2             0.23        0.32   8.5     0.058          47
## 5  5           7.2             0.23        0.32   8.5     0.058          47
##   total_sulfur density   ph sulphates alcohol quality
## 1          170  1.0010 3.00      0.45     8.8       6
## 2          132  0.9940 3.30      0.49     9.5       6
## 3           97  0.9951 3.26      0.44    10.1       6
## 4          186  0.9956 3.19      0.40     9.9       6
## 5          186  0.9956 3.19      0.40     9.9       6

Basic statistics about our data (without ‘id’ column):

summary(data[, -c(1)])

##  fixed_acidity    volatile_acidity  citric_acid         sugar       
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides        free_sulfur      total_sulfur      density      
##  Min.   :0.00900   Min.   :  2.00   Min.   :  9.0   Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00   1st Qu.:108.0   1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00   Median :134.0   Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31   Mean   :138.4   Mean   :0.9959  
##  3rd Qu.:0.05000   3rd Qu.: 46.00   3rd Qu.:167.0   3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00   Max.   :440.0   Max.   :1.2024  
##        ph          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

Now we need to create another data set with our dependent variable and delete ‘id’ and ‘quality’ from ‘data’:

library(dplyr)
data <- data %>% select(-c(id))
data.y <- data %>% select(obj = quality)
data.pca <- data %>% select(obj = -c(quality))

After all we can check dimensions of our data sets:

dim(data.pca)

## [1] 4898   11

dim(data.y)

## [1] 4898    1

Moreover, we need to look at correlation plot, to analyze correlations between given variables:

library(corrplot)
corr_data <- cor(data.pca, use = "pairwise.complete.obs")
p.mat <- cor.mtest(data.pca)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(corr_data, method="color", col=col(200), type="upper",
         addCoef.col = "black", tl.col="black", tl.srt=45,
         p.mat = p.mat$p, sig.level = 0.01, insig = "blank", diag=FALSE)

Finally, normalizing data with ‘clusterSim’ package:

library(clusterSim)
data.n <- data.Normalization(data.pca, type="n1", normalization="column")
data.y.n<-data.Normalization(data.y, type="n1", normalization="column")

3 Principal Component Analysis

3.1 About PCA

PCA is used in exploratory data analysis. It is helpful in making predictive models and commonly used for dimensionality reduction by transforming a large set of variables into a smaller one without loosing all important information. Smaller data sets are easier to explore, visualize and analysis. To be more precise, PCA finds the direction that have the most variance. This solution may be influenced by skewness or magnitude.

3.2 PCA Analysis

Now we can run PCA:

data.n.pca1 <- prcomp(data.n, center=TRUE, scale.=TRUE)
summary(data.n.pca1)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.6058 1.2554 1.0968 1.01123 1.00910 0.97570 0.8687
## Proportion of Variance 0.2344 0.1433 0.1094 0.09296 0.09257 0.08654 0.0686
## Cumulative Proportion  0.2344 0.3777 0.4871 0.58002 0.67259 0.75913 0.8277
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.79751 0.75871 0.62972 0.53540
## Proportion of Variance 0.05782 0.05233 0.03605 0.02606
## Cumulative Proportion  0.88556 0.93789 0.97394 1.00000

Results of PCA are satisfying. First principal component explaining 23.44% of variance. If we sum up all 11 principal components (the same number as number of variables in data set), we will achieve 100% of the explained variance. In next steps, we will be able to choose the optimal number of components.
Only “rotation” part:

data.n.pca1$rotation

##                           PC1          PC2         PC3         PC4          PC5
## fixed_acidity    -0.182987647  0.584435477 -0.11215682 -0.13231401  0.016718395
## volatile_acidity  0.007755622 -0.050949142  0.62628256 -0.29711560  0.248376330
## citric_acid      -0.194474737  0.338370284 -0.46650646  0.12772793  0.147811391
## sugar            -0.430624461 -0.012882209  0.20101197 -0.22440359 -0.268398916
## chlorides        -0.249912537 -0.001469266  0.19049689  0.36684004  0.700433940
## free_sulfur      -0.379663530 -0.306106790 -0.15176549  0.27338115 -0.327963951
## total_sulfur     -0.477563031 -0.259087715 -0.01553003  0.07814553 -0.080103426
## density          -0.269091952  0.034575573 -0.05671710 -0.66987375 -0.006007026
## ph                0.168332528 -0.573025785 -0.19455506 -0.11356800  0.111430284
## sulphates        -0.057825832 -0.220872903 -0.46991847 -0.37566830  0.452472949
## alcohol           0.452150699  0.044834463 -0.12530777 -0.09681488 -0.149580511
##                         PC6         PC7         PC8         PC9         PC10
## fixed_acidity    -0.2205020 -0.16818675 -0.39793313  0.43756948 -0.373494357
## volatile_acidity -0.5465779  0.27888434  0.06645780  0.06646678  0.147705982
## citric_acid      -0.1395431  0.62675354  0.38228648  0.03606745  0.176243996
## sugar             0.1181850 -0.03953182  0.58101291 -0.05610062 -0.543359250
## chlorides         0.2176943  0.09419390 -0.17363968 -0.26019346 -0.346405324
## free_sulfur      -0.2771425  0.06349654 -0.27775631 -0.30696666 -0.039384030
## total_sulfur     -0.3210936  0.02702248 -0.19310562  0.20009445  0.100426535
## density           0.3898126  0.28307909 -0.37839128 -0.27103003  0.159015457
## ph                0.1497856  0.34428067 -0.07244774  0.55054869 -0.329499377
## sulphates        -0.3159606 -0.46038413  0.19418496 -0.16656592 -0.002696584
## alcohol          -0.3430320  0.27441385 -0.15095730 -0.43976661 -0.496877672
##                         PC11
## fixed_acidity     0.16840803
## volatile_acidity  0.21902116
## citric_acid       0.03310863
## sugar            -0.01014158
## chlorides        -0.04429270
## free_sulfur       0.55050513
## total_sulfur     -0.70767103
## density          -0.01998802
## ph                0.14917197
## sulphates         0.06281394
## alcohol          -0.30009335

3.3 PCA Visualization

Firstly, visualization of results quality:

library(factoextra)
fviz_eig(data.n.pca1)

Now we can clearly seen the percentage of variance which each principal component explains. To explain about 83% of variance we will need 7 components.

fviz_pca_var(data.n.pca1, col.var="steelblue")

On the graph above we can observe few groups of variables - especially visible are three groups on the left side of the graph - (total_sulfur, free_sulfur), (density, chlorides, sugar) and (citric_acid, fixed_acidity). These results are in accordance with correlation plot - both for negative and positive correlation.
Next plot consists unlabeled observations in two dimensions with coloured quality of representation. The last two plots should show similar results.

library(ggfortify)
library(pca3d)
fviz_pca_ind(data.n.pca1, col.ind="cos2", geom="point", 
             gradient.cols=c("white", "#2E9FDF", "#FC4E07"))

autoplot(data.n.pca1, loadings=TRUE, loadings.colour='steelblue', 
         loadings.label=TRUE, loadings.label.size=3)

pca2d(data.n.pca1, biplot=TRUE, biplot.vars=3, col = "steelblue")

We have some outliers in our data set but generally it seems to be equally distributed.

4 Principal Component Regression

As we have just analyzed the principal components and wine characteristics, we can move to PCR to check how to predict values with Unsupervised Learning.

4.1 About PCR

PCR is a regression analysis technique which is based on PCA. We will be using PCR to estimate the unknown regression coefficients in a standard linear regression model. PCR uses not only independent variables to regress the dependent variable but also the principal components of the explanatory variables. What’s important, not only the principal components with highest variances are selected as regressors (quite common situation, but not always), these with low variances also can be important.
We have some steps of PCR:
1. Performing PCA on independent variables to get the principal components. Then, select a right subset.
2. Regression on the observed vector of outcomes on the selected subset as covariates (using linear regression) to get a vector of estimated regression coefficients.
3. Transformation of this vector back to the scale of the actual covariates thanks to the selected PCA loadings. Then we have the final PCR estimator for estimating the regression coefficients characterizing the original model.

4.2 PCR Analysis

Now we can create PCR model on raw data and normalized ones thanks to ‘pls’ package.

library(pls)
data.n$quality <- data.y.n$obj
pcr.model <- pcr(quality~., data = data, scale = TRUE, validation = "CV")
pcr.model.n <- pcr(quality~., data = data.n, scale = TRUE, validation = "CV")
summary(pcr.model)

## Data:    X dimension: 4898 11 
##  Y dimension: 4898 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.8857   0.8538    0.853   0.8276   0.8208   0.8113   0.8095
## adjCV       0.8857   0.8537    0.853   0.8275   0.8272   0.8109   0.8094
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV      0.8084   0.8088   0.7860    0.7597    0.7580
## adjCV   0.8083   0.8088   0.7859    0.7596    0.7579
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         23.443   37.770    48.71    58.00    67.26    75.91    82.77    88.56
## quality    7.133    7.421    12.93    12.95    16.51    16.82    17.08    17.10
##          9 comps  10 comps  11 comps
## X          93.79     97.39    100.00
## quality    21.75     26.87     27.29

summary(pcr.model.n)

## Data:    X dimension: 4898 11 
##  Y dimension: 4898 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV               1   0.9643   0.9633   0.9343   0.9279   0.9155   0.9138
## adjCV            1   0.9643   0.9633   0.9342   0.9336   0.9154   0.9137
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV      0.9127   0.9131   0.8872    0.8579    0.8555
## adjCV   0.9126   0.9130   0.8871    0.8577    0.8554
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         23.443   37.770    48.71    58.00    67.26    75.91    82.77    88.56
## quality    7.133    7.421    12.93    12.95    16.51    16.82    17.08    17.10
##          9 comps  10 comps  11 comps
## X          93.79     97.39    100.00
## quality    21.75     26.87     27.29

As we can see, the validation error and the cumulative percentage of variance explained using 11 components are printed. In both cases we have the same results so for predictions I will use original data set.
Predictions:

test <- data[, 1:11]
pcr.pred <- predict(pcr.model, test, ncomp = 7)
pred <- predict(pcr.model, test)

4.3 PCR Visualization

Comparison of raw data (black points), PCR prediction (blue point) and OLS predictions (grey points):

plot(data.y$obj, col = "black", ylab = "Predictions vs raw data")
points(pcr.pred, col = "steelblue")
points(pred, col = "darkgrey")

Results are similar for all three situations. Predictions are pretty good - they are between two the most popular values of ‘quality’ variable and close enough to all observations. In raw data set ‘quality’ values are only integers, in predictions we have some double type values.
Thanks to ‘validationplot’ function we can plot the root mean squared error. I will try two options - the usual mean squared error and the R2 by setting the val.type argument equal to “MSEP” or “R2” respectively:

validationplot(pcr.model, val.type="MSEP")

validationplot(pcr.model, val.type = "R2")

Also I will plot the predicted vs measured values using the ‘predplot’ function as below:

predplot(pcr.model)

At the end I will plot coefficients:

coefplot(pcr.model)

5 Conclusion

Thanks to Unsupervised Learning we can not only analyze characterizes of a given product/country/etc. and draw conclusions based on these analyzes. We can also predict some values and check our predictions with raw data after all. Principal Component Analysis and Principal Component Regression can be useful not only for students and researchers but also for business.

How good is your wine - components analysis

Ewelina Plachimowicz