The main goal of this paper is to present some unsupervised learning practices - Principal Component Analysis and Principal Component Regression. Thanks to data set with information about given wines, I will be able to show PCA and PCR analysis with ‘quality’ as the dependent variable and other characteristics as independent variables.
I will use data from Kaggle - free and open source of data sets. In this data set, we have 13 variables, 4898 observations and no NA values:
- id - wine id,
- fixed_acidity - the level of fixed acidity in the wine,
- volatile_acidity - the level of volatile acidity in the wine,
- citric_acid - the level of citric acid in the wine,
- sugar - the level of sugar in the wine,
- chlorides - the level of chloride in the wine,
- free_sulfur - the level of free sulfur dioxide in the wine,
- total_sulfur - the level of total sulfur dioxide in the wine,
- density - wine density,
- ph - wine pH,
- sulphates - the level of sulphates in the wine,
- alcohol - the level of alcohol in the wine,
- quality - wine quality (score between 0 and 10).
After short introduction about variables meaning, we can move to looking at data and its properties:
data <- read.csv("data.csv")
str(data)
## 'data.frame': 4898 obs. of 13 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed_acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile_acidity: num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric_acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free_sulfur : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total_sulfur : num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ ph : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
head(x = data, 5)
## id fixed_acidity volatile_acidity citric_acid sugar chlorides free_sulfur
## 1 1 7.0 0.27 0.36 20.7 0.045 45
## 2 2 6.3 0.30 0.34 1.6 0.049 14
## 3 3 8.1 0.28 0.40 6.9 0.050 30
## 4 4 7.2 0.23 0.32 8.5 0.058 47
## 5 5 7.2 0.23 0.32 8.5 0.058 47
## total_sulfur density ph sulphates alcohol quality
## 1 170 1.0010 3.00 0.45 8.8 6
## 2 132 0.9940 3.30 0.49 9.5 6
## 3 97 0.9951 3.26 0.44 10.1 6
## 4 186 0.9956 3.19 0.40 9.9 6
## 5 186 0.9956 3.19 0.40 9.9 6
Basic statistics about our data (without ‘id’ column):
summary(data[, -c(1)])
## fixed_acidity volatile_acidity citric_acid sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free_sulfur total_sulfur density
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9959
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.2024
## ph sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.180 Median :0.4700 Median :10.40 Median :6.000
## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
Now we need to create another data set with our dependent variable and delete ‘id’ and ‘quality’ from ‘data’:
library(dplyr)
data <- data %>% select(-c(id))
data.y <- data %>% select(obj = quality)
data.pca <- data %>% select(obj = -c(quality))
After all we can check dimensions of our data sets:
dim(data.pca)
## [1] 4898 11
dim(data.y)
## [1] 4898 1
Moreover, we need to look at correlation plot, to analyze correlations between given variables:
library(corrplot)
corr_data <- cor(data.pca, use = "pairwise.complete.obs")
p.mat <- cor.mtest(data.pca)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(corr_data, method="color", col=col(200), type="upper",
addCoef.col = "black", tl.col="black", tl.srt=45,
p.mat = p.mat$p, sig.level = 0.01, insig = "blank", diag=FALSE)
Finally, normalizing data with ‘clusterSim’ package:
library(clusterSim)
data.n <- data.Normalization(data.pca, type="n1", normalization="column")
data.y.n<-data.Normalization(data.y, type="n1", normalization="column")
PCA is used in exploratory data analysis. It is helpful in making predictive models and commonly used for dimensionality reduction by transforming a large set of variables into a smaller one without loosing all important information. Smaller data sets are easier to explore, visualize and analysis. To be more precise, PCA finds the direction that have the most variance. This solution may be influenced by skewness or magnitude.
Now we can run PCA:
data.n.pca1 <- prcomp(data.n, center=TRUE, scale.=TRUE)
summary(data.n.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6058 1.2554 1.0968 1.01123 1.00910 0.97570 0.8687
## Proportion of Variance 0.2344 0.1433 0.1094 0.09296 0.09257 0.08654 0.0686
## Cumulative Proportion 0.2344 0.3777 0.4871 0.58002 0.67259 0.75913 0.8277
## PC8 PC9 PC10 PC11
## Standard deviation 0.79751 0.75871 0.62972 0.53540
## Proportion of Variance 0.05782 0.05233 0.03605 0.02606
## Cumulative Proportion 0.88556 0.93789 0.97394 1.00000
Results of PCA are satisfying. First principal component explaining 23.44% of variance. If we sum up all 11 principal components (the same number as number of variables in data set), we will achieve 100% of the explained variance. In next steps, we will be able to choose the optimal number of components.
Only “rotation” part:
data.n.pca1$rotation
## PC1 PC2 PC3 PC4 PC5
## fixed_acidity -0.182987647 0.584435477 -0.11215682 -0.13231401 0.016718395
## volatile_acidity 0.007755622 -0.050949142 0.62628256 -0.29711560 0.248376330
## citric_acid -0.194474737 0.338370284 -0.46650646 0.12772793 0.147811391
## sugar -0.430624461 -0.012882209 0.20101197 -0.22440359 -0.268398916
## chlorides -0.249912537 -0.001469266 0.19049689 0.36684004 0.700433940
## free_sulfur -0.379663530 -0.306106790 -0.15176549 0.27338115 -0.327963951
## total_sulfur -0.477563031 -0.259087715 -0.01553003 0.07814553 -0.080103426
## density -0.269091952 0.034575573 -0.05671710 -0.66987375 -0.006007026
## ph 0.168332528 -0.573025785 -0.19455506 -0.11356800 0.111430284
## sulphates -0.057825832 -0.220872903 -0.46991847 -0.37566830 0.452472949
## alcohol 0.452150699 0.044834463 -0.12530777 -0.09681488 -0.149580511
## PC6 PC7 PC8 PC9 PC10
## fixed_acidity -0.2205020 -0.16818675 -0.39793313 0.43756948 -0.373494357
## volatile_acidity -0.5465779 0.27888434 0.06645780 0.06646678 0.147705982
## citric_acid -0.1395431 0.62675354 0.38228648 0.03606745 0.176243996
## sugar 0.1181850 -0.03953182 0.58101291 -0.05610062 -0.543359250
## chlorides 0.2176943 0.09419390 -0.17363968 -0.26019346 -0.346405324
## free_sulfur -0.2771425 0.06349654 -0.27775631 -0.30696666 -0.039384030
## total_sulfur -0.3210936 0.02702248 -0.19310562 0.20009445 0.100426535
## density 0.3898126 0.28307909 -0.37839128 -0.27103003 0.159015457
## ph 0.1497856 0.34428067 -0.07244774 0.55054869 -0.329499377
## sulphates -0.3159606 -0.46038413 0.19418496 -0.16656592 -0.002696584
## alcohol -0.3430320 0.27441385 -0.15095730 -0.43976661 -0.496877672
## PC11
## fixed_acidity 0.16840803
## volatile_acidity 0.21902116
## citric_acid 0.03310863
## sugar -0.01014158
## chlorides -0.04429270
## free_sulfur 0.55050513
## total_sulfur -0.70767103
## density -0.01998802
## ph 0.14917197
## sulphates 0.06281394
## alcohol -0.30009335
Firstly, visualization of results quality:
library(factoextra)
fviz_eig(data.n.pca1)
Now we can clearly seen the percentage of variance which each principal component explains. To explain about 83% of variance we will need 7 components.
fviz_pca_var(data.n.pca1, col.var="steelblue")
On the graph above we can observe few groups of variables - especially visible are three groups on the left side of the graph - (total_sulfur, free_sulfur), (density, chlorides, sugar) and (citric_acid, fixed_acidity). These results are in accordance with correlation plot - both for negative and positive correlation.
Next plot consists unlabeled observations in two dimensions with coloured quality of representation. The last two plots should show similar results.
library(ggfortify)
library(pca3d)
fviz_pca_ind(data.n.pca1, col.ind="cos2", geom="point",
gradient.cols=c("white", "#2E9FDF", "#FC4E07"))
autoplot(data.n.pca1, loadings=TRUE, loadings.colour='steelblue',
loadings.label=TRUE, loadings.label.size=3)
pca2d(data.n.pca1, biplot=TRUE, biplot.vars=3, col = "steelblue")
We have some outliers in our data set but generally it seems to be equally distributed.
As we have just analyzed the principal components and wine characteristics, we can move to PCR to check how to predict values with Unsupervised Learning.
PCR is a regression analysis technique which is based on PCA. We will be using PCR to estimate the unknown regression coefficients in a standard linear regression model. PCR uses not only independent variables to regress the dependent variable but also the principal components of the explanatory variables. What’s important, not only the principal components with highest variances are selected as regressors (quite common situation, but not always), these with low variances also can be important.
We have some steps of PCR:
1. Performing PCA on independent variables to get the principal components. Then, select a right subset.
2. Regression on the observed vector of outcomes on the selected subset as covariates (using linear regression) to get a vector of estimated regression coefficients.
3. Transformation of this vector back to the scale of the actual covariates thanks to the selected PCA loadings. Then we have the final PCR estimator for estimating the regression coefficients characterizing the original model.
Now we can create PCR model on raw data and normalized ones thanks to ‘pls’ package.
library(pls)
data.n$quality <- data.y.n$obj
pcr.model <- pcr(quality~., data = data, scale = TRUE, validation = "CV")
pcr.model.n <- pcr(quality~., data = data.n, scale = TRUE, validation = "CV")
summary(pcr.model)
## Data: X dimension: 4898 11
## Y dimension: 4898 1
## Fit method: svdpc
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.8857 0.8538 0.853 0.8276 0.8208 0.8113 0.8095
## adjCV 0.8857 0.8537 0.853 0.8275 0.8272 0.8109 0.8094
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 0.8084 0.8088 0.7860 0.7597 0.7580
## adjCV 0.8083 0.8088 0.7859 0.7596 0.7579
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 23.443 37.770 48.71 58.00 67.26 75.91 82.77 88.56
## quality 7.133 7.421 12.93 12.95 16.51 16.82 17.08 17.10
## 9 comps 10 comps 11 comps
## X 93.79 97.39 100.00
## quality 21.75 26.87 27.29
summary(pcr.model.n)
## Data: X dimension: 4898 11
## Y dimension: 4898 1
## Fit method: svdpc
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1 0.9643 0.9633 0.9343 0.9279 0.9155 0.9138
## adjCV 1 0.9643 0.9633 0.9342 0.9336 0.9154 0.9137
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 0.9127 0.9131 0.8872 0.8579 0.8555
## adjCV 0.9126 0.9130 0.8871 0.8577 0.8554
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 23.443 37.770 48.71 58.00 67.26 75.91 82.77 88.56
## quality 7.133 7.421 12.93 12.95 16.51 16.82 17.08 17.10
## 9 comps 10 comps 11 comps
## X 93.79 97.39 100.00
## quality 21.75 26.87 27.29
As we can see, the validation error and the cumulative percentage of variance explained using 11 components are printed. In both cases we have the same results so for predictions I will use original data set.
Predictions:
test <- data[, 1:11]
pcr.pred <- predict(pcr.model, test, ncomp = 7)
pred <- predict(pcr.model, test)
Comparison of raw data (black points), PCR prediction (blue point) and OLS predictions (grey points):
plot(data.y$obj, col = "black", ylab = "Predictions vs raw data")
points(pcr.pred, col = "steelblue")
points(pred, col = "darkgrey")
Results are similar for all three situations. Predictions are pretty good - they are between two the most popular values of ‘quality’ variable and close enough to all observations. In raw data set ‘quality’ values are only integers, in predictions we have some double type values.
Thanks to ‘validationplot’ function we can plot the root mean squared error. I will try two options - the usual mean squared error and the R2 by setting the val.type argument equal to “MSEP” or “R2” respectively:
validationplot(pcr.model, val.type="MSEP")
validationplot(pcr.model, val.type = "R2")
Also I will plot the predicted vs measured values using the ‘predplot’ function as below:
predplot(pcr.model)
At the end I will plot coefficients:
coefplot(pcr.model)
Thanks to Unsupervised Learning we can not only analyze characterizes of a given product/country/etc. and draw conclusions based on these analyzes. We can also predict some values and check our predictions with raw data after all. Principal Component Analysis and Principal Component Regression can be useful not only for students and researchers but also for business.