The aim of this article is to use PCA method for dimension reduction on Cereal data. PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. As a result, dimensionality of a dataset is reduced, while preserving as much variability as possible.
Loaded data of Cereal with nutritional value and has three first columns which are of character type and nine with numeric. Features which are of character type were removed as we can not work with them in feature scaling.
df <- read.csv("cereal.csv", header = TRUE)
suppressPackageStartupMessages(library(factoextra))
## Warning: package 'factoextra' was built under R version 4.1.2
## Warning: package 'ggplot2' was built under R version 4.1.3
suppressPackageStartupMessages(library(psych))
The structure of the dataset is as followed:
head(df)
## name mfr type calories protein fat sodium fiber carbo
## 1 100% Bran N C 70 4 1 130 10.0 5.0
## 2 100% Natural Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond Delight R C 110 2 2 200 1.0 14.0
## 6 Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 -1 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
A brief Summary of the Dataset showing the statistics for each of the variables under analysis:
summary(df)
## name mfr type calories
## Length:77 Length:77 Length:77 Min. : 50.0
## Class :character Class :character Class :character 1st Qu.:100.0
## Mode :character Mode :character Mode :character Median :110.0
## Mean :106.9
## 3rd Qu.:110.0
## Max. :160.0
## protein fat sodium fiber
## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median :180.0 Median : 2.000
## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
## carbo sugars potass vitamins
## Min. :-1.0 Min. :-1.000 Min. : -1.00 Min. : 0.00
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 40.00 1st Qu.: 25.00
## Median :14.0 Median : 7.000 Median : 90.00 Median : 25.00
## Mean :14.6 Mean : 6.922 Mean : 96.08 Mean : 28.25
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00
## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00
## shelf weight cups rating
## Min. :1.000 Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :2.000 Median :1.00 Median :0.750 Median :40.40
## Mean :2.208 Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :3.000 Max. :1.50 Max. :1.500 Max. :93.70
The analyses will use nine numeric features for dimension reduction. No missing values were found in the data.
pca_df <- df[,4:13]
anyNA(pca_df)
## [1] FALSE
final <- princomp(pca_df,cor = TRUE)
These are variables that we can call from dataset final
names(final)
## [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
Getting the right components we need to get a reduced dataset.
summary(final)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.6659980 1.4722311 1.2769335 1.0168383 0.94680921
## Proportion of Variance 0.2775549 0.2167465 0.1630559 0.1033960 0.08964477
## Cumulative Proportion 0.2775549 0.4943014 0.6573573 0.7607533 0.85039806
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.73226471 0.7277486 0.54878618 0.270322110 0.236536446
## Proportion of Variance 0.05362116 0.0529618 0.03011663 0.007307404 0.005594949
## Cumulative Proportion 0.90401922 0.9569810 0.98709765 0.994405051 1.000000000
engevectors <- final$loadings #Note: these values are scaled so the Sum of squares =1
eigenvalues <- final$sdev * final$sdev
Fixing and optimizing x, y grid on to our projected data. This will return a matrix which can be used to analyze the relevance between components and data. The closer it is for a component to one the more relevant to the data.
round(cor(pca_df[,1:8],final$scores),3)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## calories 0.263 0.843 0.004 0.318 0.000 0.067 0.298 0.003 0.139 0.078
## protein -0.661 0.023 -0.316 0.528 0.073 0.305 -0.013 -0.288 -0.038 -0.046
## fat -0.257 0.641 0.342 0.422 0.230 -0.162 -0.298 0.232 -0.049 -0.055
## sodium 0.279 0.381 -0.555 0.058 -0.507 -0.308 -0.302 -0.151 0.004 0.009
## fiber -0.894 -0.064 -0.215 -0.132 -0.243 -0.084 0.115 0.147 0.135 -0.116
## carbo 0.563 -0.039 -0.661 0.193 0.169 -0.157 0.354 0.123 -0.088 -0.079
## sugars 0.122 0.689 0.474 -0.312 -0.322 0.111 0.200 -0.120 -0.093 -0.096
## potass -0.896 0.169 -0.178 -0.048 -0.223 -0.059 0.179 0.137 -0.124 0.126
This will help in establishing the components to be used using the elbow method.The aim will be around variance is equal to one for the correct components to use.
screeplot(final,type ='l',main="Screeplot for Cereal")
abline(1,0,col = "blue",lty =2)
The Scree plot shows that the optimal number of components is 4.Therefore 4 components should be chosen, because eigenvalues of those are higher than 1.
fviz_eig(final, addlabels = TRUE, barfill = "#41729F",barcolor = "#274472",linecolor = "darkred")
Scree plot represents graphically the percentage of variance explained by every component. The results show that Comp.1 explains 27% of variation. To explain 76% of variance, there have to be 4 components.
#This is a plot that shows how scores for "PCA1' to "PCA2"
plot(final$scores[,1:2],type = 'n',xlab='C1',ylab = 'C2')
points(final$scores[,1:2],cex = 0.5)
##Rotate parameter <- varimax
principal(pca_df, nfactors=4,rotate="none")
## Principal Components Analysis
## Call: principal(r = pca_df, nfactors = 4, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 h2 u2 com
## calories -0.26 0.84 0.00 0.32 0.88 0.12 1.5
## protein 0.66 0.02 0.32 0.53 0.81 0.19 2.4
## fat 0.26 0.64 -0.34 0.42 0.77 0.23 2.7
## sodium -0.28 0.38 0.55 0.06 0.53 0.47 2.3
## fiber 0.89 -0.06 0.21 -0.13 0.87 0.13 1.2
## carbo -0.56 -0.04 0.66 0.19 0.79 0.21 2.1
## sugars -0.12 0.69 -0.47 -0.31 0.81 0.19 2.3
## potass 0.90 0.17 0.18 -0.05 0.86 0.14 1.2
## vitamins -0.12 0.47 0.58 -0.39 0.72 0.28 2.8
## shelf 0.42 0.41 0.16 -0.41 0.54 0.46 3.3
##
## PC1 PC2 PC3 PC4
## SS loadings 2.78 2.17 1.63 1.03
## Proportion Var 0.28 0.22 0.16 0.10
## Cumulative Var 0.28 0.49 0.66 0.76
## Proportion Explained 0.36 0.28 0.21 0.14
## Cumulative Proportion 0.36 0.65 0.86 1.00
##
## Mean item complexity = 2.2
## Test of the hypothesis that 4 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.09
## with the empirical chi square 58.13 with prob < 2.1e-08
##
## Fit based upon off diagonal values = 0.9
Dimension reduction simply refers to the process of reducing the number of dimensions in a dataset. The aim of this process is to preserve as much information as possible by reducing the number of features. From the above tabulation, the cumulative variance will explain 76% of our data meaning 34% of the data is lost due to the dimension reduction.