Dimensionality Reduction

Introduction.

The aim of this article is to use PCA method for dimension reduction on Cereal data. PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. As a result, dimensionality of a dataset is reduced, while preserving as much variability as possible.

Loaded data of Cereal with nutritional value and has three first columns which are of character type and nine with numeric. Features which are of character type were removed as we can not work with them in feature scaling.

df <- read.csv("cereal.csv", header = TRUE)

Load libraries

suppressPackageStartupMessages(library(factoextra))

## Warning: package 'factoextra' was built under R version 4.1.2

## Warning: package 'ggplot2' was built under R version 4.1.3

suppressPackageStartupMessages(library(psych))

Structure of Dataset

The structure of the dataset is as followed:

head(df)

##                        name mfr type calories protein fat sodium fiber carbo
## 1                 100% Bran   N    C       70       4   1    130  10.0   5.0
## 2         100% Natural Bran   Q    C      120       3   5     15   2.0   8.0
## 3                  All-Bran   K    C       70       4   1    260   9.0   7.0
## 4 All-Bran with Extra Fiber   K    C       50       4   0    140  14.0   8.0
## 5            Almond Delight   R    C      110       2   2    200   1.0  14.0
## 6   Apple Cinnamon Cheerios   G    C      110       2   2    180   1.5  10.5
##   sugars potass vitamins shelf weight cups   rating
## 1      6    280       25     3      1 0.33 68.40297
## 2      8    135        0     3      1 1.00 33.98368
## 3      5    320       25     3      1 0.33 59.42551
## 4      0    330       25     3      1 0.50 93.70491
## 5      8     -1       25     3      1 0.75 34.38484
## 6     10     70       25     1      1 0.75 29.50954

Summary

A brief Summary of the Dataset showing the statistics for each of the variables under analysis:

summary(df)

##      name               mfr                type              calories    
##  Length:77          Length:77          Length:77          Min.   : 50.0  
##  Class :character   Class :character   Class :character   1st Qu.:100.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :110.0  
##                                                           Mean   :106.9  
##                                                           3rd Qu.:110.0  
##                                                           Max.   :160.0  
##     protein           fat            sodium          fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##      carbo          sugars           potass          vitamins     
##  Min.   :-1.0   Min.   :-1.000   Min.   : -1.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00  
##  Median :14.0   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.6   Mean   : 6.922   Mean   : 96.08   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##      shelf           weight          cups           rating     
##  Min.   :1.000   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :2.000   Median :1.00   Median :0.750   Median :40.40  
##  Mean   :2.208   Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :3.000   Max.   :1.50   Max.   :1.500   Max.   :93.70

Data Segmentation

The analyses will use nine numeric features for dimension reduction. No missing values were found in the data.

pca_df <- df[,4:13]
anyNA(pca_df)

## [1] FALSE

Princomp package

final <- princomp(pca_df,cor = TRUE)

Information output.

These are variables that we can call from dataset final

names(final)

## [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"

Summary of subsetted data.

Getting the right components we need to get a reduced dataset.

summary(final)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.6659980 1.4722311 1.2769335 1.0168383 0.94680921
## Proportion of Variance 0.2775549 0.2167465 0.1630559 0.1033960 0.08964477
## Cumulative Proportion  0.2775549 0.4943014 0.6573573 0.7607533 0.85039806
##                            Comp.6    Comp.7     Comp.8      Comp.9     Comp.10
## Standard deviation     0.73226471 0.7277486 0.54878618 0.270322110 0.236536446
## Proportion of Variance 0.05362116 0.0529618 0.03011663 0.007307404 0.005594949
## Cumulative Proportion  0.90401922 0.9569810 0.98709765 0.994405051 1.000000000

Eigenvalues / Egenvectors

engevectors <- final$loadings #Note: these values are scaled so the Sum of squares =1
eigenvalues <- final$sdev * final$sdev

Fixing and optimizing x, y grid on to our projected data. This will return a matrix which can be used to analyze the relevance between components and data. The closer it is for a component to one the more relevant to the data.

round(cor(pca_df[,1:8],final$scores),3)

##          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## calories  0.263  0.843  0.004  0.318  0.000  0.067  0.298  0.003  0.139   0.078
## protein  -0.661  0.023 -0.316  0.528  0.073  0.305 -0.013 -0.288 -0.038  -0.046
## fat      -0.257  0.641  0.342  0.422  0.230 -0.162 -0.298  0.232 -0.049  -0.055
## sodium    0.279  0.381 -0.555  0.058 -0.507 -0.308 -0.302 -0.151  0.004   0.009
## fiber    -0.894 -0.064 -0.215 -0.132 -0.243 -0.084  0.115  0.147  0.135  -0.116
## carbo     0.563 -0.039 -0.661  0.193  0.169 -0.157  0.354  0.123 -0.088  -0.079
## sugars    0.122  0.689  0.474 -0.312 -0.322  0.111  0.200 -0.120 -0.093  -0.096
## potass   -0.896  0.169 -0.178 -0.048 -0.223 -0.059  0.179  0.137 -0.124   0.126

Scree Plot.

This will help in establishing the components to be used using the elbow method.The aim will be around variance is equal to one for the correct components to use.

screeplot(final,type ='l',main="Screeplot for Cereal")
abline(1,0,col = "blue",lty =2)

The Scree plot shows that the optimal number of components is 4.Therefore 4 components should be chosen, because eigenvalues of those are higher than 1.

fviz_eig(final, addlabels = TRUE, barfill = "#41729F",barcolor = "#274472",linecolor = "darkred")

Scree plot represents graphically the percentage of variance explained by every component. The results show that Comp.1 explains 27% of variation. To explain 76% of variance, there have to be 4 components.

PCA1 to PCA2 Score

#This is a plot that shows how scores for "PCA1' to "PCA2"
plot(final$scores[,1:2],type = 'n',xlab='C1',ylab = 'C2')
points(final$scores[,1:2],cex = 0.5)

Principle analysis:

##Rotate parameter <- varimax
principal(pca_df, nfactors=4,rotate="none")

## Principal Components Analysis
## Call: principal(r = pca_df, nfactors = 4, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##            PC1   PC2   PC3   PC4   h2   u2 com
## calories -0.26  0.84  0.00  0.32 0.88 0.12 1.5
## protein   0.66  0.02  0.32  0.53 0.81 0.19 2.4
## fat       0.26  0.64 -0.34  0.42 0.77 0.23 2.7
## sodium   -0.28  0.38  0.55  0.06 0.53 0.47 2.3
## fiber     0.89 -0.06  0.21 -0.13 0.87 0.13 1.2
## carbo    -0.56 -0.04  0.66  0.19 0.79 0.21 2.1
## sugars   -0.12  0.69 -0.47 -0.31 0.81 0.19 2.3
## potass    0.90  0.17  0.18 -0.05 0.86 0.14 1.2
## vitamins -0.12  0.47  0.58 -0.39 0.72 0.28 2.8
## shelf     0.42  0.41  0.16 -0.41 0.54 0.46 3.3
## 
##                        PC1  PC2  PC3  PC4
## SS loadings           2.78 2.17 1.63 1.03
## Proportion Var        0.28 0.22 0.16 0.10
## Cumulative Var        0.28 0.49 0.66 0.76
## Proportion Explained  0.36 0.28 0.21 0.14
## Cumulative Proportion 0.36 0.65 0.86 1.00
## 
## Mean item complexity =  2.2
## Test of the hypothesis that 4 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.09 
##  with the empirical chi square  58.13  with prob <  2.1e-08 
## 
## Fit based upon off diagonal values = 0.9

Conclusion

Dimension reduction simply refers to the process of reducing the number of dimensions in a dataset. The aim of this process is to preserve as much information as possible by reducing the number of features. From the above tabulation, the cumulative variance will explain 76% of our data meaning 34% of the data is lost due to the dimension reduction.

Dimension Reduction on Cereal Data

Keith Xolani Mpala

3/1/2022