The aim of this project is to use PCA (Principal Component Analysis) algorithm for dimension reduction on the McDonald’s menu dataset.
Source: wikipedia.org
Problem of high dimensional data occurs when the dimension of the dataset (each numeric variable is a dimension) is large in comparison to number of observations. The goal of dimension reduction is to decrease the size of the dataset preserving as much information as possible.
Dataset used in this project contains nutritional values of 260 items from McDonald’s menu. Whole dataset can be found on kaggle website (https://www.kaggle.com/mcdonalds/nutrition-facts). Each item is described by 8 variables: saturated fat, cholesterol, sodium, carbohydrates, dietary fiber, sugars and protein.
summary(df)
## Saturated.Fat Trans.Fat Cholesterol Sodium
## Min. : 0.000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.:0.0000 1st Qu.: 5.00 1st Qu.: 107.5
## Median : 5.000 Median :0.0000 Median : 35.00 Median : 190.0
## Mean : 6.008 Mean :0.2038 Mean : 54.94 Mean : 495.8
## 3rd Qu.:10.000 3rd Qu.:0.0000 3rd Qu.: 65.00 3rd Qu.: 865.0
## Max. :20.000 Max. :2.5000 Max. :575.00 Max. :3600.0
## Carbohydrates Dietary.Fiber Sugars Protein
## Min. : 0.00 Min. :0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 30.00 1st Qu.:0.000 1st Qu.: 5.75 1st Qu.: 4.00
## Median : 44.00 Median :1.000 Median : 17.50 Median :12.00
## Mean : 47.35 Mean :1.631 Mean : 29.42 Mean :13.34
## 3rd Qu.: 60.00 3rd Qu.:3.000 3rd Qu.: 48.00 3rd Qu.:19.00
## Max. :141.00 Max. :7.000 Max. :128.00 Max. :87.00
hist.data.frame(df)
dim(df)
## [1] 260 8
Before further analysis the data has been normalized.
preproc <- preProcess(df, method=c("center", "scale"))
df_norm <- predict(preproc, df)
ggpairs(df_norm)
cor<-cor(df_norm, method="pearson")
corrplot(cor)
On the correlation matrix it’s visible that some variables are positively correlated with saturated fat. Also the correlation between sugars and carbohydrates as well as protein and sodium can be easily spotted.
pca <- prcomp(df_norm, center=FALSE, scale=FALSE)
PCA projections has been calculated using prcomp function which uses the singular value decomposition.
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## Saturated.Fat 0.43015581 -0.20009279 0.24851949 -0.16854817 0.02562347
## Trans.Fat 0.26972412 -0.35490773 0.59813586 0.46222382 -0.35609428
## Cholesterol 0.38495069 0.09019691 0.17808516 -0.77036509 -0.22396712
## Sodium 0.43534374 0.29004074 -0.04388044 0.07805226 0.38904047
## Carbohydrates 0.28112075 -0.49009859 -0.43369106 -0.03968195 0.12560935
## Dietary.Fiber 0.34879197 0.26371707 -0.52107052 0.23873634 -0.67274191
## Sugars -0.01137713 -0.64401117 -0.29283307 -0.08508928 0.05224139
## Protein 0.45131975 0.13134138 -0.03844621 0.30426344 0.44714426
## PC6 PC7 PC8
## Saturated.Fat 0.754695383 -0.33459351 -0.0508323501
## Trans.Fat -0.228437310 0.22112893 0.0445791009
## Cholesterol -0.409813854 0.01980159 -0.0001305312
## Sodium 0.120147811 0.53703482 0.5139456870
## Carbohydrates -0.005284034 0.43172477 -0.5375127853
## Dietary.Fiber 0.072357358 -0.11878811 0.0914333823
## Sugars -0.160991017 -0.25891590 0.6296380445
## Protein -0.405816175 -0.53180480 -0.1938155433
There are 3 most common methods used to select the number of components:
Kaiser rule focuses on component’s eigenvalues. An eigenvalue is an index that indicates how good a component is as a summary of the data (if an eigenvalue equals to 1, it means that the component contains the same amount of information as a single variable). This approach suggests that only components with eigenvalues higher than 1 should be chosen.
df_norm.cov<-cov(df_norm)
df_norm.eigen<-eigen(df_norm.cov)
df_norm.eigen$values
## [1] 3.84428921 2.19472808 0.74712560 0.56469152 0.31383510 0.22552415 0.08849565
## [8] 0.02131069
Eigenvalues displayed above indicate that only 2 components should be chosen.
The second approach relies on the scree plot. This plot visualizes the eigenvalues of the components in the ascending order. Scree plot approach suggests that the appropriate number of components is the number of bars preceding the bend of the line connecting eigenvalues.
fviz_eig(pca, choice='eigenvalue')
This approach, as well as the Kaiser rule, indicates that the right number of components is 2.
The last approach suggests that chosen components should explain over 2/3 of the variance.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9607 1.4815 0.86436 0.75146 0.56021 0.47489 0.29748
## Proportion of Variance 0.4805 0.2743 0.09339 0.07059 0.03923 0.02819 0.01106
## Cumulative Proportion 0.4805 0.7549 0.84827 0.91885 0.95808 0.98627 0.99734
## PC8
## Standard deviation 0.14598
## Proportion of Variance 0.00266
## Cumulative Proportion 1.00000
fviz_eig(pca)
Cumulative proportion of explained variance displayed above indicates that 4 components are able to explain over 90% of the variance. It means that this proportion of information can be preserved after reducing number of variables by half. First two components are able to explain over 3/4 of the variance so this number of components is enough. It means that results given by all three methods are the same.
The “cloud of points” graph shows individual observations quality of representation.
fviz_pca_ind(pca, col.ind="cos2", geom = "point", gradient.cols = c("green", "yellow", "red" ))
fviz_pca_var(pca, col.var = "red")
The plot displayed above shows relations between variables as well as the “quality” of all factors. Variables correlated positively are close to each other whereas those correlated negatively are on the opposite sites of the plot. “Quality” of the variable is presented by the distance from the center - “the best” variables are protein and sodium. Just by looking on this graph, it’s hard to clearly distinguish the components.
Percentage of the contribution to first two components has been shown on plots displayed below.
PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
grid.arrange(PC1, PC2)
On the first plot it’s visible that the first component consists of protein, sodium, saturated fat and cholesterol. The second one consists of sugars and carbohydrates.
Dimension reduction simply refers to the process of reducing the number of dimensions in a dataset. The aim of this process is to preserve as much information as possible by reducing the number of features. Conducted research shows that over 90% of the variance can be explained by only a half of the variables and 2 variables out of 8 are able to keep over 3/4 of the information included in the original dataset. Dimension reduction techniques are very powerful when it comes to analysis and storage of huge datasets.