Danica Drmanac 19620251
HOMEWORK ASSIGNMENT
Pizza The sample I have chosen is a sample about measurements that capture the kind of things that make a pizza tasty. Each observation represents a single pizza sample from a particular brand, with measurements of various nutrient compositions such as moisture, protein, fat, ash, sodium, carbohydrates, and calories. This sample consists of 300 observations. Below are definitions of all 9 variables: brand: Pizza brand (class label) id: Identification variable mois: Amount of water per 100 grams in the sample prot: Amount of protein per 100 grams in the sample fat: Amount of fat per 100 grams in the sample ash: Amount of ash per 100 grams in the sample sodium: Amount of sodium per 100 grams in the sample carb: Amount of carbohydrates per 100 grams in the sample cal: Amount of calories per 100 grams in the sample
Research question Can we identify patterns and similarities in the nutrient compositions of pizzas from different brands using Principal Component Analysis? PCA is expected to help in reducing the dimensionality of the dataset by transforming the original variables (nutrient compositions of pizzas) into a smaller set of uncorrelated variables called principal components. This simplification makes it easier to analyze and interpret the data.
Importing data set found online at https://data.world/sdhilip/pizza-datasets
mydata <- read.table("./Pizza.csv", header = TRUE, sep = ",", dec = ".")
I have created new object that contains only numerical variables, since only numerical variables can be used in the analysis. This object will be further used for the analysis.
mydata_PCA <- mydata[, c("mois", "prot", "fat", "ash", "sodium", "carb", "cal")]
Showing first 10 rows of data set.
head(mydata_PCA, 10)
## mois prot fat ash sodium carb cal
## 1 27.82 21.43 44.87 5.11 1.77 0.77 4.93
## 2 28.49 21.26 43.89 5.34 1.79 1.02 4.84
## 3 28.35 19.99 45.78 5.08 1.63 0.80 4.95
## 4 30.55 20.15 43.13 4.79 1.61 1.38 4.74
## 5 30.49 21.28 41.65 4.82 1.64 1.76 4.67
## 6 31.14 20.23 42.31 4.92 1.65 1.40 4.67
## 7 31.21 20.97 41.34 4.71 1.58 1.77 4.63
## 8 28.76 21.41 41.60 5.28 1.75 2.95 4.72
## 9 28.22 20.48 45.10 5.02 1.71 1.18 4.93
## 10 27.72 21.19 45.29 5.16 1.66 0.64 4.95
I showed descriptive data of the dataset. For example, median of moisture is telling us that half of pizzas have less than 43.3ml of water per 100g in it, while others have more.
library(psych)
## Warning: package 'psych' was built under R version 4.3.2
describe(mydata_PCA)
## vars n mean sd median trimmed mad min max range skew
## mois 1 300 40.90 9.55 43.30 40.83 12.24 25.00 57.22 32.22 -0.04
## prot 2 300 13.37 6.43 10.44 12.54 3.99 6.98 28.48 21.50 0.86
## fat 3 300 20.23 8.98 17.13 18.55 4.71 4.38 47.20 42.82 1.55
## ash 4 300 2.63 1.27 2.22 2.49 1.21 1.17 5.43 4.26 0.67
## sodium 5 300 0.67 0.37 0.49 0.58 0.12 0.25 1.79 1.54 1.80
## carb 6 300 22.86 18.03 23.24 22.56 28.50 0.51 48.64 48.13 0.05
## cal 7 300 3.27 0.62 3.21 3.20 0.45 2.18 5.08 2.90 1.06
## kurtosis se
## mois -1.54 0.55
## prot -0.78 0.37
## fat 1.82 0.52
## ash -0.90 0.07
## sodium 2.17 0.02
## carb -1.67 1.04
## cal 1.12 0.04
Additional descriptive statistics. For example, we can notice that carbohydrates have the highest variation, since coefficient of variation (0.79) is the highest for them.
library(pastecs)
## Warning: package 'pastecs' was built under R version 4.3.2
round(stat.desc(mydata_PCA, basic = FALSE), 2)
## mois prot fat ash sodium carb cal
## median 43.30 10.44 17.13 2.22 0.49 23.24 3.21
## mean 40.90 13.37 20.23 2.63 0.67 22.86 3.27
## SE.mean 0.55 0.37 0.52 0.07 0.02 1.04 0.04
## CI.mean.0.95 1.09 0.73 1.02 0.14 0.04 2.05 0.07
## var 91.26 41.40 80.56 1.61 0.14 325.07 0.38
## std.dev 9.55 6.43 8.98 1.27 0.37 18.03 0.62
## coef.var 0.23 0.48 0.44 0.48 0.55 0.79 0.19
I have created new object called R, that aims to show us the correlation between all variables. Also, I have rounded the results up to 3 decimals for simplicity.
R <- cor(mydata_PCA)
round(R, 3)
## mois prot fat ash sodium carb cal
## mois 1.000 0.360 -0.171 0.266 -0.102 -0.592 -0.764
## prot 0.360 1.000 0.498 0.824 0.429 -0.854 0.070
## fat -0.171 0.498 1.000 0.792 0.933 -0.640 0.765
## ash 0.266 0.824 0.792 1.000 0.808 -0.899 0.326
## sodium -0.102 0.429 0.933 0.808 1.000 -0.620 0.672
## carb -0.592 -0.854 -0.640 -0.899 -0.620 1.000 -0.023
## cal -0.764 0.070 0.765 0.326 0.672 -0.023 1.000
Correlation plot of the same object is presented below.
library(psych)
corPlot(R)
I did the Bartlett’s test of sphericity, testing the assumption that population correlation matrix is equal to identity matrix. Based on the p-value, we can reject the null hypothesis, meaning that the population correlation matrix is not equal to identity matrix, which is what we need for the analysis.
library(psych)
cortest.bartlett(R, n = nrow(mydata) )
## $chisq
## [1] 7135.057
##
## $p.value
## [1] 0
##
## $df
## [1] 21
The determinant of the correlation matrix, det(R), provides valuable information about the strength and nature of correlation relationships. In factor analysis it has to be lower than 0.00001, but here that is not the case.
det(R)
## [1] 3.353267e-11
Kaiser-Mayer-Olkin statistics shows us general measure of sampling adequancy, while MSA shows us adequacy of a single variable. The results show us that data is not so suitable (it falls into category below 0.5 which is unacceptable), since we will lose a lot of information. We could drop all variables but cal and sodium. Yet, I have decided to proceed with analysis, even though some information will be lost.
library(psych)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.46
## MSA for each item =
## mois prot fat ash sodium carb cal
## 0.24 0.34 0.44 0.44 0.98 0.41 0.91
Principal components were made. We can see eigenvalues, variances and cumulative variances. First PC has 4.17 more information than initial individual variable. From here, it is clear that by using Kaiser’s rule only 2 principal components would be taken.
library("FactoMineR")
## Warning: package 'FactoMineR' was built under R version 4.3.2
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE )
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue (components)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.171782e+00 5.959688e+01 59.59688
## Dim.2 2.290457e+00 3.272082e+01 92.31770
## Dim.3 4.145623e-01 5.922319e+00 98.24002
## Dim.4 9.517423e-02 1.359632e+00 99.59966
## Dim.5 2.767702e-02 3.953860e-01 99.99504
## Dim.6 3.376094e-04 4.822991e-03 99.99986
## Dim.7 9.518780e-06 1.359826e-04 100.00000
Scree plot is showing very similar results. Based on diagram of scree plot, it is clear that “elbow” is at third principal components, so one less is taken, leading us to the same result as Kaiser’s rule.
library(factoextra)
fviz_eig(components,
choice = "eigenvalue",
main = "Scree plot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
Yet another analysis is consistent with the previous two. According to Parallel analysis, only principal components that have higher empirical than theoretical eigenvalues are analytically important and are therefore taken.
library(psych)
fa.parallel(mydata_PCA,
sim = FALSE,
fa = "pc")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
Results for the Principal Component Analysis are shown below. I will further focus on 4, 6, 8.
library(FactoMineR)
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE,
ncp = 2)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 300 individuals, described by 7 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
Showing components loadings (rescaled coefficients). Each value below is simultaneously also showing Pearson correlation coefficient between each original variable and each principal component. For example, correlation between principal component 2 and variable cal is positive and strong (0.85). 98.6% of information of variable “cal” was transferred to PC1 and PC2, while 25% was transferred only to PC1. Sum of all values by column is equal to eigenvalue of PC1 or PC2, depending on which column we are summing.
print(components$var$cor)
## Dim.1 Dim.2
## mois 0.1321685 -0.9508488
## prot 0.7736169 -0.4081810
## fat 0.9123125 0.3547153
## ash 0.9638315 -0.1679758
## sodium 0.8899205 0.3051999
## carb -0.8678837 0.4847685
## cal 0.4993639 0.8588049
This parts shows contributions. For example, it is clear that variable ash contributed the most to PC1 (22.26), while variable cal contributed the most to PC2 (32.20). Sum of all values by column is equal to 100%.
print(components$var$contrib)
## Dim.1 Dim.2
## mois 0.4187303 39.473056
## prot 14.3459822 7.274168
## fat 19.9510440 5.493356
## ash 22.2679725 1.231887
## sodium 18.9837006 4.066742
## carb 18.0551662 10.259983
## cal 5.9774042 32.200808
I found the contrast between ash, mois, prot = GROUP 1, and carb, cal, fat, sodium = GROUP 2. Based on literature, I decided to name group 1 as essential nutrients and group 2 as supplementary nutrients.
library(factoextra)
fviz_pca_var(components,
repel = TRUE )
Biplot shows all units and their relative positions. If unit 249 is observed, it can be said that it is generally under average, but much better nutrients due to high contrast.
library(factoextra)
fviz_pca_biplot(components)
Below are positions of six units.
head(components$ind$coord)
## Dim.1 Dim.2
## 1 5.010343 2.679215
## 2 5.023755 2.529295
## 3 4.805439 2.673700
## 4 4.469543 2.285029
## 5 4.471893 2.159152
## 6 4.504800 2.167973
Below are positions for both PC1 and PC2, for unit 249, that I was explaining before.
components$ind$coord[249, ]
## Dim.1 Dim.2
## -0.7635428 -1.6831108
Data was standardized first, and then values of each variable were shown. It can be concluded that majority of them are below average, except mois, which is source of contrast.
mydata_std <- scale(mydata_PCA)
mydata_std[249, ]
## mois prot fat ash sodium carb cal
## 1.3385273 -0.3688875 -0.5570102 -0.3018242 -0.3223910 -0.2792482 -1.2112234
Based on analysis, I conclude that using PCA was useful for identifying contrast, however I believe that maybe too much information was lost due to low correlations. There are similarities between some variables, as seen in the analysis, yet I would rather use another multivariate method for further investigation.