For the dataset
Cereals.csv, we will be using Principal Component Analysis (PCA) to extract data with many variables and create visualizations to display that data. It is a very useful technique for exploratory data analysis which means we investigate the dataset to discover patterns using statistical graphs and other visualization methods.
#We read our dataset.
mydata<- read.csv("Cereals.csv")
#We remove NA values from our dataset.
mydata1<- na.omit(mydata)
mydata11 <- mydata1[, c(4:16)]
#'Using Principal Component Analysis (PCA) to compute principal components. We only need numeric variables for our PCA so we remove columns name, mfr and type. We obtain 13 principal components, which you call PC1-13. Each of these explains a percentage of the total variation in the dataset. That is to say: PC1 explains 28% of the total variance and PC3 explains 14.7% of the variance.
mydata.pca <- prcomp(mydata1[, c(4:16)], center=TRUE , scale. = TRUE)
summary(mydata.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.9062 1.7743 1.3818 1.00969 0.9947 0.84974 0.81946
Proportion of Variance 0.2795 0.2422 0.1469 0.07842 0.0761 0.05554 0.05166
Cumulative Proportion 0.2795 0.5217 0.6685 0.74696 0.8231 0.87861 0.93026
PC8 PC9 PC10 PC11 PC12 PC13
Standard deviation 0.64515 0.56192 0.30301 0.25194 0.13897 1.499e-08
Proportion of Variance 0.03202 0.02429 0.00706 0.00488 0.00149 0.000e+00
Cumulative Proportion 0.96228 0.98657 0.99363 0.99851 1.00000 1.000e+00
#Using str() to have a look at our PCA object.
str(mydata.pca)
List of 5
$ sdev : num [1:13] 1.906 1.774 1.382 1.01 0.995 ...
$ rotation: num [1:13, 1:13] 0.2995 -0.3074 0.0399 0.1834 -0.4535 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:13] "calories" "protein" "fat" "sodium" ...
.. ..$ : chr [1:13] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:13] 107.03 2.51 1 162.36 2.18 ...
..- attr(*, "names")= chr [1:13] "calories" "protein" "fat" "sodium" ...
$ scale : Named num [1:13] 19.84 1.08 1.01 82.77 2.42 ...
..- attr(*, "names")= chr [1:13] "calories" "protein" "fat" "sodium" ...
$ x : num [1:74, 1:13] -5.708 -0.409 -5.133 -7.803 1.032 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:74] "1" "2" "3" "4" ...
.. ..$ : chr [1:13] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
suppressMessages(library(devtools))
suppressMessages(library(ggbiplot))
#We will plot our PCA using ggbiplot.
#Note: A biplot is a type of plot that will allow you to visualize how the samples relate to one another in our PCA (which samples are similar and which are different) and will simultaneously reveal how each variable contributes to each principal component.
ggbiplot(mydata.pca)
#Below is the graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
suppressMessages(library(factoextra))
fviz_pca_var(mydata.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
#We can see here which cereals are similar to each other. The cereals are indicated by row numbers.
ggbiplot(mydata.pca, labels=rownames(mydata1))
# Add ellipses
# mfr: Manufacturer of cereal where A = American Home Food Products; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina
p <- fviz_pca_ind(mydata.pca, label="none", habillage=mydata1$mfr,
addEllipses=TRUE, ellipse.level=0.95)
print(p)
## Too few points to calculate an ellipse
#Here we can see the percentage of explained variances.
#Eigenvalues correspond to the amount of the variation explained by each principal component (PC)
suppressMessages(library(factoextra))
fviz_screeplot(mydata.pca) #fviz_screeplot() is an alias of fviz_eig()
#Individuals with similar profile are grouped together
fviz_pca_ind(mydata.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)