The data consists of 24 cars measured on the following six variables:
library(readxl)
cars2004 <- read_excel("cars2004.xls")
cars2004 <- as.data.frame(cars2004)
# nomear as linhas
nomes <- c(cars2004[1:24,1])
rownames(cars2004) <- nomes
# excluir coluna 1
cars2004 <- cars2004[,-1]
head(cars2004)
descriptive statistics
cars_stats = data.frame(
Minimum = apply(cars2004, 2, min),
Maximum = apply(cars2004, 2, max),
Mean = apply(cars2004, 2, mean),
Std_Dev = apply(cars2004, 2, sd))
print(cars_stats, print.gap = 3)
Before performing a PCA (or any other multivariate method) we should start with some preliminary explorations
# Stars plot
#Since we have a small number of observations (24 cars), we can use
#the function stars() to get an idea of the (dis)similarities between the cars:
# star plot
stars(cars2004, labels = abbreviate(rownames(cars2004), 6),
nrow = 4, key.loc = c(8, 11.2))
abbreviate used with non-ASCII chars
abline(h = 9.85, col = "gray90")
Look at the pair-wise scatterplot:
# scatterplot to inspect pair-wise relations
pairs(cars2004)
We can also examine the correlations among variables: all variables are positively correlated
# show lower triangular part of matrix of correlations
as.dist(round(cor(cars2004), 3))
Cylinders Horsepower Speed Weight Width
Horsepower 0.954
Speed 0.885 0.934
Weight 0.692 0.529 0.466
Width 0.706 0.730 0.619 0.477
Length 0.664 0.527 0.578 0.795 0.591
?as.dist
PCA functions in R
Function Package Author
The minimal output from any PCA should contain 3 things:
PCA with prcomp() - one of the default PCA functions in R is prcomp():
# PCA with prcomp()
cars_prcomp = prcomp(cars2004, scale. = TRUE)
# what does prcomp() provide?
names(cars_prcomp)
[1] "sdev" "rotation" "center" "scale" "x"
# eigenvalues
cars_prcomp$sdev^2
[1] 4.41126759 0.85340979 0.43566395 0.23587059 0.05143668 0.01235140
# scores
round(head(cars_prcomp$x, 5), 2)
PC1 PC2 PC3 PC4 PC5 PC6
Citroën C2 1.1 Base -2.54 -0.50 -0.18 0.16 -0.20 0.03
Smart Fortwo Coupé -4.06 -1.63 0.27 -0.90 -0.03 -0.03
Mini 1.6 170 -1.35 -0.80 0.36 -0.05 0.45 0.05
Nissan Micra 1.2 65 -2.46 -0.40 -0.17 0.12 -0.28 0.05
Renault Clio 3.0 V6 0.00 -0.90 0.38 -0.27 0.28 -0.13
# loadings
round(head(cars_prcomp$rotation, 5), 2)
PC1 PC2 PC3 PC4 PC5 PC6
Cylinders 0.46 -0.14 0.21 -0.23 -0.65 -0.50
Horsepower 0.44 -0.38 0.14 -0.17 -0.09 0.78
Speed 0.42 -0.37 0.31 0.41 0.57 -0.31
Weight 0.36 0.62 0.22 -0.53 0.39 -0.01
Width 0.38 -0.12 -0.88 -0.14 0.15 -0.13
PCA with princomp() - The other default PCA function is princomp()
# PCA with princomp()
cars_princomp = princomp(cars2004, cor = TRUE)
# what does princomp() provide?
names(cars_princomp)
[1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
cars_princomp$sdev^2
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
4.41126759 0.85340979 0.43566395 0.23587059 0.05143668 0.01235140
# scores
round(head(cars_princomp$scores, 5), 3)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Citroën C2 1.1 Base 2.596 0.510 0.179 0.166 0.207 0.032
Smart Fortwo Coupé 4.150 1.666 -0.274 -0.924 0.029 -0.034
Mini 1.6 170 1.382 0.816 -0.372 -0.051 -0.464 0.051
Nissan Micra 1.2 65 2.513 0.404 0.174 0.124 0.289 0.051
Renault Clio 3.0 V6 0.003 0.916 -0.385 -0.274 -0.286 -0.134
# loadings
round(head(unclass(cars_princomp$loadings), 5), 3)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Cylinders -0.458 0.137 -0.214 -0.232 0.652 -0.496
Horsepower -0.440 0.382 -0.137 -0.173 0.094 0.777
Speed -0.422 0.367 -0.313 0.410 -0.569 -0.314
Weight -0.360 -0.623 -0.223 -0.529 -0.389 -0.008
Width -0.381 0.120 0.883 -0.140 -0.153 -0.132
PCA() with FactoMineR
#load FactoMineR
library(FactoMineR)
# nice PCA
cars_pca = PCA(cars2004, graph = FALSE)
# what does PCA provide?
cars_pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 24 individuals, described by 6 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
Extensive output: As you can tell, there is an extensive list of results provided in the output of PCA() —by FactoMineR
Reading results: We’ll discuss how to interpret the main results from PCA() and what things we should pay attention to
Graphical Examination: With the obtained scores and loadings we can get several graphical displays
Some graphics
Some questions to keep in mind
How many PCs to retain? There is no universal criterion to determine the number of PCs to retain. But we must look at the eigenvalues and see what’s the percentage of variance captured by each dimension:
# table of eigenvalues
cars_pca$eig
# screeplot of eigenvalues
barplot(cars_pca$eig[,"eigenvalue"], border = NA, col = "gray80", names.arg = rownames(cars_pca$eig))
What variables characterize each PC?
To see how each PC is characterized, we either check the loadings or the correlations between the variables and the PCs:
# correlations between variables and PCs
round(cars_pca$var$coord[,1:2], 4)
Dim.1 Dim.2
Cylinders 0.9624 -0.1269
Horsepower 0.9233 -0.3527
Speed 0.8861 -0.3387
Weight 0.7569 0.5757
Width 0.8012 -0.1110
Length 0.7953 0.5044
Circle of Correlations
# plot circle of correlations
plot(cars_pca, choix = "var")
Influence of variables on each PC?
We can also examine the contributions of the variables
If all variables were to contribute uniformly, they would have a contribution of 1/6 or 16.67%.
# Contribution of variables
print(rbind(cars_pca$var$contrib, TOTAL = colSums(cars_pca$var$contrib)), print.gap = 3)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Cylinders 20.99685 1.888053 4.5963914 5.385321 42.5754114
Horsepower 19.32581 14.572896 1.8738150 2.986455 0.8920753
Speed 17.79960 13.445643 9.7721157 16.807651 32.3352042
Weight 12.98761 38.836725 4.9895558 28.029702 15.1507636
Width 14.55312 1.444225 77.9635912 1.958483 2.3414907
Length 14.33701 29.812457 0.8045309 44.832387 6.7050548
TOTAL 100.00000 100.000000 100.0000000 100.000000 100.0000000
To inspect what variables are above and below 16.67 we can create a barplot of variable contributions in the following form:
library(RColorBrewer)
# color palette
colpal = brewer.pal(n = 5, name = "Blues")[5:1]
# Contribution of variables
barplot(t(cars_pca$var$contrib), beside = TRUE,
border = NA, ylim = c(0, 90), col = colpal,
legend.text = colnames(cars_pca$var$contrib),
args.legend = list(x = "top", ncol = 5, bty = 'n'))
abline(h = 16, col = "#ff572255", lwd = 2)
PC scores: We can use the scores as coordinates to plot the objects in a scatterplot
# PC scores (first 2 dimesions)
print(round(cars_pca$ind$coord[,1:2], 3),
print.gap = 3)
Dim.1 Dim.2
Citroën C2 1.1 Base -2.596 -0.510
Smart Fortwo Coupé -4.150 -1.666
Mini 1.6 170 -1.382 -0.816
Nissan Micra 1.2 65 -2.513 -0.404
Renault Clio 3.0 V6 -0.003 -0.916
Audi A3 1.9 TDI -1.121 0.169
Peugeot 307 1.4 HDI 70 -1.725 0.300
Peugeot 407 3.0 V6 BVA 0.553 0.523
Mercedes Classe C 270 CDI 0.078 0.482
BMW 530d 0.838 0.460
Jaguar S-Type 2.7 V6 Bi-Turbo 0.721 0.898
BMW 745i 2.126 0.610
Mercedes Classe S 400 CDI 2.167 0.810
Citroën C3 Pluriel 1.6i -1.623 -0.218
BMW Z4 2.5i -0.399 -0.596
Audi TT 1.8T 180 -0.751 -0.459
Aston Martin Vanquish 3.155 -0.639
Bentley Continental GT 4.161 0.064
Ferrari Enzo 4.946 -2.580
Renault Scenic 1.9 dCi 120 -0.842 0.380
Volkswagen Touran 1.9 TDI 105 -0.805 0.713
Land Rover Defender Td5 -1.072 0.751
Land Rover Discovery Td5 0.851 1.920
Nissan X-Trail 2.2 dCi -0.614 0.722
Default plot of objects in FactoMineR
# plot of scores
plot(cars_pca, choix = "ind")
Alternative plot of objects with ggplot2
# load ggplot2
library(ggplot2)
# data frame with observations from PCA results
cars_pca_obs = data.frame(cars_pca$ind$coord[,1:3])
# PCA plots of observations
ggplot(cars_pca_obs, aes(x = Dim.1, y = Dim.2, label = rownames(cars2004))) +
geom_hline(yintercept = 0, color = "gray70") +
geom_vline(xintercept = 0, color = "gray70") +
geom_point(color = "#55555544", size = 5) +
geom_text(alpha = 0.55, size = 4) +
xlab("PC1") +
ylab("PC2") +
xlim(-5, 6) +
ggtitle("PCA plot of observations")
Contributions of objects to PCs
The contributions (in percentage) reflect the influence that each object has on the formation of the PCs. If all objects had the same contribution on each PC, they would contribute with a value of 4.16 = 100/24
# Contributions on PCs (first 2 dimesions)
print(round(cars_pca$ind$contrib[,1:2], 3),
print.gap = 3)
Dim.1 Dim.2
Citroën C2 1.1 Base 6.365 1.270
Smart Fortwo Coupé 16.269 13.550
Mini 1.6 170 1.804 3.249
Nissan Micra 1.2 65 5.967 0.795
Renault Clio 3.0 V6 0.000 4.092
Audi A3 1.9 TDI 1.186 0.139
Peugeot 307 1.4 HDI 70 2.812 0.441
Peugeot 407 3.0 V6 BVA 0.288 1.336
Mercedes Classe C 270 CDI 0.006 1.133
BMW 530d 0.663 1.033
Jaguar S-Type 2.7 V6 Bi-Turbo 0.492 3.935
BMW 745i 4.271 1.816
Mercedes Classe S 400 CDI 4.434 3.201
Citroën C3 Pluriel 1.6i 2.487 0.231
BMW Z4 2.5i 0.150 1.734
Audi TT 1.8T 180 0.533 1.030
Aston Martin Vanquish 9.404 1.997
Bentley Continental GT 16.352 0.020
Ferrari Enzo 23.110 32.503
Renault Scenic 1.9 dCi 120 0.669 0.706
Volkswagen Touran 1.9 TDI 105 0.612 2.481
Land Rover Defender Td5 1.085 2.757
Land Rover Discovery Td5 0.683 18.004
Nissan X-Trail 2.2 dCi 0.357 2.547
Barplots of object contributions to PCs
op = par(mfrow = c(2,1))
# barplot of object contributions for PC1
barplot(cars_pca$ind$contrib[,1], border = NA, las = 2,
names.arg = abbreviate(rownames(cars2004), 8), cex.names = 0.8)
abbreviate used with non-ASCII chars
title("Object Contributions on PC1", cex.main = 0.9)
abline(h = 4.16, col = "gray50")
# barplot of object contributions for PC2
barplot(cars_pca$ind$contrib[,2], border = NA, las = 2,
names.arg = abbreviate(rownames(cars2004), 8), cex.names = 0.8)
abbreviate used with non-ASCII chars
title("Object Contributions on PC2", cex.main = 0.9)
abline(h = 4.16, col = "gray50")
par(op)
PCA with Clustering
We can gain some insight by combining PCA and Clustering
One option is to apply a hierarchical clustering to the obtained scores, and then add the clustered groups to the Scores scatterplot
Hierarchical Clustering
# clustering
cars_clustering = hclust(dist(cars_pca$ind$coord), method = "ward")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(cars_clustering, xlab = "", sub = "")
PC plot with clustering partition
# get 3 cluster
cars_clusters = cutree(cars_clustering, k = 3)
# add cluster to data frame of scores
cars_pca_obs$cluster = as.factor(cars_clusters)
# ggplot
ggplot(cars_pca_obs, aes(x=Dim.1, y=Dim.2, label=rownames(cars2004))) +
geom_hline(yintercept = 0, color = "gray70") +
geom_vline(xintercept = 0, color = "gray70") +
geom_point(aes(color = cluster), alpha = 0.55, size = 3) +
geom_text(aes(color = cluster), alpha = 0.55, size = 4) +
xlab("PC1") +
ylab("PC2") +
xlim(-5, 6) +
ggtitle("PCA plot of observations")
library(ggrepel)
# vamos arrumar o gráfico anterior.
# geom_text_repel
# get 3 cluster
cars_clusters = cutree(cars_clustering, k = 3)
# add cluster to data frame of scores
cars_pca_obs$cluster = as.factor(cars_clusters)
# ggplot
ggplot(cars_pca_obs, aes(x=Dim.1, y=Dim.2, label=rownames(cars2004))) +
geom_hline(yintercept = 0, color = "gray70") +
geom_vline(xintercept = 0, color = "gray70") +
geom_point(aes(color = cluster), alpha = 0.55, size = 3) +
geom_text_repel(aes(color = cluster), alpha = 0.55, size = 4) +
xlab("PC1") +
ylab("PC2") +
xlim(-5, 6) +
ggtitle("PCA plot of observations")