S<-matrix(c(5,0,0,0,9,0,0,0,9),nrow=3,ncol=3,byrow=T)
S
## [,1] [,2] [,3]
## [1,] 5 0 0
## [2,] 0 9 0
## [3,] 0 0 9
#Finding eigen values and vectors
eigen_vecs <- eigen(S)$vectors
cat("Eigen vectors:",eigen_vecs)
## Eigen vectors: 0 0 1 0 1 0 1 0 0
eigen_vals <- eigen(S)$values
cat("Eigen values:",eigen_vals)
## Eigen values: 9 9 5
In the case of matrix S, the eigenvectors we found
already represent the principal components becaus the eigenvectors we
obtained for S are the coordinate unit vectors ( [1, 0, 0], [0, 1, 0],
and [0, 0, 1] ). These represent directions along the original axes (X1,
X2, and X3).
Therefore, in our case, the principal components are directly given by the eigenvectors themselves.
Since the eigenvectors are unit vectors, these principal components directly correspond to the original variables (X1, X2, and X3) without any scaling or weighting.
When you have principal components associated with eigenvalues that are the same value, it suggests that the variance explained by those principal components is equal. This condition is known as degeneracy in the eigenvalues, and it indicates that the data variance is spread equally along each of those principal axes.
my.pc <- princomp(covmat= S)
summary(my.pc, loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 3.0000000 3.0000000 2.2360680
## Proportion of Variance 0.3913043 0.3913043 0.2173913
## Cumulative Proportion 0.3913043 0.7826087 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## [1,] 1
## [2,] 1
## [3,] 1
It’s implications are as follows:
my.s<-matrix(c(36,5,5,4),nrow=2,ncol=2,byrow=T)
my.s
## [,1] [,2]
## [1,] 36 5
## [2,] 5 4
my.pc_2 <- princomp(covmat= my.s)
summary(my.pc_2, loadings=T)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 6.0632545 1.79915130
## Proportion of Variance 0.9190764 0.08092363
## Cumulative Proportion 0.9190764 1.00000000
##
## Loadings:
## Comp.1 Comp.2
## [1,] 0.989 0.151
## [2,] 0.151 -0.989
A principal component is a linear combination of the original variables and using these results:
\[ PC1 = 0.989 \cdot X_1 +0.151 \cdot X_2 \]
\[ PC2 = 0.151 \cdot X_1 -0.989 \cdot X_2 \]
my.R <- cov2cor(my.s)
my.R
## [,1] [,2]
## [1,] 1.0000000 0.4166667
## [2,] 0.4166667 1.0000000
my.pc_R <- princomp(covmat=my.R)
summary(my.pc_R, loadings=T)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.1902381 0.7637626
## Proportion of Variance 0.7083333 0.2916667
## Cumulative Proportion 0.7083333 1.0000000
##
## Loadings:
## Comp.1 Comp.2
## [1,] 0.707 0.707
## [2,] 0.707 -0.707
We see that the results from PCA based on the correlation matrix and covariance matrix are different. The differences arise due to the scaling effect caused by using correlation matrix versus covariance matrix.
When using the correlation matrix, the variables are standardized and they all contribute equally to the analysis, irrespective of their original scale. However, when using the covariance matrix, the scaling depends on the original variances of the variables. This leads to differences in the importance of variables in determining the principal components.
my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340,
.402,1,.618,.150,.135,.206,.183,
.396,.618,1,.321,.289,.363,.345,
.301,.150,.321,1,.846,.759,.661,
.305,.135,.289,.846,1,.797,.800,
.339,.206,.363,.759,.797,1,.736,
.340,.183,.345,.661,.800,.736,1),
ncol=7, nrow=7, byrow=T)
row.names(my.cor.mat) <- c('head length', 'head breadth', 'face breadth',
'left finger length', 'left forearm length', 'left foot length','height')
my.pc_3 <- princomp(my.cor.mat)
summary(my.pc_3, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.6538962 0.2478205 0.14235053 0.12823663 0.09102269
## Proportion of Variance 0.7979163 0.1146078 0.03781446 0.03068767 0.01546105
## Cumulative Proportion 0.7979163 0.9125241 0.95033859 0.98102626 0.99648731
## Comp.6 Comp.7
## Standard deviation 0.043386036 3.416096e-09
## Proportion of Variance 0.003512689 2.177710e-17
## Cumulative Proportion 1.000000000 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## [1,] 0.166 0.809 0.260 0.481
## [2,] 0.424 -0.322 -0.508 -0.128 0.135 0.648
## [3,] 0.262 -0.467 0.794 0.129 0.102 0.237
## [4,] -0.433 0.610 -0.249 -0.455 0.395
## [5,] -0.479 -0.220 0.819 0.210
## [6,] -0.396 0.907
## [7,] -0.387 0.178 -0.768 -0.197 -0.318 0.294
#Calculating principal components scores
plot(my.pc_3$sdev^2, xlab = "Component number",
ylab = "Component variance", type = "l", main = "Scree diagram")
We see that the elbow in the curve corresponds to the first two components. This is inline with the summary PCA results we obtained above that show:
Given these proportions, we could say that using the first two principal components would be sufficient because together they explain more than 90% of the variance in the dataset.
Next we plot the first two principal component scores for data to see how it varies.
#Plot of the first two principal component scores for the data.
xlim <- range(my.pc_3$scores[,1:2])
plot(my.pc_3$scores, xlim = xlim, ylim = xlim, pch = 19,
main = "Scatterplot of the First Two Principal Components")
#now add abbreviated row names.
#They represent each row in the provided order but needed to shorten them
#for easier readability
text(my.pc_3$scores[,1:2],
labels=c('f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7'),
cex=0.7,lwd=2, pos = 4)
Spread out points on the first couple of principal components indicate that head length (f1), head breadth (f2), and face breadth (f3) could be the measurements with the most variability.
Looking at the actual scores:
# Extract PC scores
pc_scores <- my.pc_3$scores
# Extract the first two principal components
pc1 <- pc_scores[, 1]
pc2 <- pc_scores[, 2]
# Combine them into a data frame
first_two_pcs <- data.frame(PC1 = pc1, PC2 = pc2)
print(first_two_pcs)
## PC1 PC2
## head length 0.5349371 0.563368936
## head breadth 1.0075352 -0.172776655
## face breadth 0.6715763 -0.280424282
## left finger length -0.5604736 -0.038350593
## left forearm length -0.6504354 -0.019728003
## left foot length -0.5160711 -0.046221753
## height -0.4870684 -0.005867651
For Component 1 (Comp.1):
For Component 2 (Comp.2):
Spread out points on the first couple of principal components indicate that head length (f1), head breadth (f2), and face breadth (f3) could be the measurements with the most variability.
food.full <- read.table("C:/Users/HP/Downloads/foodstuffs.txt", header=T)
food.labels <- as.character(food.full[,1])
food.data <- food.full[,-1]
plot(food.data)
It seems necessary to extract the principal components from the correlation rather than the covariance matrix, since the five variables to be used are on very different scales.
cor(food.data)
## Energy Protein Fat Calcium Iron
## Energy 1.00000000 0.17384812 0.98706740 -0.32038440 -0.09977765
## Protein 0.17384812 1.00000000 0.02491163 -0.08508934 -0.17462478
## Fat 0.98706740 0.02491163 1.00000000 -0.30813212 -0.06060118
## Calcium -0.32038440 -0.08508934 -0.30813212 1.00000000 0.04429196
## Iron -0.09977765 -0.17462478 -0.06060118 0.04429196 1.00000000
food.pc <- princomp(food.data, cor=T)
# Showing the coefficients of the components:
summary(food.pc,loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.4824903 1.0696751 0.9211811 0.8988007 0.0400021115
## Proportion of Variance 0.4395555 0.2288410 0.1697149 0.1615686 0.0003200338
## Cumulative Proportion 0.4395555 0.6683965 0.8381114 0.9996800 1.0000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Energy 0.654 0.149 0.199 0.709
## Protein 0.151 -0.691 -0.463 0.525 -0.104
## Fat 0.639 0.202 0.216 0.134 -0.697
## Calcium -0.355 0.652 0.670
## Iron -0.122 0.689 -0.540 0.468
# Showing the eigenvalues of the correlation matrix:
(food.pc$sdev)^2
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 2.197777619 1.144204758 0.848574671 0.807842783 0.001600169
From the resulting output we see that the first two components all have variances (eigenvalues) greater than one and together account for almost 86% of the variance of the original variables.The third component has a variance of 0.69, and makes the cumulative variance almost 99%.
# A scree plot
plot(1:(length(food.pc$sdev)), (food.pc$sdev)^2, type='b',
main="Scree Plot", xlab="Number of Components", ylab="Eigenvalue Size")
From the scree plot above, we see that the elbow occurs at 2. This could indicate that the scores on these two components might be used to summarize the data in further analyses with little loss of information.
# Plotting the PC scores for the sample data in the space of the first two principal components:
par(pty="s")
plot(food.pc$scores[,1], food.pc$scores[,2], ylim=range(food.pc$scores[,1]),
xlab="PC 1", ylab="PC 2", type ='n', lwd=2)
# labeling points with food labels:
text(food.pc$scores[,1], food.pc$scores[,2],labels=food.labels, cex=0.7, lwd=2,pos = 4)
The plots demonstrate clearly that AC and AR are outliers.
# Extract PC scores
food_pc_scores <- food.pc$scores
# Extract the first two principal components
food_pc1 <- food_pc_scores[, 1]
food_pc2 <- food_pc_scores[, 2]
# Combine them into a data frame
first_two_pcs_food <- data.frame(Name = food.labels ,PC1 = food_pc1, PC2 = food_pc2)
print(first_two_pcs_food)
## Name PC1 PC2
## 1 BB 1.8926793 0.32612213
## 2 HR 0.6581727 -0.07589716
## 3 BR 2.9352834 1.13683995
## 4 BS 2.3269385 0.54744724
## 5 BC -0.2609008 0.05347606
## 6 CB -0.9291517 -0.90529899
## 7 CC -0.1813311 -1.56364639
## 8 BH -0.7103667 0.34009492
## 9 LL 0.9357840 0.11396640
## 10 LS 1.4087095 0.42295507
## 11 HS 1.9011883 0.27805930
## 12 PR 1.9228365 0.46184851
## 13 PS 2.0879935 0.44524724
## 14 BT 0.1387200 0.23467050
## 15 VC -0.1274858 -0.60588711
## 16 FB -0.6777718 -1.58636560
## 17 AR -2.4014878 2.71293056
## 18 AC -2.6229342 3.06527635
## 19 TC -1.4569532 -0.24336419
## 20 HF -0.7824842 -0.62227973
## 21 MB 0.2210494 -0.67428876
## 22 MC -1.1875477 0.08150040
## 23 PF -0.1035628 -0.07528291
## 24 SC -1.5289615 -0.71674399
## 25 DC -1.8376955 -0.57049499
## 26 UC -0.1326466 -1.70742152
## 27 RC -1.4880737 -0.87346329
USairpol.full <- read.table("C:/Users/HP/Downloads/USairpol.txt", header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]
cor(USairpol.data)
## SO2 Neg.Temp Manuf Pop Wind Precip
## SO2 1.00000000 0.43360020 0.64476873 0.49377958 0.09469045 0.05429434
## Neg.Temp 0.43360020 1.00000000 0.19004216 0.06267813 0.34973963 -0.38625342
## Manuf 0.64476873 0.19004216 1.00000000 0.95526935 0.23794683 -0.03241688
## Pop 0.49377958 0.06267813 0.95526935 1.00000000 0.21264375 -0.02611873
## Wind 0.09469045 0.34973963 0.23794683 0.21264375 1.00000000 -0.01299438
## Precip 0.05429434 -0.38625342 -0.03241688 -0.02611873 -0.01299438 1.00000000
## Days 0.36956363 0.43024212 0.13182930 0.04208319 0.16410559 0.49609671
## Days
## SO2 0.36956363
## Neg.Temp 0.43024212
## Manuf 0.13182930
## Pop 0.04208319
## Wind 0.16410559
## Precip 0.49609671
## Days 1.00000000
usair.pc<-princomp(USairpol.data,cor=TRUE)
summary(usair.pc,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.6517021 1.2297702 1.1810897 0.9444529 0.58887916
## Proportion of Variance 0.3897314 0.2160478 0.1992819 0.1274273 0.04953981
## Cumulative Proportion 0.3897314 0.6057792 0.8050611 0.9324884 0.98202821
## Comp.6 Comp.7
## Standard deviation 0.3166822 0.159733920
## Proportion of Variance 0.0143268 0.003644989
## Cumulative Proportion 0.9963550 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SO2 0.490 0.404 0.730 0.183 0.150
## Neg.Temp 0.315 -0.677 0.185 -0.162 -0.611
## Manuf 0.541 -0.226 0.267 -0.164 -0.745
## Pop 0.488 -0.282 0.345 -0.113 -0.349 0.649
## Wind 0.250 -0.311 -0.862 0.268 0.150
## Precip 0.626 0.492 -0.184 0.161 -0.554
## Days 0.260 0.678 -0.110 0.110 -0.440 0.505
# Showing the eigenvalues of the correlation matrix:
(usair.pc$sdev)^2
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 2.72811968 1.51233485 1.39497299 0.89199129 0.34677866 0.10028759 0.02551493
The correlations values for is the very high values for manuf and pop are high. We see that the first three components all have variances (eigenvalues) greater than one and together account for at least 80% of the variance of the original variables. Scores on these three components might be used to graph the data with little loss of information.
par(pty="s")
plot(usair.pc$scores[,1],usair.pc$scores[,2],
ylim=range(usair.pc$scores[,1]),
xlab="PC1",ylab="PC2",type="n",lwd=2)
text(usair.pc$scores[,1],usair.pc$scores[,2],
labels=city.names,cex=0.7,lwd=2)
Create bivariate boxplots of the first three principal components.
pairs(usair.pc$scores[,1:3], ylim = c(-6, 4), xlim = c(-6, 4),
panel = function(x,y, ...) {
text(x, y, city.names,
cex = 0.6)
bvbox(cbind(x,y), add = TRUE)
})
The plots demonstrates Chicago, Phoenix, Philadelphia and Alburquerque as clear outliers.Let’s drop them.
out.row1 <- 23
out.row2 <- 1
out.row3 <- 11
USairpol.data.reduced <- USairpol.data[-c(out.row1,out.row2, out.row3),]
#Perform PCA on reduced data
cor(USairpol.data.reduced)
## SO2 Neg.Temp Manuf Pop Wind
## SO2 1.00000000 0.42583476 0.39005350 0.13200907 -0.02540176
## Neg.Temp 0.42583476 1.00000000 0.14618114 -0.03838933 0.24783607
## Manuf 0.39005350 0.14618114 1.00000000 0.89458617 0.24065445
## Pop 0.13200907 -0.03838933 0.89458617 1.00000000 0.22138396
## Wind -0.02540176 0.24783607 0.24065445 0.22138396 1.00000000
## Precip -0.03991113 -0.67988165 -0.14589521 -0.05765077 -0.25704283
## Days 0.36637757 0.34795808 0.07317241 -0.03491977 -0.06050038
## Precip Days
## SO2 -0.03991113 0.36637757
## Neg.Temp -0.67988165 0.34795808
## Manuf -0.14589521 0.07317241
## Pop -0.05765077 -0.03491977
## Wind -0.25704283 -0.06050038
## Precip 1.00000000 0.25045464
## Days 0.25045464 1.00000000
usair.reduced.pc<-princomp(USairpol.data.reduced,cor=TRUE)
summary(usair.reduced.pc,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.5343387 1.2845013 1.2216248 0.9027976 0.72917673
## Proportion of Variance 0.3363136 0.2357062 0.2131953 0.1164348 0.07595696
## Cumulative Proportion 0.3363136 0.5720198 0.7852151 0.9016499 0.97760689
## Comp.6 Comp.7
## Standard deviation 0.30970678 0.246644404
## Proportion of Variance 0.01370261 0.008690495
## Cumulative Proportion 0.99130951 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SO2 0.373 0.174 0.451 0.239 0.710 0.120 0.228
## Neg.Temp 0.417 0.564 -0.180 -0.674 -0.138
## Manuf 0.534 -0.408 0.118 0.152 -0.709
## Pop 0.435 -0.542 -0.265 -0.236 0.622
## Wind 0.290 -0.367 -0.815 0.334
## Precip -0.323 -0.365 0.517 -0.321 0.192 -0.570 -0.176
## Days 0.153 0.243 0.623 -0.396 -0.491 0.351
# Showing the eigenvalues of the correlation matrix:
(usair.reduced.pc$sdev)^2
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 2.35419517 1.64994360 1.49236719 0.81504357 0.53169871 0.09591829 0.06083346
The full dataset loadings show Manuf and
Pop to be strongly correlated with Comp.1, whereas
Wind has a particularly strong negative loading with
Comp.4, and Precip strongly positively correlates with
Comp.2.
Similar to the full dataset, Manuf and Pop
are still significant contributors to Comp.1 in the reduced dataset.
There is a slight decrease in the proportion of total variance explained by the first three components—from 80% to 78.5% —after removing outliers. The total variance explained is slightly higher when all variables (including outliers) are included. This indicates that the outliers do have an effect on the spread of the data; they seem to add to the variance that’s being captured by the principal components. However, the difference isn’t very large, suggesting that the main structure of the data is not heavily dependent on those outliers.
pairs(usair.reduced.pc$scores[,1:3], ylim = c(-6, 4), xlim = c(-6, 4),
panel = function(x,y, ...) {
text(x, y, city.names,
cex = 0.6)
bvbox(cbind(x,y), add = TRUE)
})