Principal Component Analysis

1a. Find the eigenvalues and eigenvectors of S.

S<-matrix(c(5,0,0,0,9,0,0,0,9),nrow=3,ncol=3,byrow=T)
S

##      [,1] [,2] [,3]
## [1,]    5    0    0
## [2,]    0    9    0
## [3,]    0    0    9

#Finding eigen values and vectors
eigen_vecs <- eigen(S)$vectors
cat("Eigen vectors:",eigen_vecs)

## Eigen vectors: 0 0 1 0 1 0 1 0 0

eigen_vals <- eigen(S)$values
cat("Eigen values:",eigen_vals)

## Eigen values: 9 9 5

1b. Determine all three principal components for such a data set, using a PCA based on S.

In the case of matrix S, the eigenvectors we found already represent the principal components becaus the eigenvectors we obtained for S are the coordinate unit vectors ( [1, 0, 0], [0, 1, 0], and [0, 0, 1] ). These represent directions along the original axes (X1, X2, and X3).

Therefore, in our case, the principal components are directly given by the eigenvectors themselves.

PC1: \(X_1\)
PC2: \(X_2\)
PC3: \(X_3\)

Since the eigenvectors are unit vectors, these principal components directly correspond to the original variables (X1, X2, and X3) without any scaling or weighting.

1c. What can you say about the principal components associated with eigenvalues that are the same value?

When you have principal components associated with eigenvalues that are the same value, it suggests that the variance explained by those principal components is equal. This condition is known as degeneracy in the eigenvalues, and it indicates that the data variance is spread equally along each of those principal axes.

my.pc <- princomp(covmat= S)

summary(my.pc, loadings=T)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3
## Standard deviation     3.0000000 3.0000000 2.2360680
## Proportion of Variance 0.3913043 0.3913043 0.2173913
## Cumulative Proportion  0.3913043 0.7826087 1.0000000
## 
## Loadings:
##      Comp.1 Comp.2 Comp.3
## [1,]               1     
## [2,]        1            
## [3,] 1

It’s implications are as follows:

The principal components corresponding to the equal eigenvalues can be swapped without affecting the total variance explained by the model. This is because each of those components explains an equal amount of variance.
You have the freedom to rotate the subspace spanned by those principal components without affecting the quality of the PCA. This is because a rotation in a subspace that explains an equal amount of variability won’t change the overall variance captured by those components.
When it comes to dimensionality reduction, having multiple equal eigenvalues means that selecting which components to drop (if you have to drop any) is not guided by the variance. All components with the equal eigenvalues provide the same amount of information individually.
Principal components with equal eigenvalues provide no clear priority for which should be interpreted first in terms of importance. This could make the interpretation of your PCA results less straightforward.

2a. Determine both principal components for such a data set, using a PCA based on S.

my.s<-matrix(c(36,5,5,4),nrow=2,ncol=2,byrow=T)
my.s

##      [,1] [,2]
## [1,]   36    5
## [2,]    5    4

my.pc_2 <- princomp(covmat= my.s)

summary(my.pc_2, loadings=T)

## Importance of components:
##                           Comp.1     Comp.2
## Standard deviation     6.0632545 1.79915130
## Proportion of Variance 0.9190764 0.08092363
## Cumulative Proportion  0.9190764 1.00000000
## 
## Loadings:
##      Comp.1 Comp.2
## [1,]  0.989  0.151
## [2,]  0.151 -0.989

A principal component is a linear combination of the original variables and using these results:

\[ PC1 = 0.989 \cdot X_1 +0.151 \cdot X_2 \]

\[ PC2 = 0.151 \cdot X_1 -0.989 \cdot X_2 \]

2b. Determine the correlation matrix R that corresponds to the covariance matrix S

my.R <- cov2cor(my.s)
my.R

##           [,1]      [,2]
## [1,] 1.0000000 0.4166667
## [2,] 0.4166667 1.0000000

2c. Determine both principal components for such a data set, using a PCA based on R

my.pc_R <- princomp(covmat=my.R)

summary(my.pc_R, loadings=T)

## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.1902381 0.7637626
## Proportion of Variance 0.7083333 0.2916667
## Cumulative Proportion  0.7083333 1.0000000
## 
## Loadings:
##      Comp.1 Comp.2
## [1,]  0.707  0.707
## [2,]  0.707 -0.707

We see that the results from PCA based on the correlation matrix and covariance matrix are different. The differences arise due to the scaling effect caused by using correlation matrix versus covariance matrix.

When using the correlation matrix, the variables are standardized and they all contribute equally to the analysis, irrespective of their original scale. However, when using the covariance matrix, the scaling depends on the original variances of the variables. This leads to differences in the importance of variables in determining the principal components.

3. Obtain the principal components (including choosing an appropriate number of PCs). Also make an attempt to interpret your PCs.

my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340,
.402,1,.618,.150,.135,.206,.183,
.396,.618,1,.321,.289,.363,.345,
.301,.150,.321,1,.846,.759,.661,
.305,.135,.289,.846,1,.797,.800,
.339,.206,.363,.759,.797,1,.736,
.340,.183,.345,.661,.800,.736,1),
ncol=7, nrow=7, byrow=T)

row.names(my.cor.mat) <- c('head length', 'head breadth', 'face breadth', 
'left finger length', 'left forearm length', 'left foot length','height')


my.pc_3 <- princomp(my.cor.mat)

summary(my.pc_3, loadings = T)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     0.6538962 0.2478205 0.14235053 0.12823663 0.09102269
## Proportion of Variance 0.7979163 0.1146078 0.03781446 0.03068767 0.01546105
## Cumulative Proportion  0.7979163 0.9125241 0.95033859 0.98102626 0.99648731
##                             Comp.6       Comp.7
## Standard deviation     0.043386036 3.416096e-09
## Proportion of Variance 0.003512689 2.177710e-17
## Cumulative Proportion  1.000000000 1.000000e+00
## 
## Loadings:
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## [1,]  0.166  0.809  0.260                       0.481
## [2,]  0.424 -0.322 -0.508 -0.128  0.135         0.648
## [3,]  0.262 -0.467  0.794  0.129         0.102  0.237
## [4,] -0.433                0.610 -0.249 -0.455  0.395
## [5,] -0.479                      -0.220  0.819  0.210
## [6,] -0.396                       0.907              
## [7,] -0.387         0.178 -0.768 -0.197 -0.318  0.294

#Calculating principal components scores
plot(my.pc_3$sdev^2, xlab = "Component number",
ylab = "Component variance", type = "l", main = "Scree diagram")

We see that the elbow in the curve corresponds to the first two components. This is inline with the summary PCA results we obtained above that show:

Comp.1 explains a very high proportion of variance (approximately 79.79%).
Comp.2 adds another 11.46% of explained variance, bringing the cumulative variance explained to 91.25%.
Comp.3 to Comp.7 add little to the overall variance explained hence they may be less useful for analysis.
Comp.6 and Comp.7 have essentially zero variance, indicating that they are not informative and should be excluded from further consideration.

Given these proportions, we could say that using the first two principal components would be sufficient because together they explain more than 90% of the variance in the dataset.

Next we plot the first two principal component scores for data to see how it varies.

#Plot of the first two principal component scores for the data.
xlim <- range(my.pc_3$scores[,1:2])
plot(my.pc_3$scores, xlim = xlim, ylim = xlim,  pch = 19,
     main = "Scatterplot of the First Two Principal Components")

#now add abbreviated row names. 
#They represent each row in the provided order but needed to shorten them 
#for easier readability
text(my.pc_3$scores[,1:2],
labels=c('f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7'),
cex=0.7,lwd=2, pos = 4)

Spread out points on the first couple of principal components indicate that head length (f1), head breadth (f2), and face breadth (f3) could be the measurements with the most variability.

Looking at the actual scores:

# Extract PC scores
pc_scores <- my.pc_3$scores

# Extract the first two principal components
pc1 <- pc_scores[, 1]
pc2 <- pc_scores[, 2]

# Combine them into a data frame
first_two_pcs <- data.frame(PC1 = pc1, PC2 = pc2)
print(first_two_pcs)

##                            PC1          PC2
## head length          0.5349371  0.563368936
## head breadth         1.0075352 -0.172776655
## face breadth         0.6715763 -0.280424282
## left finger length  -0.5604736 -0.038350593
## left forearm length -0.6504354 -0.019728003
## left foot length    -0.5160711 -0.046221753
## height              -0.4870684 -0.005867651

For Component 1 (Comp.1):

Positive Scores: head length (0.53), head breadth (1.01), and face breadth (0.67) have positive scores. This suggests that Component 1 is associated with the size and breadth of the head and face.
Negative Scores: left finger length (-0.56), left forearm length (-0.65), left foot length (-0.52), and height (-0.49) have negative scores. This indicates that as these values increase, the score for Component 1 decreases, suggesting that Component 1 also contrasts head/face measurements with height and limb lengths.

For Component 2 (Comp.2):

Positive Score: Only head length has a fairly positive score (0.56), which indicates that Component 2 might represent a feature contrasting head length with other body measurements.
Negative Scores: All other variables have negative scores on Component 2, but they are all relatively small compared to head breadth (-0.17) and face breadth (-0.28), suggesting a minor opposition in variance contribution compared to ‘head length’.

Spread out points on the first couple of principal components indicate that head length (f1), head breadth (f2), and face breadth (f3) could be the measurements with the most variability.

4.Perform an appropriate PCA on the “Contents of Foodstuffs” data set, including the appropriate display of PC scores. Also make an attempt to interpret your PCs.

food.full <- read.table("C:/Users/HP/Downloads/foodstuffs.txt", header=T)
food.labels <- as.character(food.full[,1])
food.data <- food.full[,-1]

plot(food.data)

It seems necessary to extract the principal components from the correlation rather than the covariance matrix, since the five variables to be used are on very different scales.

cor(food.data)

##              Energy     Protein         Fat     Calcium        Iron
## Energy   1.00000000  0.17384812  0.98706740 -0.32038440 -0.09977765
## Protein  0.17384812  1.00000000  0.02491163 -0.08508934 -0.17462478
## Fat      0.98706740  0.02491163  1.00000000 -0.30813212 -0.06060118
## Calcium -0.32038440 -0.08508934 -0.30813212  1.00000000  0.04429196
## Iron    -0.09977765 -0.17462478 -0.06060118  0.04429196  1.00000000

food.pc <- princomp(food.data, cor=T)

# Showing the coefficients of the components:

summary(food.pc,loadings=T)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4       Comp.5
## Standard deviation     1.4824903 1.0696751 0.9211811 0.8988007 0.0400021115
## Proportion of Variance 0.4395555 0.2288410 0.1697149 0.1615686 0.0003200338
## Cumulative Proportion  0.4395555 0.6683965 0.8381114 0.9996800 1.0000000000
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Energy   0.654         0.149  0.199  0.709
## Protein  0.151 -0.691 -0.463  0.525 -0.104
## Fat      0.639  0.202  0.216  0.134 -0.697
## Calcium -0.355         0.652  0.670       
## Iron    -0.122  0.689 -0.540  0.468

# Showing the eigenvalues of the correlation matrix:

(food.pc$sdev)^2

##      Comp.1      Comp.2      Comp.3      Comp.4      Comp.5 
## 2.197777619 1.144204758 0.848574671 0.807842783 0.001600169

From the resulting output we see that the first two components all have variances (eigenvalues) greater than one and together account for almost 86% of the variance of the original variables.The third component has a variance of 0.69, and makes the cumulative variance almost 99%.

# A scree plot
plot(1:(length(food.pc$sdev)),  (food.pc$sdev)^2, type='b', 
     main="Scree Plot", xlab="Number of Components", ylab="Eigenvalue Size")

From the scree plot above, we see that the elbow occurs at 2. This could indicate that the scores on these two components might be used to summarize the data in further analyses with little loss of information.

# Plotting the PC scores for the sample data in the space of the first two principal components:

par(pty="s")
plot(food.pc$scores[,1], food.pc$scores[,2], ylim=range(food.pc$scores[,1]), 
     xlab="PC 1", ylab="PC 2", type ='n', lwd=2)
# labeling points with food labels:
text(food.pc$scores[,1], food.pc$scores[,2],labels=food.labels, cex=0.7, lwd=2,pos = 4)

The plots demonstrate clearly that AC and AR are outliers.

# Extract PC scores
food_pc_scores <- food.pc$scores

# Extract the first two principal components
food_pc1 <- food_pc_scores[, 1]
food_pc2 <- food_pc_scores[, 2]

# Combine them into a data frame
first_two_pcs_food <- data.frame(Name = food.labels ,PC1 = food_pc1, PC2 = food_pc2)
print(first_two_pcs_food)

##    Name        PC1         PC2
## 1    BB  1.8926793  0.32612213
## 2    HR  0.6581727 -0.07589716
## 3    BR  2.9352834  1.13683995
## 4    BS  2.3269385  0.54744724
## 5    BC -0.2609008  0.05347606
## 6    CB -0.9291517 -0.90529899
## 7    CC -0.1813311 -1.56364639
## 8    BH -0.7103667  0.34009492
## 9    LL  0.9357840  0.11396640
## 10   LS  1.4087095  0.42295507
## 11   HS  1.9011883  0.27805930
## 12   PR  1.9228365  0.46184851
## 13   PS  2.0879935  0.44524724
## 14   BT  0.1387200  0.23467050
## 15   VC -0.1274858 -0.60588711
## 16   FB -0.6777718 -1.58636560
## 17   AR -2.4014878  2.71293056
## 18   AC -2.6229342  3.06527635
## 19   TC -1.4569532 -0.24336419
## 20   HF -0.7824842 -0.62227973
## 21   MB  0.2210494 -0.67428876
## 22   MC -1.1875477  0.08150040
## 23   PF -0.1035628 -0.07528291
## 24   SC -1.5289615 -0.71674399
## 25   DC -1.8376955 -0.57049499
## 26   UC -0.1326466 -1.70742152
## 27   RC -1.4880737 -0.87346329

5.The U.S. air pollution data set is available from the MVA package.

Perform an appropriate PCA on the dataset, including the appropriate display of PC scores. Identify any notable outliers.
Perform another PCA after removing one or two of the most severe outliers.
Comment on any differences in the PCA results from this PCA compared to the PCA on the full data set.

USairpol.full <- read.table("C:/Users/HP/Downloads/USairpol.txt", header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]

cor(USairpol.data)

##                 SO2    Neg.Temp       Manuf         Pop        Wind      Precip
## SO2      1.00000000  0.43360020  0.64476873  0.49377958  0.09469045  0.05429434
## Neg.Temp 0.43360020  1.00000000  0.19004216  0.06267813  0.34973963 -0.38625342
## Manuf    0.64476873  0.19004216  1.00000000  0.95526935  0.23794683 -0.03241688
## Pop      0.49377958  0.06267813  0.95526935  1.00000000  0.21264375 -0.02611873
## Wind     0.09469045  0.34973963  0.23794683  0.21264375  1.00000000 -0.01299438
## Precip   0.05429434 -0.38625342 -0.03241688 -0.02611873 -0.01299438  1.00000000
## Days     0.36956363  0.43024212  0.13182930  0.04208319  0.16410559  0.49609671
##                Days
## SO2      0.36956363
## Neg.Temp 0.43024212
## Manuf    0.13182930
## Pop      0.04208319
## Wind     0.16410559
## Precip   0.49609671
## Days     1.00000000

usair.pc<-princomp(USairpol.data,cor=TRUE)
summary(usair.pc,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.6517021 1.2297702 1.1810897 0.9444529 0.58887916
## Proportion of Variance 0.3897314 0.2160478 0.1992819 0.1274273 0.04953981
## Cumulative Proportion  0.3897314 0.6057792 0.8050611 0.9324884 0.98202821
##                           Comp.6      Comp.7
## Standard deviation     0.3166822 0.159733920
## Proportion of Variance 0.0143268 0.003644989
## Cumulative Proportion  0.9963550 1.000000000
## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SO2       0.490                0.404  0.730  0.183  0.150
## Neg.Temp  0.315        -0.677  0.185 -0.162 -0.611       
## Manuf     0.541 -0.226  0.267        -0.164        -0.745
## Pop       0.488 -0.282  0.345 -0.113 -0.349         0.649
## Wind      0.250        -0.311 -0.862  0.268  0.150       
## Precip           0.626  0.492 -0.184  0.161 -0.554       
## Days      0.260  0.678 -0.110  0.110 -0.440  0.505

# Showing the eigenvalues of the correlation matrix:

(usair.pc$sdev)^2

##     Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7 
## 2.72811968 1.51233485 1.39497299 0.89199129 0.34677866 0.10028759 0.02551493

The correlations values for is the very high values for manuf and pop are high. We see that the first three components all have variances (eigenvalues) greater than one and together account for at least 80% of the variance of the original variables. Scores on these three components might be used to graph the data with little loss of information.

par(pty="s")
plot(usair.pc$scores[,1],usair.pc$scores[,2],
ylim=range(usair.pc$scores[,1]),
xlab="PC1",ylab="PC2",type="n",lwd=2)

text(usair.pc$scores[,1],usair.pc$scores[,2],
labels=city.names,cex=0.7,lwd=2)

Create bivariate boxplots of the first three principal components.

pairs(usair.pc$scores[,1:3], ylim = c(-6, 4), xlim = c(-6, 4),
panel = function(x,y, ...) {
text(x, y, city.names,
cex = 0.6)
bvbox(cbind(x,y), add = TRUE)
})

The plots demonstrates Chicago, Phoenix, Philadelphia and Alburquerque as clear outliers.Let’s drop them.

out.row1 <- 23
out.row2 <- 1
out.row3 <- 11
USairpol.data.reduced <- USairpol.data[-c(out.row1,out.row2, out.row3),]

#Perform PCA on reduced data
cor(USairpol.data.reduced)

##                  SO2    Neg.Temp       Manuf         Pop        Wind
## SO2       1.00000000  0.42583476  0.39005350  0.13200907 -0.02540176
## Neg.Temp  0.42583476  1.00000000  0.14618114 -0.03838933  0.24783607
## Manuf     0.39005350  0.14618114  1.00000000  0.89458617  0.24065445
## Pop       0.13200907 -0.03838933  0.89458617  1.00000000  0.22138396
## Wind     -0.02540176  0.24783607  0.24065445  0.22138396  1.00000000
## Precip   -0.03991113 -0.67988165 -0.14589521 -0.05765077 -0.25704283
## Days      0.36637757  0.34795808  0.07317241 -0.03491977 -0.06050038
##               Precip        Days
## SO2      -0.03991113  0.36637757
## Neg.Temp -0.67988165  0.34795808
## Manuf    -0.14589521  0.07317241
## Pop      -0.05765077 -0.03491977
## Wind     -0.25704283 -0.06050038
## Precip    1.00000000  0.25045464
## Days      0.25045464  1.00000000

usair.reduced.pc<-princomp(USairpol.data.reduced,cor=TRUE)
summary(usair.reduced.pc,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.5343387 1.2845013 1.2216248 0.9027976 0.72917673
## Proportion of Variance 0.3363136 0.2357062 0.2131953 0.1164348 0.07595696
## Cumulative Proportion  0.3363136 0.5720198 0.7852151 0.9016499 0.97760689
##                            Comp.6      Comp.7
## Standard deviation     0.30970678 0.246644404
## Proportion of Variance 0.01370261 0.008690495
## Cumulative Proportion  0.99130951 1.000000000
## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SO2       0.373  0.174  0.451  0.239  0.710  0.120  0.228
## Neg.Temp  0.417  0.564               -0.180 -0.674 -0.138
## Manuf     0.534 -0.408         0.118         0.152 -0.709
## Pop       0.435 -0.542               -0.265 -0.236  0.622
## Wind      0.290        -0.367 -0.815  0.334              
## Precip   -0.323 -0.365  0.517 -0.321  0.192 -0.570 -0.176
## Days      0.153  0.243  0.623 -0.396 -0.491  0.351

# Showing the eigenvalues of the correlation matrix:

(usair.reduced.pc$sdev)^2

##     Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7 
## 2.35419517 1.64994360 1.49236719 0.81504357 0.53169871 0.09591829 0.06083346

The full dataset loadings show Manuf and Pop to be strongly correlated with Comp.1, whereas Wind has a particularly strong negative loading with Comp.4, and Precip strongly positively correlates with Comp.2.

Similar to the full dataset, Manuf and Pop are still significant contributors to Comp.1 in the reduced dataset.

There is a slight decrease in the proportion of total variance explained by the first three components—from 80% to 78.5% —after removing outliers. The total variance explained is slightly higher when all variables (including outliers) are included. This indicates that the outliers do have an effect on the spread of the data; they seem to add to the variance that’s being captured by the principal components. However, the difference isn’t very large, suggesting that the main structure of the data is not heavily dependent on those outliers.

pairs(usair.reduced.pc$scores[,1:3], ylim = c(-6, 4), xlim = c(-6, 4),
panel = function(x,y, ...) {
text(x, y, city.names,
cex = 0.6)
bvbox(cbind(x,y), add = TRUE)
})