1 a) Suppose our multivariate data have covariance matrix S find the eigenvalues and eigenvectors of S.

my.S<-matrix(c(5,0,0,0,9,0,0,0,9),nrow=3,ncol=3,byrow=T)
my.S
##      [,1] [,2] [,3]
## [1,]    5    0    0
## [2,]    0    9    0
## [3,]    0    0    9
#Finding eigen values and vectors
eigen_vec_vals<-eigen(my.S)
Eigen_vectors <- eigen_vec_vals$vectors
Eigen_values <- eigen_vec_vals$values
cat("Eigen vectors:",Eigen_vectors)
## Eigen vectors: 0 0 1 0 1 0 1 0 0
cat("Eigen values:",Eigen_values)
## Eigen values: 9 9 5

1 b) Determine all three principal components for such a data set, using PCA based on S.

Eigenvalues and Eigen vectors

Given the covariance matrix \(S\):

\[ S = \begin{pmatrix} 5 & 0 & 0 \\ 0 & 9 & 0 \\ 0 & 0 & 9 \end{pmatrix} \]

The eigenvalues and eigenvectors are computed as follows:

  • Eigen values: \(\lambda_1 = 5\), \(\lambda_2 = 9\), \(\lambda_3 = 9\)

  • Eigen vectors: \[ \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} \]

Principal Components

Using the eigen vectors, we can determine the principal components as Each principal component is a linear combination of the original variables

\(X_1, X_2, X_3\):

  1. First Principal Component (PC1): Corresponding to the eigenvalue 5 \[ \text{PC1} = 1 \cdot X_1 + 0 \cdot X_2 + 0 \cdot X_3 = X_1 \]

  2. Second Principal Component (PC2): Corresponding to the first eigenvalue 9 \[ \text{PC2} = 0 \cdot X_1 + 1 \cdot X_2 + 0 \cdot X_3 = X_2 \]

  3. Third Principal Component (PC3): Corresponding to the second eigenvalue 9 \[ \text{PC3} = 0 \cdot X_1 + 0 \cdot X_2 + 1 \cdot X_3 = X_3 \]

Therefore, the principal components are:

  • \(\text{PC1} = X_1\)
  • \(\text{PC2} = X_2\)
  • \(\text{PC3} = X_3\)

In this specific case, the principal components are identical to the original variables due to the diagonal nature of the covariance matrix \(S\).

1 c) What can you say about the principal components associated with eigenvalues that are the same value?

HELPFUL NOTE: In R, we can do a PCA where the covariance matrix is input rather than the data matrix using code such as:

my.pc <- princomp(covmat=my.S)

summary(my.pc, loadings=T)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3
## Standard deviation     3.0000000 3.0000000 2.2360680
## Proportion of Variance 0.3913043 0.3913043 0.2173913
## Cumulative Proportion  0.3913043 0.7826087 1.0000000
## 
## Loadings:
##      Comp.1 Comp.2 Comp.3
## [1,]               1     
## [2,]        1            
## [3,] 1

Since \(\lambda_2 = \lambda_3\), any linear combination of \(X_2\) and \(X_3\) can also serve as a principal component. For example, if \(v_2\) and \(v_3\) are the eigenvectors corresponding to \(\lambda_2\) and \(\lambda_3\):

\[ v_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}, \quad v_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} \]

Any vector of the form:

\[ a v_2 + b v_3 = \begin{pmatrix} 0 \\ a \\ b \end{pmatrix} \]

is also a principal component, where \(a\) and \(b\) are arbitrary constants. This implies that within the subspace spanned by \(X_2\) and \(X_3\), the principal components are not unique as evidenced by the output of the Loadings where PC1 and PC2 are actually not unique due to same standard deviation (3) and proportion of variance (0.3913043).

2) Suppose a multivariate data set has sample covariance matrix S

my.R<-matrix(c(36,5,5,4),nrow=2,ncol=2,byrow=T)
my.R
##      [,1] [,2]
## [1,]   36    5
## [2,]    5    4

2a) Determine both principal components for such a data set, using a PCA based on S.

(Refer to HELPFUL NOTE and R TIP from Problem 1). Remember that a principal component is a linear combination of the original variables, so your answer should be in the form of linear combinations of X1 and X2.

# Calculate the eigenvalues and eigenvectors
Eigen_val_vars_R <- eigen(my.R)
Eigen_valR <- Eigen_val_vars_R$values
Eigen_vectorR <- Eigen_val_vars_R$vectors

Eigen_vectorR
##            [,1]       [,2]
## [1,] -0.9885545  0.1508642
## [2,] -0.1508642 -0.9885545
# Principal components
# The principal components are obtained by multiplying the eigen vectors by the original variables
# Since there are 2 variables (X1 and X2), there will be 2 principal components
# Each principal component is a linear combination of X1 and X2

# First principal component
pc1 <- paste0(Eigen_vectorR[1, 1], "*X1", Eigen_vectorR[2, 1], "*X2")

# Second principal component
pc2 <- paste0(Eigen_vectorR[1, 2], "*X1", Eigen_vectorR[2, 2], "*X2")

# Print principal components
cat("Principal Component 1:", pc1, "\n")
## Principal Component 1: -0.988554494713121*X1-0.150864213723749*X2
cat("Principal Component 2:", pc2)
## Principal Component 2: 0.150864213723749*X1-0.988554494713121*X2

2b) Determine the correlation matrix R that corresponds to the covariance matrix S.

corr_matrix_R<-cov2cor(my.S)
corr_matrix_R
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

2c) Determine both principal components for such a data set, using a PCA based on R.

2d) Are the PCA results different from those in part (a)? If so, try to explain why they are different.

-NOTE: For EACH of the following problems, also write (at least) a couple of paragraphs explaining the choices you made in doing the PCA and explaining in words what substantive conclusions about the data that you can draw from the PCA. You can use relevant results and graphics to support your conclusions.

my.pc <- princomp(covmat=corr_matrix_R)

summary(my.pc, loadings=T)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3
## Standard deviation     1.0000000 1.0000000 1.0000000
## Proportion of Variance 0.3333333 0.3333333 0.3333333
## Cumulative Proportion  0.3333333 0.6666667 1.0000000
## 
## Loadings:
##      Comp.1 Comp.2 Comp.3
## [1,]               1     
## [2,]        1            
## [3,] 1

Comparison of Results:

The results from PCA based on R (correlation matrix) and S (covariance matrix) are different:

PCA based on R:

  • Standard deviation for each principal component is 1.
  • Proportion of variance for each principal component is approximately 0.3333.
  • Cumulative proportion of variance increases steadily with each component.

PCA based on S:

  • Standard deviation for each principal component is not 1, indicating different scaling.

  • Proportion of variance for each principal component varies. Cumulative proportion of variance also differs from the correlation matrix-based PCA.

The differences arise due to the scaling effect caused by using correlation matrix (R) versus covariance matrix (S).

When using the correlation matrix, the variables are standardized (scaled to have unit variance), which means all variables contribute equally to the analysis, irrespective of their original scale. However, when using the covariance matrix, the scaling depends on the original variances of the variables. This leads to differences in the importance of variables in determining the principal components.

3) Problem 3.3 from Everitt & Hothorn (2011) asks you to do and interpret a principal components analysis on the given correlation matrix,

my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340,
.402,1,.618,.150,.135,.206,.183,
.396,.618,1,.321,.289,.363,.345,
.301,.150,.321,1,.846,.759,.661,
.305,.135,.289,.846,1,.797,.800,
.339,.206,.363,.759,.797,1,.736,
.340,.183,.345,.661,.800,.736,1),
ncol=7, nrow=7, byrow=T)

#Perform PCA.

pc <- princomp(my.cor.mat)

summary(pc, loadings=T)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     0.6538962 0.2478205 0.14235053 0.12823663 0.09102269
## Proportion of Variance 0.7979163 0.1146078 0.03781446 0.03068767 0.01546105
## Cumulative Proportion  0.7979163 0.9125241 0.95033859 0.98102626 0.99648731
##                             Comp.6 Comp.7
## Standard deviation     0.043386036      0
## Proportion of Variance 0.003512689      0
## Cumulative Proportion  1.000000000      1
## 
## Loadings:
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## [1,]  0.166  0.809  0.260                       0.481
## [2,]  0.424 -0.322 -0.508 -0.128  0.135         0.648
## [3,]  0.262 -0.467  0.794  0.129         0.102  0.237
## [4,] -0.433                0.610 -0.249 -0.455  0.395
## [5,] -0.479                      -0.220  0.819  0.210
## [6,] -0.396                       0.907              
## [7,] -0.387         0.178 -0.768 -0.197 -0.318  0.294
# Extract Variance from the PCA output
prop_variance <- pc$sdev^2
# Create a scree plot with enhanced aesthetics
plot(1:length(prop_variance), prop_variance * 100, type = "b", 
     main = "Scree Plot: Proportion of Variance of Components", 
     col = "blue", lwd = 2, pch = 19, cex = 1.5, xaxt = 'n', yaxt = 'n',ylab='Proportion of variance',xlab='Component')

# Customizing the axes

# Adding gridlines
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted")

# Enhancing title and axis labels
title( col.main = "darkblue", font.main = 4, cex.main = 1.2)

# Customizing the axes
axis(1, at = 1:length(prop_variance), labels = 1:length(prop_variance), las = 1, col.axis = "darkblue", cex.axis = 0.8)
axis(2, at = seq(0, 100, by = 10), labels = seq(0, 100, by = 10), las = 1, col.axis = "darkblue", cex.axis = 0.8)

#Getting scores
data<-data.frame(pc$scores[,1:2])
plot(Comp.2~Comp.1,data=data,col='red',main='Plot of the first two principal component scores for the head size data')
# Adding gridlines
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted")

Discussion

Features

  • ‘head length’- Feature 1

  • ’head breadth`- Feature 2

  • ‘face breadth’- Feature 3

  • ‘left finger length’- Feature 4

  • ‘left forearm length’ - Feature 5

  • ‘left foot length’- Feature 6

  • ‘height’ - Feature 7

  • Component 1:

Variables with high positive loadings: Head breadth (0.424), face breadth (0.262) and head length (0.166)

Interpretation: Component 1 seems to represent overall head size . Variables related to body dimensions such as head breadth, face breadth contribute positively to this component.

  • Component 2:

Variables with high positive loadings: Head length (0.809)

Interpretation: Component 2 appears to represent facial dimensions. Variables like head length contribute significantly to this component, suggesting that it captures variations in facial structure.

  • Component 3:

Variables with high positive loadings: face breadth (0.794)

Interpretation: Component 3 seems to represent face characteristics. The high positive loading of face breath indicates that this component captures variations in face width.

  • Component 4:

Variables with high positive loadings: Left finger length (0.610)

Variables with negative loadings : height -0.768

Interpretation: Component 4 seems to represent the body parts proportions. Finger length and height have positive and negative loadings, respectively, indicating that this component captures variations in stature lengths relative to finger dimensions.

Based on the cumulative proportion of variance and the scree plot, we will then consider retaining the first three principal components (C), as they explain a significant portion of the variability in the data (around 95% with two components and around 98% with four components).

Conclusion

Based on the scree plot, it seems reasonable to pick the first three principal components as they have eigenvalues notably larger than the subsequent components Head breadth (0.424), face breadth (0.262) and head length (0.166) are the features that explain most of the variability in the data and they belong to the two components that best represent the data.

From the scree plot we will be retaining the first three principal components as they capture a significant amount of variability in the data while reducing its dimensionality.

  • These three components likely represent the most prominent patterns or structures in the dataset.

See HELPFUL NOTE 2 and R TIP above, which will also help with this problem.

4a) The “Contents of Foodstuffs” data set is given in Table 3.6, Everitt (2005). Perform an appropriate PCA on the “Contents of Foodstuffs” data set, including the appropriate display of PC scores. Also make an attempt to interpret your PCs.

4b) The “Contents of Foodstuffs” data set is given in Table 3.6, Everitt (2005).Make an attempt to interpret your PCs.

Full descriptions of the observation names are on p. 63 of Everitt (2005). Perform an appropriate PCA on the “Contents of Foodstuffs” data set, including the appropriate display of PC scores. Also make an attempt to interpret your PCs.

# Create a dataframe with the provided data
meat_data <- data.frame(
  Item = c("BB Beef, braised", "HR Hamburger", "BR Beef roast", "BS Beef, steak", "BC Beef, canned",
           "CB Chicken, broiled", "CC Chicken, canned", "BH Beef, heart", "LL Lamb leg, roast", 
           "LS Lamb shoulder, roast", "HS Smoked ham", "PR Pork roast", "PS Pork simmered", 
           "BT Beef tongue", "VC Veal cutlet", "FB Bluefish, baked", "AR Clams, raw", 
           "AC Clams, canned", "TC Crabmeat, canned", "HF Haddock, fried", "MB Mackerel, broiled", 
           "MC Mackerel, canned", "PF Perch, fried", "SC Salmon, canned", "DC Sardines, canned", 
           "UC Tuna, canned", "RC Shrimp, canned"),
  Energy = c(340, 245, 420, 375, 180, 115, 170, 160, 265, 300, 340, 340, 355, 205, 185, 135, 70, 45, 90, 
             135, 200, 155, 195, 120, 180, 170, 110),
  Protein = c(20, 21, 15, 19, 22, 20, 25, 26, 20, 18, 20, 19, 19, 18, 23, 22, 11, 7, 14, 16, 19, 16, 16, 
              17, 22, 25, 23),
  Fat = c(28, 17, 39, 32, 10, 3, 7, 5, 20, 25, 28, 29, 30, 14, 9, 4, 1, 1, 2, 5, 13, 9, 11, 5, 9, 7, 1),
  Calcium = c(9, 9, 7, 9, 17, 8, 12, 14, 9, 9, 9, 9, 9, 7, 9, 25, 82, 74, 38, 15, 5, 157, 14, 159, 367, 
              7, 98),
  Iron = c(2.6, 2.7, 2.0, 2.5, 3.7, 1.4, 1.5, 5.9, 2.6, 2.3, 2.5, 2.5, 2.4, 2.5, 2.7, 0.6, 6.0, 5.4, 0.8, 
           0.5, 1.0, 1.8, 1.3, 0.7, 2.5, 1.2, 2.6)
)

# Display the dataframe
meat_data
##                       Item Energy Protein Fat Calcium Iron
## 1         BB Beef, braised    340      20  28       9  2.6
## 2             HR Hamburger    245      21  17       9  2.7
## 3            BR Beef roast    420      15  39       7  2.0
## 4           BS Beef, steak    375      19  32       9  2.5
## 5          BC Beef, canned    180      22  10      17  3.7
## 6      CB Chicken, broiled    115      20   3       8  1.4
## 7       CC Chicken, canned    170      25   7      12  1.5
## 8           BH Beef, heart    160      26   5      14  5.9
## 9       LL Lamb leg, roast    265      20  20       9  2.6
## 10 LS Lamb shoulder, roast    300      18  25       9  2.3
## 11           HS Smoked ham    340      20  28       9  2.5
## 12           PR Pork roast    340      19  29       9  2.5
## 13        PS Pork simmered    355      19  30       9  2.4
## 14          BT Beef tongue    205      18  14       7  2.5
## 15          VC Veal cutlet    185      23   9       9  2.7
## 16      FB Bluefish, baked    135      22   4      25  0.6
## 17           AR Clams, raw     70      11   1      82  6.0
## 18        AC Clams, canned     45       7   1      74  5.4
## 19     TC Crabmeat, canned     90      14   2      38  0.8
## 20       HF Haddock, fried    135      16   5      15  0.5
## 21    MB Mackerel, broiled    200      19  13       5  1.0
## 22     MC Mackerel, canned    155      16   9     157  1.8
## 23         PF Perch, fried    195      16  11      14  1.3
## 24       SC Salmon, canned    120      17   5     159  0.7
## 25     DC Sardines, canned    180      22   9     367  2.5
## 26         UC Tuna, canned    170      25   7       7  1.2
## 27       RC Shrimp, canned    110      23   1      98  2.6

4b.1)Use S-PLUS or R to create a scatterplot matrix of the data labeling the foodstuffs appropriately in each panel.

colors <- rep(c("red", "blue","green"), length.out = nrow(meat_data))

plot(meat_data,labels=meat_data[,1],col=colors)

#Conducting principal component analysis
#To ensure we dont have bloated variance,we will scale the data by Getting correlation matrix from the data
cor_meat<-cor(meat_data[,-1])
#getting eigen values and eigen vectors
eigen_vec_meat<-eigen(cor_meat)
#Getting eigen values
eigen_vals_meat<-eigen_vec_meat$values
#Getting eigen vectors
eigen_vecs_meat<-eigen_vec_meat$vectors
#Getting principal components

eigen_vec_meat
## eigen() decomposition
## $values
## [1] 2.197777619 1.144204758 0.848574671 0.807842783 0.001600169
## 
## $vectors
##            [,1]        [,2]       [,3]      [,4]         [,5]
## [1,] -0.6539155  0.08725829 -0.1490040 0.1985936  0.709322816
## [2,] -0.1511882 -0.69052953  0.4629211 0.5245825 -0.104059181
## [3,] -0.6394332  0.20196122 -0.2157528 0.1336768 -0.697078234
## [4,]  0.3546581 -0.00633049 -0.6521357 0.6699900  0.003161132
## [5,]  0.1219811  0.68900403  0.5400663 0.4675657  0.010235855
# Original variables
variables <- c("Energy", "Protein", "Fat", "Calcium", "Iron")

# Provided eigen decomposition output
eigen_values <- c(2.197777619, 1.144204758, 0.848574671, 0.807842783, 0.001600169)
eigen_vectors <- matrix(c(-0.6539155, 0.08725829, -0.1490040, 0.1985936, 0.709322816,
                          -0.1511882, -0.69052953, 0.4629211, 0.5245825, -0.104059181,
                          -0.6394332, 0.20196122, -0.2157528, 0.1336768, -0.697078234,
                           0.3546581, -0.00633049, -0.6521357, 0.6699900, 0.003161132,
                           0.1219811, 0.68900403, 0.5400663, 0.4675657, 0.010235855),
                        nrow = 5, byrow = TRUE)
  • Principal Component 1 : -0.65Energy -0.15Protein -0.64 Fat + .35 Calcium + 0.12Iron

  • Principal Component 2 : 0.09Energy -0.69Protein + 0.2Fat -0.006Calcium +0.69*Iron

  • Principal Component 3 : -0.15Energy +.46Protein - 0.22Fat -0.65Calcium +0.54*Iron

  • Principal Component 4 : 0.19Energy + 0.52Protein+ 0.13Fat +0.67Calcium + 0.47*Iron

  • Principal Component 5 : 0.71Energy -0.10Protein-0.69Fat +0.003Calcium + 0.01*Iron

#Getting principal components using pricom
princomp_meat<-princomp(cor(meat_data[,-1]))
summary(princomp_meat,loadings=T)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3       Comp.4 Comp.5
## Standard deviation     0.9000323 0.5090856 0.3794933 0.0139255839      0
## Proportion of Variance 0.6675742 0.2135822 0.1186838 0.0001598123      0
## Cumulative Proportion  0.6675742 0.8811564 0.9998402 1.0000000000      1
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Energy   0.606         0.148  0.354  0.692
## Protein        -0.737 -0.466  0.464 -0.128
## Fat      0.598  0.203  0.215  0.228 -0.710
## Calcium -0.464         0.648  0.598       
## Iron    -0.229  0.634 -0.543  0.500

Discussion

The provided loadings represent the correlation between the original variables and the principal components. Below are the discussions

  • Loadings: The values in the table represent the correlation coefficients between each original variable and each principal component.

  • In Component 1 (Comp.1), Energy has a loading of 0.606, indicating a relatively strong positive correlation with this component.

  • Protein has a loading of -0.737 in Component 2 (Comp.2), indicating a strong negative correlation with this component.

  • In Component 3 (Comp.3), Fat and Calcium have loadings of 0.215 and 0.598, respectively, indicating positive correlations with this component.

These loadings help us understand how each original variable contributes to each principal component.

Positive loadings indicate a positive relationship, while negative loadings indicate a negative relationship. Higher absolute values of loadings suggest stronger relationships.

5a) The U.S. air pollution data set is available from the MVA package and in Table 3.1, Everitt (2005). Perform an appropriate PCA on the dataset, including the appropriate display of PC scores. Identify any notable outliers.

This R code will read in the data:

USairpol.full <- read.table("/cloud/project/airpoll.txt", quote="\"", comment.char="",header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]
#EDA  - Scatter plot

pairs(USairpol.data,panel=function(x,y) {abline(lsfit(x,y)$coef,
lwd=2)
lines(lowess(x,y),lty=2,
lwd=2)
points(x,y)})

# Define a custom panel function for Coplot of SO2 and Mortality conditional on population density
custom_panel <- function(x, y, col, pch) {
  panel.smooth(x, y, span = 1)
}

# Call coplot with the custom panel function
coplot(Mortality ~ SO2 | Popden, panel = custom_panel, data = USairpol.data)

# Function to detect outliers using Z-score
detect_outliers <- function(x, threshold = 3) {
  z_scores <- scale(x)
  outliers <- abs(z_scores) > threshold
  return(outliers)
}

# Detect outliers per column
outliers_per_column <- apply(USairpol.data, 2, detect_outliers)

# Filter dataframe to return only rows with outliers
rows_with_outliers <- apply(outliers_per_column, 1, any)
data_with_outliers <- USairpol.data[rows_with_outliers, ]

# Filter dataframe to return only columns with outliers
columns_with_outliers <- apply(outliers_per_column, 2, any)
data_with_outliers <- data_with_outliers[, columns_with_outliers]

# Print dataframe with outliers
print(data_with_outliers)
##    Popden NOX SO2
## 12   6122  63 278
## 29   4700 319 130
## 40   3437  59 263
## 48   4253 171  86
## 59   9699   8  49

}

Perfoming PCA

pc_us_airpol<-princomp(cor(USairpol.data))

summary(pc_us_airpol,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     0.9012738 0.5789673 0.34222488 0.17167377 0.15697445
## Proportion of Variance 0.6129862 0.2529562 0.08838131 0.02224053 0.01859496
## Cumulative Proportion  0.6129862 0.8659424 0.95432374 0.97656427 0.99515923
##                            Comp.6       Comp.7
## Standard deviation     0.08009190 7.430301e-09
## Proportion of Variance 0.00484077 4.166295e-17
## Cumulative Proportion  1.00000000 1.000000e+00
## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Rainfall   0.531  0.260  0.179  0.290  0.291  0.167  0.650
## Education -0.495  0.312        -0.220 -0.448         0.626
## Popden           -0.511  0.568 -0.595                0.223
## Nonwhite   0.350        -0.594 -0.550         0.462       
## NOX       -0.344 -0.389 -0.445         0.645 -0.156  0.302
## SO2              -0.614 -0.173  0.448 -0.474  0.367  0.177
## Mortality  0.476 -0.196 -0.237 -0.102 -0.264 -0.765  0.115

Discussion

  1. Importance of Components:
    • The PCA analysis identified seven components (Comp.1 to Comp.7) with varying standard deviations and proportions of variance.
    • Comp.1 has the highest standard deviation (0.901) and proportion of variance (61.3%), indicating it captures a significant amount of variability in the data.
    • The cumulative proportion of variance shows that the first three components (Comp.1 to Comp.3) explain approximately 95.4% of the total variance, this suggests that they are the most important components in summarizing the data.
  2. Loadings:
    • Loadings represent the correlation between the original variables (Rainfall, Education, Popden, Nonwhite, NOX, SO2, Mortality) and the identified components.
    • For instance in the above results, Rainfall has a relatively high loading on Comp.1 (0.531), indicating a strong positive association with this component.
    • Education has a negative loading on Comp.1 (-0.495), suggesting an inverse relationship with Rainfall in this component.
    • Popden has notable loadings on Comp.2 and Comp.4, while Nonwhite has strong loadings on Comp.3 and Comp.4, indicating their contributions to different patterns of variability captured by these components.

5b) The U.S. air pollution data set is available from the MVA package and in Table 3.1, Everitt (2005). Perform another PCA after removing one or two of the most severe outliers. Comment on any differences in the PCA results from this PCA compared to the PCA on the full data set.

To remove one or two outlying observations, try R code such as:

out.row1 <-12
USairpol.data.reduced <- USairpol.data[-out.row1,]
out.row2 <- 29
USairpol.data.reduced.more <-
USairpol.data[-c(out.row1,out.row2),]

PCA Redone after removal of Outliers

pc_us_airpol<-princomp(cor(USairpol.data.reduced.more ))

summary(pc_us_airpol,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     0.8479981 0.5748198 0.3728810 0.18313149 0.13362071
## Proportion of Variance 0.5765435 0.2649145 0.1114764 0.02688861 0.01431495
## Cumulative Proportion  0.5765435 0.8414580 0.9529344 0.97982297 0.99413792
##                             Comp.6 Comp.7
## Standard deviation     0.085507589      0
## Proportion of Variance 0.005862079      0
## Cumulative Proportion  1.000000000      1
## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Rainfall   0.514  0.324  0.144  0.314  0.386         0.594
## Education -0.537  0.257 -0.126 -0.151 -0.360         0.684
## Popden     0.109 -0.440  0.618 -0.575                0.266
## Nonwhite   0.350  0.141 -0.543 -0.537 -0.102  0.513       
## NOX       -0.231 -0.483 -0.468         0.647 -0.155  0.219
## SO2        0.130 -0.605 -0.133  0.483 -0.442  0.347  0.224
## Mortality  0.493 -0.128 -0.223 -0.143 -0.310 -0.752
# Define data for comparison
results_before <- data.frame(
  Component = c("Comp.1", "Comp.2", "Comp.3", "Comp.4", "Comp.5"),
  `Standard deviation` = c(0.901, 0.579, 0.342, 0.172, 0.157),
  `Proportion of Variance` = c("61.3%", "25.3%", "8.8%", "2.2%", "1.9%")
)

results_after <- data.frame(
  Component = c("Comp.1", "Comp.2", "Comp.3", "Comp.4", "Comp.5"),
  `Standard deviation` = c(0.848, 0.575, 0.373, 0.183, 0.134),
  `Proportion of Variance` = c("57.7%", "26.5%", "11.1%", "2.7%", "1.4%")
)

# Print comparison table
cat("Before Removing Outliers:")
## Before Removing Outliers:
knitr::kable(results_before)
Component Standard.deviation Proportion.of.Variance
Comp.1 0.901 61.3%
Comp.2 0.579 25.3%
Comp.3 0.342 8.8%
Comp.4 0.172 2.2%
Comp.5 0.157 1.9%
cat("\n\n")
cat("After Removing Outliers:")
## After Removing Outliers:
knitr::kable(results_after)
Component Standard.deviation Proportion.of.Variance
Comp.1 0.848 57.7%
Comp.2 0.575 26.5%
Comp.3 0.373 11.1%
Comp.4 0.183 2.7%
Comp.5 0.134 1.4%
# Loadings before removing outliers
loadings_before <- data.frame(
  Variable = c("Rainfall", "Education", "Popden", "Nonwhite", "NOX", "SO2", "Mortality"),
  Comp.1 = c(0.531, -0.495, NA, 0.350, -0.344, -0.614, 0.476),
  Comp.2 = c(0.260, 0.312, NA, NA, -0.389, -0.173, -0.196),
  Comp.3 = c(0.179, NA, 0.568, -0.594, -0.445, 0.448, -0.237),
  Comp.4 = c(0.290, -0.220, -0.595, -0.550, NA, -0.474, -0.102),
  Comp.5 = c(0.291, -0.448, NA, 0.462, 0.645, 0.367, -0.264),
  Comp.6 = c(0.167, NA, 0.223, NA, -0.156, 0.177, -0.765),
  Comp.7 = c(0.650, 0.626, NA, NA, 0.302, 0.177, 0.115)
)

# Print loadings before removing outliers
knitr::kable(loadings_before)
Variable Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Rainfall 0.531 0.260 0.179 0.290 0.291 0.167 0.650
Education -0.495 0.312 NA -0.220 -0.448 NA 0.626
Popden NA NA 0.568 -0.595 NA 0.223 NA
Nonwhite 0.350 NA -0.594 -0.550 0.462 NA NA
NOX -0.344 -0.389 -0.445 NA 0.645 -0.156 0.302
SO2 -0.614 -0.173 0.448 -0.474 0.367 0.177 0.177
Mortality 0.476 -0.196 -0.237 -0.102 -0.264 -0.765 0.115
# Loadings after removing outliers
loadings_after <- data.frame(
  Variable = c("Rainfall", "Education", "Popden", "Nonwhite", "NOX", "SO2", "Mortality"),
  Comp.1 = c(0.514, -0.537, 0.109, 0.350, -0.231, 0.130, 0.493),
  Comp.2 = c(0.324, 0.257, -0.440, 0.141, -0.483, -0.605, -0.128),
  Comp.3 = c(0.144, -0.126, 0.618, -0.543, -0.468, -0.133, -0.223),
  Comp.4 = c(0.314, -0.151, -0.575, -0.537, NA, 0.483, -0.143),
  Comp.5 = c(0.386, -0.360, NA, -0.102, 0.647, -0.442, -0.310),
  Comp.6 = c(NA, NA, 0.266, 0.513, -0.155, 0.347, -0.752),
  Comp.7 = c(0.594, 0.684, NA, NA, 0.219, 0.224, 0.115)
)

# Print loadings after removing outliers
knitr::kable(loadings_after)
Variable Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Rainfall 0.514 0.324 0.144 0.314 0.386 NA 0.594
Education -0.537 0.257 -0.126 -0.151 -0.360 NA 0.684
Popden 0.109 -0.440 0.618 -0.575 NA 0.266 NA
Nonwhite 0.350 0.141 -0.543 -0.537 -0.102 0.513 NA
NOX -0.231 -0.483 -0.468 NA 0.647 -0.155 0.219
SO2 0.130 -0.605 -0.133 0.483 -0.442 0.347 0.224
Mortality 0.493 -0.128 -0.223 -0.143 -0.310 -0.752 0.115

Discussion of the change in PCA before and after removing outliers

In summary, the analysis of principal components (PCA) before and after removing outliers revealed notable changes in both variance and loadings across several components. Here’s a breakdown of some key findings:

  1. Comp.1:
    • Before Outlier Removal: Captured a high proportion of variance (e.g., 61.3%) and featured significant loadings for variables like Rainfall(0.531), Nonwhite(0.35), and Mortality(0.493).
    • After Outlier Removal: While still capturing a substantial proportion of variance(57.1%), there were minor shifts in loadings for Rainfall(0.514) and Mortality(0.493), indicating potential changes in variable relationships.
  2. Comp.2:
    • Before Outlier Removal: Captured a moderate proportion of variance (e.g., 25.3%) and showed significant loadings for Education(.312) and Rainfall(.26).
    • After Outlier Removal: The proportion of variance (26.5%) captured remained similar, with notable adjustments observed in loadings, particularly for Education(.257) and Rainfall (.324).
  3. Comp.3:
    • Before Outlier Removal: Captured a relatively lower proportion of variance (e.g., 8.8%) and exhibited notable loadings for Popden (0.568) and Rainfall(0.179).
    • After Outlier Removal: Despite a comparable proportion of variance captured (11.1%), loadings for Popden(0.618) and Rainfall(0.144) shifted, reflecting changes in their relationships with other variables.
  4. Comp.4:
    • Before Outlier Removal: Captured a smaller proportion of variance (e.g., 2.2%) and featured significant loadings for SO2(-0.474) and Rainfall (0.29).
    • After Outlier Removal: While still capturing a minor proportion of variance, there were notable adjustments in loadings, indicating potential changes in the importance of Rainfall(0.314) and SO2(0.483) in this component.

These observations underscore the impact of outlier removal on PCA results, highlighting how it refines the representation of underlying data structures and leads to more accurate interpretations of principal components.