my.S<-matrix(c(5,0,0,0,9,0,0,0,9),nrow=3,ncol=3,byrow=T)
my.S
## [,1] [,2] [,3]
## [1,] 5 0 0
## [2,] 0 9 0
## [3,] 0 0 9
#Finding eigen values and vectors
eigen_vec_vals<-eigen(my.S)
Eigen_vectors <- eigen_vec_vals$vectors
Eigen_values <- eigen_vec_vals$values
cat("Eigen vectors:",Eigen_vectors)
## Eigen vectors: 0 0 1 0 1 0 1 0 0
cat("Eigen values:",Eigen_values)
## Eigen values: 9 9 5
Given the covariance matrix \(S\):
\[ S = \begin{pmatrix} 5 & 0 & 0 \\ 0 & 9 & 0 \\ 0 & 0 & 9 \end{pmatrix} \]
The eigenvalues and eigenvectors are computed as follows:
Eigen values: \(\lambda_1 = 5\), \(\lambda_2 = 9\), \(\lambda_3 = 9\)
Eigen vectors: \[ \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} \]
Using the eigen vectors, we can determine the principal components as Each principal component is a linear combination of the original variables
\(X_1, X_2, X_3\):
First Principal Component (PC1): Corresponding to the eigenvalue 5 \[ \text{PC1} = 1 \cdot X_1 + 0 \cdot X_2 + 0 \cdot X_3 = X_1 \]
Second Principal Component (PC2): Corresponding to the first eigenvalue 9 \[ \text{PC2} = 0 \cdot X_1 + 1 \cdot X_2 + 0 \cdot X_3 = X_2 \]
Third Principal Component (PC3): Corresponding to the second eigenvalue 9 \[ \text{PC3} = 0 \cdot X_1 + 0 \cdot X_2 + 1 \cdot X_3 = X_3 \]
Therefore, the principal components are:
In this specific case, the principal components are identical to the original variables due to the diagonal nature of the covariance matrix \(S\).
HELPFUL NOTE: In R, we can do a PCA where the covariance matrix is input rather than the data matrix using code such as:
my.pc <- princomp(covmat=my.S)
summary(my.pc, loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 3.0000000 3.0000000 2.2360680
## Proportion of Variance 0.3913043 0.3913043 0.2173913
## Cumulative Proportion 0.3913043 0.7826087 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## [1,] 1
## [2,] 1
## [3,] 1
Since \(\lambda_2 = \lambda_3\), any linear combination of \(X_2\) and \(X_3\) can also serve as a principal component. For example, if \(v_2\) and \(v_3\) are the eigenvectors corresponding to \(\lambda_2\) and \(\lambda_3\):
\[ v_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}, \quad v_3 = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} \]
Any vector of the form:
\[ a v_2 + b v_3 = \begin{pmatrix} 0 \\ a \\ b \end{pmatrix} \]
is also a principal component, where \(a\) and \(b\) are arbitrary constants. This implies that within the subspace spanned by \(X_2\) and \(X_3\), the principal components are not unique as evidenced by the output of the Loadings where PC1 and PC2 are actually not unique due to same standard deviation (3) and proportion of variance (0.3913043).
my.R<-matrix(c(36,5,5,4),nrow=2,ncol=2,byrow=T)
my.R
## [,1] [,2]
## [1,] 36 5
## [2,] 5 4
(Refer to HELPFUL NOTE and R TIP from Problem 1). Remember that a principal component is a linear combination of the original variables, so your answer should be in the form of linear combinations of X1 and X2.
# Calculate the eigenvalues and eigenvectors
Eigen_val_vars_R <- eigen(my.R)
Eigen_valR <- Eigen_val_vars_R$values
Eigen_vectorR <- Eigen_val_vars_R$vectors
Eigen_vectorR
## [,1] [,2]
## [1,] -0.9885545 0.1508642
## [2,] -0.1508642 -0.9885545
# Principal components
# The principal components are obtained by multiplying the eigen vectors by the original variables
# Since there are 2 variables (X1 and X2), there will be 2 principal components
# Each principal component is a linear combination of X1 and X2
# First principal component
pc1 <- paste0(Eigen_vectorR[1, 1], "*X1", Eigen_vectorR[2, 1], "*X2")
# Second principal component
pc2 <- paste0(Eigen_vectorR[1, 2], "*X1", Eigen_vectorR[2, 2], "*X2")
# Print principal components
cat("Principal Component 1:", pc1, "\n")
## Principal Component 1: -0.988554494713121*X1-0.150864213723749*X2
cat("Principal Component 2:", pc2)
## Principal Component 2: 0.150864213723749*X1-0.988554494713121*X2
corr_matrix_R<-cov2cor(my.S)
corr_matrix_R
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
-NOTE: For EACH of the following problems, also write (at least) a couple of paragraphs explaining the choices you made in doing the PCA and explaining in words what substantive conclusions about the data that you can draw from the PCA. You can use relevant results and graphics to support your conclusions.
my.pc <- princomp(covmat=corr_matrix_R)
summary(my.pc, loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.0000000 1.0000000 1.0000000
## Proportion of Variance 0.3333333 0.3333333 0.3333333
## Cumulative Proportion 0.3333333 0.6666667 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## [1,] 1
## [2,] 1
## [3,] 1
The results from PCA based on R (correlation matrix) and S (covariance matrix) are different:
Standard deviation for each principal component is not 1, indicating different scaling.
Proportion of variance for each principal component varies. Cumulative proportion of variance also differs from the correlation matrix-based PCA.
The differences arise due to the scaling effect caused by using correlation matrix (R) versus covariance matrix (S).
When using the correlation matrix, the variables are standardized (scaled to have unit variance), which means all variables contribute equally to the analysis, irrespective of their original scale. However, when using the covariance matrix, the scaling depends on the original variances of the variables. This leads to differences in the importance of variables in determining the principal components.
my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340,
.402,1,.618,.150,.135,.206,.183,
.396,.618,1,.321,.289,.363,.345,
.301,.150,.321,1,.846,.759,.661,
.305,.135,.289,.846,1,.797,.800,
.339,.206,.363,.759,.797,1,.736,
.340,.183,.345,.661,.800,.736,1),
ncol=7, nrow=7, byrow=T)
#Perform PCA.
pc <- princomp(my.cor.mat)
summary(pc, loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.6538962 0.2478205 0.14235053 0.12823663 0.09102269
## Proportion of Variance 0.7979163 0.1146078 0.03781446 0.03068767 0.01546105
## Cumulative Proportion 0.7979163 0.9125241 0.95033859 0.98102626 0.99648731
## Comp.6 Comp.7
## Standard deviation 0.043386036 0
## Proportion of Variance 0.003512689 0
## Cumulative Proportion 1.000000000 1
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## [1,] 0.166 0.809 0.260 0.481
## [2,] 0.424 -0.322 -0.508 -0.128 0.135 0.648
## [3,] 0.262 -0.467 0.794 0.129 0.102 0.237
## [4,] -0.433 0.610 -0.249 -0.455 0.395
## [5,] -0.479 -0.220 0.819 0.210
## [6,] -0.396 0.907
## [7,] -0.387 0.178 -0.768 -0.197 -0.318 0.294
# Extract Variance from the PCA output
prop_variance <- pc$sdev^2
# Create a scree plot with enhanced aesthetics
plot(1:length(prop_variance), prop_variance * 100, type = "b",
main = "Scree Plot: Proportion of Variance of Components",
col = "blue", lwd = 2, pch = 19, cex = 1.5, xaxt = 'n', yaxt = 'n',ylab='Proportion of variance',xlab='Component')
# Customizing the axes
# Adding gridlines
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted")
# Enhancing title and axis labels
title( col.main = "darkblue", font.main = 4, cex.main = 1.2)
# Customizing the axes
axis(1, at = 1:length(prop_variance), labels = 1:length(prop_variance), las = 1, col.axis = "darkblue", cex.axis = 0.8)
axis(2, at = seq(0, 100, by = 10), labels = seq(0, 100, by = 10), las = 1, col.axis = "darkblue", cex.axis = 0.8)
#Getting scores
data<-data.frame(pc$scores[,1:2])
plot(Comp.2~Comp.1,data=data,col='red',main='Plot of the first two principal component scores for the head size data')
# Adding gridlines
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted")
‘head length’- Feature 1
’head breadth`- Feature 2
‘face breadth’- Feature 3
‘left finger length’- Feature 4
‘left forearm length’ - Feature 5
‘left foot length’- Feature 6
‘height’ - Feature 7
Component 1:
Variables with high positive loadings: Head breadth (0.424), face breadth (0.262) and head length (0.166)
Interpretation: Component 1 seems to represent overall head size . Variables related to body dimensions such as head breadth, face breadth contribute positively to this component.
Variables with high positive loadings: Head length (0.809)
Interpretation: Component 2 appears to represent facial dimensions. Variables like head length contribute significantly to this component, suggesting that it captures variations in facial structure.
Variables with high positive loadings: face breadth (0.794)
Interpretation: Component 3 seems to represent face characteristics. The high positive loading of face breath indicates that this component captures variations in face width.
Variables with high positive loadings: Left finger length (0.610)
Variables with negative loadings : height -0.768
Interpretation: Component 4 seems to represent the body parts proportions. Finger length and height have positive and negative loadings, respectively, indicating that this component captures variations in stature lengths relative to finger dimensions.
Based on the cumulative proportion of variance and the scree plot, we will then consider retaining the first three principal components (C), as they explain a significant portion of the variability in the data (around 95% with two components and around 98% with four components).
Based on the scree plot, it seems reasonable to pick the first three principal components as they have eigenvalues notably larger than the subsequent components Head breadth (0.424), face breadth (0.262) and head length (0.166) are the features that explain most of the variability in the data and they belong to the two components that best represent the data.
From the scree plot we will be retaining the first three principal components as they capture a significant amount of variability in the data while reducing its dimensionality.
See HELPFUL NOTE 2 and R TIP above, which will also help with this problem.
Full descriptions of the observation names are on p. 63 of Everitt (2005). Perform an appropriate PCA on the “Contents of Foodstuffs” data set, including the appropriate display of PC scores. Also make an attempt to interpret your PCs.
# Create a dataframe with the provided data
meat_data <- data.frame(
Item = c("BB Beef, braised", "HR Hamburger", "BR Beef roast", "BS Beef, steak", "BC Beef, canned",
"CB Chicken, broiled", "CC Chicken, canned", "BH Beef, heart", "LL Lamb leg, roast",
"LS Lamb shoulder, roast", "HS Smoked ham", "PR Pork roast", "PS Pork simmered",
"BT Beef tongue", "VC Veal cutlet", "FB Bluefish, baked", "AR Clams, raw",
"AC Clams, canned", "TC Crabmeat, canned", "HF Haddock, fried", "MB Mackerel, broiled",
"MC Mackerel, canned", "PF Perch, fried", "SC Salmon, canned", "DC Sardines, canned",
"UC Tuna, canned", "RC Shrimp, canned"),
Energy = c(340, 245, 420, 375, 180, 115, 170, 160, 265, 300, 340, 340, 355, 205, 185, 135, 70, 45, 90,
135, 200, 155, 195, 120, 180, 170, 110),
Protein = c(20, 21, 15, 19, 22, 20, 25, 26, 20, 18, 20, 19, 19, 18, 23, 22, 11, 7, 14, 16, 19, 16, 16,
17, 22, 25, 23),
Fat = c(28, 17, 39, 32, 10, 3, 7, 5, 20, 25, 28, 29, 30, 14, 9, 4, 1, 1, 2, 5, 13, 9, 11, 5, 9, 7, 1),
Calcium = c(9, 9, 7, 9, 17, 8, 12, 14, 9, 9, 9, 9, 9, 7, 9, 25, 82, 74, 38, 15, 5, 157, 14, 159, 367,
7, 98),
Iron = c(2.6, 2.7, 2.0, 2.5, 3.7, 1.4, 1.5, 5.9, 2.6, 2.3, 2.5, 2.5, 2.4, 2.5, 2.7, 0.6, 6.0, 5.4, 0.8,
0.5, 1.0, 1.8, 1.3, 0.7, 2.5, 1.2, 2.6)
)
# Display the dataframe
meat_data
## Item Energy Protein Fat Calcium Iron
## 1 BB Beef, braised 340 20 28 9 2.6
## 2 HR Hamburger 245 21 17 9 2.7
## 3 BR Beef roast 420 15 39 7 2.0
## 4 BS Beef, steak 375 19 32 9 2.5
## 5 BC Beef, canned 180 22 10 17 3.7
## 6 CB Chicken, broiled 115 20 3 8 1.4
## 7 CC Chicken, canned 170 25 7 12 1.5
## 8 BH Beef, heart 160 26 5 14 5.9
## 9 LL Lamb leg, roast 265 20 20 9 2.6
## 10 LS Lamb shoulder, roast 300 18 25 9 2.3
## 11 HS Smoked ham 340 20 28 9 2.5
## 12 PR Pork roast 340 19 29 9 2.5
## 13 PS Pork simmered 355 19 30 9 2.4
## 14 BT Beef tongue 205 18 14 7 2.5
## 15 VC Veal cutlet 185 23 9 9 2.7
## 16 FB Bluefish, baked 135 22 4 25 0.6
## 17 AR Clams, raw 70 11 1 82 6.0
## 18 AC Clams, canned 45 7 1 74 5.4
## 19 TC Crabmeat, canned 90 14 2 38 0.8
## 20 HF Haddock, fried 135 16 5 15 0.5
## 21 MB Mackerel, broiled 200 19 13 5 1.0
## 22 MC Mackerel, canned 155 16 9 157 1.8
## 23 PF Perch, fried 195 16 11 14 1.3
## 24 SC Salmon, canned 120 17 5 159 0.7
## 25 DC Sardines, canned 180 22 9 367 2.5
## 26 UC Tuna, canned 170 25 7 7 1.2
## 27 RC Shrimp, canned 110 23 1 98 2.6
colors <- rep(c("red", "blue","green"), length.out = nrow(meat_data))
plot(meat_data,labels=meat_data[,1],col=colors)
#Conducting principal component analysis
#To ensure we dont have bloated variance,we will scale the data by Getting correlation matrix from the data
cor_meat<-cor(meat_data[,-1])
#getting eigen values and eigen vectors
eigen_vec_meat<-eigen(cor_meat)
#Getting eigen values
eigen_vals_meat<-eigen_vec_meat$values
#Getting eigen vectors
eigen_vecs_meat<-eigen_vec_meat$vectors
#Getting principal components
eigen_vec_meat
## eigen() decomposition
## $values
## [1] 2.197777619 1.144204758 0.848574671 0.807842783 0.001600169
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.6539155 0.08725829 -0.1490040 0.1985936 0.709322816
## [2,] -0.1511882 -0.69052953 0.4629211 0.5245825 -0.104059181
## [3,] -0.6394332 0.20196122 -0.2157528 0.1336768 -0.697078234
## [4,] 0.3546581 -0.00633049 -0.6521357 0.6699900 0.003161132
## [5,] 0.1219811 0.68900403 0.5400663 0.4675657 0.010235855
# Original variables
variables <- c("Energy", "Protein", "Fat", "Calcium", "Iron")
# Provided eigen decomposition output
eigen_values <- c(2.197777619, 1.144204758, 0.848574671, 0.807842783, 0.001600169)
eigen_vectors <- matrix(c(-0.6539155, 0.08725829, -0.1490040, 0.1985936, 0.709322816,
-0.1511882, -0.69052953, 0.4629211, 0.5245825, -0.104059181,
-0.6394332, 0.20196122, -0.2157528, 0.1336768, -0.697078234,
0.3546581, -0.00633049, -0.6521357, 0.6699900, 0.003161132,
0.1219811, 0.68900403, 0.5400663, 0.4675657, 0.010235855),
nrow = 5, byrow = TRUE)
Principal Component 1 : -0.65Energy -0.15Protein -0.64 Fat + .35 Calcium + 0.12Iron
Principal Component 2 : 0.09Energy -0.69Protein + 0.2Fat -0.006Calcium +0.69*Iron
Principal Component 3 : -0.15Energy +.46Protein - 0.22Fat -0.65Calcium +0.54*Iron
Principal Component 4 : 0.19Energy + 0.52Protein+ 0.13Fat +0.67Calcium + 0.47*Iron
Principal Component 5 : 0.71Energy -0.10Protein-0.69Fat +0.003Calcium + 0.01*Iron
#Getting principal components using pricom
princomp_meat<-princomp(cor(meat_data[,-1]))
summary(princomp_meat,loadings=T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.9000323 0.5090856 0.3794933 0.0139255839 0
## Proportion of Variance 0.6675742 0.2135822 0.1186838 0.0001598123 0
## Cumulative Proportion 0.6675742 0.8811564 0.9998402 1.0000000000 1
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Energy 0.606 0.148 0.354 0.692
## Protein -0.737 -0.466 0.464 -0.128
## Fat 0.598 0.203 0.215 0.228 -0.710
## Calcium -0.464 0.648 0.598
## Iron -0.229 0.634 -0.543 0.500
The provided loadings represent the correlation between the original variables and the principal components. Below are the discussions
Loadings: The values in the table represent the correlation coefficients between each original variable and each principal component.
In Component 1 (Comp.1), Energy has a loading of 0.606, indicating a relatively strong positive correlation with this component.
Protein has a loading of -0.737 in Component 2 (Comp.2), indicating a strong negative correlation with this component.
In Component 3 (Comp.3), Fat and Calcium have loadings of 0.215 and 0.598, respectively, indicating positive correlations with this component.
These loadings help us understand how each original variable contributes to each principal component.
Positive loadings indicate a positive relationship, while negative loadings indicate a negative relationship. Higher absolute values of loadings suggest stronger relationships.
This R code will read in the data:
USairpol.full <- read.table("/cloud/project/airpoll.txt", quote="\"", comment.char="",header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]
#EDA - Scatter plot
pairs(USairpol.data,panel=function(x,y) {abline(lsfit(x,y)$coef,
lwd=2)
lines(lowess(x,y),lty=2,
lwd=2)
points(x,y)})
# Define a custom panel function for Coplot of SO2 and Mortality conditional on population density
custom_panel <- function(x, y, col, pch) {
panel.smooth(x, y, span = 1)
}
# Call coplot with the custom panel function
coplot(Mortality ~ SO2 | Popden, panel = custom_panel, data = USairpol.data)
# Function to detect outliers using Z-score
detect_outliers <- function(x, threshold = 3) {
z_scores <- scale(x)
outliers <- abs(z_scores) > threshold
return(outliers)
}
# Detect outliers per column
outliers_per_column <- apply(USairpol.data, 2, detect_outliers)
# Filter dataframe to return only rows with outliers
rows_with_outliers <- apply(outliers_per_column, 1, any)
data_with_outliers <- USairpol.data[rows_with_outliers, ]
# Filter dataframe to return only columns with outliers
columns_with_outliers <- apply(outliers_per_column, 2, any)
data_with_outliers <- data_with_outliers[, columns_with_outliers]
# Print dataframe with outliers
print(data_with_outliers)
## Popden NOX SO2
## 12 6122 63 278
## 29 4700 319 130
## 40 3437 59 263
## 48 4253 171 86
## 59 9699 8 49
}
pc_us_airpol<-princomp(cor(USairpol.data))
summary(pc_us_airpol,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.9012738 0.5789673 0.34222488 0.17167377 0.15697445
## Proportion of Variance 0.6129862 0.2529562 0.08838131 0.02224053 0.01859496
## Cumulative Proportion 0.6129862 0.8659424 0.95432374 0.97656427 0.99515923
## Comp.6 Comp.7
## Standard deviation 0.08009190 7.430301e-09
## Proportion of Variance 0.00484077 4.166295e-17
## Cumulative Proportion 1.00000000 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Rainfall 0.531 0.260 0.179 0.290 0.291 0.167 0.650
## Education -0.495 0.312 -0.220 -0.448 0.626
## Popden -0.511 0.568 -0.595 0.223
## Nonwhite 0.350 -0.594 -0.550 0.462
## NOX -0.344 -0.389 -0.445 0.645 -0.156 0.302
## SO2 -0.614 -0.173 0.448 -0.474 0.367 0.177
## Mortality 0.476 -0.196 -0.237 -0.102 -0.264 -0.765 0.115
To remove one or two outlying observations, try R code such as:
out.row1 <-12
USairpol.data.reduced <- USairpol.data[-out.row1,]
out.row2 <- 29
USairpol.data.reduced.more <-
USairpol.data[-c(out.row1,out.row2),]
pc_us_airpol<-princomp(cor(USairpol.data.reduced.more ))
summary(pc_us_airpol,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.8479981 0.5748198 0.3728810 0.18313149 0.13362071
## Proportion of Variance 0.5765435 0.2649145 0.1114764 0.02688861 0.01431495
## Cumulative Proportion 0.5765435 0.8414580 0.9529344 0.97982297 0.99413792
## Comp.6 Comp.7
## Standard deviation 0.085507589 0
## Proportion of Variance 0.005862079 0
## Cumulative Proportion 1.000000000 1
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Rainfall 0.514 0.324 0.144 0.314 0.386 0.594
## Education -0.537 0.257 -0.126 -0.151 -0.360 0.684
## Popden 0.109 -0.440 0.618 -0.575 0.266
## Nonwhite 0.350 0.141 -0.543 -0.537 -0.102 0.513
## NOX -0.231 -0.483 -0.468 0.647 -0.155 0.219
## SO2 0.130 -0.605 -0.133 0.483 -0.442 0.347 0.224
## Mortality 0.493 -0.128 -0.223 -0.143 -0.310 -0.752
# Define data for comparison
results_before <- data.frame(
Component = c("Comp.1", "Comp.2", "Comp.3", "Comp.4", "Comp.5"),
`Standard deviation` = c(0.901, 0.579, 0.342, 0.172, 0.157),
`Proportion of Variance` = c("61.3%", "25.3%", "8.8%", "2.2%", "1.9%")
)
results_after <- data.frame(
Component = c("Comp.1", "Comp.2", "Comp.3", "Comp.4", "Comp.5"),
`Standard deviation` = c(0.848, 0.575, 0.373, 0.183, 0.134),
`Proportion of Variance` = c("57.7%", "26.5%", "11.1%", "2.7%", "1.4%")
)
# Print comparison table
cat("Before Removing Outliers:")
## Before Removing Outliers:
knitr::kable(results_before)
| Component | Standard.deviation | Proportion.of.Variance |
|---|---|---|
| Comp.1 | 0.901 | 61.3% |
| Comp.2 | 0.579 | 25.3% |
| Comp.3 | 0.342 | 8.8% |
| Comp.4 | 0.172 | 2.2% |
| Comp.5 | 0.157 | 1.9% |
cat("\n\n")
cat("After Removing Outliers:")
## After Removing Outliers:
knitr::kable(results_after)
| Component | Standard.deviation | Proportion.of.Variance |
|---|---|---|
| Comp.1 | 0.848 | 57.7% |
| Comp.2 | 0.575 | 26.5% |
| Comp.3 | 0.373 | 11.1% |
| Comp.4 | 0.183 | 2.7% |
| Comp.5 | 0.134 | 1.4% |
# Loadings before removing outliers
loadings_before <- data.frame(
Variable = c("Rainfall", "Education", "Popden", "Nonwhite", "NOX", "SO2", "Mortality"),
Comp.1 = c(0.531, -0.495, NA, 0.350, -0.344, -0.614, 0.476),
Comp.2 = c(0.260, 0.312, NA, NA, -0.389, -0.173, -0.196),
Comp.3 = c(0.179, NA, 0.568, -0.594, -0.445, 0.448, -0.237),
Comp.4 = c(0.290, -0.220, -0.595, -0.550, NA, -0.474, -0.102),
Comp.5 = c(0.291, -0.448, NA, 0.462, 0.645, 0.367, -0.264),
Comp.6 = c(0.167, NA, 0.223, NA, -0.156, 0.177, -0.765),
Comp.7 = c(0.650, 0.626, NA, NA, 0.302, 0.177, 0.115)
)
# Print loadings before removing outliers
knitr::kable(loadings_before)
| Variable | Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 |
|---|---|---|---|---|---|---|---|
| Rainfall | 0.531 | 0.260 | 0.179 | 0.290 | 0.291 | 0.167 | 0.650 |
| Education | -0.495 | 0.312 | NA | -0.220 | -0.448 | NA | 0.626 |
| Popden | NA | NA | 0.568 | -0.595 | NA | 0.223 | NA |
| Nonwhite | 0.350 | NA | -0.594 | -0.550 | 0.462 | NA | NA |
| NOX | -0.344 | -0.389 | -0.445 | NA | 0.645 | -0.156 | 0.302 |
| SO2 | -0.614 | -0.173 | 0.448 | -0.474 | 0.367 | 0.177 | 0.177 |
| Mortality | 0.476 | -0.196 | -0.237 | -0.102 | -0.264 | -0.765 | 0.115 |
# Loadings after removing outliers
loadings_after <- data.frame(
Variable = c("Rainfall", "Education", "Popden", "Nonwhite", "NOX", "SO2", "Mortality"),
Comp.1 = c(0.514, -0.537, 0.109, 0.350, -0.231, 0.130, 0.493),
Comp.2 = c(0.324, 0.257, -0.440, 0.141, -0.483, -0.605, -0.128),
Comp.3 = c(0.144, -0.126, 0.618, -0.543, -0.468, -0.133, -0.223),
Comp.4 = c(0.314, -0.151, -0.575, -0.537, NA, 0.483, -0.143),
Comp.5 = c(0.386, -0.360, NA, -0.102, 0.647, -0.442, -0.310),
Comp.6 = c(NA, NA, 0.266, 0.513, -0.155, 0.347, -0.752),
Comp.7 = c(0.594, 0.684, NA, NA, 0.219, 0.224, 0.115)
)
# Print loadings after removing outliers
knitr::kable(loadings_after)
| Variable | Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 |
|---|---|---|---|---|---|---|---|
| Rainfall | 0.514 | 0.324 | 0.144 | 0.314 | 0.386 | NA | 0.594 |
| Education | -0.537 | 0.257 | -0.126 | -0.151 | -0.360 | NA | 0.684 |
| Popden | 0.109 | -0.440 | 0.618 | -0.575 | NA | 0.266 | NA |
| Nonwhite | 0.350 | 0.141 | -0.543 | -0.537 | -0.102 | 0.513 | NA |
| NOX | -0.231 | -0.483 | -0.468 | NA | 0.647 | -0.155 | 0.219 |
| SO2 | 0.130 | -0.605 | -0.133 | 0.483 | -0.442 | 0.347 | 0.224 |
| Mortality | 0.493 | -0.128 | -0.223 | -0.143 | -0.310 | -0.752 | 0.115 |
In summary, the analysis of principal components (PCA) before and after removing outliers revealed notable changes in both variance and loadings across several components. Here’s a breakdown of some key findings:
These observations underscore the impact of outlier removal on PCA results, highlighting how it refines the representation of underlying data structures and leads to more accurate interpretations of principal components.