Calculate the covariance matrix and correlation plot. Comment on any important characteristics for PCA
# Covariance matrix:
print("Stars Covariance Matrix:")
## [1] "Stars Covariance Matrix:"
round(cov(stars), digits = 1)
## Ascension Declination Mag10 Mag_Earth Log_Dist
## Ascension 48.2 4.7 0.1 0.6 -0.1
## Declination 4.7 1730.7 2.4 5.9 -0.7
## Mag10 0.1 2.4 0.7 0.4 0.1
## Mag_Earth 0.6 5.9 0.4 2.8 -0.5
## Log_Dist -0.1 -0.7 0.1 -0.5 0.1
The covariance matrix shows that declination has a much, much larger variance than all the other variables, and log_dist and mag10 have very low variances, below 1, indicating the data needs to be rescaled. If not, declination will dominate the first PC and log_dist and mag10 will be ignored.
ggcorrplot(
corr = cor(stars),
method = "square",
type = "lower",
colors = c("tomato", "white", "steelblue"),
hc.order = T,
ggtheme = theme_test,
lab = T
)
There is a strong correlation between log_dist and mag_earth, but none of the other variables have a strong association. No clear evidence of a linear dependency just from the correlation plot.
Create the scatterplot matrix for the starts. Does there appear to be a problem with multicollinearity?
ggpairs(
data = stars
)
No clear problem with multicollinearity since most of the plots are fairly random.
Using the correlation matrix, perform PCA. How many PCs should be included? Justify your answer
# Performing PCA
star_pc <-
prcomp(
x = stars,
scale. = T
)
# Creating the scree plot
fviz_screeplot(
star_pc,
choice = "eigenvalue",
geom = "line"
)
Based on the scree plot, there appears to be 4 PCs, and the last PC has a eigenvalue (variance) of 0, indicating that there is a linear dependency in the data.
Let’s check the cumulative variance:
summary(star_pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.373 1.0784 0.9957 0.9802 1.746e-15
## Proportion of Variance 0.377 0.2326 0.1983 0.1922 0.000e+00
## Cumulative Proportion 0.377 0.6096 0.8078 1.0000 1.000e+00
The first PC has a larger variance than PC2, PC3, and PC4, and those 3 all have about the same variance of about 1. Because the variance of PC2 and PC3 and PC4 are all about the same, if we keep PC2, we should also keep PC3 and PC4. So we should use 4 PCs.
**Interpret the first 2 PCs. Only include the variable in the interpretation if its contribution to the PC is greater than 0.45.
star_pc$rotation |>
data.frame() |>
dplyr::select(PC1, PC2)
## PC1 PC2
## Ascension 0.07188780 0.1534198
## Declination 0.10895180 0.3718865
## Mag10 0.01654733 0.8588054
## Mag_Earth 0.70465407 0.1710202
## Log_Dist -0.69724492 0.2671482
PC1: It appears to be the difference in Mag_Earth and Log_Dist
PC2: It is mostly Mag10
fviz_pca_biplot(
star_pc,
geom = "point"
)
The horizontal axis is mostly the difference between mag_earth and log_dist since they are the most horizontal and the longest.
The vertical axis is almost all Mag10
Yes, the biplot confirms out previous interpretation.
Create a biplot for the 3rd and 4th PC. What can you determine about these 2 PCs?
fviz_pca_biplot(
star_pc,
geom = "point",
axes = c(3, 4)
)
PC3 is just Ascension while PC4 is just Declination.
What does the last PC tell you about the 5 variables?
star_pc$rotation |>
data.frame() |>
dplyr::select(PC5) |>
round(digits = 3)
## PC5
## Ascension 0.000
## Declination 0.000
## Mag10 0.339
## Mag_Earth -0.666
## Log_Dist -0.665
The last PC shows that the linear dependency in the data is:
Mag10 = 2\(\times\)Mag_Earth + 2\(\times\)Log_dist
or
1/3Mag10 = 2/3Mag_earth + 2/3Log_dist
ggplot(
data = stars,
mapping = aes(
x = 2*Log_Dist/sd(Log_Dist) + 2*Mag_Earth/sd(Mag_Earth),
y = Mag10
)
) +
geom_point() +
labs(
x = "2 Log Distance + 2 Magnitude at Earth",
y = "Magnitude 10"
)
From either biplot in e) or f), can you tell that there are different groups in the stars?
fviz_pca_biplot(
star_pc,
geom = "point"
)
fviz_pca_biplot(
star_pc,
geom = "point",
axes = c(3, 4)
)
There aren’t any clear groups in the data, but there might be some indication of 2 groups using the 3rd PC (Ascension). Looking back at the density plot from Ascension, it is bimodal, indicating there are probably 2 groups.