Question 3a) Covariance Matrix and Correlation Plot

Calculate the covariance matrix and correlation plot. Comment on any important characteristics for PCA

# Covariance matrix:
print("Stars Covariance Matrix:")
## [1] "Stars Covariance Matrix:"
round(cov(stars), digits = 1)
##             Ascension Declination Mag10 Mag_Earth Log_Dist
## Ascension        48.2         4.7   0.1       0.6     -0.1
## Declination       4.7      1730.7   2.4       5.9     -0.7
## Mag10             0.1         2.4   0.7       0.4      0.1
## Mag_Earth         0.6         5.9   0.4       2.8     -0.5
## Log_Dist         -0.1        -0.7   0.1      -0.5      0.1

The covariance matrix shows that declination has a much, much larger variance than all the other variables, and log_dist and mag10 have very low variances, below 1, indicating the data needs to be rescaled. If not, declination will dominate the first PC and log_dist and mag10 will be ignored.

ggcorrplot(
  corr = cor(stars),
  method = "square",
  type = "lower",
  colors = c("tomato", "white", "steelblue"),
  hc.order = T,
  ggtheme = theme_test,
  lab = T
)

There is a strong correlation between log_dist and mag_earth, but none of the other variables have a strong association. No clear evidence of a linear dependency just from the correlation plot.

Question 3b) Create the scatterplot matrix

Create the scatterplot matrix for the starts. Does there appear to be a problem with multicollinearity?

ggpairs(
  data = stars
)

No clear problem with multicollinearity since most of the plots are fairly random.

Question 3c) Perform PCA

Using the correlation matrix, perform PCA. How many PCs should be included? Justify your answer

# Performing PCA
star_pc <- 
  prcomp(
    x = stars,
    scale. = T
  )

# Creating the scree plot
fviz_screeplot(
  star_pc,
  choice = "eigenvalue",
  geom = "line"
)

Based on the scree plot, there appears to be 4 PCs, and the last PC has a eigenvalue (variance) of 0, indicating that there is a linear dependency in the data.

Let’s check the cumulative variance:

summary(star_pc)
## Importance of components:
##                          PC1    PC2    PC3    PC4       PC5
## Standard deviation     1.373 1.0784 0.9957 0.9802 1.746e-15
## Proportion of Variance 0.377 0.2326 0.1983 0.1922 0.000e+00
## Cumulative Proportion  0.377 0.6096 0.8078 1.0000 1.000e+00

The first PC has a larger variance than PC2, PC3, and PC4, and those 3 all have about the same variance of about 1. Because the variance of PC2 and PC3 and PC4 are all about the same, if we keep PC2, we should also keep PC3 and PC4. So we should use 4 PCs.

Question 3d) Interpret the first 2 PCs

**Interpret the first 2 PCs. Only include the variable in the interpretation if its contribution to the PC is greater than 0.45.

star_pc$rotation |> 
  data.frame() |> 
  dplyr::select(PC1, PC2)
##                     PC1       PC2
## Ascension    0.07188780 0.1534198
## Declination  0.10895180 0.3718865
## Mag10        0.01654733 0.8588054
## Mag_Earth    0.70465407 0.1710202
## Log_Dist    -0.69724492 0.2671482

PC1: It appears to be the difference in Mag_Earth and Log_Dist

PC2: It is mostly Mag10

Part 3e) Create a biplot for the first 2 PCs. Does the plot agree with your interpretation from 3c)?

fviz_pca_biplot(
  star_pc,
  geom = "point"
  
)

The horizontal axis is mostly the difference between mag_earth and log_dist since they are the most horizontal and the longest.

The vertical axis is almost all Mag10

Yes, the biplot confirms out previous interpretation.

Question 3f) Biplot of PC 3 and PC 4

Create a biplot for the 3rd and 4th PC. What can you determine about these 2 PCs?

fviz_pca_biplot(
  star_pc,
  geom = "point",
  axes = c(3, 4)
)

PC3 is just Ascension while PC4 is just Declination.

Question 3g) Last PC

What does the last PC tell you about the 5 variables?

star_pc$rotation |> 
  data.frame() |> 
  dplyr::select(PC5) |> 
  round(digits = 3)
##                PC5
## Ascension    0.000
## Declination  0.000
## Mag10        0.339
## Mag_Earth   -0.666
## Log_Dist    -0.665

The last PC shows that the linear dependency in the data is:

Mag10 = 2\(\times\)Mag_Earth + 2\(\times\)Log_dist

or

1/3Mag10 = 2/3Mag_earth + 2/3Log_dist

ggplot(
  data = stars,
  mapping = aes(
    x = 2*Log_Dist/sd(Log_Dist) + 2*Mag_Earth/sd(Mag_Earth),
    y = Mag10
  )
) + 
  
  geom_point() + 
  
  labs(
    x = "2 Log Distance + 2 Magnitude at Earth",
    y = "Magnitude 10"
  )

Question 3h) Groups in Stars?

From either biplot in e) or f), can you tell that there are different groups in the stars?

fviz_pca_biplot(
  star_pc,
  geom = "point"
)  

fviz_pca_biplot(
  star_pc,
  geom = "point",
  axes = c(3, 4)
)

There aren’t any clear groups in the data, but there might be some indication of 2 groups using the 3rd PC (Ascension). Looking back at the density plot from Ascension, it is bimodal, indicating there are probably 2 groups.