library(tidyverse)
library(ggplot2)
library(MASS)
library(GGally)
library(plot3D)

At the end of this session you should be able to:

  1. explore PCA through multivariate data;

  2. understand how to apply PCA using covariance;

  3. understand how to apply PCA using correlation (standardising/scaling);

  4. understand how to interpret the R output using princomp and pcomp in R.

Things you may need to know/do.

Question 1 Revisiting Aircraft data

Consider the aircraft data (aircraft.csv) with variables Year, Period, Power, Span, Length, Weight, Speed and Range. The data are available under Week 1 folder or in the Data Sets folder on LMS.

Data description (from the R help for package sm)

These data record six characteristics of aircraft designs which appeared during the twentieth century.

The variables are: - Year: year of first manufacture - Period: a code to indicate one of three broad time periods - Power: total engine power (kW) - Span: wing span (m) - Length: length (m) - Weight: maximum take-off weight (kg) - Speed: maximum speed (km/h) - Range: range (km)

Source: The data were collected by P. Saviotti and are described in detail in Saviotti (1996), “Technological Evolution, Variety and Economy”, Edward Elgar: Cheltenham.

The dataframe aircraft refers to the aircraft data from the previous lab, the version with the logged variables. In this lab we work only with the logged variables plus Year and Period.

Use prcomp for this question.

  Year Period Power Span Length Weight Speed Range
1   14      1  82.0 12.8   7.60   1070   105   400
2   14      1  82.0 11.0   9.00    830   145   402
3   14      1 223.6 17.9  10.35   2200   135   500
4   15      1 164.0 14.5   9.80   1946   138   500
5   15      1 119.0 12.9   7.90   1190   140   400
6   15      1  74.5  7.5   6.30    653   177   350

Lab Method:

  1. Conduct a PCA of the raw (unscaled) data, that is, the six logged variables. Report the eigenvectors and the eigenvalues.
# Creating a new dataframe with logged variables
aircraft_logged <- mutate(aircraft, Period = factor(Period), logPower = log10(Power),
    logSpan = log10(Span), logLength = log10(Length), logWeight = log10(Weight),
    logSpeed = log10(Speed), logRange = log10(Range)) %>%
    dplyr::select(Year, Period, starts_with("log"))


# Method 1 pca.aircraft=prcomp(select(aircraft, -Year, -Period)) Method 2
noms = names(aircraft_logged)[3:8]

rhs = paste0(noms, collapse = " + ")

fmla = as.formula(paste("~ ", rhs))

print(fmla)
~logPower + logSpan + logLength + logWeight + logSpeed + logRange
pca.aircraft = prcomp(fmla, data = aircraft_logged)

# Display the eigenvectors (rotation matrix)
print(pca.aircraft$rotation)
                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613
# Display the eigenvalues (squared singular values)
eigenvalues <- (pca.aircraft$sdev)^2
print(eigenvalues)
[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

Eigenvectors:

logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106 logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987 logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613 logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325 logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895 logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613

Eigenvalues:

[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

  1. Apply print, summary and str to the object produced by prcomp and work out the correspondence between these outputs and the terminology of PCA concepts used in the lecture notes. Note that, unlike the other two commands, str shows what the object actually is (i.e. its structure).
cat("*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.

",
    sep = "\n")
*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.
print(pca.aircraft)
Standard deviations (1, .., p=6):
[1] 0.94878024 0.29350355 0.21029869 0.07473312 0.04855061 0.04645899

Rotation (n x k) = (6 x 6):
                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613
cat("****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.

",
    sep = "\n")
****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.
summary(pca.aircraft)
Importance of components:
                          PC1     PC2    PC3     PC4     PC5     PC6
Standard deviation     0.9488 0.29350 0.2103 0.07473 0.04855 0.04646
Proportion of Variance 0.8650 0.08278 0.0425 0.00537 0.00227 0.00207
Cumulative Proportion  0.8650 0.94780 0.9903 0.99566 0.99793 1.00000
cat("****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.

",
    sep = "\n")
****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.
str(pca.aircraft)
List of 6
 $ sdev    : num [1:6] 0.9488 0.2935 0.2103 0.0747 0.0486 ...
 $ rotation: num [1:6, 1:6] 0.679 0.148 0.213 0.588 0.229 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
  ..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
 $ scale   : logi FALSE
 $ x       : num [1:709, 1:6] -1.61 -1.64 -1.03 -1.17 -1.44 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:709] "1" "2" "3" "4" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ call    : language prcomp(formula = fmla, data = aircraft_logged)
 - attr(*, "class")= chr "prcomp"
  1. Make a scree plot and a plot of the cumulative contribution to total variance. (Hint: cumsum.)
plot(pca.aircraft)

# Assuming pca.aircraft is the result from prcomp Extract eigenvalues (squared
# singular values) eigenvalues <- pca.aircraft$sdev^2

# Create a sequence for the x-axis representing each principal component pc_seq
# <- 1:length(eigenvalues)

# Create the scree plot using base R plot function plot(pc_seq, eigenvalues,
# type = 'b', xlab = 'Principal Component', ylab = 'Eigenvalue', main = 'Scree
# Plot', pch = 19, col = 'blue')

# Create the scree plot using base R plot function
plot(1:length(eigenvalues), pca.aircraft$sdev^2, type = "b", xlab = "Principal Component",
    ylab = "Eigenvalue", main = "Scree Plot", col = "black")

plot(cumsum(pca.aircraft$sdev^2)/sum(pca.aircraft$sdev^2), type = "b", ylab = "Cumulative variance")

  1. Plot a smoothed histogram (density) of the PC1 scores and separately of the PC2 scores.
forplot = cbind(aircraft, pca.aircraft$x)
names(forplot)
 [1] "Year"   "Period" "Power"  "Span"   "Length" "Weight" "Speed"  "Range" 
 [9] "PC1"    "PC2"    "PC3"    "PC4"    "PC5"    "PC6"   
ggplot(forplot, aes(PC1)) + geom_density()

ggplot(forplot, aes(PC2)) + geom_density()

ggplot(forplot, aes(PC3)) + geom_density()

  1. Make a scatterplot of the PC1 scores (x-axis) against the PC2 scores (y-axis) and colour the points according to the value of Period. Overlay contours of a 2D density estimate.
ggplot(forplot, aes(PC1, PC2)) + geom_point() + geom_density_2d()

ggplot(forplot, aes(PC1, PC2, colour = Period)) + geom_point() + geom_density_2d()
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

ggplot(forplot, aes(PC1, PC3, colour = Period)) + geom_point() + geom_density_2d()
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

ggpairs(forplot, columns = 9:14)

  1. Make a 2D biplot. (Hint: biplot, with col = c("white", "blue").)
biplot(pca.aircraft, col = c("white", "blue"), choices = 1:2)

biplot(pca.aircraft, col = c("red", "blue"), choices = 1:2)

ChatGPT Method:

  1. Conduct a PCA of the raw (unscaled) data, that is, the six logged variables. Report the eigenvectors and the eigenvalues.
# Assuming 'aircraft' is already loaded with read.csv()
# Note: Ensure that you only take the log of positive, non-zero values.
# The log function cannot be applied to zero or negative values.

# Creating a new dataframe with logged variables
# Creating a new dataframe with logged variables
aircraft_logged <- mutate(aircraft, 
    # Convert the 'Period' variable to a factor (categorical variable)
    Period = factor(Period), 

    # Apply logarithmic transformation (base 10) to the 'Power' variable
    logPower = log10(Power), 

    # Apply logarithmic transformation (base 10) to the 'Span' variable
    logSpan = log10(Span),

    # Apply logarithmic transformation (base 10) to the 'Length' variable
    logLength = log10(Length),

    # Apply logarithmic transformation (base 10) to the 'Weight' variable
    logWeight = log10(Weight),

    # Apply logarithmic transformation (base 10) to the 'Speed' variable
    logSpeed = log10(Speed),

    # Apply logarithmic transformation (base 10) to the 'Range' variable
    logRange = log10(Range)
) %>%
    # Select only specific columns: 'Year', 'Period', and those that start with "log"
    dplyr::select(Year, Period, starts_with("log"))


# Perform PCA on the logged variables
# Select only the logged variables for PCA
pca_result <- prcomp(aircraft_logged[,c('logPower', 'logSpan', 'logLength', 'logWeight', 'logSpeed', 'logRange')],
                     center = TRUE, scale. = FALSE)

# Display the eigenvectors (rotation matrix)
print(pca_result$rotation)
                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613
# Display the eigenvalues (squared singular values)
eigenvalues <- (pca_result$sdev)^2
print(eigenvalues)
[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

Eigenvectors Rotation matrix

PC1 PC2 PC3 PC4 PC5 PC6 logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106 logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987 logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613 logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325 logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895 logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613

Eigenvalues

[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

  1. Apply print, summary and str to the object produced by prcomp and work out the correspondence between these outputs and the terminology of PCA concepts used in the lecture notes. Note that, unlike the other two commands, str shows what the object actually is (i.e. its structure).
cat("*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.

",
    sep = "\n")
*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.
print(pca_result)
Standard deviations (1, .., p=6):
[1] 0.94878024 0.29350355 0.21029869 0.07473312 0.04855061 0.04645899

Rotation (n x k) = (6 x 6):
                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613
cat("****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.

",
    sep = "\n")
****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.
summary(pca_result)
Importance of components:
                          PC1     PC2    PC3     PC4     PC5     PC6
Standard deviation     0.9488 0.29350 0.2103 0.07473 0.04855 0.04646
Proportion of Variance 0.8650 0.08278 0.0425 0.00537 0.00227 0.00207
Cumulative Proportion  0.8650 0.94780 0.9903 0.99566 0.99793 1.00000
cat("****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.

",
    sep = "\n")
****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.
str(pca_result)
List of 5
 $ sdev    : num [1:6] 0.9488 0.2935 0.2103 0.0747 0.0486 ...
 $ rotation: num [1:6, 1:6] 0.679 0.148 0.213 0.588 0.229 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
  ..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
 $ scale   : logi FALSE
 $ x       : num [1:709, 1:6] -1.61 -1.64 -1.03 -1.17 -1.44 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"
  1. Plot a smoothed histogram (density) of the PC1 scores and separately of the PC2 scores.
library(ggplot2)

# Assuming pca_result is already obtained from the prcomp function and contains
# the principal component scores in pca_result$x

# Convert the PC scores to a dataframe for easier manipulation with ggplot2
pc_scores <- as.data.frame(pca_result$x)

# Plot smoothed histogram (density) for PC1
ggplot(pc_scores, aes(x = PC1)) + geom_density(fill = "blue", alpha = 0.5) + labs(title = "Density Plot of PC1 Scores",
    x = "PC1 Score", y = "Density") + theme_minimal()

# Plot smoothed histogram (density) for PC2
ggplot(pc_scores, aes(x = PC2)) + geom_density(fill = "red", alpha = 0.5) + labs(title = "Density Plot of PC2 Scores",
    x = "PC2 Score", y = "Density") + theme_minimal()

library(ggplot2)

# Assuming pca_result is already obtained from the prcomp function

# Extract eigenvalues
eigenvalues <- (pca_result$sdev)^2  # Squared singular values are the eigenvalues

# Create a dataframe for plotting the scree plot
pca_df <- data.frame(PC = paste0('PC', seq_along(eigenvalues)), 
                     Eigenvalues = eigenvalues)


# Generate the scree plot
ggplot(pca_df, aes(x = PC, y = Eigenvalues)) +
  geom_point() +  # Add points for each principal component
  geom_line(aes(group = 1)) +  # Connect points with a line
  theme_minimal() +  # Use a minimal theme
  labs(title = "Scree Plot", x = "Principal Component", y = "Eigenvalue")

# Calculate the cumulative variance
pca_df$CumulativeVariance <- cumsum(pca_df$Eigenvalues) / sum(pca_df$Eigenvalues)

# Generate the plot for cumulative contribution to total variance
ggplot(pca_df, aes(x = PC, y = CumulativeVariance)) +
  geom_point() +  # Add points for each principal component
  geom_line(aes(group = 1)) +  # Connect points with a line
  theme_minimal() +  # Use a minimal theme
  labs(title = "Cumulative Contribution to Total Variance", 
       x = "Principal Component", y = "Cumulative Proportion of Variance Explained")

Question 2 Revisiting Aircraft data: scaled

Your turn to answer this question.

Repeat Question 1 but with scaled log data. How do the two analyses differ? Why do we see these differences?

Question 3 Simulation

We work with the matrix \(\Sigma_0\) from Q1 of Lab 2 which is given again below. As in Q2 of Lab 2 we generate random samples, with \(n = 250\), from each of the following distributions

   (i) the multivariate normal distribution ${\cal N}(0,\Sigma_0 )$

  (ii) the multivariate $t$-distribution $t_6(0,\Sigma_0)$ in $\nu =6$ degrees of freedom. 
  1. Start with the random sample of the \({\cal N}(0,\Sigma_0 )\) distribution and calculate the covariance matrix of the random sample.

  2. Calculate the eigenvalues and eigenvectors of the covariance matrix and compare to those of the population matrix \(\Sigma_0\) (from Lab 2).

  3. Show a graph of the eigenvalues and of the cumulative contribution to total variance against the index.

  4. Plot a smoothed histogram (density) of all six PC scores.

  5. Make a scatterplot of the PC1 scores (x-axis) against the PC2 scores (y-axis). Overlay contours of a 2D density estimate.

  6. Repeat parts (a) to (e) for the random sample of the \(t_6(0,\Sigma_0)\) distribution.

  7. Compare \(\Sigma_0\) with the two sample covariance matrices: how different are the eigenvalues and eigenvectors of the three matrices? Which are most similar? Why do they differ?

Sigma0 = matrix(c(3.1386518, 0.38872659, 0.6178228, 1.7576847, 0.77433973, 0.7508789,
    0.3887266, 1.21417949, 0.1941453, 0.451892, 0.01236855, 0.2155466, 0.6178228,
    0.19414529, 1.2437919, 0.597032, 0.15151088, 0.2665018, 1.7576847, 0.45189196,
    0.597032, 1.7083497, 0.52685668, 0.7109476, 0.7743397, 0.01236855, 0.1515109,
    0.5268567, 0.53406192, 0.2299193, 0.7508789, 0.21554658, 0.2665018, 0.7109476,
    0.22991933, 0.6642375), byrow = TRUE, nrow = 6)

Code chunk for generating random samples from the t-distribution in df degrees of freedom.

Remember that you will need the library mvtnorm here.

Sigma0sym = 0.5 * (Sigma0 + t(Sigma0))
Xmat = rmvt(100, sigma = Sigma0sym, df = 6)

Question 4 Example 8.1 JW (Calculating the population principal components)

Suppose the random variables \(X_1, X_2, X_3\) have the covariance matrix

\[ \Sigma=\begin{pmatrix} 1 & -2 & 0 \\ -2 & 5 & 0 \\ 0 &0 &2 \end{pmatrix}.\]

  1. Compute the eigenvalues and eigenvector.

  2. Write the principal component equations.

  3. Calculate \(Var(Y_1)\) and \(Cov(Y_1, Y_2)\) and the total variance.

  4. Calculate the proportions of the total variance of the first PC and the first two PC. Comment.

Question 5 Head lengths of first and second sons (EH, 2011)

The headsize data provided in (a) give the head lengths and head breadths (in millimetres) for each of the first two adult sons in 25 families. Here we shall use only the head lengths (head1 and head2).

(a). Read the data set in R and assign an appropriate name to the data object.

"headsize" <- +matrix(c(191, 195, 181, 183, 176, 208, 189, 197, 188, 192, 179, 183,
    174, 190, 188, 163, 195, 186, 181, 175, 192, 174, +176, 197, 190, 155, 149, 148,
    153, 144, 157, 150, 159, 152, 150, 158, 147, 150, 159, 151, 137, 155, 153, +145,
    140, 154, 143, 139, 167, 163, 179, 201, 185, 188, 171, 192, 190, 189, 197, 187,
    186, 174, 185, 195, +187, 161, 183, 173, 182, 165, 185, 178, 176, 200, 187, 145,
    152, 149, 149, 142, 152, 149, 152, 159, 151, +148, 147, 152, 157, 158, 130, 158,
    148, 146, 137, 152, 147, 143, 158, 150), nrow = 25, ncol = 4, dimnames = list(character(0),
    c("head1", "breadth1", "head2", "breadth2")))

(b). Compute the array of sample means, variance-covariance matrix \(S\)

(c). Perform PCA on the dataset based on \(S\).

(d). Write the equations in (c) and determine the proportion of the total sample variance explained by the first two principal components.

(e). Interpret the first two sample PCA components, obtained in (d).