Things you may need to know/do.
Question 1 Revisiting Aircraft data
- Data description (from the R help for package sm)
Question 1 Method 2:
Question 2 Revisiting Aircraft data: scaled Lab solutions
Question 3 Simulation
Question 4 Example 8.1 JW (Calculating the population principal components)
Question 5 Head lengths of first and second sons (EH, 2011)

library(tidyverse)
library(ggplot2)
library(MASS)
library(GGally)
library(plot3D)

At the end of this session you should be able to:

explore PCA through multivariate data;
understand how to apply PCA using covariance;
understand how to apply PCA using correlation (standardising/scaling);
understand how to interpret the R output using princomp and pcomp in R.

Things you may need to know/do.

Relates to lecture 2.
Libraries ggplot2, MASS, tidyverse, GGally may be useful. Others may be mentioned below in the hints.
For Q3 you will also need the library mvtnorm.
You might like to set up a project for each lab if you are using RStudio. Then you can copy a .Rmd file into that directory and write your answers in that file.

Question 1 Revisiting Aircraft data

Consider the aircraft data (aircraft.csv) with variables Year, Period, Power, Span, Length, Weight, Speed and Range. The data are available under Week 1 folder or in the Data Sets folder on LMS.

Data description (from the R help for package sm)

These data record six characteristics of aircraft designs which appeared during the twentieth century.

The variables are: - Year: year of first manufacture - Period: a code to indicate one of three broad time periods - Power: total engine power (kW) - Span: wing span (m) - Length: length (m) - Weight: maximum take-off weight (kg) - Speed: maximum speed (km/h) - Range: range (km)

Source: The data were collected by P. Saviotti and are described in detail in Saviotti (1996), “Technological Evolution, Variety and Economy”, Edward Elgar: Cheltenham.

The dataframe aircraft refers to the aircraft data from the previous lab, the version with the logged variables. In this lab we work only with the logged variables plus Year and Period.

Use prcomp for this question.

  Year Period Power Span Length Weight Speed Range
1   14      1  82.0 12.8   7.60   1070   105   400
2   14      1  82.0 11.0   9.00    830   145   402
3   14      1 223.6 17.9  10.35   2200   135   500
4   15      1 164.0 14.5   9.80   1946   138   500
5   15      1 119.0 12.9   7.90   1190   140   400
6   15      1  74.5  7.5   6.30    653   177   350

Lab Method:

Conduct a PCA of the raw (unscaled) data, that is, the six logged variables. Report the eigenvectors and the eigenvalues.

# Creating a new dataframe with logged variables
aircraft_logged <- mutate(aircraft, Period = factor(Period), logPower = log10(Power),
    logSpan = log10(Span), logLength = log10(Length), logWeight = log10(Weight),
    logSpeed = log10(Speed), logRange = log10(Range)) %>%
    dplyr::select(Year, Period, starts_with("log"))


# Method 1 pca.aircraft=prcomp(select(aircraft, -Year, -Period)) Method 2
noms = names(aircraft_logged)[3:8]

rhs = paste0(noms, collapse = " + ")

fmla = as.formula(paste("~ ", rhs))

print(fmla)

~logPower + logSpan + logLength + logWeight + logSpeed + logRange

pca.aircraft = prcomp(fmla, data = aircraft_logged)

# Display the eigenvectors (rotation matrix)
print(pca.aircraft$rotation)

                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613

# Display the eigenvalues (squared singular values)
eigenvalues <- (pca.aircraft$sdev)^2
print(eigenvalues)

[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

Eigenvectors:

logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106 logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987 logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613 logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325 logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895 logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613

Eigenvalues:

[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

Apply print, summary and str to the object produced by prcomp and work out the correspondence between these outputs and the terminology of PCA concepts used in the lecture notes. Note that, unlike the other two commands, str shows what the object actually is (i.e. its structure).

cat("*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.

",
    sep = "\n")

*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.

print(pca.aircraft)

Standard deviations (1, .., p=6):
[1] 0.94878024 0.29350355 0.21029869 0.07473312 0.04855061 0.04645899

Rotation (n x k) = (6 x 6):
                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613

cat("****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.

",
    sep = "\n")

****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.

summary(pca.aircraft)

Importance of components:
                          PC1     PC2    PC3     PC4     PC5     PC6
Standard deviation     0.9488 0.29350 0.2103 0.07473 0.04855 0.04646
Proportion of Variance 0.8650 0.08278 0.0425 0.00537 0.00227 0.00207
Cumulative Proportion  0.8650 0.94780 0.9903 0.99566 0.99793 1.00000

cat("****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.

",
    sep = "\n")

****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.

str(pca.aircraft)

List of 6
 $ sdev    : num [1:6] 0.9488 0.2935 0.2103 0.0747 0.0486 ...
 $ rotation: num [1:6, 1:6] 0.679 0.148 0.213 0.588 0.229 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
  ..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
 $ scale   : logi FALSE
 $ x       : num [1:709, 1:6] -1.61 -1.64 -1.03 -1.17 -1.44 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:709] "1" "2" "3" "4" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ call    : language prcomp(formula = fmla, data = aircraft_logged)
 - attr(*, "class")= chr "prcomp"

Make a scree plot and a plot of the cumulative contribution to total variance. (Hint: cumsum.)

plot(pca.aircraft)

# Assuming pca.aircraft is the result from prcomp Extract eigenvalues (squared
# singular values) eigenvalues <- pca.aircraft$sdev^2

# Create a sequence for the x-axis representing each principal component pc_seq
# <- 1:length(eigenvalues)

# Create the scree plot using base R plot function plot(pc_seq, eigenvalues,
# type = 'b', xlab = 'Principal Component', ylab = 'Eigenvalue', main = 'Scree
# Plot', pch = 19, col = 'blue')

# Create the scree plot using base R plot function
plot(1:length(eigenvalues), pca.aircraft$sdev^2, type = "b", xlab = "Principal Component",
    ylab = "Eigenvalue", main = "Scree Plot", col = "black")

plot(cumsum(pca.aircraft$sdev^2)/sum(pca.aircraft$sdev^2), type = "b", ylab = "Cumulative variance")

Index = c(1:6)
plot(Index, pca.aircraft$sdev^2, type = "b", ylim = c(0, 1))
lines(Index, cumsum(pca.aircraft$sdev^2)/sum(pca.aircraft$sdev^2), type = "b", col = 2)

Plot a smoothed histogram (density) of the PC1 scores and separately of the PC2 scores.

forplot = cbind(aircraft, pca.aircraft$x)
names(forplot)

 [1] "Year"   "Period" "Power"  "Span"   "Length" "Weight" "Speed"  "Range" 
 [9] "PC1"    "PC2"    "PC3"    "PC4"    "PC5"    "PC6"

ggplot(forplot, aes(PC1)) + geom_density()

ggplot(forplot, aes(PC2)) + geom_density()

ggplot(forplot, aes(PC3)) + geom_density()

Make a scatterplot of the PC1 scores (x-axis) against the PC2 scores (y-axis) and colour the points according to the value of Period. Overlay contours of a 2D density estimate.

# Assuming forplot is a dataframe that includes PC1, PC2, PC3 and a categorical
# variable Period

ggplot(forplot, aes(x = PC1, y = PC2)) + geom_point() + geom_density_2d()

# Adjusted plot to include color for different Periods
ggplot(forplot, aes(x = PC2, y = PC3, colour = Period, group = Period)) + geom_point() +
    geom_density_2d()

# Adjusted plot for PC1 vs PC3 with color
ggplot(forplot, aes(x = PC1, y = PC3, colour = Period, group = Period)) + geom_point() +
    geom_density_2d()

# For ggpairs - ensure columns 9 to 14 are correctly specified for your dataset
ggpairs(forplot, columns = 9:14)

Make a 2D biplot. (Hint: biplot, with col = c("white", "blue").)

biplot(pca.aircraft, col = c("white", "blue"), choices = 1:2)

biplot(pca.aircraft, col = c("red", "blue"), choices = 1:2)

Question 1 Method 2:

Conduct a PCA of the raw (unscaled) data, that is, the six logged variables. Report the eigenvectors and the eigenvalues.

# Assuming 'aircraft' is already loaded with read.csv()
# Note: Ensure that you only take the log of positive, non-zero values.
# The log function cannot be applied to zero or negative values.

# Creating a new dataframe with logged variables
# Creating a new dataframe with logged variables
aircraft_logged <- mutate(aircraft, 
    # Convert the 'Period' variable to a factor (categorical variable)
    Period = factor(Period), 

    # Apply logarithmic transformation (base 10) to the 'Power' variable
    logPower = log10(Power), 

    # Apply logarithmic transformation (base 10) to the 'Span' variable
    logSpan = log10(Span),

    # Apply logarithmic transformation (base 10) to the 'Length' variable
    logLength = log10(Length),

    # Apply logarithmic transformation (base 10) to the 'Weight' variable
    logWeight = log10(Weight),

    # Apply logarithmic transformation (base 10) to the 'Speed' variable
    logSpeed = log10(Speed),

    # Apply logarithmic transformation (base 10) to the 'Range' variable
    logRange = log10(Range)
) %>%
    # Select only specific columns: 'Year', 'Period', and those that start with "log"
    dplyr::select(Year, Period, starts_with("log"))


# Perform PCA on the logged variables
# Select only the logged variables for PCA
pca_result <- prcomp(aircraft_logged[,c('logPower', 'logSpan', 'logLength', 'logWeight', 'logSpeed', 'logRange')],
                     center = TRUE, scale. = FALSE)

# Display the eigenvectors (rotation matrix)
print(pca_result$rotation)

                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613

# Display the eigenvalues (squared singular values)
eigenvalues <- (pca_result$sdev)^2
print(eigenvalues)

[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

Eigenvectors Rotation matrix

PC1 PC2 PC3 PC4 PC5 PC6 logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106 logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987 logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613 logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325 logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895 logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613

Eigenvalues

[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438

Apply print, summary and str to the object produced by prcomp and work out the correspondence between these outputs and the terminology of PCA concepts used in the lecture notes. Note that, unlike the other two commands, str shows what the object actually is (i.e. its structure).

cat("*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.

",
    sep = "\n")

*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.

print(pca_result)

Standard deviations (1, .., p=6):
[1] 0.94878024 0.29350355 0.21029869 0.07473312 0.04855061 0.04645899

Rotation (n x k) = (6 x 6):
                PC1        PC2        PC3         PC4          PC5         PC6
logPower  0.6789041 -0.3370631  0.1337237  0.63152740 -0.092702676  0.01323106
logSpan   0.1475060  0.4778834  0.2150124  0.06127707 -0.048909835 -0.83515987
logLength 0.2128740  0.2307281  0.1568771 -0.27071156 -0.863486982  0.24071613
logWeight 0.5880074  0.2578687  0.2481639 -0.48182127  0.480207328  0.25182325
logSpeed  0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange  0.2715571  0.3756882 -0.8800761  0.09264058 -0.009418365  0.04370613

cat("****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.

",
    sep = "\n")

****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.

summary(pca_result)

Importance of components:
                          PC1     PC2    PC3     PC4     PC5     PC6
Standard deviation     0.9488 0.29350 0.2103 0.07473 0.04855 0.04646
Proportion of Variance 0.8650 0.08278 0.0425 0.00537 0.00227 0.00207
Cumulative Proportion  0.8650 0.94780 0.9903 0.99566 0.99793 1.00000

cat("****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.

",
    sep = "\n")

****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.

str(pca_result)

List of 5
 $ sdev    : num [1:6] 0.9488 0.2935 0.2103 0.0747 0.0486 ...
 $ rotation: num [1:6, 1:6] 0.679 0.148 0.213 0.588 0.229 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
  ..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
 $ scale   : logi FALSE
 $ x       : num [1:709, 1:6] -1.61 -1.64 -1.03 -1.17 -1.44 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"

Make a scree plot and a plot of the cumulative contribution to total variance. (Hint: cumsum.)

library(ggplot2)

# Assuming pca_result is already obtained from the prcomp function

# Extract eigenvalues
eigenvalues <- (pca_result$sdev)^2  # Squared singular values are the eigenvalues

# Create a dataframe for plotting the scree plot
pca_df <- data.frame(PC = paste0('PC', seq_along(eigenvalues)), 
                     Eigenvalues = eigenvalues)


# Generate the scree plot
ggplot(pca_df, aes(x = PC, y = Eigenvalues)) +
  geom_point() +  # Add points for each principal component
  geom_line(aes(group = 1)) +  # Connect points with a line
  theme_minimal() +  # Use a minimal theme
  labs(title = "Scree Plot", x = "Principal Component", y = "Eigenvalue")

# Calculate the cumulative variance
pca_df$CumulativeVariance <- cumsum(pca_df$Eigenvalues) / sum(pca_df$Eigenvalues)

# Generate the plot for cumulative contribution to total variance
ggplot(pca_df, aes(x = PC, y = CumulativeVariance)) +
  geom_point() +  # Add points for each principal component
  geom_line(aes(group = 1)) +  # Connect points with a line
  theme_minimal() +  # Use a minimal theme
  labs(title = "Cumulative Contribution to Total Variance", 
       x = "Principal Component", y = "Cumulative Proportion of Variance Explained")

Plot a smoothed histogram (density) of the PC1 scores and separately of the PC2 scores.

To plot smoothed histograms (density plots) of the PC1 and PC2 scores from a PCA analysis in R, you can use the ggplot2 package, which is part of tidyverse. The pca_result$x provides the principal component scores. Here’s how you can create these plots:

library(ggplot2)

# Assuming pca_result is already obtained from the prcomp function and contains
# the principal component scores in pca_result$x

# Convert the PC scores to a dataframe for easier manipulation with ggplot2
pc_scores <- as.data.frame(pca_result$x)

# Plot smoothed histogram (density) for PC1
ggplot(pc_scores, aes(x = PC1)) + geom_density(fill = "blue", alpha = 0.5) + labs(title = "Density Plot of PC1 Scores",
    x = "PC1 Score", y = "Density") + theme_minimal()

# Plot smoothed histogram (density) for PC2
ggplot(pc_scores, aes(x = PC2)) + geom_density(fill = "red", alpha = 0.5) + labs(title = "Density Plot of PC2 Scores",
    x = "PC2 Score", y = "Density") + theme_minimal()

as.data.frame(pca_result$x): This converts the principal component scores (stored in pca_result$x) into a dataframe. Each column represents scores for a principal component.

ggplot(pc_scores, aes(x = PC1)) + geom_density(…): This line creates a density plot for PC1 scores. The aes function sets the x-axis to represent PC1 scores. geom_density is used to create the smoothed histogram, with fill = “blue” and alpha = 0.5 determining the color and transparency.

The second plot is similar but for PC2 scores, with a different fill color for distinction.

labs and theme_minimal() are used to add labels and apply a minimal theme for better aesthetics.

These plots will provide a visual representation of the distribution of the scores for the first and second principal components, helping to understand the spread and density of the data along these dimensions.

Make a scatterplot of the PC1 scores (x-axis) against the PC2 scores (y-axis) and colour the points according to the value of Period. Overlay contours of a 2D density estimate.

library(ggplot2)

# Assuming pca_result is already obtained from the prcomp function
# and contains the principal component scores in pca_result$x
# Also assuming that the 'Period' variable is in the same order in the original dataset

# Add PC scores to the original dataset for plotting
aircraft_with_pc <- cbind(aircraft, pca_result$x)

# Convert 'Period' to a factor if it's not already
aircraft_with_pc$Period <- factor(aircraft_with_pc$Period)

# Create the scatterplot
ggplot(aircraft_with_pc, aes(x = PC1, y = PC2, color = Period)) +
  geom_point(alpha = 0.7) +  # Add scatter plot points
  geom_density_2d() +  # Add 2D density contours
  scale_color_brewer(type = "qual", palette = "Set1") +  # Set color palette
  labs(title = "Scatterplot of PC1 vs PC2 Scores Colored by Period",
       x = "PC1 Score", y = "PC2 Score") +
  theme_minimal()  # Use a minimal theme

Make a 2D biplot. (Hint: biplot, with col = c(“white”, “blue”).)

# Assuming pca_result is already obtained from the prcomp function

# Create a biplot
biplot(pca_result, col = c("white", "blue"))

Question 2 Revisiting Aircraft data: scaled Lab solutions

Repeat Question 1 but with scaled log data. How do the two analyses differ? Why do we see these differences?

Conduct a PCA of scaled data, that is, the six logged variables. Report the eigenvectors and the eigenvalues.

# Method 1 pca.aircraft.s = prcomp( select(aircraft, -Year, -Period ), scale. =
# TRUE ) Method 2

aircraft_logged.s <- mutate(aircraft, Period = factor(Period), logPower = log10(Power),
    logSpan = log10(Span), logLength = log10(Length), logWeight = log10(Weight),
    logSpeed = log10(Speed), logRange = log10(Range)) %>%
    dplyr::select(Year, Period, starts_with("log"))



# Select only the logged variables for PCA
pca_result.s <- prcomp(aircraft_logged.s[, c("logPower", "logSpan", "logLength",
    "logWeight", "logSpeed", "logRange")], center = TRUE, scale. = TRUE)

# Display the eigenvectors (rotation matrix)
print(pca_result.s$rotation)

                PC1          PC2        PC3         PC4         PC5
logPower  0.4480816 -0.254273152  0.2164887 -0.26154271  0.60960341
logSpan   0.3626533  0.599992407  0.1579815 -0.52435247 -0.42784071
logLength 0.4509822  0.193935691  0.2319303  0.79774800 -0.16024667
logWeight 0.4676854  0.008620396  0.1908513 -0.05827653  0.24171946
logSpeed  0.3066279 -0.729579739  0.0263650 -0.12527514 -0.59771868
logRange  0.3893202  0.073364280 -0.9150210  0.03402966  0.06293724
                   PC6
logPower  -0.497674411
logSpan   -0.159813465
logLength -0.207681242
logWeight  0.826406275
logSpeed  -0.006011923
logRange  -0.025784977

# Display the eigenvalues (squared singular values)
eigenvalues.s <- (pca_result.s$sdev)^2
print(eigenvalues.s)

[1] 4.45908692 1.06876561 0.38004455 0.04888992 0.03031952 0.01289349

Apply print, summary and str to the object produced by prcomp and work out the correspondence between these outputs and the terminology of PCA concepts used in the lecture notes. Note that, unlike the other two commands, str shows what the object actually is (i.e. its structure).

summary(pca_result.s)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6
Standard deviation     2.1117 1.0338 0.61648 0.22111 0.17413 0.11355
Proportion of Variance 0.7432 0.1781 0.06334 0.00815 0.00505 0.00215
Cumulative Proportion  0.7432 0.9213 0.98465 0.99280 0.99785 1.00000

print(pca_result.s)

Standard deviations (1, .., p=6):
[1] 2.1116550 1.0338112 0.6164775 0.2211106 0.1741250 0.1135495

Rotation (n x k) = (6 x 6):
                PC1          PC2        PC3         PC4         PC5
logPower  0.4480816 -0.254273152  0.2164887 -0.26154271  0.60960341
logSpan   0.3626533  0.599992407  0.1579815 -0.52435247 -0.42784071
logLength 0.4509822  0.193935691  0.2319303  0.79774800 -0.16024667
logWeight 0.4676854  0.008620396  0.1908513 -0.05827653  0.24171946
logSpeed  0.3066279 -0.729579739  0.0263650 -0.12527514 -0.59771868
logRange  0.3893202  0.073364280 -0.9150210  0.03402966  0.06293724
                   PC6
logPower  -0.497674411
logSpan   -0.159813465
logLength -0.207681242
logWeight  0.826406275
logSpeed  -0.006011923
logRange  -0.025784977

str(pca_result.s)

List of 5
 $ sdev    : num [1:6] 2.112 1.034 0.616 0.221 0.174 ...
 $ rotation: num [1:6, 1:6] 0.448 0.363 0.451 0.468 0.307 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
  ..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
 $ scale   : Named num [1:6] 0.654 0.207 0.221 0.567 0.295 ...
  ..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
 $ x       : num [1:709, 1:6] -3.41 -3.32 -2.1 -2.43 -3.09 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"

Question 3 Simulation

Sigma0 = matrix(c(3.1386518, 0.38872659, 0.6178228, 1.7576847, 0.77433973, 0.7508789,
    0.3887266, 1.21417949, 0.1941453, 0.451892, 0.01236855, 0.2155466, 0.6178228,
    0.19414529, 1.2437919, 0.597032, 0.15151088, 0.2665018, 1.7576847, 0.45189196,
    0.597032, 1.7083497, 0.52685668, 0.7109476, 0.7743397, 0.01236855, 0.1515109,
    0.5268567, 0.53406192, 0.2299193, 0.7508789, 0.21554658, 0.2665018, 0.7109476,
    0.22991933, 0.6642375), byrow = TRUE, nrow = 6)

We work with the matrix $\Sigma_0$ from Q1 of Lab 2 which is given again below. As in Q2 of Lab 2 we generate random samples, with $n = 250$, from each of the following distributions

   (i) the multivariate normal distribution ${\cal N}(0,\Sigma_0 )$

  (ii) the multivariate $t$-distribution $t_6(0,\Sigma_0)$ in $\nu =6$ degrees of freedom.

library(mvtnorm)

Sigma0 = matrix(c(3.1386518, 0.38872659, 0.6178228, 1.7576847, 0.77433973, 0.7508789,
    0.3887266, 1.21417949, 0.1941453, 0.451892, 0.01236855, 0.2155466, 0.6178228,
    0.19414529, 1.2437919, 0.597032, 0.15151088, 0.2665018, 1.7576847, 0.45189196,
    0.597032, 1.7083497, 0.52685668, 0.7109476, 0.7743397, 0.01236855, 0.1515109,
    0.5268567, 0.53406192, 0.2299193, 0.7508789, 0.21554658, 0.2665018, 0.7109476,
    0.22991933, 0.6642375), byrow = TRUE, nrow = 6)

Sigma0sym = 0.5 * (Sigma0 + t(Sigma0))
Xmat = rmvt(100, sigma = Sigma0sym, df = 6)

Start with the random sample of the ${\cal N}(0,\Sigma_0 )$ distribution and calculate the covariance matrix of the random sample.

Eigenvalues and eigenvectors are important in understanding the properties of a matrix, especially a covariance matrix in statistical analyses. Eigenvalues give you the magnitude of the direction in which the data varies the most, while eigenvectors give the direction of this variance. This analysis is different from generating data using a multivariate normal or t-distribution and calculating its covariance matrix, as seen in the previous code snippets. Here, you’re analyzing the properties of the covariance matrix itself, not generating new data.

n = 250  # sample size

# (a)
eig0 = eigen(Sigma0)  #Compute the coverariate matrix of the random sample
eig0

eigen() decomposition
$values
[1] 5.1206485 1.2257488 1.0134712 0.5661258 0.3053749 0.2719033

$vectors
           [,1]       [,2]          [,3]       [,4]        [,5]        [,6]
[1,] -0.7393233 -0.3611919  0.1542310412 -0.4969983 -0.22377747  0.04550286
[2,] -0.1595917  0.7948012  0.5221492521 -0.2155711  0.10579799  0.11188232
[3,] -0.2310636  0.4293893 -0.8387525653 -0.2290801  0.01673765  0.07727039
[4,] -0.5241308  0.1091430 -0.0042880344  0.5587806  0.18855315 -0.60462223
[5,] -0.2051337 -0.1833666 -0.0069591464  0.1186071  0.80847142  0.50651324
[6,] -0.2403087  0.0889987  0.0005115246  0.5724864 -0.49927143  0.59776033

# Eigenvalues are in the 'values' component of the 'spectral' object
eigenvalues <- eig0$values

# Eigenvectors are in the 'vectors' component of the 'spectral' object
eigenvectors <- eig0$vectors

eigenvalues

[1] 5.1206485 1.2257488 1.0134712 0.5661258 0.3053749 0.2719033

eigenvectors

           [,1]       [,2]          [,3]       [,4]        [,5]        [,6]
[1,] -0.7393233 -0.3611919  0.1542310412 -0.4969983 -0.22377747  0.04550286
[2,] -0.1595917  0.7948012  0.5221492521 -0.2155711  0.10579799  0.11188232
[3,] -0.2310636  0.4293893 -0.8387525653 -0.2290801  0.01673765  0.07727039
[4,] -0.5241308  0.1091430 -0.0042880344  0.5587806  0.18855315 -0.60462223
[5,] -0.2051337 -0.1833666 -0.0069591464  0.1186071  0.80847142  0.50651324
[6,] -0.2403087  0.0889987  0.0005115246  0.5724864 -0.49927143  0.59776033

This code generates a sample of multivariate normal data with specified means and covariance, converts this sample into a data frame, and then calculates the covariance matrix of this sample. This process is often used in statistical simulations to understand the properties and behaviors of multivariate datasets.

# Set the seed for random number generation to ensure reproducibility
set.seed(19501111)

# Generate a sample data from a multivariate normal distribution n = number of
# samples to generate mu = mean vector for the distribution, here a vector of
# six zeros Sigma = covariance matrix of the distribution
Xnmat = mvrnorm(n = n, mu = rep(0, 6), Sigma = Sigma0)

# Convert the matrix Xnmat into a data frame for easier data handling
Xnmat = data.frame(Xnmat)

# Calculate and return the covariance matrix of the data frame Xnmat
cov(Xnmat)

          X1         X2        X3        X4         X5        X6
X1 2.8825000  0.1445171 0.4873341 1.6265526  0.7598350 0.6760389
X2 0.1445171  1.2587985 0.2004843 0.3382075 -0.1106208 0.1713120
X3 0.4873341  0.2004843 1.1137943 0.5048360  0.1303875 0.2616829
X4 1.6265526  0.3382075 0.5048360 1.6675770  0.5375387 0.6666841
X5 0.7598350 -0.1106208 0.1303875 0.5375387  0.5692657 0.2072031
X6 0.6760389  0.1713120 0.2616829 0.6666841  0.2072031 0.5998595

This code snippet is creating a symmetric covariance matrix, generating a sample of data from a multivariate t-distribution based on this symmetric matrix, converting this sample into a data frame, and then calculating the covariance matrix of the sample. This process is useful in statistical simulations, especially when dealing with distributions that have heavier tails than the normal distribution.

# Create a symmetric covariance matrix based on Sigma0 Sigma0 is symmetrically
# averaged with its transpose (t(Sigma0))
Sigma0sym = 0.5 * (Sigma0 + t(Sigma0))

# Generate a sample data from a multivariate t-distribution 100 = number of
# samples to generate sigma = scale matrix for the distribution, set to the
# symmetric matrix Sigma0sym df = degrees of freedom for the t-distribution
Xtmat = rmvt(100, sigma = Sigma0sym, df = 6)

# Convert the generated matrix Xtmat into a data frame
Xtmat = data.frame(Xtmat)

# Calculate and return the covariance matrix of the data frame Xtmat
cov(Xtmat)

          X1          X2        X3        X4          X5        X6
X1 4.5481804  0.62368776 1.0914450 2.6039032  1.01063742 1.2934574
X2 0.6236878  1.68992385 0.3049420 0.7567985 -0.07013976 0.3697011
X3 1.0914450  0.30494200 1.8773226 1.1360912  0.17203263 0.2653673
X4 2.6039032  0.75679849 1.1360912 2.3098496  0.57903361 1.0569777
X5 1.0106374 -0.07013976 0.1720326 0.5790336  0.62089138 0.2748526
X6 1.2934574  0.36970115 0.2653673 1.0569777  0.27485257 1.0082849

Calculate the eigenvalues and eigenvectors of the covariance matrix and compare to those of the population matrix $\Sigma_0$ (from Lab 2).

The question “Calculate the eigenvalues and eigenvectors of the covariance matrix and compare to those of the population matrix” pertains to a common procedure in statistics and data analysis involving eigenvalues and eigenvectors, particularly in the context of understanding covariance matrices.

Here’s a step-by-step breakdown of what this question entails:

Covariance Matrix Analysis:

Calculate the Covariance Matrix: This involves computing a matrix that encapsulates the covariance (joint variability) between each pair of elements in your dataset. If you’re working with a sample, you’d calculate the sample covariance matrix; if you’re working with a population, you’d calculate the population covariance matrix.

Eigenvalues and Eigenvectors of the Covariance Matrix: Eigenvalues and eigenvectors are then calculated from this covariance matrix.

Eigenvalues give you information about the magnitude of variance in the direction of its associated eigenvector.

Eigenvectors are directions in which the data varies the most (principal axes of variance). Each eigenvector is orthogonal (perpendicular) to the others.

Population Matrix Analysis:

The “population matrix” here might refer to the true covariance matrix of the entire population from which the sample is drawn.

As in the first step, you’d compute the eigenvalues and eigenvectors for this population covariance matrix. Comparison:

The final part of the question involves comparing the eigenvalues and eigenvectors of the sample covariance matrix with those of the population covariance matrix.

This comparison is crucial to understand how well the sample represents the population. In ideal scenarios, especially with large samples, the eigenvalues and eigenvectors of the sample covariance matrix should closely approximate those of the population matrix.

This comparison is often used in Principal Component Analysis (PCA), a method used to reduce the dimensionality of large datasets, by transforming the data into a new set of variables (principal components) that are linear combinations of the original variables. The comparison helps in understanding the representativeness and reliability of PCA results derived from sample data in relation to the entire population.

This code is used to perform PCA on a dataset to understand the underlying structure and to reduce its dimensions by identifying the principal components.

# Perform Principal Component Analysis (PCA) on the data frame Xnmat
pca.Xnmat = prcomp(Xnmat)

# Extract the eigenvalues (squared standard deviations) of each principal
# component
eval.n = pca.Xnmat$sdev^2

# Extract the eigenvectors (principal component loadings) from the PCA
evec.n = pca.Xnmat$loadings

# Print the summary of the PCA results
print(pca.Xnmat)

Standard deviations (1, .., p=6):
[1] 2.1623665 1.1810863 0.9563652 0.7367037 0.5593483 0.5007634

Rotation (n x k) = (6 x 6):
          PC1        PC2         PC3        PC4         PC5         PC6
X1 -0.7396735 -0.2851213 -0.21018012  0.5441762  0.17512137 -0.02486000
X2 -0.1017659  0.8494873 -0.43258594  0.1892485 -0.17173681 -0.12480373
X3 -0.2092180  0.3528685  0.87476102  0.2532462 -0.04765246 -0.01002063
X4 -0.5396211  0.1301406 -0.00442740 -0.6088807 -0.10147189  0.55751290
X5 -0.2235083 -0.2077019  0.02992805 -0.1678422 -0.79509940 -0.49563513
X6 -0.2400163  0.1117337  0.05071078 -0.4527878  0.54322059 -0.65362893

The code is used to analyze the Xtmat dataset to uncover its underlying structure and reduce its complexity by focusing on the principal components that explain most of the variability in the data.

# Perform Principal Component Analysis (PCA) on the data frame Xtmat
pca.Xtmat = prcomp(Xtmat)

# Extract the eigenvalues (squared standard deviations) of each principal
# component
eval.t = pca.Xtmat$sdev^2

# Extract the eigenvectors (principal component loadings) from the PCA
evec.t = pca.Xtmat$loadings

# Print the summary of the PCA results
print(pca.Xtmat)

Standard deviations (1, .., p=6):
[1] 2.7545556 1.3145789 1.2187891 0.7998447 0.5928688 0.5119243

Rotation (n x k) = (6 x 6):
          PC1         PC2         PC3         PC4         PC5         PC6
X1 -0.7357956  0.38881222 -0.11239660 -0.44486273  0.30442973 -0.06493504
X2 -0.1717658 -0.74012410 -0.55655072 -0.27520384 -0.11115336 -0.15770821
X3 -0.2689875 -0.46125214  0.79387489 -0.06965788  0.07201535 -0.27316002
X4 -0.5146105 -0.15613913  0.02999639  0.40986375 -0.29005090  0.67659367
X5 -0.1644729  0.25232170  0.02369732 -0.04126024 -0.85903664 -0.41118670
X6 -0.2546992  0.01556161 -0.21428861  0.74284391  0.26020720 -0.51907771

summary(pca.Xnmat)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6
Standard deviation     2.1624 1.1811 0.9564 0.73670 0.55935 0.50076
Proportion of Variance 0.5778 0.1724 0.1130 0.06707 0.03867 0.03099
Cumulative Proportion  0.5778 0.7502 0.8633 0.93034 0.96901 1.00000

Standard Deviation: These values (2.1624, 1.1811, etc.) represent the standard deviations of each principal component (PC). A higher standard deviation indicates that the component accounts for more variance in the dataset.

Proportion of Variance: This shows how much of the total variance in the data each principal component accounts for. For example, PC1 accounts for 57.78% of the variance, while PC2 accounts for 17.24%, and so on.

Cumulative Proportion: This cumulative value indicates the total variance accounted for by the first N components. For instance, the first two components (PC1 and PC2) together account for 75.02% of the total variance.

summary(pca.Xtmat)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6
Standard deviation     2.7546 1.3146 1.2188 0.79984 0.59287 0.51192
Proportion of Variance 0.6294 0.1434 0.1232 0.05307 0.02916 0.02174
Cumulative Proportion  0.6294 0.7728 0.8960 0.94910 0.97826 1.00000

Standard Deviation: Similar to pca.Xnmat, these values indicate the standard deviations of each principal component in this dataset.

Proportion of Variance: This part shows the proportion of total variance accounted for by each principal component. PC1 in this dataset accounts for 56.77% of the variance, and so on.

Cumulative Proportion: The cumulative proportion of variance accounted for by the first N components. Here, the first two components account for 74.59% of the total variance.

Comparison

Standard Deviation pca.Xnmat: The standard deviations for the principal components decrease from 2.1624 for PC1 to 0.50076 for PC6.

pca.Xtmat: The standard deviations are generally higher, ranging from 2.8102 for PC1 to 0.5619 for PC6.

Comparison: The higher standard deviations in pca.Xtmat suggest that the individual principal components account for more variance within this dataset compared to pca.Xnmat. This can imply that the Xtmat dataset might have more spread or variability in its data.

Proportion of Variance

pca.Xnmat: PC1 accounts for the most variance (57.78%), and the proportion decreases with each subsequent component.

pca.Xtmat: Similarly, PC1 accounts for the most variance (56.77%), but the overall distribution of variance across components is slightly different.

Comparison: Both datasets have their first principal component accounting for a significant proportion of the total variance. However, the exact proportions differ slightly, indicating differences in how the data is structured or spread in the multidimensional space.

Cumulative Proportion pca.Xnmat: The cumulative proportion reaches 100% with PC6, with significant jumps initially (from 57.78% in PC1 to 75.02% in PC2).

pca.Xtmat: Cumulative proportion similarly reaches 100% with PC6, with the first two components accounting for 74.59%.

Comparison: The cumulative proportions suggest that in both cases, a few principal components (particularly the first two) capture a substantial amount of the total variance in the data.

However, the exact cumulative values at each component level vary slightly, indicating differences in how the variance is distributed among the components.

Overall Comparison

The PCA on both datasets suggests a similar pattern where a few principal components capture most of the variance, a common outcome in PCA analyses.

The exact values for standard deviation, proportion of variance, and cumulative proportion differ between the two datasets, reflecting differences in the underlying data structure, spread, and variability.

The pca.Xtmat dataset seems to have a slightly higher variance spread across its components compared to pca.Xnmat.

In conclusion, while both datasets exhibit typical PCA characteristics, the differences in their PCA statistics reflect variations in how their data are distributed and the extent of variability captured by their principal components.

Show a graph of the eigenvalues and of the cumulative contribution to total variance against the index.

plot(pca.Xnmat)

plot(cumsum(pca.Xnmat$sdev^2)/sum(pca.Xnmat$sdev^2), type = "b")

plot(pca.Xtmat)

plot(cumsum(pca.Xtmat$sdev^2)/sum(pca.Xtmat$sdev^2), type = "b")

Plot a smoothed histogram (density) of all six PC scores.

forplot = cbind(Xnmat, pca.Xnmat$x)
names(forplot)

 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"

ggplot(forplot, aes(PC1)) + geom_density()

ggplot(forplot, aes(PC2)) + geom_density()

ggplot(forplot, aes(PC2)) + geom_density()

ggplot(forplot, aes(PC3)) + geom_density()

ggplot(forplot, aes(PC4)) + geom_density()

ggplot(forplot, aes(PC5)) + geom_density()

ggplot(forplot, aes(PC6)) + geom_density()

forplot = cbind(Xtmat, pca.Xtmat$x)
names(forplot)

 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"

ggplot(forplot, aes(PC1)) + geom_density()

ggplot(forplot, aes(PC2)) + geom_density()

ggplot(forplot, aes(PC3)) + geom_density()

ggplot(forplot, aes(PC4)) + geom_density()

ggplot(forplot, aes(PC5)) + geom_density()

ggplot(forplot, aes(PC6)) + geom_density()

Make a scatterplot of the PC1 scores (x-axis) against the PC2 scores (y-axis). Overlay contours of a 2D density estimate.

forplot = cbind(Xnmat, pca.Xnmat$x)
names(forplot)

 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"

ggplot(forplot, aes(PC1, PC2)) + geom_point() + geom_density_2d()

forplot = cbind(Xtmat, pca.Xtmat$x)
names(forplot)

 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"

ggplot(forplot, aes(PC1, PC2)) + geom_point() + geom_density_2d()

Repeat parts (a) to (e) for the random sample of the $t_6(0,\Sigma_0)$ distribution.

# Load necessary libraries
library(tidyverse)  # For data manipulation and visualization
library(ggplot2)  # For creating custom plots
library(MASS)  # For statistical functions
library(GGally)  # For enhanced pair plots

# Assuming Xtmat is already generated and available as per your previous code

# Descriptive Statistics for Xtmat Calculating mean, median, range, variance,
# standard deviation, and mode for each variable in Xtmat
descriptive_stats_xtmat <- Xtmat %>%
    summarise(across(everything(), list(mean = mean, median = median, range = ~diff(range(.)),
        variance = var, sd = sd)))

# Define a function to calculate the mode
mode_function_xtmat <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

# Apply mode function to each column in Xtmat
modes_xtmat <- apply(Xtmat, 2, mode_function_xtmat)
modes_df_xtmat <- data.frame(modes_xtmat)

# Print descriptive statistics
print(descriptive_stats_xtmat)

     X1_mean  X1_median X1_range X1_variance    X1_sd   X2_mean X2_median
1 -0.2673069 -0.1140094 12.95437     4.54818 2.132646 0.1122905 0.1136951
  X2_range X2_variance    X2_sd     X3_mean  X3_median X3_range X3_variance
1 7.482452    1.689924 1.299971 -0.09909023 -0.2253782 8.763167    1.877323
     X3_sd    X4_mean X4_median X4_range X4_variance    X4_sd    X5_mean
1 1.370154 -0.1536837 0.0209533 8.073807     2.30985 1.519819 0.02210862
    X5_median X5_range X5_variance     X5_sd     X6_mean   X6_median X6_range
1 -0.01191906 4.845459   0.6208914 0.7879666 -0.09207192 -0.03542006 6.194098
  X6_variance    X6_sd
1    1.008285 1.004134

print(modes_df_xtmat)

   modes_xtmat
X1  1.46570565
X2 -3.45758589
X3 -0.01266486
X4  0.04866006
X5  0.20850686
X6  0.03070821

# Visualization for Xtmat Creating histograms for each variable in Xtmat
Xtmat %>%
    gather(key = "variable", value = "value") %>%
    ggplot(aes(x = value)) + geom_histogram(bins = 30, fill = "blue", color = "black") +
    facet_wrap(~variable, scales = "free") + theme_minimal() + labs(title = "Histograms of Xtmat Variables")

# Creating box plots for each variable in Xtmat
Xtmat %>%
    gather(key = "variable", value = "value") %>%
    ggplot(aes(x = variable, y = value)) + geom_boxplot(fill = "orange", color = "darkred") +
    theme_minimal() + labs(title = "Box Plots of Xtmat Variables")

# Correlation Analysis for Xtmat Calculate the correlation matrix
correlation_matrix_xtmat <- cor(Xtmat)

# Use ggpairs for a comprehensive visualization of pair-wise relationships and
# correlations
ggpairs(Xtmat)

# Covariance Matrix for Xtmat Calculate and print the covariance matrix
covariance_matrix_xtmat <- cov(Xtmat)
print(covariance_matrix_xtmat)

          X1          X2        X3        X4          X5        X6
X1 4.5481804  0.62368776 1.0914450 2.6039032  1.01063742 1.2934574
X2 0.6236878  1.68992385 0.3049420 0.7567985 -0.07013976 0.3697011
X3 1.0914450  0.30494200 1.8773226 1.1360912  0.17203263 0.2653673
X4 2.6039032  0.75679849 1.1360912 2.3098496  0.57903361 1.0569777
X5 1.0106374 -0.07013976 0.1720326 0.5790336  0.62089138 0.2748526
X6 1.2934574  0.36970115 0.2653673 1.0569777  0.27485257 1.0082849

# Perform Principal Component Analysis (PCA) on the data frame Xnmat
pca.Xtmat = prcomp(Xtmat)

# Extract the eigenvalues (squared standard deviations) of each principal
# component
eval.n = pca.Xtmat$sdev^2

# Extract the eigenvectors (principal component loadings) from the PCA
evec.n = pca.Xtmat$loadings

# Print the summary of the PCA results
print(pca.Xtmat)

Standard deviations (1, .., p=6):
[1] 2.7545556 1.3145789 1.2187891 0.7998447 0.5928688 0.5119243

Rotation (n x k) = (6 x 6):
          PC1         PC2         PC3         PC4         PC5         PC6
X1 -0.7357956  0.38881222 -0.11239660 -0.44486273  0.30442973 -0.06493504
X2 -0.1717658 -0.74012410 -0.55655072 -0.27520384 -0.11115336 -0.15770821
X3 -0.2689875 -0.46125214  0.79387489 -0.06965788  0.07201535 -0.27316002
X4 -0.5146105 -0.15613913  0.02999639  0.40986375 -0.29005090  0.67659367
X5 -0.1644729  0.25232170  0.02369732 -0.04126024 -0.85903664 -0.41118670
X6 -0.2546992  0.01556161 -0.21428861  0.74284391  0.26020720 -0.51907771

summary(pca.Xtmat)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6
Standard deviation     2.7546 1.3146 1.2188 0.79984 0.59287 0.51192
Proportion of Variance 0.6294 0.1434 0.1232 0.05307 0.02916 0.02174
Cumulative Proportion  0.6294 0.7728 0.8960 0.94910 0.97826 1.00000

Compare $\Sigma_0$ with the two sample covariance matrices: how different are the eigenvalues and eigenvectors of the three matrices? Which are most similar? Why do they differ?
```
  X1          X2        X3        X4          X5        X6
```
X1 4.5481804 0.62368776 1.0914450 2.6039032 1.01063742 1.2934574 X2 0.6236878 1.68992385 0.3049420 0.7567985 -0.07013976 0.3697011 X3 1.0914450 0.30494200 1.8773226 1.1360912 0.17203263 0.2653673 X4 2.6039032 0.75679849 1.1360912 2.3098496 0.57903361 1.0569777 X5 1.0106374 -0.07013976 0.1720326 0.5790336 0.62089138 0.2748526 X6 1.2934574 0.36970115 0.2653673 1.0569777 0.27485257 1.0082849

original:

Standard deviations (1, .., p=6): [1] 2.7545556 1.3145789 1.2187891 0.7998447 0.5928688 0.5119243

Rotation (n x k) = (6 x 6): PC1 PC2 PC3 PC4 PC5 PC6 X1 -0.7357956 0.38881222 -0.11239660 -0.44486273 0.30442973 -0.06493504 X2 -0.1717658 -0.74012410 -0.55655072 -0.27520384 -0.11115336 -0.15770821 X3 -0.2689875 -0.46125214 0.79387489 -0.06965788 0.07201535 -0.27316002 X4 -0.5146105 -0.15613913 0.02999639 0.40986375 -0.29005090 0.67659367 X5 -0.1644729 0.25232170 0.02369732 -0.04126024 -0.85903664 -0.41118670 X6 -0.2546992 0.01556161 -0.21428861 0.74284391 0.26020720 -0.51907771

Question 4 Example 8.1 JW (Calculating the population principal components)

Suppose the random variables $X_1, X_2, X_3$ have the covariance matrix

\[ \Sigma=\begin{pmatrix} 1 & -2 & 0 \\ -2 & 5 & 0 \\ 0 &0 &2 \end{pmatrix}.\]

Compute the eigenvalues and eigenvector.

sigma <- matrix(c(1, -2, 0, -2, 5, 0, 0, 0, 2), nrow = 3, ncol = 3)
sigma

     [,1] [,2] [,3]
[1,]    1   -2    0
[2,]   -2    5    0
[3,]    0    0    2

eig_sigma = eigen(sigma)
eig_sigma

eigen() decomposition
$values
[1] 5.8284271 2.0000000 0.1715729

$vectors
           [,1] [,2]      [,3]
[1,] -0.3826834    0 0.9238795
[2,]  0.9238795    0 0.3826834
[3,]  0.0000000    1 0.0000000

eigenvalues = eig_sigma$values
eigenvalues

[1] 5.8284271 2.0000000 0.1715729

eigenvectors = eig_sigma$vectors
eigenvectors

           [,1] [,2]      [,3]
[1,] -0.3826834    0 0.9238795
[2,]  0.9238795    0 0.3826834
[3,]  0.0000000    1 0.0000000

lambda1 = 5.8284, lambda2 = 2.0000, lambda3 = 0.1716

e1 = [-0.3827,0.9239,0.000]^T e2 = [0,0,1]^T , e3 = [0.9239,0.3827,0.0000]^T

pca_sigma = prcomp(sigma)
print(pca_sigma)

Standard deviations (1, .., p=3):
[1] 3.925094e+00 1.122632e+00 2.660133e-16

Rotation (n x k) = (3 x 3):
             PC1         PC2        PC3
[1,] -0.38832732 -0.08944370 0.91717049
[2,]  0.91850794  0.04285707 0.39307307
[3,] -0.07446515  0.99506939 0.06551218

Write the principal component equations.

Y1 = e_{1}^T*X = (-0.3027,0.9239,0)(X1,X2,X3)^T = -0.3027X1 + 0.9239X2

Y2 = e_{2}^T*X = (-0,0,1)(X1,X2,X3)^T = X3

Y3 = e_{1}^T*X = (-0.9239,0.3027,0)(X1,X2,X3)^T = 0.9239X1 + 0.3027X2

Calculate $Var(Y_1)$ and $Cov(Y_1, Y_2)$ and the total variance.

var(Y1) = lambda1 = 5.83

cov(y1,y2) = (-0.3827X1 + 0.9239X2, X3) = -0.3827cov(X1,X3) + 0.9239cov(X2,X3) = 0

lambda1 = 5.8284, lambda2 = 2.0000, lambda3 = 0.1716

Total variance = lamdba1+lambda2 + lamdba3 = 5.8284+2.0000+0.1716 = 8

Calculate the proportions of the total variance of the first PC and the first two PC. Comment.

Question 5 Head lengths of first and second sons (EH, 2011)

The headsize data provided in (a) give the head lengths and head breadths (in millimetres) for each of the first two adult sons in 25 families. Here we shall use only the head lengths (head1 and head2).

(a). Read the data set in R and assign an appropriate name to the data object.

"headsize" <- +matrix(c(191, 195, 181, 183, 176, 208, 189, 197, 188, 192, 179, 183,
    174, 190, 188, 163, 195, 186, 181, 175, 192, 174, +176, 197, 190, 155, 149, 148,
    153, 144, 157, 150, 159, 152, 150, 158, 147, 150, 159, 151, 137, 155, 153, +145,
    140, 154, 143, 139, 167, 163, 179, 201, 185, 188, 171, 192, 190, 189, 197, 187,
    186, 174, 185, 195, +187, 161, 183, 173, 182, 165, 185, 178, 176, 200, 187, 145,
    152, 149, 149, 142, 152, 149, 152, 159, 151, +148, 147, 152, 157, 158, 130, 158,
    148, 146, 137, 152, 147, 143, 158, 150), nrow = 25, ncol = 4, dimnames = list(character(0),
    c("head1", "breadth1", "head2", "breadth2")))

x <- headsize
headsize <- as.data.frame(headsize)
headsize

   head1 breadth1 head2 breadth2
1    191      155   179      145
2    195      149   201      152
3    181      148   185      149
4    183      153   188      149
5    176      144   171      142
6    208      157   192      152
7    189      150   190      149
8    197      159   189      152
9    188      152   197      159
10   192      150   187      151
11   179      158   186      148
12   183      147   174      147
13   174      150   185      152
14   190      159   195      157
15   188      151   187      158
16   163      137   161      130
17   195      155   183      158
18   186      153   173      148
19   181      145   182      146
20   175      140   165      137
21   192      154   185      152
22   174      143   178      147
23   176      139   176      143
24   197      167   200      158
25   190      163   187      150

(b). Compute the array of sample means, variance-covariance matrix $S$

head_dat <- headsize[, c("head1", "head2")]
head(head_dat)

  head1 head2
1   191   179
2   195   201
3   181   185
4   183   188
5   176   171
6   208   192

str(head_dat)

'data.frame':   25 obs. of  2 variables:
 $ head1: num  191 195 181 183 176 208 189 197 188 192 ...
 $ head2: num  179 201 185 188 171 192 190 189 197 187 ...

## Means and var-cov
head.mean = colMeans(head_dat)
cat("The mean vector of Head Lengths", sep = "\n")

The mean vector of Head Lengths

head.mean

 head1  head2 
185.72 183.84

cat("Covariance Matrix of Head Lengths", sep = "\n")

Covariance Matrix of Head Lengths

head.cov = cov(head_dat)
head.cov

         head1     head2
head1 95.29333  69.66167
head2 69.66167 100.80667

(c). Perform PCA on the dataset based on $S$.

head_pca <- prcomp(x = head_dat)
###
cat("Head dat PCA", sep = "\n")

Head dat PCA

head_pca

Standard deviations (1, .., p=2):
[1] 12.952459  5.322951

Rotation (n x k) = (2 x 2):
             PC1        PC2
head1 -0.6929858  0.7209512
head2 -0.7209512 -0.6929858

cat("The loadings (eigenvectors) of Head Lengths", sep = "\n")

The loadings (eigenvectors) of Head Lengths

head_pca$rotation

             PC1        PC2
head1 -0.6929858  0.7209512
head2 -0.7209512 -0.6929858

(d). Write the equations in (c) and determine the proportion of the total sample variance explained by the first two principal components.

###
cat("The standard deviation of Head Lengths", sep = "\n")

The standard deviation of Head Lengths

head_pca$sdev

[1] 12.952459  5.322951

###
cat("The summary of PCA of Head Lengths", sep = "\n")

The summary of PCA of Head Lengths

summary(head_pca)

Importance of components:
                           PC1    PC2
Standard deviation     12.9525 5.3230
Proportion of Variance  0.8555 0.1445
Cumulative Proportion   0.8555 1.0000

(e). Interpret the first two sample PCA components, obtained in (d).

STAT3064/STAT5061 Statistical Learning/Statistical Data Science

Computer Lab 3 2023