library(tidyverse)
library(ggplot2)
library(MASS)
library(GGally)
library(plot3D)
At the end of this session you should be able to:
explore PCA through multivariate data;
understand how to apply PCA using covariance;
understand how to apply PCA using correlation (standardising/scaling);
understand how to interpret the R output using princomp and pcomp in R.
Relates to lecture 2.
Libraries ggplot2, MASS, tidyverse, GGally may be useful. Others may be mentioned below in the hints.
For Q3 you will also need the library mvtnorm.
You might like to set up a project for each lab if you are using RStudio. Then you can copy a .Rmd file into that directory and write your answers in that file.
Consider the aircraft data (aircraft.csv) with variables Year, Period, Power, Span, Length, Weight, Speed and Range. The data are available under Week 1 folder or in the Data Sets folder on LMS.
These data record six characteristics of aircraft designs which appeared during the twentieth century.
The variables are: - Year: year of first manufacture - Period: a code to indicate one of three broad time periods - Power: total engine power (kW) - Span: wing span (m) - Length: length (m) - Weight: maximum take-off weight (kg) - Speed: maximum speed (km/h) - Range: range (km)
Source: The data were collected by P. Saviotti and are described in detail in Saviotti (1996), “Technological Evolution, Variety and Economy”, Edward Elgar: Cheltenham.
The dataframe aircraft refers to the aircraft data from
the previous lab, the version with the logged variables. In this lab we
work only with the logged variables plus Year and
Period.
Use prcomp for this question.
Year Period Power Span Length Weight Speed Range
1 14 1 82.0 12.8 7.60 1070 105 400
2 14 1 82.0 11.0 9.00 830 145 402
3 14 1 223.6 17.9 10.35 2200 135 500
4 15 1 164.0 14.5 9.80 1946 138 500
5 15 1 119.0 12.9 7.90 1190 140 400
6 15 1 74.5 7.5 6.30 653 177 350
Lab Method:
# Creating a new dataframe with logged variables
aircraft_logged <- mutate(aircraft, Period = factor(Period), logPower = log10(Power),
logSpan = log10(Span), logLength = log10(Length), logWeight = log10(Weight),
logSpeed = log10(Speed), logRange = log10(Range)) %>%
dplyr::select(Year, Period, starts_with("log"))
# Method 1 pca.aircraft=prcomp(select(aircraft, -Year, -Period)) Method 2
noms = names(aircraft_logged)[3:8]
rhs = paste0(noms, collapse = " + ")
fmla = as.formula(paste("~ ", rhs))
print(fmla)
~logPower + logSpan + logLength + logWeight + logSpeed + logRange
pca.aircraft = prcomp(fmla, data = aircraft_logged)
# Display the eigenvectors (rotation matrix)
print(pca.aircraft$rotation)
PC1 PC2 PC3 PC4 PC5 PC6
logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106
logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987
logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613
logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325
logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613
# Display the eigenvalues (squared singular values)
eigenvalues <- (pca.aircraft$sdev)^2
print(eigenvalues)
[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438
Eigenvectors:
logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106 logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987 logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613 logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325 logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895 logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613
Eigenvalues:
[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438
print, summary and str
to the object produced by prcomp and work out the
correspondence between these outputs and the terminology of PCA concepts
used in the lecture notes. Note that, unlike the other two commands,
str shows what the object actually is (i.e. its
structure).cat("*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.
",
sep = "\n")
*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.
print(pca.aircraft)
Standard deviations (1, .., p=6):
[1] 0.94878024 0.29350355 0.21029869 0.07473312 0.04855061 0.04645899
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106
logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987
logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613
logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325
logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613
cat("****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.
",
sep = "\n")
****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.
summary(pca.aircraft)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 0.9488 0.29350 0.2103 0.07473 0.04855 0.04646
Proportion of Variance 0.8650 0.08278 0.0425 0.00537 0.00227 0.00207
Cumulative Proportion 0.8650 0.94780 0.9903 0.99566 0.99793 1.00000
cat("****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.
",
sep = "\n")
****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.
str(pca.aircraft)
List of 6
$ sdev : num [1:6] 0.9488 0.2935 0.2103 0.0747 0.0486 ...
$ rotation: num [1:6, 1:6] 0.679 0.148 0.213 0.588 0.229 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
.. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
$ scale : logi FALSE
$ x : num [1:709, 1:6] -1.61 -1.64 -1.03 -1.17 -1.44 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:709] "1" "2" "3" "4" ...
.. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
$ call : language prcomp(formula = fmla, data = aircraft_logged)
- attr(*, "class")= chr "prcomp"
cumsum.)plot(pca.aircraft)
# Assuming pca.aircraft is the result from prcomp Extract eigenvalues (squared
# singular values) eigenvalues <- pca.aircraft$sdev^2
# Create a sequence for the x-axis representing each principal component pc_seq
# <- 1:length(eigenvalues)
# Create the scree plot using base R plot function plot(pc_seq, eigenvalues,
# type = 'b', xlab = 'Principal Component', ylab = 'Eigenvalue', main = 'Scree
# Plot', pch = 19, col = 'blue')
# Create the scree plot using base R plot function
plot(1:length(eigenvalues), pca.aircraft$sdev^2, type = "b", xlab = "Principal Component",
ylab = "Eigenvalue", main = "Scree Plot", col = "black")
plot(cumsum(pca.aircraft$sdev^2)/sum(pca.aircraft$sdev^2), type = "b", ylab = "Cumulative variance")
Index = c(1:6)
plot(Index, pca.aircraft$sdev^2, type = "b", ylim = c(0, 1))
lines(Index, cumsum(pca.aircraft$sdev^2)/sum(pca.aircraft$sdev^2), type = "b", col = 2)
forplot = cbind(aircraft, pca.aircraft$x)
names(forplot)
[1] "Year" "Period" "Power" "Span" "Length" "Weight" "Speed" "Range"
[9] "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"
ggplot(forplot, aes(PC1)) + geom_density()
ggplot(forplot, aes(PC2)) + geom_density()
ggplot(forplot, aes(PC3)) + geom_density()
# Assuming forplot is a dataframe that includes PC1, PC2, PC3 and a categorical
# variable Period
ggplot(forplot, aes(x = PC1, y = PC2)) + geom_point() + geom_density_2d()
# Adjusted plot to include color for different Periods
ggplot(forplot, aes(x = PC2, y = PC3, colour = Period, group = Period)) + geom_point() +
geom_density_2d()
# Adjusted plot for PC1 vs PC3 with color
ggplot(forplot, aes(x = PC1, y = PC3, colour = Period, group = Period)) + geom_point() +
geom_density_2d()
# For ggpairs - ensure columns 9 to 14 are correctly specified for your dataset
ggpairs(forplot, columns = 9:14)
biplot, with
col = c("white", "blue").)biplot(pca.aircraft, col = c("white", "blue"), choices = 1:2)
biplot(pca.aircraft, col = c("red", "blue"), choices = 1:2)
# Assuming 'aircraft' is already loaded with read.csv()
# Note: Ensure that you only take the log of positive, non-zero values.
# The log function cannot be applied to zero or negative values.
# Creating a new dataframe with logged variables
# Creating a new dataframe with logged variables
aircraft_logged <- mutate(aircraft,
# Convert the 'Period' variable to a factor (categorical variable)
Period = factor(Period),
# Apply logarithmic transformation (base 10) to the 'Power' variable
logPower = log10(Power),
# Apply logarithmic transformation (base 10) to the 'Span' variable
logSpan = log10(Span),
# Apply logarithmic transformation (base 10) to the 'Length' variable
logLength = log10(Length),
# Apply logarithmic transformation (base 10) to the 'Weight' variable
logWeight = log10(Weight),
# Apply logarithmic transformation (base 10) to the 'Speed' variable
logSpeed = log10(Speed),
# Apply logarithmic transformation (base 10) to the 'Range' variable
logRange = log10(Range)
) %>%
# Select only specific columns: 'Year', 'Period', and those that start with "log"
dplyr::select(Year, Period, starts_with("log"))
# Perform PCA on the logged variables
# Select only the logged variables for PCA
pca_result <- prcomp(aircraft_logged[,c('logPower', 'logSpan', 'logLength', 'logWeight', 'logSpeed', 'logRange')],
center = TRUE, scale. = FALSE)
# Display the eigenvectors (rotation matrix)
print(pca_result$rotation)
PC1 PC2 PC3 PC4 PC5 PC6
logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106
logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987
logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613
logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325
logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613
# Display the eigenvalues (squared singular values)
eigenvalues <- (pca_result$sdev)^2
print(eigenvalues)
[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438
Eigenvectors Rotation matrix
PC1 PC2 PC3 PC4 PC5 PC6 logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106 logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987 logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613 logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325 logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895 logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613
Eigenvalues
[1] 0.900183940 0.086144332 0.044225537 0.005585039 0.002357161 0.002158438
print, summary and str
to the object produced by prcomp and work out the
correspondence between these outputs and the terminology of PCA concepts
used in the lecture notes. Note that, unlike the other two commands,
str shows what the object actually is (i.e. its
structure).cat("*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.
",
sep = "\n")
*****print(pca_result)****: This command prints the standard deviations of the principal components and the rotation matrix, which contains the eigenvectors.
The standard deviations are related to the square roots of the eigenvalues of the covariance matrix. The larger the standard deviation, the more variation that principal component explains.
The rotation matrix shows how the original variables are weighted to form each principal component (PC). Each column corresponds to a principal component, and each row corresponds to the original variables.
print(pca_result)
Standard deviations (1, .., p=6):
[1] 0.94878024 0.29350355 0.21029869 0.07473312 0.04855061 0.04645899
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
logPower 0.6789041 -0.3370631 0.1337237 0.63152740 -0.092702676 0.01323106
logSpan 0.1475060 0.4778834 0.2150124 0.06127707 -0.048909835 -0.83515987
logLength 0.2128740 0.2307281 0.1568771 -0.27071156 -0.863486982 0.24071613
logWeight 0.5880074 0.2578687 0.2481639 -0.48182127 0.480207328 0.25182325
logSpeed 0.2291724 -0.6301924 -0.2741496 -0.53235281 -0.112767421 -0.42315895
logRange 0.2715571 0.3756882 -0.8800761 0.09264058 -0.009418365 0.04370613
cat("****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.
",
sep = "\n")
****summary(pca_result)****: This provides a summary of the PCA, including the importance of components.The 'Standard deviation' section reiterates the standard deviations of the principal components.
'Proportion of Variance' indicates the fraction of the total variance explained by each principal component.
'Cumulative Proportion' shows the cumulative variance explained when you consider the components together up to that point.
summary(pca_result)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 0.9488 0.29350 0.2103 0.07473 0.04855 0.04646
Proportion of Variance 0.8650 0.08278 0.0425 0.00537 0.00227 0.00207
Cumulative Proportion 0.8650 0.94780 0.9903 0.99566 0.99793 1.00000
cat("****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.
",
sep = "\n")
****str(pca_result)*****: This function reveals the structure of the prcomp object.
$ sdev: Numerical vector of the standard deviations (square roots of the eigenvalues) of the principal components.
$ rotation: The matrix of eigenvectors (each column is an eigenvector).
$ center: The means of the original variables, which are subtracted from the data if center = TRUE.
$ scale: Logical value indicating whether the variables have been scaled to unit variance before the analysis.
$ x: The matrix of the principal component scores, which are the original data projected onto the principal components.
str(pca_result)
List of 5
$ sdev : num [1:6] 0.9488 0.2935 0.2103 0.0747 0.0486 ...
$ rotation: num [1:6, 1:6] 0.679 0.148 0.213 0.588 0.229 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
.. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
$ scale : logi FALSE
$ x : num [1:709, 1:6] -1.61 -1.64 -1.03 -1.17 -1.44 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
cumsum.)library(ggplot2)
# Assuming pca_result is already obtained from the prcomp function
# Extract eigenvalues
eigenvalues <- (pca_result$sdev)^2 # Squared singular values are the eigenvalues
# Create a dataframe for plotting the scree plot
pca_df <- data.frame(PC = paste0('PC', seq_along(eigenvalues)),
Eigenvalues = eigenvalues)
# Generate the scree plot
ggplot(pca_df, aes(x = PC, y = Eigenvalues)) +
geom_point() + # Add points for each principal component
geom_line(aes(group = 1)) + # Connect points with a line
theme_minimal() + # Use a minimal theme
labs(title = "Scree Plot", x = "Principal Component", y = "Eigenvalue")
# Calculate the cumulative variance
pca_df$CumulativeVariance <- cumsum(pca_df$Eigenvalues) / sum(pca_df$Eigenvalues)
# Generate the plot for cumulative contribution to total variance
ggplot(pca_df, aes(x = PC, y = CumulativeVariance)) +
geom_point() + # Add points for each principal component
geom_line(aes(group = 1)) + # Connect points with a line
theme_minimal() + # Use a minimal theme
labs(title = "Cumulative Contribution to Total Variance",
x = "Principal Component", y = "Cumulative Proportion of Variance Explained")
To plot smoothed histograms (density plots) of the PC1 and PC2 scores from a PCA analysis in R, you can use the ggplot2 package, which is part of tidyverse. The pca_result$x provides the principal component scores. Here’s how you can create these plots:
library(ggplot2)
# Assuming pca_result is already obtained from the prcomp function and contains
# the principal component scores in pca_result$x
# Convert the PC scores to a dataframe for easier manipulation with ggplot2
pc_scores <- as.data.frame(pca_result$x)
# Plot smoothed histogram (density) for PC1
ggplot(pc_scores, aes(x = PC1)) + geom_density(fill = "blue", alpha = 0.5) + labs(title = "Density Plot of PC1 Scores",
x = "PC1 Score", y = "Density") + theme_minimal()
# Plot smoothed histogram (density) for PC2
ggplot(pc_scores, aes(x = PC2)) + geom_density(fill = "red", alpha = 0.5) + labs(title = "Density Plot of PC2 Scores",
x = "PC2 Score", y = "Density") + theme_minimal()
as.data.frame(pca_result\(x): This converts the principal component scores (stored in pca_result\)x) into a dataframe. Each column represents scores for a principal component.
ggplot(pc_scores, aes(x = PC1)) + geom_density(…): This line creates a density plot for PC1 scores. The aes function sets the x-axis to represent PC1 scores. geom_density is used to create the smoothed histogram, with fill = “blue” and alpha = 0.5 determining the color and transparency.
The second plot is similar but for PC2 scores, with a different fill color for distinction.
labs and theme_minimal() are used to add labels and apply a minimal theme for better aesthetics.
These plots will provide a visual representation of the distribution of the scores for the first and second principal components, helping to understand the spread and density of the data along these dimensions.
library(ggplot2)
# Assuming pca_result is already obtained from the prcomp function
# and contains the principal component scores in pca_result$x
# Also assuming that the 'Period' variable is in the same order in the original dataset
# Add PC scores to the original dataset for plotting
aircraft_with_pc <- cbind(aircraft, pca_result$x)
# Convert 'Period' to a factor if it's not already
aircraft_with_pc$Period <- factor(aircraft_with_pc$Period)
# Create the scatterplot
ggplot(aircraft_with_pc, aes(x = PC1, y = PC2, color = Period)) +
geom_point(alpha = 0.7) + # Add scatter plot points
geom_density_2d() + # Add 2D density contours
scale_color_brewer(type = "qual", palette = "Set1") + # Set color palette
labs(title = "Scatterplot of PC1 vs PC2 Scores Colored by Period",
x = "PC1 Score", y = "PC2 Score") +
theme_minimal() # Use a minimal theme
# Assuming pca_result is already obtained from the prcomp function
# Create a biplot
biplot(pca_result, col = c("white", "blue"))
Repeat Question 1 but with scaled log data. How do the two analyses differ? Why do we see these differences?
# Method 1 pca.aircraft.s = prcomp( select(aircraft, -Year, -Period ), scale. =
# TRUE ) Method 2
aircraft_logged.s <- mutate(aircraft, Period = factor(Period), logPower = log10(Power),
logSpan = log10(Span), logLength = log10(Length), logWeight = log10(Weight),
logSpeed = log10(Speed), logRange = log10(Range)) %>%
dplyr::select(Year, Period, starts_with("log"))
# Select only the logged variables for PCA
pca_result.s <- prcomp(aircraft_logged.s[, c("logPower", "logSpan", "logLength",
"logWeight", "logSpeed", "logRange")], center = TRUE, scale. = TRUE)
# Display the eigenvectors (rotation matrix)
print(pca_result.s$rotation)
PC1 PC2 PC3 PC4 PC5
logPower 0.4480816 -0.254273152 0.2164887 -0.26154271 0.60960341
logSpan 0.3626533 0.599992407 0.1579815 -0.52435247 -0.42784071
logLength 0.4509822 0.193935691 0.2319303 0.79774800 -0.16024667
logWeight 0.4676854 0.008620396 0.1908513 -0.05827653 0.24171946
logSpeed 0.3066279 -0.729579739 0.0263650 -0.12527514 -0.59771868
logRange 0.3893202 0.073364280 -0.9150210 0.03402966 0.06293724
PC6
logPower -0.497674411
logSpan -0.159813465
logLength -0.207681242
logWeight 0.826406275
logSpeed -0.006011923
logRange -0.025784977
# Display the eigenvalues (squared singular values)
eigenvalues.s <- (pca_result.s$sdev)^2
print(eigenvalues.s)
[1] 4.45908692 1.06876561 0.38004455 0.04888992 0.03031952 0.01289349
Apply print, summary and str to the object produced by prcomp and work out the correspondence between these outputs and the terminology of PCA concepts used in the lecture notes. Note that, unlike the other two commands, str shows what the object actually is (i.e. its structure).
summary(pca_result.s)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.1117 1.0338 0.61648 0.22111 0.17413 0.11355
Proportion of Variance 0.7432 0.1781 0.06334 0.00815 0.00505 0.00215
Cumulative Proportion 0.7432 0.9213 0.98465 0.99280 0.99785 1.00000
print(pca_result.s)
Standard deviations (1, .., p=6):
[1] 2.1116550 1.0338112 0.6164775 0.2211106 0.1741250 0.1135495
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5
logPower 0.4480816 -0.254273152 0.2164887 -0.26154271 0.60960341
logSpan 0.3626533 0.599992407 0.1579815 -0.52435247 -0.42784071
logLength 0.4509822 0.193935691 0.2319303 0.79774800 -0.16024667
logWeight 0.4676854 0.008620396 0.1908513 -0.05827653 0.24171946
logSpeed 0.3066279 -0.729579739 0.0263650 -0.12527514 -0.59771868
logRange 0.3893202 0.073364280 -0.9150210 0.03402966 0.06293724
PC6
logPower -0.497674411
logSpan -0.159813465
logLength -0.207681242
logWeight 0.826406275
logSpeed -0.006011923
logRange -0.025784977
str(pca_result.s)
List of 5
$ sdev : num [1:6] 2.112 1.034 0.616 0.221 0.174 ...
$ rotation: num [1:6, 1:6] 0.448 0.363 0.451 0.468 0.307 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
.. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:6] 3.08 1.2 1.13 3.8 2.64 ...
..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
$ scale : Named num [1:6] 0.654 0.207 0.221 0.567 0.295 ...
..- attr(*, "names")= chr [1:6] "logPower" "logSpan" "logLength" "logWeight" ...
$ x : num [1:709, 1:6] -3.41 -3.32 -2.1 -2.43 -3.09 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:6] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
Sigma0 = matrix(c(3.1386518, 0.38872659, 0.6178228, 1.7576847, 0.77433973, 0.7508789,
0.3887266, 1.21417949, 0.1941453, 0.451892, 0.01236855, 0.2155466, 0.6178228,
0.19414529, 1.2437919, 0.597032, 0.15151088, 0.2665018, 1.7576847, 0.45189196,
0.597032, 1.7083497, 0.52685668, 0.7109476, 0.7743397, 0.01236855, 0.1515109,
0.5268567, 0.53406192, 0.2299193, 0.7508789, 0.21554658, 0.2665018, 0.7109476,
0.22991933, 0.6642375), byrow = TRUE, nrow = 6)
We work with the matrix \(\Sigma_0\) from Q1 of Lab 2 which is given again below. As in Q2 of Lab 2 we generate random samples, with \(n = 250\), from each of the following distributions
(i) the multivariate normal distribution ${\cal N}(0,\Sigma_0 )$
(ii) the multivariate $t$-distribution $t_6(0,\Sigma_0)$ in $\nu =6$ degrees of freedom.
library(mvtnorm)
Sigma0 = matrix(c(3.1386518, 0.38872659, 0.6178228, 1.7576847, 0.77433973, 0.7508789,
0.3887266, 1.21417949, 0.1941453, 0.451892, 0.01236855, 0.2155466, 0.6178228,
0.19414529, 1.2437919, 0.597032, 0.15151088, 0.2665018, 1.7576847, 0.45189196,
0.597032, 1.7083497, 0.52685668, 0.7109476, 0.7743397, 0.01236855, 0.1515109,
0.5268567, 0.53406192, 0.2299193, 0.7508789, 0.21554658, 0.2665018, 0.7109476,
0.22991933, 0.6642375), byrow = TRUE, nrow = 6)
Sigma0sym = 0.5 * (Sigma0 + t(Sigma0))
Xmat = rmvt(100, sigma = Sigma0sym, df = 6)
Eigenvalues and eigenvectors are important in understanding the properties of a matrix, especially a covariance matrix in statistical analyses. Eigenvalues give you the magnitude of the direction in which the data varies the most, while eigenvectors give the direction of this variance. This analysis is different from generating data using a multivariate normal or t-distribution and calculating its covariance matrix, as seen in the previous code snippets. Here, you’re analyzing the properties of the covariance matrix itself, not generating new data.
n = 250 # sample size
# (a)
eig0 = eigen(Sigma0) #Compute the coverariate matrix of the random sample
eig0
eigen() decomposition
$values
[1] 5.1206485 1.2257488 1.0134712 0.5661258 0.3053749 0.2719033
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.7393233 -0.3611919 0.1542310412 -0.4969983 -0.22377747 0.04550286
[2,] -0.1595917 0.7948012 0.5221492521 -0.2155711 0.10579799 0.11188232
[3,] -0.2310636 0.4293893 -0.8387525653 -0.2290801 0.01673765 0.07727039
[4,] -0.5241308 0.1091430 -0.0042880344 0.5587806 0.18855315 -0.60462223
[5,] -0.2051337 -0.1833666 -0.0069591464 0.1186071 0.80847142 0.50651324
[6,] -0.2403087 0.0889987 0.0005115246 0.5724864 -0.49927143 0.59776033
# Eigenvalues are in the 'values' component of the 'spectral' object
eigenvalues <- eig0$values
# Eigenvectors are in the 'vectors' component of the 'spectral' object
eigenvectors <- eig0$vectors
eigenvalues
[1] 5.1206485 1.2257488 1.0134712 0.5661258 0.3053749 0.2719033
eigenvectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.7393233 -0.3611919 0.1542310412 -0.4969983 -0.22377747 0.04550286
[2,] -0.1595917 0.7948012 0.5221492521 -0.2155711 0.10579799 0.11188232
[3,] -0.2310636 0.4293893 -0.8387525653 -0.2290801 0.01673765 0.07727039
[4,] -0.5241308 0.1091430 -0.0042880344 0.5587806 0.18855315 -0.60462223
[5,] -0.2051337 -0.1833666 -0.0069591464 0.1186071 0.80847142 0.50651324
[6,] -0.2403087 0.0889987 0.0005115246 0.5724864 -0.49927143 0.59776033
This code generates a sample of multivariate normal data with specified means and covariance, converts this sample into a data frame, and then calculates the covariance matrix of this sample. This process is often used in statistical simulations to understand the properties and behaviors of multivariate datasets.
# Set the seed for random number generation to ensure reproducibility
set.seed(19501111)
# Generate a sample data from a multivariate normal distribution n = number of
# samples to generate mu = mean vector for the distribution, here a vector of
# six zeros Sigma = covariance matrix of the distribution
Xnmat = mvrnorm(n = n, mu = rep(0, 6), Sigma = Sigma0)
# Convert the matrix Xnmat into a data frame for easier data handling
Xnmat = data.frame(Xnmat)
# Calculate and return the covariance matrix of the data frame Xnmat
cov(Xnmat)
X1 X2 X3 X4 X5 X6
X1 2.8825000 0.1445171 0.4873341 1.6265526 0.7598350 0.6760389
X2 0.1445171 1.2587985 0.2004843 0.3382075 -0.1106208 0.1713120
X3 0.4873341 0.2004843 1.1137943 0.5048360 0.1303875 0.2616829
X4 1.6265526 0.3382075 0.5048360 1.6675770 0.5375387 0.6666841
X5 0.7598350 -0.1106208 0.1303875 0.5375387 0.5692657 0.2072031
X6 0.6760389 0.1713120 0.2616829 0.6666841 0.2072031 0.5998595
This code snippet is creating a symmetric covariance matrix, generating a sample of data from a multivariate t-distribution based on this symmetric matrix, converting this sample into a data frame, and then calculating the covariance matrix of the sample. This process is useful in statistical simulations, especially when dealing with distributions that have heavier tails than the normal distribution.
# Create a symmetric covariance matrix based on Sigma0 Sigma0 is symmetrically
# averaged with its transpose (t(Sigma0))
Sigma0sym = 0.5 * (Sigma0 + t(Sigma0))
# Generate a sample data from a multivariate t-distribution 100 = number of
# samples to generate sigma = scale matrix for the distribution, set to the
# symmetric matrix Sigma0sym df = degrees of freedom for the t-distribution
Xtmat = rmvt(100, sigma = Sigma0sym, df = 6)
# Convert the generated matrix Xtmat into a data frame
Xtmat = data.frame(Xtmat)
# Calculate and return the covariance matrix of the data frame Xtmat
cov(Xtmat)
X1 X2 X3 X4 X5 X6
X1 4.5481804 0.62368776 1.0914450 2.6039032 1.01063742 1.2934574
X2 0.6236878 1.68992385 0.3049420 0.7567985 -0.07013976 0.3697011
X3 1.0914450 0.30494200 1.8773226 1.1360912 0.17203263 0.2653673
X4 2.6039032 0.75679849 1.1360912 2.3098496 0.57903361 1.0569777
X5 1.0106374 -0.07013976 0.1720326 0.5790336 0.62089138 0.2748526
X6 1.2934574 0.36970115 0.2653673 1.0569777 0.27485257 1.0082849
The question “Calculate the eigenvalues and eigenvectors of the covariance matrix and compare to those of the population matrix” pertains to a common procedure in statistics and data analysis involving eigenvalues and eigenvectors, particularly in the context of understanding covariance matrices.
Here’s a step-by-step breakdown of what this question entails:
Covariance Matrix Analysis:
Calculate the Covariance Matrix: This involves computing a matrix that encapsulates the covariance (joint variability) between each pair of elements in your dataset. If you’re working with a sample, you’d calculate the sample covariance matrix; if you’re working with a population, you’d calculate the population covariance matrix.
Eigenvalues and Eigenvectors of the Covariance Matrix: Eigenvalues and eigenvectors are then calculated from this covariance matrix.
Eigenvalues give you information about the magnitude of variance in the direction of its associated eigenvector.
Eigenvectors are directions in which the data varies the most (principal axes of variance). Each eigenvector is orthogonal (perpendicular) to the others.
Population Matrix Analysis:
The “population matrix” here might refer to the true covariance matrix of the entire population from which the sample is drawn.
As in the first step, you’d compute the eigenvalues and eigenvectors for this population covariance matrix. Comparison:
The final part of the question involves comparing the eigenvalues and eigenvectors of the sample covariance matrix with those of the population covariance matrix.
This comparison is crucial to understand how well the sample represents the population. In ideal scenarios, especially with large samples, the eigenvalues and eigenvectors of the sample covariance matrix should closely approximate those of the population matrix.
This comparison is often used in Principal Component Analysis (PCA), a method used to reduce the dimensionality of large datasets, by transforming the data into a new set of variables (principal components) that are linear combinations of the original variables. The comparison helps in understanding the representativeness and reliability of PCA results derived from sample data in relation to the entire population.
This code is used to perform PCA on a dataset to understand the underlying structure and to reduce its dimensions by identifying the principal components.
# Perform Principal Component Analysis (PCA) on the data frame Xnmat
pca.Xnmat = prcomp(Xnmat)
# Extract the eigenvalues (squared standard deviations) of each principal
# component
eval.n = pca.Xnmat$sdev^2
# Extract the eigenvectors (principal component loadings) from the PCA
evec.n = pca.Xnmat$loadings
# Print the summary of the PCA results
print(pca.Xnmat)
Standard deviations (1, .., p=6):
[1] 2.1623665 1.1810863 0.9563652 0.7367037 0.5593483 0.5007634
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
X1 -0.7396735 -0.2851213 -0.21018012 0.5441762 0.17512137 -0.02486000
X2 -0.1017659 0.8494873 -0.43258594 0.1892485 -0.17173681 -0.12480373
X3 -0.2092180 0.3528685 0.87476102 0.2532462 -0.04765246 -0.01002063
X4 -0.5396211 0.1301406 -0.00442740 -0.6088807 -0.10147189 0.55751290
X5 -0.2235083 -0.2077019 0.02992805 -0.1678422 -0.79509940 -0.49563513
X6 -0.2400163 0.1117337 0.05071078 -0.4527878 0.54322059 -0.65362893
The code is used to analyze the Xtmat dataset to uncover its underlying structure and reduce its complexity by focusing on the principal components that explain most of the variability in the data.
# Perform Principal Component Analysis (PCA) on the data frame Xtmat
pca.Xtmat = prcomp(Xtmat)
# Extract the eigenvalues (squared standard deviations) of each principal
# component
eval.t = pca.Xtmat$sdev^2
# Extract the eigenvectors (principal component loadings) from the PCA
evec.t = pca.Xtmat$loadings
# Print the summary of the PCA results
print(pca.Xtmat)
Standard deviations (1, .., p=6):
[1] 2.7545556 1.3145789 1.2187891 0.7998447 0.5928688 0.5119243
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
X1 -0.7357956 0.38881222 -0.11239660 -0.44486273 0.30442973 -0.06493504
X2 -0.1717658 -0.74012410 -0.55655072 -0.27520384 -0.11115336 -0.15770821
X3 -0.2689875 -0.46125214 0.79387489 -0.06965788 0.07201535 -0.27316002
X4 -0.5146105 -0.15613913 0.02999639 0.40986375 -0.29005090 0.67659367
X5 -0.1644729 0.25232170 0.02369732 -0.04126024 -0.85903664 -0.41118670
X6 -0.2546992 0.01556161 -0.21428861 0.74284391 0.26020720 -0.51907771
summary(pca.Xnmat)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.1624 1.1811 0.9564 0.73670 0.55935 0.50076
Proportion of Variance 0.5778 0.1724 0.1130 0.06707 0.03867 0.03099
Cumulative Proportion 0.5778 0.7502 0.8633 0.93034 0.96901 1.00000
Standard Deviation: These values (2.1624, 1.1811, etc.) represent the standard deviations of each principal component (PC). A higher standard deviation indicates that the component accounts for more variance in the dataset.
Proportion of Variance: This shows how much of the total variance in the data each principal component accounts for. For example, PC1 accounts for 57.78% of the variance, while PC2 accounts for 17.24%, and so on.
Cumulative Proportion: This cumulative value indicates the total variance accounted for by the first N components. For instance, the first two components (PC1 and PC2) together account for 75.02% of the total variance.
summary(pca.Xtmat)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.7546 1.3146 1.2188 0.79984 0.59287 0.51192
Proportion of Variance 0.6294 0.1434 0.1232 0.05307 0.02916 0.02174
Cumulative Proportion 0.6294 0.7728 0.8960 0.94910 0.97826 1.00000
Standard Deviation: Similar to pca.Xnmat, these values indicate the standard deviations of each principal component in this dataset.
Proportion of Variance: This part shows the proportion of total variance accounted for by each principal component. PC1 in this dataset accounts for 56.77% of the variance, and so on.
Cumulative Proportion: The cumulative proportion of variance accounted for by the first N components. Here, the first two components account for 74.59% of the total variance.
Comparison
Standard Deviation pca.Xnmat: The standard deviations for the principal components decrease from 2.1624 for PC1 to 0.50076 for PC6.
pca.Xtmat: The standard deviations are generally higher, ranging from 2.8102 for PC1 to 0.5619 for PC6.
Comparison: The higher standard deviations in pca.Xtmat suggest that the individual principal components account for more variance within this dataset compared to pca.Xnmat. This can imply that the Xtmat dataset might have more spread or variability in its data.
Proportion of Variance
pca.Xnmat: PC1 accounts for the most variance (57.78%), and the proportion decreases with each subsequent component.
pca.Xtmat: Similarly, PC1 accounts for the most variance (56.77%), but the overall distribution of variance across components is slightly different.
Comparison: Both datasets have their first principal component accounting for a significant proportion of the total variance. However, the exact proportions differ slightly, indicating differences in how the data is structured or spread in the multidimensional space.
Cumulative Proportion pca.Xnmat: The cumulative proportion reaches 100% with PC6, with significant jumps initially (from 57.78% in PC1 to 75.02% in PC2).
pca.Xtmat: Cumulative proportion similarly reaches 100% with PC6, with the first two components accounting for 74.59%.
Comparison: The cumulative proportions suggest that in both cases, a few principal components (particularly the first two) capture a substantial amount of the total variance in the data.
However, the exact cumulative values at each component level vary slightly, indicating differences in how the variance is distributed among the components.
Overall Comparison
The PCA on both datasets suggests a similar pattern where a few principal components capture most of the variance, a common outcome in PCA analyses.
The exact values for standard deviation, proportion of variance, and cumulative proportion differ between the two datasets, reflecting differences in the underlying data structure, spread, and variability.
The pca.Xtmat dataset seems to have a slightly higher variance spread across its components compared to pca.Xnmat.
In conclusion, while both datasets exhibit typical PCA characteristics, the differences in their PCA statistics reflect variations in how their data are distributed and the extent of variability captured by their principal components.
plot(pca.Xnmat)
plot(cumsum(pca.Xnmat$sdev^2)/sum(pca.Xnmat$sdev^2), type = "b")
plot(pca.Xtmat)
plot(cumsum(pca.Xtmat$sdev^2)/sum(pca.Xtmat$sdev^2), type = "b")
forplot = cbind(Xnmat, pca.Xnmat$x)
names(forplot)
[1] "X1" "X2" "X3" "X4" "X5" "X6" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"
ggplot(forplot, aes(PC1)) + geom_density()
ggplot(forplot, aes(PC2)) + geom_density()
ggplot(forplot, aes(PC2)) + geom_density()
ggplot(forplot, aes(PC3)) + geom_density()
ggplot(forplot, aes(PC4)) + geom_density()
ggplot(forplot, aes(PC5)) + geom_density()
ggplot(forplot, aes(PC6)) + geom_density()
forplot = cbind(Xtmat, pca.Xtmat$x)
names(forplot)
[1] "X1" "X2" "X3" "X4" "X5" "X6" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"
ggplot(forplot, aes(PC1)) + geom_density()
ggplot(forplot, aes(PC2)) + geom_density()
ggplot(forplot, aes(PC3)) + geom_density()
ggplot(forplot, aes(PC4)) + geom_density()
ggplot(forplot, aes(PC5)) + geom_density()
ggplot(forplot, aes(PC6)) + geom_density()
forplot = cbind(Xnmat, pca.Xnmat$x)
names(forplot)
[1] "X1" "X2" "X3" "X4" "X5" "X6" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"
ggplot(forplot, aes(PC1, PC2)) + geom_point() + geom_density_2d()
forplot = cbind(Xtmat, pca.Xtmat$x)
names(forplot)
[1] "X1" "X2" "X3" "X4" "X5" "X6" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"
ggplot(forplot, aes(PC1, PC2)) + geom_point() + geom_density_2d()
# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
library(ggplot2) # For creating custom plots
library(MASS) # For statistical functions
library(GGally) # For enhanced pair plots
# Assuming Xtmat is already generated and available as per your previous code
# Descriptive Statistics for Xtmat Calculating mean, median, range, variance,
# standard deviation, and mode for each variable in Xtmat
descriptive_stats_xtmat <- Xtmat %>%
summarise(across(everything(), list(mean = mean, median = median, range = ~diff(range(.)),
variance = var, sd = sd)))
# Define a function to calculate the mode
mode_function_xtmat <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# Apply mode function to each column in Xtmat
modes_xtmat <- apply(Xtmat, 2, mode_function_xtmat)
modes_df_xtmat <- data.frame(modes_xtmat)
# Print descriptive statistics
print(descriptive_stats_xtmat)
X1_mean X1_median X1_range X1_variance X1_sd X2_mean X2_median
1 -0.2673069 -0.1140094 12.95437 4.54818 2.132646 0.1122905 0.1136951
X2_range X2_variance X2_sd X3_mean X3_median X3_range X3_variance
1 7.482452 1.689924 1.299971 -0.09909023 -0.2253782 8.763167 1.877323
X3_sd X4_mean X4_median X4_range X4_variance X4_sd X5_mean
1 1.370154 -0.1536837 0.0209533 8.073807 2.30985 1.519819 0.02210862
X5_median X5_range X5_variance X5_sd X6_mean X6_median X6_range
1 -0.01191906 4.845459 0.6208914 0.7879666 -0.09207192 -0.03542006 6.194098
X6_variance X6_sd
1 1.008285 1.004134
print(modes_df_xtmat)
modes_xtmat
X1 1.46570565
X2 -3.45758589
X3 -0.01266486
X4 0.04866006
X5 0.20850686
X6 0.03070821
# Visualization for Xtmat Creating histograms for each variable in Xtmat
Xtmat %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = value)) + geom_histogram(bins = 30, fill = "blue", color = "black") +
facet_wrap(~variable, scales = "free") + theme_minimal() + labs(title = "Histograms of Xtmat Variables")
# Creating box plots for each variable in Xtmat
Xtmat %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = variable, y = value)) + geom_boxplot(fill = "orange", color = "darkred") +
theme_minimal() + labs(title = "Box Plots of Xtmat Variables")
# Correlation Analysis for Xtmat Calculate the correlation matrix
correlation_matrix_xtmat <- cor(Xtmat)
# Use ggpairs for a comprehensive visualization of pair-wise relationships and
# correlations
ggpairs(Xtmat)
# Covariance Matrix for Xtmat Calculate and print the covariance matrix
covariance_matrix_xtmat <- cov(Xtmat)
print(covariance_matrix_xtmat)
X1 X2 X3 X4 X5 X6
X1 4.5481804 0.62368776 1.0914450 2.6039032 1.01063742 1.2934574
X2 0.6236878 1.68992385 0.3049420 0.7567985 -0.07013976 0.3697011
X3 1.0914450 0.30494200 1.8773226 1.1360912 0.17203263 0.2653673
X4 2.6039032 0.75679849 1.1360912 2.3098496 0.57903361 1.0569777
X5 1.0106374 -0.07013976 0.1720326 0.5790336 0.62089138 0.2748526
X6 1.2934574 0.36970115 0.2653673 1.0569777 0.27485257 1.0082849
# Perform Principal Component Analysis (PCA) on the data frame Xnmat
pca.Xtmat = prcomp(Xtmat)
# Extract the eigenvalues (squared standard deviations) of each principal
# component
eval.n = pca.Xtmat$sdev^2
# Extract the eigenvectors (principal component loadings) from the PCA
evec.n = pca.Xtmat$loadings
# Print the summary of the PCA results
print(pca.Xtmat)
Standard deviations (1, .., p=6):
[1] 2.7545556 1.3145789 1.2187891 0.7998447 0.5928688 0.5119243
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
X1 -0.7357956 0.38881222 -0.11239660 -0.44486273 0.30442973 -0.06493504
X2 -0.1717658 -0.74012410 -0.55655072 -0.27520384 -0.11115336 -0.15770821
X3 -0.2689875 -0.46125214 0.79387489 -0.06965788 0.07201535 -0.27316002
X4 -0.5146105 -0.15613913 0.02999639 0.40986375 -0.29005090 0.67659367
X5 -0.1644729 0.25232170 0.02369732 -0.04126024 -0.85903664 -0.41118670
X6 -0.2546992 0.01556161 -0.21428861 0.74284391 0.26020720 -0.51907771
summary(pca.Xtmat)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.7546 1.3146 1.2188 0.79984 0.59287 0.51192
Proportion of Variance 0.6294 0.1434 0.1232 0.05307 0.02916 0.02174
Cumulative Proportion 0.6294 0.7728 0.8960 0.94910 0.97826 1.00000
Compare \(\Sigma_0\) with the two sample covariance matrices: how different are the eigenvalues and eigenvectors of the three matrices? Which are most similar? Why do they differ?
X1 X2 X3 X4 X5 X6
X1 4.5481804 0.62368776 1.0914450 2.6039032 1.01063742 1.2934574 X2 0.6236878 1.68992385 0.3049420 0.7567985 -0.07013976 0.3697011 X3 1.0914450 0.30494200 1.8773226 1.1360912 0.17203263 0.2653673 X4 2.6039032 0.75679849 1.1360912 2.3098496 0.57903361 1.0569777 X5 1.0106374 -0.07013976 0.1720326 0.5790336 0.62089138 0.2748526 X6 1.2934574 0.36970115 0.2653673 1.0569777 0.27485257 1.0082849
original:
Standard deviations (1, .., p=6): [1] 2.7545556 1.3145789 1.2187891 0.7998447 0.5928688 0.5119243
Rotation (n x k) = (6 x 6): PC1 PC2 PC3 PC4 PC5 PC6 X1 -0.7357956 0.38881222 -0.11239660 -0.44486273 0.30442973 -0.06493504 X2 -0.1717658 -0.74012410 -0.55655072 -0.27520384 -0.11115336 -0.15770821 X3 -0.2689875 -0.46125214 0.79387489 -0.06965788 0.07201535 -0.27316002 X4 -0.5146105 -0.15613913 0.02999639 0.40986375 -0.29005090 0.67659367 X5 -0.1644729 0.25232170 0.02369732 -0.04126024 -0.85903664 -0.41118670 X6 -0.2546992 0.01556161 -0.21428861 0.74284391 0.26020720 -0.51907771
Suppose the random variables \(X_1, X_2, X_3\) have the covariance matrix
\[ \Sigma=\begin{pmatrix} 1 & -2 & 0 \\ -2 & 5 & 0 \\ 0 &0 &2 \end{pmatrix}.\]
sigma <- matrix(c(1, -2, 0, -2, 5, 0, 0, 0, 2), nrow = 3, ncol = 3)
sigma
[,1] [,2] [,3]
[1,] 1 -2 0
[2,] -2 5 0
[3,] 0 0 2
eig_sigma = eigen(sigma)
eig_sigma
eigen() decomposition
$values
[1] 5.8284271 2.0000000 0.1715729
$vectors
[,1] [,2] [,3]
[1,] -0.3826834 0 0.9238795
[2,] 0.9238795 0 0.3826834
[3,] 0.0000000 1 0.0000000
eigenvalues = eig_sigma$values
eigenvalues
[1] 5.8284271 2.0000000 0.1715729
eigenvectors = eig_sigma$vectors
eigenvectors
[,1] [,2] [,3]
[1,] -0.3826834 0 0.9238795
[2,] 0.9238795 0 0.3826834
[3,] 0.0000000 1 0.0000000
lambda1 = 5.8284, lambda2 = 2.0000, lambda3 = 0.1716
e1 = [-0.3827,0.9239,0.000]^T e2 = [0,0,1]^T , e3 = [0.9239,0.3827,0.0000]^T
pca_sigma = prcomp(sigma)
print(pca_sigma)
Standard deviations (1, .., p=3):
[1] 3.925094e+00 1.122632e+00 2.660133e-16
Rotation (n x k) = (3 x 3):
PC1 PC2 PC3
[1,] -0.38832732 -0.08944370 0.91717049
[2,] 0.91850794 0.04285707 0.39307307
[3,] -0.07446515 0.99506939 0.06551218
Y1 = e_{1}^T*X = (-0.3027,0.9239,0)(X1,X2,X3)^T = -0.3027X1 + 0.9239X2
Y2 = e_{2}^T*X = (-0,0,1)(X1,X2,X3)^T = X3
Y3 = e_{1}^T*X = (-0.9239,0.3027,0)(X1,X2,X3)^T = 0.9239X1 + 0.3027X2
var(Y1) = lambda1 = 5.83
cov(y1,y2) = (-0.3827X1 + 0.9239X2, X3) = -0.3827cov(X1,X3) + 0.9239cov(X2,X3) = 0
lambda1 = 5.8284, lambda2 = 2.0000, lambda3 = 0.1716
Total variance = lamdba1+lambda2 + lamdba3 = 5.8284+2.0000+0.1716 = 8
The headsize data provided in (a) give the head lengths and head breadths (in millimetres) for each of the first two adult sons in 25 families. Here we shall use only the head lengths (head1 and head2).
(a). Read the data set in R and assign an appropriate name to the data object.
"headsize" <- +matrix(c(191, 195, 181, 183, 176, 208, 189, 197, 188, 192, 179, 183,
174, 190, 188, 163, 195, 186, 181, 175, 192, 174, +176, 197, 190, 155, 149, 148,
153, 144, 157, 150, 159, 152, 150, 158, 147, 150, 159, 151, 137, 155, 153, +145,
140, 154, 143, 139, 167, 163, 179, 201, 185, 188, 171, 192, 190, 189, 197, 187,
186, 174, 185, 195, +187, 161, 183, 173, 182, 165, 185, 178, 176, 200, 187, 145,
152, 149, 149, 142, 152, 149, 152, 159, 151, +148, 147, 152, 157, 158, 130, 158,
148, 146, 137, 152, 147, 143, 158, 150), nrow = 25, ncol = 4, dimnames = list(character(0),
c("head1", "breadth1", "head2", "breadth2")))
x <- headsize
headsize <- as.data.frame(headsize)
headsize
head1 breadth1 head2 breadth2
1 191 155 179 145
2 195 149 201 152
3 181 148 185 149
4 183 153 188 149
5 176 144 171 142
6 208 157 192 152
7 189 150 190 149
8 197 159 189 152
9 188 152 197 159
10 192 150 187 151
11 179 158 186 148
12 183 147 174 147
13 174 150 185 152
14 190 159 195 157
15 188 151 187 158
16 163 137 161 130
17 195 155 183 158
18 186 153 173 148
19 181 145 182 146
20 175 140 165 137
21 192 154 185 152
22 174 143 178 147
23 176 139 176 143
24 197 167 200 158
25 190 163 187 150
(b). Compute the array of sample means, variance-covariance matrix \(S\)
head_dat <- headsize[, c("head1", "head2")]
head(head_dat)
head1 head2
1 191 179
2 195 201
3 181 185
4 183 188
5 176 171
6 208 192
str(head_dat)
'data.frame': 25 obs. of 2 variables:
$ head1: num 191 195 181 183 176 208 189 197 188 192 ...
$ head2: num 179 201 185 188 171 192 190 189 197 187 ...
## Means and var-cov
head.mean = colMeans(head_dat)
cat("The mean vector of Head Lengths", sep = "\n")
The mean vector of Head Lengths
head.mean
head1 head2
185.72 183.84
cat("Covariance Matrix of Head Lengths", sep = "\n")
Covariance Matrix of Head Lengths
head.cov = cov(head_dat)
head.cov
head1 head2
head1 95.29333 69.66167
head2 69.66167 100.80667
(c). Perform PCA on the dataset based on \(S\).
head_pca <- prcomp(x = head_dat)
###
cat("Head dat PCA", sep = "\n")
Head dat PCA
head_pca
Standard deviations (1, .., p=2):
[1] 12.952459 5.322951
Rotation (n x k) = (2 x 2):
PC1 PC2
head1 -0.6929858 0.7209512
head2 -0.7209512 -0.6929858
cat("The loadings (eigenvectors) of Head Lengths", sep = "\n")
The loadings (eigenvectors) of Head Lengths
head_pca$rotation
PC1 PC2
head1 -0.6929858 0.7209512
head2 -0.7209512 -0.6929858
(d). Write the equations in (c) and determine the proportion of the total sample variance explained by the first two principal components.
###
cat("The standard deviation of Head Lengths", sep = "\n")
The standard deviation of Head Lengths
head_pca$sdev
[1] 12.952459 5.322951
###
cat("The summary of PCA of Head Lengths", sep = "\n")
The summary of PCA of Head Lengths
summary(head_pca)
Importance of components:
PC1 PC2
Standard deviation 12.9525 5.3230
Proportion of Variance 0.8555 0.1445
Cumulative Proportion 0.8555 1.0000
(e). Interpret the first two sample PCA components, obtained in (d).