Chapter Eight – Part Two.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(ggfortify)
library(here)
here() starts at /Users/thawks/Library/Mobile Documents/com~apple~CloudDocs/Documents/John Jay/2020 Fall/MAT 367
library(kableExtra)
Attaching package: ‘kableExtra’
The following object is masked from ‘package:dplyr’:
group_rows
Swiss Bank Notes: The data consist of p = 6 measurements in millimeters on n = 100 genuine Swiss Bank notes (old ones). Data file named Swiss_Bank_Notes_Data.csv has the data for this example.
file = here("Data","Swiss_Bank_Notes_Data.csv")
notes = read_csv(file)
── Column specification ────────────────────────────────────────────────────────
cols(
length = col_double(),
left_width = col_double(),
right_width = col_double(),
bottom_margin = col_double(),
top_margin = col_double(),
diag_length = col_double()
)
notes
Compute the principal components and select first three components.Use these principal components and correlations between original variables and components to interpret. Use covariance matrix S for this analysis.
q1a_S = cov(notes)
q1a_eig = eigen(q1a_S)
q1a_formatted = function(){
for (i in 1:3){
coef = round(q1a_eig$vectors[i,],3)
pcs = paste(coef[1],"X_1", sep = "")
for (j in 2:length(coef)){
if (sign(coef[j]) == 1){
pcs = paste(pcs, " + ", coef[j],"X_", j, sep = "")
}
else if (sign(coef[j]) == -1){
pcs = paste(pcs," - ", -coef[j],"X_", j, sep = "")
}
}
rv = paste0("- $Y_", i, " = ", pcs, " $\n", sep = "")
cat(rv)
}
}
q1a_formatted()
Construct plots and see any violations.
pca_notes = prcomp(notes)
qqnorm(pca_notes$x[,1])
qqnorm(pca_notes$x[,2])
qqnorm(pca_notes$x[,3])
autoplot(pca_notes, label = TRUE)
Plotting the first three principle components with QQplot show the distributions to be approximately normal, with only very minor outliers. A scatterplot of the first component against the second confirms that the outliers are small in magnitude.
Then find the 95% Bonferroni simultaneous confidence intervals for variances of the principal components.
alpha = .05
q1c_z = qnorm(alpha/2*3, lower.tail = F)
q1c_bf.min = q1a_eig$values[1:3]/(1 + q1c_z * sqrt(2/dim(notes)[1]))
q1c_bf.max = q1a_eig$values[1:3]/(1 - q1c_z * sqrt(2/dim(notes)[1]))
q1c = tibble(pca = 1:3, min = q1c_bf.min, max = q1c_bf.max)
q1c %>%
kbl() %>%
kable_minimal(full_width = F)
| pca | min | max |
|---|---|---|
| 1 | 0.5724990 | 0.8651831 |
| 2 | 0.2985055 | 0.4511133 |
| 3 | 0.1542132 | 0.2330530 |
The data on national track records for men for 54 countries are given in data file named Track_Records_Men.txt.
track_file = here("Data", "Track_Records_Men.txt")
track = read.table(track_file, header = T)
track = tibble(track)
track
Determine the first two principal components for the standardized variables. Then use the component coefficients and correlation coefficients to interpret two principal components.
q2_pca = prcomp(track %>% select(-Country), scale = T)
q2a_formatted = function(){
for (i in 1:2){
coef = round(q2_pca$rotation[,i],3)
pcs = paste(coef[1],"X_1", sep = "")
for (j in 2:length(coef)){
if (sign(coef[j]) == 1){
pcs = paste(pcs, " + ", coef[j],"X_", j, sep = "")
}
else if (sign(coef[j]) == -1){
pcs = paste(pcs," - ", -coef[j],"X_", j, sep = "")
}
}
rv = paste0("- $Y_", i, " = ", pcs, " $\n", sep = "")
cat(rv)
}
}
q2a_formatted()
cor(track %>% select(-Country), q2_pca$x)
PC1 PC2 PC3 PC4 PC5
X100m.s. -0.8605755 -0.42299290 -0.164019263 -0.172746424 0.09360945
X200m.s. -0.8959509 -0.37584469 0.001805954 -0.098464749 -0.16912991
X400m.s. -0.8780162 -0.27592007 0.031987545 0.386239872 0.04154145
X800m.m. -0.9139768 0.07147524 0.373349523 -0.060923243 -0.07099713
X1500.m. -0.9475610 0.12276915 0.116515655 -0.105722740 0.20355032
X5000m.m. -0.9574913 0.23551480 -0.087224857 0.024783427 0.02243356
X10000m.m. -0.9474679 0.26655325 -0.116377639 0.039503854 -0.01915871
Marathon.m. -0.9172508 0.30886432 -0.159618346 -0.008221675 -0.10554831
PC6 PC7 PC8
X100m.s. -0.09625553 0.07532160 -0.0064771301
X200m.s. 0.09268091 -0.09530856 0.0059895281
X400m.s. 0.02049444 0.02460301 -0.0003420605
X800m.m. -0.09074435 0.05608990 -0.0038718020
X1500.m. 0.14085356 -0.03185703 -0.0039182827
X5000m.m. -0.09548628 -0.07113424 0.0695694120
X10000m.m. -0.07260596 -0.07607261 -0.0687311626
Marathon.m. 0.09974715 0.12873076 0.0068335563
Under the first principle component, we see that none of the events are over represented; all the weights are very similar (none more than .1 different). There is more variation in the second PC, favoring the shortest courses. When we find the correlations to the original data set, we see something similar, all of the courses are highly correlated with the first principle components; in the second PC, the shortest courses are most highly correlated.
Rank the nations based on their score on the first principal component (value of the principal component for each country). Does this ranking correspond with your intuitive notion of athletic excellence for the various countries.
q2b = tibble(country = track$Country, pca_1 = q2_pca$x[,1])
arrange(q2b, -pca_1)
Not at all surprised by this ranking — these countries (predominantly western and wealthy and otherwise traditional athletic powerhouses) often do well in global competition.
Note that the data file records for 800m, 1500m, 5000m, 10,000m and the marathon are given in minutes while short distances are given in seconds. Convert the records to speeds measured in meters per second for all variables. The marathon is 26.2 miles, or 42,195 meters long. Perform the principal component analysis using the sample covariance matrix S of the speed data. Compare the results with the results obtained using correlation matrix in above two parts.
track2 = track %>%
mutate(X100m.s. = X100m.s./100,
X200m.s. = X200m.s./200,
X400m.s. = X400m.s./400,
X800m.m. = X800m.m.*60/800,
X1500.m. = X1500.m.*60/1500,
X5000m.m. = X5000m.m.*60/5000,
X10000m.m. = X10000m.m.*60/10000,
Marathon.m. = Marathon.m.*60/42195)
q2c_pca = prcomp(track2 %>% select(-Country))
q2c_pca$rotation[,1:2]
PC1 PC2
X100m.s. -0.08266859 -0.25917553
X200m.s. -0.10767856 -0.31622357
X400m.s. -0.14167701 -0.42498895
X800m.m. -0.17191254 -0.34342836
X1500.m. -0.28393113 -0.39628663
X5000m.m. -0.45330875 -0.15893260
X10000m.m. -0.50116935 -0.02605457
Marathon.m. -0.62822648 0.59269769
cor(track2 %>% select(-Country), q2c_pca$x[,1:2])
PC1 PC2
X100m.s. -0.7421812 -0.427120112
X200m.s. -0.7799973 -0.420479384
X400m.s. -0.7825755 -0.430914534
X800m.m. -0.8682220 -0.318380193
X1500.m. -0.9292147 -0.238066967
X5000m.m. -0.9864334 -0.063485316
X10000m.m. -0.9883011 -0.009431382
Marathon.m. -0.9805309 0.169810649
By transforming the dataset in this way, we have given greater emphasis to the longer courses. This is most clearly seen in the rotations, where the magnitude of the rotation shrinks as the courses get longer.
q2c = tibble(country = track$Country, pca_1c = q2c_pca$x[,1])
arrange(q2c, -pca_1c)
The new ranking is similar to the first one. Now Kenya reigns, but otherwise it is largely the same colleciton of countries.