df <- read.csv("https://www.dropbox.com/s/9x9aovbfbmfbixy/T5_6_PILOT.txt?dl=1",
sep = "", header = FALSE)
names(df) <- c("group", "y1", "y2", "y3", "y4", "y5", "y6")
kable(head(df))| group | y1 | y2 | y3 | y4 | y5 | y6 |
|---|---|---|---|---|---|---|
| 1 | 121 | 22 | 74 | 223 | 54 | 254 |
| 1 | 108 | 30 | 80 | 175 | 40 | 300 |
| 1 | 122 | 49 | 87 | 266 | 41 | 223 |
| 1 | 77 | 37 | 66 | 178 | 80 | 209 |
| 1 | 140 | 35 | 71 | 175 | 38 | 261 |
| 1 | 108 | 37 | 57 | 241 | 59 | 245 |
eng <- df %>% filter(group == 1) %>% dplyr::select(c(-group))
pil <- df %>% filter(group == 2) %>% dplyr::select(c(-group))
s1 <- var(eng)
s2 <- var(pil)
n1 <- dim(eng)[1]
n2 <- dim(pil)[1]
p <- dim(eng)[2]
sp <- ((n1 - 1) * s1 + (n2 - 1) * s2)/(n1 + n2 - 2)
s.eig <- eigen(sp)
s.eigeigen() decomposition
$values
[1] 1050.59628 858.31581 398.90346 259.14837 108.08917 43.35349
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] -0.44092830 0.1897832542 0.86398819 0.079577443 0.105698146
[2,] -0.04088997 0.0377138213 0.08242069 -0.227027690 -0.057349251
[3,] 0.03933709 -0.0313490863 0.14336529 -0.023912831 -0.985372990
[4,] -0.44974010 -0.8923295867 -0.03264727 0.003570544 0.004135096
[5,] 0.01936520 0.0009446983 -0.05392188 0.970251939 -0.047798944
[6,] -0.77441699 0.4066008303 -0.47138542 -0.012347547 -0.110802522
[,6]
[1,] 0.07472808
[2,] -0.96709979
[3,] 0.07337894
[4,] -0.01964830
[5,] -0.23031058
[6,] 0.01789487
[1] 0.38647504 0.31574225 0.14674165 0.09533098 0.03976196 0.01594813
[1] 0.3864750 0.7022173 0.8489589 0.9442899 0.9840519 1.0000000
df.sub <- df[, 2:7]
cov.sub <- cov(df.sub)
pca.cov <- princomp(covmat = cov.wt(df.sub), cor = FALSE, score = TRUE)
pca.covCall:
princomp(cor = FALSE, scores = TRUE, covmat = cov.wt(df.sub))
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
41.497499 29.637102 20.035932 16.157875 11.353640 7.097781
6 variables and 40 observations.
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 41.4974985 29.6371023 20.0359318 16.15787527
Proportion of Variance 0.5002739 0.2551734 0.1166227 0.07584597
Cumulative Proportion 0.5002739 0.7554473 0.8720700 0.94791596
Comp.5 Comp.6
Standard deviation 11.35364032 7.09778133
Proportion of Variance 0.03744848 0.01463556
Cumulative Proportion 0.98536444 1.00000000
The approach from part b seems to be more successful as it explains a higher cumulative proportion of variance with each additional principal component than the pooled variance method for pilots vs engineers in part a. The biggest indicator of the second approach being more appropriate is the difference in proportion of variance explained by the first principal component (0.3865 for method a and 0.5003 for method b).
Standard deviations (1, .., p=6):
[1] 41.497499 29.637102 20.035932 16.157875 11.353640 7.097781
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
y1 0.21165160 -0.38949336 0.88819049 -0.03082062 -0.04760343 -0.10677164
y2 -0.03883125 -0.06379320 0.09571590 0.19128493 -0.14793191 0.96269790
y3 0.08012946 0.06602004 0.08145863 0.12854488 0.97505667 0.12379748
y4 0.77552673 0.60795970 0.08071120 -0.08125631 -0.10891968 0.06295166
y5 -0.09593926 -0.01046493 0.01494473 -0.96813856 0.10919120 0.20309559
y6 0.58019734 -0.68566916 -0.43426141 -0.04518327 0.03644629 0.03572141
The scree plot shows significant decreases in marginal increase in variance explained after the 2nd principal component. Thus, m should be set to either 2 or 3, depending on theory behind the decision. Either seem appropriate, but as a statistician I prefer explaining more variance over the simplistinc interpretation, so I would choose m = 3.
y1 y2 y3 y4 y5 y6
8.783012 -1.611400 3.325172 32.182419 -3.981239 24.076738
y1 y2 y3 y4 y5 y6
-11.5434545 -1.8906456 1.9566427 18.0181637 -0.3101502 -20.3212470
y1 y2 y3 y4 y5 y6
17.7957240 1.9177572 1.6320995 1.6171240 0.2994317 -8.7008320
$loadings
Loadings:
l1 l2 l3
y1 5.246 -5.720 21.606
y2 -1.773 0.590 2.521
y3 4.143 -0.166 0.601
y4 35.522 -8.086 -5.979
y5 -3.156 2.456 0.210
y6 4.445 -32.148 3.887
l1 l2 l3
SS loadings 1339.397 1137.992 524.449
Proportion Var 223.233 189.665 87.408
Cumulative Var 223.233 412.898 500.306
$rotmat
[,1] [,2] [,3]
[1,] 0.7697299 -0.6363192 0.05112505
[2,] 0.5710494 0.6505477 -0.50068983
[3,] 0.2853393 0.4145909 0.86411569
Call:
factanal(x = df.sub, factors = 3, rotation = "none")
Uniquenesses:
y1 y2 y3 y4 y5 y6
0.005 0.703 0.866 0.378 0.005 0.726
Loadings:
Factor1 Factor2 Factor3
y1 0.725 0.685
y2 0.298 -0.455
y3 0.179 0.316
y4 0.217 0.757
y5 -0.723 0.687
y6 0.372 0.325 0.172
Factor1 Factor2 Factor3
SS loadings 1.355 0.985 0.977
Proportion Var 0.226 0.164 0.163
Cumulative Var 0.226 0.390 0.553
The degrees of freedom for the model is 0 and the fit was 0.0165
y1 y2 y3 y4 y5 y6
0.0050000 0.7027878 0.8663403 0.3775656 0.0050000 0.7260919
Call:
factanal(x = df.sub, factors = 3, rotation = "varimax")
Uniquenesses:
y1 y2 y3 y4 y5 y6
0.005 0.703 0.866 0.378 0.005 0.726
Loadings:
Factor1 Factor2 Factor3
y1 0.996
y2 0.199 -0.328 -0.387
y3 0.345
y4 0.784
y5 0.977 -0.201
y6 0.367 0.366
Factor1 Factor2 Factor3
SS loadings 1.178 1.078 1.061
Proportion Var 0.196 0.180 0.177
Cumulative Var 0.196 0.376 0.553
The degrees of freedom for the model is 0 and the fit was 0.0165
Method (e) seems to outperform method f with higher cumulative variance explained.
n1 <- dim(eng)[1]
n2 <- dim(pil)[1]
xbar1 <- colMeans(eng)
xbar2 <- colMeans(pil)
s1 <- var(eng)
s2 <- var(pil)
sp <- ((n1 - 1) * s1 + (n2 - 1) * s2)/(n1 + n2 - 2)
a.hat <- solve(sp) %*% (xbar1 - xbar2)
ybar1 <- t(a.hat) %*% xbar1
ybar2 <- t(a.hat) %*% xbar2
mhat <- (1/2) * (ybar1 + ybar2)
z <- lda(group ~ ., df)
pred.class <- predict(z, df)$class
mean(pred.class == df$group)[1] 0.9
pred.class
1 2
1 18 2
2 2 18
[1] 0.9
pred.class
1 2
1 18 2
2 2 18
The results of LDA and QDA suggest very similar performance; they each classify correctly 90% of the time in this particular example.