Multivariate Statistical Analysis

(#4) [40 Points]

20 engineer apprentices and 20 pilots were given 6 tests (Travers, 1939):

\(y_{1}\): intelligence
\(y_{2}\): form relations
\(y_{3}\): dynamometer
\(y_{4}\): dotting
\(y_{5}\): sensory motor coordination
\(y_{6}\): perseveration

df <- read.csv("https://www.dropbox.com/s/9x9aovbfbmfbixy/T5_6_PILOT.txt?dl=1", 
    sep = "", header = FALSE)
names(df) <- c("group", "y1", "y2", "y3", "y4", "y5", "y6")
kable(head(df))

group	y1	y2	y3	y4	y5	y6
1	121	22	74	223	54	254
1	108	30	80	175	40	300
1	122	49	87	266	41	223
1	77	37	66	178	80	209
1	140	35	71	175	38	261
1	108	37	57	241	59	245

a. Perform PCA using the pooled covariance matrix.

eng <- df %>% filter(group == 1) %>% dplyr::select(c(-group))
pil <- df %>% filter(group == 2) %>% dplyr::select(c(-group))

s1 <- var(eng)
s2 <- var(pil)
n1 <- dim(eng)[1]
n2 <- dim(pil)[1]
p <- dim(eng)[2]
sp <- ((n1 - 1) * s1 + (n2 - 1) * s2)/(n1 + n2 - 2)

s.eig <- eigen(sp)
s.eig

eigen() decomposition
$values
[1] 1050.59628  858.31581  398.90346  259.14837  108.08917   43.35349

$vectors
            [,1]          [,2]        [,3]         [,4]         [,5]
[1,] -0.44092830  0.1897832542  0.86398819  0.079577443  0.105698146
[2,] -0.04088997  0.0377138213  0.08242069 -0.227027690 -0.057349251
[3,]  0.03933709 -0.0313490863  0.14336529 -0.023912831 -0.985372990
[4,] -0.44974010 -0.8923295867 -0.03264727  0.003570544  0.004135096
[5,]  0.01936520  0.0009446983 -0.05392188  0.970251939 -0.047798944
[6,] -0.77441699  0.4066008303 -0.47138542 -0.012347547 -0.110802522
            [,6]
[1,]  0.07472808
[2,] -0.96709979
[3,]  0.07337894
[4,] -0.01964830
[5,] -0.23031058
[6,]  0.01789487

# Proportion of Variance explained by each PC
s.eig$values/sum(s.eig$values)

[1] 0.38647504 0.31574225 0.14674165 0.09533098 0.03976196 0.01594813

# Cumulative Propoertion of Variance explained by each PC
cumsum(s.eig$values)/sum(s.eig$values)

[1] 0.3864750 0.7022173 0.8489589 0.9442899 0.9840519 1.0000000

b. Ignore groups and use a covariance matrix based on 40 observations to perform PCA.

df.sub <- df[, 2:7]
cov.sub <- cov(df.sub)
pca.cov <- princomp(covmat = cov.wt(df.sub), cor = FALSE, score = TRUE)
pca.cov

Call:
princomp(cor = FALSE, scores = TRUE, covmat = cov.wt(df.sub))

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6 
41.497499 29.637102 20.035932 16.157875 11.353640  7.097781 

 6  variables and  40 observations.

summary(pca.cov)

Importance of components:
                           Comp.1     Comp.2     Comp.3      Comp.4
Standard deviation     41.4974985 29.6371023 20.0359318 16.15787527
Proportion of Variance  0.5002739  0.2551734  0.1166227  0.07584597
Cumulative Proportion   0.5002739  0.7554473  0.8720700  0.94791596
                            Comp.5     Comp.6
Standard deviation     11.35364032 7.09778133
Proportion of Variance  0.03744848 0.01463556
Cumulative Proportion   0.98536444 1.00000000

c. Which of the approaches in (a) & (b) seems to be more successful?

The approach from part b seems to be more successful as it explains a higher cumulative proportion of variance with each additional principal component than the pooled variance method for pilots vs engineers in part a. The biggest indicator of the second approach being more appropriate is the difference in proportion of variance explained by the first principal component (0.3865 for method a and 0.5003 for method b).

d. For (e)-(f), combine the two groups into a single sample. Using a scree plot with percentages, is there a clear choice of m?

pca.pc <- prcomp(df.sub, scale = FALSE, center = TRUE)
pca.pc

Standard deviations (1, .., p=6):
[1] 41.497499 29.637102 20.035932 16.157875 11.353640  7.097781

Rotation (n x k) = (6 x 6):
           PC1         PC2         PC3         PC4         PC5         PC6
y1  0.21165160 -0.38949336  0.88819049 -0.03082062 -0.04760343 -0.10677164
y2 -0.03883125 -0.06379320  0.09571590  0.19128493 -0.14793191  0.96269790
y3  0.08012946  0.06602004  0.08145863  0.12854488  0.97505667  0.12379748
y4  0.77552673  0.60795970  0.08071120 -0.08125631 -0.10891968  0.06295166
y5 -0.09593926 -0.01046493  0.01494473 -0.96813856  0.10919120  0.20309559
y6  0.58019734 -0.68566916 -0.43426141 -0.04518327  0.03644629  0.03572141

screeplot(pca.pc)

The scree plot shows significant decreases in marginal increase in variance explained after the 2nd principal component. Thus, m should be set to either 2 or 3, depending on theory behind the decision. Either seem appropriate, but as a statistician I prefer explaining more variance over the simplistinc interpretation, so I would choose m = 3.

e. Extract 3 factors by the principle component method and carry out a varimax rotation.

# Factor Loading 1
l1 <- pca.pc$sdev[1] * pca.pc$rotation[, 1]
l1

       y1        y2        y3        y4        y5        y6 
 8.783012 -1.611400  3.325172 32.182419 -3.981239 24.076738

# Factor Loading 2
l2 <- pca.pc$sdev[2] * pca.pc$rotation[, 2]
l2

         y1          y2          y3          y4          y5          y6 
-11.5434545  -1.8906456   1.9566427  18.0181637  -0.3101502 -20.3212470

# Factor Loading 3
l3 <- pca.pc$sdev[3] * pca.pc$rotation[, 3]
l3

        y1         y2         y3         y4         y5         y6 
17.7957240  1.9177572  1.6320995  1.6171240  0.2994317 -8.7008320

# Varimax Rotation
vmax.rot <- varimax(cbind(l1, l2, l3))
vmax.rot

$loadings

Loadings:
   l1      l2      l3     
y1   5.246  -5.720  21.606
y2  -1.773   0.590   2.521
y3   4.143  -0.166   0.601
y4  35.522  -8.086  -5.979
y5  -3.156   2.456   0.210
y6   4.445 -32.148   3.887

                     l1       l2      l3
SS loadings    1339.397 1137.992 524.449
Proportion Var  223.233  189.665  87.408
Cumulative Var  223.233  412.898 500.306

$rotmat
          [,1]       [,2]        [,3]
[1,] 0.7697299 -0.6363192  0.05112505
[2,] 0.5710494  0.6505477 -0.50068983
[3,] 0.2853393  0.4145909  0.86411569

f. Extract 3 factors by the maximum likelihood method and carry out a varimax rotation.

# MLE Method
fa.mle <- factanal(df.sub, factors = 3, rotation = "none")
fa.mle


Call:
factanal(x = df.sub, factors = 3, rotation = "none")

Uniquenesses:
   y1    y2    y3    y4    y5    y6 
0.005 0.703 0.866 0.378 0.005 0.726 

Loadings:
   Factor1 Factor2 Factor3
y1  0.725           0.685 
y2  0.298  -0.455         
y3  0.179   0.316         
y4  0.217   0.757         
y5 -0.723           0.687 
y6  0.372   0.325   0.172 

               Factor1 Factor2 Factor3
SS loadings      1.355   0.985   0.977
Proportion Var   0.226   0.164   0.163
Cumulative Var   0.226   0.390   0.553

The degrees of freedom for the model is 0 and the fit was 0.0165

fa.mle$uniquenesses

       y1        y2        y3        y4        y5        y6 
0.0050000 0.7027878 0.8663403 0.3775656 0.0050000 0.7260919

# Varimax Rotation
vmax.mle <- factanal(df.sub, factors = 3, rotation = "varimax")
vmax.mle


Call:
factanal(x = df.sub, factors = 3, rotation = "varimax")

Uniquenesses:
   y1    y2    y3    y4    y5    y6 
0.005 0.703 0.866 0.378 0.005 0.726 

Loadings:
   Factor1 Factor2 Factor3
y1  0.996                 
y2  0.199  -0.328  -0.387 
y3                  0.345 
y4                  0.784 
y5          0.977  -0.201 
y6  0.367           0.366 

               Factor1 Factor2 Factor3
SS loadings      1.178   1.078   1.061
Proportion Var   0.196   0.180   0.177
Cumulative Var   0.196   0.376   0.553

The degrees of freedom for the model is 0 and the fit was 0.0165

g. Compare the results of (e) and (f).

Method (e) seems to outperform method f with higher cumulative variance explained.

h. Find the classification confusion matrix using linear discriminant function.

n1 <- dim(eng)[1]
n2 <- dim(pil)[1]
xbar1 <- colMeans(eng)
xbar2 <- colMeans(pil)
s1 <- var(eng)
s2 <- var(pil)
sp <- ((n1 - 1) * s1 + (n2 - 1) * s2)/(n1 + n2 - 2)
a.hat <- solve(sp) %*% (xbar1 - xbar2)
ybar1 <- t(a.hat) %*% xbar1
ybar2 <- t(a.hat) %*% xbar2
mhat <- (1/2) * (ybar1 + ybar2)
z <- lda(group ~ ., df)
pred.class <- predict(z, df)$class
mean(pred.class == df$group)

[1] 0.9

table(df$group, pred.class)

   pred.class
     1  2
  1 18  2
  2  2 18

i. Find the classification confusion matrix using quadratic discriminant function.

z <- qda(group ~ ., df)
pred.class <- predict(z, df)$class
mean(pred.class == df$group)

[1] 0.9

table(df$group, pred.class)

   pred.class
     1  2
  1 18  2
  2  2 18

j. Compare the results of (i) and (g).

The results of LDA and QDA suggest very similar performance; they each classify correctly 90% of the time in this particular example.