Multivariate Statistical Methods

Principal Component Analysis

Patrick Oter

2019-04-09

Men & Women’s Track Records by Country

  1. Test for a mean difference in the track records between women and men, you may assume they share the same covariance structure, calculate the p-value.
  2. Determine the first two principal components for the standardized variables of women’s track records. Prepare a table showing the correlations of the standardized variables with the principal components and the cumulative percentage of the total variance explained by the two components.
  3. Interpret the two principal components obtained in part (b). (note that the first component is essentially a normalized unit vector and might measure the athletic excellence of a given nation and the second component might measure the relative strength of a nation at the various running distances.)
  4. Rank the nations based on their score on the first principal component. Does this ranking correspond with your intuitive notion of athletic excellence for the various countries?
  5. Repeat parts (b), (c) and (d) with the men’s track data.
  6. Using the first two principal components of the women’s data and the first two of the men’s data, test for a difference in the two population principal components. Calculate the p-value and compare it to your answer in part (a). Note that the returned principal components of prcomp when you use standardardized values are also standardized. You’ll want to use the loadings to transform the original data. Also, pay attention to the signs on the loadings. The first two principal components of the two data sets can be interpreted as the same thing, but you may get opposite signs, this can throw off Hotelling’s T-squared statistic.
  7. Perform a factor analysis of the national track records for women. Use the sample covariance matrix S and interpret the factors. Repeat the analysis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain.
  8. Perform a factor analysis of the national track records for men. Is the appropriate factor model for the men’s data different from the one for the women’s data? If not, are the interpretations of the factors roughly the same? If the models are different, explain the differences.
  9. (STAT488 only) Convert the national track records for women to speeds measured in meters per second. Notice the records for 800m, 1500m, 3000m and marathon are given in minutes. A marathon is 26.2 miles or 42195 meters long. Perform a principal components analysis using the covariance matrix S of this speed data. Compare the results with parts (b){(d) above. Do your interpretations of the components differ? Rank the nations based on their score in the first principal component, how does it compare? Which analysis do you prefer?
  10. (STAT488 only) Repeat the process with the men’s track data, that is, convert it to speed and perform the analysis in parts (b){(d) above.
  11. (STAT488 only) Using the speed data, use the first two principal components to test for a difference in the two populations. Calculate the p-value and compare it to your answer in part (a) and (f) above. As with part (f), you may need to adjust the loadings so they match in the two principal components.

a. Testing for mean difference in track records between men & women (\(\alpha = 0.05\)).

\(H_{0}: \mu_{men} = \mu_{women}\)
\(H_{A}: \mu_{men} \neq \mu_{women}\)

Test stat:  82.838 
Numerator df:  6 
Denominator df:  101 
P-value:  0 

Conclusion: With a p-value less than 0.05, we reject the null indicating statistically significant evidence of the difference of mean national track records between men & women.

b. Loadings of first two principal components for the standardized variables of women’s track records with a table showing the correlations of the standardized variables with the principal components and the cumulative percentage of the total variance explained by the two components.

Standard deviations (1, .., p=6):
[1] 2.2399302 0.7239524 0.4643059 0.3514499 0.2716888 0.2137607

Rotation (n x k) = (6 x 6):
                PC1        PC2         PC3        PC4        PC5
100m     -0.4137534  0.3846299 -0.12054812  0.5735489 -0.3939281
200m     -0.4202377  0.3824032 -0.06535754  0.1897090  0.3176902
400m     -0.4061645  0.3929376  0.44254638 -0.5768634  0.1985368
800m     -0.4226394 -0.2858530 -0.09940074 -0.3891141 -0.6981406
1500m    -0.4067668 -0.3159300 -0.67477321 -0.1371183  0.4276702
marathon -0.3783590 -0.6081973  0.56581787  0.3634138  0.1848636
                 PC6
100m     -0.42684706
200m      0.73210645
400m     -0.33555163
800m      0.30161790
1500m    -0.27875957
marathon -0.02337895
100m 200m 400m 800m 1500m marathon
PC1 -0.4137534 -0.4202377 -0.4061645 -0.4226394 -0.4067668 -0.3783590
PC2 0.3846299 0.3824032 0.3929376 -0.2858530 -0.3159300 -0.6081973
PC1 PC2
Standard deviation 2.23993 0.7239524
Proportion of Variance 0.83621 0.0873500
Cumulative Proportion 0.83621 0.9235700

c. Interpretation of principal components from part (b)

The loadings of the first principal component being incredibly similar may indicate a latent variable such as a nation’s athletic excellence (as indicated in the exercise) is responsible for much of the variance. The second principal component’s loadings are positive for shorter distances and negative for longer distances, so it may be a proxy for a nation’s abilities in different style running events.

d. Ranking Countries based on PC1

As far as I know the rankings align fairly well with track talent; the countries who rank highest in terms of this principal component are at least competitive in the Olympics.

e. Parts (b)-(d) with men’s track records

Standard deviations (1, .., p=6):
[1] 2.2264396 0.7001597 0.4577477 0.4326613 0.3111293 0.2433370

Rotation (n x k) = (6 x 6):
                PC1        PC2        PC3         PC4        PC5
100m     -0.4013958 -0.5138807  0.4789843 -0.01687632 -0.2969661
200m     -0.4180798 -0.3917657  0.1620888 -0.17179519  0.5364074
400m     -0.4035829 -0.2881765 -0.7458936  0.42347137 -0.1323242
800m     -0.4104405  0.3266872 -0.2985046 -0.65176976  0.2267076
1500m    -0.4208110  0.3494365  0.1138092 -0.13633585 -0.6628797
marathon -0.3945481  0.5201637  0.2930638  0.58947631  0.3402392
                 PC6
100m      0.50686134
200m     -0.57289582
400m      0.02966769
800m      0.39938562
1500m    -0.47944005
marathon  0.15693998
100m 200m 400m 800m 1500m marathon
PC1 -0.4013958 -0.4180798 -0.4035829 -0.4104405 -0.4208110 -0.3945481
PC2 -0.5138807 -0.3917657 -0.2881765 0.3266872 0.3494365 0.5201637
PC1 PC2
Standard deviation 2.22644 0.7001597
Proportion of Variance 0.82617 0.0817000
Cumulative Proportion 0.82617 0.9078800

The loadings of the first principal component are incredibly similar indicating that there is possibly a latent variable such as a nation’s athletic excellence (as indicated in the exercise) is responsible for much of the variance. The second principal component’s loadings are negative for shorter distances and positive for longer distances, so it may be a proxy for a nation’s abilities in different style running events. Additionally, the national rankings based on the first principal component seem to align with national rankings as I understand them.

f. Testing for differences in Population Principal components using PC1 & PC2 for Men’s/Women’s National Track Records

g. Factor analysis of National Track Records for Women.


Call:
factanal(x = df.w, factors = 2, n.obs = 54, rotation = "none",     method = "mle")

Uniquenesses:
    100m     200m     400m     800m    1500m marathon 
   0.106    0.005    0.160    0.006    0.167    0.264 

Loadings:
         Factor1 Factor2
100m      0.926  -0.192 
200m      0.962  -0.263 
400m      0.905  -0.145 
800m      0.942   0.327 
1500m     0.889   0.209 
marathon  0.795   0.323 

               Factor1 Factor2
SS loadings      4.910   0.382
Proportion Var   0.818   0.064
Cumulative Var   0.818   0.882

Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 6.41 on 4 degrees of freedom.
The p-value is 0.171 

Call:
factanal(factors = 2, covmat = w.rm, n.obs = 54, rotation = "varimax",     method = "mle")

Uniquenesses:
    100m     200m     400m     800m    1500m marathon 
   0.106    0.005    0.160    0.006    0.167    0.264 

Loadings:
         Factor1 Factor2
100m     0.824   0.464  
200m     0.898   0.433  
400m     0.778   0.485  
800m     0.495   0.865  
1500m    0.533   0.741  
marathon 0.387   0.766  

               Factor1 Factor2
SS loadings      2.770   2.522
Proportion Var   0.462   0.420
Cumulative Var   0.462   0.882

Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 6.41 on 4 degrees of freedom.
The p-value is 0.171 

The choice of rotation and factoring seems to be important when conducting factor analysis becuase they impact how different variables are loaded into the factors; both methods concluded with the same cumulative proportion of variance explained (88.2%) & both methods had the same variable uniqueness suggesting that the variables were equally important in each factor analysis. Both analyses got the same results (cumulative proportion with two factors) by different means.

h. Factor analysis of National Track Records for Men.


Call:
factanal(x = df.m, factors = 2, n.obs = 54, rotation = "none",     method = "mle")

Uniquenesses:
    100m     200m     400m     800m    1500m marathon 
   0.154    0.009    0.250    0.164    0.027    0.208 

Loadings:
         Factor1 Factor2
100m      0.914  -0.102 
200m      0.983  -0.158 
400m      0.865         
800m      0.859   0.312 
1500m     0.881   0.445 
marathon  0.797   0.396 

               Factor1 Factor2
SS loadings      4.700   0.488
Proportion Var   0.783   0.081
Cumulative Var   0.783   0.865

Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 5.5 on 4 degrees of freedom.
The p-value is 0.239 

Call:
factanal(factors = 2, covmat = m.rm, n.obs = 54, rotation = "varimax",     method = "mle")

Uniquenesses:
    100m     200m     400m     800m    1500m marathon 
   0.154    0.009    0.250    0.164    0.027    0.208 

Loadings:
         Factor1 Factor2
100m     0.806   0.443  
200m     0.895   0.436  
400m     0.691   0.522  
800m     0.523   0.750  
1500m    0.464   0.870  
marathon 0.424   0.782  

               Factor1 Factor2
SS loadings      2.598   2.589
Proportion Var   0.433   0.432
Cumulative Var   0.433   0.865

Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 5.5 on 4 degrees of freedom.
The p-value is 0.239 

The factor loadings are relatively similar for men and for women using the varimax rotation. Additionally these two factors account for similar proportions of variance in national track records (88.2% and 86.5%).

i. Converting Women’s National Track Records to Average Speed (m/s)

Units: avg. m/s
country s100 s200 s400 s800 s1500 smar
ARG 8.643042 8.718396 7.619048 6.504065 5.882353 4.678353
AUS 8.992806 8.996851 8.225375 6.734007 6.218905 4.900355
AUT 8.968610 8.810573 7.902015 6.872852 6.172840 4.556203
BEL 8.976661 8.896797 7.774538 6.768190 6.127451 4.916113
BER 8.726003 8.676790 7.504690 6.441224 5.827506 4.037490
BRA 8.952552 8.849557 7.902015 6.768190 5.995204 4.770708
Standard deviations (1, .., p=6):
[1] 2.2394797 0.7372135 0.4298389 0.3581696 0.2840273 0.2180105

Rotation (n x k) = (6 x 6):
            PC1        PC2         PC3        PC4        PC5         PC6
s100  0.4110369 -0.3985261  0.19870602 -0.5452687  0.3871234  0.43076577
s200  0.4194769 -0.3817117  0.11579002 -0.1503919 -0.2699084 -0.75462556
s400  0.4059147 -0.3822980 -0.55004814  0.4729653 -0.2383517  0.32560833
s800  0.4213789  0.2724519  0.07729831  0.4566153  0.6932049 -0.23066019
s1500 0.4105739  0.3252716  0.61695997  0.1892135 -0.4742597  0.29028244
smar  0.3797236  0.6076921 -0.50787892 -0.4605094 -0.1225483 -0.03863049
100m 200m 400m 800m 1500m marathon
PC1 0.4110369 0.4194769 0.4059147 0.4213789 0.4105739 0.3797236
PC2 -0.3985261 -0.3817117 -0.3822980 0.2724519 0.3252716 0.6076921
PC1 PC2
Standard deviation 2.23948 0.7372135
Proportion of Variance 0.83588 0.0905800
Cumulative Proportion 0.83588 0.9264600

j. Repeating (i) with Men’s Track Records

Units: avg. m/s
country s100 s200 s400 s800 s1500 smar
Argentina 9.775171 9.818360 8.661758 7.532957 6.793478 5.427568
Australia 10.070493 9.970090 9.013069 7.662835 7.082153 5.515254
Austria 9.852217 9.779951 8.733624 7.532957 6.983240 5.318787
Belgium 9.861933 9.905894 8.884940 7.707129 7.002801 5.528695
Bermuda 9.737098 9.852217 8.837826 7.448790 6.756757 4.804605
Brazil 10.000000 10.055304 9.031384 7.843137 7.002801 5.579135
Standard deviations (1, .., p=6):
[1] 2.2135294 0.7218360 0.4710989 0.4387464 0.3252378 0.2429571

Rotation (n x k) = (6 x 6):
            PC1        PC2         PC3         PC4        PC5         PC6
s100  0.4013532  0.5044835 -0.48337352  0.03846601  0.3388272 -0.48422984
s200  0.4180942  0.3929087 -0.16824214  0.17014061 -0.5665582  0.54090531
s400  0.4026901  0.3010357  0.73492220 -0.43653013  0.1240380 -0.03411095
s800  0.4111819 -0.3304516  0.30737166  0.63871947 -0.2199149 -0.41343524
s1500 0.4217625 -0.3523522 -0.08502157  0.14181362  0.6319581  0.52082007
smar  0.3936996 -0.5168620 -0.31020634 -0.59240216 -0.3179448 -0.17203808
100m 200m 400m 800m 1500m marathon
PC1 0.4013532 0.4180942 0.4026901 0.4111819 0.4217625 0.3936996
PC2 0.5044835 0.3929087 0.3010357 -0.3304516 -0.3523522 -0.5168620
PC1 PC2
Standard deviation 2.213529 0.721836
Proportion of Variance 0.816620 0.086840
Cumulative Proportion 0.816620 0.903460

For both parts (i) and (j), it seems that the principal components are loaded similarly (despite the direction of signs). The first principal component is constructed with nearly equal parts from each variable and may account for a latent country talent variable. The second principal component is divided in sign by short vs long distance variables. I prefer this speed analysis because all of the variables have the same unit of measurement; it does not seem to provide much other benefit in terms of the analysis.

k. Testing for differences in Population Principal components using the first two speed-based principal components for Men’s/Women’s National Track Records