1 The Datasets

The dataset is constructed from the national records for men and women at various track races from 100 meters to the marathon, as given in Belcham and Hymans (1984). This is not a well-known data set, yet it can be analyzed instructively using a variety of techniques appropriate to exploratory data analysis (EDA), ranging from a simple examination of the various one and two-dimensional marginals to more sophisticated methods, all of which lead to interesting results. I will consider here only the results of applying multivariate analysis. Such analyses are often an effective tool in the attempt to gain insight into high-dimensional data sets and, make no probabilistic assumptions, concentrating on what are purely geometric properties of the data.

For the analysis, only a subset of the whole data set as reported in Belcham and Hymans (1984) was used. There are \(55\) countries for which a complete set of records for the “flat” races in both men’s and women’s events are available. Thus the hurdles and steeplechase events, as well as the field events, are excluded from the main analysis. The full data set has many missing values, and handling these is an important and interesting problem, but for this work, I will restrict attention to the countries with a complete set of observations. The men’s and women’s sets will be treated separately. Thus the data matrix for the women was \(55 \times 7\), with the events represented being the \(100\) meters, \(200\) meters, \(400\) meters, \(800\) meters, \(1,500\) meters, \(3,000\) meters, and marathon. For the men the data matrix was \(55 \times 8\), differing from the women’s events in that the \(3,000\) meters was excluded but the \(5,000\) and \(10,000\) meters were included.

One can choose to first rescale the variables in each of the two data sets to have mean 0 and standard deviation 1, making them unit free, because all the variables are equally important, and hence should somehow be brought to an equal footing. This, in turn, amounts to using the spectral decomposition of the sample correlation matrix instead of the sample variance-covariance matrix (of the time taken by the athletes in the events) to obtain the principal components. Here the objection to the sample variance-covariance matrix as a choice for the analysis is based on the argument that if the raw data are to be analyzed using the same time unit, the variable represented by the time taken in the marathon, due to its larger amount of variability, would be weighted excessively in the analysis. These objections are well taken, and one would readily agree with these concerns. However, this choice of the correlation matrix as the focal point of analysis is suspect in that by forcing all the track records variables to have equal variance by such scaling, the purpose of partitioning the total variability and perhaps the very objective of identifying those variables that contribute more significantly to the total variability have been defeated.

How do we bring all the variables to an “equal footing” while still admitting the possibility that more variability across the nations may be found in certain specific track events? For this, one must use the variance-covariance matrix as the basic object for the analysis, but it should correspond to variables that represent characteristics common to all of the events. “Total time took” is certainly not such a variable. To compare the athletic performances of nations, the appropriate variables should be rate or speed related rather than the total time taken. A variable that may be more relevant in this context is the speed itself, defined as the “distance covered per unit time.” This variable succeeds in retaining the possibility of having different degrees of variability in different variables. I shall, therefore, use the speeds in the track events as the variables for the principal component analysis. The separate analyses will be performed for the data sets on women and men, respectively. These two sets (rounded to two digits after the decimal) are presented in Tables 1 and 2, respectively.

men = read.csv("/cloud/project/track_data_men.csv", row.names=1)
men[,1] = 100/men[,1]
men[,2] = 200/men[,2]
men[,3] = 400/men[,3]
men[,4] = 800/(60*men[,4])
men[,5] = 1500/(60*men[,5])
men[,6] = 5000/(60*men[,6])
men[,7] = 10000/(60*men[,7])
men[,8] = 42195/(60*men[,8])
men = round(men, 2)
colnames(men) = c('100m','200m','400m','800m','1500m','5000m','10000m','Marathon')
datatable(men, autoHideNavigation = TRUE, fillContainer = FALSE)
women = read.csv("/cloud/project/track_data_women.csv", row.names=1)
women[,1] = 100/women[,1]
women[,2] = 200/women[,2]
women[,3] = 400/women[,3]
women[,4] = 800/(60*women[,4])
women[,5] = 1500/(60*women[,5])
women[,6] = 3000/(60*women[,6])
women[,7] = 42195/(60*women[,7])
women = round(women, 2)
colnames(women) = c('100m','200m','400m','800m','1500m','3000m', 'Marathon')
datatable(women, autoHideNavigation = TRUE, fillContainer = FALSE)

2 Denisty Plots

mdensity1 = density(men[,1])
mdensity2 = density(men[,2])
mdensity3 = density(men[,3])
mdensity4 = density(men[,4])
mdensity5 = density(men[,5])
mdensity6 = density(men[,6])
mdensity7 = density(men[,7])
mdensity8 = density(men[,8])

wdensity1 = density(women[,1])
wdensity2 = density(women[,2])
wdensity3 = density(women[,3])
wdensity4 = density(women[,4])
wdensity5 = density(women[,5])
wdensity6 = density(women[,6])
wdensity7 = density(women[,7])
p1 = plot_ly(x = ~mdensity1$x, y = ~mdensity1$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '100m Men') %>%
  add_trace(x = ~wdensity1$x, y = ~wdensity1$y, name = '100m Women', fill = 'tozeroy') %>% 
  layout(xaxis = list(title = 'Speed in 100m Track'),
         yaxis = list(title = 'Density')) 
p1
p2 = plot_ly(x = ~mdensity2$x, y = ~mdensity2$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '200m Men') %>%
  add_trace(x = ~wdensity2$x, y = ~wdensity2$y, name = '200m Women', fill = 'tozeroy') %>%
  layout(xaxis = list(title = 'Speed in 200m Track'),
         yaxis = list(title = 'Density'))
p2
p3 = plot_ly(x = ~mdensity3$x, y = ~mdensity3$y, type = 'scatter', name = '400m Men', mode = 'none', fill = 'tozeroy')  %>%
  add_trace(x = ~wdensity3$x, y = ~wdensity3$y, name = '400m Women', fill = 'tozeroy') %>%
  layout(xaxis = list(title = 'Speed in 400m Track'),
         yaxis = list(title = 'Density')) 
p3
p4 = plot_ly(x = ~mdensity4$x, y = ~mdensity4$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '800m Men') %>%
  add_trace(x = ~wdensity4$x, y = ~wdensity4$y, name = '800m Women', fill = 'tozeroy') %>%
  layout(xaxis = list(title = 'Speed in 800m Track'),
         yaxis = list(title = 'Density'))
p4
p5 = plot_ly(x = ~mdensity5$x, y = ~mdensity5$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '1500m Men') %>%
  add_trace(x = ~wdensity5$x, y = ~wdensity5$y, name = '1500m Women', fill = 'tozeroy') %>%
  layout(xaxis = list(title = 'Speed in 1500m'),
         yaxis = list(title = 'Density')) 
p5
p6 = plot_ly(x = ~mdensity6$x, y = ~mdensity6$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '5000m Men') %>%
  layout(xaxis = list(title = 'Speed in 5000m Track Men'),
         yaxis = list(title = 'Density')) 
p6
p7 = plot_ly(x = ~mdensity7$x, y = ~mdensity7$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '10000m Men') %>%
  layout(xaxis = list(title = 'Speed in 10000m Track Men'),
         yaxis = list(title = 'Density')) 
p7
p8 = plot_ly(x = ~wdensity6$x, y = ~wdensity6$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '3000m Women') %>%
  layout(xaxis = list(title = 'Speed in 3000m Track Women'),
         yaxis = list(title = 'Density')) 
p8
p9 = plot_ly(x = ~mdensity8$x, y = ~mdensity8$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = 'Marathon Men') %>%
  add_trace(x = ~wdensity7$x, y = ~wdensity7$y, name = 'Marathon Women', fill = 'tozeroy') %>%
  layout(xaxis = list(title = 'Speed in Marathon Track'),
         yaxis = list(title = 'Density')) 
p9
subplot(p1,p2,p3,p4,p5,p6,p7,p8,p9)

Few things are revealed from the density plots. These are

3 Sample Correlation Matrix Plots

cor_men = cor(men)
cor_women = cor(women)
p10 = plot_ly(x = rownames(cor_men), y = colnames(cor_men), z = cor_men, type = "heatmap")
p11 = plot_ly(x = rownames(cor_women), y = colnames(cor_women), z = cor_women, type = "heatmap")
p10
p11

4 Bartlett’s Test Of Sphericity

Null hypothesis assumes the correlation matrix to be an identity matrix.

bart_spher(men)
##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = men)
## 
##      X2 = 702.525
##      df = 28
## p-value < 2.22e-16
bart_spher(women)
##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = women)
## 
##      X2 = 606.846
##      df = 21
## p-value < 2.22e-16

None of the correlation matrices is an identity matrix.

5 KMO Test

Henry Kaiser (1970) introduced an Measure of Sampling Adequacy (MSA) of data matrices. Kaiser and Rice (1974) then modified it. This is just a function of the squared elements of the ‘image’ matrix compared to the squares of the original correlations. The overall MSA as well as estimates for each item are found. The index is known as the Kaiser-Meyer-Olkin (KMO) index. In his delightfully flamboyant style, Kaiser (1975) suggested that KMO > 0.9 were marvelous, in the 0.8s, mertitourious, in the 0.7s, middling, in the 0.6s, medicore, in the 0.5s, miserable, and less than 0.5, unacceptable.

KMO(men)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = men)
## Overall MSA =  0.91
## MSA for each item = 
##     100m     200m     400m     800m    1500m    5000m   10000m Marathon 
##     0.84     0.87     0.96     0.95     0.94     0.92     0.87     0.93
KMO(women)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = women)
## Overall MSA =  0.84
## MSA for each item = 
##     100m     200m     400m     800m    1500m    3000m Marathon 
##     0.85     0.80     0.83     0.81     0.82     0.84     0.93

6 Principal Component Analysis

6.1 PCA for Men

V1 = cov(men)
ev1 = eigen(V1)
index1 = seq(1, length(ev1$values), 1)
aa = data.frame(index1, ev1$values)
p12 = plot_ly(data = aa, x = ~index1, y = ~ev1$values, name = 'Scree Plot for Men', type = 'scatter', mode = 'lines') %>% 
  layout(xaxis = list(title = 'Principal Components'),
         yaxis = list(title = 'Eigen Values'))
p12
princomp.track.men = prcomp(men, retx = TRUE)
summary(princomp.track.men)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     0.7518 0.2884 0.1111 0.09639 0.08226 0.07689 0.06495
## Proportion of Variance 0.8197 0.1206 0.0179 0.01347 0.00981 0.00857 0.00612
## Cumulative Proportion  0.8197 0.9404 0.9583 0.97172 0.98153 0.99011 0.99623
##                            PC8
## Standard deviation     0.05101
## Proportion of Variance 0.00377
## Cumulative Proportion  1.00000

94.0% of the total variability in the dataset is explained by the first two sample PCs. This is also evident from the position of the elbow shape in the scree plot. So we are going to consider first two sample PCS.

The coefficient of the variables in the first two sample PCs are

M = princomp.track.men$rotation
M[, c(1, 2)]
##                 PC1         PC2
## 100m     -0.3153454 -0.59955831
## 200m     -0.3251131 -0.47163652
## 400m     -0.3090385 -0.23213403
## 800m     -0.3123950 -0.05861613
## 1500m    -0.3417021  0.07902820
## 5000m    -0.4063680  0.29596157
## 10000m   -0.4186956  0.29760231
## Marathon -0.3802133  0.42232798

Several facts are revealed by PCA. These are:

  1. The uniformity of weights in the first principal component is really a reflection of the considerable amount of structure in the original data. The coefficients of the variables in PC1 are nearly equal. This means that PC1 considers all the variables to almost same extent which is also clear from the previous table. Using the values of PC1 we can actually sort of rank the individuals. So the analysis of the first PC results in ranking the individuals that are least plausible from the viewpoint of subjective judgements, and, in any case, this has a considerable intrinsic interest. This technique will be very useful when such judgements are not so easy to come by.

  2. The second PC can be interpreted as a measure of relative strength of a given nation at various distances. Hence the a value near to 0 of the second PC would indicate that the particular nation had achieved at about the same level in both long and short distances. Extreme values in either end indicate imbalance in achievement. Thus the second PC serves as a measure of differential achievement.

Let us try to look at the values of PC2 for all the individuals.

Mx = princomp.track.men$x
Mx[, 2]
##    argentina    australia      austria      belgium      bermuda       brazil 
## -0.117419763 -0.060367323  0.140273728  0.088983611 -0.557515352 -0.217869563 
##        burma       canada        chile        china     colombia       cookis 
##  0.064411222 -0.169462822 -0.070864959  0.106758597  0.177640813  0.474945898 
##        costa        czech      denmark       domrep      finland       france 
##  0.514976841 -0.021864896  0.127601896 -0.765363394  0.134192489 -0.178143788 
##          gdr          frg         gbni       greece    guatemala      hungary 
## -0.105888738 -0.142135548 -0.102960451 -0.214925122  0.383602579 -0.065312578 
##        india    indonesia      ireland       israel        italy        japan 
##  0.216243136 -0.175248800  0.293585524  0.190988963 -0.335127996  0.144978466 
##        kenya        korea     dprkorea   luxembourg     malaysia    mauritius 
##  0.181411820 -0.105226541  0.481269685 -0.111034425 -0.527459267  0.197227644 
##       mexico  netherlands           nz       norway          png  philippines 
##  0.267287839  0.234597981  0.291173843  0.331173993  0.058655817 -0.054057615 
##       poland     portugal      rumania    singapore        spain       sweden 
## -0.155662847  0.425589259  0.170143261 -0.516778940  0.152090915  0.004860714 
##  switzerland       taipei     thailand       turkey          usa         ussr 
##  0.057510844  0.004053239 -0.510206290  0.422426665 -0.376942136 -0.248016640 
##       wsamoa 
## -0.432801487

Since PC2 serves as a measure of differential achievement, extreme negative values will indicate better performances in shorter distances and extreme positive values will indicate better performances in longer distances. So, looking at the values of PC2 we can actually classify the individuals into three groups - one with better performances in shorter distances, one with even performances in both shorter and longer distances and one with better performances in longer distances. This classification is shown below.

levels(cut(Mx[, 2], 3))
## [1] "(-0.767,-0.339]" "(-0.339,0.0882]" "(0.0882,0.516]"
names(Mx[, 2][Mx[, 2] > -0.767 & Mx[, 2] <= -0.339])
## [1] "bermuda"   "domrep"    "malaysia"  "singapore" "thailand"  "usa"      
## [7] "wsamoa"
names(Mx[, 2][Mx[, 2] > -0.339 & Mx[, 2] <= 0.0882])
##  [1] "argentina"   "australia"   "brazil"      "burma"       "canada"     
##  [6] "chile"       "czech"       "france"      "gdr"         "frg"        
## [11] "gbni"        "greece"      "hungary"     "indonesia"   "italy"      
## [16] "korea"       "luxembourg"  "png"         "philippines" "poland"     
## [21] "sweden"      "switzerland" "taipei"      "ussr"
names(Mx[, 2][Mx[, 2] > 0.0882 & Mx[, 2] <= 0.516])
##  [1] "austria"     "belgium"     "china"       "colombia"    "cookis"     
##  [6] "costa"       "denmark"     "finland"     "guatemala"   "india"      
## [11] "ireland"     "israel"      "japan"       "kenya"       "dprkorea"   
## [16] "mauritius"   "mexico"      "netherlands" "nz"          "norway"     
## [21] "portugal"    "rumania"     "spain"       "turkey"

Now by plotting the PC1 vs PC2 let us see how good this grouping is:

PC_men = data.frame(Mx[, 1], Mx[, 2], cut(Mx[, 2], 3, labels = c('Dominant Performance in Short Distance Performance', 'Equivalent Performances in Both', 'Dominant Performance in Long Distance')))
colnames(PC_men) = c('PC1', 'PC2', 'Group')
p13 = plot_ly(data = PC_men, x = ~ PC1, y = ~PC2, color = ~Group, text = ~paste("Country: ", rownames(PC_men) ))
p13

6.2 PCA for Women

V2 = cov(women)
ev2 = eigen(V2)
index2 = seq(1, length(ev2$values), 1)
bb = data.frame(index2, ev2$values)
p14 = plot_ly(data = bb, x = ~index2, y = ~ev2$values, name = 'Scree Plot for Women', type = 'scatter', mode = 'lines') %>% 
  layout(xaxis = list(title = 'Principal Components'),
         yaxis = list(title = 'Eigen Values'))
p14
princomp.track.women = prcomp(women, retx = TRUE)
summary(princomp.track.women)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     0.9894 0.3136 0.23001 0.15263 0.08494 0.07216 0.05833
## Proportion of Variance 0.8372 0.0841 0.04525 0.01993 0.00617 0.00445 0.00291
## Cumulative Proportion  0.8372 0.9213 0.96654 0.98647 0.99264 0.99709 1.00000

92.13% of the total variability in the dataset is explained by the first two sample PCs. This is also evident from the position of the elbow shape in the scree plot. So we are going to consider first two sample PCS.

The coefficient of the variables in the first two sample PCs are

W = princomp.track.women$rotation
W[, c(1, 2)]
##                PC1          PC2
## 100m     0.2909154 -0.426435627
## 200m     0.3414473 -0.558102524
## 400m     0.3390391 -0.383489492
## 800m     0.3053161 -0.006696912
## 1500m    0.3857370  0.198885662
## 3000m    0.3999272  0.254144679
## Marathon 0.5309254  0.505391101

Several facts are revealed by PCA. These are:

  1. The uniformity of weights in the first principal component is really a reflection of the considerable amount of structure in the original data. The coefficients of the variables in PC1 are nearly equal. This means that PC1 considers all the variables to almost same extent which is also clear from the previous table. Using the values of PC1 we can actually sort of rank the individuals. So the analysis of the first PC results in ranking the individuals that are least plausible from the viewpoint of subjective judgements, and, in any case, this has a considerable intrinsic interest. This technique will be very useful when such judgements are not so easy to come by.

  2. The second PC can be interpreted as a measure of relative strength of a given nation at various distances. Hence the a value near to 0 of the second PC would indicate that the particular nation had achieved at about the same level in both long and short distances. Extreme values in either end indicate imbalance in achievement. Thus the second PC serves as a measure of differential achievement.

Let us try to look at the values of PC2 for all the individuals.

Wx = princomp.track.women$x
Wx[, 2]
##   argentina   australia     austria     belgium     bermuda      brazil 
## -0.26245939 -0.19187465 -0.15256028  0.03304336 -0.28972763 -0.29849297 
##       burma      canada       chile       china    colombia      cookis 
##  0.13240049 -0.23281774  0.37966038  0.36545522  0.10923410  0.41745950 
##       costa       czech     denmark      domrep     finland      france 
##  0.28962521 -0.58429058  0.32431335 -0.23934620 -0.25345101 -0.14659643 
##         gdr         frg        gbni      greece   guatemala     hungary 
## -0.63704447 -0.16173667 -0.20471643  0.01448173 -0.32532294  0.02597314 
##       india   indonesia     ireland      israel       italy       japan 
## -0.03590875 -0.18587910  0.33411349  0.16145977  0.15267934  0.42190027 
##       kenya       korea    dprkorea  luxembourg    malaysia   mauritius 
## -0.01103023  0.40609402  0.38483113  0.42038839  0.02309732 -0.46578552 
##      mexico netherlands          nz      norway         png philippines 
##  0.24757004  0.02569444  0.26528256  0.48529016 -0.20089492 -0.39556902 
##      poland    portugal     rumania   singapore       spain      sweden 
## -0.42039442  0.61345358  0.08348082  0.28644576  0.34723175 -0.04885176 
## switzerland      taipei    thailand      turkey         usa        ussr 
##  0.26179941 -0.52643650  0.13277297  0.09511220 -0.17498458 -0.18855441 
##      wsamoa 
## -0.60561728

Since PC2 serves as a measure of differential achievement, extreme negative values will indicate better performances in shorter distances and extreme positive values will indicate better performances in longer distances. So, looking at the values of PC2 we can actually classify the individuals into three groups - one with better performances in shorter distances, one with even performances in both shorter and longer distances and one with better performances in longer distances. This classification is shown below.

levels(cut(Wx[, 2], 3))
## [1] "(-0.638,-0.22]" "(-0.22,0.197]"  "(0.197,0.615]"
names(Wx[, 2][Wx[, 2] > -0.638 & Wx[, 2] <= -0.22])
##  [1] "argentina"   "bermuda"     "brazil"      "canada"      "czech"      
##  [6] "domrep"      "finland"     "gdr"         "guatemala"   "mauritius"  
## [11] "philippines" "poland"      "taipei"      "wsamoa"
names(Wx[, 2][Wx[, 2] > -0.22 & Wx[, 2] <= 0.197])
##  [1] "australia"   "austria"     "belgium"     "burma"       "colombia"   
##  [6] "france"      "frg"         "gbni"        "greece"      "hungary"    
## [11] "india"       "indonesia"   "israel"      "italy"       "kenya"      
## [16] "malaysia"    "netherlands" "png"         "rumania"     "sweden"     
## [21] "thailand"    "turkey"      "usa"         "ussr"
names(Wx[, 2][Wx[, 2] > 0.197 & Wx[, 2] <= 0.615])
##  [1] "chile"       "china"       "cookis"      "costa"       "denmark"    
##  [6] "ireland"     "japan"       "korea"       "dprkorea"    "luxembourg" 
## [11] "mexico"      "nz"          "norway"      "portugal"    "singapore"  
## [16] "spain"       "switzerland"

Now by plotting the PC1 vs PC2 let us see how good this grouping is:

PC_women = data.frame(Wx[, 1], Wx[, 2], cut(Wx[, 2], 3, labels = c('Dominant Performance in Short Distance Performance', 'Equivalent Performances in Both', 'Dominant Performance in Long Distance')))
colnames(PC_women) = c('PC1', 'PC2', 'Group')
p15 = plot_ly(data = PC_women, x = ~ PC1, y = ~PC2, color = ~Group, text = ~paste("Country: ", rownames(PC_women) ))
p15

7 Factor Analysis

7.1 FA for Men

We first factor analyze the sample covariance matrix by the iterative principal component method. The estimated loading matrix is shown below.

f1.men <- fa(r = men, nfactors = 2, rotate = 'none', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f1.men$loadings
## 
## Loadings:
##          PA1    PA2   
## 100m      0.803  0.521
## 200m      0.860  0.424
## 400m      0.898  0.207
## 800m      0.933       
## 1500m     0.953 -0.117
## 5000m     0.933 -0.295
## 10000m    0.943 -0.297
## Marathon  0.870 -0.381
## 
##                  PA1   PA2
## SS loadings    6.486 0.829
## Proportion Var 0.811 0.104
## Cumulative Var 0.811 0.914

The estimated communalities are obtained as: 0.9160849, 0.9190559, 0.8498386, 0.8711599, 0.9213064, 0.9578236, 0.9773479, 0.9015345. The specific variances are obtained as: 0.0839151, 0.0809441, 0.1501614, 0.1288401, 0.0786936, 0.0421764, 0.0226521, 0.0984655. The residual matrix comes cout to be the following.

f1.men$residual - diag(f1.men$residual)
##                 100m        200m        400m        800m       1500m
## 100m      0.00000000 -0.07184424 -0.08477023 -0.09633006 -0.09684761
## 200m     -0.06887325  0.00000000 -0.09295282 -0.09289630 -0.07894811
## 400m     -0.15101659 -0.16217017  0.00000000 -0.12287829 -0.15006664
## 800m     -0.14125505 -0.14079228 -0.10155693  0.00000000 -0.10657636
## 1500m    -0.09162619 -0.07669768 -0.07859885 -0.05642994  0.00000000
## 5000m    -0.03911968 -0.03432554 -0.05161649 -0.05137271 -0.04123162
## 10000m   -0.01281943 -0.02057460 -0.02897111 -0.03684212 -0.02487059
## Marathon -0.09459245 -0.09748370 -0.09846762 -0.10258663 -0.11107967
##                5000m      10000m    Marathon
## 100m     -0.08085835 -0.07408242 -0.08004198
## 200m     -0.07309322 -0.07886660 -0.07996225
## 400m     -0.15960152 -0.15648047 -0.15016351
## 800m     -0.13803638 -0.14303011 -0.13296115
## 1500m    -0.07774886 -0.08091215 -0.09130778
## 5000m     0.00000000 -0.03812993 -0.03853653
## 10000m   -0.01860561  0.00000000 -0.01384641
## Marathon -0.09482567 -0.08965987  0.00000000

Now we will make use of factor rotation. Let us try the previous thing with varimax rotation. The estimated loading matrix is shown below.

f2.men <- fa(r = men, nfactors = 2, rotate = 'varimax', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f2.men$loadings
## 
## Loadings:
##          PA1   PA2  
## 100m     0.275 0.917
## 200m     0.380 0.880
## 400m     0.551 0.739
## 800m     0.691 0.627
## 1500m    0.802 0.528
## 5000m    0.902 0.380
## 10000m   0.911 0.384
## Marathon 0.909 0.273
## 
##                  PA1   PA2
## SS loadings    4.114 3.200
## Proportion Var 0.514 0.400
## Cumulative Var 0.514 0.914

The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same. Let us look at the loading plot in this case.

fa.diagram(f2.men)

The following things can be interpreted:

  • Focusing our attention on the iterative PC method and the cumulative proportion of the total sample variance explained which is 0.914, we see that a two-factor solution appears to be warranted.

  • The factor loadings change abruptly after the varimax rotation.

  • Before rotation, all the variables load highly on the first factor and the estimated loadings are about equal. This factor can be interpreted as a general running factor.

  • The second factor contrasts the shorter distances with the longer distances. (The shorter distance performances have positive estimated loadings and the longer distance performances have negative estimated loadings on this factor.) Thus, this factor seems to differentiate the performances in different distances and might be called a differentiating factor. Also the absolute value of the estimated loadings of this factor increases as we move to extreme running distances in both the directions.

  • After rotation, we see that the longer distance performances load highly on the first factor rather than the shorter distance performances. All the estimated loadings are positive for this factor. In this case the first factor can be interpreted as a factor affecting the longer distance performances.

  • The shorter distance performances load highly on the second factor rather than the longer distance performances. All the estimated loadings are positive for this factor. In this case the second factor can be interpreted as a factor affecting the shorter distance performances.

  • The off-diagonal elements of the residual matrix are close to zero. This supports the fact of retaining two factors.

7.2 FA for Women

We first factor analyze the sample covariance matrix by the iterative principal component method. The estimated loading matrix is shown below.

f1.women <- fa(r = women, nfactors = 2, rotate = 'none', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f1.women$loadings
## 
## Loadings:
##          PA1    PA2   
## 100m      0.888  0.346
## 200m      0.895  0.441
## 400m      0.903  0.125
## 800m      0.915 -0.156
## 1500m     0.945 -0.292
## 3000m     0.935 -0.273
## Marathon  0.860 -0.165
## 
##                  PA1   PA2
## SS loadings    5.751 0.542
## Proportion Var 0.822 0.077
## Cumulative Var 0.822 0.899

The estimated communalities are obtained as: 0.9090072, 0.9967272, 0.830982, 0.8623495, 0.9791393, 0.9484347, 0.7659236. The specific variances are obtained as: 0.0909928, 0.0032728, 0.169018, 0.1376505, 0.0208607, 0.0515653, 0.2340764. The residual matrix comes cout to be the following.

f1.women$residual - diag(f1.women$residual)
##                  100m        200m        400m        800m         1500m
## 100m      0.000000000 -0.08446987 -0.09953291 -0.12070567 -0.0866300178
## 200m      0.003250154  0.00000000 -0.00571118 -0.02535116 -0.0009517203
## 400m     -0.177558063 -0.17145636  0.00000000 -0.07668479 -0.1820736151
## 800m     -0.167363326 -0.15972885 -0.04531729  0.00000000 -0.1294677822
## 1500m    -0.016497854 -0.01853959 -0.03391630 -0.01267796  0.0000000000
## 3000m    -0.038137974 -0.04840133 -0.07997792 -0.07604424 -0.0520686672
## Marathon -0.221358588 -0.21789650 -0.27595560 -0.25973114 -0.2353518322
##                  3000m    Marathon
## 100m     -0.0775655540 -0.07827505
## 200m     -0.0001088836  0.01290707
## 400m     -0.1974306564 -0.21089722
## 800m     -0.1621294848 -0.16330526
## 1500m    -0.0213640840 -0.02213613
## 3000m     0.0000000000 -0.01233176
## Marathon -0.1948428806  0.00000000

Now we will make use of factor rotation. Let us try the previous thing with varimax rotation. The estimated loading matrix is shown below.

f2.women <- fa(r = women, nfactors = 2, rotate = 'varimax', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f2.women$loadings
## 
## Loadings:
##          PA1   PA2  
## 100m     0.439 0.846
## 200m     0.382 0.922
## 400m     0.597 0.689
## 800m     0.791 0.486
## 1500m    0.903 0.404
## 3000m    0.883 0.412
## Marathon 0.755 0.443
## 
##                  PA1   PA2
## SS loadings    3.486 2.806
## Proportion Var 0.498 0.401
## Cumulative Var 0.498 0.899

The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same. Let us look at the loading plot in this case.

fa.diagram(f2.women)

The interpretations are similar.