The dataset is constructed from the national records for men and women at various track races from 100 meters to the marathon, as given in Belcham and Hymans (1984). This is not a well-known data set, yet it can be analyzed instructively using a variety of techniques appropriate to exploratory data analysis (EDA), ranging from a simple examination of the various one and two-dimensional marginals to more sophisticated methods, all of which lead to interesting results. I will consider here only the results of applying multivariate analysis. Such analyses are often an effective tool in the attempt to gain insight into high-dimensional data sets and, make no probabilistic assumptions, concentrating on what are purely geometric properties of the data.
For the analysis, only a subset of the whole data set as reported in Belcham and Hymans (1984) was used. There are \(55\) countries for which a complete set of records for the “flat” races in both men’s and women’s events are available. Thus the hurdles and steeplechase events, as well as the field events, are excluded from the main analysis. The full data set has many missing values, and handling these is an important and interesting problem, but for this work, I will restrict attention to the countries with a complete set of observations. The men’s and women’s sets will be treated separately. Thus the data matrix for the women was \(55 \times 7\), with the events represented being the \(100\) meters, \(200\) meters, \(400\) meters, \(800\) meters, \(1,500\) meters, \(3,000\) meters, and marathon. For the men the data matrix was \(55 \times 8\), differing from the women’s events in that the \(3,000\) meters was excluded but the \(5,000\) and \(10,000\) meters were included.
One can choose to first rescale the variables in each of the two data sets to have mean 0 and standard deviation 1, making them unit free, because all the variables are equally important, and hence should somehow be brought to an equal footing. This, in turn, amounts to using the spectral decomposition of the sample correlation matrix instead of the sample variance-covariance matrix (of the time taken by the athletes in the events) to obtain the principal components. Here the objection to the sample variance-covariance matrix as a choice for the analysis is based on the argument that if the raw data are to be analyzed using the same time unit, the variable represented by the time taken in the marathon, due to its larger amount of variability, would be weighted excessively in the analysis. These objections are well taken, and one would readily agree with these concerns. However, this choice of the correlation matrix as the focal point of analysis is suspect in that by forcing all the track records variables to have equal variance by such scaling, the purpose of partitioning the total variability and perhaps the very objective of identifying those variables that contribute more significantly to the total variability have been defeated.
How do we bring all the variables to an “equal footing” while still admitting the possibility that more variability across the nations may be found in certain specific track events? For this, one must use the variance-covariance matrix as the basic object for the analysis, but it should correspond to variables that represent characteristics common to all of the events. “Total time took” is certainly not such a variable. To compare the athletic performances of nations, the appropriate variables should be rate or speed related rather than the total time taken. A variable that may be more relevant in this context is the speed itself, defined as the “distance covered per unit time.” This variable succeeds in retaining the possibility of having different degrees of variability in different variables. I shall, therefore, use the speeds in the track events as the variables for the principal component analysis. The separate analyses will be performed for the data sets on women and men, respectively. These two sets (rounded to two digits after the decimal) are presented in Tables 1 and 2, respectively.
men = read.csv("/cloud/project/track_data_men.csv", row.names=1)
men[,1] = 100/men[,1]
men[,2] = 200/men[,2]
men[,3] = 400/men[,3]
men[,4] = 800/(60*men[,4])
men[,5] = 1500/(60*men[,5])
men[,6] = 5000/(60*men[,6])
men[,7] = 10000/(60*men[,7])
men[,8] = 42195/(60*men[,8])
men = round(men, 2)
colnames(men) = c('100m','200m','400m','800m','1500m','5000m','10000m','Marathon')
datatable(men, autoHideNavigation = TRUE, fillContainer = FALSE)
women = read.csv("/cloud/project/track_data_women.csv", row.names=1)
women[,1] = 100/women[,1]
women[,2] = 200/women[,2]
women[,3] = 400/women[,3]
women[,4] = 800/(60*women[,4])
women[,5] = 1500/(60*women[,5])
women[,6] = 3000/(60*women[,6])
women[,7] = 42195/(60*women[,7])
women = round(women, 2)
colnames(women) = c('100m','200m','400m','800m','1500m','3000m', 'Marathon')
datatable(women, autoHideNavigation = TRUE, fillContainer = FALSE)
mdensity1 = density(men[,1])
mdensity2 = density(men[,2])
mdensity3 = density(men[,3])
mdensity4 = density(men[,4])
mdensity5 = density(men[,5])
mdensity6 = density(men[,6])
mdensity7 = density(men[,7])
mdensity8 = density(men[,8])
wdensity1 = density(women[,1])
wdensity2 = density(women[,2])
wdensity3 = density(women[,3])
wdensity4 = density(women[,4])
wdensity5 = density(women[,5])
wdensity6 = density(women[,6])
wdensity7 = density(women[,7])
p1 = plot_ly(x = ~mdensity1$x, y = ~mdensity1$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '100m Men') %>%
add_trace(x = ~wdensity1$x, y = ~wdensity1$y, name = '100m Women', fill = 'tozeroy') %>%
layout(xaxis = list(title = 'Speed in 100m Track'),
yaxis = list(title = 'Density'))
p1
p2 = plot_ly(x = ~mdensity2$x, y = ~mdensity2$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '200m Men') %>%
add_trace(x = ~wdensity2$x, y = ~wdensity2$y, name = '200m Women', fill = 'tozeroy') %>%
layout(xaxis = list(title = 'Speed in 200m Track'),
yaxis = list(title = 'Density'))
p2
p3 = plot_ly(x = ~mdensity3$x, y = ~mdensity3$y, type = 'scatter', name = '400m Men', mode = 'none', fill = 'tozeroy') %>%
add_trace(x = ~wdensity3$x, y = ~wdensity3$y, name = '400m Women', fill = 'tozeroy') %>%
layout(xaxis = list(title = 'Speed in 400m Track'),
yaxis = list(title = 'Density'))
p3
p4 = plot_ly(x = ~mdensity4$x, y = ~mdensity4$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '800m Men') %>%
add_trace(x = ~wdensity4$x, y = ~wdensity4$y, name = '800m Women', fill = 'tozeroy') %>%
layout(xaxis = list(title = 'Speed in 800m Track'),
yaxis = list(title = 'Density'))
p4
p5 = plot_ly(x = ~mdensity5$x, y = ~mdensity5$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '1500m Men') %>%
add_trace(x = ~wdensity5$x, y = ~wdensity5$y, name = '1500m Women', fill = 'tozeroy') %>%
layout(xaxis = list(title = 'Speed in 1500m'),
yaxis = list(title = 'Density'))
p5
p6 = plot_ly(x = ~mdensity6$x, y = ~mdensity6$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '5000m Men') %>%
layout(xaxis = list(title = 'Speed in 5000m Track Men'),
yaxis = list(title = 'Density'))
p6
p7 = plot_ly(x = ~mdensity7$x, y = ~mdensity7$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '10000m Men') %>%
layout(xaxis = list(title = 'Speed in 10000m Track Men'),
yaxis = list(title = 'Density'))
p7
p8 = plot_ly(x = ~wdensity6$x, y = ~wdensity6$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = '3000m Women') %>%
layout(xaxis = list(title = 'Speed in 3000m Track Women'),
yaxis = list(title = 'Density'))
p8
p9 = plot_ly(x = ~mdensity8$x, y = ~mdensity8$y, type = 'scatter', mode = 'none', fill = 'tozeroy', name = 'Marathon Men') %>%
add_trace(x = ~wdensity7$x, y = ~wdensity7$y, name = 'Marathon Women', fill = 'tozeroy') %>%
layout(xaxis = list(title = 'Speed in Marathon Track'),
yaxis = list(title = 'Density'))
p9
subplot(p1,p2,p3,p4,p5,p6,p7,p8,p9)
Few things are revealed from the density plots. These are
Both for men and women none of the speeds is normally distributed.
Speeds are usually higher for men.
Density plots for men show the mixture of different populations in the dataset because of the multimodal shapes.
The multimodal nature in not so prominent in case of women dataset.
cor_men = cor(men)
cor_women = cor(women)
p10 = plot_ly(x = rownames(cor_men), y = colnames(cor_men), z = cor_men, type = "heatmap")
p11 = plot_ly(x = rownames(cor_women), y = colnames(cor_women), z = cor_women, type = "heatmap")
p10
p11
Null hypothesis assumes the correlation matrix to be an identity matrix.
bart_spher(men)
## Bartlett's Test of Sphericity
##
## Call: bart_spher(x = men)
##
## X2 = 702.525
## df = 28
## p-value < 2.22e-16
bart_spher(women)
## Bartlett's Test of Sphericity
##
## Call: bart_spher(x = women)
##
## X2 = 606.846
## df = 21
## p-value < 2.22e-16
None of the correlation matrices is an identity matrix.
Henry Kaiser (1970) introduced an Measure of Sampling Adequacy (MSA) of data matrices. Kaiser and Rice (1974) then modified it. This is just a function of the squared elements of the ‘image’ matrix compared to the squares of the original correlations. The overall MSA as well as estimates for each item are found. The index is known as the Kaiser-Meyer-Olkin (KMO) index. In his delightfully flamboyant style, Kaiser (1975) suggested that KMO > 0.9 were marvelous, in the 0.8s, mertitourious, in the 0.7s, middling, in the 0.6s, medicore, in the 0.5s, miserable, and less than 0.5, unacceptable.
KMO(men)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = men)
## Overall MSA = 0.91
## MSA for each item =
## 100m 200m 400m 800m 1500m 5000m 10000m Marathon
## 0.84 0.87 0.96 0.95 0.94 0.92 0.87 0.93
KMO(women)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = women)
## Overall MSA = 0.84
## MSA for each item =
## 100m 200m 400m 800m 1500m 3000m Marathon
## 0.85 0.80 0.83 0.81 0.82 0.84 0.93
V1 = cov(men)
ev1 = eigen(V1)
index1 = seq(1, length(ev1$values), 1)
aa = data.frame(index1, ev1$values)
p12 = plot_ly(data = aa, x = ~index1, y = ~ev1$values, name = 'Scree Plot for Men', type = 'scatter', mode = 'lines') %>%
layout(xaxis = list(title = 'Principal Components'),
yaxis = list(title = 'Eigen Values'))
p12
princomp.track.men = prcomp(men, retx = TRUE)
summary(princomp.track.men)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.7518 0.2884 0.1111 0.09639 0.08226 0.07689 0.06495
## Proportion of Variance 0.8197 0.1206 0.0179 0.01347 0.00981 0.00857 0.00612
## Cumulative Proportion 0.8197 0.9404 0.9583 0.97172 0.98153 0.99011 0.99623
## PC8
## Standard deviation 0.05101
## Proportion of Variance 0.00377
## Cumulative Proportion 1.00000
94.0% of the total variability in the dataset is explained by the first two sample PCs. This is also evident from the position of the elbow shape in the scree plot. So we are going to consider first two sample PCS.
The coefficient of the variables in the first two sample PCs are
M = princomp.track.men$rotation
M[, c(1, 2)]
## PC1 PC2
## 100m -0.3153454 -0.59955831
## 200m -0.3251131 -0.47163652
## 400m -0.3090385 -0.23213403
## 800m -0.3123950 -0.05861613
## 1500m -0.3417021 0.07902820
## 5000m -0.4063680 0.29596157
## 10000m -0.4186956 0.29760231
## Marathon -0.3802133 0.42232798
Several facts are revealed by PCA. These are:
The uniformity of weights in the first principal component is really a reflection of the considerable amount of structure in the original data. The coefficients of the variables in PC1 are nearly equal. This means that PC1 considers all the variables to almost same extent which is also clear from the previous table. Using the values of PC1 we can actually sort of rank the individuals. So the analysis of the first PC results in ranking the individuals that are least plausible from the viewpoint of subjective judgements, and, in any case, this has a considerable intrinsic interest. This technique will be very useful when such judgements are not so easy to come by.
The second PC can be interpreted as a measure of relative strength of a given nation at various distances. Hence the a value near to 0 of the second PC would indicate that the particular nation had achieved at about the same level in both long and short distances. Extreme values in either end indicate imbalance in achievement. Thus the second PC serves as a measure of differential achievement.
Let us try to look at the values of PC2 for all the individuals.
Mx = princomp.track.men$x
Mx[, 2]
## argentina australia austria belgium bermuda brazil
## -0.117419763 -0.060367323 0.140273728 0.088983611 -0.557515352 -0.217869563
## burma canada chile china colombia cookis
## 0.064411222 -0.169462822 -0.070864959 0.106758597 0.177640813 0.474945898
## costa czech denmark domrep finland france
## 0.514976841 -0.021864896 0.127601896 -0.765363394 0.134192489 -0.178143788
## gdr frg gbni greece guatemala hungary
## -0.105888738 -0.142135548 -0.102960451 -0.214925122 0.383602579 -0.065312578
## india indonesia ireland israel italy japan
## 0.216243136 -0.175248800 0.293585524 0.190988963 -0.335127996 0.144978466
## kenya korea dprkorea luxembourg malaysia mauritius
## 0.181411820 -0.105226541 0.481269685 -0.111034425 -0.527459267 0.197227644
## mexico netherlands nz norway png philippines
## 0.267287839 0.234597981 0.291173843 0.331173993 0.058655817 -0.054057615
## poland portugal rumania singapore spain sweden
## -0.155662847 0.425589259 0.170143261 -0.516778940 0.152090915 0.004860714
## switzerland taipei thailand turkey usa ussr
## 0.057510844 0.004053239 -0.510206290 0.422426665 -0.376942136 -0.248016640
## wsamoa
## -0.432801487
Since PC2 serves as a measure of differential achievement, extreme negative values will indicate better performances in shorter distances and extreme positive values will indicate better performances in longer distances. So, looking at the values of PC2 we can actually classify the individuals into three groups - one with better performances in shorter distances, one with even performances in both shorter and longer distances and one with better performances in longer distances. This classification is shown below.
levels(cut(Mx[, 2], 3))
## [1] "(-0.767,-0.339]" "(-0.339,0.0882]" "(0.0882,0.516]"
names(Mx[, 2][Mx[, 2] > -0.767 & Mx[, 2] <= -0.339])
## [1] "bermuda" "domrep" "malaysia" "singapore" "thailand" "usa"
## [7] "wsamoa"
names(Mx[, 2][Mx[, 2] > -0.339 & Mx[, 2] <= 0.0882])
## [1] "argentina" "australia" "brazil" "burma" "canada"
## [6] "chile" "czech" "france" "gdr" "frg"
## [11] "gbni" "greece" "hungary" "indonesia" "italy"
## [16] "korea" "luxembourg" "png" "philippines" "poland"
## [21] "sweden" "switzerland" "taipei" "ussr"
names(Mx[, 2][Mx[, 2] > 0.0882 & Mx[, 2] <= 0.516])
## [1] "austria" "belgium" "china" "colombia" "cookis"
## [6] "costa" "denmark" "finland" "guatemala" "india"
## [11] "ireland" "israel" "japan" "kenya" "dprkorea"
## [16] "mauritius" "mexico" "netherlands" "nz" "norway"
## [21] "portugal" "rumania" "spain" "turkey"
Now by plotting the PC1 vs PC2 let us see how good this grouping is:
PC_men = data.frame(Mx[, 1], Mx[, 2], cut(Mx[, 2], 3, labels = c('Dominant Performance in Short Distance Performance', 'Equivalent Performances in Both', 'Dominant Performance in Long Distance')))
colnames(PC_men) = c('PC1', 'PC2', 'Group')
p13 = plot_ly(data = PC_men, x = ~ PC1, y = ~PC2, color = ~Group, text = ~paste("Country: ", rownames(PC_men) ))
p13
V2 = cov(women)
ev2 = eigen(V2)
index2 = seq(1, length(ev2$values), 1)
bb = data.frame(index2, ev2$values)
p14 = plot_ly(data = bb, x = ~index2, y = ~ev2$values, name = 'Scree Plot for Women', type = 'scatter', mode = 'lines') %>%
layout(xaxis = list(title = 'Principal Components'),
yaxis = list(title = 'Eigen Values'))
p14
princomp.track.women = prcomp(women, retx = TRUE)
summary(princomp.track.women)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.9894 0.3136 0.23001 0.15263 0.08494 0.07216 0.05833
## Proportion of Variance 0.8372 0.0841 0.04525 0.01993 0.00617 0.00445 0.00291
## Cumulative Proportion 0.8372 0.9213 0.96654 0.98647 0.99264 0.99709 1.00000
92.13% of the total variability in the dataset is explained by the first two sample PCs. This is also evident from the position of the elbow shape in the scree plot. So we are going to consider first two sample PCS.
The coefficient of the variables in the first two sample PCs are
W = princomp.track.women$rotation
W[, c(1, 2)]
## PC1 PC2
## 100m 0.2909154 -0.426435627
## 200m 0.3414473 -0.558102524
## 400m 0.3390391 -0.383489492
## 800m 0.3053161 -0.006696912
## 1500m 0.3857370 0.198885662
## 3000m 0.3999272 0.254144679
## Marathon 0.5309254 0.505391101
Several facts are revealed by PCA. These are:
The uniformity of weights in the first principal component is really a reflection of the considerable amount of structure in the original data. The coefficients of the variables in PC1 are nearly equal. This means that PC1 considers all the variables to almost same extent which is also clear from the previous table. Using the values of PC1 we can actually sort of rank the individuals. So the analysis of the first PC results in ranking the individuals that are least plausible from the viewpoint of subjective judgements, and, in any case, this has a considerable intrinsic interest. This technique will be very useful when such judgements are not so easy to come by.
The second PC can be interpreted as a measure of relative strength of a given nation at various distances. Hence the a value near to 0 of the second PC would indicate that the particular nation had achieved at about the same level in both long and short distances. Extreme values in either end indicate imbalance in achievement. Thus the second PC serves as a measure of differential achievement.
Let us try to look at the values of PC2 for all the individuals.
Wx = princomp.track.women$x
Wx[, 2]
## argentina australia austria belgium bermuda brazil
## -0.26245939 -0.19187465 -0.15256028 0.03304336 -0.28972763 -0.29849297
## burma canada chile china colombia cookis
## 0.13240049 -0.23281774 0.37966038 0.36545522 0.10923410 0.41745950
## costa czech denmark domrep finland france
## 0.28962521 -0.58429058 0.32431335 -0.23934620 -0.25345101 -0.14659643
## gdr frg gbni greece guatemala hungary
## -0.63704447 -0.16173667 -0.20471643 0.01448173 -0.32532294 0.02597314
## india indonesia ireland israel italy japan
## -0.03590875 -0.18587910 0.33411349 0.16145977 0.15267934 0.42190027
## kenya korea dprkorea luxembourg malaysia mauritius
## -0.01103023 0.40609402 0.38483113 0.42038839 0.02309732 -0.46578552
## mexico netherlands nz norway png philippines
## 0.24757004 0.02569444 0.26528256 0.48529016 -0.20089492 -0.39556902
## poland portugal rumania singapore spain sweden
## -0.42039442 0.61345358 0.08348082 0.28644576 0.34723175 -0.04885176
## switzerland taipei thailand turkey usa ussr
## 0.26179941 -0.52643650 0.13277297 0.09511220 -0.17498458 -0.18855441
## wsamoa
## -0.60561728
Since PC2 serves as a measure of differential achievement, extreme negative values will indicate better performances in shorter distances and extreme positive values will indicate better performances in longer distances. So, looking at the values of PC2 we can actually classify the individuals into three groups - one with better performances in shorter distances, one with even performances in both shorter and longer distances and one with better performances in longer distances. This classification is shown below.
levels(cut(Wx[, 2], 3))
## [1] "(-0.638,-0.22]" "(-0.22,0.197]" "(0.197,0.615]"
names(Wx[, 2][Wx[, 2] > -0.638 & Wx[, 2] <= -0.22])
## [1] "argentina" "bermuda" "brazil" "canada" "czech"
## [6] "domrep" "finland" "gdr" "guatemala" "mauritius"
## [11] "philippines" "poland" "taipei" "wsamoa"
names(Wx[, 2][Wx[, 2] > -0.22 & Wx[, 2] <= 0.197])
## [1] "australia" "austria" "belgium" "burma" "colombia"
## [6] "france" "frg" "gbni" "greece" "hungary"
## [11] "india" "indonesia" "israel" "italy" "kenya"
## [16] "malaysia" "netherlands" "png" "rumania" "sweden"
## [21] "thailand" "turkey" "usa" "ussr"
names(Wx[, 2][Wx[, 2] > 0.197 & Wx[, 2] <= 0.615])
## [1] "chile" "china" "cookis" "costa" "denmark"
## [6] "ireland" "japan" "korea" "dprkorea" "luxembourg"
## [11] "mexico" "nz" "norway" "portugal" "singapore"
## [16] "spain" "switzerland"
Now by plotting the PC1 vs PC2 let us see how good this grouping is:
PC_women = data.frame(Wx[, 1], Wx[, 2], cut(Wx[, 2], 3, labels = c('Dominant Performance in Short Distance Performance', 'Equivalent Performances in Both', 'Dominant Performance in Long Distance')))
colnames(PC_women) = c('PC1', 'PC2', 'Group')
p15 = plot_ly(data = PC_women, x = ~ PC1, y = ~PC2, color = ~Group, text = ~paste("Country: ", rownames(PC_women) ))
p15
We first factor analyze the sample covariance matrix by the iterative principal component method. The estimated loading matrix is shown below.
f1.men <- fa(r = men, nfactors = 2, rotate = 'none', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f1.men$loadings
##
## Loadings:
## PA1 PA2
## 100m 0.803 0.521
## 200m 0.860 0.424
## 400m 0.898 0.207
## 800m 0.933
## 1500m 0.953 -0.117
## 5000m 0.933 -0.295
## 10000m 0.943 -0.297
## Marathon 0.870 -0.381
##
## PA1 PA2
## SS loadings 6.486 0.829
## Proportion Var 0.811 0.104
## Cumulative Var 0.811 0.914
The estimated communalities are obtained as: 0.9160849, 0.9190559, 0.8498386, 0.8711599, 0.9213064, 0.9578236, 0.9773479, 0.9015345. The specific variances are obtained as: 0.0839151, 0.0809441, 0.1501614, 0.1288401, 0.0786936, 0.0421764, 0.0226521, 0.0984655. The residual matrix comes cout to be the following.
f1.men$residual - diag(f1.men$residual)
## 100m 200m 400m 800m 1500m
## 100m 0.00000000 -0.07184424 -0.08477023 -0.09633006 -0.09684761
## 200m -0.06887325 0.00000000 -0.09295282 -0.09289630 -0.07894811
## 400m -0.15101659 -0.16217017 0.00000000 -0.12287829 -0.15006664
## 800m -0.14125505 -0.14079228 -0.10155693 0.00000000 -0.10657636
## 1500m -0.09162619 -0.07669768 -0.07859885 -0.05642994 0.00000000
## 5000m -0.03911968 -0.03432554 -0.05161649 -0.05137271 -0.04123162
## 10000m -0.01281943 -0.02057460 -0.02897111 -0.03684212 -0.02487059
## Marathon -0.09459245 -0.09748370 -0.09846762 -0.10258663 -0.11107967
## 5000m 10000m Marathon
## 100m -0.08085835 -0.07408242 -0.08004198
## 200m -0.07309322 -0.07886660 -0.07996225
## 400m -0.15960152 -0.15648047 -0.15016351
## 800m -0.13803638 -0.14303011 -0.13296115
## 1500m -0.07774886 -0.08091215 -0.09130778
## 5000m 0.00000000 -0.03812993 -0.03853653
## 10000m -0.01860561 0.00000000 -0.01384641
## Marathon -0.09482567 -0.08965987 0.00000000
Now we will make use of factor rotation. Let us try the previous thing with varimax rotation. The estimated loading matrix is shown below.
f2.men <- fa(r = men, nfactors = 2, rotate = 'varimax', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f2.men$loadings
##
## Loadings:
## PA1 PA2
## 100m 0.275 0.917
## 200m 0.380 0.880
## 400m 0.551 0.739
## 800m 0.691 0.627
## 1500m 0.802 0.528
## 5000m 0.902 0.380
## 10000m 0.911 0.384
## Marathon 0.909 0.273
##
## PA1 PA2
## SS loadings 4.114 3.200
## Proportion Var 0.514 0.400
## Cumulative Var 0.514 0.914
The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same. Let us look at the loading plot in this case.
fa.diagram(f2.men)
The following things can be interpreted:
Focusing our attention on the iterative PC method and the cumulative proportion of the total sample variance explained which is 0.914, we see that a two-factor solution appears to be warranted.
The factor loadings change abruptly after the varimax rotation.
Before rotation, all the variables load highly on the first factor and the estimated loadings are about equal. This factor can be interpreted as a general running factor.
The second factor contrasts the shorter distances with the longer distances. (The shorter distance performances have positive estimated loadings and the longer distance performances have negative estimated loadings on this factor.) Thus, this factor seems to differentiate the performances in different distances and might be called a differentiating factor. Also the absolute value of the estimated loadings of this factor increases as we move to extreme running distances in both the directions.
After rotation, we see that the longer distance performances load highly on the first factor rather than the shorter distance performances. All the estimated loadings are positive for this factor. In this case the first factor can be interpreted as a factor affecting the longer distance performances.
The shorter distance performances load highly on the second factor rather than the longer distance performances. All the estimated loadings are positive for this factor. In this case the second factor can be interpreted as a factor affecting the shorter distance performances.
The off-diagonal elements of the residual matrix are close to zero. This supports the fact of retaining two factors.
We first factor analyze the sample covariance matrix by the iterative principal component method. The estimated loading matrix is shown below.
f1.women <- fa(r = women, nfactors = 2, rotate = 'none', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f1.women$loadings
##
## Loadings:
## PA1 PA2
## 100m 0.888 0.346
## 200m 0.895 0.441
## 400m 0.903 0.125
## 800m 0.915 -0.156
## 1500m 0.945 -0.292
## 3000m 0.935 -0.273
## Marathon 0.860 -0.165
##
## PA1 PA2
## SS loadings 5.751 0.542
## Proportion Var 0.822 0.077
## Cumulative Var 0.822 0.899
The estimated communalities are obtained as: 0.9090072, 0.9967272, 0.830982, 0.8623495, 0.9791393, 0.9484347, 0.7659236. The specific variances are obtained as: 0.0909928, 0.0032728, 0.169018, 0.1376505, 0.0208607, 0.0515653, 0.2340764. The residual matrix comes cout to be the following.
f1.women$residual - diag(f1.women$residual)
## 100m 200m 400m 800m 1500m
## 100m 0.000000000 -0.08446987 -0.09953291 -0.12070567 -0.0866300178
## 200m 0.003250154 0.00000000 -0.00571118 -0.02535116 -0.0009517203
## 400m -0.177558063 -0.17145636 0.00000000 -0.07668479 -0.1820736151
## 800m -0.167363326 -0.15972885 -0.04531729 0.00000000 -0.1294677822
## 1500m -0.016497854 -0.01853959 -0.03391630 -0.01267796 0.0000000000
## 3000m -0.038137974 -0.04840133 -0.07997792 -0.07604424 -0.0520686672
## Marathon -0.221358588 -0.21789650 -0.27595560 -0.25973114 -0.2353518322
## 3000m Marathon
## 100m -0.0775655540 -0.07827505
## 200m -0.0001088836 0.01290707
## 400m -0.1974306564 -0.21089722
## 800m -0.1621294848 -0.16330526
## 1500m -0.0213640840 -0.02213613
## 3000m 0.0000000000 -0.01233176
## Marathon -0.1948428806 0.00000000
Now we will make use of factor rotation. Let us try the previous thing with varimax rotation. The estimated loading matrix is shown below.
f2.women <- fa(r = women, nfactors = 2, rotate = 'varimax', scores = 'Thurstone', residuals = TRUE, fm = 'pa', covar = FALSE)
f2.women$loadings
##
## Loadings:
## PA1 PA2
## 100m 0.439 0.846
## 200m 0.382 0.922
## 400m 0.597 0.689
## 800m 0.791 0.486
## 1500m 0.903 0.404
## 3000m 0.883 0.412
## Marathon 0.755 0.443
##
## PA1 PA2
## SS loadings 3.486 2.806
## Proportion Var 0.498 0.401
## Cumulative Var 0.498 0.899
The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same. Let us look at the loading plot in this case.
fa.diagram(f2.women)
The interpretations are similar.