Report on Track Records of Men

Data Description

We have data on Men’s track record in a sports competition. We are provided with Country names and each country’s record in 8 events. Events are 100m, 200m, 400m, 800m, 1500 m, 5000 m, 10000 m and Marathon. These are our variables. For the first three events we are provided with track record in seconds whereas for the other ones we have record in minutes. For future computation, analysis and ease of interpretation, we transform all the variables into same unit(minutes).

Abstract

We have first checked for normality of the data in various ways and looked for possible outliers. Then we removed the observations and tried to find out possible transformation to normalize the data. Based on such analyses we’ve classified the data into two groups and carried out profile analysis. We’ve done PCA, factor analysis and found out some interesting outcomes. Finally we’ve tried to perform discriminant analysis but failed to do so due to unavailability of observations.

Loading data

track<-read.csv('track.csv')
dimnames(track)<-list(x=seq(1:nrow(track)),y=c("100 m","200 m","400 m","800 m","1500 m","5000 m","10000 m","Marathon","Country"))
head(track,10)

##        100 m     200 m     400 m 800 m 1500 m 5000 m 10000 m Marathon
## 1  0.1731667 0.3468333 0.7806667  1.81   3.70  14.04   29.36   137.72
## 2  0.1718333 0.3343333 0.7473333  1.74   3.57  13.28   27.66   128.30
## 3  0.1740000 0.3468333 0.7803333  1.79   3.60  13.26   27.72   135.90
## 4  0.1723333 0.3446667 0.7506667  1.73   3.60  13.22   27.45   129.95
## 5  0.1713333 0.3430000 0.7651667  1.80   3.75  14.68   30.55   146.62
## 6  0.1703333 0.3405000 0.7535000  1.73   3.66  13.62   28.62   133.13
## 7  0.1773333 0.3586667 0.8050000  1.80   3.85  14.45   30.28   139.95
## 8  0.1695000 0.3370000 0.7613333  1.76   3.63  13.55   28.09   130.15
## 9  0.1723333 0.3466667 0.7700000  1.79   3.71  13.61   29.30   134.03
## 10 0.1751667 0.3506667 0.7883333  1.81   3.73  13.90   29.13   133.53
##     Country
## 1  argentin
## 2  australi
## 3   austria
## 4   belgium
## 5   bermuda
## 6    brazil
## 7     burma
## 8    canada
## 9     chile
## 10    china

Density Plots

First we plot the densities of the variables. We try to get an overall idea about the shapes of the densitites, tell if they are normally distributed marginally.
We define

vec<-c("100 m","200 m","400 m","800 m","1500 m","5000 m","10000 m","Marathon")

and plot the densities:

par(mfrow=c(2,4))
for(i in 1:length(vec))
  plot(density(track[,i]),main = paste(vec[i]),xlab = "Run time in minutes")

What we see is that all the densities have multiple modes. Which gives us indication that the countries can possibly be divided into groups. Some of the countries are better in this sport compared to others. QQ plots give:

par(mfrow=c(2,4))
for(i in 1:length(vec))
{
  qqnorm(y=track[,i],main= paste("QQ plot of",vec[i],sep = " "))
  qqline(y=track[,i])
}

par(mfrow=c(1,1))

Transformation of this data using Box-Cox method gives decreasing likelihood w.r.t \(\lambda\). That is we can keep on decreasing \(\lambda\) and get densitites nearing normality. But then the datapoints get either too close or too far apart from each other and further analysis becomes harder.
- We only show this for the first and fourth variable.
- Rest follow similar outcome.

library(MASS)

## Warning: package 'MASS' was built under R version 3.5.3

par(mfrow=c(2,4))
for(i in 1:length(vec))
  boxcox(track[,i]~1,lambda = seq(-6,6,by=0.1))

From the plot we see it clearly.
From the first plot for 100 m timings we take \(\lambda\)=-4 and plot the density of transformed variable.

par(mfrow=c(1,2))
trans_y<-((track$`100 m`)^(-4)-1)/(-4)
plot(density(trans_y),main = "Transformed first variable, lambda=-4",xlab = "Run-time in minutes")
trans_y1<-((track$`800 m`)^(-4)-1)/(-4)
plot(density(trans_y1),main = "Transformed fourth variable, lambda=-4",xlab = "Run-time in minutes")

This gives enough evidence for not transforming the variables. Instead we should group the data accordingly.
For grouping we look for possible outliers of the data and group them into a single group and put the rest in another. We plot Mahalanobis distance and try to find out the outliers.

par(mfrow=c(1,1))
mahal_d<-mahalanobis(track[,-9],center=rowMeans(t(track[-9])),cov = var(track[-9]))
plot(mahal_d,main = "Mahalanobis Distance")
abline(h=qchisq(.95,8),lty=2)

The outliers are:

(index<-which(mahal_d>qchisq(.95,8)))

## 12 22 36 51 55 
## 12 22 36 51 55

track[index,9]

## [1] cookis   greece   mauritiu thailand wsamoa  
## 55 Levels: argentin australi austria belgium bermuda brazil ... wsamoa

Plotting the densities without taking the outliers give:

par(mfrow=c(2,4))
for(i in 1:length(vec))
  plot(density(track[-index,i]),main = paste(vec[i]),xlab = "Run time in minutes")

par(mfrow=c(1,1))

Outlier free data gives Mahalanobis distances as:

par(mfrow=c(1,1))
mahal_d_1<-mahalanobis(track[-index,-9],center=rowMeans(t(track[-9])),cov = var(track[-9]))
plot(mahal_d_1,main = "Mahalanobis Distance")
abline(h=qchisq(.95,8),lty=2)

Which shows absense of outliers.
Outlier free data gives qqplots as:

par(mfrow=c(2,4))
for(i in 1:length(vec))
{
  qqnorm(y=track[-index,i],main= paste("QQ plot of",vec[i],sep = " "))
  qqline(y=track[-index,i])
}

par(mfrow=c(1,1))

We perform Shapiro-Wilk test on the outlier free data.

library(MVN)
mvn(track[-index,-9],mvnTest="energy",univariateTest="SW",univariatePlot="histogram")

## $multivariateNormality
##          Test Statistic p value MVN
## 1 E-statistic  1.930065       0  NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   100 m      0.9716  0.2678      YES   
## 2 Shapiro-Wilk   200 m      0.9845  0.7517      YES   
## 3 Shapiro-Wilk   400 m      0.9791   0.515      YES   
## 4 Shapiro-Wilk   800 m      0.9405  0.0142      NO    
## 5 Shapiro-Wilk  1500 m      0.9276  0.0045      NO    
## 6 Shapiro-Wilk  5000 m      0.8446  <0.001      NO    
## 7 Shapiro-Wilk  10000 m     0.8453  <0.001      NO    
## 8 Shapiro-Wilk Marathon     0.7827  <0.001      NO    
## 
## $Descriptives
##           n        Mean     Std.Dev      Median         Min         Max
## 100 m    50   0.1737033 0.004069310   0.1734167   0.1655000   0.1830000
## 200 m    50   0.3474700 0.008815937   0.3468333   0.3286667   0.3656667
## 400 m    50   0.7700067 0.018319381   0.7676667   0.7310000   0.8110000
## 800 m    50   1.7820000 0.045355737   1.7800000   1.7000000   1.9000000
## 1500 m   50   3.6722000 0.115251101   3.6350000   3.5100000   4.0100000
## 5000 m   50  13.6732000 0.556132564  13.4750000  13.0100000  15.1100000
## 10000 m  50  28.6306000 1.242877520  28.0300000  27.3800000  31.4500000
## Marathon 50 135.0212000 7.314385223 132.1050000 128.2200000 157.7700000
##                 25th        75th      Skew    Kurtosis
## 100 m      0.1710833   0.1759583 0.4076383 -0.18428149
## 200 m      0.3422500   0.3527917 0.1078462 -0.41757710
## 400 m      0.7565417   0.7805833 0.3033403 -0.58304617
## 800 m      1.7425000   1.8075000 0.6981109  0.04485093
## 1500 m     3.5925000   3.7475000 0.8214838 -0.03533428
## 5000 m    13.2625000  14.0125000 1.0578375 -0.11826838
## 10000 m   27.6750000  29.2825000 0.9217064 -0.48690964
## Marathon 130.5700000 137.3075000 1.5564469  1.54540140

This shows Normality of first three variables but other four does not show normality. Important to note that variable 7 has a flat distribution. So what we do is look at the distribution of outlier free data and cut the runtimes that are too long because they possibly form a group that performs below average.
The cutpoints we use are:
1.85 min for 800m
3.80 min for 1500m
14.25 min for 5000m
29.5 min for 10000m
140 min for Marathon
These are solely based on visual observation.

time_points<-c(1.85,3.8,14.25,29.5,140)
for(i in 1:length(time_points))
  print(which(track[,3+i]>time_points[i]))

## [1] 12 13 23 36 41 46 55
##  [1]  7 12 13 16 26 36 41 42 46 51 55
##  [1]  5  7 12 16 22 26 35 36 41 42 46 51 55
##  [1]  5  7 12 16 23 26 33 35 36 41 42 46 50 51 55
##  [1]  5 12 16 26 34 35 36 41 42 46 51 55

From this we take the indices that are common for almost all the variables and remove them to obtain a new dataset. Then we check this data for normality.
The common indices that we take are : 5,7,13,16,23,26,35,41,42,46

index_1<-c(5,7,13,16,23,26,35,41,42,46)

All in all we consider the countries indexed by these numbers and indices found before as outliers(countries that perform well below average) and put them into a group. Rest we consider as another group.
These countries are

track[sort(c(index,index_1)),9]

##  [1] bermuda  burma    cookis   costa    domrep   greece   guatemal
##  [8] indonesi malaysia mauritiu png      philippi singapor thailand
## [15] wsamoa  
## 55 Levels: argentin australi austria belgium bermuda brazil ... wsamoa

Removing these observations we try to check the data for normality:

library(MVN)
mvn(track[-sort(c(index,index_1)),-9],mvnTest="energy",univariateTest="SW",univariatePlot="histogram")

## $multivariateNormality
##          Test Statistic p value MVN
## 1 E-statistic  1.605336   0.001  NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   100 m      0.9877    0.9358    YES   
## 2 Shapiro-Wilk   200 m      0.9872    0.9240    YES   
## 3 Shapiro-Wilk   400 m      0.9820    0.7639    YES   
## 4 Shapiro-Wilk   800 m      0.9574    0.1369    YES   
## 5 Shapiro-Wilk  1500 m      0.9430    0.0438    NO    
## 6 Shapiro-Wilk  5000 m      0.8993    0.0018    NO    
## 7 Shapiro-Wilk  10000 m     0.8565    0.0001    NO    
## 8 Shapiro-Wilk Marathon     0.8855    0.0007    NO    
## 
## $Descriptives
##           n        Mean     Std.Dev      Median         Min         Max
## 100 m    40   0.1729250 0.003438087   0.1730000   0.1655000   0.1818333
## 200 m    40   0.3453458 0.007711074   0.3461667   0.3286667   0.3656667
## 400 m    40   0.7651292 0.015362685   0.7644167   0.7310000   0.7966667
## 800 m    40   1.7667500 0.032375006   1.7600000   1.7000000   1.8500000
## 1500 m   40   3.6275000 0.072102136   3.6150000   3.5100000   3.7700000
## 5000 m   40  13.4372500 0.280439102  13.3400000  13.0100000  14.1300000
## 10000 m  40  28.1302500 0.717175684  27.8850000  27.3800000  30.0700000
## Marathon 40 132.0067500 3.046697571 131.4150000 128.2200000 141.2700000
##                 25th        75th       Skew    Kurtosis
## 100 m      0.1707083   0.1752083 0.08510908 -0.16342089
## 200 m      0.3402917   0.3497500 0.01632403  0.07692336
## 400 m      0.7541250   0.7783750 0.14483432 -0.65038257
## 800 m      1.7400000   1.7900000 0.29832458 -0.46679776
## 1500 m     3.5700000   3.6700000 0.49336226 -0.79448600
## 5000 m    13.2275000  13.5725000 0.95319143  0.01292578
## 10000 m   27.6100000  28.5900000 1.01313076 -0.14051673
## Marathon 129.9425000 132.7175000 1.20772171  1.00336107

This tells that after removing those observations first five variables of our data follow normality marginally. For the rest three we try to do Box-Cox transformation:

library(MASS)
par(mfrow=c(1,3))
for(i in 6:8)
  boxcox(track[-sort(c(index,index_1)),i]~1,lambda = seq(-6,6,by=0.1))

Taking \(\lambda\)=-10 we get the following density plots for these variables:

par(mfrow=c(1,3))
for(i in 6:8)
  plot(density((((track[-sort(c(index,index_1)),i])^(-10))-1)/(-10)),main=paste("Transformed",vec[i],sep=" "))

## Warning in plot.window(...): relative range of values = 0 * EPS, is small
## (axis 1)

Since for these transformations our data has too less variability, even if they were to come from a normal distribution we would not be able to work with them. We scale this data to check if it follows normality.

for(i in 6:8)
  track[-sort(c(index,index_1)),i]<-scale((((track[-sort(c(index,index_1)),i])^(-10))-1)/(-10))
mvn(track[-sort(c(index,index_1)),-9],mvnTest="energy",univariateTest="SW",univariatePlot="histogram")

This shows that the data finally assumes normality but using this data for further analysis won’t carry any meaning. So we discard this transformed data and proceed with the original data with removed outliers. We do PCA, factor analysis etc on this data.
We group the data as mentioned before for Profile analysis and discriminant analysis. This is the code used for grouping:

library(dplyr)
track_group<-track%>%mutate(Running_ability=ifelse(Country %in% track[sort(c(index,index_1)),9],"0","1"))
head(track_group,10)

##        100 m     200 m     400 m 800 m 1500 m 5000 m 10000 m Marathon
## 1  0.1731667 0.3468333 0.7806667  1.81   3.70  14.04   29.36   137.72
## 2  0.1718333 0.3343333 0.7473333  1.74   3.57  13.28   27.66   128.30
## 3  0.1740000 0.3468333 0.7803333  1.79   3.60  13.26   27.72   135.90
## 4  0.1723333 0.3446667 0.7506667  1.73   3.60  13.22   27.45   129.95
## 5  0.1713333 0.3430000 0.7651667  1.80   3.75  14.68   30.55   146.62
## 6  0.1703333 0.3405000 0.7535000  1.73   3.66  13.62   28.62   133.13
## 7  0.1773333 0.3586667 0.8050000  1.80   3.85  14.45   30.28   139.95
## 8  0.1695000 0.3370000 0.7613333  1.76   3.63  13.55   28.09   130.15
## 9  0.1723333 0.3466667 0.7700000  1.79   3.71  13.61   29.30   134.03
## 10 0.1751667 0.3506667 0.7883333  1.81   3.73  13.90   29.13   133.53
##     Country Running_ability
## 1  argentin               1
## 2  australi               1
## 3   austria               1
## 4   belgium               1
## 5   bermuda               0
## 6    brazil               1
## 7     burma               0
## 8    canada               1
## 9     chile               1
## 10    china               1

track_group_1<-track_group[track_group$Running_ability=="1",]
track_group_0<-track_group[track_group$Running_ability=="0",]

So we look into densities of the grouped data:

library(ggplot2)
track_group%>%ggplot(aes(track_group[,1],fill=Running_ability))+geom_density(alpha=0.3)+xlab("100 m")

track_group%>%ggplot(aes(track_group[,2],fill=Running_ability))+geom_density(alpha=0.3)+xlab("200 m")

track_group%>%ggplot(aes(track_group[,3],fill=Running_ability))+geom_density(alpha=0.3)+xlab("400 m")

track_group%>%ggplot(aes(track_group[,4],fill=Running_ability))+geom_density(alpha=0.3)+xlab("800 m")

track_group%>%ggplot(aes(track_group[,5],fill=Running_ability))+geom_density(alpha=0.3)+xlab("1500 m")

track_group%>%ggplot(aes(track_group[,6],fill=Running_ability))+geom_density(alpha=0.3)+xlab("5000 m")

track_group%>%ggplot(aes(track_group[,7],fill=Running_ability))+geom_density(alpha=0.3)+xlab("10000 m")

track_group%>%ggplot(aes(track_group[,8],fill=Running_ability))+geom_density(alpha=0.3)+xlab("Marathon")

Indication that we will see quite a number of misclassification in discriminant analysis.
Now we look into sigma distances around mean and match them with empirical probabilities. These are expected to be 0.683 and 0.954 for one sigma and two sigma distances around mean respectively.
Code used:

normality_check<-function(x)
{
  rm<-rowMeans(t(x))
  rv<-apply(x,2,sd)
  proportion<-matrix(numeric(2*dim(x)[2]),ncol = 2)
  for (i in 1:dim(x)[2]) {
    for(j in 1:2)
    {
      proportion[i,j]<-mean(between(as.matrix(x[i]),as.matrix(rm[i]-j*rv[i]),
                                    as.matrix(rm[i]+j*rv[i])))
    }
  }
  return(proportion)
}
normality_check(track[,-9])

##           [,1]      [,2]
## [1,] 0.8000000 0.9636364
## [2,] 0.7272727 0.9636364
## [3,] 0.7454545 0.9818182
## [4,] 0.8545455 0.9636364
## [5,] 0.8363636 0.9636364
## [6,] 0.8000000 0.9636364
## [7,] 0.8545455 0.9636364
## [8,] 0.8181818 0.9454545

normality_check(track[-index,-9])

##      [,1] [,2]
## [1,] 0.66 0.92
## [2,] 0.68 0.94
## [3,] 0.60 0.94
## [4,] 0.70 0.94
## [5,] 0.68 0.96
## [6,] 0.80 0.96
## [7,] 0.76 0.94
## [8,] 0.86 0.94

normality_check(track_group_1[,-c(9,10)])

##       [,1]  [,2]
## [1,] 0.650 0.950
## [2,] 0.700 0.925
## [3,] 0.650 0.950
## [4,] 0.625 0.950
## [5,] 0.600 1.000
## [6,] 0.775 0.925
## [7,] 0.750 0.950
## [8,] 0.750 0.950

normality_check(track_group_0[,-c(9,10)])

##           [,1]      [,2]
## [1,] 0.8666667 0.9333333
## [2,] 0.6666667 0.9333333
## [3,] 0.8666667 0.9333333
## [4,] 0.8000000 0.8666667
## [5,] 0.8000000 0.8666667
## [6,] 0.7333333 0.9333333
## [7,] 0.7333333 0.9333333
## [8,] 0.6000000 1.0000000

QQplots for grouped data are(1 & 0 respectively):

par(mfrow=c(2,4))
for(i in 1:8)
{
  qqnorm(y=track_group_1[,i],main= paste("QQ plot of",vec[i],sep = " "))
  qqline(y=track_group_1[,i])
}

for(i in 1:8)
{
  qqnorm(y=track_group_0[,i],main= paste("QQ plot of",vec[i],sep = " "))
  qqline(y=track_group_0[,i])
}

Bonferroni intervals for the variables for actual data, outlier free data and grouped data(accordingly) are:

Bonferroni_interval<-function(x,confidence=0.95)
{
  rm<-rowMeans(t(x))
  rv<-apply(x,2,sd)
  proportion<-matrix(numeric(2*dim(x)[2]),ncol = 2)
  actual<-numeric(dim(x)[2])
  for (i in 1:dim(x)[2]) {
      proportion[i,]<-rm[i]+qt((1-confidence)/(2*dim(x)[2]),df=nrow(x)-1,lower.tail = FALSE)*rv[i]*c(-1,1)/sqrt(nrow(x))
      actual[i]<-mean(between(x[,i],proportion[i,1],proportion[i,2]))
  }
  return(cbind(proportion,actual))
}
Bonferroni_interval(track[,-9])

##                                 actual
## [1,]   0.1722707   0.1767657 0.4909091
## [2,]   0.3448834   0.3531288 0.3454545
## [3,]   0.7646608   0.7832968 0.3454545
## [4,]   1.7688358   1.8177096 0.3636364
## [5,]   3.6383568   3.7580068 0.2545455
## [6,]  13.5383997  14.1532366 0.2363636
## [7,]  28.2954346  29.6827473 0.2363636
## [8,] 133.0834355 140.1645645 0.2363636

Bonferroni_interval(track[-index,-9])

##                              actual
## [1,]   0.1720589   0.1753478   0.40
## [2,]   0.3439073   0.3510327   0.40
## [3,]   0.7626035   0.7774099   0.28
## [4,]   1.7636709   1.8003291   0.32
## [5,]   3.6256249   3.7187751   0.18
## [6,]  13.4484567  13.8979433   0.22
## [7,]  28.1283304  29.1328696   0.18
## [8,] 132.0653230 137.9770770   0.28

Bonferroni_interval(track_group_1[,-c(9,10)])

##                              actual
## [1,]   0.1713535   0.1744965  0.400
## [2,]   0.3418213   0.3488704  0.375
## [3,]   0.7581073   0.7721510  0.400
## [4,]   1.7519523   1.7815477  0.300
## [5,]   3.5945441   3.6604559  0.400
## [6,]  13.3090690  13.5654310  0.325
## [7,]  27.8024487  28.4580513  0.275
## [8,] 130.6141884 133.3993116  0.450

Bonferroni_interval(track_group_0[,-c(9,10)])

##                                 actual
## [1,]   0.1716869   0.1858464 0.6666667
## [2,]   0.3489751   0.3685582 0.6000000
## [3,]   0.7741590   0.8209965 0.7333333
## [4,]   1.8032861   1.9247139 0.6666667
## [5,]   3.7504843   4.0228491 0.7333333
## [6,]  14.3440173  15.5266494 0.7333333
## [7,]  29.7393809  32.8192858 0.7333333
## [8,] 141.5529856 156.3203478 0.5333333

T squared intervals for the variables for actual data, outlier free data and grouped data(accordingly) are:

T_square_interval<-function(x,confidence=0.95)
{
  rm<-rowMeans(t(x))
  rv<-apply(x,2,sd)
  proportion<-matrix(numeric(2*dim(x)[2]),ncol = 2)
  actual<-numeric(dim(x)[2])
  for (i in 1:dim(x)[2]) {
    proportion[i,]<-rm[i]+qf(1-(confidence^(1/ncol(x))),df1 = ncol(x),df2 = nrow(x)-ncol(x),lower.tail = FALSE)*sqrt(ncol(x)*(nrow(x)-1))*rv[i]*c(-1,1)/sqrt(nrow(x)*(nrow(x)-ncol(x)))
    actual[i]<-mean(between(x[,i],proportion[i,1],proportion[i,2]))
  }
  return(cbind(proportion,actual))
}
T_square_interval(track[,-9])

##                                 actual
## [1,]   0.1670179   0.1820185 0.8727273
## [2,]   0.3352478   0.3627644 0.8000000
## [3,]   0.7428826   0.8050750 0.8909091
## [4,]   1.7117217   1.8748238 0.8727273
## [5,]   3.4985334   3.8978302 0.9272727
## [6,]  12.8198994  14.8717370 0.8909091
## [7,]  26.6742165  31.3039653 0.8727273
## [8,] 124.8084056 148.4395944 0.8545455

T_square_interval(track[-index,-9])

##                              actual
## [1,]   0.1680886   0.1793181   0.84
## [2,]   0.3353059   0.3596341   0.82
## [3,]   0.7447299   0.7952834   0.82
## [4,]   1.7194190   1.8445810   0.88
## [5,]   3.5131787   3.8312213   0.88
## [6,]  12.9058587  14.4405413   0.84
## [7,]  26.9157009  30.3454991   0.86
## [8,] 124.9289486 145.1134514   0.86

T_square_interval(track_group_1[,-c(9,10)])

##                              actual
## [1,]   0.1671875   0.1786625  0.925
## [2,]   0.3324775   0.3582142  0.925
## [3,]   0.7394917   0.7907666  0.925
## [4,]   1.7127222   1.8207778  0.950
## [5,]   3.5071751   3.7478249  0.925
## [6,]  12.9692499  13.9052501  0.900
## [7,]  26.9334187  29.3270813  0.925
## [8,] 126.9223848 137.0911152  0.900

T_square_interval(track_group_0[,-c(9,10)])

##                             actual
## [1,]  0.1083936   0.2491398      1
## [2,]  0.2614378   0.4560955      1
## [3,]  0.5647938   1.0303617      1
## [4,]  1.2604997   2.4675003      1
## [5,]  2.5330049   5.2403285      1
## [6,]  9.0576130  20.8130537      1
## [7,] 15.9721052  46.5865615      1
## [8,] 75.5423944 222.3309389      1

Since after removing odd observations we see that first four variables were normally distributed we draw 95% confidence ellipsoids of those variables together(that is 6 plots) for all type of different data frame considered. We also do this for the last four variables among themselves differently. Also we see the scatterplots as:

library(psych)
pairs.panels(track[,-c(9)],gap=0,pch=21)

pairs.panels(track_group[,-c(9,10)],gap=0,bg=c("red","green")[track_group$Running_ability],pch=21)

pairs.panels(track_group_1[,-c(9,10)],gap=0,pch=21)

pairs.panels(track_group_1[,-c(9,10)],gap=0,pch=21)

Here the first four variables are quite linearly related whereas the last four are nearly linearly related among themselves.
* For first four variables:

library('car')
delp<-function(x){
  for(i in 1:4)
  {
    for(j in i+1:4)
    {
      if(j<=4)
      {
        dataEllipse(x[,i],x[,j],levels = 0.95,xlab = paste(vec[i]),ylab = paste(vec[j]),xlim=c(min(x[,i])-0.05,max(x[,i])+0.05),ylim=c(min(x[,j])-0.05,max(x[,j])+0.05))
      }
    }
  }
}
par(mfrow=c(2,3))
delp(track[,-9])

delp(track[-index,-9])

delp(track_group_1[,-c(9,10)])

delp(track_group_0[,-c(9,10)])

For last four variables:

delp<-function(x){
  for(i in 5:8)
  {
    for(j in i+1:8)
    {
      if(j<=8)
      {
        dataEllipse(x[,i],x[,j],levels = 0.95,xlab = paste(vec[i]),ylab = paste(vec[j]),xlim=c(min(x[,i])-0.5,max(x[,i])+0.5),ylim=c(min(x[,j])-0.5,max(x[,j])+0.5))
      }
    }
  }
}
par(mfrow=c(2,3))
delp(track[,-9])

delp(track[-index,-9])

delp(track_group_1[,-c(9,10)])

delp(track_group_0[,-c(9,10)])

Now we will proceed for principal component analysis and factor analysis of the dataset. The first step of these analyses was to rescale each data set to give mean 0 and standard deviation 1, on the grounds that all of the variables are equally important. If the raw data were analyzed using the same time units, the marathon data would completely swamp the effect of the shorter races and would, therefore, be weighted excessively in the analyses. We name the corresponding scaled variables as z1, z2, z3, z4, z5, z6, z7 and z8.

Principal Component Analysis:

The sample covariance matrix of the scaled variables, R is given by

aa=read.csv("track.csv")
bb=aa[,-9]
X=as.matrix(bb)
Z=scale(X)
colnames(Z)=c('z1','z2','z3','z4','z5','z6','z7','z8')
R=cov(Z)
R

##           z1        z2        z3        z4        z5        z6        z7
## z1 1.0000000 0.9226384 0.8411468 0.7560278 0.7002382 0.6194618 0.6325389
## z2 0.9226384 1.0000000 0.8507270 0.8066265 0.7749513 0.6953770 0.6965391
## z3 0.8411468 0.8507270 1.0000000 0.8701714 0.8352694 0.7786139 0.7872046
## z4 0.7560278 0.8066265 0.8701714 1.0000000 0.9180442 0.8635939 0.8690489
## z5 0.7002382 0.7749513 0.8352694 0.9180442 1.0000000 0.9281140 0.9346970
## z6 0.6194618 0.6953770 0.7786139 0.8635939 0.9281140 1.0000000 0.9746354
## z7 0.6325389 0.6965391 0.7872046 0.8690489 0.9346970 0.9746354 1.0000000
## z8 0.5199490 0.5961837 0.7049905 0.8064764 0.8655492 0.9321884 0.9431763
##           z8
## z1 0.5199490
## z2 0.5961837
## z3 0.7049905
## z4 0.8064764
## z5 0.8655492
## z6 0.9321884
## z7 0.9431763
## z8 1.0000000

Now let us find the eigen values of R in decreasing order of magnitude.

ev=eigen(R)
ev$values

## [1] 6.62214614 0.87761829 0.15932114 0.12404938 0.07988027 0.06796515
## [7] 0.04641953 0.02260010

The screeplot will look like:

plot(ev$values,type='l',ylab="Eigenvalues",col='magenta')

The elbow shape in the plot is coming corresponding to the second maximum eigenvalue. So, we will consider only the first two sample PCs.

prcomp.track=prcomp(x=Z,retx=T)
A=prcomp.track$rotation
A

##          PC1         PC2        PC3         PC4        PC5        PC6
## z1 0.3175565  0.56687750  0.3322620 -0.12762824  0.2625555 -0.5937042
## z2 0.3369792  0.46162589  0.3606567  0.25911578 -0.1539571  0.6561367
## z3 0.3556454  0.24827331 -0.5604674 -0.65234081 -0.2183229  0.1566252
## z4 0.3686841  0.01242993 -0.5324823  0.47999891  0.5400528 -0.0146918
## z5 0.3728099 -0.13979665 -0.1534427  0.40451038 -0.4877151 -0.1578430
## z6 0.3643741 -0.31203045  0.1897643 -0.02958753 -0.2539792 -0.1412987
## z7 0.3667726 -0.30685985  0.1817517 -0.08006860 -0.1331764 -0.2190168
## z8 0.3419261 -0.43896266  0.2632087 -0.29951212  0.4979282  0.3152849
##             PC7           PC8
## z1  0.136241272 -0.1055416946
## z2 -0.112639536  0.0960543400
## z3 -0.002853716  0.0001272086
## z4 -0.238016091  0.0381651160
## z5  0.610011481 -0.1392909902
## z6 -0.591298854 -0.5466969178
## z7 -0.176871008  0.7967952159
## z8  0.398822205 -0.1581638586

The first two sample PCs are given by y1=0.32z1 + 0.34z2 + 0.35z3 + 0.37z4 + 0.37z5 + 0.36z6 + 0.37z7 + 0.34z8 and
y2=0.57z1 + 0.46z2 + 0.25z3 + 0.01z4 - 0.14z5 - 0.31z6 - 0.31z7 - 0.44z8.
Now let us find the proportion of total variablity of the data explained by these two PCs.

summary(prcomp.track)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6    PC7
## Standard deviation     2.5734 0.9368 0.39915 0.35221 0.28263 0.2607 0.2155
## Proportion of Variance 0.8278 0.1097 0.01992 0.01551 0.00999 0.0085 0.0058
## Cumulative Proportion  0.8278 0.9375 0.95739 0.97289 0.98288 0.9914 0.9972
##                            PC8
## Standard deviation     0.15033
## Proportion of Variance 0.00283
## Cumulative Proportion  1.00000

93.75% of the total variation in the data, which is a substantial amount, is explained by these two principal components.

Now let us find the correlation coefficient between zi and each of the two PCs for all i.

Y=prcomp.track$x
r1=numeric(8)
r2=numeric(8)
for(i in 1:8)
{
  r1[i]=cor(Z[,i],Y[,1])
  r2[i]=cor(Z[,i],Y[,2])
}
B=data.frame(r1,r2)
rownames(B)=c('z1','z2','z3','z4','z5','z6','z7','z8')
colnames(B)=c('PC1','PC2')
B

##          PC1         PC2
## z1 0.8171850  0.53105812
## z2 0.8671665  0.43245706
## z3 0.9152012  0.23258562
## z4 0.9487545  0.01164451
## z5 0.9593714 -0.13096330
## z6 0.9376633 -0.29231413
## z7 0.9438353 -0.28747024
## z8 0.8798965 -0.41122586

Several facts are revealed by PCA. These are:

The uniformity of weights in the first principal component is really a reflection of the considerable amount of structure in the original data. The coefficients of zi’s in y1are nearly equal. This means that y1considers all the zi’s to almost same extent which is also clear from the previous table. Using the values of y1we can actually sort of rank the individuals. So the analysis of the first PC results in ranking the individuals that are least plausible from the viewpoint of subjective judgements, and, in any case, this has a considerable intrinsic interest. This technique will be very useful when such judgements are not so easy to come by.
The second PC can be interpreted as a measure of relative strength of a given nation at various distances. Hence the a value near to 0 of the second PC would indicate that the particular nation had achieved at about the same level in both long and short distances. Extreme values in either end indicate imbalance in achievement. Thus the second PC serves as a measure of differential achievement.

Let us try to look at the values of PC2 for all the individuals.

Y[,2]

##  [1] -0.34487666 -0.21616901  0.48689073  0.26194657 -1.76693743
##  [6] -0.64115405  0.25720348 -0.50034502 -0.20141106  0.35777693
## [11]  0.49997999  1.50876822  1.67064116 -0.02878301  0.38730438
## [16] -2.44900163  0.40882391 -0.50289374 -0.31066724 -0.41137495
## [21] -0.27890616 -0.60178374  1.27121076 -0.15152669  0.67647278
## [26] -0.60316736  0.94602435  0.67428204 -0.98985879  0.41357459
## [31]  0.53371105 -0.30152973  1.56472721 -0.27933562 -1.72272831
## [36]  0.66695377  0.84175219  0.70244240  0.92336255  1.05663679
## [41]  0.08554195 -0.18932031 -0.46259920  1.30473235  0.53076660
## [46] -1.78901618  0.50665093  0.02323023  0.19593360  0.04202950
## [51] -1.66978711  1.38303678 -1.11019109 -0.75696176 -1.90208197

summary(Y[,2])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.44900 -0.48147  0.04203  0.00000  0.60033  1.67064

Since PC2 serves as a measure of differential achievement, extreme negative values will indicate better performances in shorter distances and extreme positive values will indicate better performances in longer distances. So, looking at the values of PC2 we can actually classify the individuals into three groups - one with better performances in shorter distances, one with even performances in both shorter and longer distances and one with better performances in longer distances. This classification is shown below.

g1=which(Y[,2] < -1)
g2=which(Y[,2] >= -1 & Y[,2] <= 1)
g3=which(Y[,2] > 1)

The countries corresponding to group 1 are bermuda, domrep, malaysia, singapor, thailand, usa, wsamoa.

The countries corresponding to group 2 are argentin, australi, austria, belgium, brazil, burma, canada, chile, china, columbia, czech, denmark, finland, france, gdr, frg, gbni, greece, hungary, india, indonesi, ireland, israel, italy, japan, kenya, korea, luxembou, mauritiu, mexico, netherla, nz, png, philippi, poland, rumania, spain, sweden, switzerl, taipei, ussr.

The countries corresponding to group 3 are cookis, costa, guatemal, dprkorea, norway, portugal, turkey.

To detect outliers let us plot the values of PC3, PC4, PC5, PC6, PC7 and PC8 for the given data.

par(mfrow=c(2,3))
plot(Y[,3],ylab='PC3',col='lightseagreen',ylim=c(-1,1))
plot(Y[,4],ylab='PC4',col='lightseagreen',ylim=c(-1,1))
plot(Y[,5],ylab='PC5',col='lightseagreen',ylim=c(-1,1))
plot(Y[,6],ylab='PC6',col='lightseagreen',ylim=c(-1,1))
plot(Y[,7],ylab='PC7',col='lightseagreen',ylim=c(-1,1))
plot(Y[,8],ylab='PC8',col='lightseagreen',ylim=c(-1,1))

For PC3 and PC4 if the absolute value for a sample observation is greater than 0.85 we suspect it as an outlier.
For PC5, PC6 and PC7 if the absolute value for a sample observation is greater than 0.75 we suspect it as an outlier.
For PC8 if the absolute value for a sample observation is greater than 0.60 we suspect it as an outlier.

which(abs(Y[,3])>0.85)

## [1] 36 42

which(abs(Y[,4])>0.85)

## [1] 55

which(abs(Y[,5])>0.75)

## [1] 7

which(abs(Y[,6])>0.75)

## [1] 12

which(abs(Y[,7])>0.75)

## [1] 22

which(abs(Y[,8])>0.60)

## [1] 22

aa[c(7,12,22,36,42,55),9]

## [1] burma    cookis   greece   mauritiu philippi wsamoa  
## 55 Levels: argentin australi austria belgium bermuda brazil ... wsamoa

The suspected observations are 7th, 12th, 22nd, 36th, 42nd and 55th observation in the dataset. The corresponding countries are Burma, Cook Island, Greece, Mauritius, Philippines and Western Samoa.

Now let us remove these observations from the dataset and do the principal component analysis again. The new sample covariance matrix of the scaled variables, R1 is given by

aa1=aa[-c(7,12,36,42,55),]
bb1=aa1[,-9]
X1=as.matrix(bb1)
Z1=scale(X1)
colnames(Z1)=c('z1','z2','z3','z4','z5','z6','z7','z8')
R1=cov(Z1)
R1

##           z1        z2        z3        z4        z5        z6        z7
## z1 1.0000000 0.8967167 0.7352504 0.6425212 0.5311355 0.2822338 0.3198075
## z2 0.8967167 1.0000000 0.8062597 0.7508055 0.6853210 0.4776137 0.4861586
## z3 0.7352504 0.8062597 1.0000000 0.8446809 0.7587542 0.6406367 0.6569769
## z4 0.6425212 0.7508055 0.8446809 1.0000000 0.8529926 0.7406845 0.7423022
## z5 0.5311355 0.6853210 0.7587542 0.8529926 1.0000000 0.8644899 0.8756224
## z6 0.2822338 0.4776137 0.6406367 0.7406845 0.8644899 1.0000000 0.9482835
## z7 0.3198075 0.4861586 0.6569769 0.7423022 0.8756224 0.9482835 1.0000000
## z8 0.1677496 0.3434233 0.5582540 0.6658721 0.7843441 0.8886124 0.9091055
##           z8
## z1 0.1677496
## z2 0.3434233
## z3 0.5582540
## z4 0.6658721
## z5 0.7843441
## z6 0.8886124
## z7 0.9091055
## z8 1.0000000

Now let us find the eigen values of R1 in decreasing order of magnitude.

ev1=eigen(R1)
ev1$values

## [1] 5.77048282 1.53269755 0.21178746 0.16062546 0.11487878 0.08570121
## [7] 0.08121911 0.04260761

The screeplot will look like:

plot(ev1$values,type='l',ylab="Eigenvalues",col='thistle3')

The elbow shape in the plot is coming corresponding to the second maximum eigenvalue. So, we will consider only the first two sample PCs.

prcomp.track1=prcomp(x=Z1,retx=T)
A1=prcomp.track1$rotation
A1

##          PC1         PC2          PC3          PC4         PC5         PC6
## z1 0.2719659  0.57546658 -0.314388819  0.173910120 -0.33750082  0.17239103
## z2 0.3283778  0.44740468 -0.344698977  0.008642004  0.07541809 -0.39344746
## z3 0.3669861  0.22830208  0.637396531  0.497858882  0.34663360  0.18278873
## z4 0.3843617  0.08740350  0.501252928 -0.628082021 -0.30210086 -0.27509050
## z5 0.3933679 -0.09184117 -0.220133144 -0.443845752  0.28212124  0.68145517
## z6 0.3648331 -0.33594328 -0.176143455  0.037359251  0.36673246 -0.47908120
## z7 0.3702270 -0.32276430 -0.212838065  0.176838970  0.09099747 -0.02349439
## z8 0.3329869 -0.42829495  0.005318866  0.312511292 -0.66829626  0.10927944
##            PC7         PC8
## z1  0.52054738 -0.22625793
## z2 -0.60812939  0.20686922
## z3 -0.07081342 -0.02191945
## z4  0.15155052  0.09475171
## z5 -0.18280661 -0.11822623
## z6  0.21596083 -0.55764333
## z7  0.36251657  0.73616053
## z8 -0.34605111 -0.17227385

The first two sample PCs are given by y1=0.27z1 + 0.33z2 + 0.37z3 + 0.38z4 + 0.39z5 + 0.36z6 + 0.37z7 + 0.33z8 and
y2=0.57z1 + 0.45z2 + 0.23z3 + 0.09z4 - 0.09z5 - 0.33z6 - 0.32z7 - 0.43z8.
Now let us find the proportion of total variablity of the data explained by these two PCs.

summary(prcomp.track1)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.4022 1.2380 0.46020 0.40078 0.33894 0.29275
## Proportion of Variance 0.7213 0.1916 0.02647 0.02008 0.01436 0.01071
## Cumulative Proportion  0.7213 0.9129 0.93937 0.95945 0.97381 0.98452
##                            PC7     PC8
## Standard deviation     0.28499 0.20642
## Proportion of Variance 0.01015 0.00533
## Cumulative Proportion  0.99467 1.00000

91.29% of the total variation in the data, which is a substantial amount, is explained by these two principal components. But it has decreased in this case.

Now let us find the correlation coefficient between zi and each of the two PCs for all i.

Y1=prcomp.track1$x
r11=numeric(8)
r21=numeric(8)
for(i in 1:8)
{
  r11[i]=cor(Z1[,i],Y1[,1])
  r21[i]=cor(Z1[,i],Y1[,2])
}
B1=data.frame(r11,r21)
rownames(B1)=c('z1','z2','z3','z4','z5','z6','z7','z8')
colnames(B1)=c('PC1','PC2')
B1

##          PC1        PC2
## z1 0.6533119  0.7124401
## z2 0.7888235  0.5538967
## z3 0.8815677  0.2826429
## z4 0.9233072  0.1082074
## z5 0.9449417 -0.1137014
## z6 0.8763960 -0.4159050
## z7 0.8893530 -0.3995892
## z8 0.7998955 -0.5302384

Comparing the results obtained the two cases we see that in the second case the correlation coefficient between Zi and PC1 has decreased for all i and the absolute value of the correlation coefficient between Zi and PC2 has increased for all i except for i = 5. All other interpreations regarding PCs remain the same.

Factor Analysis:

Let the observable random vector X=(X1,X2,X3,X4,X5,X6,X7,X8)‘with 8 components has mean vector \(\mu\)=(\(\mu 1\),\(\mu 2\),\(\mu 3\),\(\mu 4\),\(\mu 5\),\(\mu 6\),\(\mu 7\),\(\mu 8\))’ and covariance matrix \(\Sigma\). Here instead of working with X we work with standardized vector Z. Our factor model postulates that Z is linearly dependent upon two unobservable random variables F1 and F2, called common factors, and 8 additional sources of variation \(\varepsilon 1\), \(\varepsilon 2\), …, \(\varepsilon 8\), called errors or sometimes, specific factors. In particular, the factor analysis model is
Zi = li1F1 + li2F2 + \(\varepsilon 1\) , i = 1(1)8
or, in matrix notation, Z = LF + \(\varepsilon\).
The coefficient lij is called the loading of the i’th variable on the j’th factor, so the matrix L is the matrix of factor loadings.

Assumptions:

F and \(\varepsilon\) are independent.
E(F) = 0 and Cov(F) = I.
E(\(\varepsilon\)) = 0 and Cov(\(\varepsilon\)) = \(\Psi\), where \(\Psi\) is a diagonal matrix.

This are the assumptions of orthogonal factor model.

We first factor analyze the sample correlation matrix R by the iterative principal component method.

library('psych')
f1=fac(r=R,nfactors=2,rotate='none',scores='Thurstone',residuals=TRUE,fm='pa')

The estimated loading matrix is shown below.

L1=f1$loadings
L1

## 
## Loadings:
##    PA1    PA2   
## z1  0.812  0.511
## z2  0.861  0.417
## z3  0.901  0.211
## z4  0.937       
## z5  0.955 -0.118
## z6  0.938 -0.285
## z7  0.947 -0.286
## z8  0.873 -0.378
## 
##                  PA1   PA2
## SS loadings    6.540 0.800
## Proportion Var 0.818 0.100
## Cumulative Var 0.818 0.918

The estimated communalities are obtained as:

f1$communality

##        z1        z2        z3        z4        z5        z6        z7 
## 0.9200913 0.9149326 0.8558811 0.8784953 0.9249959 0.9621504 0.9786024 
##        z8 
## 0.9048802

The specific variances are obtained as:

1-f1$communality

##         z1         z2         z3         z4         z5         z6 
## 0.07990870 0.08506742 0.14411890 0.12150473 0.07500406 0.03784962 
##         z7         z8 
## 0.02139765 0.09511978

The residual matrix is obtained as:

f1$residual-diag(f1$residual)

##             z1          z2          z3          z4          z5          z6
## z1  0.00000000 -0.06909616 -0.07785252 -0.09377579 -0.09422145 -0.07623959
## z2 -0.07425488  0.00000000 -0.09789737 -0.09286702 -0.08290860 -0.07878014
## z3 -0.14206272 -0.15694885  0.00000000 -0.12180296 -0.14366282 -0.15039006
## z4 -0.13537182 -0.12930434 -0.09918880  0.00000000 -0.09581518 -0.13214786
## z5 -0.08931681 -0.07284525 -0.07454798 -0.04931451  0.00000000 -0.07627891
## z6 -0.03418051 -0.03156233 -0.04412077 -0.04849274 -0.03912447  0.00000000
## z7 -0.01123598 -0.02101583 -0.02661870 -0.03456410 -0.02430590 -0.01710207
## z8 -0.09086196 -0.09340770 -0.09671101 -0.09999440 -0.10745583 -0.09007141
##             z7          z8
## z1 -0.06974703 -0.07565088
## z2 -0.08468560 -0.08335534
## z3 -0.14933996 -0.14571013
## z4 -0.13467119 -0.12637935
## z5 -0.07791231 -0.08734011
## z6 -0.03355404 -0.03280124
## z7  0.00000000 -0.01305434
## z8 -0.08677648  0.00000000

Now we will make use of factor rotation. Let us try the previous thing with varimax rotation.

f2=fac(r=R,nfactors=2,rotate='varimax',scores='Thurstone',residuals=TRUE,fm='pa')

The estimated loading matrix is shown below.

L2=f2$loadings
L2

## 
## Loadings:
##    PA1   PA2  
## z1 0.287 0.915
## z2 0.386 0.875
## z3 0.549 0.744
## z4 0.702 0.621
## z5 0.804 0.528
## z6 0.900 0.390
## z7 0.907 0.395
## z8 0.910 0.278
## 
##                  PA1   PA2
## SS loadings    4.133 3.207
## Proportion Var 0.517 0.401
## Cumulative Var 0.517 0.918

The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same.

The following things can be interpreted:

Focusing our attention on the iterative PC method and the cumulative proportion of the total sample variance explained which is 0.918, we see that a two-factor solution appears to be warranted.
The factor loadings change abruptly after the varimax rotation.
Before rotation, all the variables load highly on the first factor and the estimated loadings are about equal. This factor can be interpreted as a general running factor.
The second factor contrasts the shorter distances with the longer distances. (The shorter distance performances have positive estimated loadings and the longer distance performances have negative estimated loadings on this factor.) Thus, this factor seems to differentiate the performances in different distances and might be called a differentiating factor. Also the absolute value of the estimated loadings of this factor increases as we move to extreme running distances in both the directions.
After rotation, we see that the longer distance performances load highly on the first factor rather than the shorter distance performances. All the estimated loadings are positive for this factor. In this case the first factor can be interpreted as a factor affecting the longer distance performances.
the shorter distance performances load highly on the second factor rather than the longer distance performances. All the estimated loadings are positive for this factor. In this case the second factor can be interpreted as a factor affecting the shorter distance performances.
The off-diagonal elements of the residual matrix are close to zero. This supports the fact of retaining two factors.

Now we will plot the factor scores for the two factors produced from the weighted least squares method with the rotated iterative PC method estimates.

s=factor.scores(x=Z,f=f1,method='Thurstone')
scores=s$scores
plot(scores[,1],scores[,2],xlab='Factor Scores of F1',ylab='Factor Scores of F2',col='palevioletred4')

par(mfrow=c(1,2))
plot(scores[,1],ylab='Factor Scores of F1',col='slateblue')
plot(scores[,2],ylab='Factor Scores of F2',col='slateblue')

From the scatter plot of the factor scores of F1 and the factor scores of F2 we see that two points are not consitent with the remaining observations. They lie on the extreme right hand side of the diagram. It seems that for those two points the factor scores of F1 are way too high than the others.

Then from the individual factor scores plots we see that those two points differ from the other points only with respect to the factor score of F1. We can treat them as potential outliers.

which(scores[,1]>2)

## [1] 12 55

Hence we can treat the 12th and 55th observations in the dataset as potential outliers.

Since the distribution of Z is not multivariate normal we can not use the maximum likelihood method for estimating factor loadings and specific variances. Still we are going to do so with variamx rotation to check whether it gives bad results or not.

f3=fac(r=R,nfactors=2,rotate='varimax',residuals=TRUE,fm='ml')

The estimated loading matrix is shown below.

L3=f3$loadings
L3

## 
## Loadings:
##    ML1   ML2  
## z1 0.291 0.914
## z2 0.382 0.882
## z3 0.543 0.744
## z4 0.691 0.622
## z5 0.799 0.530
## z6 0.901 0.394
## z7 0.907 0.399
## z8 0.915 0.278
## 
##                  ML1   ML2
## SS loadings    4.112 3.225
## Proportion Var 0.514 0.403
## Cumulative Var 0.514 0.917

The estimated communalities are obtained as:

f3$communality

##        z1        z2        z3        z4        z5        z6        z7 
## 0.9190082 0.9241862 0.8485663 0.8646685 0.9182821 0.9662060 0.9820430 
##        z8 
## 0.9140073

The specific variances are obtained as:

1-f3$communality

##         z1         z2         z3         z4         z5         z6 
## 0.08099178 0.07581381 0.15143372 0.13533154 0.08171787 0.03379396 
##         z7         z8 
## 0.01795698 0.08599274

The residual matrix is obtained as:

f3$residual-diag(f3$residual)

##             z1          z2          z3          z4          z5          z6
## z1  0.00000000 -0.07528918 -0.07744923 -0.09409695 -0.09676523 -0.08299463
## z2 -0.07011122  0.00000000 -0.08897102 -0.08196194 -0.07321198 -0.07182454
## z3 -0.14789117 -0.16459093  0.00000000 -0.11944082 -0.14394549 -0.15485326
## z4 -0.14843672 -0.14147967 -0.10333864  0.00000000 -0.09874046 -0.13918121
## z5 -0.09749133 -0.07911604 -0.07422963 -0.04512679  0.00000000 -0.08138941
## z6 -0.03579682 -0.02980470 -0.03721350 -0.03764363 -0.03346551  0.00000000
## z7 -0.01339955 -0.01976209 -0.02009879 -0.02401797 -0.01894842 -0.01741195
## z8 -0.08573227 -0.08434142 -0.08445635 -0.08462657 -0.09811029 -0.08709447
##             z7          z8
## z1 -0.07643435 -0.08073130
## z2 -0.07761892 -0.07416249
## z3 -0.15357553 -0.14989733
## z4 -0.14139253 -0.13396536
## z5 -0.08270931 -0.09383542
## z6 -0.03324893 -0.03489569
## z7  0.00000000 -0.01546208
## z8 -0.08349784  0.00000000

We find similar results. THIS IS SURPRISING.

Now let us remove the potential outliers from the dataset and do the factor analysis again. The new sample covariance matrix of the sclaed variables or equivalently the new sample correlation matrix of the original variables, R2 is given by

aa2=aa[-c(12,55),]
bb2=aa2[,-9]
X2=as.matrix(bb2)
Z2=scale(X2)
colnames(Z2)=c('z1','z2','z3','z4','z5','z6','z7','z8')
R2=cov(Z2)
R2

##           z1        z2        z3        z4        z5        z6        z7
## z1 1.0000000 0.9181822 0.7221096 0.6805015 0.5791556 0.4132462 0.4410473
## z2 0.9181822 1.0000000 0.7897826 0.7688061 0.7142773 0.5745055 0.5793445
## z3 0.7221096 0.7897826 1.0000000 0.8292540 0.7661279 0.6513068 0.6693626
## z4 0.6805015 0.7688061 0.8292540 1.0000000 0.8441634 0.7561770 0.7612732
## z5 0.5791556 0.7142773 0.7661279 0.8441634 1.0000000 0.8749984 0.8836742
## z6 0.4132462 0.5745055 0.6513068 0.7561770 0.8749984 1.0000000 0.9560610
## z7 0.4410473 0.5793445 0.6693626 0.7612732 0.8836742 0.9560610 1.0000000
## z8 0.3060880 0.4508726 0.5708201 0.6973173 0.7957968 0.9005466 0.9191016
##           z8
## z1 0.3060880
## z2 0.4508726
## z3 0.5708201
## z4 0.6973173
## z5 0.7957968
## z6 0.9005466
## z7 0.9191016
## z8 1.0000000

We first factor analyze the new sample correlation matrix R2 by the iterative principal component method.

library('psych')
f4=fac(r=R2,nfactors=2,rotate='none',scores='Thurstone',residuals=TRUE,fm='pa')

The estimated loading matrix is shown below.

L4=f4$loadings
L4

## 
## Loadings:
##    PA1    PA2   
## z1  0.710  0.615
## z2  0.827  0.498
## z3  0.838  0.230
## z4  0.900       
## z5  0.931 -0.107
## z6  0.896 -0.365
## z7  0.910 -0.360
## z8  0.818 -0.456
## 
##                  PA1   PA2
## SS loadings    5.866 1.171
## Proportion Var 0.733 0.146
## Cumulative Var 0.733 0.880

The estimated communalities are obtained as:

f4$communality

##        z1        z2        z3        z4        z5        z6        z7 
## 0.8818386 0.9319128 0.7560770 0.8182939 0.8776741 0.9357628 0.9584378 
##        z8 
## 0.8766711

The specific variances are obtained as:

1-f4$communality

##         z1         z2         z3         z4         z5         z6 
## 0.11816143 0.06808721 0.24392298 0.18170611 0.12232590 0.06423717 
##         z7         z8 
## 0.04156224 0.12332893

The residual matrix is obtained as:

f4$residual-diag(f4$residual)

##             z1          z2          z3          z4          z5          z6
## z1  0.00000000 -0.09319257 -0.13285244 -0.13499519 -0.13373570 -0.11609090
## z2 -0.04311835  0.00000000 -0.08639428 -0.09077390 -0.07001953 -0.05226367
## z3 -0.25861399 -0.26223005  0.00000000 -0.19091701 -0.23343556 -0.25944202
## z4 -0.19853987 -0.20439280 -0.12870014  0.00000000 -0.16446101 -0.19627115
## z5 -0.13790017 -0.12425822 -0.11183847 -0.10508079  0.00000000 -0.12011029
## z6 -0.06216664 -0.04841363 -0.07975621 -0.07880220 -0.06202156  0.00000000
## z7 -0.02564867 -0.03593682 -0.05273161 -0.06488581 -0.04384899 -0.03247618
## z8 -0.11737739 -0.12154989 -0.13311245 -0.11798075 -0.13750046 -0.12185548
##             z7          z8
## z1 -0.10224786 -0.11220989
## z2 -0.06246179 -0.06630817
## z3 -0.25509236 -0.25370650
## z4 -0.20502968 -0.17635793
## z5 -0.12461265 -0.13649742
## z6 -0.05515111 -0.06276371
## z7  0.00000000 -0.03108369
## z8 -0.11285039  0.00000000

Now we will make use of factor rotation. Let us try the previous thing with varimax rotation.

f5=fac(r=R2,nfactors=2,rotate='varimax',scores='Thurstone',residuals=TRUE,fm='pa')

The estimated loading matrix is shown below.

L5=f5$loadings
L5

## 
## Loadings:
##    PA1   PA2  
## z1 0.147 0.928
## z2 0.311 0.914
## z3 0.493 0.716
## z4 0.626 0.653
## z5 0.781 0.518
## z6 0.920 0.298
## z7 0.928 0.312
## z8 0.919 0.178
## 
##                  PA1  PA2
## SS loadings    3.916 3.12
## Proportion Var 0.490 0.39
## Cumulative Var 0.490 0.88

The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same.

Comparing with the previous case where we were using the complete dataset, we see that removing the two observations makes no change in the interpretation of the factors. However a decrease in the cumulative proportion of the total sample variance explained is observed. Previously it was 0.918 while now it has become 0.88. Also the residual matrix in the latter case does not have all the off-diagonal elements close to zero. So removing those observations actually results in a fall in the quality of estimation.
Now we proceed for profile analysis of the dataset.

Profile Analysis

We proceed without checking for equality of variance covariance matrix since for the second group we dont have enough observations to estimate the parameters. We use the function ‘pbg’ to do our required testings.

library('profileR')
mod <- pbg(data=track_group[,1:8], group=track_group[,10], original.names=TRUE, profile.plot=TRUE)

print(mod)

## 
## Data Summary:
##                    0           1
## 100 m      0.1787667   0.1729250
## 200 m      0.3587667   0.3453458
## 400 m      0.7975778   0.7651292
## 800 m      1.8640000   1.7667500
## 1500 m     3.8866667   3.6275000
## 5000 m    14.9353333  13.4372500
## 10000 m   31.2793333  28.1302500
## Marathon 148.9366667 132.0067500

summary(mod)

## Call:
## pbg(data = track_group[, 1:8], group = track_group[, 10], original.names = TRUE, 
##     profile.plot = TRUE)
## 
## Hypothesis Tests:
## $`Ho: Profiles are parallel`
##   Multivariate.Test Statistic Approx.F num.df den.df      p.value
## 1             Wilks 0.2159207 24.38179      7     47 1.223905e-13
## 2            Pillai 0.7840793 24.38179      7     47 1.223905e-13
## 3  Hotelling-Lawley 3.6313304 24.38179      7     47 1.223905e-13
## 4               Roy 3.6313304 24.38179      7     47 1.223905e-13
## 
## $`Ho: Profiles have equal levels`
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        1  82.39   82.39   115.9 6.02e-15 ***
## Residuals   53  37.69    0.71                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Ho: Profiles are flat`
##          F df1 df2      p-value
## 1 17695.82   7  47 3.881274e-78

From the results of this code we see that the profiles are clearly not parallel. So further tests are not required.
So we move to Discriminant analysis.

Discriminant Analysis

While doing discriminant analysis with the groups we have formed, the model used using generic function “lda” in R returns error message of unability of doing singular value decomposition for ‘0’ group. This is mainly because the data does not contain enough observations. So maybe grouping the data in any other way wouldn’t yield any result. That is we cannot separate the data into training and testing data. So we take the entire dataset as training data and check model accuracy on itself only.

LDA gives:

track_group_1<-track_group[,-9]

linear<-lda(Running_ability~.,data=track_group_1)
linear

## Call:
## lda(Running_ability ~ ., data = track_group_1)
## 
## Prior probabilities of groups:
##         0         1 
## 0.2727273 0.7272727 
## 
## Group means:
##     `100 m`   `200 m`   `400 m` `800 m` `1500 m` `5000 m` `10000 m`
## 0 0.1787667 0.3587667 0.7975778 1.86400 3.886667 14.93533  31.27933
## 1 0.1729250 0.3453458 0.7651292 1.76675 3.627500 13.43725  28.13025
##   Marathon
## 0 148.9367
## 1 132.0068
## 
## Coefficients of linear discriminants:
##                   LD1
## `100 m`    79.9904054
## `200 m`   -68.3596062
## `400 m`     4.3383901
## `800 m`     3.1282279
## `1500 m`    0.1041853
## `5000 m`   -4.0058080
## `10000 m`   1.4499209
## Marathon   -0.1650255

First discriminant function is a linear combination of the eight variables: LD1’columns of X.
Prediction and histogram based on LDA1 gives:

p<- predict(linear, track_group_1)
ldahist(data = p$x, g= track_group_1$Running_ability)

Histograms based on LD1. No overlap in histograms except one.
We check accuracy of the model as:

p1<-predict(linear,track_group_1)$class
tab<-table(Predicted=p1, Actual= track_group_1$Running_ability)
tab

##          Actual
## Predicted  0  1
##         0 13  0
##         1  2 40

sum(diag(tab))/sum(tab)

## [1] 0.9636364

Accuracy is 96.36% in training data(which is the entire dataset here).
But we don’t know if the var-cov matrices of the groups are same. We have no way of checking that since observations in group ‘0’ is far less compared to the number of parameters to be estimated. So we don’t test equality of var-cov matrices and find out their euclidean distance to see if its small.

sqrt(sum((var(track_group_0[,-c(9,10)])-var(track_group_1[,-c(9,10)]))^2))

## [1] 6.482418

Which isn’t small.
So we also try quadratic discriminant analysis.

quadratic<-qda(Running_ability~.,data=track_group_1)
quadratic

## Call:
## qda(Running_ability ~ ., data = track_group_1)
## 
## Prior probabilities of groups:
##         0         1 
## 0.2727273 0.7272727 
## 
## Group means:
##     `100 m`   `200 m`   `400 m` `800 m` `1500 m` `5000 m` `10000 m`
## 0 0.1787667 0.3587667 0.7975778 1.86400 3.886667 14.93533  31.27933
## 1 0.1729250 0.3453458 0.7651292 1.76675 3.627500 13.43725  28.13025
##   Marathon
## 0 148.9367
## 1 132.0068

p_1<- predict(quadratic, track_group_1)
p1_1<-predict(quadratic,track_group_1)$class
tab_1<-table(Predicted=p1_1, Actual= track_group_1$Running_ability)
tab_1

##          Actual
## Predicted  0  1
##         0 15  0
##         1  0 40

sum(diag(tab))/sum(tab)

## [1] 0.9636364

Accuracy is 100% in training data(which is the entire dataset here). Seems qda performs better than lda which is supposed to be true in case covariance matrices are not same.

Report on Track Records of Men

Sayan Nath & Anjan Ghosh

Data Description

Abstract

Density Plots

Principal Component Analysis:

Factor Analysis:

Profile Analysis

Discriminant Analysis