We have data on Men’s track record in a sports competition. We are provided with Country names and each country’s record in 8 events. Events are 100m, 200m, 400m, 800m, 1500 m, 5000 m, 10000 m and Marathon. These are our variables. For the first three events we are provided with track record in seconds whereas for the other ones we have record in minutes. For future computation, analysis and ease of interpretation, we transform all the variables into same unit(minutes).
We have first checked for normality of the data in various ways and looked for possible outliers. Then we removed the observations and tried to find out possible transformation to normalize the data. Based on such analyses we’ve classified the data into two groups and carried out profile analysis. We’ve done PCA, factor analysis and found out some interesting outcomes. Finally we’ve tried to perform discriminant analysis but failed to do so due to unavailability of observations.
Loading data
track<-read.csv('track.csv')
dimnames(track)<-list(x=seq(1:nrow(track)),y=c("100 m","200 m","400 m","800 m","1500 m","5000 m","10000 m","Marathon","Country"))
head(track,10)
## 100 m 200 m 400 m 800 m 1500 m 5000 m 10000 m Marathon
## 1 0.1731667 0.3468333 0.7806667 1.81 3.70 14.04 29.36 137.72
## 2 0.1718333 0.3343333 0.7473333 1.74 3.57 13.28 27.66 128.30
## 3 0.1740000 0.3468333 0.7803333 1.79 3.60 13.26 27.72 135.90
## 4 0.1723333 0.3446667 0.7506667 1.73 3.60 13.22 27.45 129.95
## 5 0.1713333 0.3430000 0.7651667 1.80 3.75 14.68 30.55 146.62
## 6 0.1703333 0.3405000 0.7535000 1.73 3.66 13.62 28.62 133.13
## 7 0.1773333 0.3586667 0.8050000 1.80 3.85 14.45 30.28 139.95
## 8 0.1695000 0.3370000 0.7613333 1.76 3.63 13.55 28.09 130.15
## 9 0.1723333 0.3466667 0.7700000 1.79 3.71 13.61 29.30 134.03
## 10 0.1751667 0.3506667 0.7883333 1.81 3.73 13.90 29.13 133.53
## Country
## 1 argentin
## 2 australi
## 3 austria
## 4 belgium
## 5 bermuda
## 6 brazil
## 7 burma
## 8 canada
## 9 chile
## 10 china
First we plot the densities of the variables. We try to get an overall idea about the shapes of the densitites, tell if they are normally distributed marginally.
We define
vec<-c("100 m","200 m","400 m","800 m","1500 m","5000 m","10000 m","Marathon")
and plot the densities:
par(mfrow=c(2,4))
for(i in 1:length(vec))
plot(density(track[,i]),main = paste(vec[i]),xlab = "Run time in minutes")
What we see is that all the densities have multiple modes. Which gives us indication that the countries can possibly be divided into groups. Some of the countries are better in this sport compared to others. QQ plots give:
par(mfrow=c(2,4))
for(i in 1:length(vec))
{
qqnorm(y=track[,i],main= paste("QQ plot of",vec[i],sep = " "))
qqline(y=track[,i])
}
par(mfrow=c(1,1))
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.3
par(mfrow=c(2,4))
for(i in 1:length(vec))
boxcox(track[,i]~1,lambda = seq(-6,6,by=0.1))
From the plot we see it clearly.
From the first plot for 100 m timings we take \(\lambda\)=-4 and plot the density of transformed variable.
par(mfrow=c(1,2))
trans_y<-((track$`100 m`)^(-4)-1)/(-4)
plot(density(trans_y),main = "Transformed first variable, lambda=-4",xlab = "Run-time in minutes")
trans_y1<-((track$`800 m`)^(-4)-1)/(-4)
plot(density(trans_y1),main = "Transformed fourth variable, lambda=-4",xlab = "Run-time in minutes")
This gives enough evidence for not transforming the variables. Instead we should group the data accordingly.
For grouping we look for possible outliers of the data and group them into a single group and put the rest in another. We plot Mahalanobis distance and try to find out the outliers.
par(mfrow=c(1,1))
mahal_d<-mahalanobis(track[,-9],center=rowMeans(t(track[-9])),cov = var(track[-9]))
plot(mahal_d,main = "Mahalanobis Distance")
abline(h=qchisq(.95,8),lty=2)
The outliers are:
(index<-which(mahal_d>qchisq(.95,8)))
## 12 22 36 51 55
## 12 22 36 51 55
track[index,9]
## [1] cookis greece mauritiu thailand wsamoa
## 55 Levels: argentin australi austria belgium bermuda brazil ... wsamoa
Plotting the densities without taking the outliers give:
par(mfrow=c(2,4))
for(i in 1:length(vec))
plot(density(track[-index,i]),main = paste(vec[i]),xlab = "Run time in minutes")
par(mfrow=c(1,1))
Outlier free data gives Mahalanobis distances as:
par(mfrow=c(1,1))
mahal_d_1<-mahalanobis(track[-index,-9],center=rowMeans(t(track[-9])),cov = var(track[-9]))
plot(mahal_d_1,main = "Mahalanobis Distance")
abline(h=qchisq(.95,8),lty=2)
Which shows absense of outliers.
Outlier free data gives qqplots as:
par(mfrow=c(2,4))
for(i in 1:length(vec))
{
qqnorm(y=track[-index,i],main= paste("QQ plot of",vec[i],sep = " "))
qqline(y=track[-index,i])
}
par(mfrow=c(1,1))
We perform Shapiro-Wilk test on the outlier free data.
library(MVN)
mvn(track[-index,-9],mvnTest="energy",univariateTest="SW",univariatePlot="histogram")
## $multivariateNormality
## Test Statistic p value MVN
## 1 E-statistic 1.930065 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk 100 m 0.9716 0.2678 YES
## 2 Shapiro-Wilk 200 m 0.9845 0.7517 YES
## 3 Shapiro-Wilk 400 m 0.9791 0.515 YES
## 4 Shapiro-Wilk 800 m 0.9405 0.0142 NO
## 5 Shapiro-Wilk 1500 m 0.9276 0.0045 NO
## 6 Shapiro-Wilk 5000 m 0.8446 <0.001 NO
## 7 Shapiro-Wilk 10000 m 0.8453 <0.001 NO
## 8 Shapiro-Wilk Marathon 0.7827 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max
## 100 m 50 0.1737033 0.004069310 0.1734167 0.1655000 0.1830000
## 200 m 50 0.3474700 0.008815937 0.3468333 0.3286667 0.3656667
## 400 m 50 0.7700067 0.018319381 0.7676667 0.7310000 0.8110000
## 800 m 50 1.7820000 0.045355737 1.7800000 1.7000000 1.9000000
## 1500 m 50 3.6722000 0.115251101 3.6350000 3.5100000 4.0100000
## 5000 m 50 13.6732000 0.556132564 13.4750000 13.0100000 15.1100000
## 10000 m 50 28.6306000 1.242877520 28.0300000 27.3800000 31.4500000
## Marathon 50 135.0212000 7.314385223 132.1050000 128.2200000 157.7700000
## 25th 75th Skew Kurtosis
## 100 m 0.1710833 0.1759583 0.4076383 -0.18428149
## 200 m 0.3422500 0.3527917 0.1078462 -0.41757710
## 400 m 0.7565417 0.7805833 0.3033403 -0.58304617
## 800 m 1.7425000 1.8075000 0.6981109 0.04485093
## 1500 m 3.5925000 3.7475000 0.8214838 -0.03533428
## 5000 m 13.2625000 14.0125000 1.0578375 -0.11826838
## 10000 m 27.6750000 29.2825000 0.9217064 -0.48690964
## Marathon 130.5700000 137.3075000 1.5564469 1.54540140
This shows Normality of first three variables but other four does not show normality. Important to note that variable 7 has a flat distribution. So what we do is look at the distribution of outlier free data and cut the runtimes that are too long because they possibly form a group that performs below average.
The cutpoints we use are:
1.85 min for 800m
3.80 min for 1500m
14.25 min for 5000m
29.5 min for 10000m
140 min for Marathon
These are solely based on visual observation.
time_points<-c(1.85,3.8,14.25,29.5,140)
for(i in 1:length(time_points))
print(which(track[,3+i]>time_points[i]))
## [1] 12 13 23 36 41 46 55
## [1] 7 12 13 16 26 36 41 42 46 51 55
## [1] 5 7 12 16 22 26 35 36 41 42 46 51 55
## [1] 5 7 12 16 23 26 33 35 36 41 42 46 50 51 55
## [1] 5 12 16 26 34 35 36 41 42 46 51 55
From this we take the indices that are common for almost all the variables and remove them to obtain a new dataset. Then we check this data for normality.
The common indices that we take are : 5,7,13,16,23,26,35,41,42,46
index_1<-c(5,7,13,16,23,26,35,41,42,46)
All in all we consider the countries indexed by these numbers and indices found before as outliers(countries that perform well below average) and put them into a group. Rest we consider as another group.
These countries are
track[sort(c(index,index_1)),9]
## [1] bermuda burma cookis costa domrep greece guatemal
## [8] indonesi malaysia mauritiu png philippi singapor thailand
## [15] wsamoa
## 55 Levels: argentin australi austria belgium bermuda brazil ... wsamoa
Removing these observations we try to check the data for normality:
library(MVN)
mvn(track[-sort(c(index,index_1)),-9],mvnTest="energy",univariateTest="SW",univariatePlot="histogram")
## $multivariateNormality
## Test Statistic p value MVN
## 1 E-statistic 1.605336 0.001 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk 100 m 0.9877 0.9358 YES
## 2 Shapiro-Wilk 200 m 0.9872 0.9240 YES
## 3 Shapiro-Wilk 400 m 0.9820 0.7639 YES
## 4 Shapiro-Wilk 800 m 0.9574 0.1369 YES
## 5 Shapiro-Wilk 1500 m 0.9430 0.0438 NO
## 6 Shapiro-Wilk 5000 m 0.8993 0.0018 NO
## 7 Shapiro-Wilk 10000 m 0.8565 0.0001 NO
## 8 Shapiro-Wilk Marathon 0.8855 0.0007 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max
## 100 m 40 0.1729250 0.003438087 0.1730000 0.1655000 0.1818333
## 200 m 40 0.3453458 0.007711074 0.3461667 0.3286667 0.3656667
## 400 m 40 0.7651292 0.015362685 0.7644167 0.7310000 0.7966667
## 800 m 40 1.7667500 0.032375006 1.7600000 1.7000000 1.8500000
## 1500 m 40 3.6275000 0.072102136 3.6150000 3.5100000 3.7700000
## 5000 m 40 13.4372500 0.280439102 13.3400000 13.0100000 14.1300000
## 10000 m 40 28.1302500 0.717175684 27.8850000 27.3800000 30.0700000
## Marathon 40 132.0067500 3.046697571 131.4150000 128.2200000 141.2700000
## 25th 75th Skew Kurtosis
## 100 m 0.1707083 0.1752083 0.08510908 -0.16342089
## 200 m 0.3402917 0.3497500 0.01632403 0.07692336
## 400 m 0.7541250 0.7783750 0.14483432 -0.65038257
## 800 m 1.7400000 1.7900000 0.29832458 -0.46679776
## 1500 m 3.5700000 3.6700000 0.49336226 -0.79448600
## 5000 m 13.2275000 13.5725000 0.95319143 0.01292578
## 10000 m 27.6100000 28.5900000 1.01313076 -0.14051673
## Marathon 129.9425000 132.7175000 1.20772171 1.00336107
This tells that after removing those observations first five variables of our data follow normality marginally. For the rest three we try to do Box-Cox transformation:
library(MASS)
par(mfrow=c(1,3))
for(i in 6:8)
boxcox(track[-sort(c(index,index_1)),i]~1,lambda = seq(-6,6,by=0.1))
Taking \(\lambda\)=-10 we get the following density plots for these variables:
par(mfrow=c(1,3))
for(i in 6:8)
plot(density((((track[-sort(c(index,index_1)),i])^(-10))-1)/(-10)),main=paste("Transformed",vec[i],sep=" "))
## Warning in plot.window(...): relative range of values = 0 * EPS, is small
## (axis 1)
Since for these transformations our data has too less variability, even if they were to come from a normal distribution we would not be able to work with them. We scale this data to check if it follows normality.
for(i in 6:8)
track[-sort(c(index,index_1)),i]<-scale((((track[-sort(c(index,index_1)),i])^(-10))-1)/(-10))
mvn(track[-sort(c(index,index_1)),-9],mvnTest="energy",univariateTest="SW",univariatePlot="histogram")
This shows that the data finally assumes normality but using this data for further analysis won’t carry any meaning. So we discard this transformed data and proceed with the original data with removed outliers. We do PCA, factor analysis etc on this data.
We group the data as mentioned before for Profile analysis and discriminant analysis. This is the code used for grouping:
library(dplyr)
track_group<-track%>%mutate(Running_ability=ifelse(Country %in% track[sort(c(index,index_1)),9],"0","1"))
head(track_group,10)
## 100 m 200 m 400 m 800 m 1500 m 5000 m 10000 m Marathon
## 1 0.1731667 0.3468333 0.7806667 1.81 3.70 14.04 29.36 137.72
## 2 0.1718333 0.3343333 0.7473333 1.74 3.57 13.28 27.66 128.30
## 3 0.1740000 0.3468333 0.7803333 1.79 3.60 13.26 27.72 135.90
## 4 0.1723333 0.3446667 0.7506667 1.73 3.60 13.22 27.45 129.95
## 5 0.1713333 0.3430000 0.7651667 1.80 3.75 14.68 30.55 146.62
## 6 0.1703333 0.3405000 0.7535000 1.73 3.66 13.62 28.62 133.13
## 7 0.1773333 0.3586667 0.8050000 1.80 3.85 14.45 30.28 139.95
## 8 0.1695000 0.3370000 0.7613333 1.76 3.63 13.55 28.09 130.15
## 9 0.1723333 0.3466667 0.7700000 1.79 3.71 13.61 29.30 134.03
## 10 0.1751667 0.3506667 0.7883333 1.81 3.73 13.90 29.13 133.53
## Country Running_ability
## 1 argentin 1
## 2 australi 1
## 3 austria 1
## 4 belgium 1
## 5 bermuda 0
## 6 brazil 1
## 7 burma 0
## 8 canada 1
## 9 chile 1
## 10 china 1
track_group_1<-track_group[track_group$Running_ability=="1",]
track_group_0<-track_group[track_group$Running_ability=="0",]
So we look into densities of the grouped data:
library(ggplot2)
track_group%>%ggplot(aes(track_group[,1],fill=Running_ability))+geom_density(alpha=0.3)+xlab("100 m")
track_group%>%ggplot(aes(track_group[,2],fill=Running_ability))+geom_density(alpha=0.3)+xlab("200 m")
track_group%>%ggplot(aes(track_group[,3],fill=Running_ability))+geom_density(alpha=0.3)+xlab("400 m")
track_group%>%ggplot(aes(track_group[,4],fill=Running_ability))+geom_density(alpha=0.3)+xlab("800 m")
track_group%>%ggplot(aes(track_group[,5],fill=Running_ability))+geom_density(alpha=0.3)+xlab("1500 m")
track_group%>%ggplot(aes(track_group[,6],fill=Running_ability))+geom_density(alpha=0.3)+xlab("5000 m")
track_group%>%ggplot(aes(track_group[,7],fill=Running_ability))+geom_density(alpha=0.3)+xlab("10000 m")
track_group%>%ggplot(aes(track_group[,8],fill=Running_ability))+geom_density(alpha=0.3)+xlab("Marathon")
Indication that we will see quite a number of misclassification in discriminant analysis.
Now we look into sigma distances around mean and match them with empirical probabilities. These are expected to be 0.683 and 0.954 for one sigma and two sigma distances around mean respectively.
Code used:
normality_check<-function(x)
{
rm<-rowMeans(t(x))
rv<-apply(x,2,sd)
proportion<-matrix(numeric(2*dim(x)[2]),ncol = 2)
for (i in 1:dim(x)[2]) {
for(j in 1:2)
{
proportion[i,j]<-mean(between(as.matrix(x[i]),as.matrix(rm[i]-j*rv[i]),
as.matrix(rm[i]+j*rv[i])))
}
}
return(proportion)
}
normality_check(track[,-9])
## [,1] [,2]
## [1,] 0.8000000 0.9636364
## [2,] 0.7272727 0.9636364
## [3,] 0.7454545 0.9818182
## [4,] 0.8545455 0.9636364
## [5,] 0.8363636 0.9636364
## [6,] 0.8000000 0.9636364
## [7,] 0.8545455 0.9636364
## [8,] 0.8181818 0.9454545
normality_check(track[-index,-9])
## [,1] [,2]
## [1,] 0.66 0.92
## [2,] 0.68 0.94
## [3,] 0.60 0.94
## [4,] 0.70 0.94
## [5,] 0.68 0.96
## [6,] 0.80 0.96
## [7,] 0.76 0.94
## [8,] 0.86 0.94
normality_check(track_group_1[,-c(9,10)])
## [,1] [,2]
## [1,] 0.650 0.950
## [2,] 0.700 0.925
## [3,] 0.650 0.950
## [4,] 0.625 0.950
## [5,] 0.600 1.000
## [6,] 0.775 0.925
## [7,] 0.750 0.950
## [8,] 0.750 0.950
normality_check(track_group_0[,-c(9,10)])
## [,1] [,2]
## [1,] 0.8666667 0.9333333
## [2,] 0.6666667 0.9333333
## [3,] 0.8666667 0.9333333
## [4,] 0.8000000 0.8666667
## [5,] 0.8000000 0.8666667
## [6,] 0.7333333 0.9333333
## [7,] 0.7333333 0.9333333
## [8,] 0.6000000 1.0000000
QQplots for grouped data are(1 & 0 respectively):
par(mfrow=c(2,4))
for(i in 1:8)
{
qqnorm(y=track_group_1[,i],main= paste("QQ plot of",vec[i],sep = " "))
qqline(y=track_group_1[,i])
}
for(i in 1:8)
{
qqnorm(y=track_group_0[,i],main= paste("QQ plot of",vec[i],sep = " "))
qqline(y=track_group_0[,i])
}
Bonferroni intervals for the variables for actual data, outlier free data and grouped data(accordingly) are:
Bonferroni_interval<-function(x,confidence=0.95)
{
rm<-rowMeans(t(x))
rv<-apply(x,2,sd)
proportion<-matrix(numeric(2*dim(x)[2]),ncol = 2)
actual<-numeric(dim(x)[2])
for (i in 1:dim(x)[2]) {
proportion[i,]<-rm[i]+qt((1-confidence)/(2*dim(x)[2]),df=nrow(x)-1,lower.tail = FALSE)*rv[i]*c(-1,1)/sqrt(nrow(x))
actual[i]<-mean(between(x[,i],proportion[i,1],proportion[i,2]))
}
return(cbind(proportion,actual))
}
Bonferroni_interval(track[,-9])
## actual
## [1,] 0.1722707 0.1767657 0.4909091
## [2,] 0.3448834 0.3531288 0.3454545
## [3,] 0.7646608 0.7832968 0.3454545
## [4,] 1.7688358 1.8177096 0.3636364
## [5,] 3.6383568 3.7580068 0.2545455
## [6,] 13.5383997 14.1532366 0.2363636
## [7,] 28.2954346 29.6827473 0.2363636
## [8,] 133.0834355 140.1645645 0.2363636
Bonferroni_interval(track[-index,-9])
## actual
## [1,] 0.1720589 0.1753478 0.40
## [2,] 0.3439073 0.3510327 0.40
## [3,] 0.7626035 0.7774099 0.28
## [4,] 1.7636709 1.8003291 0.32
## [5,] 3.6256249 3.7187751 0.18
## [6,] 13.4484567 13.8979433 0.22
## [7,] 28.1283304 29.1328696 0.18
## [8,] 132.0653230 137.9770770 0.28
Bonferroni_interval(track_group_1[,-c(9,10)])
## actual
## [1,] 0.1713535 0.1744965 0.400
## [2,] 0.3418213 0.3488704 0.375
## [3,] 0.7581073 0.7721510 0.400
## [4,] 1.7519523 1.7815477 0.300
## [5,] 3.5945441 3.6604559 0.400
## [6,] 13.3090690 13.5654310 0.325
## [7,] 27.8024487 28.4580513 0.275
## [8,] 130.6141884 133.3993116 0.450
Bonferroni_interval(track_group_0[,-c(9,10)])
## actual
## [1,] 0.1716869 0.1858464 0.6666667
## [2,] 0.3489751 0.3685582 0.6000000
## [3,] 0.7741590 0.8209965 0.7333333
## [4,] 1.8032861 1.9247139 0.6666667
## [5,] 3.7504843 4.0228491 0.7333333
## [6,] 14.3440173 15.5266494 0.7333333
## [7,] 29.7393809 32.8192858 0.7333333
## [8,] 141.5529856 156.3203478 0.5333333
T squared intervals for the variables for actual data, outlier free data and grouped data(accordingly) are:
T_square_interval<-function(x,confidence=0.95)
{
rm<-rowMeans(t(x))
rv<-apply(x,2,sd)
proportion<-matrix(numeric(2*dim(x)[2]),ncol = 2)
actual<-numeric(dim(x)[2])
for (i in 1:dim(x)[2]) {
proportion[i,]<-rm[i]+qf(1-(confidence^(1/ncol(x))),df1 = ncol(x),df2 = nrow(x)-ncol(x),lower.tail = FALSE)*sqrt(ncol(x)*(nrow(x)-1))*rv[i]*c(-1,1)/sqrt(nrow(x)*(nrow(x)-ncol(x)))
actual[i]<-mean(between(x[,i],proportion[i,1],proportion[i,2]))
}
return(cbind(proportion,actual))
}
T_square_interval(track[,-9])
## actual
## [1,] 0.1670179 0.1820185 0.8727273
## [2,] 0.3352478 0.3627644 0.8000000
## [3,] 0.7428826 0.8050750 0.8909091
## [4,] 1.7117217 1.8748238 0.8727273
## [5,] 3.4985334 3.8978302 0.9272727
## [6,] 12.8198994 14.8717370 0.8909091
## [7,] 26.6742165 31.3039653 0.8727273
## [8,] 124.8084056 148.4395944 0.8545455
T_square_interval(track[-index,-9])
## actual
## [1,] 0.1680886 0.1793181 0.84
## [2,] 0.3353059 0.3596341 0.82
## [3,] 0.7447299 0.7952834 0.82
## [4,] 1.7194190 1.8445810 0.88
## [5,] 3.5131787 3.8312213 0.88
## [6,] 12.9058587 14.4405413 0.84
## [7,] 26.9157009 30.3454991 0.86
## [8,] 124.9289486 145.1134514 0.86
T_square_interval(track_group_1[,-c(9,10)])
## actual
## [1,] 0.1671875 0.1786625 0.925
## [2,] 0.3324775 0.3582142 0.925
## [3,] 0.7394917 0.7907666 0.925
## [4,] 1.7127222 1.8207778 0.950
## [5,] 3.5071751 3.7478249 0.925
## [6,] 12.9692499 13.9052501 0.900
## [7,] 26.9334187 29.3270813 0.925
## [8,] 126.9223848 137.0911152 0.900
T_square_interval(track_group_0[,-c(9,10)])
## actual
## [1,] 0.1083936 0.2491398 1
## [2,] 0.2614378 0.4560955 1
## [3,] 0.5647938 1.0303617 1
## [4,] 1.2604997 2.4675003 1
## [5,] 2.5330049 5.2403285 1
## [6,] 9.0576130 20.8130537 1
## [7,] 15.9721052 46.5865615 1
## [8,] 75.5423944 222.3309389 1
Since after removing odd observations we see that first four variables were normally distributed we draw 95% confidence ellipsoids of those variables together(that is 6 plots) for all type of different data frame considered. We also do this for the last four variables among themselves differently. Also we see the scatterplots as:
library(psych)
pairs.panels(track[,-c(9)],gap=0,pch=21)
pairs.panels(track_group[,-c(9,10)],gap=0,bg=c("red","green")[track_group$Running_ability],pch=21)
pairs.panels(track_group_1[,-c(9,10)],gap=0,pch=21)
pairs.panels(track_group_1[,-c(9,10)],gap=0,pch=21)
Here the first four variables are quite linearly related whereas the last four are nearly linearly related among themselves.
* For first four variables:
library('car')
delp<-function(x){
for(i in 1:4)
{
for(j in i+1:4)
{
if(j<=4)
{
dataEllipse(x[,i],x[,j],levels = 0.95,xlab = paste(vec[i]),ylab = paste(vec[j]),xlim=c(min(x[,i])-0.05,max(x[,i])+0.05),ylim=c(min(x[,j])-0.05,max(x[,j])+0.05))
}
}
}
}
par(mfrow=c(2,3))
delp(track[,-9])
delp(track[-index,-9])
delp(track_group_1[,-c(9,10)])
delp(track_group_0[,-c(9,10)])
delp<-function(x){
for(i in 5:8)
{
for(j in i+1:8)
{
if(j<=8)
{
dataEllipse(x[,i],x[,j],levels = 0.95,xlab = paste(vec[i]),ylab = paste(vec[j]),xlim=c(min(x[,i])-0.5,max(x[,i])+0.5),ylim=c(min(x[,j])-0.5,max(x[,j])+0.5))
}
}
}
}
par(mfrow=c(2,3))
delp(track[,-9])
delp(track[-index,-9])
delp(track_group_1[,-c(9,10)])
delp(track_group_0[,-c(9,10)])
aa=read.csv("track.csv")
bb=aa[,-9]
X=as.matrix(bb)
Z=scale(X)
colnames(Z)=c('z1','z2','z3','z4','z5','z6','z7','z8')
R=cov(Z)
R
## z1 z2 z3 z4 z5 z6 z7
## z1 1.0000000 0.9226384 0.8411468 0.7560278 0.7002382 0.6194618 0.6325389
## z2 0.9226384 1.0000000 0.8507270 0.8066265 0.7749513 0.6953770 0.6965391
## z3 0.8411468 0.8507270 1.0000000 0.8701714 0.8352694 0.7786139 0.7872046
## z4 0.7560278 0.8066265 0.8701714 1.0000000 0.9180442 0.8635939 0.8690489
## z5 0.7002382 0.7749513 0.8352694 0.9180442 1.0000000 0.9281140 0.9346970
## z6 0.6194618 0.6953770 0.7786139 0.8635939 0.9281140 1.0000000 0.9746354
## z7 0.6325389 0.6965391 0.7872046 0.8690489 0.9346970 0.9746354 1.0000000
## z8 0.5199490 0.5961837 0.7049905 0.8064764 0.8655492 0.9321884 0.9431763
## z8
## z1 0.5199490
## z2 0.5961837
## z3 0.7049905
## z4 0.8064764
## z5 0.8655492
## z6 0.9321884
## z7 0.9431763
## z8 1.0000000
ev=eigen(R)
ev$values
## [1] 6.62214614 0.87761829 0.15932114 0.12404938 0.07988027 0.06796515
## [7] 0.04641953 0.02260010
plot(ev$values,type='l',ylab="Eigenvalues",col='magenta')
The elbow shape in the plot is coming corresponding to the second maximum eigenvalue. So, we will consider only the first two sample PCs.
prcomp.track=prcomp(x=Z,retx=T)
A=prcomp.track$rotation
A
## PC1 PC2 PC3 PC4 PC5 PC6
## z1 0.3175565 0.56687750 0.3322620 -0.12762824 0.2625555 -0.5937042
## z2 0.3369792 0.46162589 0.3606567 0.25911578 -0.1539571 0.6561367
## z3 0.3556454 0.24827331 -0.5604674 -0.65234081 -0.2183229 0.1566252
## z4 0.3686841 0.01242993 -0.5324823 0.47999891 0.5400528 -0.0146918
## z5 0.3728099 -0.13979665 -0.1534427 0.40451038 -0.4877151 -0.1578430
## z6 0.3643741 -0.31203045 0.1897643 -0.02958753 -0.2539792 -0.1412987
## z7 0.3667726 -0.30685985 0.1817517 -0.08006860 -0.1331764 -0.2190168
## z8 0.3419261 -0.43896266 0.2632087 -0.29951212 0.4979282 0.3152849
## PC7 PC8
## z1 0.136241272 -0.1055416946
## z2 -0.112639536 0.0960543400
## z3 -0.002853716 0.0001272086
## z4 -0.238016091 0.0381651160
## z5 0.610011481 -0.1392909902
## z6 -0.591298854 -0.5466969178
## z7 -0.176871008 0.7967952159
## z8 0.398822205 -0.1581638586
The first two sample PCs are given by y1=0.32z1 + 0.34z2 + 0.35z3 + 0.37z4 + 0.37z5 + 0.36z6 + 0.37z7 + 0.34z8 and
y2=0.57z1 + 0.46z2 + 0.25z3 + 0.01z4 - 0.14z5 - 0.31z6 - 0.31z7 - 0.44z8.
Now let us find the proportion of total variablity of the data explained by these two PCs.
summary(prcomp.track)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5734 0.9368 0.39915 0.35221 0.28263 0.2607 0.2155
## Proportion of Variance 0.8278 0.1097 0.01992 0.01551 0.00999 0.0085 0.0058
## Cumulative Proportion 0.8278 0.9375 0.95739 0.97289 0.98288 0.9914 0.9972
## PC8
## Standard deviation 0.15033
## Proportion of Variance 0.00283
## Cumulative Proportion 1.00000
93.75% of the total variation in the data, which is a substantial amount, is explained by these two principal components.
Y=prcomp.track$x
r1=numeric(8)
r2=numeric(8)
for(i in 1:8)
{
r1[i]=cor(Z[,i],Y[,1])
r2[i]=cor(Z[,i],Y[,2])
}
B=data.frame(r1,r2)
rownames(B)=c('z1','z2','z3','z4','z5','z6','z7','z8')
colnames(B)=c('PC1','PC2')
B
## PC1 PC2
## z1 0.8171850 0.53105812
## z2 0.8671665 0.43245706
## z3 0.9152012 0.23258562
## z4 0.9487545 0.01164451
## z5 0.9593714 -0.13096330
## z6 0.9376633 -0.29231413
## z7 0.9438353 -0.28747024
## z8 0.8798965 -0.41122586
The uniformity of weights in the first principal component is really a reflection of the considerable amount of structure in the original data. The coefficients of zi’s in y1are nearly equal. This means that y1considers all the zi’s to almost same extent which is also clear from the previous table. Using the values of y1we can actually sort of rank the individuals. So the analysis of the first PC results in ranking the individuals that are least plausible from the viewpoint of subjective judgements, and, in any case, this has a considerable intrinsic interest. This technique will be very useful when such judgements are not so easy to come by.
The second PC can be interpreted as a measure of relative strength of a given nation at various distances. Hence the a value near to 0 of the second PC would indicate that the particular nation had achieved at about the same level in both long and short distances. Extreme values in either end indicate imbalance in achievement. Thus the second PC serves as a measure of differential achievement.
Y[,2]
## [1] -0.34487666 -0.21616901 0.48689073 0.26194657 -1.76693743
## [6] -0.64115405 0.25720348 -0.50034502 -0.20141106 0.35777693
## [11] 0.49997999 1.50876822 1.67064116 -0.02878301 0.38730438
## [16] -2.44900163 0.40882391 -0.50289374 -0.31066724 -0.41137495
## [21] -0.27890616 -0.60178374 1.27121076 -0.15152669 0.67647278
## [26] -0.60316736 0.94602435 0.67428204 -0.98985879 0.41357459
## [31] 0.53371105 -0.30152973 1.56472721 -0.27933562 -1.72272831
## [36] 0.66695377 0.84175219 0.70244240 0.92336255 1.05663679
## [41] 0.08554195 -0.18932031 -0.46259920 1.30473235 0.53076660
## [46] -1.78901618 0.50665093 0.02323023 0.19593360 0.04202950
## [51] -1.66978711 1.38303678 -1.11019109 -0.75696176 -1.90208197
summary(Y[,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.44900 -0.48147 0.04203 0.00000 0.60033 1.67064
Since PC2 serves as a measure of differential achievement, extreme negative values will indicate better performances in shorter distances and extreme positive values will indicate better performances in longer distances. So, looking at the values of PC2 we can actually classify the individuals into three groups - one with better performances in shorter distances, one with even performances in both shorter and longer distances and one with better performances in longer distances. This classification is shown below.
g1=which(Y[,2] < -1)
g2=which(Y[,2] >= -1 & Y[,2] <= 1)
g3=which(Y[,2] > 1)
The countries corresponding to group 1 are bermuda, domrep, malaysia, singapor, thailand, usa, wsamoa.
The countries corresponding to group 2 are argentin, australi, austria, belgium, brazil, burma, canada, chile, china, columbia, czech, denmark, finland, france, gdr, frg, gbni, greece, hungary, india, indonesi, ireland, israel, italy, japan, kenya, korea, luxembou, mauritiu, mexico, netherla, nz, png, philippi, poland, rumania, spain, sweden, switzerl, taipei, ussr.
The countries corresponding to group 3 are cookis, costa, guatemal, dprkorea, norway, portugal, turkey.
par(mfrow=c(2,3))
plot(Y[,3],ylab='PC3',col='lightseagreen',ylim=c(-1,1))
plot(Y[,4],ylab='PC4',col='lightseagreen',ylim=c(-1,1))
plot(Y[,5],ylab='PC5',col='lightseagreen',ylim=c(-1,1))
plot(Y[,6],ylab='PC6',col='lightseagreen',ylim=c(-1,1))
plot(Y[,7],ylab='PC7',col='lightseagreen',ylim=c(-1,1))
plot(Y[,8],ylab='PC8',col='lightseagreen',ylim=c(-1,1))
For PC3 and PC4 if the absolute value for a sample observation is greater than 0.85 we suspect it as an outlier.
For PC5, PC6 and PC7 if the absolute value for a sample observation is greater than 0.75 we suspect it as an outlier.
For PC8 if the absolute value for a sample observation is greater than 0.60 we suspect it as an outlier.
which(abs(Y[,3])>0.85)
## [1] 36 42
which(abs(Y[,4])>0.85)
## [1] 55
which(abs(Y[,5])>0.75)
## [1] 7
which(abs(Y[,6])>0.75)
## [1] 12
which(abs(Y[,7])>0.75)
## [1] 22
which(abs(Y[,8])>0.60)
## [1] 22
aa[c(7,12,22,36,42,55),9]
## [1] burma cookis greece mauritiu philippi wsamoa
## 55 Levels: argentin australi austria belgium bermuda brazil ... wsamoa
The suspected observations are 7th, 12th, 22nd, 36th, 42nd and 55th observation in the dataset. The corresponding countries are Burma, Cook Island, Greece, Mauritius, Philippines and Western Samoa.
aa1=aa[-c(7,12,36,42,55),]
bb1=aa1[,-9]
X1=as.matrix(bb1)
Z1=scale(X1)
colnames(Z1)=c('z1','z2','z3','z4','z5','z6','z7','z8')
R1=cov(Z1)
R1
## z1 z2 z3 z4 z5 z6 z7
## z1 1.0000000 0.8967167 0.7352504 0.6425212 0.5311355 0.2822338 0.3198075
## z2 0.8967167 1.0000000 0.8062597 0.7508055 0.6853210 0.4776137 0.4861586
## z3 0.7352504 0.8062597 1.0000000 0.8446809 0.7587542 0.6406367 0.6569769
## z4 0.6425212 0.7508055 0.8446809 1.0000000 0.8529926 0.7406845 0.7423022
## z5 0.5311355 0.6853210 0.7587542 0.8529926 1.0000000 0.8644899 0.8756224
## z6 0.2822338 0.4776137 0.6406367 0.7406845 0.8644899 1.0000000 0.9482835
## z7 0.3198075 0.4861586 0.6569769 0.7423022 0.8756224 0.9482835 1.0000000
## z8 0.1677496 0.3434233 0.5582540 0.6658721 0.7843441 0.8886124 0.9091055
## z8
## z1 0.1677496
## z2 0.3434233
## z3 0.5582540
## z4 0.6658721
## z5 0.7843441
## z6 0.8886124
## z7 0.9091055
## z8 1.0000000
ev1=eigen(R1)
ev1$values
## [1] 5.77048282 1.53269755 0.21178746 0.16062546 0.11487878 0.08570121
## [7] 0.08121911 0.04260761
plot(ev1$values,type='l',ylab="Eigenvalues",col='thistle3')
The elbow shape in the plot is coming corresponding to the second maximum eigenvalue. So, we will consider only the first two sample PCs.
prcomp.track1=prcomp(x=Z1,retx=T)
A1=prcomp.track1$rotation
A1
## PC1 PC2 PC3 PC4 PC5 PC6
## z1 0.2719659 0.57546658 -0.314388819 0.173910120 -0.33750082 0.17239103
## z2 0.3283778 0.44740468 -0.344698977 0.008642004 0.07541809 -0.39344746
## z3 0.3669861 0.22830208 0.637396531 0.497858882 0.34663360 0.18278873
## z4 0.3843617 0.08740350 0.501252928 -0.628082021 -0.30210086 -0.27509050
## z5 0.3933679 -0.09184117 -0.220133144 -0.443845752 0.28212124 0.68145517
## z6 0.3648331 -0.33594328 -0.176143455 0.037359251 0.36673246 -0.47908120
## z7 0.3702270 -0.32276430 -0.212838065 0.176838970 0.09099747 -0.02349439
## z8 0.3329869 -0.42829495 0.005318866 0.312511292 -0.66829626 0.10927944
## PC7 PC8
## z1 0.52054738 -0.22625793
## z2 -0.60812939 0.20686922
## z3 -0.07081342 -0.02191945
## z4 0.15155052 0.09475171
## z5 -0.18280661 -0.11822623
## z6 0.21596083 -0.55764333
## z7 0.36251657 0.73616053
## z8 -0.34605111 -0.17227385
The first two sample PCs are given by y1=0.27z1 + 0.33z2 + 0.37z3 + 0.38z4 + 0.39z5 + 0.36z6 + 0.37z7 + 0.33z8 and
y2=0.57z1 + 0.45z2 + 0.23z3 + 0.09z4 - 0.09z5 - 0.33z6 - 0.32z7 - 0.43z8.
Now let us find the proportion of total variablity of the data explained by these two PCs.
summary(prcomp.track1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.4022 1.2380 0.46020 0.40078 0.33894 0.29275
## Proportion of Variance 0.7213 0.1916 0.02647 0.02008 0.01436 0.01071
## Cumulative Proportion 0.7213 0.9129 0.93937 0.95945 0.97381 0.98452
## PC7 PC8
## Standard deviation 0.28499 0.20642
## Proportion of Variance 0.01015 0.00533
## Cumulative Proportion 0.99467 1.00000
91.29% of the total variation in the data, which is a substantial amount, is explained by these two principal components. But it has decreased in this case.
Y1=prcomp.track1$x
r11=numeric(8)
r21=numeric(8)
for(i in 1:8)
{
r11[i]=cor(Z1[,i],Y1[,1])
r21[i]=cor(Z1[,i],Y1[,2])
}
B1=data.frame(r11,r21)
rownames(B1)=c('z1','z2','z3','z4','z5','z6','z7','z8')
colnames(B1)=c('PC1','PC2')
B1
## PC1 PC2
## z1 0.6533119 0.7124401
## z2 0.7888235 0.5538967
## z3 0.8815677 0.2826429
## z4 0.9233072 0.1082074
## z5 0.9449417 -0.1137014
## z6 0.8763960 -0.4159050
## z7 0.8893530 -0.3995892
## z8 0.7998955 -0.5302384
Assumptions:
This are the assumptions of orthogonal factor model.
library('psych')
f1=fac(r=R,nfactors=2,rotate='none',scores='Thurstone',residuals=TRUE,fm='pa')
The estimated loading matrix is shown below.
L1=f1$loadings
L1
##
## Loadings:
## PA1 PA2
## z1 0.812 0.511
## z2 0.861 0.417
## z3 0.901 0.211
## z4 0.937
## z5 0.955 -0.118
## z6 0.938 -0.285
## z7 0.947 -0.286
## z8 0.873 -0.378
##
## PA1 PA2
## SS loadings 6.540 0.800
## Proportion Var 0.818 0.100
## Cumulative Var 0.818 0.918
The estimated communalities are obtained as:
f1$communality
## z1 z2 z3 z4 z5 z6 z7
## 0.9200913 0.9149326 0.8558811 0.8784953 0.9249959 0.9621504 0.9786024
## z8
## 0.9048802
The specific variances are obtained as:
1-f1$communality
## z1 z2 z3 z4 z5 z6
## 0.07990870 0.08506742 0.14411890 0.12150473 0.07500406 0.03784962
## z7 z8
## 0.02139765 0.09511978
The residual matrix is obtained as:
f1$residual-diag(f1$residual)
## z1 z2 z3 z4 z5 z6
## z1 0.00000000 -0.06909616 -0.07785252 -0.09377579 -0.09422145 -0.07623959
## z2 -0.07425488 0.00000000 -0.09789737 -0.09286702 -0.08290860 -0.07878014
## z3 -0.14206272 -0.15694885 0.00000000 -0.12180296 -0.14366282 -0.15039006
## z4 -0.13537182 -0.12930434 -0.09918880 0.00000000 -0.09581518 -0.13214786
## z5 -0.08931681 -0.07284525 -0.07454798 -0.04931451 0.00000000 -0.07627891
## z6 -0.03418051 -0.03156233 -0.04412077 -0.04849274 -0.03912447 0.00000000
## z7 -0.01123598 -0.02101583 -0.02661870 -0.03456410 -0.02430590 -0.01710207
## z8 -0.09086196 -0.09340770 -0.09671101 -0.09999440 -0.10745583 -0.09007141
## z7 z8
## z1 -0.06974703 -0.07565088
## z2 -0.08468560 -0.08335534
## z3 -0.14933996 -0.14571013
## z4 -0.13467119 -0.12637935
## z5 -0.07791231 -0.08734011
## z6 -0.03355404 -0.03280124
## z7 0.00000000 -0.01305434
## z8 -0.08677648 0.00000000
f2=fac(r=R,nfactors=2,rotate='varimax',scores='Thurstone',residuals=TRUE,fm='pa')
The estimated loading matrix is shown below.
L2=f2$loadings
L2
##
## Loadings:
## PA1 PA2
## z1 0.287 0.915
## z2 0.386 0.875
## z3 0.549 0.744
## z4 0.702 0.621
## z5 0.804 0.528
## z6 0.900 0.390
## z7 0.907 0.395
## z8 0.910 0.278
##
## PA1 PA2
## SS loadings 4.133 3.207
## Proportion Var 0.517 0.401
## Cumulative Var 0.517 0.918
The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same.
Focusing our attention on the iterative PC method and the cumulative proportion of the total sample variance explained which is 0.918, we see that a two-factor solution appears to be warranted.
The factor loadings change abruptly after the varimax rotation.
Before rotation, all the variables load highly on the first factor and the estimated loadings are about equal. This factor can be interpreted as a general running factor.
The second factor contrasts the shorter distances with the longer distances. (The shorter distance performances have positive estimated loadings and the longer distance performances have negative estimated loadings on this factor.) Thus, this factor seems to differentiate the performances in different distances and might be called a differentiating factor. Also the absolute value of the estimated loadings of this factor increases as we move to extreme running distances in both the directions.
After rotation, we see that the longer distance performances load highly on the first factor rather than the shorter distance performances. All the estimated loadings are positive for this factor. In this case the first factor can be interpreted as a factor affecting the longer distance performances.
the shorter distance performances load highly on the second factor rather than the longer distance performances. All the estimated loadings are positive for this factor. In this case the second factor can be interpreted as a factor affecting the shorter distance performances.
The off-diagonal elements of the residual matrix are close to zero. This supports the fact of retaining two factors.
s=factor.scores(x=Z,f=f1,method='Thurstone')
scores=s$scores
plot(scores[,1],scores[,2],xlab='Factor Scores of F1',ylab='Factor Scores of F2',col='palevioletred4')
par(mfrow=c(1,2))
plot(scores[,1],ylab='Factor Scores of F1',col='slateblue')
plot(scores[,2],ylab='Factor Scores of F2',col='slateblue')
From the scatter plot of the factor scores of F1 and the factor scores of F2 we see that two points are not consitent with the remaining observations. They lie on the extreme right hand side of the diagram. It seems that for those two points the factor scores of F1 are way too high than the others.
Then from the individual factor scores plots we see that those two points differ from the other points only with respect to the factor score of F1. We can treat them as potential outliers.
which(scores[,1]>2)
## [1] 12 55
Hence we can treat the 12th and 55th observations in the dataset as potential outliers.
f3=fac(r=R,nfactors=2,rotate='varimax',residuals=TRUE,fm='ml')
The estimated loading matrix is shown below.
L3=f3$loadings
L3
##
## Loadings:
## ML1 ML2
## z1 0.291 0.914
## z2 0.382 0.882
## z3 0.543 0.744
## z4 0.691 0.622
## z5 0.799 0.530
## z6 0.901 0.394
## z7 0.907 0.399
## z8 0.915 0.278
##
## ML1 ML2
## SS loadings 4.112 3.225
## Proportion Var 0.514 0.403
## Cumulative Var 0.514 0.917
The estimated communalities are obtained as:
f3$communality
## z1 z2 z3 z4 z5 z6 z7
## 0.9190082 0.9241862 0.8485663 0.8646685 0.9182821 0.9662060 0.9820430
## z8
## 0.9140073
The specific variances are obtained as:
1-f3$communality
## z1 z2 z3 z4 z5 z6
## 0.08099178 0.07581381 0.15143372 0.13533154 0.08171787 0.03379396
## z7 z8
## 0.01795698 0.08599274
The residual matrix is obtained as:
f3$residual-diag(f3$residual)
## z1 z2 z3 z4 z5 z6
## z1 0.00000000 -0.07528918 -0.07744923 -0.09409695 -0.09676523 -0.08299463
## z2 -0.07011122 0.00000000 -0.08897102 -0.08196194 -0.07321198 -0.07182454
## z3 -0.14789117 -0.16459093 0.00000000 -0.11944082 -0.14394549 -0.15485326
## z4 -0.14843672 -0.14147967 -0.10333864 0.00000000 -0.09874046 -0.13918121
## z5 -0.09749133 -0.07911604 -0.07422963 -0.04512679 0.00000000 -0.08138941
## z6 -0.03579682 -0.02980470 -0.03721350 -0.03764363 -0.03346551 0.00000000
## z7 -0.01339955 -0.01976209 -0.02009879 -0.02401797 -0.01894842 -0.01741195
## z8 -0.08573227 -0.08434142 -0.08445635 -0.08462657 -0.09811029 -0.08709447
## z7 z8
## z1 -0.07643435 -0.08073130
## z2 -0.07761892 -0.07416249
## z3 -0.15357553 -0.14989733
## z4 -0.14139253 -0.13396536
## z5 -0.08270931 -0.09383542
## z6 -0.03324893 -0.03489569
## z7 0.00000000 -0.01546208
## z8 -0.08349784 0.00000000
We find similar results. THIS IS SURPRISING.
aa2=aa[-c(12,55),]
bb2=aa2[,-9]
X2=as.matrix(bb2)
Z2=scale(X2)
colnames(Z2)=c('z1','z2','z3','z4','z5','z6','z7','z8')
R2=cov(Z2)
R2
## z1 z2 z3 z4 z5 z6 z7
## z1 1.0000000 0.9181822 0.7221096 0.6805015 0.5791556 0.4132462 0.4410473
## z2 0.9181822 1.0000000 0.7897826 0.7688061 0.7142773 0.5745055 0.5793445
## z3 0.7221096 0.7897826 1.0000000 0.8292540 0.7661279 0.6513068 0.6693626
## z4 0.6805015 0.7688061 0.8292540 1.0000000 0.8441634 0.7561770 0.7612732
## z5 0.5791556 0.7142773 0.7661279 0.8441634 1.0000000 0.8749984 0.8836742
## z6 0.4132462 0.5745055 0.6513068 0.7561770 0.8749984 1.0000000 0.9560610
## z7 0.4410473 0.5793445 0.6693626 0.7612732 0.8836742 0.9560610 1.0000000
## z8 0.3060880 0.4508726 0.5708201 0.6973173 0.7957968 0.9005466 0.9191016
## z8
## z1 0.3060880
## z2 0.4508726
## z3 0.5708201
## z4 0.6973173
## z5 0.7957968
## z6 0.9005466
## z7 0.9191016
## z8 1.0000000
library('psych')
f4=fac(r=R2,nfactors=2,rotate='none',scores='Thurstone',residuals=TRUE,fm='pa')
The estimated loading matrix is shown below.
L4=f4$loadings
L4
##
## Loadings:
## PA1 PA2
## z1 0.710 0.615
## z2 0.827 0.498
## z3 0.838 0.230
## z4 0.900
## z5 0.931 -0.107
## z6 0.896 -0.365
## z7 0.910 -0.360
## z8 0.818 -0.456
##
## PA1 PA2
## SS loadings 5.866 1.171
## Proportion Var 0.733 0.146
## Cumulative Var 0.733 0.880
The estimated communalities are obtained as:
f4$communality
## z1 z2 z3 z4 z5 z6 z7
## 0.8818386 0.9319128 0.7560770 0.8182939 0.8776741 0.9357628 0.9584378
## z8
## 0.8766711
The specific variances are obtained as:
1-f4$communality
## z1 z2 z3 z4 z5 z6
## 0.11816143 0.06808721 0.24392298 0.18170611 0.12232590 0.06423717
## z7 z8
## 0.04156224 0.12332893
The residual matrix is obtained as:
f4$residual-diag(f4$residual)
## z1 z2 z3 z4 z5 z6
## z1 0.00000000 -0.09319257 -0.13285244 -0.13499519 -0.13373570 -0.11609090
## z2 -0.04311835 0.00000000 -0.08639428 -0.09077390 -0.07001953 -0.05226367
## z3 -0.25861399 -0.26223005 0.00000000 -0.19091701 -0.23343556 -0.25944202
## z4 -0.19853987 -0.20439280 -0.12870014 0.00000000 -0.16446101 -0.19627115
## z5 -0.13790017 -0.12425822 -0.11183847 -0.10508079 0.00000000 -0.12011029
## z6 -0.06216664 -0.04841363 -0.07975621 -0.07880220 -0.06202156 0.00000000
## z7 -0.02564867 -0.03593682 -0.05273161 -0.06488581 -0.04384899 -0.03247618
## z8 -0.11737739 -0.12154989 -0.13311245 -0.11798075 -0.13750046 -0.12185548
## z7 z8
## z1 -0.10224786 -0.11220989
## z2 -0.06246179 -0.06630817
## z3 -0.25509236 -0.25370650
## z4 -0.20502968 -0.17635793
## z5 -0.12461265 -0.13649742
## z6 -0.05515111 -0.06276371
## z7 0.00000000 -0.03108369
## z8 -0.11285039 0.00000000
f5=fac(r=R2,nfactors=2,rotate='varimax',scores='Thurstone',residuals=TRUE,fm='pa')
The estimated loading matrix is shown below.
L5=f5$loadings
L5
##
## Loadings:
## PA1 PA2
## z1 0.147 0.928
## z2 0.311 0.914
## z3 0.493 0.716
## z4 0.626 0.653
## z5 0.781 0.518
## z6 0.920 0.298
## z7 0.928 0.312
## z8 0.919 0.178
##
## PA1 PA2
## SS loadings 3.916 3.12
## Proportion Var 0.490 0.39
## Cumulative Var 0.490 0.88
The estimates of communalities and specific variances will remain the same. Also the percentage of total variation explained by the two factors and the residual matrix will remain the same.
Comparing with the previous case where we were using the complete dataset, we see that removing the two observations makes no change in the interpretation of the factors. However a decrease in the cumulative proportion of the total sample variance explained is observed. Previously it was 0.918 while now it has become 0.88. Also the residual matrix in the latter case does not have all the off-diagonal elements close to zero. So removing those observations actually results in a fall in the quality of estimation.
Now we proceed for profile analysis of the dataset.
We proceed without checking for equality of variance covariance matrix since for the second group we dont have enough observations to estimate the parameters. We use the function ‘pbg’ to do our required testings.
library('profileR')
mod <- pbg(data=track_group[,1:8], group=track_group[,10], original.names=TRUE, profile.plot=TRUE)
print(mod)
##
## Data Summary:
## 0 1
## 100 m 0.1787667 0.1729250
## 200 m 0.3587667 0.3453458
## 400 m 0.7975778 0.7651292
## 800 m 1.8640000 1.7667500
## 1500 m 3.8866667 3.6275000
## 5000 m 14.9353333 13.4372500
## 10000 m 31.2793333 28.1302500
## Marathon 148.9366667 132.0067500
summary(mod)
## Call:
## pbg(data = track_group[, 1:8], group = track_group[, 10], original.names = TRUE,
## profile.plot = TRUE)
##
## Hypothesis Tests:
## $`Ho: Profiles are parallel`
## Multivariate.Test Statistic Approx.F num.df den.df p.value
## 1 Wilks 0.2159207 24.38179 7 47 1.223905e-13
## 2 Pillai 0.7840793 24.38179 7 47 1.223905e-13
## 3 Hotelling-Lawley 3.6313304 24.38179 7 47 1.223905e-13
## 4 Roy 3.6313304 24.38179 7 47 1.223905e-13
##
## $`Ho: Profiles have equal levels`
## Df Sum Sq Mean Sq F value Pr(>F)
## group 1 82.39 82.39 115.9 6.02e-15 ***
## Residuals 53 37.69 0.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $`Ho: Profiles are flat`
## F df1 df2 p-value
## 1 17695.82 7 47 3.881274e-78
From the results of this code we see that the profiles are clearly not parallel. So further tests are not required.
So we move to Discriminant analysis.
LDA gives:
track_group_1<-track_group[,-9]
linear<-lda(Running_ability~.,data=track_group_1)
linear
## Call:
## lda(Running_ability ~ ., data = track_group_1)
##
## Prior probabilities of groups:
## 0 1
## 0.2727273 0.7272727
##
## Group means:
## `100 m` `200 m` `400 m` `800 m` `1500 m` `5000 m` `10000 m`
## 0 0.1787667 0.3587667 0.7975778 1.86400 3.886667 14.93533 31.27933
## 1 0.1729250 0.3453458 0.7651292 1.76675 3.627500 13.43725 28.13025
## Marathon
## 0 148.9367
## 1 132.0068
##
## Coefficients of linear discriminants:
## LD1
## `100 m` 79.9904054
## `200 m` -68.3596062
## `400 m` 4.3383901
## `800 m` 3.1282279
## `1500 m` 0.1041853
## `5000 m` -4.0058080
## `10000 m` 1.4499209
## Marathon -0.1650255
First discriminant function is a linear combination of the eight variables: LD1’columns of X.
Prediction and histogram based on LDA1 gives:
p<- predict(linear, track_group_1)
ldahist(data = p$x, g= track_group_1$Running_ability)
Histograms based on LD1. No overlap in histograms except one.
We check accuracy of the model as:
p1<-predict(linear,track_group_1)$class
tab<-table(Predicted=p1, Actual= track_group_1$Running_ability)
tab
## Actual
## Predicted 0 1
## 0 13 0
## 1 2 40
sum(diag(tab))/sum(tab)
## [1] 0.9636364
Accuracy is 96.36% in training data(which is the entire dataset here).
But we don’t know if the var-cov matrices of the groups are same. We have no way of checking that since observations in group ‘0’ is far less compared to the number of parameters to be estimated. So we don’t test equality of var-cov matrices and find out their euclidean distance to see if its small.
sqrt(sum((var(track_group_0[,-c(9,10)])-var(track_group_1[,-c(9,10)]))^2))
## [1] 6.482418
Which isn’t small.
So we also try quadratic discriminant analysis.
quadratic<-qda(Running_ability~.,data=track_group_1)
quadratic
## Call:
## qda(Running_ability ~ ., data = track_group_1)
##
## Prior probabilities of groups:
## 0 1
## 0.2727273 0.7272727
##
## Group means:
## `100 m` `200 m` `400 m` `800 m` `1500 m` `5000 m` `10000 m`
## 0 0.1787667 0.3587667 0.7975778 1.86400 3.886667 14.93533 31.27933
## 1 0.1729250 0.3453458 0.7651292 1.76675 3.627500 13.43725 28.13025
## Marathon
## 0 148.9367
## 1 132.0068
p_1<- predict(quadratic, track_group_1)
p1_1<-predict(quadratic,track_group_1)$class
tab_1<-table(Predicted=p1_1, Actual= track_group_1$Running_ability)
tab_1
## Actual
## Predicted 0 1
## 0 15 0
## 1 0 40
sum(diag(tab))/sum(tab)
## [1] 0.9636364
Accuracy is 100% in training data(which is the entire dataset here). Seems qda performs better than lda which is supposed to be true in case covariance matrices are not same.