Dataset

Two files (T8-6.DAT and T1-9.DAT) contain information about the national track records for both men and women.

Dateset for men

men <- read.table("E:/ASDS/2nd Semester/Multivariate Analysis/datasets/T8-6.DAT")
colnames(men) <- c("country", "100m (s)", "200m (s)", "400m (s)", "800m (min)",
                   "1500m (min)", "5000m (min)", "10,000m (min)", "Marathon (min)")
str(men)
## 'data.frame':    54 obs. of  9 variables:
##  $ country       : chr  "Argentina" "Australia" "Austria" "Belgium" ...
##  $ 100m (s)      : num  10.23 9.93 10.15 10.14 10.27 ...
##  $ 200m (s)      : num  20.4 20.1 20.4 20.2 20.3 ...
##  $ 400m (s)      : num  46.2 44.4 45.8 45 45.3 ...
##  $ 800m (min)    : num  1.77 1.74 1.77 1.73 1.79 1.7 1.75 1.76 1.77 1.8 ...
##  $ 1500m (min)   : num  3.68 3.53 3.58 3.57 3.7 3.57 3.53 3.65 3.61 3.72 ...
##  $ 5000m (min)   : num  13.3 12.9 13.3 12.8 14.6 ...
##  $ 10,000m (min) : num  27.6 27.5 27.7 26.9 30.5 ...
##  $ Marathon (min): num  130 128 132 127 146 ...

Dataset for Women

women <- read.table("E:/ASDS/2nd Semester/Multivariate Analysis/datasets/T1-9.dat")
colnames(women) <- c("country", "100m (s)", "200m (s)", "400m (s)", "800m (min)",
                     "1500m (min)", "3000m (min)", "Marathon (min)")
str(women)
## 'data.frame':    54 obs. of  8 variables:
##  $ country       : chr  "ARG" "AUS" "AUT" "BEL" ...
##  $ 100m (s)      : num  11.6 11.1 11.2 11.1 11.5 ...
##  $ 200m (s)      : num  22.9 22.2 22.7 22.5 23.1 ...
##  $ 400m (s)      : num  52.5 48.6 50.6 51.5 53.3 ...
##  $ 800m (min)    : num  2.05 1.98 1.94 1.97 2.07 1.97 1.97 2 1.93 2.04 ...
##  $ 1500m (min)   : num  4.25 4.02 4.05 4.08 4.29 4.17 4 4.22 3.84 4.34 ...
##  $ 3000m (min)   : num  9.19 8.63 8.78 8.82 9.81 9.04 8.54 9.26 8.1 9.37 ...
##  $ Marathon (min): num  150 144 154 143 174 ...


Data Preprocessing

Removing Irrelevent Data

The data sets used to calculate the correlation matrix and its corresponding eigenvalues and eigenvectors no longer include the column for “Country”.

library(dplyr)
numeric_men <- select(men, -country)
head(numeric_men)
numeric_women <- select(women, -country)
head(numeric_women)

Finding Outliers

The most straightforward method for identifying any outlier in a dataset is to use a boxplot.

Outliers in Men’s National Track Record

par(mfrow = c(2,4))
par(mar = c(4, 4, 2, 1))
par(mgp = c(3,1,0))
boxplot_list_men <- lapply(colnames(numeric_men), 
                           function(x) boxplot(numeric_men[[x]], main = x))

par(mfrow = c(1, 1))

By examining the boxplot, it is apparent that each variable within the men’s track record exhibits an outlier. Marathon (min) displays the highest number of outliers.

Outliers in Women’s National Track Record

par(mfrow = c(2,4))
par(mar = c(4, 4, 2, 1))
par(mgp = c(3,1,0))
boxplot_list_women <- lapply(colnames(numeric_women), 
                           function(x) boxplot(numeric_women[[x]], main = x))
par(mfrow = c(1, 1))

It has been noted that the dataset containing the women’s national track record also contains outliers. In order to conduct Principal Component Analysis, it is necessary to eliminate these outliers from both datasets.

Removing Outliers

First, to eliminate outliers, they are substituted with NA values. Next, the NA values are substituted with the mean of the corresponding column.

Removing Outliers from Men’s National Track Record

#Replacing outliers by NA
numeric_men_without_outlier <- numeric_men
for (i in colnames(numeric_men_without_outlier)) {
  med <- median(numeric_men_without_outlier[[i]], na.rm = TRUE)
  iqr <- IQR(numeric_men_without_outlier[[i]], na.rm = TRUE)
  lower_bound <- med - 1.5 * iqr
  upper_bound <- med + 1.5 * iqr
  numeric_men_without_outlier[[i]][numeric_men_without_outlier[[i]] < 
                                     lower_bound |
                              numeric_men_without_outlier[[i]] > 
                                upper_bound] <- NA
}
head(numeric_men_without_outlier)
#Replacing NA by mean
for (i in colnames(numeric_men_without_outlier)) {
  column_mean <- mean(numeric_men_without_outlier[[i]], na.rm = TRUE)
  numeric_men_without_outlier[[i]][is.na(numeric_men_without_outlier[[i]])] <- 
    column_mean
}
head(numeric_men_without_outlier)

Removing Outliers from Women’s National Track Record

#Replacing outliers by NA
numeric_women_without_outlier <- numeric_women
for (i in colnames(numeric_women_without_outlier)) {
  med <- median(numeric_women_without_outlier[[i]], na.rm = TRUE)
  iqr <- IQR(numeric_women_without_outlier[[i]], na.rm = TRUE)
  lower_bound <- med - 1.5 * iqr
  upper_bound <- med + 1.5 * iqr
  numeric_women_without_outlier[[i]][numeric_women_without_outlier[[i]] < 
                                     lower_bound |
                                     numeric_women_without_outlier[[i]] > 
                                     upper_bound] <- NA
}
head(numeric_women_without_outlier)
#Replacing NA by mean
for (i in colnames(numeric_women_without_outlier)) {
  column_mean <- mean(numeric_women_without_outlier[[i]], na.rm = TRUE)
  numeric_women_without_outlier[[i]][is.na(numeric_women_without_outlier[[i]])] <- 
    column_mean
}
head(numeric_women_without_outlier)

Scaling Dataset

The provided data sets include values that are measured in different units, such as seconds and minutes. To address this issue, we need to remove the units by scaling the data. One possible method of scaling the data is to normalize it. Normalization is a process of transforming the data to make them comparable by bringing them to a common scale. Specifically, in this method, the data are adjusted in a way that their mean value becomes zero and their spread or variability, as measured by the standard deviation, becomes equal to one.

scale_men <- scale(numeric_men_without_outlier)
head(scale_men)
##        100m (s)   200m (s)   400m (s) 800m (min) 1500m (min) 5000m (min)
## [1,]  0.3087310 -0.2890627  0.8545271  0.2863158   0.4897456  -0.2517226
## [2,] -2.3802163 -1.0920148 -1.4967832 -0.5249124  -0.9240484  -1.2627398
## [3,] -0.4083216 -0.0818493  0.3581394  0.2863158  -0.4527837  -0.4286506
## [4,] -0.4979532 -0.7552929 -0.6607617 -0.7953218  -0.5470366  -1.5154941
## [5,]  0.6672573 -0.4703745 -0.3472537  0.8271347   0.6782515   0.0000000
## [6,] -1.7527953 -1.5323433 -1.6143487 -1.6065500  -0.5470366   0.1274088
##      10,000m (min) Marathon (min)
## [1,]   -0.54699386    -0.07699177
## [2,]   -0.70444936    -0.88599899
## [3,]   -0.45514482     0.96372140
## [4,]   -1.57045463    -1.00774279
## [5,]    0.00000000     0.00000000
## [6,]    0.08282816    -1.45937304
scale_women <- scale(numeric_women_without_outlier)
head(scale_women)
##        100m (s)    200m (s)   400m (s) 800m (min) 1500m (min) 3000m (min)
## [1,]  0.8816068 -0.05889089  0.3115352  0.6797313   0.5836379   0.5921674
## [2,] -0.6802950 -1.16040392 -1.4276858 -0.4489388  -0.6888663  -0.6234255
## [3,] -0.5761682 -0.43123333 -0.5333577 -1.0938932  -0.5228875  -0.2978203
## [4,] -0.6108771 -0.77254722 -0.1603465 -0.6101774  -0.3569087  -0.2109922
## [5,]  0.4998086  0.11176605  0.6710641  1.0022085   0.8049429   1.9380024
## [6,] -0.5067504 -0.58637601 -0.5333577 -0.6101774   0.1410277   0.2665622
##      Marathon (min)
## [1,]      0.1054535
## [2,]     -0.7570163
## [3,]      0.6158430
## [4,]     -0.8152742
## [5,]      0.0000000
## [6,]     -0.2630909

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical method used to reduce the dimensionality of high-dimensional data by finding a smaller set of linearly uncorrelated variables, known as principal components, that capture the majority of the variance in the original data.

Assumptions of PCA

PCA has some assumptions that should be considered when using the method. These include:

  • Linearity: PCA assumes that the relationships between variables are linear. If the relationships are non-linear, PCA may not be an appropriate method.

  • Independence: PCA assumes that the variables are independent. If there are strong correlations between variables, PCA may not be an appropriate method.

  • Normality: PCA assumes that the data are normally distributed. If the data are not normally distributed, PCA may not be an appropriate method.


Linearity Check of the Dataset

Linearity can be check by corrplot.

#Linearity Chech of Men's National Track Record
library(corrplot)
corr_men <- cor(scale_men)
corrplot_men <- corrplot(corr_men)

#Linearity Chech of Women's National Track Record
corr_women <- cor(scale_women)
corrplot_women <- corrplot(corr_women)

We can see from the corrplot that there is a linear association among the Men’s National Track Record variables and Women’s National Track Record variables.

Independency Check of the Dataset

The chi-squared test of independence can be used to evaluate the independence between variables in a dataset.The output of chi-squared test will show the test statistic, degrees of freedom, and p-value of the . If the p-value is greater than the significance level (typically 0.05), then the null hypothesis of independence can be accepted and concluding that there has been a significant association between the variables.

#Independency check of Men's National Track Record
chisq.test(numeric_men_without_outlier)
## 
##  Pearson's Chi-squared test
## 
## data:  numeric_men_without_outlier
## X-squared = 2.0765, df = 371, p-value = 1
#Independency check of Women's National Track Record
chisq.test(numeric_women_without_outlier)
## 
##  Pearson's Chi-squared test
## 
## data:  numeric_women_without_outlier
## X-squared = 8.0524, df = 318, p-value = 1

In both datasets, the p-value exceeds the significance level (usually 0.05), which implies that we can reject the alternative hypothesis and accept the null hypothesis of independence. Therefore, we can conclude that the variables in both datasets are independent.

Normality Check of the Dataset

Normality of a dataset can be assessed using various methods, such as the Q-Q plot and statistical tests like the Shapiro-Wilk test. This tests assess whether the data follow a normal distribution or not, based on the sample statistics and the theoretical properties of the normal distribution. If the p-value of the test is greater than the significance level (usually 0.05), we can accept the null hypothesis of normality and conclude that the data are normally distributed.

#Shapiro-Wilk test of Men's National Track Record
matrix_men <- as.matrix(scale_men)
shapiro.test(matrix_men)
## 
##  Shapiro-Wilk normality test
## 
## data:  matrix_men
## W = 0.98135, p-value = 2.343e-05

The p-value for the Men’s National Track Record dataset is extremely low at 2.343e-05, well below the significance level of 0.05. As a result, the null hypothesis is rejected, indicating that the dataset is not normally distributed. Therefore, it is necessary to apply a normal transformation such as logarithmic, square root, or inverse transformations to the dataset.Box-Cox transformation is a robust method of transformation that can convert non-normally distributed data into normally distributed data using statistical techniques.

#Box-Cox transformation of Men's National Track Record
library(MASS)
positive_men <- abs(min(scale_men)) + 1
scale_men_transformed <- scale_men + positive_men
boxcox_men <- boxcox(lm(scale_men_transformed ~ 1))

lambda_men <- boxcox_men$x[which.max(boxcox_men$y)]
bcx_men<- (scale_men_transformed ^ lambda_men - 1) / lambda_men
shapiro.test(bcx_men)
## 
##  Shapiro-Wilk normality test
## 
## data:  bcx_men
## W = 0.99085, p-value = 0.008905

Following the Box-Cox transformation, the p-value has increased to 0.008905, which is still below the threshold of 0.05, indicating that the dataset is not entirely normally distributed. However, the increase in the p-value from the previous value suggests that the dataset has moved closer to a normal distribution.

#Shapiro-Wilk test of Women's National Track Record
matrix_women <- as.matrix(scale_women)
shapiro.test(matrix_women)
## 
##  Shapiro-Wilk normality test
## 
## data:  matrix_women
## W = 0.97672, p-value = 9.006e-06

The Shapiro-Wilk test was conducted on the Women’s National Track Record dataset and resulted in a p-value of 9.006e-06, which is significantly smaller than the significance level of 0.05. Therefore, it is necessary to transform the dataset. The chosen transformation method will be Box-Cox transformation.

#Box-Cox transformation of Women's National Track Record
positive_women <- abs(min(scale_women)) + 1
scale_women_transformed <- scale_women + positive_women
boxcox_women <- boxcox(lm(scale_women_transformed ~ 1))

lambda_women <- boxcox_women$x[which.max(boxcox_women$y)]
bcx_women<- (scale_women_transformed ^ lambda_women - 1) / lambda_women
shapiro.test(bcx_women)
## 
##  Shapiro-Wilk normality test
## 
## data:  bcx_women
## W = 0.9909, p-value = 0.0197

After Box-Cox transformation the Women’s National Track Record dataset was subjected to a Shapiro-Wilk test, which yielded a p-value of 0.0197, much near to the significance level of 0.05. As a result, it can be concluded that the dataset now conforms to a normal distribution.


PCA of Men’s National Track Record

library(factoextra)
cor_men <- cor(bcx_men)
pca_men <- princomp(cor_men)
summary(pca_men)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     0.4658678 0.2067061 0.17243295 0.13316873 0.09399884
## Proportion of Variance 0.6799581 0.1338639 0.09315309 0.05555987 0.02768227
## Cumulative Proportion  0.6799581 0.8138220 0.90697510 0.96253497 0.99021725
##                             Comp.6      Comp.7       Comp.8
## Standard deviation     0.050014796 0.024920562 1.210219e-09
## Proportion of Variance 0.007837071 0.001945684 4.588648e-18
## Cumulative Proportion  0.998054316 1.000000000 1.000000e+00

The table above displays the standard deviation, proportion of variance, and cumulative proportion of each principal component. The data reveals that the first principal component has a standard deviation of 0.4658678 and a proportion of variance of 0.6799581. This indicates that the first principal component is capable of accounting for 67.996% of the variation in the dataset.

pca_men$loadings
## 
## Loadings:
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## 100m (s)        0.454  0.248  0.419         0.140  0.341         0.649
## 200m (s)        0.255  0.303         0.473 -0.775               -0.132
## 400m (s)        0.203        -0.685 -0.558 -0.280  0.127         0.285
## 800m (min)     -0.256  0.481 -0.409  0.410  0.287 -0.366         0.386
## 1500m (min)    -0.409  0.232 -0.160  0.119         0.829  0.138 -0.156
## 5000m (min)    -0.466  0.178  0.232 -0.197 -0.248        -0.751  0.179
## 10,000m (min)  -0.464         0.284 -0.248 -0.361 -0.151  0.636  0.292
## Marathon (min) -0.149 -0.727 -0.165  0.427 -0.136  0.150         0.435
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
## Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

The table above displays the loadings of each variable on each principal component, which represents the correlation between the variables and the principal components. A higher absolute value of the loading indicates that the variable is more important in determining the principal component. According to the table, the variable “5000m (min)” has the maximum absolute value of 0.466 in Comp.1. Therefore, it can be inferred that “5000m (min)” is the most important variable in determining the first principal component.


fviz_eig(pca_men, addlabels = TRUE)

The plot shown above is referred to as a Scree plot, which graphically represents the amount of variance explained by each principal component. According to the plot, the first principal component accounts for 68% of the total variance, while the second principal component accounts for 13.4% of the total variance.


fviz_cos2(pca_men, choice = "var", axes = 1:2)

The plot shown displays the squared cosine of the angle between each original variable and the principal component (PC) axis, which is known as the cos2 value. These values demonstrate the degree to which each variable contributes to the variation captured by each PC. Higher cos2 values indicate a greater importance of a variable in explaining the variation in the data. The plot indicates that the variable “5000m (min)” has the highest cos2 value, which implies that it is the principal component of the Men’s National Track Record dataset.

PCA of Women’s National Track Record

cor_women <- cor(bcx_women)
pca_women <- princomp(cor_women)
summary(pca_women)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     0.3666425 0.1766601 0.09762406 0.08050947 0.04592459
## Proportion of Variance 0.7297086 0.1694107 0.05173417 0.03518501 0.01144865
## Cumulative Proportion  0.7297086 0.8991193 0.95085346 0.98603847 0.99748712
##                             Comp.6 Comp.7
## Standard deviation     0.021515637      0
## Proportion of Variance 0.002512882      0
## Cumulative Proportion  1.000000000      1

The table presented illustrates the standard deviation and proportion of variance for each principal component. The first component, denoted as “Comp.1”, demonstrates a proportion of variance equal to 0.7297086. This indicates that the first principal component accounts for 72.971% of the variance in the Women’s National Track Record dataset.

pca_women$loadings
## 
## Loadings:
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 100m (s)        0.495         0.542         0.306  0.283  0.526
## 200m (s)        0.301  0.436         0.510 -0.665 -0.108       
## 400m (s)        0.376  0.169 -0.639 -0.544 -0.120  0.200  0.266
## 800m (min)             0.545 -0.369  0.377  0.639              
## 1500m (min)    -0.293  0.502  0.224 -0.201 -0.110  0.670 -0.333
## 3000m (min)    -0.389  0.422  0.256 -0.418 -0.100 -0.521  0.387
## Marathon (min) -0.528 -0.211 -0.214  0.293 -0.140  0.372  0.622
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
## Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000

The table above presents the correlation between variables and principal components, and higher absolute loading values indicate greater importance of a variable in determining the principal component. According to the table, the variable “Marathon (min)” has the highest absolute loading value of 0.528. This indicates that “Marathon (min)” is the first principal component of the Women’s National Track Record dataset.

fviz_eig(pca_women, addlabels = TRUE)

The Scree plot above indicates that the first principal component accounts for 73% of the variation in the dataset, while the second principal component accounts for 16.9% of the variation.

fviz_cos2(pca_women, choice = "var", axes = 1:2)

The plot presented above indicates that “Marathon (min)” is the first principal component because it explains the majority of the variation in the Women’s National Track Record dataset.


Ranking of Men’s National Track Record

The variable “5000m (min)” is the primary principal component of the Men’s National Track Record dataset. The table below displays the top ten countries ranked by this variable:

rank_men <- arrange(men, `5000m (min)`)
head(rank_men, 10)

Ranking of Women’s National Track Record

The variable “Marathon (min)” is the main principal component of the Women’s National Track Record dataset. The table presented below shows the top ten countries ranked according to this variable:

rank_women <- arrange(women, `Marathon (min)`)
head(rank_women, 10)

Q-Q Plot of First Principal Component

A Q-Q plot is a type of scatter plot used to visually represent the distribution of a dataset. The x-axis of the plot shows the theoretical quantiles, while the y-axis shows the sample quantiles. When the sample quantiles and theoretical quantiles are identical, the points on the scatter plot fall along a straight line known as the Q-Q line. However, if the points on the scatter plot deviate from the Q-Q line, it indicates that the dataset does not follow a normal distribution. If the points on the scatter plot deviate upwards from the Q-Q line, it suggests that the distribution has a heavier tail than the theoretical normal distribution, whereas if the points deviate downwards, it suggests a lighter tail. If the points fall in the middle, it indicates that the distribution has a different shape compared to the theoretical normal distribution.

#Q-Q plot of first principal component of Men's National Track Record
qqplot_men <- qqnorm(bcx_men[, 6],  col = 'blue', main = 
         "Q-Q Plot of Men's First PC 5000m (min)")
qqline_men <- qqline(bcx_men[, 6], col = 'black')

The plot displayed above illustrates the distribution of data for the Men’s National Track Record in the 5000m (min) category. The scatter points on the plot are mostly aligned along the Q-Q line, but there is a slight upward deviation in some points, indicating that the data has a heavier tail than the theoretical normal distribution. However, the overall shape of the data appears to be similar to that of a normal distribution.

#Q-Q plot of first principal component of Women's National Track Record
qqplot_women <- qqnorm(bcx_women[, 7], col = "black", main = 
                         "Q-Q Plot of Women's First PC (Marathon (min))")
qqpline_women <- qqline(bcx_women[, 7], col = "navy")

The plot above represents the Q-Q plot of the Women’s National Track Record in the Marathon (min) category. The points on the plot align closely with the Q-Q line, with only minor deviations that can be ignored. This suggests that the data follows a theoretical normal distribution.