Md. Makfidunnabi | Roll: 20228025
Two files (T8-6.DAT and T1-9.DAT) contain information about the national track records for both men and women.
men <- read.table("E:/ASDS/2nd Semester/Multivariate Analysis/datasets/T8-6.DAT")
colnames(men) <- c("country", "100m (s)", "200m (s)", "400m (s)", "800m (min)",
"1500m (min)", "5000m (min)", "10,000m (min)", "Marathon (min)")
str(men)
## 'data.frame': 54 obs. of 9 variables:
## $ country : chr "Argentina" "Australia" "Austria" "Belgium" ...
## $ 100m (s) : num 10.23 9.93 10.15 10.14 10.27 ...
## $ 200m (s) : num 20.4 20.1 20.4 20.2 20.3 ...
## $ 400m (s) : num 46.2 44.4 45.8 45 45.3 ...
## $ 800m (min) : num 1.77 1.74 1.77 1.73 1.79 1.7 1.75 1.76 1.77 1.8 ...
## $ 1500m (min) : num 3.68 3.53 3.58 3.57 3.7 3.57 3.53 3.65 3.61 3.72 ...
## $ 5000m (min) : num 13.3 12.9 13.3 12.8 14.6 ...
## $ 10,000m (min) : num 27.6 27.5 27.7 26.9 30.5 ...
## $ Marathon (min): num 130 128 132 127 146 ...
women <- read.table("E:/ASDS/2nd Semester/Multivariate Analysis/datasets/T1-9.dat")
colnames(women) <- c("country", "100m (s)", "200m (s)", "400m (s)", "800m (min)",
"1500m (min)", "3000m (min)", "Marathon (min)")
str(women)
## 'data.frame': 54 obs. of 8 variables:
## $ country : chr "ARG" "AUS" "AUT" "BEL" ...
## $ 100m (s) : num 11.6 11.1 11.2 11.1 11.5 ...
## $ 200m (s) : num 22.9 22.2 22.7 22.5 23.1 ...
## $ 400m (s) : num 52.5 48.6 50.6 51.5 53.3 ...
## $ 800m (min) : num 2.05 1.98 1.94 1.97 2.07 1.97 1.97 2 1.93 2.04 ...
## $ 1500m (min) : num 4.25 4.02 4.05 4.08 4.29 4.17 4 4.22 3.84 4.34 ...
## $ 3000m (min) : num 9.19 8.63 8.78 8.82 9.81 9.04 8.54 9.26 8.1 9.37 ...
## $ Marathon (min): num 150 144 154 143 174 ...
The data sets used to calculate the correlation matrix and its corresponding eigenvalues and eigenvectors no longer include the column for “Country”.
library(dplyr)
numeric_men <- select(men, -country)
head(numeric_men)
numeric_women <- select(women, -country)
head(numeric_women)
The most straightforward method for identifying any outlier in a dataset is to use a boxplot.
par(mfrow = c(2,4))
par(mar = c(4, 4, 2, 1))
par(mgp = c(3,1,0))
boxplot_list_men <- lapply(colnames(numeric_men),
function(x) boxplot(numeric_men[[x]], main = x))
par(mfrow = c(1, 1))
By examining the boxplot, it is apparent that each variable within the men’s track record exhibits an outlier. Marathon (min) displays the highest number of outliers.
par(mfrow = c(2,4))
par(mar = c(4, 4, 2, 1))
par(mgp = c(3,1,0))
boxplot_list_women <- lapply(colnames(numeric_women),
function(x) boxplot(numeric_women[[x]], main = x))
par(mfrow = c(1, 1))
It has been noted that the dataset containing the women’s national track
record also contains outliers. In order to conduct Principal Component
Analysis, it is necessary to eliminate these outliers from both
datasets.
First, to eliminate outliers, they are substituted with NA values. Next, the NA values are substituted with the mean of the corresponding column.
#Replacing outliers by NA
numeric_men_without_outlier <- numeric_men
for (i in colnames(numeric_men_without_outlier)) {
med <- median(numeric_men_without_outlier[[i]], na.rm = TRUE)
iqr <- IQR(numeric_men_without_outlier[[i]], na.rm = TRUE)
lower_bound <- med - 1.5 * iqr
upper_bound <- med + 1.5 * iqr
numeric_men_without_outlier[[i]][numeric_men_without_outlier[[i]] <
lower_bound |
numeric_men_without_outlier[[i]] >
upper_bound] <- NA
}
head(numeric_men_without_outlier)
#Replacing NA by mean
for (i in colnames(numeric_men_without_outlier)) {
column_mean <- mean(numeric_men_without_outlier[[i]], na.rm = TRUE)
numeric_men_without_outlier[[i]][is.na(numeric_men_without_outlier[[i]])] <-
column_mean
}
head(numeric_men_without_outlier)
#Replacing outliers by NA
numeric_women_without_outlier <- numeric_women
for (i in colnames(numeric_women_without_outlier)) {
med <- median(numeric_women_without_outlier[[i]], na.rm = TRUE)
iqr <- IQR(numeric_women_without_outlier[[i]], na.rm = TRUE)
lower_bound <- med - 1.5 * iqr
upper_bound <- med + 1.5 * iqr
numeric_women_without_outlier[[i]][numeric_women_without_outlier[[i]] <
lower_bound |
numeric_women_without_outlier[[i]] >
upper_bound] <- NA
}
head(numeric_women_without_outlier)
#Replacing NA by mean
for (i in colnames(numeric_women_without_outlier)) {
column_mean <- mean(numeric_women_without_outlier[[i]], na.rm = TRUE)
numeric_women_without_outlier[[i]][is.na(numeric_women_without_outlier[[i]])] <-
column_mean
}
head(numeric_women_without_outlier)
The provided data sets include values that are measured in different units, such as seconds and minutes. To address this issue, we need to remove the units by scaling the data. One possible method of scaling the data is to normalize it. Normalization is a process of transforming the data to make them comparable by bringing them to a common scale. Specifically, in this method, the data are adjusted in a way that their mean value becomes zero and their spread or variability, as measured by the standard deviation, becomes equal to one.
scale_men <- scale(numeric_men_without_outlier)
head(scale_men)
## 100m (s) 200m (s) 400m (s) 800m (min) 1500m (min) 5000m (min)
## [1,] 0.3087310 -0.2890627 0.8545271 0.2863158 0.4897456 -0.2517226
## [2,] -2.3802163 -1.0920148 -1.4967832 -0.5249124 -0.9240484 -1.2627398
## [3,] -0.4083216 -0.0818493 0.3581394 0.2863158 -0.4527837 -0.4286506
## [4,] -0.4979532 -0.7552929 -0.6607617 -0.7953218 -0.5470366 -1.5154941
## [5,] 0.6672573 -0.4703745 -0.3472537 0.8271347 0.6782515 0.0000000
## [6,] -1.7527953 -1.5323433 -1.6143487 -1.6065500 -0.5470366 0.1274088
## 10,000m (min) Marathon (min)
## [1,] -0.54699386 -0.07699177
## [2,] -0.70444936 -0.88599899
## [3,] -0.45514482 0.96372140
## [4,] -1.57045463 -1.00774279
## [5,] 0.00000000 0.00000000
## [6,] 0.08282816 -1.45937304
scale_women <- scale(numeric_women_without_outlier)
head(scale_women)
## 100m (s) 200m (s) 400m (s) 800m (min) 1500m (min) 3000m (min)
## [1,] 0.8816068 -0.05889089 0.3115352 0.6797313 0.5836379 0.5921674
## [2,] -0.6802950 -1.16040392 -1.4276858 -0.4489388 -0.6888663 -0.6234255
## [3,] -0.5761682 -0.43123333 -0.5333577 -1.0938932 -0.5228875 -0.2978203
## [4,] -0.6108771 -0.77254722 -0.1603465 -0.6101774 -0.3569087 -0.2109922
## [5,] 0.4998086 0.11176605 0.6710641 1.0022085 0.8049429 1.9380024
## [6,] -0.5067504 -0.58637601 -0.5333577 -0.6101774 0.1410277 0.2665622
## Marathon (min)
## [1,] 0.1054535
## [2,] -0.7570163
## [3,] 0.6158430
## [4,] -0.8152742
## [5,] 0.0000000
## [6,] -0.2630909
Principal component analysis (PCA) is a statistical method used to reduce the dimensionality of high-dimensional data by finding a smaller set of linearly uncorrelated variables, known as principal components, that capture the majority of the variance in the original data.
PCA has some assumptions that should be considered when using the method. These include:
Linearity: PCA assumes that the relationships between variables are linear. If the relationships are non-linear, PCA may not be an appropriate method.
Independence: PCA assumes that the variables are independent. If there are strong correlations between variables, PCA may not be an appropriate method.
Normality: PCA assumes that the data are normally distributed. If the data are not normally distributed, PCA may not be an appropriate method.
Linearity can be check by corrplot.
#Linearity Chech of Men's National Track Record
library(corrplot)
corr_men <- cor(scale_men)
corrplot_men <- corrplot(corr_men)
#Linearity Chech of Women's National Track Record
corr_women <- cor(scale_women)
corrplot_women <- corrplot(corr_women)
We can see from the corrplot that there is a linear association among the Men’s National Track Record variables and Women’s National Track Record variables.
The chi-squared test of independence can be used to evaluate the independence between variables in a dataset.The output of chi-squared test will show the test statistic, degrees of freedom, and p-value of the . If the p-value is greater than the significance level (typically 0.05), then the null hypothesis of independence can be accepted and concluding that there has been a significant association between the variables.
#Independency check of Men's National Track Record
chisq.test(numeric_men_without_outlier)
##
## Pearson's Chi-squared test
##
## data: numeric_men_without_outlier
## X-squared = 2.0765, df = 371, p-value = 1
#Independency check of Women's National Track Record
chisq.test(numeric_women_without_outlier)
##
## Pearson's Chi-squared test
##
## data: numeric_women_without_outlier
## X-squared = 8.0524, df = 318, p-value = 1
In both datasets, the p-value exceeds the significance level (usually 0.05), which implies that we can reject the alternative hypothesis and accept the null hypothesis of independence. Therefore, we can conclude that the variables in both datasets are independent.
Normality of a dataset can be assessed using various methods, such as the Q-Q plot and statistical tests like the Shapiro-Wilk test. This tests assess whether the data follow a normal distribution or not, based on the sample statistics and the theoretical properties of the normal distribution. If the p-value of the test is greater than the significance level (usually 0.05), we can accept the null hypothesis of normality and conclude that the data are normally distributed.
#Shapiro-Wilk test of Men's National Track Record
matrix_men <- as.matrix(scale_men)
shapiro.test(matrix_men)
##
## Shapiro-Wilk normality test
##
## data: matrix_men
## W = 0.98135, p-value = 2.343e-05
The p-value for the Men’s National Track Record dataset is extremely low at 2.343e-05, well below the significance level of 0.05. As a result, the null hypothesis is rejected, indicating that the dataset is not normally distributed. Therefore, it is necessary to apply a normal transformation such as logarithmic, square root, or inverse transformations to the dataset.Box-Cox transformation is a robust method of transformation that can convert non-normally distributed data into normally distributed data using statistical techniques.
#Box-Cox transformation of Men's National Track Record
library(MASS)
positive_men <- abs(min(scale_men)) + 1
scale_men_transformed <- scale_men + positive_men
boxcox_men <- boxcox(lm(scale_men_transformed ~ 1))
lambda_men <- boxcox_men$x[which.max(boxcox_men$y)]
bcx_men<- (scale_men_transformed ^ lambda_men - 1) / lambda_men
shapiro.test(bcx_men)
##
## Shapiro-Wilk normality test
##
## data: bcx_men
## W = 0.99085, p-value = 0.008905
Following the Box-Cox transformation, the p-value has increased to 0.008905, which is still below the threshold of 0.05, indicating that the dataset is not entirely normally distributed. However, the increase in the p-value from the previous value suggests that the dataset has moved closer to a normal distribution.
#Shapiro-Wilk test of Women's National Track Record
matrix_women <- as.matrix(scale_women)
shapiro.test(matrix_women)
##
## Shapiro-Wilk normality test
##
## data: matrix_women
## W = 0.97672, p-value = 9.006e-06
The Shapiro-Wilk test was conducted on the Women’s National Track Record dataset and resulted in a p-value of 9.006e-06, which is significantly smaller than the significance level of 0.05. Therefore, it is necessary to transform the dataset. The chosen transformation method will be Box-Cox transformation.
#Box-Cox transformation of Women's National Track Record
positive_women <- abs(min(scale_women)) + 1
scale_women_transformed <- scale_women + positive_women
boxcox_women <- boxcox(lm(scale_women_transformed ~ 1))
lambda_women <- boxcox_women$x[which.max(boxcox_women$y)]
bcx_women<- (scale_women_transformed ^ lambda_women - 1) / lambda_women
shapiro.test(bcx_women)
##
## Shapiro-Wilk normality test
##
## data: bcx_women
## W = 0.9909, p-value = 0.0197
After Box-Cox transformation the Women’s National Track Record dataset was subjected to a Shapiro-Wilk test, which yielded a p-value of 0.0197, much near to the significance level of 0.05. As a result, it can be concluded that the dataset now conforms to a normal distribution.
library(factoextra)
cor_men <- cor(bcx_men)
pca_men <- princomp(cor_men)
summary(pca_men)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.4658678 0.2067061 0.17243295 0.13316873 0.09399884
## Proportion of Variance 0.6799581 0.1338639 0.09315309 0.05555987 0.02768227
## Cumulative Proportion 0.6799581 0.8138220 0.90697510 0.96253497 0.99021725
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.050014796 0.024920562 1.210219e-09
## Proportion of Variance 0.007837071 0.001945684 4.588648e-18
## Cumulative Proportion 0.998054316 1.000000000 1.000000e+00
The table above displays the standard deviation, proportion of variance, and cumulative proportion of each principal component. The data reveals that the first principal component has a standard deviation of 0.4658678 and a proportion of variance of 0.6799581. This indicates that the first principal component is capable of accounting for 67.996% of the variation in the dataset.
pca_men$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## 100m (s) 0.454 0.248 0.419 0.140 0.341 0.649
## 200m (s) 0.255 0.303 0.473 -0.775 -0.132
## 400m (s) 0.203 -0.685 -0.558 -0.280 0.127 0.285
## 800m (min) -0.256 0.481 -0.409 0.410 0.287 -0.366 0.386
## 1500m (min) -0.409 0.232 -0.160 0.119 0.829 0.138 -0.156
## 5000m (min) -0.466 0.178 0.232 -0.197 -0.248 -0.751 0.179
## 10,000m (min) -0.464 0.284 -0.248 -0.361 -0.151 0.636 0.292
## Marathon (min) -0.149 -0.727 -0.165 0.427 -0.136 0.150 0.435
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## Cumulative Var 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000
The table above displays the loadings of each variable on each principal component, which represents the correlation between the variables and the principal components. A higher absolute value of the loading indicates that the variable is more important in determining the principal component. According to the table, the variable “5000m (min)” has the maximum absolute value of 0.466 in Comp.1. Therefore, it can be inferred that “5000m (min)” is the most important variable in determining the first principal component.
fviz_eig(pca_men, addlabels = TRUE)
The plot shown above is referred to as a Scree plot, which graphically represents the amount of variance explained by each principal component. According to the plot, the first principal component accounts for 68% of the total variance, while the second principal component accounts for 13.4% of the total variance.
fviz_cos2(pca_men, choice = "var", axes = 1:2)
The plot shown displays the squared cosine of the angle between each original variable and the principal component (PC) axis, which is known as the cos2 value. These values demonstrate the degree to which each variable contributes to the variation captured by each PC. Higher cos2 values indicate a greater importance of a variable in explaining the variation in the data. The plot indicates that the variable “5000m (min)” has the highest cos2 value, which implies that it is the principal component of the Men’s National Track Record dataset.
cor_women <- cor(bcx_women)
pca_women <- princomp(cor_women)
summary(pca_women)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.3666425 0.1766601 0.09762406 0.08050947 0.04592459
## Proportion of Variance 0.7297086 0.1694107 0.05173417 0.03518501 0.01144865
## Cumulative Proportion 0.7297086 0.8991193 0.95085346 0.98603847 0.99748712
## Comp.6 Comp.7
## Standard deviation 0.021515637 0
## Proportion of Variance 0.002512882 0
## Cumulative Proportion 1.000000000 1
The table presented illustrates the standard deviation and proportion of variance for each principal component. The first component, denoted as “Comp.1”, demonstrates a proportion of variance equal to 0.7297086. This indicates that the first principal component accounts for 72.971% of the variance in the Women’s National Track Record dataset.
pca_women$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 100m (s) 0.495 0.542 0.306 0.283 0.526
## 200m (s) 0.301 0.436 0.510 -0.665 -0.108
## 400m (s) 0.376 0.169 -0.639 -0.544 -0.120 0.200 0.266
## 800m (min) 0.545 -0.369 0.377 0.639
## 1500m (min) -0.293 0.502 0.224 -0.201 -0.110 0.670 -0.333
## 3000m (min) -0.389 0.422 0.256 -0.418 -0.100 -0.521 0.387
## Marathon (min) -0.528 -0.211 -0.214 0.293 -0.140 0.372 0.622
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143
## Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
The table above presents the correlation between variables and principal components, and higher absolute loading values indicate greater importance of a variable in determining the principal component. According to the table, the variable “Marathon (min)” has the highest absolute loading value of 0.528. This indicates that “Marathon (min)” is the first principal component of the Women’s National Track Record dataset.
fviz_eig(pca_women, addlabels = TRUE)
The Scree plot above indicates that the first principal component accounts for 73% of the variation in the dataset, while the second principal component accounts for 16.9% of the variation.
fviz_cos2(pca_women, choice = "var", axes = 1:2)
The plot presented above indicates that “Marathon (min)” is the first principal component because it explains the majority of the variation in the Women’s National Track Record dataset.
The variable “5000m (min)” is the primary principal component of the Men’s National Track Record dataset. The table below displays the top ten countries ranked by this variable:
rank_men <- arrange(men, `5000m (min)`)
head(rank_men, 10)
The variable “Marathon (min)” is the main principal component of the Women’s National Track Record dataset. The table presented below shows the top ten countries ranked according to this variable:
rank_women <- arrange(women, `Marathon (min)`)
head(rank_women, 10)
A Q-Q plot is a type of scatter plot used to visually represent the distribution of a dataset. The x-axis of the plot shows the theoretical quantiles, while the y-axis shows the sample quantiles. When the sample quantiles and theoretical quantiles are identical, the points on the scatter plot fall along a straight line known as the Q-Q line. However, if the points on the scatter plot deviate from the Q-Q line, it indicates that the dataset does not follow a normal distribution. If the points on the scatter plot deviate upwards from the Q-Q line, it suggests that the distribution has a heavier tail than the theoretical normal distribution, whereas if the points deviate downwards, it suggests a lighter tail. If the points fall in the middle, it indicates that the distribution has a different shape compared to the theoretical normal distribution.
#Q-Q plot of first principal component of Men's National Track Record
qqplot_men <- qqnorm(bcx_men[, 6], col = 'blue', main =
"Q-Q Plot of Men's First PC 5000m (min)")
qqline_men <- qqline(bcx_men[, 6], col = 'black')
The plot displayed above illustrates the distribution of data for the Men’s National Track Record in the 5000m (min) category. The scatter points on the plot are mostly aligned along the Q-Q line, but there is a slight upward deviation in some points, indicating that the data has a heavier tail than the theoretical normal distribution. However, the overall shape of the data appears to be similar to that of a normal distribution.
#Q-Q plot of first principal component of Women's National Track Record
qqplot_women <- qqnorm(bcx_women[, 7], col = "black", main =
"Q-Q Plot of Women's First PC (Marathon (min))")
qqpline_women <- qqline(bcx_women[, 7], col = "navy")
The plot above represents the Q-Q plot of the Women’s National Track Record in the Marathon (min) category. The points on the plot align closely with the Q-Q line, with only minor deviations that can be ignored. This suggests that the data follows a theoretical normal distribution.