Multivariate Analysis Assignment

Data set- The data on national track records for the men and women are recorded in two files ( T8-6.DAT and T1-9.DAT).

(a) Obtain the sample correlation matrix for both data, and hence determine its eigenvalues and eigenvectors.

First we set the directory and read the data table for analysis

getwd()

## [1] "I:/Varsity DOCU/WM ASDS/Semester 2/PM ASDS22 Machine Learning for Data Science"

setwd("I:/Varsity DOCU/WM ASDS/Semester 2/PM ASDS06 Multivariate Analysis/datasets/datasets")
male_data <- read.table("T8-6.DAT") # male dataset
female_data <- read.table("T1-9.DAT", fill = TRUE) #female dataset  # simply specify the fill argument to be equal to TRUE

dim(male_data);dim(female_data) #dimention of the data

## [1] 54  9

## [1] 56  8

#changing the columns name
colnames(female_data)<-c("Country","100 m (s)","200 m (s)","400 m (s)","800 m (min)","1500 m (min)","3000 m (min)","Marathon (min)")
colnames(male_data)<-c("Country","100 m (s)","200 m (s)","400 m (s)","800 m (min)","1500 m (min)","5000 m (min)","10000 m (min)", "Marathon (min)")
head(male_data);head(female_data) #Header of the reading data file

Change female data ‘100 m (s)’ variable “chr” to “numeric”. This will create a new column ‘100 m (s)’ in t2 with the values from ‘100 m (s)’ converted to numeric. if any of the values in the original character column cannot be converted to numeric, that will get a warning message and the corresponding value in the new column will be NA

female_data$`100 m (s)` <- as.numeric(female_data$`100 m (s)`)

## Warning: NAs introduced by coercion

female_data$`100 m (s)`

##  [1] 11.57 11.12 11.15 11.14 11.46 11.17 10.98 11.65 10.79 11.31 12.52 11.72
## [13] 11.09 11.42 11.63 11.13 10.73 10.81 11.10 10.83 11.92 11.41 11.56 11.38
## [25] 11.43 11.45 11.14 11.36 11.62    NA    NA    NA    NA 11.76 11.50 11.72
## [37] 11.09 11.66 11.08 11.32 11.41 11.96 11.28 10.93 11.30 11.30 10.77 12.38
## [49] 12.13 11.06 11.16 11.34 11.22 11.33 11.25 10.49

Correlation Matrix for Male_data without data normalization

# Compute the correlation matrix for Male
cor_matrix_male <- cor(male_data[2:9])
cor_matrix_male

##                100 m (s) 200 m (s) 400 m (s) 800 m (min) 1500 m (min)
## 100 m (s)      1.0000000 0.9147554 0.8041147   0.7119388    0.7657919
## 200 m (s)      0.9147554 1.0000000 0.8449159   0.7969162    0.7950871
## 400 m (s)      0.8041147 0.8449159 1.0000000   0.7677488    0.7715522
## 800 m (min)    0.7119388 0.7969162 0.7677488   1.0000000    0.8957609
## 1500 m (min)   0.7657919 0.7950871 0.7715522   0.8957609    1.0000000
## 5000 m (min)   0.7398803 0.7613028 0.7796929   0.8606959    0.9165224
## 10000 m (min)  0.7147921 0.7479519 0.7657481   0.8431074    0.9013380
## Marathon (min) 0.6764873 0.7211157 0.7126823   0.8069657    0.8777788
##                5000 m (min) 10000 m (min) Marathon (min)
## 100 m (s)         0.7398803     0.7147921      0.6764873
## 200 m (s)         0.7613028     0.7479519      0.7211157
## 400 m (s)         0.7796929     0.7657481      0.7126823
## 800 m (min)       0.8606959     0.8431074      0.8069657
## 1500 m (min)      0.9165224     0.9013380      0.8777788
## 5000 m (min)      1.0000000     0.9882324      0.9441466
## 10000 m (min)     0.9882324     1.0000000      0.9541630
## Marathon (min)    0.9441466     0.9541630      1.0000000

Correlation Matrix for t2/Female_data without normalization

cor_matrix_female_with_NA_value <- cor(female_data[2:8])
cor_matrix_female_with_NA_value

##                100 m (s) 200 m (s) 400 m (s) 800 m (min) 1500 m (min)
## 100 m (s)              1        NA        NA          NA           NA
## 200 m (s)             NA         1        NA          NA           NA
## 400 m (s)             NA        NA         1          NA           NA
## 800 m (min)           NA        NA        NA           1           NA
## 1500 m (min)          NA        NA        NA          NA            1
## 3000 m (min)          NA        NA        NA          NA           NA
## Marathon (min)        NA        NA        NA          NA           NA
##                3000 m (min) Marathon (min)
## 100 m (s)                NA             NA
## 200 m (s)                NA             NA
## 400 m (s)                NA             NA
## 800 m (min)              NA             NA
## 1500 m (min)             NA             NA
## 3000 m (min)              1             NA
## Marathon (min)           NA              1

For this correlation matrix there are some NA value in female_data$‘100 m (s)’ column that’s why we get NA in female_data cor_matrix.Now identify the missing row and create a new data set that excludes the rows with missing values.

# Identify rows with missing values
missing_rows <- complete.cases(female_data)
# Create a new data set that excludes the rows with missing values #row 30-33 missing value
female_data <- female_data[missing_rows,]
female_data

Now correlation matrix for Female data after removing NA value from dataset

cor_matrix_female <- cor(female_data[2:8])
cor_matrix_female

##                100 m (s) 200 m (s) 400 m (s) 800 m (min) 1500 m (min)
## 100 m (s)      1.0000000 0.9501109 0.8687133   0.8357212    0.7866444
## 200 m (s)      0.9501109 1.0000000 0.9038381   0.8858583    0.8322940
## 400 m (s)      0.8687133 0.9038381 1.0000000   0.8481705    0.7333062
## 800 m (min)    0.8357212 0.8858583 0.8481705   1.0000000    0.9139765
## 1500 m (min)   0.7866444 0.8322940 0.7333062   0.9139765    1.0000000
## 3000 m (min)   0.7418504 0.7790246 0.7012080   0.8744640    0.9754223
## Marathon (min) 0.6955105 0.7488635 0.7241690   0.8668722    0.7987052
##                3000 m (min) Marathon (min)
## 100 m (s)         0.7418504      0.6955105
## 200 m (s)         0.7790246      0.7488635
## 400 m (s)         0.7012080      0.7241690
## 800 m (min)       0.8744640      0.8668722
## 1500 m (min)      0.9754223      0.7987052
## 3000 m (min)      1.0000000      0.8003613
## Marathon (min)    0.8003613      1.0000000

Corrplot for the both Male and female dataset

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor_matrix_male, method=c("ellipse"),type = "lower", title = "Male's Correlated Data", addCoef.col = "black", cex.main=0.5);corrplot(cor_matrix_female, method=c("ellipse"),type = "lower", title = "Female's Correlated Data", addCoef.col = "black", cex.main=0.5)

Eigenvalues and Eigenvectors are bellow

eig_male <- eigen(cor_matrix_male)
eig_female <- eigen(cor_matrix_female)
eig_male;eig_female

## eigen() decomposition
## $values
## [1] 6.703289951 0.638410110 0.227524494 0.205849181 0.097577441 0.070687912
## [7] 0.046942050 0.009718862
## 
## $vectors
##            [,1]        [,2]         [,3]        [,4]        [,5]        [,6]
## [1,] -0.3323877 -0.52939911 -0.343859303  0.38074525  0.29967117 -0.36203713
## [2,] -0.3460511 -0.47039050  0.003786104  0.21702322 -0.54143422  0.34859224
## [3,] -0.3391240 -0.34532929  0.067060507 -0.85129980  0.13298631  0.07708385
## [4,] -0.3530134  0.08945523  0.782711152  0.13427911 -0.22728254 -0.34130845
## [5,] -0.3659849  0.15365241  0.244270040  0.23302034  0.65162403  0.52977961
## [6,] -0.3698204  0.29475985 -0.182863147 -0.05462441  0.07181636 -0.35914382
## [7,] -0.3659489  0.33360619 -0.243980694 -0.08706927 -0.06133263 -0.27308617
## [8,] -0.3542779  0.38656085 -0.334632969  0.01812115 -0.33789097  0.37516986
##            [,7]         [,8]
## [1,]  0.3476470 -0.065701445
## [2,] -0.4398969  0.060755403
## [3,]  0.1135553 -0.003469726
## [4,]  0.2588830 -0.039274027
## [5,] -0.1470362 -0.039745509
## [6,] -0.3283202  0.705684585
## [7,] -0.3511133 -0.697181715
## [8,]  0.5941571  0.069316891

## eigen() decomposition
## $values
## [1] 5.93888605 0.53078156 0.27482696 0.12645231 0.07311967 0.04184642 0.01408704
## 
## $vectors
##            [,1]       [,2]       [,3]        [,4]        [,5]        [,6]
## [1,] -0.3744596 -0.4367624 -0.2268423  0.55088047  0.12222759  0.54263718
## [2,] -0.3886178 -0.3532148 -0.1158835  0.22440384 -0.11910666 -0.77486230
## [3,] -0.3678277 -0.4653831  0.2186249 -0.69319469  0.31240786  0.08050071
## [4,] -0.3965282  0.1000993  0.1324223 -0.19047340 -0.80779684  0.26722912
## [5,] -0.3848945  0.3662518 -0.3792578 -0.07387975 -0.06870531 -0.13222465
## [6,] -0.3740130  0.4648406 -0.3578193 -0.13587603  0.40777728  0.08446189
## [7,] -0.3580431  0.3281150  0.7732464  0.32473699  0.22301768 -0.05122863
##             [,7]
## [1,]  0.06853581
## [2,] -0.21408297
## [3,]  0.12540559
## [4,] -0.23448362
## [5,]  0.73906521
## [6,] -0.56931927
## [7,]  0.09176257

To see the result of the eigenvalues and eigenvectors for Male the matrix is eight by eight and for female matrix is seven by seven dimensions. For male, The first eigenvalue is 6.703289951, and the subsequent eigenvalues decrease in magnitude. The eigenvalues indicate the amount of variance explained by each eigenvector. In this case, the first eigenvector explains the most variance, followed by the second eigenvector, and so on. In the first example, we see that the first eigenvalue is much larger than the others, indicating that the first principal component explains most of the variance in the data. In the second example, the first eigenvalue is still larger than the others, but the difference is not as pronounced as in the first example.

Additionally, looking at the eigenvectors, we can see that the eigenvectors in the male dataset are not as clearly separated and defined as in the female dataset. This could be an indication of more noise or overlap in the data. In contrast, the eigenvectors in the second dataset are more distinct and easier to interpret.

(b) Perform the principal component analysis (PCA) using both datasets.

#str(male_data);str(female_data)
#plot(male_data);plot(female_data)
data_normalized_male <- scale(male_data[2:9]) #male data normalization
data_normalized_female <- scale(female_data[2:8]) #female data normalization
head(data_normalized_male);head(data_normalized_female)

##       100 m (s)  200 m (s)   400 m (s) 800 m (min) 1500 m (min) 5000 m (min)
## [1,]  0.0594137 -0.3126106  0.24391299  0.03530558    0.1757051   -0.3779943
## [2,] -1.2962228 -0.8777403 -1.00718688 -0.53664481   -0.8126362   -0.9037300
## [3,] -0.3020894 -0.1667707 -0.02020809  0.03530558   -0.4831891   -0.4699980
## [4,] -0.3472772 -0.6407504 -0.56235137 -0.72729495   -0.5490785   -1.0351640
## [5,]  0.2401652 -0.4402205 -0.39553805  0.41660584    0.3074840    1.3437903
## [6,] -0.9799076 -1.1876501 -1.06974187 -1.29924534   -0.5490785   -0.1808434
##      10000 m (min) Marathon (min)
## [1,]    -0.5271604     -0.4366164
## [2,]    -0.5986248     -0.6667369
## [3,]    -0.4854728     -0.1405878
## [4,]    -0.9916792     -0.7013666
## [5,]     1.1641643      1.4400937
## [6,]    -0.2413027     -0.8298319

##    100 m (s)   200 m (s)  400 m (s) 800 m (min) 1500 m (min) 3000 m (min)
## 1  0.5632928 -0.14161313  0.2434145   0.3177879   0.22597853   0.12706018
## 2 -0.5720260 -0.93139514 -1.2642718  -0.4799693  -0.60307153  -0.54689108
## 3 -0.4963381 -0.40858170 -0.4890016  -0.9358306  -0.49493456  -0.36636842
## 4 -0.5215674 -0.65330288 -0.1656477  -0.5939347  -0.38679760  -0.31822905
## 5  0.2857704 -0.01925254  0.5550809   0.5457186   0.37016114   0.87322051
## 6 -0.4458794 -0.51981860 -0.4890016  -0.5939347  -0.06238671  -0.05346248
##   Marathon (min)
## 1    -0.21598614
## 2    -0.62418924
## 3     0.02557898
## 4    -0.65176243
## 5     1.21422322
## 6    -0.39041654

print("Correlation matrix For Male Dataset")

## [1] "Correlation matrix For Male Dataset"

# Compute the correlation matrix for Male after normalization 
cor_matrix_male <- cor(data_normalized_male)
cor_matrix_male

##                100 m (s) 200 m (s) 400 m (s) 800 m (min) 1500 m (min)
## 100 m (s)      1.0000000 0.9147554 0.8041147   0.7119388    0.7657919
## 200 m (s)      0.9147554 1.0000000 0.8449159   0.7969162    0.7950871
## 400 m (s)      0.8041147 0.8449159 1.0000000   0.7677488    0.7715522
## 800 m (min)    0.7119388 0.7969162 0.7677488   1.0000000    0.8957609
## 1500 m (min)   0.7657919 0.7950871 0.7715522   0.8957609    1.0000000
## 5000 m (min)   0.7398803 0.7613028 0.7796929   0.8606959    0.9165224
## 10000 m (min)  0.7147921 0.7479519 0.7657481   0.8431074    0.9013380
## Marathon (min) 0.6764873 0.7211157 0.7126823   0.8069657    0.8777788
##                5000 m (min) 10000 m (min) Marathon (min)
## 100 m (s)         0.7398803     0.7147921      0.6764873
## 200 m (s)         0.7613028     0.7479519      0.7211157
## 400 m (s)         0.7796929     0.7657481      0.7126823
## 800 m (min)       0.8606959     0.8431074      0.8069657
## 1500 m (min)      0.9165224     0.9013380      0.8777788
## 5000 m (min)      1.0000000     0.9882324      0.9441466
## 10000 m (min)     0.9882324     1.0000000      0.9541630
## Marathon (min)    0.9441466     0.9541630      1.0000000

print("Correlation matrix For Female Dataset")

## [1] "Correlation matrix For Female Dataset"

# Compute the correlation matrix for FeMale
cor_matrix_female <- cor(data_normalized_female)
cor_matrix_female

##                100 m (s) 200 m (s) 400 m (s) 800 m (min) 1500 m (min)
## 100 m (s)      1.0000000 0.9501109 0.8687133   0.8357212    0.7866444
## 200 m (s)      0.9501109 1.0000000 0.9038381   0.8858583    0.8322940
## 400 m (s)      0.8687133 0.9038381 1.0000000   0.8481705    0.7333062
## 800 m (min)    0.8357212 0.8858583 0.8481705   1.0000000    0.9139765
## 1500 m (min)   0.7866444 0.8322940 0.7333062   0.9139765    1.0000000
## 3000 m (min)   0.7418504 0.7790246 0.7012080   0.8744640    0.9754223
## Marathon (min) 0.6955105 0.7488635 0.7241690   0.8668722    0.7987052
##                3000 m (min) Marathon (min)
## 100 m (s)         0.7418504      0.6955105
## 200 m (s)         0.7790246      0.7488635
## 400 m (s)         0.7012080      0.7241690
## 800 m (min)       0.8744640      0.8668722
## 1500 m (min)      0.9754223      0.7987052
## 3000 m (min)      1.0000000      0.8003613
## Marathon (min)    0.8003613      1.0000000

####################################### PCA
#perfoming PCA for both data
male_pca <- princomp(cor_matrix_male)
female_pca <- princomp(cor_matrix_female)
#male_pca  #It's not show all details that's why we use summary function
#female_pca
summary(male_pca)

## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     0.2375035 0.08075860 0.07335707 0.04416490 0.03296589
## Proportion of Variance 0.7835686 0.09059714 0.07475166 0.02709514 0.01509616
## Cumulative Proportion  0.7835686 0.87416576 0.94891742 0.97601256 0.99110872
##                             Comp.6       Comp.7 Comp.8
## Standard deviation     0.024892956 0.0045178371      0
## Proportion of Variance 0.008607753 0.0002835293      0
## Cumulative Proportion  0.999716471 1.0000000000      1

summary(female_pca)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4      Comp.5
## Standard deviation     0.2006489 0.1133731 0.06233540 0.04754000 0.016557783
## Proportion of Variance 0.6758809 0.2157828 0.06523279 0.03794153 0.004602574
## Cumulative Proportion  0.6758809 0.8916637 0.95689644 0.99483798 0.999440551
##                              Comp.6       Comp.7
## Standard deviation     0.0057727389 1.156933e-09
## Proportion of Variance 0.0005594487 2.247050e-17
## Cumulative Proportion  1.0000000000 1.000000e+00

The PCA result summaries show the standard deviations, proportion of variance, and cumulative proportion for each principal component (PC) for two datasets, one with 8 variables and 8 observations, and the other with 7 variables and 7 observations.

For the male dataset, the first principal component (PC1) has the highest proportion of variance (78.36%), followed by PC2 (9.06%), PC3 (7.48%), and PC4 (2.71%). Together, these four components account for 97.60% of the total variance in the data. The remaining components have much lower proportions of variance and can be considered less important for describing the variability in the data.

For the female dataset, the first principal component (PC1) also has the highest proportion of variance (67.59%), followed by PC2 (21.58%), PC3 (6.52%), and PC4 (3.79%). Together, these four components account for 99.48% of the total variance in the data. The remaining components have very low proportions of variance and can be considered less important for describing the variability in the data.

Overall, these results suggest that the first few principal components capture the majority of the variability in both datasets, with PC1 being the most important in each case. However, it’s important to note that without more context on the datasets and their variables, it’s difficult to interpret the meaning of each principal component and their respective loadings.

(c) Determine the appropriate number of components to effectively summarize the sample variability for each data.

loading data

################loading/select component/variation of the component
load_male<-male_pca$loadings
str(load_male) #structure

##  'loadings' num [1:8, 1:8] 0.397 0.335 0.223 -0.198 -0.266 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:8] "100 m (s)" "200 m (s)" "400 m (s)" "800 m (min)" ...
##   ..$ : chr [1:8] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...

#load_male[,1]
sum(load_male[,1]^2) ##sum square loading for component 1

## [1] 1

load_female<-female_pca$loadings
str(load_female) #structure of the load data

##  'loadings' num [1:7, 1:7] 0.444 0.361 0.472 -0.092 -0.358 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:7] "100 m (s)" "200 m (s)" "400 m (s)" "800 m (min)" ...
##   ..$ : chr [1:7] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...

#load_female[,1]
sum(load_female[,1]^2) ##sum square loading for component 1

## [1] 1

Based on the information provided, it seems that two different sets of data are being shown, both involving a matrix called “loadings”. The first set of data includes a matrix with dimensions 8x8, where the entries are numeric values ranging from approximately -0.4 to 0.4. The row and column names of this matrix are given as character strings corresponding to different running times and components, respectively.

The second set of data includes a matrix with dimensions 7x7, where the entries are numeric values ranging from approximately -0.5 to 0.5. The row and column names of this matrix are also given as character strings corresponding to different running times and components, respectively.

It is unclear from this information alone what the context or purpose of these matrices is, or what analysis they are being used for. Without further information, it is difficult to provide a more detailed interpretation.

For further interpretation we plot Scree plot

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_eig(male_pca)

fviz_eig(female_pca)

This scree plot shows the eigenvalues in a downward curve, from highest to lowest. The first two components of both male and female can be considered to be the most significant since they contain more than 85% of the total information of the data.

Biplot of the attributes and Contribution of each variable

With the biplot, it is possible to visualize the similarities and dissimilarities between the samples, and further shows the impact of each attribute on each of the principal components.

########## Biplot and contribution  
fviz_pca_var(male_pca, col.var = "black")

fviz_cos2(male_pca,choice ="var",axes=1:2)

fviz_pca_var(female_pca, col.var = "black")

fviz_cos2(female_pca,choice ="var",axes=1:2)

#Biplot combined with cos2 male data
fviz_pca_var(male_pca, col.var = "cos2",
                gradient.cols = c("black", "orange", "green"),
                repel = TRUE)

#Biplot combined with cos2 female data
fviz_pca_var(female_pca, col.var = "cos2",
                gradient.cols = c("black", "orange", "green"),
                repel = TRUE)

Biplot of the variables :

Three main pieces of information can be observed from the previous plot.

First, all the variables that are grouped together are positively correlated to each other, and that is the case for instance for “100 m (s)”, “200 m (s)” and “400 m (s)” have a positive correlation to each. This result is surprising because they have the highest values in the loading matrix with respect to the first principal component.

Then, the higher the distance between the variable and the origin, the better represented that variable is. Variables that are negatively correlated are displayed to the opposite sides of the biplot’s origin.

Based on the Scree plots and the cumulative proportion of variance explained, for the male data it appears that the first 2 principal components explain the majority of the variability in the data (over 87% of the total variance). For the female data, the first 2 principal component appears to explain a majority of the variance (over 89% of the total variance). If we think about 3 component for the both data it cover approximately 95% of the total variance.

Therefore, for the male and female data, we can summarize the variability with just 2 principal components whcih cover more than 85% of total variance.

male_pca$loadings[,1:2] #Loading matrix of the first two principal components male data

##                    Comp.1      Comp.2
## 100 m (s)       0.3971133  0.27234892
## 200 m (s)       0.3345827 -0.06834248
## 400 m (s)       0.2228388 -0.07156477
## 800 m (min)    -0.1982670 -0.81099114
## 1500 m (min)   -0.2656437 -0.29157647
## 5000 m (min)   -0.3973884  0.15451249
## 10000 m (min)  -0.4323405  0.21967555
## Marathon (min) -0.4753799  0.31808635

The loading matrix for Male data shows that the first principal component has high positive values for both “100 m (s)”, “200 m (s)” and “400 m (s)”in the the national-track-records data. However, the values for 800 m (min), 1500 m (min), 5000 m (min), 10000 m (min) and Marathon (min) are relatively negative. This suggests that countries with a higher intake of of 100,200,400 m race track records. When it comes to the second principal component, it has high negative values for 200 m (s), 400 m (s), 800 m (min) and 1500 m (min).

female_pca$loadings[,1:2] #Loading matrix of the first two principal components female data

##                    Comp.1      Comp.2
## 100 m (s)       0.4443220  0.35086755
## 200 m (s)       0.3611640  0.29056994
## 400 m (s)       0.4717280 -0.03935761
## 800 m (min)    -0.0920500  0.11996092
## 1500 m (min)   -0.3576415  0.52351114
## 3000 m (min)   -0.4568553  0.47110756
## Marathon (min) -0.3232860 -0.52963684

The loading matrix for Female data shows that the first principal component has high positive values for both “100 m (s)”, “200 m (s)” and “400 m (s)”in the the national-track-records data as result as male data. However, the values for 800 m (min), 1500 m (min), 3000 m (min) and Marathon (min) are relatively negative. This suggests that countries with a higher intake of of 100,200,400 m race track records. When it comes to the second principal component, it has high negative values for 400 m (s) and Marathon (min).

(d) Rank the nations based on their score on the first principal component

Ranking Nation of Male data based on score on the first principal component

male_pca <- prcomp(male_data[,2:ncol(male_data)], scale = TRUE)
scores <- data.frame(nation = male_data[,1], score = male_pca$x[,1])
ranked_scores <- scores[order(scores$score, decreasing = TRUE),]
print(ranked_scores)

##            nation        score
## 54         U.S.A.   3.82842499
## 19   GreatBritain   2.91498277
## 29          Kenya   2.60834729
## 17         France   2.40202950
## 2       Australia   2.35250215
## 27          Italy   2.24390003
## 6          Brazil   2.20825258
## 18        Germany   2.13786396
## 43       Portugal   2.07397725
## 7          Canada   2.00766173
## 4         Belgium   1.97977654
## 42         Poland   1.90966469
## 45         Russia   1.82365926
## 48          Spain   1.72965172
## 28          Japan   1.65463811
## 50    Switzerland   1.58538872
## 39         Norway   1.43996309
## 37    Netherlands   1.43961748
## 35         Mexico   1.22404745
## 38     NewZealand   1.21196841
## 14        Denmark   1.12932147
## 20         Greece   1.09870962
## 22        Hungary   1.02549631
## 16        Finland   0.91160091
## 25        Ireland   0.80892204
## 49         Sweden   0.76690338
## 3         Austria   0.73063182
## 8           Chile   0.72256355
## 9           China   0.72128317
## 13  CzechRepublic   0.56671042
## 44        Romania   0.56233650
## 1       Argentina   0.41633265
## 30    Korea,South   0.37364381
## 23          India  -0.03970824
## 10       Columbia  -0.38603599
## 53         Turkey  -0.42141313
## 26         Israel  -0.79065356
## 34      Mauritius  -0.86045396
## 32     Luxembourg  -0.99164938
## 51         Taiwan  -1.34431318
## 15 DominicanRepub  -1.47568252
## 5         Bermuda  -1.48613375
## 52       Thailand  -1.51481290
## 24      Indonesia  -1.80145079
## 12      CostaRica  -1.80827499
## 31    Korea,North  -1.85611284
## 33       Malaysia  -1.97947150
## 21      Guatemala  -1.98836313
## 41    Philippines  -2.12382696
## 36 Myanmar(Burma)  -3.23684547
## 40 PapuaNewGuinea  -3.68134494
## 47      Singapore  -3.69149229
## 46          Samoa  -8.42153735
## 11    CookIslands -10.71119651

Ranking Nation of Female data based on score on the first principal component

female_pca <- prcomp(female_data[,2:ncol(female_data)], scale = TRUE)
scores_fe <- data.frame(nation = female_data[,1], score = female_pca$x[,1])
ranked_scores_fe <- scores[order(scores_fe$score, decreasing = TRUE),]
print(ranked_scores_fe)

##            nation        score
## 52       Thailand  -1.51481290
## 18        Germany   2.13786396
## 43       Portugal   2.07397725
## 9           China   0.72128317
## 17         France   2.40202950
## 19   GreatBritain   2.91498277
## 13  CzechRepublic   0.56671042
## 40 PapuaNewGuinea  -3.68134494
## 42         Poland   1.90966469
## 2       Australia   2.35250215
## 46          Samoa  -8.42153735
## 7          Canada   2.00766173
## 27          Italy   2.24390003
## 35         Mexico   1.22404745
## 4         Belgium   1.97977654
## 16        Finland   0.91160091
## 3         Austria   0.73063182
## 20         Greece   1.09870962
## 41    Philippines  -2.12382696
## 48          Spain   1.72965172
## 25        Ireland   0.80892204
## 6          Brazil   2.20825258
## 33       Malaysia  -1.97947150
## 29          Kenya   2.60834729
## 51         Taiwan  -1.34431318
## 47      Singapore  -3.69149229
## 22        Hungary   1.02549631
## 36 Myanmar(Burma)  -3.23684547
## 37    Netherlands   1.43961748
## 28          Japan   1.65463811
## 24      Indonesia  -1.80145079
## 14        Denmark   1.12932147
## 10       Columbia  -0.38603599
## 1       Argentina   0.41633265
## 26         Israel  -0.79065356
## 49         Sweden   0.76690338
## 8           Chile   0.72256355
## 34      Mauritius  -0.86045396
## 50    Switzerland   1.58538872
## 5         Bermuda  -1.48613375
## 31    Korea,North  -1.85611284
## 30    Korea,South   0.37364381
## 23          India  -0.03970824
## 39         Norway   1.43996309
## 32     Luxembourg  -0.99164938
## 12      CostaRica  -1.80827499
## 15 DominicanRepub  -1.47568252
## 45         Russia   1.82365926
## 21      Guatemala  -1.98836313
## 38     NewZealand   1.21196841
## 11    CookIslands -10.71119651
## 44        Romania   0.56233650

(e) To construct a Q-Q plot using the first principal component, we can use the following R code:

# Construct a Q-Q plot using the first principal component for the male data
qqnorm(predict(male_pca, main = "Normal Q-Q Plot male Data")[,1])

## Warning: In predict.prcomp(male_pca, main = "Normal Q-Q Plot male Data") :
##  extra argument 'main' will be disregarded

qqline(predict(male_pca)[,1])

# Construct a Q-Q plot using the first principal component for the female data
qqnorm(predict(female_pca, main = "Normal Q-Q Plot")[,1])

## Warning: In predict.prcomp(female_pca, main = "Normal Q-Q Plot") :
##  extra argument 'main' will be disregarded

qqline(predict(female_pca)[,1])

A Q-Q plot is a graphical way to compare the distribution of a dataset to a normal distribution. If the data is normally distributed, the points in the Q-Q plot should fall along a straight line. In this case, we are constructing Q-Q plots using the scores on the first principal component for each dataset. The qqnorm() function is used to create the plot, and the qqline() function adds a reference line for a normal distribution.

Interpretation:

For the male data, we can see that the data points deviate slightly from the QQ line in the tails, suggesting that the data may be slightly skewed. However, overall, the points lie close to the line, indicating that the data is reasonably well approximated by a normal distribution.

For the female data, the points deviate more from the QQ line, particularly in the tails. This suggests that the data may be more heavily skewed and/or have heavier tails than a normal distribution. Therefore, we may need to consider non-normal distributions when modeling or analyzing this data.

(f) Are the results consistent with those obtained from the women’s data?

The results obtained from the men’s data and women’s data are generally consistent. Both datasets show that the first two principal components explain the majority of the variability in the data. In addition, the countries are ranked similarly based on their scores on the first principal component.

For the male data, the first principal component (PC1) explains 78.36% of the variance in the data, followed by PC2 with 9.06% and PC3 with 7.48%. Together, these three components explain more than 95% of the total variance in the data. The remaining components each explain less than 3% of the variance.

For the female data, the first principal component (PC1) explains 67.59% of the variance in the data, followed by PC2 with 21.58% and PC3 with 6.52%. Together, these three components explain more than 95% of the total variance in the data. The remaining components each explain less than 1% of the variance.

From this, we can see that the male data has a higher proportion of variance explained by the first principal component (PC1) compared to the female data. This suggests that the male data may be more consistent and have stronger patterns than the female data, at least in terms of the variables included in the PCA.

The eigenvalues of both datasets, we can see that the male dataset has a larger spread of eigenvalues, with the male eigenvalue being much larger than the rest. This suggests that the male dataset has a dominant factor that explains a large amount of the variation in the data. On the other hand, the female dataset has a smaller spread of eigenvalues, with the male eigenvalue still being the largest but not as dominant as in the male dataset.

Looking at the eigenvectors, we can see that the eigenvectors in the male dataset are not as clearly separated and defined as in the femlae dataset. This could be an indication of more noise or overlap in the data. In contrast, the eigenvectors in the second dataset are more distinct and easier to interpret.Based on these observations, it seems that the female dataset is more consistent than the male dataset.

However, there are some differences in the exact values of the eigenvalues and eigenvectors. These differences can be attributed to the fact that the men’s and women’s data have different sample sizes. Nonetheless, overall the results consistent with those women’s data.