Math 56 Questions
Multivariate analysis is a statistical technique used to examine multiple variables simultaneously to understand relationships, patterns, and trends within complex datasets.
Multivariate data occurs when multiple variables are measured or observed simultaneously, often arising in complex real-world scenarios such as economics, biology, and social sciences.
3.Why is knowledge of measurement scales important to an understanding of multivariate data analysis? Knowledge of measurement scales is crucial in multivariate data analysis because it determines the appropriate statistical methods, influences data interpretation, and ensures meaningful comparisons between variables.
4.Discuss the approaches in analyzing multivariate data. Analyzing multivariate data involves exploring multiple variables. Different approaches are used depending on the nature of the data, the objective of the analysis, and assumptions. Some approaches are as follows: PCA(Principal Component Analysis) - is used to simplify data by identifying the directions where the data has most variance, at the same time the results of the simplified data should represent the original data in a more simple form, with minimal loss of information. SEM(Sequential Equation Modelling) - is used to analyze the relationships between the observed variables and latent variables. It involves the specification of the models that include both direct and indirect effects among variables.
5.1 Provide an example of multivariate data in matrix form.
data.frame(Employees= c(1,2,3,4,5),
Sales = c(120,120,130,150,110),
Customer_service = c(90,70,80,90,80),
Teamwork = c(85,95,88,80,75))
Employees Sales Customer_service Teamwork
1 1 120 90 85
2 2 120 70 95
3 3 130 80 88
4 4 150 90 80
5 5 110 80 75
5.2 Of the given example in 5.1, how many variables are there? There are three variables; Sales, Customer Service, and Teamwork.
5.3 Of the given example in 5.2, how many cases are there? There are 5 cases, since there are 5 employees.
Is it possible to display the data graphically to show how the goblets are related and, if so, are there any obvious groupings of similar goblets?
Are there any goblets that are particularly unusual? Carry out a principal components analysis and see whether the values of the principal components help to answer these questions.
One point that needs consideration with this exercise is the extent to which differences between goblets are due to shape differences rather than size differences. It may well be considered that two goblets that are almost the same shape but have very different sizes are really ‘similar’. The problem of separating size and shape differences has generated a considerable scientific literature that will not be considered here. However, it can be noted that one way to remove the effects of size involves dividing the measurements for a goblet b the total height of the body of the goblet. Alternatively, the measurements of a goblet can be expressed as a proportion of the sum of all measurements on that goblet. These types of standardization of variables will clearly ensure that the data values are similar for two gablets with the same shape but different sizes.
Table 1. Measurements (in cm) taken on 25 prehistoric goblets from Thailand. The variables are defined in Fig. 6.3. The data were kindly provided by Professor C.F.W. Higham of the University of Otago.
library(factoextra)
Warning: package 'factoextra' was built under R version 4.3.3
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.3.2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
goblets<-read.csv( "C:/Users/USER/Dropbox/PC/Desktop/Second semester 24-25/Multivariate Analysis, Lab/Midterm exam/vase_midterm.csv",header=T, row.names=1)
goblets
X1 X2 X3 X4 X5 X6
1 13 21 23 14 7 8
2 14 14 24 19 5 9
3 19 23 24 20 6 12
4 17 18 16 16 11 8
5 19 20 16 16 10 7
6 12 20 24 17 6 9
7 12 19 22 16 6 10
8 12 22 25 15 7 7
9 11 15 17 11 6 5
10 11 13 14 11 7 4
11 12 20 25 18 5 12
12 13 21 23 15 9 8
13 12 15 19 12 5 6
14 13 22 26 17 7 10
15 14 22 26 15 7 9
16 14 19 20 17 5 10
17 15 16 15 15 9 7
18 19 21 20 16 9 10
19 12 20 26 16 7 10
20 17 20 27 18 6 14
21 13 20 27 17 6 9
22 9 9 10 7 4 3
23 8 8 7 5 2 2
24 9 9 8 4 2 2
25 12 19 27 18 5 12
Let us perform a Principal Component Analysis on the data.
my_goblets<- prcomp(goblets[colnames(goblets)!="y"],
scale = TRUE)
my_goblets
Standard deviations (1, .., p=6):
[1] 2.0668279 1.0450729 0.6202804 0.3773544 0.2555262 0.2088231
Rotation (n x k) = (6 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
X1 0.3660233 0.48592912 -0.6179335 -0.32436829 0.27835629 0.2556581
X2 0.4515367 -0.03412653 0.3752732 -0.67427405 -0.08391876 -0.4386709
X3 0.4111609 -0.44135161 0.3163501 0.02019451 0.38254463 0.6239630
X4 0.4618586 -0.11457532 -0.1588367 0.54119094 0.38182563 -0.5564635
X5 0.2963653 0.68277080 0.4914536 0.35921044 -0.22136144 0.1625790
X6 0.4381125 -0.29768029 -0.3324080 0.13346207 -0.75785442 0.1295892
The observed standard deviations for PC1 to PC6 are 2.0668279, 1.0450729, 0.6202804, 0.3773544, 0.2555262, and 0.2088231, respectively. The rotation indicates that the original measurements influence each component, with PC1 displaying relatively high values, suggesting that this principal component predominantly reflects the size variations among the goblets
Next is we determine the number of principal components by using a scree plot.
fviz_eig(my_goblets,
addlabels = TRUE,
choice ="eigenvalue",
ncp = ncol(goblets)) +
geom_hline(yintercept = 1,
linetype = "dashed",
color = "red")
The scree plot shows that we only have two principal components to consider, PC1 and PC2 with eigenvalues greater than 1.
summary(my_goblets)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.067 1.045 0.62028 0.37735 0.25553 0.20882
Proportion of Variance 0.712 0.182 0.06412 0.02373 0.01088 0.00727
Cumulative Proportion 0.712 0.894 0.95812 0.98185 0.99273 1.00000
num_pcgoblets <- sum(my_goblets$sdev^2 > 1)
print(num_pcgoblets)
[1] 2
It is confirmed that the principal components are only 2.
fviz_pca_biplot(my_goblets,
label = "var",
col.var = "#353436",) +
labs(x = "PC1",
y = "PC2")
Performing the Biplot we can see that in PC2, X1 and X5 are highly
correlated, the same also in PC1, as it was shown previously in
performing the PCA. Before performing a K-means clustering, we need to
determine how many clusters we will use.
my_pca_goblets <- data.frame(my_goblets$x[ , 1:2])
fviz_nbclust(my_pca_goblets,
FUNcluster = kmeans,
method = "wss")
As we can see that in cluster 3, it started to level off at 3, so we
will use 3 as our number of clusters.
my_goblets_pca_scores <- as.data.frame(my_goblets$x[,1:2])
my_kmeans_goblet <- kmeans(my_goblets_pca_scores,
centers = 3)
fviz_pca_ind(my_goblets,
habillage = my_kmeans_goblet$cluster,
repel = TRUE,
addEllipses = TRUE,
ellipse.type = "convex") +
guides(color = guide_legend(override.aes = list(label = ""))) +
labs(x = "PC1",
y = "PC2")
my_kmeans_goblet
K-means clustering with 3 clusters of sizes 6, 15, 4
Cluster means:
PC1 PC2
1 -3.1948479 0.0508247
2 1.0284111 -0.5523108
3 0.9357302 1.9949284
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
2 2 2 3 3 2 2 2 1 1 2 2 1 2 2 2 3 3 2 2 2 1 1 1 2
Within cluster sum of squares by cluster:
[1] 15.186120 9.935051 2.494463
(between_SS / total_SS = 78.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
a.) Yes, the data can be represented graphically to illustrate the relationships among the goblets. Specifically, Group 1 consists of goblets 1, 2, 3, 6, 7, 8, 11, 12, 15, 16, 19, 20, 21, and 25, which share similarities. Likewise, Group 2 includes goblets 4, 5, 17, and 18, while Group 3 comprises goblets 9, 10, 13, 22, 23, and 24, showing distinct characteristics within their respective groups.
b.) The scatterplot visualizes the distribution of data points using PCA, where each point represents an observation based on PC1 and PC2. The red points signify the transformed data defined by the principal components. The 95% confidence ellipse highlights the area where most data points cluster, while those outside the ellipse are considered unusual, as they fall beyond the concentrated region.
goblet_pca_scores <- as.data.frame(my_goblets$x[, 1:2])
ggplot(goblet_pca_scores, aes(x = PC1, y = PC2)) +
geom_point(size = 2, color = "red") +
theme_minimal() +
stat_ellipse(type = "t", level = 0.95)
mahal_dist <- mahalanobis(my_goblets_pca_scores, colMeans(my_goblets_pca_scores), cov(my_goblets_pca_scores))
mahal_dist
1 2 3 4 5 6 7
0.05284764 0.54101529 1.54405977 5.32224449 5.78497860 0.60885271 0.44418301
8 9 10 11 12 13 14
0.14228860 0.70161594 1.66856335 2.20666819 0.48010680 0.46027735 0.70557856
15 16 17 18 19 20 21
0.35204402 0.32649269 2.39913589 2.93767708 0.58948610 1.60396641 0.81232165
22 23 24 25
3.81505022 6.18920620 5.70680660 2.60453284
Using Mahalabonis Distance, we can see that goblets 4,5, 22, 23, and 24 have a standard deviations greater than 3, considering that this goblets are unusual, as we can see in the scatterplot above.
library(factoextra)
Protein<-read.csv( "C:/Users/USER/Dropbox/PC/Desktop/Second semester 24-25/Multivariate Analysis, Lab/Midterm exam/Protein_midterm.csv",header=T, row.names=1)
Protein
Red.Meat White.Meat Eggs Milk Fish Cereals Starchy.Foods
Albania 10 1 1 9 9 42 1
Austria 9 14 4 20 2 28 4
Belgium 14 9 4 18 5 27 6
Bulgaria 8 6 2 8 1 57 1
Czechoslovakia 10 11 3 13 2 34 5
Denmark 11 11 4 25 10 22 5
E. Germany 8 12 4 11 5 25 7
Finland 10 5 3 34 6 26 5
France 18 10 3 20 6 28 5
Greece 10 3 3 18 6 42 2
Hungary 5 12 3 10 0 40 4
Ireland 14 10 5 26 2 24 6
Italy 9 5 3 14 3 37 2
Netherlands 10 14 4 23 3 22 4
Norway 9 5 3 23 10 23 5
Poland 7 10 3 19 3 36 6
Portugal 6 4 1 5 14 27 6
Romania 6 6 2 11 1 50 3
Spain 7 3 3 9 7 29 6
Sweden 10 8 4 25 8 20 4
Switzerland 13 10 3 24 2 26 3
UK 17 6 5 21 4 24 5
USSR 9 5 2 17 3 44 6
W. Germany 11 13 4 19 3 19 5
Yugoslavia 4 5 1 10 1 56 3
Pulses..Nuts..Oilseeds Fruits...Vegetables Total
Albania 6 2 72
Austria 1 4 86
Belgium 2 4 89
Bulgaria 4 4 91
Czechoslovakia 1 4 83
Denmark 1 2 91
E. Germany 1 4 77
Finland 1 1 91
France 2 7 99
Greece 8 7 99
Hungary 5 4 83
Ireland 2 3 92
Italy 4 7 84
Netherlands 2 4 86
Norway 2 3 83
Poland 2 7 93
Portugal 5 8 76
Romania 5 3 87
Spain 6 7 77
Sweden 1 2 82
Switzerland 2 5 88
UK 3 3 88
USSR 3 3 92
W. Germany 2 4 80
Yugoslavia 6 3 89
First is we need to check for outliers across the data by making a scatter plot matrix.
pairs(Protein,font.labels=0.1,gap=0.1,pch=".",cex = 0.00001)
round(sapply(Protein,var),2)
Red.Meat White.Meat Eggs
11.58 13.99 1.24
Milk Fish Cereals
50.38 12.07 121.23
Starchy.Foods Pulses..Nuts..Oilseeds Fruits...Vegetables
2.74 4.08 3.67
Total
45.81
We can see that the variance are from 1.24 (Eggs) to 121.23 (Cereals). Next is we normalize the data and calculate its correlation matrix.
Nor_Protein=scale(Protein)
round(cor(Nor_Protein),2)
Red.Meat White.Meat Eggs Milk Fish Cereals
Red.Meat 1.00 0.19 0.58 0.54 0.07 -0.51
White.Meat 0.19 1.00 0.60 0.30 -0.40 -0.44
Eggs 0.58 0.60 1.00 0.61 -0.15 -0.70
Milk 0.54 0.30 0.61 1.00 0.04 -0.59
Fish 0.07 -0.40 -0.15 0.04 1.00 -0.42
Cereals -0.51 -0.44 -0.70 -0.59 -0.42 1.00
Starchy.Foods 0.15 0.33 0.41 0.21 0.22 -0.58
Pulses..Nuts..Oilseeds -0.41 -0.67 -0.60 -0.62 0.03 0.64
Fruits...Vegetables -0.06 -0.07 -0.16 -0.40 0.11 0.04
Total 0.37 0.10 0.19 0.46 -0.32 0.19
Starchy.Foods Pulses..Nuts..Oilseeds Fruits...Vegetables
Red.Meat 0.15 -0.41 -0.06
White.Meat 0.33 -0.67 -0.07
Eggs 0.41 -0.60 -0.16
Milk 0.21 -0.62 -0.40
Fish 0.22 0.03 0.11
Cereals -0.58 0.64 0.04
Starchy.Foods 1.00 -0.50 0.07
Pulses..Nuts..Oilseeds -0.50 1.00 0.35
Fruits...Vegetables 0.07 0.35 1.00
Total -0.04 -0.08 0.07
Total
Red.Meat 0.37
White.Meat 0.10
Eggs 0.19
Milk 0.46
Fish -0.32
Cereals 0.19
Starchy.Foods -0.04
Pulses..Nuts..Oilseeds -0.08
Fruits...Vegetables 0.07
Total 1.00
Calculating the eigenvalues and eigenvectors of the normalized data, we have
eigen(cor(Nor_Protein))
eigen() decomposition
$values
[1] 4.08102042 1.77649203 1.29332073 1.15617590 0.64090232 0.40846280
[7] 0.35173106 0.17668554 0.11086093 0.00434827
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.323162712 0.085761181 -0.45555126 0.13466528 -0.4149609 -0.3378958
[2,] -0.327966557 0.167793167 0.52378576 0.16677840 -0.1194565 0.3658457
[3,] -0.427603002 0.069514798 0.05369812 0.10457088 -0.3052443 -0.3220580
[4,] -0.391384747 0.151749080 -0.33065384 -0.21545795 0.1829815 0.2987262
[5,] -0.002912023 -0.620011310 -0.40426232 -0.09156400 0.1109589 0.3017640
[6,] 0.407352461 0.379047870 0.04161765 -0.03365992 0.2043929 -0.1377447
[7,] -0.276316909 -0.336781738 0.19817040 0.25489398 0.6218107 -0.5160563
[8,] 0.423039223 0.006822824 -0.20122839 0.15536750 -0.1800047 -0.2882860
[9,] 0.132974407 -0.157406379 -0.04272630 0.83950049 -0.1221485 0.2947784
[10,] -0.114182012 0.519897886 -0.39894376 0.30545868 0.4458952 0.1125718
[,7] [,8] [,9] [,10]
[1,] 0.55843275 0.002529492 -0.20345101 -0.15145045
[2,] 0.12212643 0.492031222 -0.32884690 -0.22102950
[3,] -0.49516831 0.199127568 0.54936105 -0.12234518
[4,] -0.37188877 -0.359851908 -0.27811906 -0.44740553
[5,] 0.06768754 0.501579332 0.20518330 -0.20525235
[6,] 0.20980494 0.099997238 0.30264163 -0.69365849
[7,] 0.03944011 -0.023409408 -0.18280865 -0.13228566
[8,] -0.48684034 0.334997648 -0.53173185 -0.09419728
[9,] -0.04931532 -0.329761355 0.11236684 -0.15918232
[10,] -0.01201355 0.325051449 0.09403944 0.37157289
next is we extract the Principal components.
Protein_PCA<-princomp(Nor_Protein,cor = TRUE)
summary(Protein_PCA, loadings = TRUE)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 2.020154 1.3328511 1.1372426 1.0752562 0.80056375
Proportion of Variance 0.408102 0.1776492 0.1293321 0.1156176 0.06409023
Cumulative Proportion 0.408102 0.5857512 0.7150833 0.8307009 0.89479114
Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 0.63911094 0.59306918 0.42033979 0.33295784 0.065941413
Proportion of Variance 0.04084628 0.03517311 0.01766855 0.01108609 0.000434827
Cumulative Proportion 0.93563742 0.97081053 0.98847908 0.99956517 1.000000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Red.Meat 0.323 0.456 0.135 0.415 0.338 0.558
White.Meat 0.328 0.168 -0.524 0.167 0.119 -0.366 0.122 0.492
Eggs 0.428 0.105 0.305 0.322 -0.495 0.199
Milk 0.391 0.152 0.331 -0.215 -0.183 -0.299 -0.372 -0.360
Fish -0.620 0.404 -0.111 -0.302 0.502
Cereals -0.407 0.379 -0.204 0.138 0.210
Starchy.Foods 0.276 -0.337 -0.198 0.255 -0.622 0.516
Pulses..Nuts..Oilseeds -0.423 0.201 0.155 0.180 0.288 -0.487 0.335
Fruits...Vegetables -0.133 -0.157 0.840 0.122 -0.295 -0.330
Total 0.114 0.520 0.399 0.305 -0.446 -0.113 0.325
Comp.9 Comp.10
Red.Meat 0.203 0.151
White.Meat 0.329 0.221
Eggs -0.549 0.122
Milk 0.278 0.447
Fish -0.205 0.205
Cereals -0.303 0.694
Starchy.Foods 0.183 0.132
Pulses..Nuts..Oilseeds 0.532
Fruits...Vegetables -0.112 0.159
Total -0.372
PC1 explains 40.81 % of variance, PC2 explains 17.77 % of variance, PC3 explains 12.93 % of variance, and the fourth explains 11.56 % of variance. Let us extract the number of principal components in our data.
fviz_eig(Protein_PCA,
addlabels = TRUE,
choice ="eigenvalue",
ncp = ncol(Protein)) +
geom_hline(yintercept = 1,
linetype = "dashed",
color = "red")
Here we calculate the axis scores of the country in each Principal Components
round(Protein_PCA$scores,1)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
Albania -3.5 -1.3 1.0 -2.3 1.5 0.1 0.7 0.1 0.3
Austria 1.5 0.8 -1.4 0.0 0.4 -0.7 -0.1 0.1 -0.2
Belgium 1.7 -0.1 0.3 0.5 0.0 0.8 0.5 0.2 -0.1
Bulgaria -2.9 2.2 -0.1 -0.3 0.4 -0.3 0.7 0.2 -0.8
Czechoslovakia 0.5 0.2 -1.4 0.0 0.0 0.1 0.9 -0.2 -0.2
Denmark 2.4 -0.5 0.7 -0.9 -0.6 -0.7 0.0 1.1 -0.3
E. Germany 1.2 -1.5 -2.2 0.1 -0.1 0.6 0.2 0.2 -0.4
Finland 1.8 0.3 1.3 -1.9 -1.4 -0.3 -0.4 -0.7 0.2
France 1.6 0.6 1.8 2.2 0.0 -0.3 1.5 0.2 0.3
Greece -2.2 1.1 2.5 1.5 0.3 -0.2 -1.3 0.6 0.0
Hungary -1.3 0.9 -2.1 0.2 0.2 0.2 -0.7 0.5 0.4
Ireland 2.8 0.9 0.3 0.2 0.0 1.0 -0.5 0.0 0.1
Italy -1.6 0.3 0.2 0.8 1.2 -0.5 -0.3 -0.9 -0.6
Netherlands 1.8 0.5 -0.9 0.0 0.6 -0.8 -0.5 0.2 0.4
Norway 0.7 -1.6 0.8 -1.1 -0.6 -0.4 -0.3 -0.1 -0.2
Poland 0.3 0.4 -0.6 1.7 -1.4 -0.6 -0.2 -0.4 -0.2
Portugal -2.6 -4.0 0.2 1.2 -0.6 -0.6 0.4 0.2 0.2
Romania -2.5 1.3 -0.6 -0.7 -0.3 0.3 -0.1 0.1 0.0
Spain -1.8 -2.3 -0.2 1.2 0.1 1.0 -0.9 -0.4 0.0
Sweden 1.8 -0.9 0.3 -1.6 0.3 -0.5 -0.4 0.0 -0.4
Switzerland 1.1 0.9 0.3 0.2 0.8 -0.8 0.3 -0.7 0.5
UK 2.0 0.2 1.3 -0.1 1.0 1.6 0.0 0.0 -0.2
USSR -0.7 0.7 0.2 -0.3 -1.8 0.8 0.5 -0.3 0.1
W. Germany 1.8 -0.4 -1.2 0.0 0.9 0.0 -0.1 -0.1 0.5
Yugoslavia -3.7 1.5 -0.5 -0.7 -1.0 0.2 -0.1 0.1 0.4
Comp.10
Albania 0.2
Austria 0.0
Belgium 0.0
Bulgaria -0.1
Czechoslovakia 0.0
Denmark 0.0
E. Germany 0.0
Finland 0.0
France 0.0
Greece 0.0
Hungary 0.0
Ireland 0.1
Italy 0.0
Netherlands 0.0
Norway 0.0
Poland 0.2
Portugal -0.1
Romania -0.1
Spain 0.0
Sweden 0.0
Switzerland -0.1
UK -0.1
USSR 0.0
W. Germany 0.0
Yugoslavia -0.1
Here we will use a biplot showing the variables of the PC1 and PC2 diagram
fviz_pca_biplot(Protein_PCA,
label = "var",
col.var = "#353436",
palette = c("#3fdf05",
"#f25c10",
"#1b98e0")) +
labs(x = "PC1",
y = "PC2")
Performing the biplot, we can see that Starchy Foods, White Meat, Red Meat, Eggs, and Milk are highly correlated in PC1, While, in PC2, White Meat, Red Meat, Eggs, and Milk, Cereals, and Pulses, Nuts and Oilseeds are highly correlated in PC2.
biplot(Protein_PCA,xlim=c(-0.4,0.5),ylim=c(-0.3,0.3), xlabs=abbreviate(row.names(Protein)))
The dietary patterns reveal that Albania, Spain, and Portugal are
characterized by high protein consumption from fruits and vegetables. In
Belgium, Norway, Western Germany, Sweden, Eastern Germany, and Denmark,
starchy foods are the distinguishing factor. Meanwhile, Switzerland,
Ireland, Austria, France, Poland, Czechoslovakia, Finland, the
Netherlands, and the UK show a preference for white meat, red meat,
eggs, and milk. Lastly, Yugoslavia, Romania, Italy, Hungary, the USSR,
Bulgaria, and Greece stand out due to their notable intake of cereals,
pulses, nuts, and oilseeds.