Preliminary Information on Multivariate Analysis
Introduction
Multivariate analysis: According to K.C. Bhuyan, multivariate analysis is a statistical technique for simultaneous analysis of two or more variables observed from one or more sample objects. Besides the univariate analysis like studying mean, variance, etc. of the variables, the inter-relationships of the variables are studied in multivariate analysis.
The main objective is in estimating the extent or amount of relationship among the variables.
Types of multivariate data:
- Dependent variable/Response variable/Criterion variable
- Independent variable/Predictor variable
Types of multivariate data analysis: Besides studying the extent of relationship between the dependent and independent sets we also need to study the structure of inter-relationships of the variables altogether. Thus the analysis of multivariate data can be classified mainly into two types -
- Inter-dependence Analysis
- Principal Component Analysis (PCA)
- Factor Analysis
- Cluster Analysis
- Principal Component Analysis (PCA)
- Dependence Analysis
- Discriminant Analysis
- Canonical Correlation Analysis
- Multivariate Analysis of Variance(MANOVA)
- Multivariate Regression Analysis
- Discriminant Analysis
Data reduction techniques: To prevent calculations from being too complicated due to the existence of too many variables, data reduction techniques are used to analyze the data with small number of variables as far as possible reserving the main information unaffected.
Two of the data reduction techniques are - PCA and factor analysis.
Another way to reduce the size of data is in respect of the number of individuals or objects. It is done using - cluster analysis.
Discriminant analysis: It is used to classify the objects into mutually exclusive classes on the basis of some preexisting information. Thus, it helps us to find the class for a newly found object based on one or more predictor variables. The criterion variable or dependent variable is measured in nominal scale.
Canonical correlation analysis: The most known and commonly used dependence analysis technique is the canonical correlation analysis. Canonical correlation analysis is an alternative multivariate method to study the correlations of the variables in dependent variables set and independent variables set considering the independence of the dependent variables.
Multivariate analysis of variance: MANOVA studies the significance of the differences among the groups of objects.
Data
Usually the observed data are arranged in n(no. of objects/individuals) rows and p(no. of variables) columns as follows -
\[ \begin{bmatrix} x_{11} & x_{12} & ... & x_{1p}\\ x_{21} & x_{22} & ... & x_{2p}\\ x_{31} & x_{32} & ... & x_{3p}\\ x_{41} & x_{42} & ... & x_{4p}\\ . & . & ... & .\\ x_{n1} & x_{n2} & ... & x_{np} \end{bmatrix}_{n\ X\ p} \]
Reading data (part of an work by Bhuyan and Ghffar) -
df <- as.matrix(read.delim("data/x0 data.txt"))
df x1 x2 x3 x4 x5
[1,] 10 3 5 0 18
[2,] 3 0 10 5 8
[3,] 5 0 12 4 15
[4,] 2 0 14 14 6
[5,] 12 2 0 0 17
[6,] 1 0 10 0 3
[7,] 8 1 5 5 15
[8,] 7 2 0 0 18
[9,] 4 0 8 4 10
[10,] 2 0 12 10 5
[11,] 1 0 10 5 3
[12,] 5 1 5 0 12
[13,] 4 0 0 0 11
[14,] 6 2 5 0 13
[15,] 7 1 0 0 10
[16,] 4 1 6 0 12
[17,] 8 2 0 0 15
[18,] 3 1 0 0 8
[19,] 6 1 0 0 14
[20,] 5 0 6 0 12
Here, x1 = No. of everborn children
x2 = No. of dead children under 5
x3 = Father’s level of education (in completed years)
x4 = Mother’s level of education (in completed years)
x5 = Duration of marriage (in years)
Mean Vector (Centroid)
Formula for calculating mean vector in matrix notation- \(\bar{X} = n^{-1}X'1\) where, X is a matrix of order (n x p)
and, \(1' = [1\ 1\ 1\ 1\ ...\ 1]_{1\ x\ n}\)
Mean vector -
apply(df, 2, mean) x1 x2 x3 x4 x5
5.15 0.85 5.40 2.35 11.25
This mean vector is also called centroid.
Manual code that utilizes the formula using matrix -
I1 <- matrix(rep(1, nrow(df)), nrow = nrow(df))
centroid <- 1/nrow(df) * t(df) %*% I1
centroid [,1]
x1 5.15
x2 0.85
x3 5.40
x4 2.35
x5 11.25
Variance-Covariance Matrix
\(\Sigma\) represents the population variance-covariance matrix.
R’s built in function var() gives the sample variance-covariance matrix -
var(df) x1 x2 x3 x4 x5
x1 8.555263 2.1815789 -8.589474 -5.160526 11.907895
x2 2.181579 0.8710526 -2.673684 -1.839474 3.144737
x3 -8.589474 -2.6736842 22.989474 14.063158 -12.473684
x4 -5.160526 -1.8394737 14.063158 15.397368 -8.671053
x5 11.907895 3.1447368 -12.473684 -8.671053 21.355263
Since R doesn’t have any built in code for population variance, we need to utilize the formula using sample variance and sample size.
So, code for population variance-covariance matrix -
var(df)*(nrow(df)-1)/nrow(df) x1 x2 x3 x4 x5
x1 8.1275 2.0725 -8.16 -4.9025 11.3125
x2 2.0725 0.8275 -2.54 -1.7475 2.9875
x3 -8.1600 -2.5400 21.84 13.3600 -11.8500
x4 -4.9025 -1.7475 13.36 14.6275 -8.2375
x5 11.3125 2.9875 -11.85 -8.2375 20.2875
Formula for population variance-covariance matrix -
\(\Sigma=n^{-1}X^{T}X - \bar{X}\bar{X}^{T}\)
\(=>\ \Sigma=n^{-1}X'HX\)
where, \(H = I - n^{-1}11'\)
\(I\) is an identity matrix of order (n x n)
\(1' = [1\ 1\ 1\ 1\ ...\ 1]_{1\ x\ n}\)
and, \[
\Sigma=
\begin{bmatrix}
Var(x_1) & Cov(x_1x_2) & ... & Cov(x_1x_p) \\
Cov(x_2x_1) & Var(x_2) & ... & Cov(x_2x_p)\\
. & . & ... & .\\
Cov(x_px_1) & Cov(x_px_2) & ... & Var(x_p)
\end{bmatrix}_{p\ X\ p}
\]
Implemented code of the formula for population variance-covariance matrix -
S <- nrow(df)^(-1)*t(df)%*%df - apply(df, 2, mean)%*%t(apply(df, 2, mean))
S x1 x2 x3 x4 x5
x1 8.1275 2.0725 -8.16 -4.9025 11.3125
x2 2.0725 0.8275 -2.54 -1.7475 2.9875
x3 -8.1600 -2.5400 21.84 13.3600 -11.8500
x4 -4.9025 -1.7475 13.36 14.6275 -8.2375
x5 11.3125 2.9875 -11.85 -8.2375 20.2875
Another way,
I1 <- matrix(rep(1, nrow(df)), nrow = nrow(df))
H <- diag(nrow(df)) - 1/nrow(df) * I1 %*% t(I1)
S <- 1/nrow(df) * t(df) %*% H %*% df
S x1 x2 x3 x4 x5
x1 8.1275 2.0725 -8.16 -4.9025 11.3125
x2 2.0725 0.8275 -2.54 -1.7475 2.9875
x3 -8.1600 -2.5400 21.84 13.3600 -11.8500
x4 -4.9025 -1.7475 13.36 14.6275 -8.2375
x5 11.3125 2.9875 -11.85 -8.2375 20.2875
Formula for an unbiased estimator of the variance-covariance matrix is as follows -
\(S_u = (n-1)^{-1}X'HX\)
\(=> S_u = n(n-1)^{-1}\Sigma\)
where, population variance-covariance matrix, \(\Sigma=n^{-1}X'HX\)
Manual code for sample variance-covariance matrix -
I1 <- matrix(rep(1, nrow(df)), nrow = nrow(df))
H <- diag(nrow(df)) - 1/nrow(df) * I1 %*% t(I1)
Su <- 1/(nrow(df)-1) * t(df) %*% H %*% df
Su x1 x2 x3 x4 x5
x1 8.555263 2.1815789 -8.589474 -5.160526 11.907895
x2 2.181579 0.8710526 -2.673684 -1.839474 3.144737
x3 -8.589474 -2.6736842 22.989474 14.063158 -12.473684
x4 -5.160526 -1.8394737 14.063158 15.397368 -8.671053
x5 11.907895 3.1447368 -12.473684 -8.671053 21.355263
Which is same as the following -
var(df) x1 x2 x3 x4 x5
x1 8.555263 2.1815789 -8.589474 -5.160526 11.907895
x2 2.181579 0.8710526 -2.673684 -1.839474 3.144737
x3 -8.589474 -2.6736842 22.989474 14.063158 -12.473684
x4 -5.160526 -1.8394737 14.063158 15.397368 -8.671053
x5 11.907895 3.1447368 -12.473684 -8.671053 21.355263
Correlation Matrix
In R correlation coefficients can be easily calculated using -
cor(df) x1 x2 x3 x4 x5
x1 1.0000000 0.7991569 -0.6124707 -0.4496287 0.8809796
x2 0.7991569 1.0000000 -0.5974800 -0.5022826 0.7291379
x3 -0.6124707 -0.5974800 1.0000000 0.7474723 -0.5629603
x4 -0.4496287 -0.5022826 0.7474723 1.0000000 -0.4781852
x5 0.8809796 0.7291379 -0.5629603 -0.4781852 1.0000000
Formula for calculation using matrix -
\(R=D^{-1}SD^{-1}\)
where,
S is the population variance-covariance matrix and
\(D=\sqrt{diag(S)}\)
and, \[
R=
\begin{bmatrix}
r_{11} & r_{12} & ... & r_{1p}\\
r_{21} & r_{22} & ... & r_{2p}\\
. & . & ... & .\\
r_{p1} & r_{p2} & ... & r_{pp}
\end{bmatrix}_{p\ X\ p}
\]
Manual code -
D <- sqrt(diag(S))
solve(diag(D)) %*% S %*% solve(diag(D)) [,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.7991569 -0.6124707 -0.4496287 0.8809796
[2,] 0.7991569 1.0000000 -0.5974800 -0.5022826 0.7291379
[3,] -0.6124707 -0.5974800 1.0000000 0.7474723 -0.5629603
[4,] -0.4496287 -0.5022826 0.7474723 1.0000000 -0.4781852
[5,] 0.8809796 0.7291379 -0.5629603 -0.4781852 1.0000000