Multivariate Data Analytics

Department of Statistics

Ronald WESONGA (PhD)

13 September 2022

Outline

  1. Introd.
  2. Data Org.
  3. Desc. Stat.
  4. Prsentation
  5. Graphics
  6. Distance

Introduction

Objective Description
Data reduction index of measurement
Sorting and grouping develop clusters
screen procedures
seperation rules
investigate dependence identify responsible factors
examine relationships
predict predict outcomes
identify risks
test hypotheses test for differences

How to Organize Data

Through the VAM system:

\[ X_{np} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{j1} & x_{j2} & \cdots & x_{jp} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \]

Example

\[ \begin{bmatrix} Sales(OMR) &| & 32 & 42 & 52 & 48 & 58 \\ \hline Books(No.) &| & 2 & 4 & 6 & 3 & 6 \\ \end{bmatrix} \]

\[ X_{5X2} = \begin{bmatrix} 32 & 2 \\ 42 & 4 \\ 52 & 6 \\ 48 & 3 \\ 58 & 6 \end{bmatrix} \]

Exercise. You are required to organize the data:

\[ \begin{bmatrix} Age(Yrs.) &| & 23 & 20 & 22 & 20 \\ \hline Mark(No.) &| & 82 & 4 & 6 & 3 \\ \hline Height(M) &| & 1.53 & 1.58 & 1.60 & 1.65 \\ \end{bmatrix} \]

Can We Generate Descriptive Statistics for MD?

Yes we can: Let the measurements on the first variable be \(x_{11}, x_{21} \cdots,x_{n1}\) and for the \(i^{th}\) variable will be \(x_{1i}, x_{2i} \cdots,x_{ji},\cdots, x_{ni}\). Thus,

\[ X_{n1} = \begin{bmatrix} x_{11} \\ \vdots \\ x_{j1} \\ \vdots \\ x_{n1} \end{bmatrix} X_{ni} = \begin{bmatrix} x_{1i} \\ \vdots \\ x_{ji} \\ \vdots \\ x_{ni} \end{bmatrix}\]

Multivariate Descriptive Analysis (MDA)

\[\bar{X}_k = \frac{1}{n}\sum_{j=1}^n X_{jk}\]

\[S_{kk}=S_k^2=\frac{1}{n}\sum_{j=1}^n \left(X_{jk}-\bar{X}_k\right)^2\]

\[S_{ik}=\frac{1}{n}\sum_{j=1}^n \left(X_{ji}-\bar{X}_i\right)\left(X_{jk}-\bar{X}_k\right)\] \[r_{ik}=\frac{\sum_{j=1}^n \left(X_{ji}-\bar{X}_i\right)\left(X_{jk}-\bar{X}_k\right)}{\sqrt{\sum_{j=1}^n \left(X_{ji}-\bar{X}_i\right)^2}\sqrt{\sum_{j=1}^n \left(X_{jk}-\bar{X}_k\right)^2}}\]

How Can We Present MDA?

\[ \bar{X} = \begin{bmatrix} \bar{X}_{1} \\ \vdots \\ \bar{X}_{i} \\ \vdots \\ \bar{X}_{p} \end{bmatrix}\]

\[ S_{n} = \begin{bmatrix} S_{11} & S_{12} & \cdots & S_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ S_{i1} & S_{i2} & \cdots & S_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ S_{p1} & S_{p2} & \cdots & S_{pp} \end{bmatrix} R = \begin{bmatrix} 1 & r_{12} & \cdots & r_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ r_{i1} & r_{i2} & \cdots & r_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ r_{p1} & r_{p2} & \cdots & 1 \end{bmatrix}\]

A Bivariate Example

\[ X_{4X2} = \begin{bmatrix} 42 & 4 \\ 52 & 6 \\ 48 & 3 \\ 58 & 6 \end{bmatrix} \] \[\bar{X}_1=\sum_{j=1}^4X_{j1}=50;~~~~ \bar{X}_2=\sum_{j=1}^4X_{j2}=4\]

\[ S_{4} = \begin{bmatrix} 34 & -1.5 \\ -1.5 & 0.5 \\ \end{bmatrix} R = \begin{bmatrix} 1 & -0.36 \\ -0.36 & 1 \\ \end{bmatrix} \]

Multivariate Data Graphics (MDGs)

Distance Measurement of MD using Euclidean Distance Theory

If \(P(x_1,x_2, \cdots,x_p)\) and \(O(0,0,\cdots,0)\) are two points, then

\[d(O,P)=\sqrt{x_1^2+x_2^2+\cdots+x_p^2}\] elseif \(P(x_1,x_2, \cdots,x_p)\) and \(Q(y_1,y_2, \cdots,y_p)\), then

\[d(O,P)=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\cdots+(x_p-y_p)^2}\]

Distance Measurement of MD using Statistical Distance Theory

If \(P(x_1,x_2, \cdots,x_p)\) and \(O(0,0,\cdots,0)\) are two points, then

\[d(O,P)=\sqrt{\left(\frac{x_1}{\sqrt{s_{11}}}\right)^2+\cdots+\left(\frac{x_p}{\sqrt{s_{pp}}}\right)^2}\] elseif \(P(x_1,x_2, \cdots,x_p)\) and \(Q(y_1,y_2, \cdots,y_p)\), then

\[d(O,P)=\sqrt{\left(\frac{x_1-y_1}{\sqrt{s_{11}}}\right)^2+\cdots+\left(\frac{x_p-y_p}{\sqrt{s_{pp}}}\right)^2}\]

Important Note About Distance Measurements

For \(P(\tilde{x}_1, \tilde{x}_2)\) and O(0,0), then

\[d(O,P)=\sqrt{\frac{\tilde{x}_1^2}{\tilde{s}_{11}}+\frac{\tilde{x}_2^2}{\tilde{s}_{22}}}\] where \[\tilde{x}_1=x_1cos\theta+x_2sin\theta\] \[\tilde{x}_2=-x_1sin\theta+x_2cos\theta\] implying that \[d(O,P)=\sqrt{a_{11}x_1^2+2a_{12}x_1x_2+a_{22}x_2^2}\] and \[d(O,P)=\sqrt{a_{11}(x_1-y_1)^2+2a_{12}(x_1-y_1)(x_2-y_2)+a_{22}(x_2-y_2)^2}\]

where: \[{\scriptstyle a_{11}=\frac{cos^2\theta}{cos^2\theta s_{11}+2sin\theta cos\theta s_{12}+sin^2\theta s_{22}} + \frac{sin^2\theta}{cos^2\theta s_{22}-2sin\theta cos\theta s_{12}+sin^2\theta s_{11}}} \] \[{\scriptstyle a_{22}=\frac{sin^2\theta}{cos^2\theta s_{11}+2sin\theta cos\theta s_{12}+sin^2\theta s_{22}} + \frac{cos^2\theta}{cos^2\theta s_{22}-2sin\theta cos\theta s_{12}+sin^2\theta s_{11}}}\] \[{\scriptstyle a_{12}=\frac{cos\theta sin\theta}{cos^2\theta s_{11}+2sin\theta cos\theta s_{12}+sin^2\theta s_{22}} + \frac{sin\theta cos\theta}{cos^2\theta s_{22}-2sin\theta cos\theta s_{12}+sin^2\theta s_{11}}}\]

Properties of distance measure

  1. \(d(P,Q)=d(Q,P)\)
  2. \(d(P,Q)>0~~\forall~~ P\ne Q\)
  3. \(d(P,Q)=0~~\forall~~ P = Q\)
  4. \(d(P,Q)\le d(P,R)+d(R,Q)\) commonly referred to as triangle inequality.