Problem

Consider five measurements of normal patients and diabetics in the data “Diabetes” in the package “heplots”. The variables are

The original data are from Reaven and Miller (1979, Diabetologia). Focus on x variables that are of main interest.

Answer the following questions:

  1. Are the 3-dimensional data normally distributed? Why?

  2. Compute the sample mean and covariance matrix of the x variables.

  3. Compute the eigenvalues and eigenvectors of the sample covariance matrix.

Theory

If we have a p x 1 random vector \(X\) that is distributed according to a multivariate normal distribution with population mean vector \(\mu\) and population variance-covariance matrix \(\Sigma\), then this random vector, \(X\) , will have the joint density function as shown in the expression below:

\[\phi(\textbf{x})=\left(\frac{1}{2\pi}\right)^{p/2}|\Sigma|^{-1/2}\exp\{-\frac{1}{2}(\textbf{x}-\mathbf{\mu})'\Sigma^{-1}(\textbf{x}-\mathbf{\mu})\}\]

\(|\Sigma|\)denotes the determinant of the variance-covariance matrix \(\Sigma\) and \(\Sigma^{-1}\)is just the inverse of the variance-covariance matrix \(\Sigma\). Again, this distribution will take maximum values when the vector \(X\) is equal to the mean vector \(\mu\) , and decrease around that maximum.

The shorthand notation, similar to the univariate version above, is \[\mathbf{X} \sim N(\mathbf{\mu},\Sigma)\]

If the random vectors are uncorrelated then the variance co-variance matrix is written as follows. \[\Sigma = \left(\begin{array}{cccc}\sigma^2_1 & 0 & \dots & 0\\ 0 & \sigma^2_2 & \dots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & \sigma^2_p \end{array}\right)\]

R Code

Part 1

library("heplots")
## Loading required package: car
## Loading required package: carData
library(ggpubr) 
## Loading required package: ggplot2
library(ggplot2)

data(Diabetes,package = "heplots")
head(Diabetes) #The actucal dataset from where we will extract columns
##   relwt glufast glutest instest sspg  group
## 1  0.81      80     356     124   55 Normal
## 2  0.95      97     289     117   76 Normal
## 3  0.94     105     319     143  105 Normal
## 4  1.04      90     356     199  108 Normal
## 5  1.00      90     323     240  143 Normal
## 6  0.76      86     381     157  165 Normal
(d<-str(Diabetes))
## 'data.frame':    145 obs. of  6 variables:
##  $ relwt  : num  0.81 0.95 0.94 1.04 1 0.76 0.91 1.1 0.99 0.78 ...
##  $ glufast: int  80 97 105 90 90 86 100 85 97 97 ...
##  $ glutest: int  356 289 319 356 323 381 350 301 379 296 ...
##  $ instest: int  124 117 143 199 240 157 221 186 142 131 ...
##  $ sspg   : int  55 76 105 108 143 165 119 105 98 94 ...
##  $ group  : Factor w/ 3 levels "Normal","Chemical_Diabetic",..: 1 1 1 1 1 1 1 1 1 1 ...
## NULL
Y <- Diabetes[,1:2]

X <- Diabetes[,3:5]

The variables are:

  • relwt: relative weight, expressed as the ratio of actual weight to expected weight, given the person’s height

  • glufast: fasting plasma glucose level

  • glutest: test plasma glucose level, a measure of glucose intolerance,

  • instest: plasma insulin during test, a measure of insulin response to oral glucose,

  • sspg: steady state plasma glucose, a measure of insulin resistance

  • group: diagnostic group

A <- qplot(sample = X[,1],data = X,main = "Glucose Intolerence", 
           col="salmon")+ geom_qq_line(col="green")
A 

B <- qplot(sample = X[,2],data = X,main = "Insulin response to oral glucose",
           col="salmon")+ geom_qq_line(col="green")


B

C <- qplot(sample = X[,3],data = X,main = "Insulin resistance", 
           col="salmon")+ geom_qq_line(col="green")

C 

Part 2 and 3

colMeans(X) #sample mean
##  glutest  instest     sspg 
## 543.6138 186.1172 184.2069
var(X)      #cov matrix
##           glutest     instest       sspg
## glutest 100457.85 -12918.1627 25908.4902
## instest -12918.16  14625.3125   101.4825
## sspg     25908.49    101.4825 11242.3319
eigen(var(X))   #eigen values and vectors
## eigen() decomposition
## $values
## [1] 109078.28  14170.86   3076.36
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,]  0.9584067 0.03568944 -0.2831658
## [2,] -0.1308070 0.93673946 -0.3246671
## [3,]  0.2536654 0.34820318  0.9024458
sessioninfo::platform_info()
##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       Ubuntu 19.10                
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_IN:en                    
##  collate  en_IN.UTF-8                 
##  ctype    en_IN.UTF-8                 
##  tz       Asia/Kolkata                
##  date     2021-01-03

Refrences

  1. Friendly, M. (2020). Diabetes data: heplots and candisc examples. Retrieved 20 September 2020, from https://cran.r-project.org/web/packages/candisc/vignettes/diabetes.html

  2. Lesson 4: Multivariate Normal Distribution. (2020). Retrieved 20 September 2020, from https://online.stat.psu.edu/stat505/book/export/html/636