Problem

Consider five measurements of normal patients and diabetics in the data “Diabetes” in the package “heplots”. The variables are

y1: relative weight
y2: fasting plasma glucose
x1: glucose intolerance
x2: insulin response to oral glucose;
x3: insulin resistance.

The original data are from Reaven and Miller (1979, Diabetologia). Focus on x variables that are of main interest.

Answer the following questions:

Are the 3-dimensional data normally distributed? Why?
Compute the sample mean and covariance matrix of the x variables.
Compute the eigenvalues and eigenvectors of the sample covariance matrix.

Theory

If we have a p x 1 random vector \(X\) that is distributed according to a multivariate normal distribution with population mean vector \(\mu\) and population variance-covariance matrix \(\Sigma\), then this random vector, \(X\) , will have the joint density function as shown in the expression below:

\[\phi(\textbf{x})=\left(\frac{1}{2\pi}\right)^{p/2}|\Sigma|^{-1/2}\exp\{-\frac{1}{2}(\textbf{x}-\mathbf{\mu})'\Sigma^{-1}(\textbf{x}-\mathbf{\mu})\}\]

\(|\Sigma|\)denotes the determinant of the variance-covariance matrix \(\Sigma\) and \(\Sigma^{-1}\)is just the inverse of the variance-covariance matrix \(\Sigma\). Again, this distribution will take maximum values when the vector \(X\) is equal to the mean vector \(\mu\) , and decrease around that maximum.

The shorthand notation, similar to the univariate version above, is \[\mathbf{X} \sim N(\mathbf{\mu},\Sigma)\]

If the random vectors are uncorrelated then the variance co-variance matrix is written as follows. \[\Sigma = \left(\begin{array}{cccc}\sigma^2_1 & 0 & \dots & 0\\ 0 & \sigma^2_2 & \dots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & \sigma^2_p \end{array}\right)\]

R Code

Part 1

library("heplots")

## Loading required package: car

## Loading required package: carData

library(ggpubr)

## Loading required package: ggplot2

library(ggplot2)

data(Diabetes,package = "heplots")
head(Diabetes) #The actucal dataset from where we will extract columns

##   relwt glufast glutest instest sspg  group
## 1  0.81      80     356     124   55 Normal
## 2  0.95      97     289     117   76 Normal
## 3  0.94     105     319     143  105 Normal
## 4  1.04      90     356     199  108 Normal
## 5  1.00      90     323     240  143 Normal
## 6  0.76      86     381     157  165 Normal

(d<-str(Diabetes))

## 'data.frame':    145 obs. of  6 variables:
##  $ relwt  : num  0.81 0.95 0.94 1.04 1 0.76 0.91 1.1 0.99 0.78 ...
##  $ glufast: int  80 97 105 90 90 86 100 85 97 97 ...
##  $ glutest: int  356 289 319 356 323 381 350 301 379 296 ...
##  $ instest: int  124 117 143 199 240 157 221 186 142 131 ...
##  $ sspg   : int  55 76 105 108 143 165 119 105 98 94 ...
##  $ group  : Factor w/ 3 levels "Normal","Chemical_Diabetic",..: 1 1 1 1 1 1 1 1 1 1 ...

## NULL

Y <- Diabetes[,1:2]

X <- Diabetes[,3:5]

The variables are:

relwt: relative weight, expressed as the ratio of actual weight to expected weight, given the person’s height
glufast: fasting plasma glucose level
glutest: test plasma glucose level, a measure of glucose intolerance,
instest: plasma insulin during test, a measure of insulin response to oral glucose,
sspg: steady state plasma glucose, a measure of insulin resistance
group: diagnostic group

A <- qplot(sample = X[,1],data = X,main = "Glucose Intolerence", 
           col="salmon")+ geom_qq_line(col="green")
A

B <- qplot(sample = X[,2],data = X,main = "Insulin response to oral glucose",
           col="salmon")+ geom_qq_line(col="green")


B

C <- qplot(sample = X[,3],data = X,main = "Insulin resistance", 
           col="salmon")+ geom_qq_line(col="green")

C

Part 2 and 3

colMeans(X) #sample mean

##  glutest  instest     sspg 
## 543.6138 186.1172 184.2069

var(X)      #cov matrix

##           glutest     instest       sspg
## glutest 100457.85 -12918.1627 25908.4902
## instest -12918.16  14625.3125   101.4825
## sspg     25908.49    101.4825 11242.3319

eigen(var(X))   #eigen values and vectors

## eigen() decomposition
## $values
## [1] 109078.28  14170.86   3076.36
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,]  0.9584067 0.03568944 -0.2831658
## [2,] -0.1308070 0.93673946 -0.3246671
## [3,]  0.2536654 0.34820318  0.9024458

sessioninfo::platform_info()

##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       Ubuntu 19.10                
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_IN:en                    
##  collate  en_IN.UTF-8                 
##  ctype    en_IN.UTF-8                 
##  tz       Asia/Kolkata                
##  date     2021-01-03

Refrences

Friendly, M. (2020). Diabetes data: heplots and candisc examples. Retrieved 20 September 2020, from https://cran.r-project.org/web/packages/candisc/vignettes/diabetes.html
Lesson 4: Multivariate Normal Distribution. (2020). Retrieved 20 September 2020, from https://online.stat.psu.edu/stat505/book/export/html/636