Multivariate distributions are designed to describe the probability distributions of more than one random variable at the same time. Since the variables are often correlated, exploring them individually wpuld only provide limited insight

Structure of multivariate data

# Read in the wine dataset 
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep = ",")

# Print the first four entries
head(wine, n = 4)
##   V1    V2   V3   V4   V5  V6   V7   V8   V9  V10  V11  V12  V13  V14
## 1  1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2  1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3  1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4  1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
# Find the dimensions of the data
dim(wine)
## [1] 178  14
# Check the names of the wine dataset 
names(wine)
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
## [13] "V13" "V14"
# Assign new names
colnames(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols', 'Flavanoids', 'Nonflavanoids','Proanthocyanins', 'Color', 'Hue', 'Dilution', 'Proline')
          
# Check the new column names
names(wine)
##  [1] "Type"            "Alcohol"         "Malic"           "Ash"            
##  [5] "Alcalinity"      "Magnesium"       "Phenols"         "Flavanoids"     
##  [9] "Nonflavanoids"   "Proanthocyanins" "Color"           "Hue"            
## [13] "Dilution"        "Proline"
# Check data type/structure of each variable
str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
# Change the Type variable data type
wine$Type <- as.factor(wine$Type)

# Check data type/structure again
str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

Mean vector and variance covariance matrix

# Calculate the mean of the Alcohol, Malic, Ash, and Alcalinity variables 
colMeans(wine[, 2:5])
##    Alcohol      Malic        Ash Alcalinity 
##  13.000618   2.336348   2.366517  19.494944
# Calculate the mean of the variables by wine type
by(wine[2:5], wine$Type, colMeans)
## wine$Type: 1
##    Alcohol      Malic        Ash Alcalinity 
##  13.744746   2.010678   2.455593  17.037288 
## ------------------------------------------------------------ 
## wine$Type: 2
##    Alcohol      Malic        Ash Alcalinity 
##  12.278732   1.932676   2.244789  20.238028 
## ------------------------------------------------------------ 
## wine$Type: 3
##    Alcohol      Malic        Ash Alcalinity 
##  13.153750   3.333750   2.437083  21.416667
# Calculate the variance-covariance matrix of the variables Alcohol, Malic, Ash, Alcalinity
var.wine <- var(wine[2:5])

# Round the matrix values to two decimal places 
round(var.wine, 2)
##            Alcohol Malic  Ash Alcalinity
## Alcohol       0.66  0.09 0.05      -0.84
## Malic         0.09  1.25 0.05       1.08
## Ash           0.05  0.05 0.08       0.41
## Alcalinity   -0.84  1.08 0.41      11.15
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
# Calculate the correlation matrix 
cor.wine <- cor(wine[, 2:5])

# Round the matrix to two decimal places
round(cor.wine, 2)
##            Alcohol Malic  Ash Alcalinity
## Alcohol       1.00  0.09 0.21      -0.31
## Malic         0.09  1.00 0.16       0.29
## Ash           0.21  0.16 1.00       0.44
## Alcalinity   -0.31  0.29 0.44       1.00
# Plot the correlations 
corrplot(cor.wine, method = "ellipse")

Plotting multivariate data

library(lattice)
# Scatter plot matrix using the base R plot function
pairs(wine[,2:5])

# Scatter plot matrix colored by groups
splom( ~ wine[,2:5],  pch = 16, col = wine$Type)

library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Color the points by wine type 
wine.gg <- ggpairs(data = wine, mapping = aes(color = Type), columns = 2:5)
wine.gg

Multivarite normal distribution

It is a generalization of the univariate normal and is specified by a mean vector and variance covariance matrix. Even if all variables individually follow a univariate normal, the joint distribution might not be a multivariate normal.

Functions for statistical distributions inr R

  • Probability
  • Quantile
  • Density
  • random
library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 4.0.3
mu.sim <- c(2, -2, 1)
sigma.sim <- matrix(data = c(9,5,5,5,5,3,5,3,4), nrow = 3)
# Generate 100 bivariate normal samples
multnorm.sample <- rmvnorm(n = 100, mean = mu.sim, sigma = sigma.sim)

# View the first 6 samples
head(multnorm.sample)
##            [,1]       [,2]       [,3]
## [1,] -0.1031677 -4.0362852 -1.0112424
## [2,] 10.4023207  3.6987711  7.4120582
## [3,] -0.8593295 -0.5760292 -1.1337082
## [4,]  3.5361230 -0.6839956  1.3939369
## [5,]  0.3137523 -3.5272389 -0.7769091
## [6,]  5.4426615  1.4807898  2.8038625
# Scatterplot of the bivariate samples 
ggpairs(data.frame(multnorm.sample))

Dentiry of a multivariate normal distribution

library(scatterplot3d) 
## Warning: package 'scatterplot3d' was built under R version 4.0.3
# Calculate density
multnorm.dens <- dmvnorm(multnorm.sample, mean = mu.sim, sigma = sigma.sim)

# Create scatter plot of density heights 
scatterplot3d(cbind(multnorm.sample, multnorm.dens),    
                color="blue", pch="", type = "h",             
                              xlab = "x", ylab = "y", zlab = "density")

cumulative distribution and inverse CDF

# Volume under a bivariate standard normal
pmvnorm(lower = c(-1, -1), upper = c(1, 1))
## [1] 0.4660649
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"
# Volume under specified mean and variance-covariance matrix
pmvnorm(lower = c(-5, -5), upper = c(5, 5), mean = mu.sim, sigma = sigma.sim)
## Warning in cbind(lower, upper, mean): number of rows of result is not a multiple
## of vector length (arg 1)
## [1] 0.7485242
## attr(,"error")
## [1] 0.0001896578
## attr(,"msg")
## [1] "Normal Completion"
# Probability contours for a standard bivariate normal
qmvnorm(0.9, tail = "both", sigma = diag(2))
## $quantile
## [1] 1.948779
## 
## $f.quantile
## [1] -1.537507e-06
## 
## attr(,"message")
## [1] "Normal Completion"
# Probability contours for a bivariate normal 
qmvnorm(0.95, tail = "both", mean = mu.sim, sigma = sigma.sim)
## $quantile
## [1] 7.253087
## 
## $f.quantile
## [1] 5.054298e-06
## 
## attr(,"message")
## [1] "Normal Completion"

Checking normality of multivariate data

Why check normality?

  • Classical statistical techniques that assume univariate/multivariate nromality
    • Multivariate regression
    • Discriminant analysis
    • model-based clustering
    • PCA
    • Multivariate analysis of variance (MANOVA)

univariate normality tests

If any single variable fails to follow normality we cannot have joint multivariate normality

multivarite normality tests

  • MV normality tets by
    • Mardia
    • Henze-Zirkler
    • Royston
  • Graphical approaches
    • Chi-square QQplot
    • perspective
    • contour plots
# Test sample normality 
qqnorm(multnorm.sample[, 1])
qqline(multnorm.sample[, 1])

library(MVN)
## Warning: package 'MVN' was built under R version 4.0.3
## sROC 0.1-2 loaded
# Create qqnorm plot 
# uniPlot(wine[, 2:5], type = "qqplot")
# mardiaTest qqplot 
wine.mvntest <- mardiaTest(wine[, 2:5], qqplot = TRUE)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
wine.mvntest
## [1] "'mardiaTest' is deprecated.\nUse 'mvn' instead.\nSee help(\"Deprecated\")"
# Use mardiaTest
mardiaTest(multnorm.sample)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
# Use hzTest
hzTest(wine[, 2:5])
## Warning: 'hzTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")

Multivariate t-distribution

It can model heavier tails and are generalization of univariate Stundent’s t-distribution

\[ t_{df}(\delta, \Sigma) \]

Where \(df\) are the degrees of freedom in all dimensions

# Generate the t-samples 
multt.sample <- rmvt(n = 200,sigma = sigma.sim, df = 5, delta = mu.sim)

# Print the first 6 samples
head(multt.sample)
##           [,1]       [,2]       [,3]
## [1,]  3.130277 -1.9400025  0.9762581
## [2,]  1.361335 -3.0098776  0.1205345
## [3,]  3.802967 -0.7209706  2.6175912
## [4,] -1.776014 -6.8858459 -2.2717340
## [5,]  4.637783 -1.2810504  3.8066385
## [6,]  3.051600 -2.3924482  1.4898844
# Check multivariate normality
mardiaTest(multt.sample, qqplot = TRUE)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")

Density and cumulative density for multivariate-t

# Calculate densities
multt.dens <- dmvt(x = multt.sample, delta = mu.sim, sigma = sigma.sim, df = 5)

# Plot 3D heights of densities
#scatterplot3d(cbind(multt.sample, multt.dens),    
#                color = "blue", pch = "", type = "h",             
#                              xlab = "x", ylab = "y", zlab = "density")
# Calculate the volume under a t-distribution
pmvt(lower = c(-5, -5), upper = c(5, 5), delta = mu.sim, df = 5, sigma = sigma.sim)
## Warning in cbind(lower, upper, mean): number of rows of result is not a multiple
## of vector length (arg 1)
## [1] 0.6434244
## attr(,"error")
## [1] 0.0003743722
## attr(,"msg")
## [1] "Normal Completion"
# Calculate the equal probability contour
qmvt(p = 0.9, tail = "both", sigma = diag(2))
## $quantile
## [1] 8.956747
## 
## $f.quantile
## [1] -1.35426e-07
## 
## attr(,"message")
## [1] "Normal Completion"

Multivariate Skewed distributions

Univariate skew-normal distribution

  • general skew-normal is denoted by \(SN(\xi, \omega, \alpha)\)
  • \(\xi\) and $omega$ are the location and scale parameters
  • \(\alpha\) is the skewness parameter
    • For \(\alpha > 0\) skewed to the right
    • For \(\alpha < 0\) skewed to the left
    • Sn(0) is the same as a standard normal

Multivariate skew-normal dsitribution

\(SN(\xi, \omega, \alpha)\)

library(sn)
## Warning: package 'sn' was built under R version 4.0.3
## Loading required package: stats4
## 
## Attaching package: 'sn'
## The following object is masked from 'package:stats':
## 
##     sd
# Generate the skew-normal samples 
skewnorm.sample <- rmsn(n = 100, xi = mu.sim, Omega = sigma.sim, alpha = c(4, -4))
## Warning in matrix(dp$Omega, d, d): la longitud de los datos [9] no es un
## submúltiplo o múltiplo del número de filas [2] en la matriz
## Warning in dp[[1]] + mu.z * omega: longitud de objeto mayor no es múltiplo de la
## longitud de uno menor
## Warning in dp0$xi + lot$aux$omega * z: longitud de objeto mayor no es múltiplo
## de la longitud de uno menor
# Print first six samples
head(skewnorm.sample)
##             [,1]       [,2]
## [1,] -0.27268774 -4.0253926
## [2,]  1.00484814  0.2731621
## [3,]  0.05980053  2.5412612
## [4,]  2.60591642 -1.8870326
## [5,]  3.63497414  1.4335588
## [6,]  0.18127730  2.1798305
# Generate the skew-t samples 
skewt.sample <- rmst(n = 100, xi = mu.sim, Omega = sigma.sim, alpha = c(4, -4), nu = 5)
## Warning in matrix(dp$Omega, d, d): la longitud de los datos [9] no es un
## submúltiplo o múltiplo del número de filas [2] en la matriz
## Warning in xi + t(z/sqrt(v)): longitud de objeto mayor no es múltiplo de la
## longitud de uno menor
# Print first six samples
head(skewt.sample)
##            [,1]       [,2]
## [1,]  3.7659280 -3.5896138
## [2,]  3.4277042  0.7036523
## [3,]  0.1525090  1.5846982
## [4,]  0.6648123 -4.6884818
## [5,] -6.6229018 -4.6319690
## [6,] -5.9962971 -2.4466693
# Contour plot for skew-normal sample
#ggplot(data.frame(skewnorm.sample), aes(x = x, y = y)) + geom_point() + geom_density_2d() 
# Normality test for skew-t sample
skewt.Test <- mardiaTest(skewt.sample, qqplot = TRUE)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")

PCA

# Calculate PCs
pca.state <- princomp(state.x77, cor = TRUE, scores = TRUE) 

# Plot the PCA object  
plot(pca.state) 

# Print the summary of the PCs
summary(pca.state) 
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.8970755 1.2774659 1.0544862 0.84113269 0.62019488
## Proportion of Variance 0.4498619 0.2039899 0.1389926 0.08843803 0.04808021
## Cumulative Proportion  0.4498619 0.6538519 0.7928445 0.88128252 0.92936273
##                            Comp.6    Comp.7     Comp.8
## Standard deviation     0.55449226 0.3800642 0.33643379
## Proportion of Variance 0.03843271 0.0180561 0.01414846
## Cumulative Proportion  0.96779544 0.9858515 1.00000000
# Variance explained by each PC
pca.var <- pca.state$sdev^2

# Proportion of variance explained by each PC
pca.pvar <- pca.var/sum(pca.var)

# Proportion of variance explained by each principal component
pca.pvar
##     Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7 
## 0.44986195 0.20398990 0.13899264 0.08843803 0.04808021 0.03843271 0.01805610 
##     Comp.8 
## 0.01414846
# Cumulative variance explained plot
plot(cumsum(pca.pvar), xlab = "Principal component", ylab = "Cumulative Proportion of variance explained", ylim = c(0,1), type = 'b')
grid()

# Add a horizontal line
abline(h = 0.95, col = "blue")

# Draw screeplot
screeplot(pca.state, type = "l")
grid()

Interpreting PCA outputs

Loadings

Show the weights that were used to construct the PCA. the blank entries in the data frame are values that are close to zero and have been ommitted to provide a clear interpretation.

PCA scores

  • projection of the original dataset on the principal components
# Create dataframe of scores
scores.state <- data.frame(pca.state$scores)

# Plot of scores colored by region
ggplot(data = scores.state, aes(x = Comp.1, y = Comp.2, label = rownames(scores.state), color = state.region)) + 
    geom_text(alpha = 0.8, size = 3) + 
    ggtitle("PCA of states data colored by region")

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_pca_ind(pca.state)

fviz_pca_var(pca.state)

fviz_pca_biplot(pca.state)

Multidimensional Scaling

Technique for performing dimension reduction. The primary goal of multidimensional scaling is to providea visual representation of the pattern of proximities, meaning the similarities or distances, among a set of objects.

This is accomplished by assigning objects to specific locations in a conceptual space usually in two or three dimensions, so the distances between points in the space match the given similarities as closely as possible

Classicla multidimensional scaling or Principal coordinates analysis

  • Input; matrix distances
  • Output: Set of points in given dimensions such that the distances closely match the input distances
# Calculate distance 
state.dist <- dist(state.x77)

# Perform multidimensional scaling 
mds.state <- cmdscale(state.dist) 

# Change mds.state to a dataframe for use in ggplot
mds.state_df <- data.frame(mds.state)

# Plot the representation of the data in two dimensions 
ggplot(data = mds.state_df, aes(x = X1, y = X2, label = rownames(mds.state), color = state.region)) + 
    geom_text(alpha = 0.8, size = 3)  

# Calculate distance 
wine.dist <- dist(wine[,-1])

# Perform multidimensional scaling 
mds.wine <- cmdscale(wine.dist, k = 3) 
mds.wine_df <- data.frame(mds.wine)

# Plot the representation of the data in three dimensions 
scatterplot3d(mds.wine_df, color = wine$Type, pch = 19, type = "h", lty.hplot = 2)