Chapter 4 - Multivariate Normal

Let’s clean the data by:

dropping any non-numeric columns
removing any rows with a missing value
making the column names a little easier to use:

bones <- 
  bone_measures |> 
  # Picking the only numeric columns
  select(where(is.numeric)) |> 
  
  # dropping any rows with a missing value
  drop_na() |>
  
  # using clean_names() in the janitor package to change the names to be all lower case
  janitor::clean_names()

Recording the sample info and summary statistics

# Recording the sample size and number of numeric variables
n <- nrow(bones); p <- ncol(bones)


# Calculating the mean vector and covariance matrix
bones_mean <- colMeans(bones)
bone_Sigma <- cov(bones)

Correlation plot using `ggcorrplot()` in the `ggcorrplot` package

bones |> 
  # calculating the correlation matrix
  cor() |> 
  # Plotting the correlation matrix and ordering them in terms of correlation patterns
  ggcorrplot::ggcorrplot(
    lab = T,
    colors = c("red", "white", "blue"),
    type = "lower",
    outline.color = "white",
    ggtheme = theme_void,
    hc.order = T
  )

Alternative: Correlation bar chart

An alternative way to view a correlation plot is by using a correlation bar plot, using the function in the R script read in to R below:

source("correlation bar plot.R")

# Just need to give the function a data frame and optionally a separator for the variable names
gg_cor_bars(corr_df = bones, 
            var_seperator = ":")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Maybe too many variables, let's just pick the first 8 columns:
gg_cor_bars(corr_df = bones |> select(lhum:rtib), 
            var_seperator = ":")

Calculating Distances between two points with R

First let’s pick 2 cases: 150 and 250

(skele <- bones %>% slice(c(150, 250)))

## # A tibble: 2 × 13
##    lhum  rhum  lrad  rrad  lfem  rfem  ltib  rtib   lib   rib l_ace r_ace   bib
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   342   335   255  256.   492   491  406    404   164   162  54.0  53.8  276.
## 2   273   274   212  216    382   385  316.   317   143   143  40.6  41.6  262.

The dist() function will calculate the non-statistical distances

?dist

## starting httpd help server ... done

# Manhattan distance: |y_1i - y_2i|
dist(x = skele, 
     method = "manhattan")

##        1
## 2 686.59

# Euclidean distance: (y_1i - y_2i)^2
dist(x = skele, 
     method = "euclidean")

##          1
## 2 228.8943

Pearson distance:

\[\frac{(y_{1i} - y_{2i})^2}{s_{ii}}\]

(pear_dist <- sqrt(sum(( skele[1, ]-skele[2, ] )^2 / diag(bone_Sigma))))

## [1] 9.772264

Finding multivariate outliers

Mahalanobis (Statistical) Distance:

\[(\textbf{y}_1-\textbf{y}_2)'\textbf{S}^{-1} (\textbf{y}_1-\textbf{y}_2)\] Using the mahalanobis_distance() function in the rstatix package will find the distance each obs is from the mean: \(\bar{\textbf{y}}\) and if it is a multivariate outlier

\[d_i^2 = (\textbf{y}_1-\bar{\textbf{y}})'\textbf{S}^{-1} (\textbf{y}_1-\bar{\textbf{y}})\\ d_i^2 \sim \chi^2_p\]

# Adding the distances to the data
mahalanobis_distance(data = bones)

## # A tibble: 619 × 15
##     lhum  rhum  lrad  rrad  lfem  rfem  ltib  rtib   lib   rib l_ace r_ace   bib
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  295   302   208   205   395    396  306   305    160   155  45.6  44.8  286.
##  2  318   322   234   238   434.   437  361   364    158   156  50.7  52.4  261 
##  3  304.  307   233   235   402.   407  338   338    160   161  52.2  52.4  269 
##  4  287   292.  210.  214   390    386  311   315    160   160  47.7  47.7  283 
##  5  269   272   198.  200   369    367  298   296    145   147  47.7  47.8  254 
##  6  305   298   223   220.  393    397  316   312.   146   144  52.0  51.6  256 
##  7  306.  310.  228.  227   421    421  337   333    162   165  52.3  53.1  295 
##  8  280   284.  203   204.  402.   400  308.  306.   154   151  47.4  45.3  260 
##  9  287   291   221   224   408.   407  320   324    166   168  52.9  52.9  279 
## 10  352.  361   259   263   514.   513  377   381    171   169  56.0  56.4  287 
## # ℹ 609 more rows
## # ℹ 2 more variables: mahal.dist <dbl>, is.outlier <lgl>

# Using it to create a distance plot to id outliers:
gg_outlier <- 
  bones |> 
  mahalanobis_distance() |> 
  mutate(id = row_number()) |> 
  
  ggplot(
    mapping = aes(
      x = id,
      y = mahal.dist
    )
  ) + 
  
  geom_segment(
    mapping = aes(
      xend = id,
      yend = 0
      
    )
  ) +
    
  geom_point(
    mapping = aes(
      color = if_else(is.outlier, "red", "black")
    )
  ) + 
  
  # Adding a horizontal line to indicate where the outlier (< 0.1% prob) cut off is:
  geom_hline(
    yintercept = qchisq(p = 0.999, df = ncol(bones)), 
    color = "red"
  ) + 
  
  labs(
    x = "Skeleton ID",
       y = "Mahalanobis distance"
    ) + 
  
  scale_color_identity()

gg_outlier

Adding the row number to the plot using geom_text_repel() in the ggrepel package

pacman::p_load(ggrepel)
gg_outlier + 
  
  geom_text_repel(
    data = mahalanobis_distance(bones) |> 
           mutate(id = row_number()) |> 
           filter(is.outlier),  # Keeping the rows that are outliers
    mapping = aes(label = id))

Tests for Normality

Univariate tests

We often start by looking at the normality of each variable individually. Start with the Shapiro-Wilks tests using the shapiro_test() in the rstatix package:

# Just need to list the columns that we want to test for:
bones |> 
  shapiro_test(lhum, lrad, lfem, ltib, lib, l_ace) |> 
  # Let's put them in the order from lowest p-value to highest p-value
  arrange(p) |> 
  # Round the p-value to 4 decimal places
  mutate(p = round(p, digits = 3))

## # A tibble: 6 × 3
##   variable statistic     p
##   <chr>        <dbl> <dbl>
## 1 lib          0.993 0.007
## 2 l_ace        0.995 0.044
## 3 lrad         0.996 0.132
## 4 ltib         0.996 0.157
## 5 lfem         0.998 0.697
## 6 lhum         0.998 0.769

We can add in a group_by() statement if we want to separate the data based on a group:

bone_measures |> 
  janitor::clean_names() |>  # fixing the names
  drop_na() |>      # removing rows with missing values
  group_by(sex) |>  # different test for the different sexes
  shapiro_test(lhum, lrad, lfem, ltib, lib, l_ace) |> 
  arrange(sex, p)

## # A tibble: 12 × 4
##    sex   variable statistic         p
##    <chr> <chr>        <dbl>     <dbl>
##  1 F     ltib         0.989 0.0793   
##  2 F     l_ace        0.989 0.0962   
##  3 F     lfem         0.991 0.209    
##  4 F     lrad         0.994 0.468    
##  5 F     lib          0.994 0.580    
##  6 F     lhum         0.995 0.710    
##  7 M     lib          0.978 0.0000110
##  8 M     ltib         0.994 0.0863   
##  9 M     l_ace        0.994 0.136    
## 10 M     lfem         0.996 0.476    
## 11 M     lrad         0.997 0.538    
## 12 M     lhum         0.999 0.996

We can create the normal probability plots for each variable using pivot_longer() and facet_wrap():

Lbones_long <- 

  bone_measures  |>  
  janitor::clean_names() |> 
  drop_na() |> 
  
  # Picking just the left wise bones and the sex column
  dplyr::select(starts_with("L"), sex) |> 
  
  pivot_longer(cols = -sex,
               names_to = "bone",
               values_to = "length") |>  
  
  # as_factor() will keep the groups of bone in the same order as they were in the columns
  mutate(bone = as_factor(bone))

head(Lbones_long)

## # A tibble: 6 × 3
##   sex   bone  length
##   <chr> <fct>  <dbl>
## 1 F     lhum   295  
## 2 F     lrad   208  
## 3 F     lfem   395  
## 4 F     ltib   306  
## 5 F     lib    160  
## 6 F     l_ace   45.6

Now we can us the Lbones_long to create the graphs:

ggplot(data = Lbones_long,
       mapping = aes(sample = length)) +

  # Adds the dots that represent the data
  stat_qq() +
    
  stat_qq_line(color = "red") + 
  
  # Create a separate QQ plot for each different bone measurement
  facet_wrap(facet = ~ bone, 
             scales = "free")

Most of the plots look close to normal, with the left radius being the one that appears the most non-Normal

We can separate by sex by using facet_grid() in place of facet_wrap()

ggplot(
  data = Lbones_long,
  mapping = aes(sample = length)
) +

  # Adds the dots that represent the data
  stat_qq() +
    
  stat_qq_line(color = "red") + 
  
  # Create a separate QQ plot for each different bone measurement
  facet_grid(
    rows = vars(sex),
    cols = vars(bone), 
    scales = "free"
  )

Multivariate normal check

Let’s just check the left side bones

left_bones <- 
  bones  |>  
  # Picking just the left side bones
  dplyr::select(starts_with("l"))

Then we can get the Mahalanobis distance using mahalanobis_distance() in rstatix:

left_bones |> 
  mahalanobis_distance() |> 
  ggplot(mapping = aes(sample = mahal.dist)) +
  
  stat_qq(
    distribution = qchisq, 
    dparams = ncol(left_bones)
  ) +
  
  stat_qq_line(
    distribution = qchisq, 
    dparams = ncol(left_bones), 
    color = "red"
  ) + 
  
  labs(
    title = "QQ Plot for Multivariate Normality",
    subtitle = "Six different bones",
    x = "Theoretical Quantities",
    y = "Mahalanobis Distance"
  ) + 
  
  coord_flip()

The graph definitely indicates that there is non-multivariate normality in the data :(

Checking MVN using the MVN package

The MVN package has a function called mvn() that has a lot of helpful options:

data = the data set (only numeric columns)
mvnTest = "mardia to perform a Mardia MVN test
multivariatePlot = "qq" to include a qqplot
univariateTest = "SW" to perform shapiro-wilks test for each individual variable
desc = T or F to include or exclude the descriptive statistics of each individual variable.

pacman::p_load(MVN)
mvn(
  data = left_bones, 
  mvnTest = "mardia", 
  multivariatePlot = "qq",
  univariateTest = "SW",
  desc = F
)

## $multivariateNormality
##              Test        Statistic              p value Result
## 1 Mardia Skewness 257.216419529803 1.44766735856602e-27     NO
## 2 Mardia Kurtosis 15.2094470107533                    0     NO
## 3             MVN             <NA>                 <NA>     NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   lhum       0.9982    0.7689    YES   
## 2 Shapiro-Wilk   lrad       0.9961    0.1319    YES   
## 3 Shapiro-Wilk   lfem       0.9980    0.6973    YES   
## 4 Shapiro-Wilk   ltib       0.9963    0.1567    YES   
## 5 Shapiro-Wilk    lib       0.9932    0.0069    NO    
## 6 Shapiro-Wilk   l_ace      0.9950    0.0436    NO

We get the same plot that we had earlier!

Also, for the Mardia test to conclude there isn’t evidence of non-MVN is for both tests to have a p-value above the significance level. Both of them have p-value less than any reasonable significance level, so we can conclude that there is strong evidence that the data are not multivariate normal!

We can include univariatePlot = "histogram" to create a histogram for each variable:

mvn(
  data = left_bones, 
  mvnTest = "mardia", 
  univariatePlot = "histogram",
  univariateTest = "SW",
  desc = F
)

## $multivariateNormality
##              Test        Statistic              p value Result
## 1 Mardia Skewness 257.216419529803 1.44766735856602e-27     NO
## 2 Mardia Kurtosis 15.2094470107533                    0     NO
## 3             MVN             <NA>                 <NA>     NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   lhum       0.9982    0.7689    YES   
## 2 Shapiro-Wilk   lrad       0.9961    0.1319    YES   
## 3 Shapiro-Wilk   lfem       0.9980    0.6973    YES   
## 4 Shapiro-Wilk   ltib       0.9963    0.1567    YES   
## 5 Shapiro-Wilk    lib       0.9932    0.0069    NO    
## 6 Shapiro-Wilk   l_ace      0.9950    0.0436    NO

While the p-values are low for lib and l_ace, the graphs themselves aren’t that non-Normal. So why such strong evidence that the data combined are not Multivariate Normal?

Normality tests are sensitive to outliers, and if there is a multivariate outlier (as seen in the QQ plot from the MVN test), it can cause us to falsely reject the null hypothesis.

Let’s remove any outliers and rerun the test!

left_bones |> 
  # Using the mahalanobis distance to find outliers:
  mahalanobis_distance() |> 

  #using filter to remove the outliers:
  filter(!is.outlier) |> 
  
  # Picking the original 6 variables (not including mahal.dist & is.outlier)
  select(lhum:l_ace) |> 

  # Using mvn() to conduct our test
  mvn(
    mvnTest = "mardia", 
    multivariatePlot = "qq",
    univariateTest = "SW",
    desc = F
  )

## $multivariateNormality
##              Test         Statistic             p value Result
## 1 Mardia Skewness  85.1413347317432 0.00726556579118107     NO
## 2 Mardia Kurtosis 0.972905593884909   0.330600224534011    YES
## 3             MVN              <NA>                <NA>     NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   lhum       0.9969    0.3002    YES   
## 2 Shapiro-Wilk   lrad       0.9961    0.1361    YES   
## 3 Shapiro-Wilk   lfem       0.9981    0.7446    YES   
## 4 Shapiro-Wilk   ltib       0.9963    0.1629    YES   
## 5 Shapiro-Wilk    lib       0.9953    0.0611    YES   
## 6 Shapiro-Wilk   l_ace      0.9950    0.0448    NO

Looks better without the 3 outliers! There is a little bit of a wiggle near the end of the QQ plot, and the p-value for the skewness is below 0.01, but it’s not as bad as before!

If some the individual variables are non-Normal, we can use a BoxCox transformation to try to “fix” the non-Normality of some of the variables by including the arguments bc = T and bcType = "rounded":

left_bones |> 
  # Using the mahalanobis distance to find outliers:
  mahalanobis_distance() |> 

  #using filter to remove the outliers:
  filter(!is.outlier) |> 
  
  # Picking the original 6 variables (not including mahal.dist & is.outlier)
  select(lhum:l_ace) |> 

  # Using mvn() to conduct our test
  mvn(
    mvnTest = "mardia", 
    multivariatePlot = "qq",
    univariateTest = "SW",
    desc = F,
    bc = T,
    bcType = "rounded"
  )

## $multivariateNormality
##              Test         Statistic            p value Result
## 1 Mardia Skewness  72.6405694231352 0.0667169283604157    YES
## 2 Mardia Kurtosis 0.833790377463341  0.404399118668008    YES
## 3             MVN              <NA>               <NA>    YES
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   lhum       0.9958    0.0943    YES   
## 2 Shapiro-Wilk   lrad       0.9961    0.1361    YES   
## 3 Shapiro-Wilk   lfem       0.9970    0.3303    YES   
## 4 Shapiro-Wilk   ltib       0.9964    0.1722    YES   
## 5 Shapiro-Wilk    lib       0.9953    0.0611    YES   
## 6 Shapiro-Wilk   l_ace      0.9950    0.0448    NO    
## 
## $BoxCoxPowerTransformation
##  lhum  lrad  lfem  ltib   lib l_ace 
##     0     1     0     0     1     1

Chapter 4 - Multivariate Normal

Jacob Martin

STAT 5230

Recording the sample info and summary statistics

Correlation plot using `ggcorrplot()` in the `ggcorrplot` package

Alternative: Correlation bar chart

Calculating Distances between two points with R

Finding multivariate outliers

Tests for Normality

Univariate tests

Multivariate normal check

Checking MVN using the MVN package

Chapter 4 - Multivariate Normal

Jacob Martin

STAT 5230

Recording the sample info and summary statistics

Correlation plot using ggcorrplot() in the ggcorrplot package

Alternative: Correlation bar chart

Calculating Distances between two points with R

Finding multivariate outliers

Tests for Normality

Univariate tests

Multivariate normal check

Checking MVN using the MVN package

Correlation plot using `ggcorrplot()` in the `ggcorrplot` package