Let’s clean the data by:
dropping any non-numeric columns
removing any rows with a missing value
making the column names a little easier to use:
bones <-
bone_measures |>
# Picking the only numeric columns
select(where(is.numeric)) |>
# dropping any rows with a missing value
drop_na() |>
# using clean_names() in the janitor package to change the names to be all lower case
janitor::clean_names()
# Recording the sample size and number of numeric variables
n <- nrow(bones); p <- ncol(bones)
# Calculating the mean vector and covariance matrix
bones_mean <- colMeans(bones)
bone_Sigma <- cov(bones)
ggcorrplot()
in the
ggcorrplot
packagebones |>
# calculating the correlation matrix
cor() |>
# Plotting the correlation matrix and ordering them in terms of correlation patterns
ggcorrplot::ggcorrplot(
lab = T,
colors = c("red", "white", "blue"),
type = "lower",
outline.color = "white",
ggtheme = theme_void,
hc.order = T
)
An alternative way to view a correlation plot is by using a correlation bar plot, using the function in the R script read in to R below:
source("correlation bar plot.R")
# Just need to give the function a data frame and optionally a separator for the variable names
gg_cor_bars(corr_df = bones,
var_seperator = ":")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Maybe too many variables, let's just pick the first 8 columns:
gg_cor_bars(corr_df = bones |> select(lhum:rtib),
var_seperator = ":")
First let’s pick 2 cases: 150 and 250
(skele <- bones %>% slice(c(150, 250)))
## # A tibble: 2 × 13
## lhum rhum lrad rrad lfem rfem ltib rtib lib rib l_ace r_ace bib
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 342 335 255 256. 492 491 406 404 164 162 54.0 53.8 276.
## 2 273 274 212 216 382 385 316. 317 143 143 40.6 41.6 262.
The dist()
function will calculate the non-statistical
distances
?dist
## starting httpd help server ... done
# Manhattan distance: |y_1i - y_2i|
dist(x = skele,
method = "manhattan")
## 1
## 2 686.59
# Euclidean distance: (y_1i - y_2i)^2
dist(x = skele,
method = "euclidean")
## 1
## 2 228.8943
Pearson distance:
\[\frac{(y_{1i} - y_{2i})^2}{s_{ii}}\]
(pear_dist <- sqrt(sum(( skele[1, ]-skele[2, ] )^2 / diag(bone_Sigma))))
## [1] 9.772264
Mahalanobis (Statistical) Distance:
\[(\textbf{y}_1-\textbf{y}_2)'\textbf{S}^{-1}
(\textbf{y}_1-\textbf{y}_2)\] Using the
mahalanobis_distance()
function in the rstatix
package will find the distance each obs is from the mean: \(\bar{\textbf{y}}\) and if it is a
multivariate outlier
\[d_i^2 = (\textbf{y}_1-\bar{\textbf{y}})'\textbf{S}^{-1} (\textbf{y}_1-\bar{\textbf{y}})\\ d_i^2 \sim \chi^2_p\]
# Adding the distances to the data
mahalanobis_distance(data = bones)
## # A tibble: 619 × 15
## lhum rhum lrad rrad lfem rfem ltib rtib lib rib l_ace r_ace bib
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 295 302 208 205 395 396 306 305 160 155 45.6 44.8 286.
## 2 318 322 234 238 434. 437 361 364 158 156 50.7 52.4 261
## 3 304. 307 233 235 402. 407 338 338 160 161 52.2 52.4 269
## 4 287 292. 210. 214 390 386 311 315 160 160 47.7 47.7 283
## 5 269 272 198. 200 369 367 298 296 145 147 47.7 47.8 254
## 6 305 298 223 220. 393 397 316 312. 146 144 52.0 51.6 256
## 7 306. 310. 228. 227 421 421 337 333 162 165 52.3 53.1 295
## 8 280 284. 203 204. 402. 400 308. 306. 154 151 47.4 45.3 260
## 9 287 291 221 224 408. 407 320 324 166 168 52.9 52.9 279
## 10 352. 361 259 263 514. 513 377 381 171 169 56.0 56.4 287
## # ℹ 609 more rows
## # ℹ 2 more variables: mahal.dist <dbl>, is.outlier <lgl>
# Using it to create a distance plot to id outliers:
gg_outlier <-
bones |>
mahalanobis_distance() |>
mutate(id = row_number()) |>
ggplot(
mapping = aes(
x = id,
y = mahal.dist
)
) +
geom_segment(
mapping = aes(
xend = id,
yend = 0
)
) +
geom_point(
mapping = aes(
color = if_else(is.outlier, "red", "black")
)
) +
# Adding a horizontal line to indicate where the outlier (< 0.1% prob) cut off is:
geom_hline(
yintercept = qchisq(p = 0.999, df = ncol(bones)),
color = "red"
) +
labs(
x = "Skeleton ID",
y = "Mahalanobis distance"
) +
scale_color_identity()
gg_outlier
Adding the row number to the plot using
geom_text_repel()
in the ggrepel
package
pacman::p_load(ggrepel)
gg_outlier +
geom_text_repel(
data = mahalanobis_distance(bones) |>
mutate(id = row_number()) |>
filter(is.outlier), # Keeping the rows that are outliers
mapping = aes(label = id))
We often start by looking at the normality of each variable
individually. Start with the Shapiro-Wilks tests using the
shapiro_test()
in the rstatix
package:
# Just need to list the columns that we want to test for:
bones |>
shapiro_test(lhum, lrad, lfem, ltib, lib, l_ace) |>
# Let's put them in the order from lowest p-value to highest p-value
arrange(p) |>
# Round the p-value to 4 decimal places
mutate(p = round(p, digits = 3))
## # A tibble: 6 × 3
## variable statistic p
## <chr> <dbl> <dbl>
## 1 lib 0.993 0.007
## 2 l_ace 0.995 0.044
## 3 lrad 0.996 0.132
## 4 ltib 0.996 0.157
## 5 lfem 0.998 0.697
## 6 lhum 0.998 0.769
We can add in a group_by()
statement if we want to
separate the data based on a group:
bone_measures |>
janitor::clean_names() |> # fixing the names
drop_na() |> # removing rows with missing values
group_by(sex) |> # different test for the different sexes
shapiro_test(lhum, lrad, lfem, ltib, lib, l_ace) |>
arrange(sex, p)
## # A tibble: 12 × 4
## sex variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 F ltib 0.989 0.0793
## 2 F l_ace 0.989 0.0962
## 3 F lfem 0.991 0.209
## 4 F lrad 0.994 0.468
## 5 F lib 0.994 0.580
## 6 F lhum 0.995 0.710
## 7 M lib 0.978 0.0000110
## 8 M ltib 0.994 0.0863
## 9 M l_ace 0.994 0.136
## 10 M lfem 0.996 0.476
## 11 M lrad 0.997 0.538
## 12 M lhum 0.999 0.996
We can create the normal probability plots for each variable using
pivot_longer()
and facet_wrap()
:
Lbones_long <-
bone_measures |>
janitor::clean_names() |>
drop_na() |>
# Picking just the left wise bones and the sex column
dplyr::select(starts_with("L"), sex) |>
pivot_longer(cols = -sex,
names_to = "bone",
values_to = "length") |>
# as_factor() will keep the groups of bone in the same order as they were in the columns
mutate(bone = as_factor(bone))
head(Lbones_long)
## # A tibble: 6 × 3
## sex bone length
## <chr> <fct> <dbl>
## 1 F lhum 295
## 2 F lrad 208
## 3 F lfem 395
## 4 F ltib 306
## 5 F lib 160
## 6 F l_ace 45.6
Now we can us the Lbones_long to create the graphs:
ggplot(data = Lbones_long,
mapping = aes(sample = length)) +
# Adds the dots that represent the data
stat_qq() +
stat_qq_line(color = "red") +
# Create a separate QQ plot for each different bone measurement
facet_wrap(facet = ~ bone,
scales = "free")
Most of the plots look close to normal, with the left radius being the one that appears the most non-Normal
We can separate by sex by using facet_grid()
in place of
facet_wrap()
ggplot(
data = Lbones_long,
mapping = aes(sample = length)
) +
# Adds the dots that represent the data
stat_qq() +
stat_qq_line(color = "red") +
# Create a separate QQ plot for each different bone measurement
facet_grid(
rows = vars(sex),
cols = vars(bone),
scales = "free"
)
Let’s just check the left side bones
left_bones <-
bones |>
# Picking just the left side bones
dplyr::select(starts_with("l"))
Then we can get the Mahalanobis distance using
mahalanobis_distance()
in rstatix
:
left_bones |>
mahalanobis_distance() |>
ggplot(mapping = aes(sample = mahal.dist)) +
stat_qq(
distribution = qchisq,
dparams = ncol(left_bones)
) +
stat_qq_line(
distribution = qchisq,
dparams = ncol(left_bones),
color = "red"
) +
labs(
title = "QQ Plot for Multivariate Normality",
subtitle = "Six different bones",
x = "Theoretical Quantities",
y = "Mahalanobis Distance"
) +
coord_flip()
The graph definitely indicates that there is non-multivariate normality in the data :(
The MVN
package has a function called mvn()
that has a lot of helpful options:
data =
the data set (only numeric columns)
mvnTest = "mardia
to perform a Mardia MVN
test
multivariatePlot = "qq"
to include a qqplot
univariateTest = "SW"
to perform shapiro-wilks test
for each individual variable
desc = T or F
to include or exclude the descriptive
statistics of each individual variable.
pacman::p_load(MVN)
mvn(
data = left_bones,
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 257.216419529803 1.44766735856602e-27 NO
## 2 Mardia Kurtosis 15.2094470107533 0 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk lhum 0.9982 0.7689 YES
## 2 Shapiro-Wilk lrad 0.9961 0.1319 YES
## 3 Shapiro-Wilk lfem 0.9980 0.6973 YES
## 4 Shapiro-Wilk ltib 0.9963 0.1567 YES
## 5 Shapiro-Wilk lib 0.9932 0.0069 NO
## 6 Shapiro-Wilk l_ace 0.9950 0.0436 NO
We get the same plot that we had earlier!
Also, for the Mardia test to conclude there isn’t evidence of non-MVN is for both tests to have a p-value above the significance level. Both of them have p-value less than any reasonable significance level, so we can conclude that there is strong evidence that the data are not multivariate normal!
We can include univariatePlot = "histogram"
to create a
histogram for each variable:
mvn(
data = left_bones,
mvnTest = "mardia",
univariatePlot = "histogram",
univariateTest = "SW",
desc = F
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 257.216419529803 1.44766735856602e-27 NO
## 2 Mardia Kurtosis 15.2094470107533 0 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk lhum 0.9982 0.7689 YES
## 2 Shapiro-Wilk lrad 0.9961 0.1319 YES
## 3 Shapiro-Wilk lfem 0.9980 0.6973 YES
## 4 Shapiro-Wilk ltib 0.9963 0.1567 YES
## 5 Shapiro-Wilk lib 0.9932 0.0069 NO
## 6 Shapiro-Wilk l_ace 0.9950 0.0436 NO
While the p-values are low for lib and l_ace, the graphs themselves aren’t that non-Normal. So why such strong evidence that the data combined are not Multivariate Normal?
Normality tests are sensitive to outliers, and if there is a multivariate outlier (as seen in the QQ plot from the MVN test), it can cause us to falsely reject the null hypothesis.
Let’s remove any outliers and rerun the test!
left_bones |>
# Using the mahalanobis distance to find outliers:
mahalanobis_distance() |>
#using filter to remove the outliers:
filter(!is.outlier) |>
# Picking the original 6 variables (not including mahal.dist & is.outlier)
select(lhum:l_ace) |>
# Using mvn() to conduct our test
mvn(
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 85.1413347317432 0.00726556579118107 NO
## 2 Mardia Kurtosis 0.972905593884909 0.330600224534011 YES
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk lhum 0.9969 0.3002 YES
## 2 Shapiro-Wilk lrad 0.9961 0.1361 YES
## 3 Shapiro-Wilk lfem 0.9981 0.7446 YES
## 4 Shapiro-Wilk ltib 0.9963 0.1629 YES
## 5 Shapiro-Wilk lib 0.9953 0.0611 YES
## 6 Shapiro-Wilk l_ace 0.9950 0.0448 NO
Looks better without the 3 outliers! There is a little bit of a wiggle near the end of the QQ plot, and the p-value for the skewness is below 0.01, but it’s not as bad as before!
If some the individual variables are non-Normal, we can use a BoxCox
transformation to try to “fix” the non-Normality of some of the
variables by including the arguments bc = T
and
bcType = "rounded"
:
left_bones |>
# Using the mahalanobis distance to find outliers:
mahalanobis_distance() |>
#using filter to remove the outliers:
filter(!is.outlier) |>
# Picking the original 6 variables (not including mahal.dist & is.outlier)
select(lhum:l_ace) |>
# Using mvn() to conduct our test
mvn(
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F,
bc = T,
bcType = "rounded"
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 72.6405694231352 0.0667169283604157 YES
## 2 Mardia Kurtosis 0.833790377463341 0.404399118668008 YES
## 3 MVN <NA> <NA> YES
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk lhum 0.9958 0.0943 YES
## 2 Shapiro-Wilk lrad 0.9961 0.1361 YES
## 3 Shapiro-Wilk lfem 0.9970 0.3303 YES
## 4 Shapiro-Wilk ltib 0.9964 0.1722 YES
## 5 Shapiro-Wilk lib 0.9953 0.0611 YES
## 6 Shapiro-Wilk l_ace 0.9950 0.0448 NO
##
## $BoxCoxPowerTransformation
## lhum lrad lfem ltib lib l_ace
## 0 1 0 0 1 1