Make sure to place this R markdown file and the sparrow.xlsx file in the same folder! Then delete this line before you submit your results
The sparrow.xlsx data set contains the following 5 measurements on 49 sparrows:
All measurements are in millimeters
# y_bar: Average of all 5 lengths
(sparrow_mean <- colMeans(sparrow))
## length wing_span beak humerus sternum
## 158.0 241.3 31.5 18.5 20.8
# S: covariance matrix
(sparrow_cov <- cov(sparrow))
## length wing_span beak humerus sternum
## length 13.35 13.61 1.922 1.331 2.192
## wing_span 13.61 25.68 2.714 2.198 2.658
## beak 1.92 2.71 0.632 0.342 0.415
## humerus 1.33 2.20 0.342 0.318 0.339
## sternum 2.19 2.66 0.415 0.339 0.983
# R: correlation matrix
(sparrow_cor <- cor(sparrow))
## length wing_span beak humerus sternum
## length 1.000 0.735 0.662 0.645 0.605
## wing_span 0.735 1.000 0.674 0.769 0.529
## beak 0.662 0.674 1.000 0.763 0.526
## humerus 0.645 0.769 0.763 1.000 0.607
## sternum 0.605 0.529 0.526 0.607 1.000
Let’s look at the correlation plot instead of the correlation matrix. Which variables have the strongest and weakest correlations?
# Using ggcorr() in the GGally package
sparrow |>
cor() |>
ggcorrplot::ggcorrplot(
lab = T,
colors = c("red", "white", "blue"),
type = "lower",
outline.color = "white",
ggtheme = theme_void,
hc.order = T
)
The strongest correlation is between humerus with wing_span and humerus with beak.
The weakest correlation is between sternum and wing_span and beak!
Overall, the correlations are between 0.50 and 0.80, indicating that at least all of the variables are somewhat related to the other variables.
# Using ggpairs() from the GGally package to create the scatterplot matrix:
ggpairs(sparrow)
There aren’t any obvious non-linear trends. Phew!
# Using eigen() to store the eigenvalues and eigenvectors of the covariance matrix
sparrow_eigen <- eigen(sparrow_cov)
# Viewing the eigenvalues only:
sparrow_eigen$values
## [1] 35.3258 4.6225 0.6309 0.3128 0.0775
The smallest eigenvalue of the covariance matrix is 0.078.
But is that small? We can check the percentage of the last eigenvalue comprises of the total of the eigenvalues:
\(\frac{\lambda_5}{\sum \lambda_i}\) = 0.002, which is pretty small.
While none of the individual correlation pairs are extremely strong, there may be an approximate linear combination of the sparrow measurements.
Calculate the total variability and generalized variability of the 5 sparrow measurements.
# Calculating the total variability -> Sum of the variances:
tot_var_spar <- sum(diag(sparrow_cov))
# Calculating the generalized variability -> Determinant of the covariance matrix
gen_var_spar <- det(sparrow_cov)
The total variability is \(\sum s_{ii} =\) 40.969
The generalized variability is \(|\mathbf{S}| =\) 2.498
Plot the results with birdno on the x-axis and Mahalanobis distance on the y-axis (don’t report the individual distances).
# Adding the Mahalanobis distance to the sparrow data and renaming it as sparrow2:
sparrow2 <-
sparrow |>
rstatix::mahalanobis_distance()|>
rownames_to_column(var = "birdno") |>
mutate(birdno = as.numeric(birdno))
# Creating the graph and adding a line for a Mahalanobis distance indicating a 99% cut off (assuming y ~ MVN)
ggplot(
data = sparrow2,
mapping = aes(
x = birdno,
y = mahal.dist
)
) +
geom_col(
color = "black",
fill = "steelblue"
) +
labs(
y = "Mahalanobis Distance \n(p = 5)",
x = "Bird Number"
) +
scale_y_continuous(
expand = expansion(mult = c(0, 0.05),
add = c(0,0))
) +
scale_x_continuous(expand = c(0,0)) +
geom_hline(
yintercept = qchisq(p = 0.995, df = 5),
color = "darkred"
) +
geom_text(
data = sparrow2 |>
filter(mahal.dist > qchisq(p = 0.995, df = 5)),
mapping = aes(
x = birdno,
y = mahal.dist,
label = birdno
),
color = "darkred",
vjust = -0.5
)
Yes, the 31st bird has a Mahalanobis distance close to 20, which is much larger than the next largest Mahalanobis distance of about 12.2
First step is to check if the individual variable are univariate Normal:
# Creating a QQ Plot for each of the 5 individual sparrow measurements
sparrow |>
pivot_longer(
cols = length:sternum,
values_to = "length",
names_to = "Measure"
) |>
mutate(measure = as_factor(Measure)) |>
ggplot(
mapping = aes(
sample = length,
color = measure
)
) +
stat_qq_line(color = "black") +
theme(legend.position = "none") +
stat_qq(alpha = 0.50) +
facet_wrap(
facet = ~ measure,
scales = "free"
)
Individually, all the plots look about what we would expect except length is a little too flat at the tails. The other 4 measurements all seem reasonable!
# Using mvn() in the MVN package to perform the Shapiro-Wilks test for each of the variables individually
mvn(
data = sparrow,
covariance = FALSE,
mvnTest = "mardia",
univariateTest = "SW",
desc = F
)$univariateNormality
With the exception of length, all the p-values are quite large, with the second smallest being 0.34. No evidence that the individual variables are not Normally distributed.
The p-value for length is about 0.04, indicating some evidence of non-Normality, but not incredibly strong evidence (possibly because the sample of 49 sparrows may not be that large).
For a set of random variables to be multivariate Normal, any linear combination of the variables should also be Normal:
\[z = \sum_{i = 1} a_i y_i \sim N(\mu_z, \sigma^2_z)\]
which is impossible to completely check! Instead we’ll just perform a
MVN test created by Mardia using the mvn()
function:
mvn(
data = sparrow,
covariance = FALSE,
mvnTest = "mardia",
univariateTest = "SW",
multivariatePlot = "qq",
desc = F
)$multivariateNormality
Overall, the test doesn’t imply there is evidence that the 5 measurements for the 49 sparrows are not multivariate Normal!
The Q-Q plot does give a little cause for concern, but we can’t be certain if it is just due to a small sample size causing a somewhat unusual looking graph.
ggplot(
data = sparrow,
mapping = aes(sample = length)
) +
stat_qq_line(color = "black") +
theme(legend.position = "none") +
stat_qq(
color = "darkred",
size = 2
)
Many of the sparrows in the data have the same body length measurement, indicating that length is measured too discretely (not enough decimal places, or in this case, no decimal places).
If a variable is measured too discretely, then the variable won’t appear Normal and no transformation will fix it. If two birds have the same body length, applying any transformation will result in the two sparrows having the same transformed body length and the issue persists.
If possible, we would remeasure the lengths of the sparrows, but unfortunately that isn’t possible since these data are very, very old :(
mvn(
data = sparrow %>% dplyr::select(-length),
covariance = FALSE,
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 27.5874176935894 0.119531450977439 YES
## 2 Mardia Kurtosis 0.260352501001285 0.794591878514289 YES
## 3 MVN <NA> <NA> YES
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk wing_span 0.979 0.519 YES
## 2 Shapiro-Wilk beak 0.974 0.339 YES
## 3 Shapiro-Wilk humerus 0.981 0.607 YES
## 4 Shapiro-Wilk sternum 0.987 0.853 YES
The test conclusions remain the same - No evidence of the 4 measurements being not MVN.
The Q-Q plot is more “in line” with what we would expect it to look like if the data were MVN