Make sure to place this R markdown file and the sparrow.xlsx file in the same folder! Then delete this line before you submit your results

Data Description

The sparrow.xlsx data set contains the following 5 measurements on 49 sparrows:

All measurements are in millimeters

Question 1: Calculate the Multivariate Summary Stats

# y_bar: Average of all 5 lengths
(sparrow_mean <- colMeans(sparrow))
##    length wing_span      beak   humerus   sternum 
##     158.0     241.3      31.5      18.5      20.8
# S: covariance matrix
(sparrow_cov <- cov(sparrow))
##           length wing_span  beak humerus sternum
## length     13.35     13.61 1.922   1.331   2.192
## wing_span  13.61     25.68 2.714   2.198   2.658
## beak        1.92      2.71 0.632   0.342   0.415
## humerus     1.33      2.20 0.342   0.318   0.339
## sternum     2.19      2.66 0.415   0.339   0.983
# R: correlation matrix
(sparrow_cor <- cor(sparrow))
##           length wing_span  beak humerus sternum
## length     1.000     0.735 0.662   0.645   0.605
## wing_span  0.735     1.000 0.674   0.769   0.529
## beak       0.662     0.674 1.000   0.763   0.526
## humerus    0.645     0.769 0.763   1.000   0.607
## sternum    0.605     0.529 0.526   0.607   1.000

Let’s look at the correlation plot instead of the correlation matrix. Which variables have the strongest and weakest correlations?

# Using ggcorr() in the GGally package
sparrow |> 
  cor() |> 
  
  ggcorrplot::ggcorrplot(
    lab = T,
    colors = c("red", "white", "blue"),
    type = "lower",
    outline.color = "white",
    ggtheme = theme_void,
    hc.order = T
  )

The strongest correlation is between humerus with wing_span and humerus with beak.

The weakest correlation is between sternum and wing_span and beak!

Overall, the correlations are between 0.50 and 0.80, indicating that at least all of the variables are somewhat related to the other variables.

Question 2: Scatterplot Matrix

# Using ggpairs() from the GGally package to create the scatterplot matrix:
ggpairs(sparrow)

There aren’t any obvious non-linear trends. Phew!

Question 3: Eigenvalues

Calculate the eigenvalues of the covariance matrix.

# Using eigen() to store the eigenvalues and eigenvectors of the covariance matrix
sparrow_eigen <- eigen(sparrow_cov)

# Viewing the eigenvalues only:
sparrow_eigen$values
## [1] 35.3258  4.6225  0.6309  0.3128  0.0775

Using the smallest eigenvalue, could there be an issue with linear dependence (aka multicollinearity)? **

The smallest eigenvalue of the covariance matrix is 0.078.

But is that small? We can check the percentage of the last eigenvalue comprises of the total of the eigenvalues:

\(\frac{\lambda_5}{\sum \lambda_i}\) = 0.002, which is pretty small.

While none of the individual correlation pairs are extremely strong, there may be an approximate linear combination of the sparrow measurements.

Question 4: Total and Generalized Variability

Calculate the total variability and generalized variability of the 5 sparrow measurements.

# Calculating the total variability -> Sum of the variances:
tot_var_spar <- sum(diag(sparrow_cov))

# Calculating the generalized variability -> Determinant of the covariance matrix
gen_var_spar <- det(sparrow_cov)

The total variability is \(\sum s_{ii} =\) 40.969

The generalized variability is \(|\mathbf{S}| =\) 2.498

Question 5: Mahalanobis Distance and Outliers

Calculate the Mahalanobis distance each sparrow is from \(\bar{\mathbf{y}}\).

Plot the results with birdno on the x-axis and Mahalanobis distance on the y-axis (don’t report the individual distances).

# Adding the Mahalanobis distance to the sparrow data and renaming it as sparrow2:
sparrow2 <- 
  sparrow |>  
  
  rstatix::mahalanobis_distance()|>  
  
  rownames_to_column(var = "birdno") |>  
  
  mutate(birdno = as.numeric(birdno))

# Creating the graph and adding a line for a Mahalanobis distance indicating a 99% cut off (assuming y ~ MVN)
ggplot(
  data = sparrow2,
  mapping = aes(
    x = birdno, 
    y = mahal.dist
  )
) +
  
  geom_col(
    color = "black", 
    fill = "steelblue"
  ) + 
  
  labs(
    y = "Mahalanobis Distance \n(p = 5)",
    x = "Bird Number"
  ) + 
  
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.05),
                       add = c(0,0))
  ) +

  scale_x_continuous(expand = c(0,0)) +
  
  geom_hline(
    yintercept = qchisq(p = 0.995, df = 5),
    color = "darkred"
  ) + 
  
  geom_text(
    data = sparrow2 |> 
           filter(mahal.dist > qchisq(p = 0.995, df = 5)),
    mapping = aes(
      x = birdno,
      y = mahal.dist,
      label = birdno
    ),
    color = "darkred", 
    vjust = -0.5
  )

Do there appear to be any outliers? Justify your answer.

Yes, the 31st bird has a Mahalanobis distance close to 20, which is much larger than the next largest Mahalanobis distance of about 12.2

Question 6: Multivariate Normality?

Part a) Checking Univariate Normality

First step is to check if the individual variable are univariate Normal:

# Creating a QQ Plot for each of the 5 individual sparrow measurements
sparrow |>  
  
  pivot_longer(
    cols = length:sternum, 
    values_to = "length", 
    names_to = "Measure"
  ) |>  
  
  mutate(measure = as_factor(Measure)) |>  
  
  ggplot(
    mapping = aes(
      sample = length, 
      color = measure
    )
  ) + 
  
  stat_qq_line(color = "black") + 
  
  theme(legend.position = "none") +
  
  stat_qq(alpha = 0.50) +
  
  facet_wrap(
    facet = ~ measure,
    scales = "free"
  ) 

Individually, all the plots look about what we would expect except length is a little too flat at the tails. The other 4 measurements all seem reasonable!

# Using mvn() in the MVN package to perform the Shapiro-Wilks test for each of the variables individually
mvn(
  data = sparrow,
  covariance = FALSE,
  mvnTest = "mardia",
  univariateTest = "SW",
  desc = F
)$univariateNormality

With the exception of length, all the p-values are quite large, with the second smallest being 0.34. No evidence that the individual variables are not Normally distributed.

The p-value for length is about 0.04, indicating some evidence of non-Normality, but not incredibly strong evidence (possibly because the sample of 49 sparrows may not be that large).

Part b) Checking Multivariate Normality

For a set of random variables to be multivariate Normal, any linear combination of the variables should also be Normal:

\[z = \sum_{i = 1} a_i y_i \sim N(\mu_z, \sigma^2_z)\]

which is impossible to completely check! Instead we’ll just perform a MVN test created by Mardia using the mvn() function:

mvn(
  data = sparrow,
  covariance = FALSE,
  mvnTest = "mardia",
  univariateTest = "SW",
  multivariatePlot = "qq",
  desc = F
)$multivariateNormality

Overall, the test doesn’t imply there is evidence that the 5 measurements for the 49 sparrows are not multivariate Normal!

The Q-Q plot does give a little cause for concern, but we can’t be certain if it is just due to a small sample size causing a somewhat unusual looking graph.

Question 7: Normality of Length

Create the univariate Q-Q plot for length.

ggplot(
  data = sparrow, 
  mapping = aes(sample = length)
) + 
  
  stat_qq_line(color = "black") + 
  
  theme(legend.position = "none") +
  
  stat_qq(
    color = "darkred",
    size = 2
    )

What characteristic from the QQ plot may prevent length from being able to be transformed to a more Normal shape?

Many of the sparrows in the data have the same body length measurement, indicating that length is measured too discretely (not enough decimal places, or in this case, no decimal places).

If a variable is measured too discretely, then the variable won’t appear Normal and no transformation will fix it. If two birds have the same body length, applying any transformation will result in the two sparrows having the same transformed body length and the issue persists.

If possible, we would remeasure the lengths of the sparrows, but unfortunately that isn’t possible since these data are very, very old :(

Question 8: Checking MVN Again

mvn(
  data = sparrow %>% dplyr::select(-length),
  covariance = FALSE,
  mvnTest = "mardia",
  multivariatePlot = "qq",
  univariateTest = "SW",
  desc = F
)

## $multivariateNormality
##              Test         Statistic           p value Result
## 1 Mardia Skewness  27.5874176935894 0.119531450977439    YES
## 2 Mardia Kurtosis 0.260352501001285 0.794591878514289    YES
## 3             MVN              <NA>              <NA>    YES
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk wing_span     0.979     0.519    YES   
## 2 Shapiro-Wilk   beak        0.974     0.339    YES   
## 3 Shapiro-Wilk  humerus      0.981     0.607    YES   
## 4 Shapiro-Wilk  sternum      0.987     0.853    YES

The test conclusions remain the same - No evidence of the 4 measurements being not MVN.

The Q-Q plot is more “in line” with what we would expect it to look like if the data were MVN