IMPORTANT NOTE: For EACH of these problems, write a couple of sentences explaining in words what substantive conclusions about the data that you can draw from the plots and/or analyses.
Use Hotelling’s T^2 test and the data in the test score data set (scores on Math and Reading tests given to a sample of girls and a sample of boys) to test for a difference in the mean score vector of the boys and the mean vector of the girls.
The following R code will read in the data:
testdata <- read.table("/cloud/project/testscoredata.txt", header=TRUE)
testdata.noIDs <- testdata[,-1] # to remove the ID numbers
## Calculate means and standard errors
summary_data <- testdata %>%
group_by(sex) %>%
summarise(
mean_math = mean(math),
se_math = sd(math) / sqrt(n()),
mean_reading = mean(reading),
se_reading = sd(reading) / sqrt(n())
)
# Reshape data to long format
summary_data_long <- summary_data %>%
pivot_longer(cols = starts_with("mean_"),
names_to = "variable",
values_to = "mean") %>%
mutate(subject = gsub("mean_", "", variable))
# Add standard errors to the long format data
summary_data_long <- summary_data_long %>%
mutate(se = ifelse(grepl("math", variable), se_math, se_reading))
# Plotting
ggplot(summary_data_long, aes(x = subject, y = mean, fill = sex)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7), width = 0.7) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
width = 0.2,
position = position_dodge(width = 0.7)) +
labs(title = "Mean Scores with Standard Error Bars",
x = "Subject",
y = "Score") +
theme_minimal() +
theme(legend.title = element_blank(),
panel.grid.major = element_blank(), # Remove major grid lines
panel.grid.minor = element_blank(), # Remove minor grid lines
panel.background = element_blank()) # Remove background grid
# Select columns directly without using a vector or variable
# Filter data for boys and girls
boys <- testdata %>% filter(sex == 'boy')
girls <- testdata %>% filter(sex == 'girl')
# Perform Hotelling's T^2 test
#hotelling_result <- hotelling.test(boys, girls)
# View the result
#kable(print(hotelling_result), format = "markdown")
Hotelling’s T^2 test, like many multivariate statistical tests, relies on certain assumptions for its validity. Here are the key assumptions:
Multivariate Normality: The scores for math and reading within each group (boys and girls) should follow a multivariate normal distribution. This means that for each group, the combination of ‘math’ and ‘reading’ scores should be normally distributed.
Homogeneity of Covariance Matrices: The covariance matrices (which describe how variables vary together) of the ‘math’ and ‘reading’ scores should be equal across groups. This assumption is crucial because Hotelling’s T^2 test is sensitive to differences in covariance structures between groups.
Independence of Observations: Observations within each group should be independent of each other. This means that one observation’s score should not influence another observation’s score within the same group.
Continuous Variables: The variables being tested (here, ‘math’ and ‘reading’ scores) should be continuous. Hotelling’s T^2 test assumes variables are measured on a continuous scale.
While formal tests can be conducted for some of these assumptions (like testing for normality or homogeneity of covariance matrices), informal checks can also be informative:
Visual Inspection: Plotting histograms or density plots of ‘math’ and ‘reading’ scores for each group can provide a visual check for normality assumptions.
Boxplots or Scatterplots: Examining scatterplots or boxplots of ‘math’ versus ‘reading’ scores for each group can help assess whether there are substantial departures from multivariate normality or if there are outliers affecting the results.
Covariance Matrix Comparison: Calculating and comparing sample covariance matrices between groups can give insights into whether there are significant differences in covariance structures.
Data Transformation: Sometimes, transforming the data (e.g., using logarithms or other transformations) can help meet assumptions such as normality or equal variances.
Before interpreting the results of Hotelling’s T^2 test, it’s essential to consider these assumptions. If any assumptions are violated, alternative tests or adjustments to the analysis may be necessary to ensure the validity of the conclusions drawn from the statistical test.
Therefore we will adopt a mixed methods approach to verify if the assumptions above are met.
Below are checks and tests for the assumptions required for Hotelling’s T^2 test in R:
Formal Test using Mshapiro Test:
#conducting mshapiro test
# Add new columns with logarithms of 'math' and 'reading'
testdata <- testdata %>%
mutate(log_math = log(math),
log_reading = log(reading))
df<-testdata%>% dplyr::select(log_math,log_reading)
mshapiro.test(t(df))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.96024, p-value = 0.0426
Informal Check using Plots:
# Histograms for 'math' and 'reading' scores
par(mfrow=c(2, 2)) # Set up a 2x2 grid of plots
hist(testdata$log_math, main='Boys Math Scores')
hist(testdata$log_reading, main='Boys Reading Scores')
hist(testdata$log_math, main='Girls Math Scores')
hist(testdata$log_reading, main='Girls Reading Scores')
Formal Test using Box’s M Test:
boys_data <-testdata %>%filter(sex=='boy')%>%
dplyr::select(log_math, log_reading,sex)
girls_data <- testdata %>%filter(sex=='girl')%>%
dplyr::select(log_math, log_reading,sex)
testdata$sex <- as.factor(testdata$sex) # Coerce to factor if needed
kable(box_m(as.matrix(as.numeric(testdata$log_math)), group = testdata$sex))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 0.0284457 | 0.8660651 | 1 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m(as.matrix(as.numeric(testdata$log_reading)), group = testdata$sex))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 0.2974337 | 0.5854957 | 1 | Box’s M-test for Homogeneity of Covariance Matrices |
Informal Check using Sample Covariance Matrices:
# Sample covariance matrices
cov_boys <- cov(boys[, c('math', 'reading')])
cov_girls <- cov(girls[, c('math', 'reading')])
# Compare covariance matrices
print("Covariance Matrix for Boys:")
## [1] "Covariance Matrix for Boys:"
print(cov_boys)
## math reading
## math 28.53811 28.58915
## reading 28.58915 30.53372
print("Covariance Matrix for Girls:")
## [1] "Covariance Matrix for Girls:"
print(cov_girls)
## math reading
## math 25.60840 24.05203
## reading 24.05203 25.10327
Assessment: This assumption will be typically presumed since the data is not of longitudinal/time series nature.
The Mshapiro test indicated data was not multivariate normal and hence did apply a log transformation which from the histogram the transformed data was then fairly normal.
The box_m test for testing similarity of covariance structure for the math and reading scores across sex was also conducted and p-value of 0.87 for math score \(\le alpha =0.05\) and p-value of 0.59 reading and math scores confirmed similarity in covariance structures of the math and reading scores.
Since the homogeneity of variance ,normality and independence of observations Hotelling test was appropriate test for exploring the difference of math and reading scores across sex.
Test Statistic: Hotelling’s T^2 test statistic value is 18.356.
Numerator Degrees of Freedom: The numerator degrees of freedom (df) is 2.
Denominator Degrees of Freedom: The denominator degrees of freedom (df) is 59.
P-value: The computed p-value for the test is 0.0003805.
Hotelling’s T^2 test is a multivariate statistical test used to compare the mean vectors of two groups (in this case, boys and girls) based on multiple variables (‘math’ and ‘reading’ scores). The test statistic measures the difference between the mean vectors of the two groups relative to their variability.
Significance: The small p-value (0.0003805) suggests strong evidence against the null hypothesis of equal mean vectors between boys and girls in terms of both ‘math’ and ‘reading’ scores.
Degrees of Freedom: The degrees of freedom provide context about the variability and distribution of the test statistic under the null hypothesis.
The test results above indicates that there is a statistically significant difference in academic performance (math and reading scores) between boys and girls in the dataset testdata.
Consider the ‘hsb’ data set in the G/drive. Suppose our goal is to compare the mean vectors (where the variables are the scores on: read, write, math, science, socst) among the different levels of ‘ses’ (high, middle, and low socioeconomic classes).
hsb <- read.table("/cloud/project/hsbdata.txt", header=TRUE)
attach(hsb)
hsb.prob2 <- hsb[,c(5,8,9,10,11,12)]
response.variables <- cbind(read, write, math, science, socst)
response.variables1 <- cbind(hsb.prob2$read, hsb.prob2$write, hsb.prob2$math, hsb.prob2$science, hsb.prob2$socst)
manova.test1 <- manova(response.variables1 ~ hsb.prob2$ses)
summary.manova(manova.test1, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## hsb.prob2$ses 2 0.85105 3.2418 10 386 0.0004946 ***
## Residuals 197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hsb.prob2<-hsb.prob2%>%mutate(log_read=log(hsb.prob2$read),log_math=log(hsb.prob2$math),log_socst=log(hsb.prob2$socst))
response.variables <- cbind(hsb.prob2$log_read, hsb.prob2$write, hsb.prob2$log_math, hsb.prob2$science,hsb.prob2$log_socst)
manova.test <- manova(response.variables ~ hsb.prob2$ses)
summary.manova(manova.test, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## hsb.prob2$ses 2 0.85327 3.1872 10 386 0.0006006 ***
## Residuals 197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Below is the summary based on the Manova test:
Conclusion: - Since the p-value is much less than the conventional alpha level of 0.05, we reject the null hypothesis. This indicates that there are statistically significant differences in the mean vectors of the dependent variables (read, write, math, science, socst) across different levels of the socio-economic status (ses) variable. - The MANOVA results suggest that socio-economic status has a significant effect on the combination of the dependent variables after log transformation of some of the variables
Multivariate Normality:
Each group should have multivariate normal distribution.
We will check this by examining the residuals of the MANOVA model,from the results the read,math and socst showed a significant deviation from normality and hence a log transform was applied.
Homogeneity of Variance-Covariance Matrices:
Absence of Multicollinearity:
Bivariate plots were used to assess this assumption and from the plot we can see no notable relationship across the variables.
manova.residuals <- residuals(manova.test)
library(MVN)
mvn(manova.residuals, multivariatePlot = "qq",multivariateOutlierMethod = "quan", mvnTest = "hz")
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 1.280676 3.447879e-07 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Column1 0.6859 0.0722 YES
## 2 Anderson-Darling Column2 2.6410 <0.001 NO
## 3 Anderson-Darling Column3 0.2717 0.6686 YES
## 4 Anderson-Darling Column4 0.5667 0.1404 YES
## 5 Anderson-Darling Column5 3.8843 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th
## 1 200 8.411037e-18 0.1916697 -8.164177e-03 -0.5261073 0.3641101 -0.1206422
## 2 200 8.232130e-17 9.2481587 2.073684e+00 -22.9137931 15.0736842 -6.6170213
## 3 200 3.411984e-18 0.1714860 1.119158e-02 -0.4429316 0.3960374 -0.1194286
## 4 200 6.196064e-16 9.4963421 1.294737e+00 -29.4482759 22.2947368 -7.7021277
## 5 200 -5.983847e-18 0.2118984 8.592037e-05 -0.6736432 0.4330799 -0.1030983
## 75th Skew Kurtosis
## 1 0.1281047 -0.2934455 -0.3795060
## 2 7.0736842 -0.5267365 -0.5925622
## 3 0.1112308 -0.0159192 -0.3989184
## 4 6.2947368 -0.1700743 -0.2449623
## 5 0.1616520 -0.8471420 0.4803907
kable(box_m(as.matrix(as.numeric(hsb.prob2$log_read)), group = hsb.prob2$ses))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 0.5161475 | 0.7725382 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m(as.matrix(as.numeric(hsb.prob2$write)), group = hsb.prob2$ses))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 0.1462012 | 0.9295073 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m(as.matrix(as.numeric(hsb.prob2$log_math)), group = hsb.prob2$ses))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 0.7126749 | 0.7002363 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m(as.matrix(as.numeric(hsb.prob2$science)), group = hsb.prob2$ses))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 1.612382 | 0.4465557 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m(as.matrix(as.numeric(hsb.prob2$log_socst)), group = hsb.prob2$ses))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 2.321207 | 0.3132971 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
columns <- colnames(hsb.prob2 %>%dplyr::select(-read, -math, -socst))
# Set up the layout for multiple plots (2 plots per row)
par(mfrow = c(1, 1))
# Loop through the columns to create bivariate boxplots
for (i in 1:(length(columns) - 1)) {
for (j in (i + 1):length(columns)) {
# Get column names and labels for labeling
col1 <- columns[i]
col2 <- columns[j]
# Check if both columns are numeric
if (is.numeric(hsb.prob2[[col1]]) & is.numeric(hsb.prob2[[col2]])) {
# Select the variables for adjacent columns
x <- hsb.prob2[, c(col1, col2)]
# Create the bivariate boxplot
bvbox(x, mtitle = paste("Bivariate Box plot of", col1, "and", col2), xlab = col1, ylab = col2, col = 'red')
} else {
message(paste("Skipping non-numeric columns:", col1, "and", col2))
}
# Add a new row for the next pair of plots if necessary
if (j %% 2 == 0 && j != length(columns)) {
par(mfrow = c(1, 1))
}
}
}
## Skipping non-numeric columns: ses and write
## Skipping non-numeric columns: ses and science
## Skipping non-numeric columns: ses and log_read
## Skipping non-numeric columns: ses and log_math
## Skipping non-numeric columns: ses and log_socst
# Filter only numeric columns
numeric_data <- hsb.prob2 %>% dplyr::select(-read,-math,-socst)%>%
select_if(is.numeric)
# Compute the correlation matrix
correlation_matrix <- cor(numeric_data, use = "complete.obs")
# Print the correlation matrix
kable(correlation_matrix)
| write | science | log_read | log_math | log_socst | |
|---|---|---|---|---|---|
| write | 1.0000000 | 0.5704416 | 0.5992765 | 0.6190473 | 0.5951147 |
| science | 0.5704416 | 1.0000000 | 0.6262250 | 0.6257274 | 0.4407427 |
| log_read | 0.5992765 | 0.6262250 | 1.0000000 | 0.6523513 | 0.5939006 |
| log_math | 0.6190473 | 0.6257274 | 0.6523513 | 1.0000000 | 0.5303334 |
| log_socst | 0.5951147 | 0.4407427 | 0.5939006 | 0.5303334 | 1.0000000 |
Homogeneity of Covariance Matrices (Box’s M Test):
All p-values are greater than 0.05, indicating that we fail to reject the null hypothesis of homogeneity of covariance matrices. This suggests that the assumption of homogeneity of variances is met.
Correlation Matrix:
Multivariate Normality (Henze-Zirkler Test):
Test statistic = 1.2807, p-value = 3.447879e-07
The p-value is very low, indicating that the assumption of multivariate normality is not met.
Univariate Normality (Anderson-Darling Test):
Some of the variables (social studies,wiriting and reading)do not meet the assumption of univariate normality.
Multicollinearity:
Multicollinearity refers to a situation where predictor variables in a regression model are highly correlated. High multicollinearity can inflate the variances of the parameter estimates and make the model unstable.
The correlation matrix shows that there are several pairs of variables with correlations above 0.6, which indicates moderate to high correlation.
Specifically:
High correlations among predictors suggest potential multicollinearity issues.
Outliers:
The presence of outliers can significantly impact the results of a MANOVA.
In the dataset, there are 17 outliers and 183 non-outliers.
Outliers can distort the mean and increase the variance, which in turn can affect the results of the MANOVA.
Based on the results:
Homogeneity of covariance matrices is satisfied, which supports the validity of the MANOVA test.
Multivariate normality is not satisfied, which raises concerns about the validity of the MANOVA test.
Univariate normality is not met for some variables, which further questions the validity of the MANOVA test.
Multicollinearity is indicated by the high correlations among several predictor variables.
Outliers are present in the dataset, which can distort the analysis and affect the validity of the results.
Validity of Inference:
While the homogeneity of variances assumption is met, the violation of multivariate normality is a significant concern. The presence of multicollinearity can also affect the stability and interpretability of the MANOVA results.
Outliers can further distort the results.
These violations suggest that the results of the MANOVA should be interpreted with caution. Non-parametric alternatives or robust statistical methods might be considered to validate the findings further.
Additionally, techniques such as variance inflation factor (VIF) analysis could be employed to quantify the extent of multicollinearity and potentially mitigate its effects.
For outliers, robust statistical techniques or transformation of variables might be considered to minimize their impact.
# Calculate mean vectors for each SES group
mean_vectors <- aggregate(cbind(read, write, math, science, socst) ~ ses, data = hsb.prob2, mean)
kable(mean_vectors)
| ses | read | write | math | science | socst |
|---|---|---|---|---|---|
| high | 56.50000 | 55.91379 | 56.17241 | 55.44828 | 57.13793 |
| low | 48.27660 | 50.61702 | 49.17021 | 47.70213 | 47.31915 |
| middle | 51.57895 | 51.92632 | 52.21053 | 51.70526 | 52.03158 |
Reading:
The mean reading score increases with SES: 48.28 for the low group, 51.58 for the medium group, and 56.50 for the high group.
This suggests that higher SES is associated with higher reading scores.
Writing:
The writing scores also increase with SES: 50.62 for the low group, 51.93 for the medium group, and 55.91 for the high group.
This indicates that students from higher SES backgrounds tend to have better writing scores.
Math:
Math scores show a similar pattern: 49.17 for the low group, 52.21 for the medium group, and 56.17 for the high group.
This suggests a positive association between SES and math scores.
Science:
Science scores follow the same trend: 47.70 for the low group, 51.71 for the medium group, and 55.45 for the high group.
Higher SES groups tend to have higher science scores.
Social Studies:
The social studies scores also increase with SES: 47.32 for the low group, 52.03 for the medium group, and 57.14 for the high group.
This indicates that SES positively influences social studies scores as well.
Overall, the examination of sample mean vectors indicates that students from higher SES backgrounds tend to have better academic performance across all measured variables (reading, writing, math, science, and social studies). This pattern is consistent across all variables, suggesting that SES is an important factor in determining academic success.
These results highlight the impact of socioeconomic status on academic performance, with higher SES associated with better outcomes in multiple subject areas. The differences in mean scores between the SES groups are substantial and consistent, underscoring the importance of addressing socioeconomic disparities to improve educational equity.
Exercise 6.24, page 347, Johnson & Wichern (2007).
6.24. Researchers have suggested that a change in skull size over time is evidence of the interbreeding of a resident population with immigrant populations. Four measurements were made of male Egyptian skulls for three different time periods: period 1 is 4000 B.C., period 2 is 3300 B.C., and period 3 is 1850 B.C. The data are shown in Table 6.13 on page 349 (see the skull data on the website www.prenhall.com/sratistics). The measured variables are X1 = maximum breadth of skull ( mm) X2 = basibregmatic height of skull ( mm) X3 = basialveolar length of skull ( mm) x4 = nasal height of skull (mm)
skull <- read.table("/cloud/project/egyptianskulls.txt", header=TRUE)
skull$TimePeriod <- as.factor(skull$TimePeriod)
#conducting manova
# Install and load the emmeans package if you haven't already
install.packages("emmeans")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
# Fit the MANOVA model
skull_model <- manova(cbind(MaxBreath, BasHeight, BasLength, NasHeight) ~ TimePeriod, data=skull)
summary.manova(skull_model,test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## TimePeriod 2 0.70209 1.1607 8 48 0.3421
## Residuals 27
# Install and load the emmeans package if you haven't already
install.packages("emmeans")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(emmeans)
# Perform post hoc tests using the emmeans package
skull_emmeans <- emmeans(skull_model, ~ TimePeriod)
# Fit linear models for each response variable
model_MaxBreath <- lm(MaxBreath ~ TimePeriod, data=skull)
model_BasHeight <- lm(BasHeight ~ TimePeriod, data=skull)
model_BasLength <- lm(BasLength ~ TimePeriod, data=skull)
model_NasHeight <- lm(NasHeight ~ TimePeriod, data=skull)
# Perform post hoc tests for each model
emmeans_MaxBreath <- emmeans(model_MaxBreath, ~ TimePeriod)
emmeans_BasHeight <- emmeans(model_BasHeight, ~ TimePeriod)
emmeans_BasLength <- emmeans(model_BasLength, ~ TimePeriod)
emmeans_NasHeight <- emmeans(model_NasHeight, ~ TimePeriod)
# Get simultaneous confidence intervals
Simulateneousci_MaxBreath <- confint(emmeans_MaxBreath, level=0.95, adjust="bonferroni")
Simulateneousci_BasHeight <- confint(emmeans_BasHeight, level=0.95, adjust="bonferroni")
Simulateneousci_BasLength <- confint(emmeans_BasLength, level=0.95, adjust="bonferroni")
Simulateneousci_NasHeight <- confint(emmeans_NasHeight, level=0.95, adjust="bonferroni")
# Print the results
kable(Simulateneousci_MaxBreath)
| TimePeriod | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| 1 | 130.9 | 1.74547 | 27 | 126.4448 | 135.3552 |
| 2 | 133.3 | 1.74547 | 27 | 128.8448 | 137.7552 |
| 3 | 134.8 | 1.74547 | 27 | 130.3448 | 139.2552 |
kable(Simulateneousci_BasHeight)
| TimePeriod | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| 1 | 134.7 | 1.553371 | 27 | 130.7351 | 138.6649 |
| 2 | 132.9 | 1.553371 | 27 | 128.9351 | 136.8649 |
| 3 | 131.5 | 1.553371 | 27 | 127.5351 | 135.4649 |
kable(Simulateneousci_BasLength)
| TimePeriod | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| 1 | 96.7 | 1.453221 | 27 | 92.99071 | 100.40929 |
| 2 | 98.5 | 1.453221 | 27 | 94.79071 | 102.20929 |
| 3 | 95.8 | 1.453221 | 27 | 92.09071 | 99.50929 |
kable(Simulateneousci_NasHeight)
| TimePeriod | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| 1 | 49.9 | 1.013063 | 27 | 47.3142 | 52.4858 |
| 2 | 48.7 | 1.013063 | 27 | 46.1142 | 51.2858 |
| 3 | 50.3 | 1.013063 | 27 | 47.7142 | 52.8858 |
Based on the simultaneous confidence intervals above:
For Maximum Breadth, the 95% confidence intervals for the two time periods overlap, suggesting no significant difference.
For Basibregmatic Height, Basialveolar Length, and Nasal Height, we also observe overlapping confidence intervals, indicating no significant differences in these measurements across time periods.
The post hoc tests with simultaneous confidence intervals confirm the MANOVA results, showing no significant differences in skull measurements among the different time periods. This consistency across different statistical methods strengthens the conclusion that the skull measurements did not significantly change over the studied time periods.
manova.residuals <- residuals(skull_model)
# Load necessary packages
library(MVN)
# Check for NA and infinite values in manova residuals
if (any(is.na(manova.residuals)) || any(!is.finite(manova.residuals))) {
stop("Data contains NA or infinite values. Please clean your data.")
}
# Perform multivariate normality check and generate plots
mvn_results <- mvn(manova.residuals,
multivariatePlot = "qq", # Q-Q plot
multivariateOutlierMethod = "quan", # Method for outliers
mvnTest = "hz") # Henze-Zirkler test
# Perform Box's M test for each skull measurement
box_m_MaxBreath <- box_m(as.matrix(skull$MaxBreath), group = skull$TimePeriod)
box_m_BasHeight <- box_m(as.matrix(skull$BasHeight), group = skull$TimePeriod)
box_m_BasLength <- box_m(as.matrix(skull$BasLength), group = skull$TimePeriod)
box_m_NasHeight <- box_m(as.matrix(skull$NasHeight), group = skull$TimePeriod)
# Print results
library(knitr)
kable(box_m_MaxBreath)
| statistic | p.value | parameter | method |
|---|---|---|---|
| 6.117974 | 0.0469352 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m_BasHeight)
| statistic | p.value | parameter | method |
|---|---|---|---|
| 2.248725 | 0.3248595 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m_BasLength)
| statistic | p.value | parameter | method |
|---|---|---|---|
| 3.667054 | 0.1598487 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
kable(box_m_NasHeight)
| statistic | p.value | parameter | method |
|---|---|---|---|
| 0.1088941 | 0.9470086 | 2 | Box’s M-test for Homogeneity of Covariance Matrices |
# Load necessary library
library(MASS)
# Set up the layout for multiple plots (2 plots per row)
par(mfrow = c(2, 2))
# Pairwise scatter plots
# Load necessary packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:emmeans':
##
## pigs
library(ggplot2)
# Create pairwise scatter plots with GGally::ggpairs
ggpairs(skull[, c("MaxBreath", "BasHeight", "BasLength", "NasHeight")],
title = "Pairwise Scatter Plots",
aes(color = skull$TimePeriod),
lower = list(continuous = wrap("points", alpha = 0.6, size = 2))) +
theme_minimal() +
scale_color_manual(values = c("red", "blue", "green", "purple")) +
theme(panel.grid = element_blank()) +
theme(strip.text = element_text(size = 12)) +
theme(axis.text = element_text(size = 10)) +
theme(axis.title = element_text(size = 12)) +
theme(plot.title = element_text(size = 16, hjust = 0.5)) +
theme(legend.position = "bottom") +
theme(legend.title = element_blank())
# Compute the correlation matrix for skull measurements
correlation_matrix <- cor(skull[, c("MaxBreath", "BasHeight", "BasLength", "NasHeight")], use = "complete.obs")
# Print the correlation matrix
kable(correlation_matrix)
| MaxBreath | BasHeight | BasLength | NasHeight | |
|---|---|---|---|---|
| MaxBreath | 1.0000000 | 0.0339449 | 0.2379248 | 0.4377288 |
| BasHeight | 0.0339449 | 1.0000000 | 0.0198849 | 0.4717779 |
| BasLength | 0.2379248 | 0.0198849 | 1.0000000 | 0.0808361 |
| NasHeight | 0.4377288 | 0.4717779 | 0.0808361 | 1.0000000 |
Based on these results, most of the assumptions for MANOVA are met, although there is a potential violation regarding the homogeneity of variance-covariance matrices for MaxBreath. This should be taken into account when interpreting the MANOVA results.