STA 8301 Multivariate Statistical Analysis

IMPORTANT NOTE: For EACH of these problems, write a couple of sentences explaining in words what substantive conclusions about the data that you can draw from the plots and/or analyses.

PROBLEM 1:

Use Hotelling’s T^2 test and the data in the test score data set (scores on Math and Reading tests given to a sample of girls and a sample of boys) to test for a difference in the mean score vector of the boys and the mean vector of the girls.

The following R code will read in the data:

testdata <- read.table("/cloud/project/testscoredata.txt", header=TRUE)
testdata.noIDs <- testdata[,-1]  # to remove the ID numbers

## Calculate means and standard errors
summary_data <- testdata %>%
  group_by(sex) %>%
  summarise(
    mean_math = mean(math),
    se_math = sd(math) / sqrt(n()),
    mean_reading = mean(reading),
    se_reading = sd(reading) / sqrt(n())
  )

# Reshape data to long format
summary_data_long <- summary_data %>%
  pivot_longer(cols = starts_with("mean_"), 
               names_to = "variable", 
               values_to = "mean") %>%
  mutate(subject = gsub("mean_", "", variable))

# Add standard errors to the long format data
summary_data_long <- summary_data_long %>%
  mutate(se = ifelse(grepl("math", variable), se_math, se_reading))

# Plotting
ggplot(summary_data_long, aes(x = subject, y = mean, fill = sex)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7), width = 0.7) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = 0.2,
                position = position_dodge(width = 0.7)) +
  labs(title = "Mean Scores with Standard Error Bars",
       x = "Subject",
       y = "Score") +
  theme_minimal() + 
  theme(legend.title = element_blank(),
        panel.grid.major = element_blank(),  # Remove major grid lines
        panel.grid.minor = element_blank(),  # Remove minor grid lines
        panel.background = element_blank())  # Remove background grid

# Select columns directly without using a vector or variable
# Filter data for boys and girls
boys <- testdata %>% filter(sex == 'boy')
girls <- testdata %>% filter(sex == 'girl')

# Perform Hotelling's T^2 test
#hotelling_result <- hotelling.test(boys, girls)

# View the result
#kable(print(hotelling_result), format = "markdown")

Hotelling’s T^2 test, like many multivariate statistical tests, relies on certain assumptions for its validity. Here are the key assumptions:

Multivariate Normality: The scores for math and reading within each group (boys and girls) should follow a multivariate normal distribution. This means that for each group, the combination of ‘math’ and ‘reading’ scores should be normally distributed.
Homogeneity of Covariance Matrices: The covariance matrices (which describe how variables vary together) of the ‘math’ and ‘reading’ scores should be equal across groups. This assumption is crucial because Hotelling’s T^2 test is sensitive to differences in covariance structures between groups.
Independence of Observations: Observations within each group should be independent of each other. This means that one observation’s score should not influence another observation’s score within the same group.
Continuous Variables: The variables being tested (here, ‘math’ and ‘reading’ scores) should be continuous. Hotelling’s T^2 test assumes variables are measured on a continuous scale.

Informal Checks for Assumptions:

While formal tests can be conducted for some of these assumptions (like testing for normality or homogeneity of covariance matrices), informal checks can also be informative:

Visual Inspection: Plotting histograms or density plots of ‘math’ and ‘reading’ scores for each group can provide a visual check for normality assumptions.
Boxplots or Scatterplots: Examining scatterplots or boxplots of ‘math’ versus ‘reading’ scores for each group can help assess whether there are substantial departures from multivariate normality or if there are outliers affecting the results.
Covariance Matrix Comparison: Calculating and comparing sample covariance matrices between groups can give insights into whether there are significant differences in covariance structures.
Data Transformation: Sometimes, transforming the data (e.g., using logarithms or other transformations) can help meet assumptions such as normality or equal variances.

Conclusion:

Before interpreting the results of Hotelling’s T^2 test, it’s essential to consider these assumptions. If any assumptions are violated, alternative tests or adjustments to the analysis may be necessary to ensure the validity of the conclusions drawn from the statistical test.

Therefore we will adopt a mixed methods approach to verify if the assumptions above are met.

Below are checks and tests for the assumptions required for Hotelling’s T^2 test in R:

1. Multivariate Normality

Formal Test using Mshapiro Test:

#conducting mshapiro test
# Add new columns with logarithms of 'math' and 'reading'

testdata <- testdata %>%
  mutate(log_math = log(math),
         log_reading = log(reading))

df<-testdata%>% dplyr::select(log_math,log_reading)

mshapiro.test(t(df))

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.96024, p-value = 0.0426

Informal Check using Plots:

# Histograms for 'math' and 'reading' scores
par(mfrow=c(2, 2))  # Set up a 2x2 grid of plots
hist(testdata$log_math, main='Boys Math Scores')
hist(testdata$log_reading, main='Boys Reading Scores')
hist(testdata$log_math, main='Girls Math Scores')
hist(testdata$log_reading, main='Girls Reading Scores')

2. Homogeneity of Covariance Matrices

Formal Test using Box’s M Test:

boys_data <-testdata %>%filter(sex=='boy')%>%
   dplyr::select(log_math, log_reading,sex)
girls_data <- testdata %>%filter(sex=='girl')%>%
   dplyr::select(log_math, log_reading,sex)

testdata$sex <- as.factor(testdata$sex)  # Coerce to factor if needed

kable(box_m(as.matrix(as.numeric(testdata$log_math)), group = testdata$sex))

statistic	p.value	parameter	method
0.0284457	0.8660651	1	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m(as.matrix(as.numeric(testdata$log_reading)), group = testdata$sex))

statistic	p.value	parameter	method
0.2974337	0.5854957	1	Box’s M-test for Homogeneity of Covariance Matrices

Informal Check using Sample Covariance Matrices:

# Sample covariance matrices
cov_boys <- cov(boys[, c('math', 'reading')])
cov_girls <- cov(girls[, c('math', 'reading')])

# Compare covariance matrices
print("Covariance Matrix for Boys:")

## [1] "Covariance Matrix for Boys:"

print(cov_boys)

##             math  reading
## math    28.53811 28.58915
## reading 28.58915 30.53372

print("Covariance Matrix for Girls:")

## [1] "Covariance Matrix for Girls:"

print(cov_girls)

##             math  reading
## math    25.60840 24.05203
## reading 24.05203 25.10327

3. Independence of Observations

Assessment: This assumption will be typically presumed since the data is not of longitudinal/time series nature.

Conclusion

The Mshapiro test indicated data was not multivariate normal and hence did apply a log transformation which from the histogram the transformed data was then fairly normal.

The box_m test for testing similarity of covariance structure for the math and reading scores across sex was also conducted and p-value of 0.87 for math score $\le alpha =0.05$ and p-value of 0.59 reading and math scores confirmed similarity in covariance structures of the math and reading scores.

Since the homogeneity of variance ,normality and independence of observations Hotelling test was appropriate test for exploring the difference of math and reading scores across sex.

Output Summary:

Test Statistic: Hotelling’s T^2 test statistic value is 18.356.
Numerator Degrees of Freedom: The numerator degrees of freedom (df) is 2.
Denominator Degrees of Freedom: The denominator degrees of freedom (df) is 59.
P-value: The computed p-value for the test is 0.0003805.

Interpretation:

Hotelling’s T^2 test is a multivariate statistical test used to compare the mean vectors of two groups (in this case, boys and girls) based on multiple variables (‘math’ and ‘reading’ scores). The test statistic measures the difference between the mean vectors of the two groups relative to their variability.

Significance: The small p-value (0.0003805) suggests strong evidence against the null hypothesis of equal mean vectors between boys and girls in terms of both ‘math’ and ‘reading’ scores.
Degrees of Freedom: The degrees of freedom provide context about the variability and distribution of the test statistic under the null hypothesis.

The test results above indicates that there is a statistically significant difference in academic performance (math and reading scores) between boys and girls in the dataset testdata.

PROBLEM 2:

Consider the ‘hsb’ data set in the G/drive. Suppose our goal is to compare the mean vectors (where the variables are the scores on: read, write, math, science, socst) among the different levels of ‘ses’ (high, middle, and low socioeconomic classes).

hsb <- read.table("/cloud/project/hsbdata.txt", header=TRUE)
attach(hsb)

hsb.prob2 <- hsb[,c(5,8,9,10,11,12)]
response.variables <- cbind(read, write, math, science, socst)

Conduct the MANOVA F-test using Wilks’ Lambda to test for a difference in (read, write, math, science, socst) mean vectors across the three ses classes. Use a 0.05 significance level, and give the P-value of the test. To conduct the MANOVA F-test using Wilks’ Lambda and check the assumptions, you can follow these steps in R:

(a) Conducting the MANOVA F-test using Wilks’ Lambda

Conducting MANOVA with non - normal data

response.variables1 <- cbind(hsb.prob2$read, hsb.prob2$write, hsb.prob2$math, hsb.prob2$science, hsb.prob2$socst)
manova.test1 <- manova(response.variables1 ~ hsb.prob2$ses)
summary.manova(manova.test1, test="Wilks")

##                Df   Wilks approx F num Df den Df    Pr(>F)    
## hsb.prob2$ses   2 0.85105   3.2418     10    386 0.0004946 ***
## Residuals     197                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting log transform to ensure normality of data for read,math and socst

hsb.prob2<-hsb.prob2%>%mutate(log_read=log(hsb.prob2$read),log_math=log(hsb.prob2$math),log_socst=log(hsb.prob2$socst))
                
response.variables <- cbind(hsb.prob2$log_read, hsb.prob2$write, hsb.prob2$log_math, hsb.prob2$science,hsb.prob2$log_socst)

Conducting manova after log transformation

manova.test <- manova(response.variables ~ hsb.prob2$ses)
summary.manova(manova.test, test="Wilks")

##                Df   Wilks approx F num Df den Df    Pr(>F)    
## hsb.prob2$ses   2 0.85327   3.1872     10    386 0.0006006 ***
## Residuals     197                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion

Below is the summary based on the Manova test:

Degrees of Freedom (Df):
- The degrees of freedom associated with the independent variable (hsb.prob2$ses) are 2.
- The residual degrees of freedom (denominator degrees of freedom) are 386.
Wilks’ Lambda:
- Wilks’ Lambda value is 0.85327. Wilks’ Lambda is a measure of how much the independent variable accounts for the variance in the dependent variables. Lower values indicate that the independent variable explains more of the variance.
Approximate F-Statistic:
- The approximate F-value is 3.1872. This F-statistic tests the null hypothesis that the group means of the dependent variables are equal across the levels of the independent variable.
Numerator Degrees of Freedom (num Df):
- The numerator degrees of freedom for the F-test are 10.
P-Value (Pr(>F)):
- The p-value is 0.0006006, which is highly significant (p < 0.01).

Conclusion: - Since the p-value is much less than the conventional alpha level of 0.05, we reject the null hypothesis. This indicates that there are statistically significant differences in the mean vectors of the dependent variables (read, write, math, science, socst) across different levels of the socio-economic status (ses) variable. - The MANOVA results suggest that socio-economic status has a significant effect on the combination of the dependent variables after log transformation of some of the variables

(b) Check (informal exploratory checks are fine) to see whether the assumptions of your test are met. Do you believe your inference is valid?

Multivariate Normality:
- Each group should have multivariate normal distribution.
- We will check this by examining the residuals of the MANOVA model,from the results the read,math and socst showed a significant deviation from normality and hence a log transform was applied.
Homogeneity of Variance-Covariance Matrices:
- Box’s M test was used to test this assumption. The assumption being that the covariance matrices of the dependent variables are equal across groups and since all the box_m tests for all the scores had a p-value greater than $\alpha =0.05$ then we conclude the homogeneity of variance-covariance assumption was met.
Absence of Multicollinearity:
- Here we intend to ensure that the dependent variables are not too highly correlated with each other.

Bivariate plots were used to assess this assumption and from the plot we can see no notable relationship across the variables.

Checking Multivariate Normality

Extract residuals

manova.residuals <- residuals(manova.test)

Perform Shapiro-Wilk test for multivariate normality on residuals

library(MVN)
mvn(manova.residuals, multivariatePlot = "qq",multivariateOutlierMethod = "quan", mvnTest = "hz")

## $multivariateNormality
##            Test       HZ      p value MVN
## 1 Henze-Zirkler 1.280676 3.447879e-07  NO
## 
## $univariateNormality
##               Test  Variable Statistic   p value Normality
## 1 Anderson-Darling  Column1     0.6859  0.0722      YES   
## 2 Anderson-Darling  Column2     2.6410  <0.001      NO    
## 3 Anderson-Darling  Column3     0.2717  0.6686      YES   
## 4 Anderson-Darling  Column4     0.5667  0.1404      YES   
## 5 Anderson-Darling  Column5     3.8843  <0.001      NO    
## 
## $Descriptives
##     n          Mean   Std.Dev        Median         Min        Max       25th
## 1 200  8.411037e-18 0.1916697 -8.164177e-03  -0.5261073  0.3641101 -0.1206422
## 2 200  8.232130e-17 9.2481587  2.073684e+00 -22.9137931 15.0736842 -6.6170213
## 3 200  3.411984e-18 0.1714860  1.119158e-02  -0.4429316  0.3960374 -0.1194286
## 4 200  6.196064e-16 9.4963421  1.294737e+00 -29.4482759 22.2947368 -7.7021277
## 5 200 -5.983847e-18 0.2118984  8.592037e-05  -0.6736432  0.4330799 -0.1030983
##        75th       Skew   Kurtosis
## 1 0.1281047 -0.2934455 -0.3795060
## 2 7.0736842 -0.5267365 -0.5925622
## 3 0.1112308 -0.0159192 -0.3989184
## 4 6.2947368 -0.1700743 -0.2449623
## 5 0.1616520 -0.8471420  0.4803907

Checking Homogeneity of Variance-Covariance Matrices

kable(box_m(as.matrix(as.numeric(hsb.prob2$log_read)), group = hsb.prob2$ses))

statistic	p.value	parameter	method
0.5161475	0.7725382	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m(as.matrix(as.numeric(hsb.prob2$write)), group = hsb.prob2$ses))

statistic	p.value	parameter	method
0.1462012	0.9295073	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m(as.matrix(as.numeric(hsb.prob2$log_math)), group = hsb.prob2$ses))

statistic	p.value	parameter	method
0.7126749	0.7002363	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m(as.matrix(as.numeric(hsb.prob2$science)), group = hsb.prob2$ses))

statistic	p.value	parameter	method
1.612382	0.4465557	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m(as.matrix(as.numeric(hsb.prob2$log_socst)), group = hsb.prob2$ses))

statistic	p.value	parameter	method
2.321207	0.3132971	2	Box’s M-test for Homogeneity of Covariance Matrices

Checking Multicollinearity

columns <- colnames(hsb.prob2 %>%dplyr::select(-read, -math, -socst))
# Set up the layout for multiple plots (2 plots per row)
par(mfrow = c(1, 1))
# Loop through the columns to create bivariate boxplots
for (i in 1:(length(columns) - 1)) {
for (j in (i + 1):length(columns)) {
# Get column names and labels for labeling
col1 <- columns[i]
col2 <- columns[j]

# Check if both columns are numeric
if (is.numeric(hsb.prob2[[col1]]) & is.numeric(hsb.prob2[[col2]])) {
# Select the variables for adjacent columns
x <- hsb.prob2[, c(col1, col2)]
# Create the bivariate boxplot
bvbox(x, mtitle = paste("Bivariate Box plot of", col1, "and", col2), xlab = col1, ylab = col2, col = 'red')
} else {
message(paste("Skipping non-numeric columns:", col1, "and", col2))
}
# Add a new row for the next pair of plots if necessary
if (j %% 2 == 0 && j != length(columns)) {
par(mfrow = c(1, 1))
}
}
}

## Skipping non-numeric columns: ses and write

## Skipping non-numeric columns: ses and science

## Skipping non-numeric columns: ses and log_read

## Skipping non-numeric columns: ses and log_math

## Skipping non-numeric columns: ses and log_socst

# Filter only numeric columns
numeric_data <- hsb.prob2 %>% dplyr::select(-read,-math,-socst)%>%
  select_if(is.numeric)

# Compute the correlation matrix
correlation_matrix <- cor(numeric_data, use = "complete.obs")
# Print the correlation matrix
kable(correlation_matrix)

	write	science	log_read	log_math	log_socst
write	1.0000000	0.5704416	0.5992765	0.6190473	0.5951147
science	0.5704416	1.0000000	0.6262250	0.6257274	0.4407427
log_read	0.5992765	0.6262250	1.0000000	0.6523513	0.5939006
log_math	0.6190473	0.6257274	0.6523513	1.0000000	0.5303334
log_socst	0.5951147	0.4407427	0.5939006	0.5303334	1.0000000

Summary of Assumptions Checks for MANOVA with Multicollinearity and Outliers Discussion

Homogeneity of Covariance Matrices (Box’s M Test):
- Log Reading: Statistic = 0.5161, p-value = 0.7725
- Writing: Statistic = 0.1462, p-value = 0.9295
- Log Math: Statistic = 0.7127, p-value = 0.7002
- Science: Statistic = 1.6124, p-value = 0.4466
- Log Social Studies: Statistic = 2.3212, p-value = 0.3133
All p-values are greater than 0.05, indicating that we fail to reject the null hypothesis of homogeneity of covariance matrices. This suggests that the assumption of homogeneity of variances is met.
Correlation Matrix:
- There are moderate to strong correlations among the variables, which is expected in a MANOVA analysis. Here are the specific correlations:
  - Writing and Science: 0.5704
  - Writing and Log Reading: 0.5993
  - Writing and Log Math: 0.6190
  - Writing and Log Social Studies: 0.5951
  - Science and Log Reading: 0.6262
  - Science and Log Math: 0.6257
  - Science and Log Social Studies: 0.4407
  - Log Reading and Log Math: 0.6524
  - Log Reading and Log Social Studies: 0.5939
  - Log Math and Log Social Studies: 0.5303
Multivariate Normality (Henze-Zirkler Test):
- Test statistic = 1.2807, p-value = 3.447879e-07
- The p-value is very low, indicating that the assumption of multivariate normality is not met.
Univariate Normality (Anderson-Darling Test):
- Log Reading: Statistic = 0.6859, p-value = 0.0722 (Normality: Yes)
- Writing: Statistic = 2.6410, p-value < 0.001 (Normality: No)
- Log Math: Statistic = 0.2717, p-value = 0.6686 (Normality: Yes)
- Science: Statistic = 0.5667, p-value = 0.1404 (Normality: Yes)
- Log Social Studies: Statistic = 3.8843, p-value < 0.001 (Normality: No)
Some of the variables (social studies,wiriting and reading)do not meet the assumption of univariate normality.
Multicollinearity:
- Multicollinearity refers to a situation where predictor variables in a regression model are highly correlated. High multicollinearity can inflate the variances of the parameter estimates and make the model unstable.
- The correlation matrix shows that there are several pairs of variables with correlations above 0.6, which indicates moderate to high correlation.
Specifically:
- Log Reading and Log Math: 0.6524
- Writing and Log Math: 0.6190
- Science and Log Reading: 0.6262
High correlations among predictors suggest potential multicollinearity issues.
Outliers:
- The presence of outliers can significantly impact the results of a MANOVA.
- In the dataset, there are 17 outliers and 183 non-outliers.
Outliers can distort the mean and increase the variance, which in turn can affect the results of the MANOVA.

Conclusion on Inference Validity:

Based on the results:

Homogeneity of covariance matrices is satisfied, which supports the validity of the MANOVA test.
Multivariate normality is not satisfied, which raises concerns about the validity of the MANOVA test.
Univariate normality is not met for some variables, which further questions the validity of the MANOVA test.
Multicollinearity is indicated by the high correlations among several predictor variables.
Outliers are present in the dataset, which can distort the analysis and affect the validity of the results.

Validity of Inference:

While the homogeneity of variances assumption is met, the violation of multivariate normality is a significant concern. The presence of multicollinearity can also affect the stability and interpretability of the MANOVA results.

Outliers can further distort the results.

These violations suggest that the results of the MANOVA should be interpreted with caution. Non-parametric alternatives or robust statistical methods might be considered to validate the findings further.

Additionally, techniques such as variance inflation factor (VIF) analysis could be employed to quantify the extent of multicollinearity and potentially mitigate its effects.

For outliers, robust statistical techniques or transformation of variables might be considered to minimize their impact.

(c) Examine the sample mean vectors for each group. Informally comment on the differences among the groups in terms of the specific variables.

# Calculate mean vectors for each SES group
mean_vectors <- aggregate(cbind(read, write, math, science, socst) ~ ses, data = hsb.prob2, mean)

kable(mean_vectors)

ses	read	write	math	science	socst
high	56.50000	55.91379	56.17241	55.44828	57.13793
low	48.27660	50.61702	49.17021	47.70213	47.31915
middle	51.57895	51.92632	52.21053	51.70526	52.03158

Reading:
- The mean reading score increases with SES: 48.28 for the low group, 51.58 for the medium group, and 56.50 for the high group.
- This suggests that higher SES is associated with higher reading scores.
Writing:
- The writing scores also increase with SES: 50.62 for the low group, 51.93 for the medium group, and 55.91 for the high group.
- This indicates that students from higher SES backgrounds tend to have better writing scores.
Math:
- Math scores show a similar pattern: 49.17 for the low group, 52.21 for the medium group, and 56.17 for the high group.
- This suggests a positive association between SES and math scores.
Science:
- Science scores follow the same trend: 47.70 for the low group, 51.71 for the medium group, and 55.45 for the high group.
- Higher SES groups tend to have higher science scores.
Social Studies:
- The social studies scores also increase with SES: 47.32 for the low group, 52.03 for the medium group, and 57.14 for the high group.
- This indicates that SES positively influences social studies scores as well.

Summary

Overall, the examination of sample mean vectors indicates that students from higher SES backgrounds tend to have better academic performance across all measured variables (reading, writing, math, science, and social studies). This pattern is consistent across all variables, suggesting that SES is an important factor in determining academic success.

These results highlight the impact of socioeconomic status on academic performance, with higher SES associated with better outcomes in multiple subject areas. The differences in mean scores between the SES groups are substantial and consistent, underscoring the importance of addressing socioeconomic disparities to improve educational equity.

PROBLEM 3:

Exercise 6.24, page 347, Johnson & Wichern (2007).

6.24. Researchers have suggested that a change in skull size over time is evidence of the interbreeding of a resident population with immigrant populations. Four measurements were made of male Egyptian skulls for three different time periods: period 1 is 4000 B.C., period 2 is 3300 B.C., and period 3 is 1850 B.C. The data are shown in Table 6.13 on page 349 (see the skull data on the website www.prenhall.com/sratistics). The measured variables are X1 = maximum breadth of skull ( mm) X2 = basibregmatic height of skull ( mm) X3 = basialveolar length of skull ( mm) x4 = nasal height of skull (mm)

a) Construct a one-way MANOVA of the Egyptian skull data. Use a = .05.

skull <- read.table("/cloud/project/egyptianskulls.txt", header=TRUE)
skull$TimePeriod <- as.factor(skull$TimePeriod)
#conducting manova
# Install and load the emmeans package if you haven't already
install.packages("emmeans")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(emmeans)

## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'

# Fit the MANOVA model
skull_model <- manova(cbind(MaxBreath, BasHeight, BasLength, NasHeight) ~ TimePeriod, data=skull)


summary.manova(skull_model,test="Wilks")

##            Df   Wilks approx F num Df den Df Pr(>F)
## TimePeriod  2 0.70209   1.1607      8     48 0.3421
## Residuals  27

Summary Interpretation with Variable Names:

MANOVA Test Results:
- The Wilks’ Lambda test statistic is 0.70209.
- The approximate F-statistic is 1.1607.
- Degrees of freedom: 8 for the numerator and 48 for the denominator.
- The p-value associated with the MANOVA test is 0.3421.
Interpretation of MANOVA Results:
- Significance: The MANOVA test examines whether there are significant differences in the mean vector of skull measurements among the three time periods.
- Wilks’ Lambda: A value close to 0 indicates strong evidence against the null hypothesis (no differences among time periods), while a value close to 1 suggests weak evidence against the null hypothesis. Here, Wilks’ Lambda is 0.70209, suggesting some evidence against the null hypothesis but not strong.
- F-statistic: The F-statistic of 1.1607 tests whether the mean vectors of the skull measurements (maximum breadth, basibregmatic height, basialveolar length, nasal height) differ significantly across time periods. With a p-value of 0.3421, which is greater than the typical significance level of 0.05, we fail to reject the null hypothesis at the 5% level of significance.
- Conclusion: Based on these results, we do not find sufficient evidence to conclude that there are significant differences in the mean skull measurements among the three time periods.

Practical Implications:

These results suggest that over the studied time periods, there is no strong evidence to support the hypothesis that skull measurements (maximum breadth, basibregmatic height, basialveolar length, nasal height) significantly differed among the populations represented by these time periods.
Further investigations could explore other factors influencing skull morphology changes, or consider additional variables that might influence the measurements.

b) Construct 95 %.simultaneous confidence intervals to determine which mean components differ among the populations represented by the three time periods.

# Install and load the emmeans package if you haven't already
install.packages("emmeans")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(emmeans)

# Perform post hoc tests using the emmeans package
skull_emmeans <- emmeans(skull_model, ~ TimePeriod)

Simultaneous Confidence Intervals for Each Variable:

# Fit linear models for each response variable
model_MaxBreath <- lm(MaxBreath ~ TimePeriod, data=skull)
model_BasHeight <- lm(BasHeight ~ TimePeriod, data=skull)
model_BasLength <- lm(BasLength ~ TimePeriod, data=skull)
model_NasHeight <- lm(NasHeight ~ TimePeriod, data=skull)

# Perform post hoc tests for each model
emmeans_MaxBreath <- emmeans(model_MaxBreath, ~ TimePeriod)
emmeans_BasHeight <- emmeans(model_BasHeight, ~ TimePeriod)
emmeans_BasLength <- emmeans(model_BasLength, ~ TimePeriod)
emmeans_NasHeight <- emmeans(model_NasHeight, ~ TimePeriod)

# Get simultaneous confidence intervals
Simulateneousci_MaxBreath <- confint(emmeans_MaxBreath, level=0.95, adjust="bonferroni")
Simulateneousci_BasHeight <- confint(emmeans_BasHeight, level=0.95, adjust="bonferroni")
Simulateneousci_BasLength <- confint(emmeans_BasLength, level=0.95, adjust="bonferroni")
Simulateneousci_NasHeight <- confint(emmeans_NasHeight, level=0.95, adjust="bonferroni")

# Print the results
kable(Simulateneousci_MaxBreath)

TimePeriod	emmean	SE	df	lower.CL	upper.CL
1	130.9	1.74547	27	126.4448	135.3552
2	133.3	1.74547	27	128.8448	137.7552
3	134.8	1.74547	27	130.3448	139.2552

kable(Simulateneousci_BasHeight)

TimePeriod	emmean	SE	df	lower.CL	upper.CL
1	134.7	1.553371	27	130.7351	138.6649
2	132.9	1.553371	27	128.9351	136.8649
3	131.5	1.553371	27	127.5351	135.4649

kable(Simulateneousci_BasLength)

TimePeriod	emmean	SE	df	lower.CL	upper.CL
1	96.7	1.453221	27	92.99071	100.40929
2	98.5	1.453221	27	94.79071	102.20929
3	95.8	1.453221	27	92.09071	99.50929

kable(Simulateneousci_NasHeight)

TimePeriod	emmean	SE	df	lower.CL	upper.CL
1	49.9	1.013063	27	47.3142	52.4858
2	48.7	1.013063	27	46.1142	51.2858
3	50.3	1.013063	27	47.7142	52.8858

Based on the simultaneous confidence intervals above:

For Maximum Breadth, the 95% confidence intervals for the two time periods overlap, suggesting no significant difference.
For Basibregmatic Height, Basialveolar Length, and Nasal Height, we also observe overlapping confidence intervals, indicating no significant differences in these measurements across time periods.

Interpretation:

The post hoc tests with simultaneous confidence intervals confirm the MANOVA results, showing no significant differences in skull measurements among the different time periods. This consistency across different statistical methods strengthens the conclusion that the skull measurements did not significantly change over the studied time periods.

c) Are the usual MANOVA assumptions realistic for these data? Explain.

Checking Multivariate Normality

Extract residuals

manova.residuals <- residuals(skull_model)

Perform Shapiro-Wilk test for multivariate normality on residuals

# Load necessary packages
library(MVN)

# Check for NA and infinite values in manova residuals
if (any(is.na(manova.residuals)) || any(!is.finite(manova.residuals))) {
  stop("Data contains NA or infinite values. Please clean your data.")
}

# Perform multivariate normality check and generate plots
mvn_results <- mvn(manova.residuals,
                  multivariatePlot = "qq",  # Q-Q plot
                  multivariateOutlierMethod = "quan",  # Method for outliers
                  mvnTest = "hz")  # Henze-Zirkler test

Checking Homogeneity of Variance-Covariance Matrices

# Perform Box's M test for each skull measurement
box_m_MaxBreath <- box_m(as.matrix(skull$MaxBreath), group = skull$TimePeriod)
box_m_BasHeight <- box_m(as.matrix(skull$BasHeight), group = skull$TimePeriod)
box_m_BasLength <- box_m(as.matrix(skull$BasLength), group = skull$TimePeriod)
box_m_NasHeight <- box_m(as.matrix(skull$NasHeight), group = skull$TimePeriod)

# Print results
library(knitr)
kable(box_m_MaxBreath)

statistic	p.value	parameter	method
6.117974	0.0469352	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m_BasHeight)

statistic	p.value	parameter	method
2.248725	0.3248595	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m_BasLength)

statistic	p.value	parameter	method
3.667054	0.1598487	2	Box’s M-test for Homogeneity of Covariance Matrices

kable(box_m_NasHeight)

statistic	p.value	parameter	method
0.1088941	0.9470086	2	Box’s M-test for Homogeneity of Covariance Matrices

Visualizing Pairwise Relationships for Linearity

# Load necessary library
library(MASS)

# Set up the layout for multiple plots (2 plots per row)
par(mfrow = c(2, 2))

# Pairwise scatter plots
# Load necessary packages
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:emmeans':
## 
##     pigs

library(ggplot2)

# Create pairwise scatter plots with GGally::ggpairs
ggpairs(skull[, c("MaxBreath", "BasHeight", "BasLength", "NasHeight")],
        title = "Pairwise Scatter Plots",
        aes(color = skull$TimePeriod),
        lower = list(continuous = wrap("points", alpha = 0.6, size = 2))) +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue", "green", "purple")) +
  theme(panel.grid = element_blank()) +
  theme(strip.text = element_text(size = 12)) +
  theme(axis.text = element_text(size = 10)) +
  theme(axis.title = element_text(size = 12)) +
  theme(plot.title = element_text(size = 16, hjust = 0.5)) +
  theme(legend.position = "bottom") +
  theme(legend.title = element_blank())

Checking Correlation Matrix

# Compute the correlation matrix for skull measurements
correlation_matrix <- cor(skull[, c("MaxBreath", "BasHeight", "BasLength", "NasHeight")], use = "complete.obs")

# Print the correlation matrix
kable(correlation_matrix)

	MaxBreath	BasHeight	BasLength	NasHeight
MaxBreath	1.0000000	0.0339449	0.2379248	0.4377288
BasHeight	0.0339449	1.0000000	0.0198849	0.4717779
BasLength	0.2379248	0.0198849	1.0000000	0.0808361
NasHeight	0.4377288	0.4717779	0.0808361	1.0000000

Summary of MANOVA Assumptions Based on Output above

Multivariate Normality:
- Henze-Zirkler Test: The test statistic is 0.7701 with a p-value of 0.3104, which indicates no significant deviation from multivariate normality. Therefore, the assumption of multivariate normality is satisfied.
- Univariate Normality (Anderson-Darling Test):
  - MaxBreath: Statistic = 0.3908, p-value = 0.3595, indicating normality.
  - BasHeight: Statistic = 0.2887, p-value = 0.5923, indicating normality.
  - BasLength: Statistic = 0.3383, p-value = 0.4789, indicating normality.
  - NasHeight: Statistic = 0.1691, p-value = 0.9269, indicating normality.
- All variables show no significant deviation from normality, satisfying the univariate normality assumption.
Descriptive Statistics:
- The descriptive statistics provide insights into the central tendency, spread, and shape of the data distributions for each variable:
  - MaxBreath: Mean = 0, Std.Dev = 5.33, Skew = 0.22, Kurtosis = 0.67
  - BasHeight: Mean = 0, Std.Dev = 4.74, Skew = 0.22, Kurtosis = 0.03
  - BasLength: Mean = 0, Std.Dev = 4.43, Skew = 0.25, Kurtosis = -0.43
  - NasHeight: Mean = 0, Std.Dev = 3.09, Skew = 0.09, Kurtosis = -0.77
- These values suggest approximately normal distributions for the variables.
Homogeneity of Variance-Covariance Matrices (Box’s M Test):
- MaxBreath: Statistic = 6.1180, p-value = 0.0469, indicating a potential violation of the assumption.
- BasHeight: Statistic = 2.2487, p-value = 0.3249, indicating homogeneity.
- BasLength: Statistic = 3.6671, p-value = 0.1598, indicating homogeneity.
- NasHeight: Statistic = 0.1089, p-value = 0.9470, indicating homogeneity.
- Most variables show no significant deviation from homogeneity, except for MaxBreath, which might violate this assumption.
Correlation Matrix:
- The correlations between the variables are as follows:
  - MaxBreath and BasHeight: 0.034
  - MaxBreath and BasLength: 0.238
  - MaxBreath and NasHeight: 0.438
  - BasHeight and BasLength: 0.020
  - BasHeight and NasHeight: 0.472
  - BasLength and NasHeight: 0.081
- The correlations are generally low to moderate, indicating that multicollinearity is not a major concern.

Conclusion

Multivariate Normality: Satisfied based on the Henze-Zirkler test and univariate normality tests.
Homogeneity of Variance-Covariance Matrices: Generally satisfied, except for a potential issue with MaxBreath.
Multicollinearity: Not a significant concern based on the correlation matrix.

Based on these results, most of the assumptions for MANOVA are met, although there is a potential violation regarding the homogeneity of variance-covariance matrices for MaxBreath. This should be taken into account when interpreting the MANOVA results.

Multivariate Assignment VII

Hassan Ngala - 122912

2024-07-14

STA 8301 Multivariate Statistical Analysis

PROBLEM 1:

Informal Checks for Assumptions:

Conclusion:

1. Multivariate Normality

2. Homogeneity of Covariance Matrices

3. Independence of Observations

Conclusion

Output Summary:

Interpretation:

PROBLEM 2:

(a) Conducting the MANOVA F-test using Wilks’ Lambda

Discussion

(b) Check (informal exploratory checks are fine) to see whether the assumptions of your test are met. Do you believe your inference is valid?

Checking Multivariate Normality

Checking Multicollinearity

Summary of Assumptions Checks for MANOVA with Multicollinearity and Outliers Discussion

Conclusion on Inference Validity:

(c) Examine the sample mean vectors for each group. Informally comment on the differences among the groups in terms of the specific variables.

Summary

PROBLEM 3:

a) Construct a one-way MANOVA of the Egyptian skull data. Use a = .05.

Summary Interpretation with Variable Names:

Practical Implications:

b) Construct 95 %.simultaneous confidence intervals to determine which mean components differ among the populations represented by the three time periods.

Simultaneous Confidence Intervals for Each Variable:

Interpretation:

c) Are the usual MANOVA assumptions realistic for these data? Explain.

Checking Multivariate Normality

Checking Homogeneity of Variance-Covariance Matrices

Visualizing Pairwise Relationships for Linearity

Checking Correlation Matrix

Summary of MANOVA Assumptions Based on Output above

Conclusion