ANOVA LAB

Exercises to Answer Research Questions

1. How many observations does the dataset contain? How many variables?

Therefore, the dataset has 1280 observations and 26 variables.

dim(edu_data) ## The first number in the output is equal to the number of observation in the dataset, and the second numer is the number of variables in the dataset.

## [1] 1280   26

2. What does each observation represent?

The dataset contains aggregates from all state-level sources in one CSV. Each observation represents the data for a state in a specific year. Each observation includes the state’s data related to students enrolled across the state, and financial information related to education expenses from 1992 to 2016.

3. Create a scatterplot of AVG_MATH_8_SCORE vs. AVG_READING_8_SCORE and interpret the plot

The scatter plot shows a weak positive linear relatioship. This interpretation is based on the amount of outliers in the data, which suggest that the data is strongly dispersed. However, the plot shows a cluster on the upper right conner that might suggest that students whom have a high Math score tend to have a not-low Reading score (in average above 250), but not the same in viceversa. Overall, math scores are higher than reading scores.

plot(edu_data$AVG_MATH_8_SCORE, edu_data$AVG_READING_8_SCORE, main= "Reading Scores vs. Math Scores for 8th Grade Students", xlab= "Average Math Score", ylab= "Average Reading Score") # The output shows the comparison between datasets

4. Create a scatterplot of AVG_MATH_8_SCORE vs. AVG_MATH_4_SCORE and interpret the plot.

The scatter plot suggest that there is a strong positive linear relationship. In this case, if a student in fourth grade got a high score, most likely a student in fourth grade also got a high score. The plot is uniformly spread and there are no clear outliers. However, there is a possible cluster in the range 220-250 for fourth grade students and 270-290 for eight grade students. Overall, eight grade scores are higher than fourth grade scores for the math portion.

# Create scatter plot for 8th grade Math scores (x-axis) vs. 4th grade Math scores (y-axis)
plot(edu_data$AVG_MATH_8_SCORE, edu_data$AVG_MATH_4_SCORE, main= "8th Grade  Math Scores vs 4th Grade Math Scores", xlab= "Average Score for Eight Grade Students", ylab= "Average Score for Fourth Grade Students") # The output shows the comparison between datasets

5. Create a scatterplot of AVG_MATH_8_SCORE vs. INSTRUCTION_EXPENDITURE and interpret the plot.

Instructor expenditure was set as the independent variable, assumming the Math scores were the responsive variable. In this case, there is a weak linear relationship. The dataset might be better represented using a logarithmic relationship. In this case, the math scores increase significantly with a small increase on instructor expenditure, and as the intructor expenditure increases, it reaches a stable point, where values start converging. However, the residuals are greater in the middle part of the plot.The plot suggest that with a small increase on instructor expedinture most of the math scores would increase, and then as expenditure increases, the scores stay high but stop increasing.

# Create scatter plot for 8th grade Math scores (y-axis) vs. Instruction expenditure (x-axis)

plot(edu_data$INSTRUCTION_EXPENDITURE, edu_data$AVG_MATH_8_SCORE, main= "8th Grade  Math Scores vs. Intruction Expenditure", ylab= "Average Score for Eight Grade Students", xlab= "Intruction Expenditure ($)") # The output shows the comparison between datasets

6. How many regions are represented in the data? According to the summary, there are 5 regions represented in the dataset: Midwest, Northeast, Southeast, Southwest, and West.

summary(edu_data$REGION) # The summary is only concerned with the data contained under the Region Variable.

##   MIDWEST NORTHEAST SOUTHEAST SOUTHWEST      WEST 
##       300       225       380       100       275

7. Which region has the most states? The fewest?

According to the summary shown above, the region with the most observations is the Southeast, while the region with the least observation if the Southwest.Based on the plot, the Southeast region is the one with the most states (15), while the Southwest region is the one with the least amount of states.

# Use of a graphical solution to find the regions with the least and most states.

# Create Subsets for each region to identify states that are aligned with each region
Mid <- subset(edu_data, edu_data$REGION == "MIDWEST")
NE <-  subset(edu_data, edu_data$REGION == "NORTHEAST")
SE <-  subset(edu_data, edu_data$REGION == "SOUTHEAST")
SW <-  subset(edu_data, edu_data$REGION == "SOUTHWEST")
We<-  subset(edu_data, edu_data$REGION == "WEST")

# Set 2x3 plot window with increase upper margin for title
par(mfrow = c(3,2), oma= c(0,0,2,0))
#Plot histograms 
plot(Mid$STATE, xlab= "States for Midwest", ylab= "Frequency")
plot(NE$STATE, xlab= "States for Northeast", ylab= "Frequency")
plot(SE$STATE, xlab= "States for Southeast", ylab= "Frequency")
plot(SW$STATE, xlab= "States for Southwest", ylab= "Frequency")
plot(We$STATE, xlab= "States for West", ylab= "Frequency")

#Add overall title
title(" Histograms of states per Region", outer = TRUE, cex=1.5)

_8. Create a boxplot of AVG_MATH_8_SCORE vs. REGION. Add points to show the within-region means and a horizontal line to show the overall mean.

# Baseline boxplot
boxplot(edu_data$AVG_MATH_8_SCORE~ edu_data$REGION, 
        data = edu_data,
        main = "Eight Grade Math Scores by Region", 
        ylab = "Math Scores", 
        xlab = "Region", 
     
        col = "light blue")

# Calculate the means per region
means <- aggregate(edu_data$AVG_MATH_8_SCORE ~  edu_data$REGION, edu_data, na.rm= TRUE, mean)

# Plot means as points on boxplot
points(means, col="black", pch=18)

# Plot horizontal line for overall mean
abline(h = mean(edu_data$AVG_MATH_8_SCORE, na.rm=TRUE), 
       col = "red", 
       lwd = 2,  # change line width
       lty = 2)  # change line type (dashed line)

# Add legend
legend("topleft",
       legend = c("Within-region mean", "Overall mean"),
       pch = c(18, NA),
       lty = c(NA, 2),
       col = c("black", "red"))

9. Interpret the boxplot. What does the boxplot suggest regarding the research questions?

The boxplot suggest that the Math scores for eight grade may be different across geographic regions. Also, the mean for the scores mainly vary between the Southeast region in comparison with the Midwest and Northeast regions mostly. The boxplot suggest that the data within the Southeast region is also more spreaded than the other regions, and it has more visible outliers than the others.

10. Perform an analysis of variance that answers the first research question. Use the PCCC method.

Prepare:

Let \(~ \mu_{Mid}\),\(~ \mu_{NE}\), \(~ \mu_{SE}\), \(~ \mu_{SW}\), and \(~ \mu_{We}\) be the average math scores for eigth grade students from 1992-2016 for the Midwest,Northeast, Southeast, Southwest, and West regions, respectively.

List out hypotheses. \[H_0: ~ \mu_{Mid}= \mu_{NE}= \mu_{SE}=\mu_{SW}= \mu_{We}\] \[H_a: not\ H_0\] Please see above for data visualization.

Check:

Independence:

1. Are the observation independent within and across groups? Based on the data description, there is not reason to believe that knowing about one observation would give us information about another observation, this valid within groups and within the groups. Therefore, we can assume independence.

2. Are the data within each group are nearly normal? Please refer to histograms in question 7, for which each column represents a different state. Based on the Normal Q-Q plots, we can say that the datasets do not appear to be completely normal but still there is no sufficient reason to dimiss the idea that the datasets are not normal.

## To set Normal Q-Q plots

# Set 2x2 plot window with increased upper margin for title
par(mfrow = c(3,2), oma = c(0,0,2,0))

# Plot q-q plots using the subsets already created for histograms

   # Midwest
   qqnorm(Mid$AVG_MATH_8_SCORE,
        main = "",
        ylab = "Midwest Math Scores")
   qqline(Mid$AVG_MATH_8_SCORE, col = "red")

   # Northeast
   qqnorm(NE$AVG_MATH_8_SCORE,
        main = "",
        ylab = "Northeast Math Scores")
   qqline(NE$AVG_MATH_8_SCORE, col = "red")

   # Southeast
   qqnorm(SE$AVG_MATH_8_SCORE,
        main = "",
        ylab = "Southeast Math Scores")
   qqline(SE$AVG_MATH_8_SCORE, col = "red")

   # Southwest
   qqnorm(SW$AVG_MATH_8_SCORE,
        main = "",
        ylab = "Southwest Math Scores")
   qqline(SW$AVG_MATH_8_SCORE, col = "red")
   
   # West
   qqnorm(We$AVG_MATH_8_SCORE,
        main = "",
        ylab = "West Math scores")
   qqline(We$AVG_MATH_8_SCORE, col = "red")  

# Add overall title
title("Normal QQ Plots of Eight Grade Math Scores by Region", outer = TRUE, cex = 1.5)

3. Are the variability across groups about equal? Based on the IQRs, they all have a similar width. Therefore, we can consider this as a way to conffirm that the avriability across groups is equal. However, based on the boxplots we do not have enough information to prove that the variance differs across groups.

Calculate:

# Build ANOVA model
model.act <- aov(edu_data$AVG_MATH_8_SCORE ~ edu_data$REGION, data = edu_data,na.rm=TRUE )

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'na.rm' will be disregarded

# look at ANOVA results
summary.aov(model.act)

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## edu_data$REGION   4  16152    4038   54.22 <2e-16 ***
## Residuals       476  35451      74                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 799 observations deleted due to missingness

From the ANOVA output we can see that \(p-value\le 2*10^{-16}\)

Conclude:

Since the \(p-value \le 2*10^{-16} < 0.05= \alpha\), there is statistically significant evidence that at least one pair of mean Math scores for eight grade students is different between regions. Thus, we have significant evidence to reject \[H_0: ~ \mu_{Mid}= \mu_{NE}= \mu_{SE}=\mu_{SW}= \mu_{We}\] This answers the first research question: “Do 8th grade math scores vary across geographic regions of the United States?” , to which the answer is yes, they do vary.

11. Make pairwise comparisons, and answer the second research question. Use an overall significance level of \(\alpha_{overall}\ =\ 0.05\ \).

# Pairwise comparison test using Bonferroni approach to maintain an overall Type 1 error alpha = 0.05
pairwise.t.test(x = edu_data$AVG_MATH_8_SCORE, 
                g = edu_data$REGION, 
                p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  edu_data$AVG_MATH_8_SCORE and edu_data$REGION 
## 
##           MIDWEST NORTHEAST SOUTHEAST SOUTHWEST
## NORTHEAST 1.0000  -         -         -        
## SOUTHEAST < 2e-16 < 2e-16   -         -        
## SOUTHWEST 5.1e-09 8.1e-09   0.4591    -        
## WEST      0.0025  0.0027    1.0e-13   0.0040   
## 
## P value adjustment method: bonferroni

We conclude from the data that there is a statistically significant different between Southeast and Midwest regions, Northeast and Southeast regions, Midwest and Southwest regions, Northeast and Southwest regions, West and Southeast regions, West and Midwest regions, Northeast and West regions, and Southwest and West regions, corresponding to their respectives average 8th grade Math scores. Therefore, answering the second research question “Which regions?”.

12. Propose a possible explanation for your findings.

First of all, we have to understand that these scores were recorded from 1992-2016, in which the politics, culture, society and public expending probably differed across the country. In some regions Math may have been valued more than in others due to social factors or politico-economic factors. Thus, causing a difference in means and a variance across the regions. If we go back to the plot of “8th Grade Math Scores vs. Intruction Expenditure,” we can see that there is no a clear linear relationship, also we do not know if inflation was taken into account when building the dataset. Therefore, we are not sure if instruction expediture may also be one of the reasons why there is variance. We would also need to look at education expenditure by region and by year or decade to notice a any possible influence that might have caused the variance not to be constant amongs regions.

Documentation Statement:

I used the lab file provide by Capt. Forbes to understand the analysis and the coding. In addition, I used the link provided (https://www.statmethods.net/input/missingdata.html) to treat the missing data in the dataset for the boxplot and the means.
I discussed in the ANOVA Lab channel on Teams on 5/3/2020 if it was necessary to stack our dat and if so why.