#Load the dataset
library(readxl)
edu_data<- read.csv(file = "C:\\Users\\C21Itzel.ChanTopete\\Documents\\2019-2020\\SPRING SEMESTER\\Math 356\\Annova Lab\\states_edu(1).csv",
header = TRUE)
Therefore, the dataset has 1280 observations and 26 variables.
dim(edu_data) ## The first number in the output is equal to the number of observation in the dataset, and the second numer is the number of variables in the dataset.
## [1] 1280 26
The dataset contains aggregates from all state-level sources in one CSV. Each observation represents the data for a state in a specific year. Each observation includes the state’s data related to students enrolled across the state, and financial information related to education expenses from 1992 to 2016.
The scatter plot shows a weak positive linear relatioship. This interpretation is based on the amount of outliers in the data, which suggest that the data is strongly dispersed. However, the plot shows a cluster on the upper right conner that might suggest that students whom have a high Math score tend to have a not-low Reading score (in average above 250), but not the same in viceversa. Overall, math scores are higher than reading scores.
plot(edu_data$AVG_MATH_8_SCORE, edu_data$AVG_READING_8_SCORE, main= "Reading Scores vs. Math Scores for 8th Grade Students", xlab= "Average Math Score", ylab= "Average Reading Score") # The output shows the comparison between datasets
The scatter plot suggest that there is a strong positive linear relationship. In this case, if a student in fourth grade got a high score, most likely a student in fourth grade also got a high score. The plot is uniformly spread and there are no clear outliers. However, there is a possible cluster in the range 220-250 for fourth grade students and 270-290 for eight grade students. Overall, eight grade scores are higher than fourth grade scores for the math portion.
# Create scatter plot for 8th grade Math scores (x-axis) vs. 4th grade Math scores (y-axis)
plot(edu_data$AVG_MATH_8_SCORE, edu_data$AVG_MATH_4_SCORE, main= "8th Grade Math Scores vs 4th Grade Math Scores", xlab= "Average Score for Eight Grade Students", ylab= "Average Score for Fourth Grade Students") # The output shows the comparison between datasets
Instructor expenditure was set as the independent variable, assumming the Math scores were the responsive variable. In this case, there is a weak linear relationship. The dataset might be better represented using a logarithmic relationship. In this case, the math scores increase significantly with a small increase on instructor expenditure, and as the intructor expenditure increases, it reaches a stable point, where values start converging. However, the residuals are greater in the middle part of the plot.The plot suggest that with a small increase on instructor expedinture most of the math scores would increase, and then as expenditure increases, the scores stay high but stop increasing.
# Create scatter plot for 8th grade Math scores (y-axis) vs. Instruction expenditure (x-axis)
plot(edu_data$INSTRUCTION_EXPENDITURE, edu_data$AVG_MATH_8_SCORE, main= "8th Grade Math Scores vs. Intruction Expenditure", ylab= "Average Score for Eight Grade Students", xlab= "Intruction Expenditure ($)") # The output shows the comparison between datasets
summary(edu_data$REGION) # The summary is only concerned with the data contained under the Region Variable.
## MIDWEST NORTHEAST SOUTHEAST SOUTHWEST WEST
## 300 225 380 100 275
According to the summary shown above, the region with the most observations is the Southeast, while the region with the least observation if the Southwest.Based on the plot, the Southeast region is the one with the most states (15), while the Southwest region is the one with the least amount of states.
# Use of a graphical solution to find the regions with the least and most states.
# Create Subsets for each region to identify states that are aligned with each region
Mid <- subset(edu_data, edu_data$REGION == "MIDWEST")
NE <- subset(edu_data, edu_data$REGION == "NORTHEAST")
SE <- subset(edu_data, edu_data$REGION == "SOUTHEAST")
SW <- subset(edu_data, edu_data$REGION == "SOUTHWEST")
We<- subset(edu_data, edu_data$REGION == "WEST")
# Set 2x3 plot window with increase upper margin for title
par(mfrow = c(3,2), oma= c(0,0,2,0))
#Plot histograms
plot(Mid$STATE, xlab= "States for Midwest", ylab= "Frequency")
plot(NE$STATE, xlab= "States for Northeast", ylab= "Frequency")
plot(SE$STATE, xlab= "States for Southeast", ylab= "Frequency")
plot(SW$STATE, xlab= "States for Southwest", ylab= "Frequency")
plot(We$STATE, xlab= "States for West", ylab= "Frequency")
#Add overall title
title(" Histograms of states per Region", outer = TRUE, cex=1.5)
# Baseline boxplot
boxplot(edu_data$AVG_MATH_8_SCORE~ edu_data$REGION,
data = edu_data,
main = "Eight Grade Math Scores by Region",
ylab = "Math Scores",
xlab = "Region",
col = "light blue")
# Calculate the means per region
means <- aggregate(edu_data$AVG_MATH_8_SCORE ~ edu_data$REGION, edu_data, na.rm= TRUE, mean)
# Plot means as points on boxplot
points(means, col="black", pch=18)
# Plot horizontal line for overall mean
abline(h = mean(edu_data$AVG_MATH_8_SCORE, na.rm=TRUE),
col = "red",
lwd = 2, # change line width
lty = 2) # change line type (dashed line)
# Add legend
legend("topleft",
legend = c("Within-region mean", "Overall mean"),
pch = c(18, NA),
lty = c(NA, 2),
col = c("black", "red"))
The boxplot suggest that the Math scores for eight grade may be different across geographic regions. Also, the mean for the scores mainly vary between the Southeast region in comparison with the Midwest and Northeast regions mostly. The boxplot suggest that the data within the Southeast region is also more spreaded than the other regions, and it has more visible outliers than the others.
Let \(~ \mu_{Mid}\),\(~ \mu_{NE}\), \(~ \mu_{SE}\), \(~ \mu_{SW}\), and \(~ \mu_{We}\) be the average math scores for eigth grade students from 1992-2016 for the Midwest,Northeast, Southeast, Southwest, and West regions, respectively.
List out hypotheses. \[H_0: ~ \mu_{Mid}= \mu_{NE}= \mu_{SE}=\mu_{SW}= \mu_{We}\] \[H_a: not\ H_0\] Please see above for data visualization.
## To set Normal Q-Q plots
# Set 2x2 plot window with increased upper margin for title
par(mfrow = c(3,2), oma = c(0,0,2,0))
# Plot q-q plots using the subsets already created for histograms
# Midwest
qqnorm(Mid$AVG_MATH_8_SCORE,
main = "",
ylab = "Midwest Math Scores")
qqline(Mid$AVG_MATH_8_SCORE, col = "red")
# Northeast
qqnorm(NE$AVG_MATH_8_SCORE,
main = "",
ylab = "Northeast Math Scores")
qqline(NE$AVG_MATH_8_SCORE, col = "red")
# Southeast
qqnorm(SE$AVG_MATH_8_SCORE,
main = "",
ylab = "Southeast Math Scores")
qqline(SE$AVG_MATH_8_SCORE, col = "red")
# Southwest
qqnorm(SW$AVG_MATH_8_SCORE,
main = "",
ylab = "Southwest Math Scores")
qqline(SW$AVG_MATH_8_SCORE, col = "red")
# West
qqnorm(We$AVG_MATH_8_SCORE,
main = "",
ylab = "West Math scores")
qqline(We$AVG_MATH_8_SCORE, col = "red")
# Add overall title
title("Normal QQ Plots of Eight Grade Math Scores by Region", outer = TRUE, cex = 1.5)
# Build ANOVA model
model.act <- aov(edu_data$AVG_MATH_8_SCORE ~ edu_data$REGION, data = edu_data,na.rm=TRUE )
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'na.rm' will be disregarded
# look at ANOVA results
summary.aov(model.act)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu_data$REGION 4 16152 4038 54.22 <2e-16 ***
## Residuals 476 35451 74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 799 observations deleted due to missingness
From the ANOVA output we can see that \(p-value\le 2*10^{-16}\)
Since the \(p-value \le 2*10^{-16} < 0.05= \alpha\), there is statistically significant evidence that at least one pair of mean Math scores for eight grade students is different between regions. Thus, we have significant evidence to reject \[H_0: ~ \mu_{Mid}= \mu_{NE}= \mu_{SE}=\mu_{SW}= \mu_{We}\] This answers the first research question: “Do 8th grade math scores vary across geographic regions of the United States?” , to which the answer is yes, they do vary.
# Pairwise comparison test using Bonferroni approach to maintain an overall Type 1 error alpha = 0.05
pairwise.t.test(x = edu_data$AVG_MATH_8_SCORE,
g = edu_data$REGION,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: edu_data$AVG_MATH_8_SCORE and edu_data$REGION
##
## MIDWEST NORTHEAST SOUTHEAST SOUTHWEST
## NORTHEAST 1.0000 - - -
## SOUTHEAST < 2e-16 < 2e-16 - -
## SOUTHWEST 5.1e-09 8.1e-09 0.4591 -
## WEST 0.0025 0.0027 1.0e-13 0.0040
##
## P value adjustment method: bonferroni
We conclude from the data that there is a statistically significant different between Southeast and Midwest regions, Northeast and Southeast regions, Midwest and Southwest regions, Northeast and Southwest regions, West and Southeast regions, West and Midwest regions, Northeast and West regions, and Southwest and West regions, corresponding to their respectives average 8th grade Math scores. Therefore, answering the second research question “Which regions?”.
First of all, we have to understand that these scores were recorded from 1992-2016, in which the politics, culture, society and public expending probably differed across the country. In some regions Math may have been valued more than in others due to social factors or politico-economic factors. Thus, causing a difference in means and a variance across the regions. If we go back to the plot of “8th Grade Math Scores vs. Intruction Expenditure,” we can see that there is no a clear linear relationship, also we do not know if inflation was taken into account when building the dataset. Therefore, we are not sure if instruction expediture may also be one of the reasons why there is variance. We would also need to look at education expenditure by region and by year or decade to notice a any possible influence that might have caused the variance not to be constant amongs regions.
I used the lab file provide by Capt. Forbes to understand the analysis and the coding. In addition, I used the link provided (https://www.statmethods.net/input/missingdata.html) to treat the missing data in the dataset for the boxplot and the means.
I discussed in the ANOVA Lab channel on Teams on 5/3/2020 if it was necessary to stack our dat and if so why.