library(knitr)
library(formatR)
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 60), tidy = TRUE)
Problem Statement: I hope to explore the impact of educational spending on the standardized reading test scores of students in California schools. I would particularly like to explore this effect between schools with similar incomes. I want to know if schools with more per-pupil spending get better results than schools with less, even when the income of the districts is within the same range.
In this section, I import the data from github and load a library I will use later.
csvurl = "https://raw.githubusercontent.com/Marley-Myrianthopoulos/Grad-School-HW-2/main/CASchools.csv"
alltestscores <- read.csv(url(csvurl))
library(ExcelFunctionsR)
In this section, I perform the most basic analysis of this question, a scatterplot of each school’s per-student spending and average reading test score. Overlaid on this plot is a line of best fit found through linear regression. The line of best fit has a positive slope, implying that higher per-student spending is correlated with a higher average test score. However, the correlation coefficient is around 0.13, suggesting a very weak correlation.
coef <- cor(alltestscores$expenditure, alltestscores$read, method = "spearman")
plot(alltestscores$expenditure, alltestscores$read, main = "Per-Pupil School Expenditures vs. Reading Test Scores",
sub = CONCAT("Correlation Coefficient: r = ", coef), xlab = "Expenditure per Student",
ylab = "Average Reading Test Score")
abline(lm(alltestscores$read ~ alltestscores$expenditure))
In this section, I classify schools as being “High Spending” or “Low Spending” if they are in the 4th or 1st quartile in expenditures, respectively. I similarly classify schools in the 4th and 1st quartile of reading test scores as “High Reading” and “Low Reading,” respectively. I then count the number of schools that are in each combination of high/low spending and high/low reading scores and display this information in a bar plot.
I observe that among low-spending schools there are more low-performing schools than high-performing schools, while among high-spending schools there are quite a few more high-performing schools than low-performing schools. This reinforces the previous conclusion that higher spending is correlated with better reading standardized test results.
histotestscores <- subset(alltestscores, select = c("expenditure",
"read"))
histospendingq1 <- quantile(histotestscores[["expenditure"]],
probs = 0.25)
histospendingq3 <- quantile(histotestscores[["expenditure"]],
probs = 0.75)
historeadq1 <- quantile(histotestscores[["read"]], probs = 0.25)
historeadq3 <- quantile(histotestscores[["read"]], probs = 0.75)
suppressWarnings(histotestscores$expenditure <- replace(histotestscores$expenditure,
as.numeric(histotestscores$expenditure) < histospendingq1,
"Q1"))
suppressWarnings(histotestscores$expenditure <- replace(histotestscores$expenditure,
as.numeric(histotestscores$expenditure) > histospendingq3,
"Q4"))
suppressWarnings(histotestscores$read <- replace(histotestscores$read,
as.numeric(histotestscores$read) < historeadq1, "Q1"))
suppressWarnings(histotestscores$read <- replace(histotestscores$read,
as.numeric(histotestscores$read) > historeadq3, "Q4"))
LexpLread <- suppressWarnings(COUNTIFS(histotestscores$expenditure,
"Q1", histotestscores$read, "Q1"))
LexpHread <- suppressWarnings(COUNTIFS(histotestscores$expenditure,
"Q1", histotestscores$read, "Q4"))
HexpLread <- suppressWarnings(COUNTIFS(histotestscores$expenditure,
"Q4", histotestscores$read, "Q1"))
HexpHread <- suppressWarnings(COUNTIFS(histotestscores$expenditure,
"Q4", histotestscores$read, "Q4"))
bardata <- c(LexpLread, LexpHread, HexpLread, HexpHread)
barplot(bardata, main = "Effect of Spending on Reading Test Scores",
ylab = "Number of Schools", names.arg = c("L Spd L Rd", "L Spd H Rd",
"H Spd L Rd", "H Spd H Rd"))
In this section, I create box plots of the reading test results in “High Spending” schools (schools in the 4th quartile of per-student expenditures) and “Low Spending” schools (schools in the 1st quartile of per-student expenditures). Consistent with the previous observations, the High Spending schools have better results at each point measured by the box plot (minimum, first quartile, median, third quartile, maximum) than the Low Spending schools.
highspending <- subset(alltestscores, expenditure >= quantile(alltestscores[["expenditure"]],
probs = 0.75), select = c("district", "school", "county",
"expenditure", "income", "read"))
lowspending <- subset(alltestscores, expenditure <= quantile(alltestscores[["expenditure"]],
probs = 0.25), select = c("district", "school", "county",
"expenditure", "income", "read"))
spendingcomparisonallread <- data.frame(A = lowspending[["read"]],
B = highspending[["read"]])
colnames(spendingcomparisonallread) <- c("Low Spending", "High Spending")
boxplot(spendingcomparisonallread, horizontal = TRUE, range = 0,
main = "Reading Test Results in High Spending vs Low Spending Schools")
In this section, I create a separate data frame for each income quartile. My goal is to use these different data frames to compare the results from higher-spending and lower-spending schools with similar income levels to see if the effect of higher spending changes based on the income of the district.
firstquartileincome <- quantile(alltestscores[["income"]], probs = 0.25)
medianincome <- median(alltestscores[["income"]])
thirdquartileincome <- quantile(alltestscores[["income"]], probs = 0.75)
suppressWarnings(alltestscores$income <- replace(alltestscores$income,
as.numeric(alltestscores$income) > thirdquartileincome, "Q4"))
suppressWarnings(alltestscores$income <- replace(alltestscores$income,
as.numeric(alltestscores$income) <= thirdquartileincome &
as.numeric(alltestscores$income) > medianincome, "Q3"))
suppressWarnings(alltestscores$income <- replace(alltestscores$income,
as.numeric(alltestscores$income) <= medianincome & as.numeric(alltestscores$income) >
firstquartileincome, "Q2"))
suppressWarnings(alltestscores$income <- replace(alltestscores$income,
as.numeric(alltestscores$income) <= firstquartileincome,
"Q1"))
q4incomesub <- subset(alltestscores, income == "Q4", select = c("district",
"school", "county", "expenditure", "income", "read"))
q3incomesub <- subset(alltestscores, income == "Q3", select = c("district",
"school", "county", "expenditure", "income", "read"))
q2incomesub <- subset(alltestscores, income == "Q2", select = c("district",
"school", "county", "expenditure", "income", "read"))
q1incomesub <- subset(alltestscores, income == "Q1", select = c("district",
"school", "county", "expenditure", "income", "read"))
In this section, I perform the same box plot analysis as before but limit the analysis to schools with average incomes in the highest quartile from the original data set. There is a huge difference in results, with the first quartile result in high-spending schools being higher than the third quartile result in low-spending schools.
highspendingq4 <- subset(q4incomesub, expenditure >= quantile(q4incomesub[["expenditure"]],
probs = 0.75), select = c("district", "school", "county",
"expenditure", "income", "read"))
lowspendingq4 <- subset(q4incomesub, expenditure <= quantile(q4incomesub[["expenditure"]],
probs = 0.25), select = c("district", "school", "county",
"expenditure", "income", "read"))
spendingcomparisonq4read <- data.frame(A = lowspendingq4[["read"]],
B = highspendingq4[["read"]])
colnames(spendingcomparisonq4read) <- c("Low Spending", "High Spending")
boxplot(spendingcomparisonq4read, horizontal = TRUE, range = 0,
main = "Reading Test Results - 4th Income Quartile")
In this section, I perform the same box plot analysis as before but limit the analysis to schools with average incomes in the 3rd quartile from the original data set. Intriguingly, the high-spending schools get better results from their middle students (first quartile, third quartile, and median are all higher), but worse results at the extremes.
highspendingq3 <- subset(q3incomesub, expenditure >= quantile(q3incomesub[["expenditure"]],
probs = 0.75), select = c("district", "school", "county",
"expenditure", "income", "read"))
lowspendingq3 <- subset(q3incomesub, expenditure <= quantile(q3incomesub[["expenditure"]],
probs = 0.25), select = c("district", "school", "county",
"expenditure", "income", "read"))
spendingcomparisonq3read <- data.frame(A = lowspendingq3[["read"]],
B = highspendingq3[["read"]])
colnames(spendingcomparisonq3read) <- c("Low Spending", "High Spending")
boxplot(spendingcomparisonq3read, horizontal = TRUE, range = 0,
main = "Reading Test Results - 3rd Income Quartile")
In this section, I perform the same box plot analysis as before but limit the analysis to schools with average incomes in the 2nd quartile from the original data set. These data sets are very similar, although the high-spending schools seem to have a higher ceiling for student performance
highspendingq2 <- subset(q2incomesub, expenditure >= quantile(q2incomesub[["expenditure"]],
probs = 0.75), select = c("district", "school", "county",
"expenditure", "income", "read"))
lowspendingq2 <- subset(q2incomesub, expenditure <= quantile(q2incomesub[["expenditure"]],
probs = 0.25), select = c("district", "school", "county",
"expenditure", "income", "read"))
spendingcomparisonq2read <- data.frame(A = lowspendingq2[["read"]],
B = highspendingq2[["read"]])
colnames(spendingcomparisonq2read) <- c("Low Spending", "High Spending")
boxplot(spendingcomparisonq2read, horizontal = TRUE, range = 0,
main = "Reading Test Results - 2nd Income Quartile")
In this section, I perform the same box plot analysis as before but limit the analysis to schools with average incomes in the 1st quartile from the original data set. The low-spending schools actually outperform the high-spending schools in this income quartile.
highspendingq1 <- subset(q1incomesub, expenditure >= quantile(q1incomesub[["expenditure"]],
probs = 0.75), select = c("district", "school", "county",
"expenditure", "income", "read"))
lowspendingq1 <- subset(q1incomesub, expenditure <= quantile(q1incomesub[["expenditure"]],
probs = 0.25), select = c("district", "school", "county",
"expenditure", "income", "read"))
spendingcomparisonq1read <- data.frame(A = lowspendingq1[["read"]],
B = highspendingq1[["read"]])
colnames(spendingcomparisonq1read) <- c("Low Spending", "High Spending")
boxplot(spendingcomparisonq1read, horizontal = TRUE, range = 0,
main = "Reading Test Results - 1st Income Quartile")
In this final section, I use what I have observed (the relationship between per-pupil expenditures and standardized test results appears much stronger in higher income schools) to run two more linear regressions, one on only the 4th income quartile schools and one on onle the 1st income quartile schools. The correlation
coefq4 <- round(cor(q4incomesub$expenditure, q4incomesub$read,
method = "spearman"), 2)
coefq1 <- round(cor(q1incomesub$expenditure, q1incomesub$read,
method = "spearman"), 2)
print(paste0("In the highest income schools, the correlation coefficient between per-pupil expenditure and reading test score is about r = ",
coefq4, " , while in the lowest income schools, the correlation coefficient is about r = ",
coefq1, "."))
## [1] "In the highest income schools, the correlation coefficient between per-pupil expenditure and reading test score is about r = 0.46 , while in the lowest income schools, the correlation coefficient is about r = -0.11."
CONCLUSIONS: Higher per-pupil spending is correlated with increased student achievement on standardized reading tests. However, the size of this effect varies considerably based on the average income of the district in which the school is located. For schools in districts that are in the 4th income quartile, there is a moderate positive correlation between per-pupil expenditures and student performance, while for schools in districts that are in the 1st income quartile, there is a very weak negative correlation. Further analysis would be needed to determine the reason for this difference in the effect of higher educational spending.