Final Project

1. Introduction

Throughout our schooling, from kindergarten through high school, national tests are taken to track the education levels and compare them around the nation. For example, STAR testing, SAT scores, and several other tests administered throughout the primary schooling system. When compared globally, the students in the United States test lower than many other countries, and while this issue has been entertained by local and federal governments, the issue is consistently correlated with funding.

Do financial constraints cause students to get worse grades? The objective of this project is to test data from around the United States regarding test scores of schools and percentage of economically disadvantaged students. Understanding the impact of wealth on schooling is crucial, especially when it comes to budgetary assignment, as well as what method is used to distribute finances to schools. I chose this question at issue because I have seen the effects of a lack of money in education firsthand, as my high school was part of the lowest funded districts in the peninsula of the Bay Area. Seeing underqualified teachers or teachers who conducted themselves poorly was very frustrating, along with seeing amazing teachers being vastly underpaid. That was the reality of my education, and I had to make the best of everything. I know for a fact I did not go to the poorest school in the region, let alone in the nation, so imagining what other students are facing is unacceptable. It needs to be brought to light and addressed nationally.

In this project I will use data from the Education Opportunity Project at Stanford University to analyze the percentage of economically disadvantaged students and how their tests scores compare.This will come in the form of raw data converted into graphs, statistical analysis and varaious computations to further understand the data.

2. Data

This data is from the Educational Opportunity Project at Stanford University. It is acquired from around the nation of various school districts.

#First I have merged two of the data sets, one containing the percentage of economically disadvantaged students and the other containing the test scores. I then load the necessary packages.

setwd("~/Downloads/School Stuff/EC320/Final Project Stuff")

library(haven)
SEDA_cov_school_pool_v30 <- read_dta("SEDA_cov_school_pool_v30.dta")

library(haven)
seda_school_pool_cs_v30 <- read_dta("seda_school_pool_cs_v30.dta")

total <- merge(SEDA_cov_school_pool_v30, seda_school_pool_cs_v30)

pacman::p_load(ggplot2,dplyr, tidyr, stargazer)

The merged dataset contains 56 variables and 63,395 observations. The two variables we will be focusing on are perecd, the percent of economically disadvantaged students in each school, and mn_avg_ol, the average test score based on grade level for each school.

The dataset contains many NAs, which will be identified and removed.

total1 <- drop_na(total)

perecd1 <- total1$perecd
mn_avg_ol1 <- total1$mn_avg_ol
perecd2 <- perecd1*100

The NAs have been removed in the form of creating a shortcut from the raw data using a filter to drop all NAs. perecd1 and mn_avg_ol1 are both shortcuts to use for the total1 data for both percentage of economically disadvantaged students and average test scores. perecd2 represents the filtered percent of economically disadvantaged students as an whole number percent, rather than a decimal. This will help later with the regression.

Finally, before proceeding to the regression analysis, I construct summary statistics of the main regression variables.

#Compute summary statistics for percent of students economically disadvantaged
mean(perecd2)

## [1] 53.66844

median(perecd2)

## [1] 53.89661

max(perecd2)

## [1] 100

min(perecd2)

## [1] 0

#Compute summary statistics for test score (grade levels)
mean(total1$mn_avg_ol)

## [1] 0.0157451

median(total1$mn_avg_ol)

## [1] 0.01788848

max(total1$mn_avg_ol)

## [1] 2.156946

min(total1$mn_avg_ol)

## [1] -1.762436

Percentages of economically disadvantaged students varies from 0% to 100%, with the mean grade scoring as 0.015 grades above average and the mean percentage of economically disadvantaged students being 53.66%. The highest testing was a +2.15 grade levels above the national average, with under 20% of students being economically disadvantaged. The lowest scoring was -1.76 grade levels below the national average, with over 95% of students being economically disadvantaged.

3. Regression analysis

Baseline estimation

The main specification is:

$mn_avg_ol_i = _1 + _2 perecd2 + u_i $

This is a regression of percentage of economically disabled students on the grade level of the species. The regression is estimated below, using OLS:

baseline <- lm(mn_avg_ol1 ~ perecd2, data = total1)
summary(baseline)

## 
## Call:
## lm(formula = mn_avg_ol1 ~ perecd2, data = total1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37220 -0.13873 -0.00373  0.13690  1.68997 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.941e-01  2.554e-03   271.7   <2e-16 ***
## perecd2     -1.264e-02  4.269e-05  -296.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2203 on 38095 degrees of freedom
## Multiple R-squared:  0.697,  Adjusted R-squared:  0.697 
## F-statistic: 8.765e+04 on 1 and 38095 DF,  p-value: < 2.2e-16

stargazer(baseline, type = "text", title = "Baseline Regression Results", dep.var.labels = c("Average Test Grade Level Score"), covariate.labels = c("Percent Economically Disadvantaged Students"))

## 
## Baseline Regression Results
## ==========================================================================
##                                                  Dependent variable:      
##                                             ------------------------------
##                                             Average Test Grade Level Score
## --------------------------------------------------------------------------
## Percent Economically Disadvantaged Students           -0.013***           
##                                                       (0.00004)           
##                                                                           
## Constant                                               0.694***           
##                                                        (0.003)            
##                                                                           
## --------------------------------------------------------------------------
## Observations                                            38,097            
## R2                                                      0.697             
## Adjusted R2                                             0.697             
## Residual Std. Error                               0.220 (df = 38095)      
## F Statistic                                 87,650.580*** (df = 1; 38095) 
## ==========================================================================
## Note:                                          *p<0.1; **p<0.05; ***p<0.01

The results above suggest that when the percentage of economically disadvantaged students is 0%, the average test score is .694 grade levels above the ntional average. This makes sense, as it points out that the grade level is higher when there are no economically disadvantaged students. The fact it is negative is also logical, and shows a negative correlation. This also shows that when the percentage of economically disadvantaged students goes up by 1%, the average test score goes down by .013 grade levels. While this seems small, the difference between 0% and 100% is 1.3 grade levels, and is significant to the 1% level.

Testing and confidence interval

cor.test(perecd1, mn_avg_ol1, conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  perecd1 and mn_avg_ol1
## t = -296.06, df = 38095, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  -0.8388478 -0.8308509
## sample estimates:
##        cor 
## -0.8348934

The data has proven to be significant to the 1% level, deeming this statistically unusual.

Graphical representation

ggplot(total1, mapping = aes(perecd2 , mn_avg_ol1)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + labs(y= "Average Test Scores", x = "Percentage of Economically Disadvantaged Students")

## `geom_smooth()` using formula 'y ~ x'

This regression shows a clear negative trend in percentage of economically disadvantaged students and their test scores.

4. Further Checks

While the data is fairly correllary and does not show any signs of violating any of the OLS assumptions, there could be some omitted variable bias (OVB). This model does not take into account financial activities such as donations made to the school or fundraising activities by the school. Another variable that was not mentioned is parental involvement. I bring this up because my elementary school required parental involvement in the form of volunteering (for driving field trips or helping in the classroom), bringing more motivation to schooling.

Additionally, the data does not specify to what extent the students are economically disadvantaged. It could be minor, or quite major, which would also affect the students test scores. Many students are also not proficient in testing, which could provide a major skew in data, as students who perform better on tests will rank higher than those who cannot. This could also come in the form of ability to recieve academic support, such as tutoring or access to technology that could give students resources such as Khan Academy.

First check : Adding Free/Reduced Lunch

The proportion of students on the free/reduced lunch programs can also be accounted for in the regression.

This will now be added to the regressor model.

$mn_avg_ol_i = _1 + _2 perecd2 + _3perfrl + u_i $

perfrl1 <- total1$perfrl*100

The action above generates the proportion of free/reduced lunch as a whole number percentage to go along with the whole number percentage of economically disadvantaged students.

multiple <- lm(mn_avg_ol1 ~ perecd2 + perfrl1, data = total1)
summary(multiple)

## 
## Call:
## lm(formula = mn_avg_ol1 ~ perecd2 + perfrl1, data = total1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37594 -0.13783 -0.00416  0.13533  1.74716 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7029414  0.0025606  274.53   <2e-16 ***
## perecd2     -0.0074041  0.0002194  -33.75   <2e-16 ***
## perfrl1     -0.0054404  0.0002237  -24.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2186 on 38094 degrees of freedom
## Multiple R-squared:  0.7017, Adjusted R-squared:  0.7017 
## F-statistic: 4.48e+04 on 2 and 38094 DF,  p-value: < 2.2e-16

stargazer(multiple, type = "text", title = "Multiple Regression Results", dep.var.labels = c("Average Test Grade Level Score"), covariate.labels = c("Percent Economically Disadvantaged Students", "Proportion of Students Recieving Free/Reduced Lunch"))

## 
## Multiple Regression Results
## ==================================================================================
##                                                          Dependent variable:      
##                                                     ------------------------------
##                                                     Average Test Grade Level Score
## ----------------------------------------------------------------------------------
## Percent Economically Disadvantaged Students                   -0.007***           
##                                                                (0.0002)           
##                                                                                   
## Proportion of Students Recieving Free/Reduced Lunch           -0.005***           
##                                                                (0.0002)           
##                                                                                   
## Constant                                                       0.703***           
##                                                                (0.003)            
##                                                                                   
## ----------------------------------------------------------------------------------
## Observations                                                    38,097            
## R2                                                              0.702             
## Adjusted R2                                                     0.702             
## Residual Std. Error                                       0.219 (df = 38094)      
## F Statistic                                         44,800.710*** (df = 2; 38094) 
## ==================================================================================
## Note:                                                  *p<0.1; **p<0.05; ***p<0.01

While the Constant has been altered a statistically insignificant amount, it can be seen that for one additional percent of students being economically disadvantaged, there is a .007 decline in average test score, whereas it was a .013 in the previous regression without adding in free/reduced lunch.

cor.test(perecd1 + perfrl1, mn_avg_ol1, conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  perecd1 + perfrl1 and mn_avg_ol1
## t = -293.29, df = 38095, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  -0.8365114 -0.8284093
## sample estimates:
##        cor 
## -0.8325049

Adding in free/reduced lunch as a variable has also proven to be significant at the 1% level, deeming this statistically unusual.

Conclusion

As can be seen through regression models and significance tests, there is a clear relationship between the percentage of economic disadvantage in a student body and their average test scores. This is important because understanding that finances play a big part in education can be a major point of interest for advocates of school budgetary assignment. Given that schools are funded on a local level, and the tax dollars their cities bring in are what pay for the education systems, we can see a trend in education levels. When schools are funded based off of the taxes within the cities these districts are a part of, it keeps the wealthy schools wealthy, but leaves the poor schools poor. Using data such as the one above to push lawmakers to change from district based funding to statewide funding is crucial, and can prove to be beneficial to poor communities throughout the United States that need extra funding for their education programs.