ECMT 461-904 Final Term Project
Analyzing Texas High School Spending & AP Performance Outcomes of Economically Disadvantaged Students
1. Introduction
In this project I will be using the Texas High School AP and SAT data to analyze the potential relationship between AP Pass Rates for Economically Disadvantaged Students and the school’s Total Expenditures per Student. Thus, AP Pass Rates for Economically Disadvantaged Students will be my criterion variable, and my predictor variable will be the Total Expenditures per Student. For my category separation variable, I will use Total School Enrollment divided into 3 subgroups - Small, Medium and Large. Each subgroup possesses one-third of the sample size in their group, with the cutoffs between groups being 684 and 1955 students respectively.
The criterion variable of AP Pass Rates for Economically Disadvantaged Students is economically significant because economically disadvantaged students typically have lower scores on standardized tests due to the extra barriers they face in education. However, one would expect a positive correlation between AP Pass Rates for Economically Disadvantaged Students and Total School Expenditures per Student if our funding is being allocated wisely.
This is also why I created subgroups by School Size - I wanted to examine whether the potential correlation between an increase in funding and an increase in AP test performance for economically disadvantaged students holds for all schools or depends upon the size of the school as well. The basis for expecting differences in the results among each subgroup is that larger schools might overlook funding their underprivileged students in favor of other programs that might have a larger scope (in terms of impacting the higher class students). However, I believe that smaller schools would be more invested in each individual student, and thus push their total spending per student into programs that specifically target their economically disadvantaged population, resulting in a stronger correlation between Total Expenditures per Student and AP pass rates for smaller schools’ economically disadvantaged population. Thus, the null hypothesis is that AP Pass Rates for Economically Disadvantaged Students have no correlation with Total Expenditure per Student, and there are no statistically significant differences as the Total Enrollment subgroup changes. The alternative hypothesis is that if Total Expenditure per Student increases then the AP Pass Rate for Economically Disadvantaged Students increases, and there are statistically significant differences as we change Total Enrollment.
2. Literature Review
When we examine the literature on this topic, experts typically agree that an increase in spending correlates with an increase in educational attainment, and that this benefit should extend to economically disadvantaged students. For example, Daarel Burnette II from Education Week found in 2020 that when “politicians and taxpayers invested more money in teacher salaries, school construction, and schools with high populations of low-income students… students’ test scores [jumped].” This would support the alternative hypothesis written earlier in this proposal. C. Kirabo Jackson reaches a similar conclusion in his 2020 paper “Does School Spending Matter? the New Literature on an Old Question”, writing that “the recent quasi-experimental literature that relates school spending to student outcomes overwhelmingly support a causal relationship between increased school spending and student outcomes.” With this research in mind, it is expected that we should find a positive correlation between school’s Total Expenditure per Student and their AP Pass Rates for Economically Disadvantaged Students, and likely reject the stated null hypothesis.
3. Basic Descriptive Statistics & Frequency Distributions
Basic Descriptive Statistics for AP Pass Rates for Econ. Disadv. Students (Tabular)
| Min | Q1 | Median | Mean | Q3 | Max | Std.Dev | Range | Correlation | |
|---|---|---|---|---|---|---|---|---|---|
| Small | 0.5 | 18.800 | 36.80 | 42.122 | 61.10 | 100.0 | 28.555 | 99.5 | 0.036815 |
| Medium | 0.7 | 14.150 | 26.70 | 29.945 | 40.20 | 100.0 | 19.660 | 99.3 | -0.135894 |
| Large | 3.1 | 20.550 | 36.00 | 37.120 | 51.85 | 88.1 | 19.636 | 85.0 | -0.204066 |
| Overall | 0.5 | 17.275 | 33.15 | 35.891 | 50.00 | 100.0 | 22.884 | 99.5 | -0.007243 |
Basic Descriptive Statistics for AP Pass Rates for Econ. Disadv. Students (Graphical)
Discussion of Descriptive Statistics
When we analyze the basic descriptive statistics of the AP Pass Rate for Economically Disadvantaged Students for the overall sample, we see that the Median (33.15) is below the Mean (35.89087). This tells us that overall, the AP Pass Rate for Economically Disadvantaged Students has a positive/right skew. This seems to hold true for all subgroups, as they all have a higher Mean than Median. One interesting observation is that the “Small” subgroup has a far larger standard deviation (28.55457) than the “Medium” (19.66015) and “Large” (19.63628) groups, and even than the overall standard deviation (22.88403). This means that smaller schools have a larger variance among their AP Pass Rates for Economically Disadvantaged Students when compared to the medium and larger schools. Another important difference among subgroups is the Maximum value - all subgroups have a Maximum of 100, except for the “Large” group, which only has a Maximum of 88.1. When we look at the Range, the “Small” and “Medium” subgroups have values of 99.5 and 99.3 respectively. However, the “Large” subgroup has a Range of 99.3, which is explained by the far smaller Maximum value of the “Large” subgroup. We can also see that the “Medium” subgroup has 4 high outliers while the other subgroups have no outliers.
Frequency Distribution for AP Pass Rates for Econ. Disadv. Students (Tabular)
| AP Pass Rate for Econ. Dis. Students (%) | Count - Overall | Count - Small | Count - Medium | Count - Large |
|---|---|---|---|---|
| 0 - 5 | 26 | 8 | 13 | 5 |
| 5 - 10 | 60 | 17 | 28 | 15 |
| 10 - 15 | 79 | 16 | 38 | 25 |
| 15 - 20 | 71 | 16 | 29 | 26 |
| 20 - 25 | 66 | 15 | 26 | 25 |
| 25 - 30 | 55 | 14 | 21 | 20 |
| 30 - 35 | 66 | 16 | 25 | 25 |
| 35 - 40 | 73 | 17 | 29 | 27 |
| 40 - 45 | 40 | 9 | 15 | 16 |
| 45 - 50 | 55 | 17 | 13 | 25 |
| 50 - 55 | 35 | 4 | 10 | 21 |
| 55 - 60 | 37 | 7 | 11 | 19 |
| 60 - 65 | 29 | 8 | 6 | 15 |
| 65 - 70 | 21 | 9 | 5 | 7 |
| 70 - 75 | 15 | 4 | 1 | 10 |
| 75 - 80 | 13 | 5 | 3 | 5 |
| 80 - 85 | 4 | 2 | 0 | 2 |
| 85 - 90 | 11 | 5 | 4 | 2 |
| 90 - 95 | 0 | 0 | 0 | 0 |
| 95 - 100 | 22 | 20 | 2 | 0 |
Frequency Distribution for AP Pass Rates for Econ. Disadv. Students (Graphical)
By Subgroups (Graphical)
Discussion of Frequency Distributions
When we look at the frequency distributions for the AP Pass Rate for Economically Disadvantaged Students, we can see the right skew confirmed. However, when we separate the data into subgroups, we see that the “Medium” subgroup has the most defined right skew while the “Large” subgroup seems to have a slightly more balanced distribution, with the “Small” group being even moreso. Another interesting observation is that none of the schools in the sample have a value between 90% and 95%. In addition, the overwhelming majority of observations that fall between 95% and 100% belong to the “Small” subgroup (20), with the “Medium” subgroup having 2 and the “Large” subgroup having 0.
4. Single Sample Confidence Intervals and Hypothesis Tests
Confidence Intervals for AP Pass Rate Mean by School Size (Tabular)
| S - 99% CI | S - 95% CI | S - 90% CI | M - 99% CI | M - 95% CI | M - 90% CI | L - 99% CI | L - 95% CI | L - 90% CI | Overall - 99% CI | Overall - 95% CI | Overall - 90% CI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample Mean | 42.122 | 42.122 | 42.122 | 29.945 | 29.945 | 29.945 | 37.12 | 37.120 | 37.120 | 35.891 | 35.891 | 35.891 |
| Lower CL | 36.987 | 38.228 | 38.859 | 26.892 | 27.628 | 28.003 | 34.13 | 34.851 | 35.218 | 33.772 | 34.280 | 34.540 |
| Upper CL | 47.257 | 46.016 | 45.385 | 32.998 | 32.262 | 31.888 | 40.11 | 39.390 | 39.023 | 38.009 | 37.501 | 37.242 |
Confidence Intervals for AP Pass Rate Mean by School Size (Graphical)
Discussion of Mean Confidence Interval Results
When analyzing the Confidence Intervals for the Mean of the AP Pass Rate for Economically Disadvantaged Students, we see that the lowest population mean is likely for the “Medium” subgroup while the “Small” subgroup likely has the largest population mean. The “Large” subgroup likely has a population mean closest to the Overall population mean. Using the Overall sample’s confidence interval, we can be 95% sure that the population mean AP Pass Rate for Economically Disadvantaged Students lies between 35.89087 and 37.5014.
Confidence Intervals for AP Pass Rate Variance by School Size (Tabular)
| S - 99% CI | S - 95% CI | S - 90% CI | M - 99% CI | M - 95% CI | M - 90% CI | L - 99% CI | L - 95% CI | L - 90% CI | Overall - 99% CI | Overall - 95% CI | Overall - 90% CI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample Variance | 815.36 | 815.36 | 815.36 | 386.52 | 386.52 | 386.52 | 385.58 | 385.58 | 385.58 | 523.68 | 523.68 | 523.68 |
| Lower CL | 641.72 | 678.83 | 698.94 | 313.74 | 329.53 | 338.02 | 314.18 | 329.70 | 338.04 | 461.18 | 475.27 | 482.70 |
| Upper CL | 1065.16 | 997.91 | 965.66 | 486.17 | 459.78 | 447.00 | 482.75 | 457.07 | 444.62 | 599.07 | 579.92 | 570.43 |
Confidence Intervals for AP Pass Rate Variance by School Size (Graphical)
Discussion of Variance Confidence Interval Results
When analyzing the Confidence Intervals for the Variance of the AP Pass Rate for Economically Disadvantaged Students, we see that the only subgroup with a differing variance than the Overall sample is the “Small” subgroup, which has a considerably higher Variance Confidence Interval (95% likely to be between 678.83075 and 997.90849). Using the Overall sample’s confidence interval, we can be 95% sure that the population variance for the AP Pass Rate for Economically Disadvantaged Students lies between 475.27108 and 579.91829.
Hypothesis Testing for AP Pass Rate Mean by School Size (Tabular)
| Mean | T-stat | T-crit 90% | Conc. 90% | T-crit 95% | Conc. 95% | T-crit 99% | Conc. 99% | |
|---|---|---|---|---|---|---|---|---|
| Small | 42.122 | 3.1547 | 1.6522 | Reject | 1.9714 | Reject | 2.5996 | Reject |
| Medium | 29.945 | -5.0515 | 1.6503 | Reject | 1.9685 | Reject | 2.5936 | Reject |
| Large | 37.120 | 1.0662 | 1.6501 | Fail to Reject | 1.9682 | Fail to Reject | 2.5929 | Fail to Reject |
Discussion of Mean Hypothesis Testing Results
When conducting our Mean Hypothesis Test, we must use the Student’s T-Distribution because the population variance of our criterion variable is unknown. When we conduct this test, we set our Null Hypothesis as each subgroup having the exact same Mean as the overall sample (35.89087). Our Alternative Hypothesis is that the Mean of each of our subgroups is not equal to the overall sample mean. At every confidence level, we Rejected H0 for the “Small” and “Medium” subgroups, and Failed to Reject H0 for the “Large” subgroup. This means that according to our findings, the “Small” and “Medium” subgroups have differing means than the overall sample (at every tested confidence level). However, because we Failed to Reject H0 for our “Large” subgroup, this means that we cannot say that the “Large” subgroup has a different mean than the overall sample (at every tested confidence level).
Hypothesis Testing for AP Pass Rate Variance by School Size (Tabular)
| Variance | ChiSq Stat | ChiSq-L 90% | ChiSq-U 90% | Conc. 90% | ChiSq-L 95% | ChiSq-U 95% | Conc. 95% | ChiSq-L 99% | ChiSq-U 99% | Conc. 99% | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Small | 815.36 | 323.85 | 698.94 | 965.66 | Reject | 678.83 | 997.91 | Reject | 641.72 | 1065.16 | Reject |
| Medium | 386.52 | 205.19 | 338.02 | 447.00 | Reject | 329.53 | 459.78 | Reject | 313.74 | 486.17 | Reject |
| Large | 385.58 | 212.79 | 338.04 | 444.62 | Reject | 329.70 | 457.07 | Reject | 314.18 | 482.75 | Reject |
Discussion of Variance Hypothesis Testing Results
When we conduct our Variance Hypothesis Test, we must use the Chi-Squared Distribution for Variance testing in order to check if each subgroup can be said to have the same variance as the overall sample. Thus, our Null Hypothesis is that each subgroup has the same variance as the overall sample (523.67893) while our Alternative Hypothesis is that each subgroup has a different variance than the overall sample. When considering the findings, we see that across all groups and across all Confidence Levels, we Rejected H0. Thus, our findings indicate that for every tested confidence level, none of our subgroups share the same variance as the overall sample.
5. Two Sample Confidence Intervals and Hypothesis Tests
Pair-wise Hypothesis Tests of Equal Variance (Tabular)
| Group 1 Variance | Group 2 Variance | P Value | Conc. 90% | Conc. 95% | Conc. 99% | |
|---|---|---|---|---|---|---|
| Small vs. Medium | 815.36 | 386.52 | 0.00000 | Reject | Reject | Reject |
| Medium vs. Large | 386.52 | 385.58 | 0.95459 | Fail To Reject | Fail To Reject | Fail To Reject |
| Small vs. Large | 815.36 | 385.58 | 0.00000 | Reject | Reject | Reject |
Discussion of Pair-wise Hypothesis Tests of Equal Variance
When we conduct our Pair-wise Hypothesis Tests of Equal Variance, our Null Hypothesis is that our subgroups have equal variances, while our Alternative Hypothesis is that our subgroups have different variances. Using the calculated P-values, we can determine if we Reject or Fail to Reject H0. If our P-value is below our significance level, we Reject H0, while if our P-value is greater than the significance level, we Fail to Reject H0. We Reject H0 for both the “Small vs. Medium” and the “Small vs. Large” pair-wise tests (which each had a P-value of 0). This means that we have statistically significant proof that the “Small” subgroup has a different population variance than the “Medium” and the “Large” subgroups. However, we Fail To Reject H0 for the “Medium vs. Large” pair-wise test (P-value of .955), meaning that we have enough statistical proof to assume that the “Medium” and “Large” subgroups have equal population variances.
Difference in Mean Pair-Wise Confidence Intervals (Tabular)
| Group 1 Mean | Group 2 Mean | Xbar - Ybar | LCL - 90% | UCL - 90% | LCL - 95% | UCL - 95% | LCL - 99% | UCL - 99% | |
|---|---|---|---|---|---|---|---|---|---|
| Small vs. Medium | 42.12 | 29.95 | 12.177 | 8.385 | 15.969 | 7.6547 | 16.699 | 6.2218 | 18.132 |
| Small vs. Large | 42.12 | 37.12 | 5.002 | 1.230 | 8.774 | 0.5032 | 9.500 | -0.9223 | 10.926 |
| Medium vs. Large | 29.95 | 37.12 | -7.175 | -15.776 | -8.577 | -16.4685 | -7.885 | -17.8252 | -6.529 |
Difference in Mean Pair-wise Hypothesis Testings (Tabular)
| Xbar - Ybar | P-value | Conc. 90% | Conc. 95% | Conc. 99% | |
|---|---|---|---|---|---|
| Small vs. Medium | 12.1768 | 2.0999e-07 | Reject | Reject | Reject |
| Medium vs. Large | 5.0017 | 4.1122e-08 | Reject | Reject | Reject |
| Small vs. Large | -7.1752 | 2.9421e-02 | Reject | Reject | Reject |
Discussion of Pair-wise Difference in Means Hypothesis Testing Results
When analyzing our Difference in Means Hypothesis Testing Results, our Null Hypothesis is that our subgroups share the same mean. Our Alternative Hypothesis is that our subgroups have differing means. Thus, our findings - that we Reject H0 at every Confidence Interval and for every Pair-Wise test - tell us that every one of the subgroups (at every tested Confidence Interval) has a differing mean from the other subgroups.
An important thing to note when we look at our data is that the P-value for our “Medium vs. Large” Pair-wise test was calculated under the assumption that both subgroups had equal population variance (due to our earlier findings). Every other pair-wise test was calculated under the assumption that subgroups had differing population variances from each other.
6. Single Factor Analysis of Variance (ANOVA) Tests for Joint Equality of Means
| df | Sum Sq. | Mean Sq. | F-value | P-value | Conc. 90% | Conc. 95% | Conc. 99% | |
|---|---|---|---|---|---|---|---|---|
| School Size | 2 | 18416.3 | 9208.143 | 18.3697 | 1.6049e-08 | Reject | Reject | Reject |
| Residuals | 775 | 388482.2 | 501.267 | — | — | — | — | — |
Discussion of ANOVA Tests for Joint Equality of Means Results
When we conduct our ANOVA test for joint equality of means, our Null Hypothesis is that every subgroup has the same mean, and our Alternative Hypothesis is that not every subgroup has the same mean. We get a P-value of 1.6049e-08 from our ANOVA test, which is far lower than any of the significance levels we are testing at. What this means is that we Reject H0 at every tested confidence interval, meaning that there are statistically significant differences in the means of each subgroup.
7. Correlation Analysis
Scatterplots
Significant Correlation Hypothesis Testing
| Corr. Coeff. | Sample Size | P-Value | Conc. 90% | Conc. 95% | Conc. 99% | |
|---|---|---|---|---|---|---|
| Overall | -0.02115 | 778 | 5.6e-01 | Fail To Reject | Fail To Reject | Fail To Reject |
| Small | 0.06758 | 209 | 3.3e-01 | Fail To Reject | Fail To Reject | Fail To Reject |
| Medium | -0.13589 | 279 | 2.3e-02 | Reject | Reject | Fail To Reject |
| Large | -0.31822 | 290 | 3.2e-08 | Reject | Reject | Reject |
Discussion of Significant Correlation Hypothesis Testing Results
When conducting our Significant Correlation Hypothesis Tests, our Null Hypothesis is that there is no correlation between our predictor and criterion variable, while our Alternative Hypothesis is that there is a correlation between the two. Examining our results, for “Overall” and for the “Small” subgroup, we Fail To Reject H0 at every tested confidence level, meaning that we find no correlation between the tested variables for these two groups. However, for the “Medium” subgroup we Reject H0 at 90% and 95%, but Fail To Reject 0 at the 99% confidence level. What this means is that when we test at 90% and 95%, we cannot say there is no correlation, but we can say this for the 99% confidence level. For the “Large” subgroup, we Reject H0 at every tested confidence level, meaning that (from our analysis) there exists a correlation in the “Medium” and “Large” subgroups.
One important thing to note here is that the correlation we are testing for is not necessarily a positive one like we predicted in our original hypothesis. In fact, the sample correlation coefficients were negative for every tested group except for “Small”. What this means for our “Medium” and “Large” groups (Where we Rejected H0) is that we found a correlation - a negative correlation between Total Expenditures Per Student, and AP Pass Rates for Economically Disadvantaged Students.
Pair-Wise Differences in Degree of Correlation Hypothesis Testing
| Group 1 Corr. Coeff. | Group 2 Corr. Coeff. | Z Calc | Conc. 90% | Conc. 95% | Conc. 99% | |
|---|---|---|---|---|---|---|
| Small vs. Medium | 0.06768 | -0.1367 | 2.220 | Reject | Reject | Fail To Reject |
| Medium vs. Large | -0.13674 | -0.3297 | 2.288 | Reject | Reject | Fail To Reject |
| Small vs. Large | 0.06768 | -0.3297 | 4.351 | Reject | Reject | Reject |
Discussion of Pair-Wise Difference in Degree of Correlation Hypothesis Testing
(*Note - Z-stat for each Confidence Level is not listed, simply for efficiency sake. However, we compared our Z-calc vs. the Standard Normal Distribution for each probability listed)
For our Pair-Wise Difference in Correlation Hypothesis Tests, our Null Hypothesis is that the tested subgroups have the same correlation between Total Expenditures Per Student and AP Pass Rates for Economically Disadvantaged Students. Our Alternative Hypothesis is that the subgroups’ correlation between our predictor and criterion variable are not equal. When comparing the “Small” subgroup to the “Medium” subgroup, we Rejected H0 at every tested confidence level. We had the same findings when comparing the “Medium” and “Large” subgroups. This means that our analysis found a statistical significantly difference between the “Small” & “Medium” groups and the “Medium” & “Large” subgroups. However, when we compare the “Small” subgroup with the “Large” subgroup, we Fail To Reject H0 at the 90% and 95% confidence levels, and Reject H0 at the 99% confidence level. This means that we find both subgroups’ correlations equal except for at the 99% confidence level, which could potentially be the result of a Type I Error.
Correlation Joint Equality Hypothesis Testing
| Corr. Coeff. | Sample Size | Zr | (n-3) | (n-3)Zr^2 | (n-3)Zr | |
|---|---|---|---|---|---|---|
| Small | 0.0676 | 209 | 0.0677 | 206 | 0.944 | 13.9 |
| Medium | -0.1359 | 279 | -0.1367 | 276 | 5.160 | -37.7 |
| Large | -0.3182 | 290 | -0.3297 | 287 | 31.191 | -94.6 |
| Sum | NA | NA | NA | 769 | 37.295 | -118.4 |
Applying the formula to find our Chi-Squared Calculated Test Statistic:
𝛘² = 37.4489661899
And now we can do our actual Hypothesis Testing at our 3 different significance levels:
| ChiSq Calc | ChiSq Stat 90% | Conc. 90% | ChiSq Stat 95% | Conc. 95% | ChiSq Stat 99% | Conc. 99% |
|---|---|---|---|---|---|---|
| 37.449 | 4.6052 | Reject | 5.9915 | Reject | 9.2103 | Reject |
Discussion of Correlation Joint Equality Hypothesis Testing
When we are doing our Correlation Joint Equality Hypothesis Testing, our Null Hypothesis is that the correlation between Total Expenditures Per Student and the AP Pass Rate for Economically Disadvantaged Students is jointly equal between all 3 subgroups of School Size. Our Alternative Hypothesis is that any one of these correlation coefficients are different. At every significance level, we Reject the H0, meaning that we have proof that at least one group has a statistically significant difference in correlation than the other(s). In real world terms, we can say that the correlation between Total Expenditures per Student and AP Pass Rates for Economically Disadvantaged Students is different for different sized schools according to our data.
8. Conclusion
The bulk of our findings lie in Section #7, where we analyzed correlations between our predictor and criterion variable, even separating among subgroups and comparing these subgroups individually. Our findings seem quite troubling if they hold true.
Overall, we found no statistically significant correlation between Total Expenditures Per Student and AP Pass Rates for Economically Disadvantaged Students. We had the same findings for the “Small” subgroup. This would imply that an increase in per pupil spending has no impact on economically disadvantaged students’ AP test performance. However, when we look at our other subgroups, the findings become more troublesome.
When we tested the “Medium” and “Large” subgroups, we found a statistically significant negative correlation between Total Expenditures Per Student and AP Pass Rates for Economically Disadvantaged students. This goes directly against the original hypothesis posited in Section #1 of this paper, and against the previous literature on this topic. However, the deviance of the “Medium” and “Large” subgroups would agree with one part of our original hypothesis - that smaller schools focus their funding to help their economically disadvantaged students better than larger schools do. However, this finding is not much help considering that small schools were still found to have no benefit to their economically disadvantaged students when they increase spending.
Possible Limitations
Our findings don’t imply causality, however, because there are a few limitations to our analysis.
For one, we are performing a cross sectional analysis and not a time series analysis. We aren’t taking multiple schools, controlling for external factors, increasing/decreasing their Total Expenditures Per Student and then measuring their AP Pass Rates for Economically Disadvantaged Students. To ascertain causality, we would need to ensure that a change in Total Expenditures Per Student is the only factor affecting the measured AP Pass Rate.
Furthermore, we shouldn’t necessarily sound the alarm that our school spending isn’t helping our students who need the most help. AP Pass Rates are not necessarily the defining standard on learning outcomes for High School students. In fact, maybe we should celebrate that funding isn’t pushing students to perform better on AP tests - it might be doing things that are more beneficial for our students lives than letting them skip out on 3 credit hours of college coursework.
Possible Extensions
One of the main ways to extend our research project would be to actually create an experimental setting to test our hypothesis after excluding all possible confounding variables. We could also include a “Control” group to compare our findings to in order to ensure that our findings aren’t spurious. We also could increase our data set from just Texas to the entire country. While this might induce more variability in our data, more data is always better than less data and could help us guarantee more accuracy than possible with only 200-300 observations per subgroup. We also could attempt a similar project on high schools in other countries to determine if a correlation does exist in general, if a correlation exists only in other countries, or any number of more insights into our data.
Two other extensions that we could pursue involve Machine Learning. I have explained and worked those extensions out below in Sections #9 and #10.
9. Extension 1 - Imputing Missing Data
When analyzing our dataset, we had to cut out every observation that did not have a recorded value for any of “Total Enrollment”, “Pass_Rate_EconDis” or “Total Exp per Student”. While we were still left with many data points to analyze for this project, this is still not ideal. In the best case scenario, we are given a completed dataset that we do not have to delete entries from - but this is unrealistic. Thus, to get a (reasonably accurate) completed dataset, we have to impute for the missing values. We can do this in multiple ways - we can take the mean/median/mode of the dataset and plug it in for the missing values (not very accurate), or we can use a Machine Learning Model to predict these missing values based on where the other stats of the school lie. We can use the “caret” (Classification And REgression Training) package in R to make a simple model to impute these missing values, using Bagged Decision Trees.
There are some steps to do before jumping straight into the model, however. We must first choose which features we want our model to take in when attempting to predict our 3 variables. We obviously can cut out all the basic information for each High School, such as “Campus,” “CampName,” “District,” “DistName,” “County,” “CntyName,” “Region,” and “RegnName.” We also have to cut out all the different subgroups of spending, such as Instructional, Leadership, Guidance, and Extracurricular spending per student, as all of those combined are equivalent to “Total Exp per Student,” which means that if Total Expenditure Per Student is missing, these values must be missing as well. It is also important to note that we DO include the variables we are testing for as features in our model.
This leaves us with the features “Campus_Type”, “Part_Rate_All”, “Part_Rate_Female”, “Pass_Rate_Female”, “Part_Rate_Male”, “Pass_Rate_Male”, “Part_Rate_EconDis”, “SAT_All”, “SAT_EconDis”, “SAT_Female”, “SAT_Male”, “EconDis Enr”, “Pass_Rate_EconDis”, “TOTAL ENROLLMENT”, and “Total Exp per Student” that will be used to predict “Pass_Rate_EconDis”, “TOTAL ENROLLMENT”, and “Total Exp per Student”.
Next, we need to add new columns that track whether or not the value of our variables was missing in the original dataset. As such, we tack on 3 new binary columns to our dataset, which can read “Y” (Yes, was missing), or “N” (No, was not missing): “MissingTotalEnrollment”, “MissingPassRateEconDis”, and “MissingTotalExpPerStudent”. This is so that looking back on the dataset we don’t lose any information and can easily tell what was imputed and what was originally given. An example of what our dataset’s new columns look like is show below. Note that when there is an “NA” in our dataframe for “Pass_Rate_EconDis”, our “MissingPassRateEconDis” variable takes on a value of “Y” for yes. This happens throughout the dataset for “Total Exp Per Student” and “Total Enrollment” as well.
| Pass_Rate_EconDis | TOTAL ENROLLMENT | Total Exp per Student | MissingPassRateEconDis | MissingTotalExpPerStudent | MissingTotalEnrollment |
|---|---|---|---|---|---|
| NA | 410 | 10100 | Y | N | N |
| 35.0 | 858 | 6855 | N | N | N |
| 25.0 | 231 | 13552 | N | N | N |
| 50.0 | 319 | 10579 | N | N | N |
| NA | 417 | 12710 | Y | N | N |
| 100.0 | 461 | 11646 | N | N | N |
| 25.4 | 1031 | 7986 | N | N | N |
| NA | 481 | 10657 | Y | N | N |
| 52.4 | 716 | 9528 | N | N | N |
| 35.7 | 806 | 7735 | N | N | N |
| NA | 217 | 9913 | Y | N | N |
Once we have done all of this, we can run the model and replace our original dataset’s missing values with our imputed values, and compare their basic descriptive statistics.
Original Data Set’s Basic Statistics
| Min. | Q1 | Median | Mean | Q3 | Max. | |
|---|---|---|---|---|---|---|
| Total Enrollment | 116.0 | 623.250 | 1525.00 | 1551.323 | 2271.5 | 5098 |
| AP Pass Rate (Econ. Dis.) | 0.5 | 17.275 | 33.15 | 35.891 | 50.0 | 100 |
| Total Exp Per Student | 4261.0 | 7250.750 | 8094.00 | 8636.021 | 9192.8 | 59298 |
New Data Set’s Basic Statistics
| Min. | Q1 | Median | Mean | Q3 | Max. | |
|---|---|---|---|---|---|---|
| Total Enrollment | 10.0 | 474.0 | 1121.2 | 1339.714 | 2056.50 | 5098 |
| AP Pass Rate (Econ. Dis.) | 0.5 | 16.7 | 33.2 | 36.615 | 50.45 | 100 |
| Total Exp Per Student | 0.0 | 7416.0 | 8437.0 | 9320.330 | 9953.00 | 118688 |
Discussion of Differences in Basic Statistics
When we compare the basic descriptive statistics of both the original dataset (with N/A’s excluded) and the new dataset (with N/A’s imputed), we see that broadly speaking, the missing values for “Total Enrollment” were generally imputed as lower than the original dataset, while the missing values for “Pass_Rate_EconDis” were generally imputed as higher than the original dataset. However, when we look at “Total Exp Per Student”, we see a far larger range shown, that does not seem reasonable. A value of $0 for Total Exp Per Student makes little sense (how would a school operate if it spent $0?), as does a value of $118,688, as that is almost double the original maximum of $59,298. We also can see a somewhat unfeasible phenomena in our new dataset’s minimum for “Total Enrollment” - a school size of 10 kids is possible but not very likely, and thus most likely a shortcoming of our imputation model.
10. Extension 2 - Creating & Training a Machine Learning Regression Model
Though we attempted to examine a relationship between Total Expenditures per Student and AP Pass Rates for Economically Disadvantaged students, we can take this one step further. We can now ask the question: how is “AP_Pass_Rate_EconDis” affected by every other variable we are given data for? We do this by creating an algorithm. First, we split the data into two sets: our Training set and our Testing set. Basically, we ask our Model to analyze a proportion of our data (the Training set) and determine the relationship between all of our factors (ex: SAT Scores, Instructional Expenditurem Per Student, etc.) and the AP Pass Rate for Economically Disadvantaged Students. Then, we use this Model on the remaining portion of our data (the Testing set) to evaluate how effective our created Model is when examining “new” data. The split between Train & Test used in this paper will be 70/30.
Below is the code used to create our Machine Learning model. I have set this code to not run, and to just display, with interspersed comments to explain exactly what is happening within each line.
# Creating our 70/30 Train/Test split of our data with createDataPartition()
indexes <- createDataPartition(hs$Pass_Rate_EconDis,
times = 1,
p = .7,
list = F
)
# Train gets the chosen training rows (70%), Test gets everything else (30%)
hs.train <- hs[indexes, ]
hs.test <- hs[-indexes, ]
# Training Our Model ####
# We need to set precise parameters for our trainControl(), and tuning grid which we will later
# pass to our final training function
# We tell our model to run with the "repeatedcv" method. We divide our training data set into
# 10 parts and then train each part based off of the other 9. We do this 3 times, and find the
# average error across all attempts. We then conduct a grid search, which means we exhaustively
# look through every single tested value to find the best one.
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid"
)
# Now we construct our tune grid. This consists of multiple hyperparameters which we have to
# tune to get our most efficient model. Hence, we have it try out multiple different
# combinations (it will try eta = .025, nrounds = 75, etc. and then do the exact same thing
# except eta = .05).
tune.grid <- expand.grid(eta = c(0.025, .05),
nrounds = c(75, 100, 125),
max_depth = c(6, 7, 8),
min_child_weight = c(1.5, 2, 2.5),
colsample_bytree = c(0.6, 0.7, 0.8),
gamma = 0,
subsample = 1
)
# In broad terms, this opens up 3 sessions of RStudio to all run through our train() function
# at the same time to speed it up, because this will take a long time to run.
c1 <- makeCluster(3, type = "SOCK")
registerDoSNOW(c1)
# Training our model. We want to predict "Pass_Rate_EconDis" as a function of every other given
# variable. We pass it our tune grid, train.control, and we tell it to give us the model with
# the highest R squared value.
caret.cv <- train(Pass_Rate_EconDis ~ .,
data = hs,
method = "xgbTree",
metric = "Rsquared",
tuneGrid = tune.grid,
trControl = train.control,
na.action = na.pass,
verbose = F,
verbosity = 0
)
# Turning off our other instances of RStudio
stopCluster(c1)Machine Learning Model Results
| eta | max_depth | colsample_bytree | min_child_weight | nrounds | RMSE | R^2 | MAE | |
|---|---|---|---|---|---|---|---|---|
| Value | 0.025 | 7 | 0.7 | 2.5 | 125 | 9.412 | 0.8469 | 6.248 |
The methods to evaluate our Machine Learning model’s predictions of AP Pass Rates for Economically Disadvantaged Students are RMSE, R^2, and MAE (Root Mean Square Error, R^2, Mean Average Error). Because we told our model to prioritize R^2, the R^2 value of .8469 matters the most for our analysis. An R^2 of .8469 tells us that 84.69% of the variation in the AP Pass Rates for Economically Disadvantaged Students is explained by our model, and 15.31% is explained by other factors that we do not know.
Works Cited
Alboukadel Kassambara (2020). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.4.0. https://CRAN.R-project.org/package=ggpubr
Andri Signorell et mult. al. (2021). DescTools: Tools for descriptive statistics. R package version 0.99.44. https://cran.r-project.org/web/packages/DescTools/index.html
Burnette, Daarel. “Student Outcomes: Does More Money Really Matter?” Education Week, Education Week, 8 Dec. 2020, www.edweek.org/policy-politics/student-outcomes-does-more-money-really-matter/2019/06.
Hadley Wickham and Jennifer Bryan (2022). readxl: Read Excel Files. R package version 1.4.0. https://CRAN.R-project.org/package=readxl
Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Jackson, C. Kirabo. “Does School Spending Matter? the New Literature on an Old Question.”Confronting Inequality: How Policies and Practices Shape Children’s Opportunities., 2020, pp. 165–186., https://doi.org/10.1037/0000187-008.
Jonathan M. Lees (2020). PEIP: Geophysical Inverse Theory and Optimization. R package version 2.2-3. https://CRAN.R-project.org/package=PEIP
Julien Barnier (2021). rmdformats: HTML Output Formats and Templates for ‘rmarkdown’ Documents. R package version 1.0.3. https://CRAN.R-project.org/package=rmdformats
Kun Ren and Kenton Russell (2021). formattable: Create ‘Formattable’ Data Structures. R package version 0.2.1. https://CRAN.R-project.org/package=formattable
Max Kuhn (2022). caret: Classification and Regression Training. R package version 6.0-91. https://CRAN.R-project.org/package=caret
Microsoft Corporation and Stephen Weston (2022). doSNOW: Foreach Parallel Adaptor for the ‘snow’ Package. R package version 1.0.20. https://CRAN.R-project.org/package=doSNOW
Millard SP (2013). _EnvStats: An R Package for Environmental Statistics_. Springer, New York. ISBN 978-1-4614-8455-4, <URL: https://www.springer.com>.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
Schulman, Craig T. “Texas High School AP and SAT Data.” Econometrics 461, http://people.tamu.edu/~cschulman/ECMT461/TermPr461.html.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Yihui Xie (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.38.