ECMT 461-904 Final Term Project

Analyzing Texas High School Spending & AP Performance Outcomes of Economically Disadvantaged Students


1. Introduction

In this project I will be using the Texas High School AP and SAT data to analyze the potential relationship between AP Pass Rates for Economically Disadvantaged Students and the school’s Total Expenditures per Student. Thus, AP Pass Rates for Economically Disadvantaged Students will be my criterion variable, and my predictor variable will be the Total Expenditures per Student. For my category separation variable, I will use Total School Enrollment divided into 3 subgroups - Small, Medium and Large. Each subgroup possesses one-third of the sample size in their group, with the cutoffs between groups being 684 and 1955 students respectively.

The criterion variable of AP Pass Rates for Economically Disadvantaged Students is economically significant because economically disadvantaged students typically have lower scores on standardized tests due to the extra barriers they face in education. However, one would expect a positive correlation between AP Pass Rates for Economically Disadvantaged Students and Total School Expenditures per Student if our funding is being allocated wisely.

This is also why I created subgroups by School Size - I wanted to examine whether the potential correlation between an increase in funding and an increase in AP test performance for economically disadvantaged students holds for all schools or depends upon the size of the school as well. The basis for expecting differences in the results among each subgroup is that larger schools might overlook funding their underprivileged students in favor of other programs that might have a larger scope (in terms of impacting the higher class students). However, I believe that smaller schools would be more invested in each individual student, and thus push their total spending per student into programs that specifically target their economically disadvantaged population, resulting in a stronger correlation between Total Expenditures per Student and AP pass rates for smaller schools’ economically disadvantaged population. Thus, the null hypothesis is that AP Pass Rates for Economically Disadvantaged Students have no correlation with Total Expenditure per Student, and there are no statistically significant differences as the Total Enrollment subgroup changes. The alternative hypothesis is that if Total Expenditure per Student increases then the AP Pass Rate for Economically Disadvantaged Students increases, and there are statistically significant differences as we change Total Enrollment.


2. Literature Review

When we examine the literature on this topic, experts typically agree that an increase in spending correlates with an increase in educational attainment, and that this benefit should extend to economically disadvantaged students. For example, Daarel Burnette II from Education Week found in 2020 that when “politicians and taxpayers invested more money in teacher salaries, school construction, and schools with high populations of low-income students… students’ test scores [jumped].” This would support the alternative hypothesis written earlier in this proposal. C. Kirabo Jackson reaches a similar conclusion in his 2020 paper “Does School Spending Matter? the New Literature on an Old Question”, writing that “the recent quasi-experimental literature that relates school spending to student outcomes overwhelmingly support a causal relationship between increased school spending and student outcomes.” With this research in mind, it is expected that we should find a positive correlation between school’s Total Expenditure per Student and their AP Pass Rates for Economically Disadvantaged Students, and likely reject the stated null hypothesis.


3. Basic Descriptive Statistics & Frequency Distributions

Basic Descriptive Statistics for AP Pass Rates for Econ. Disadv. Students (Tabular)

Min Q1 Median Mean Q3 Max Std.Dev Range Correlation
Small 0.5 18.800 36.80 42.122 61.10 100.0 28.555 99.5 0.036815
Medium 0.7 14.150 26.70 29.945 40.20 100.0 19.660 99.3 -0.135894
Large 3.1 20.550 36.00 37.120 51.85 88.1 19.636 85.0 -0.204066
Overall 0.5 17.275 33.15 35.891 50.00 100.0 22.884 99.5 -0.007243

Basic Descriptive Statistics for AP Pass Rates for Econ. Disadv. Students (Graphical)

Discussion of Descriptive Statistics

When we analyze the basic descriptive statistics of the AP Pass Rate for Economically Disadvantaged Students for the overall sample, we see that the Median (33.15) is below the Mean (35.89087). This tells us that overall, the AP Pass Rate for Economically Disadvantaged Students has a positive/right skew. This seems to hold true for all subgroups, as they all have a higher Mean than Median. One interesting observation is that the “Small” subgroup has a far larger standard deviation (28.55457) than the “Medium” (19.66015) and “Large” (19.63628) groups, and even than the overall standard deviation (22.88403). This means that smaller schools have a larger variance among their AP Pass Rates for Economically Disadvantaged Students when compared to the medium and larger schools. Another important difference among subgroups is the Maximum value - all subgroups have a Maximum of 100, except for the “Large” group, which only has a Maximum of 88.1. When we look at the Range, the “Small” and “Medium” subgroups have values of 99.5 and 99.3 respectively. However, the “Large” subgroup has a Range of 99.3, which is explained by the far smaller Maximum value of the “Large” subgroup. We can also see that the “Medium” subgroup has 4 high outliers while the other subgroups have no outliers.

Frequency Distribution for AP Pass Rates for Econ. Disadv. Students (Tabular)

AP Pass Rate for Econ. Dis. Students (%) Count - Overall Count - Small Count - Medium Count - Large
0 - 5 26 8 13 5
5 - 10 60 17 28 15
10 - 15 79 16 38 25
15 - 20 71 16 29 26
20 - 25 66 15 26 25
25 - 30 55 14 21 20
30 - 35 66 16 25 25
35 - 40 73 17 29 27
40 - 45 40 9 15 16
45 - 50 55 17 13 25
50 - 55 35 4 10 21
55 - 60 37 7 11 19
60 - 65 29 8 6 15
65 - 70 21 9 5 7
70 - 75 15 4 1 10
75 - 80 13 5 3 5
80 - 85 4 2 0 2
85 - 90 11 5 4 2
90 - 95 0 0 0 0
95 - 100 22 20 2 0

Frequency Distribution for AP Pass Rates for Econ. Disadv. Students (Graphical)

By Subgroups (Graphical)

Discussion of Frequency Distributions

When we look at the frequency distributions for the AP Pass Rate for Economically Disadvantaged Students, we can see the right skew confirmed. However, when we separate the data into subgroups, we see that the “Medium” subgroup has the most defined right skew while the “Large” subgroup seems to have a slightly more balanced distribution, with the “Small” group being even moreso. Another interesting observation is that none of the schools in the sample have a value between 90% and 95%. In addition, the overwhelming majority of observations that fall between 95% and 100% belong to the “Small” subgroup (20), with the “Medium” subgroup having 2 and the “Large” subgroup having 0.


4. Single Sample Confidence Intervals and Hypothesis Tests

Confidence Intervals for AP Pass Rate Mean by School Size (Tabular)

S - 99% CI S - 95% CI S - 90% CI M - 99% CI M - 95% CI M - 90% CI L - 99% CI L - 95% CI L - 90% CI Overall - 99% CI Overall - 95% CI Overall - 90% CI
Sample Mean 42.122 42.122 42.122 29.945 29.945 29.945 37.12 37.120 37.120 35.891 35.891 35.891
Lower CL 36.987 38.228 38.859 26.892 27.628 28.003 34.13 34.851 35.218 33.772 34.280 34.540
Upper CL 47.257 46.016 45.385 32.998 32.262 31.888 40.11 39.390 39.023 38.009 37.501 37.242

Confidence Intervals for AP Pass Rate Mean by School Size (Graphical)

Discussion of Mean Confidence Interval Results

When analyzing the Confidence Intervals for the Mean of the AP Pass Rate for Economically Disadvantaged Students, we see that the lowest population mean is likely for the “Medium” subgroup while the “Small” subgroup likely has the largest population mean. The “Large” subgroup likely has a population mean closest to the Overall population mean. Using the Overall sample’s confidence interval, we can be 95% sure that the population mean AP Pass Rate for Economically Disadvantaged Students lies between 35.89087 and 37.5014.

Confidence Intervals for AP Pass Rate Variance by School Size (Tabular)

S - 99% CI S - 95% CI S - 90% CI M - 99% CI M - 95% CI M - 90% CI L - 99% CI L - 95% CI L - 90% CI Overall - 99% CI Overall - 95% CI Overall - 90% CI
Sample Variance 815.36 815.36 815.36 386.52 386.52 386.52 385.58 385.58 385.58 523.68 523.68 523.68
Lower CL 641.72 678.83 698.94 313.74 329.53 338.02 314.18 329.70 338.04 461.18 475.27 482.70
Upper CL 1065.16 997.91 965.66 486.17 459.78 447.00 482.75 457.07 444.62 599.07 579.92 570.43

Confidence Intervals for AP Pass Rate Variance by School Size (Graphical)

Discussion of Variance Confidence Interval Results

When analyzing the Confidence Intervals for the Variance of the AP Pass Rate for Economically Disadvantaged Students, we see that the only subgroup with a differing variance than the Overall sample is the “Small” subgroup, which has a considerably higher Variance Confidence Interval (95% likely to be between 678.83075 and 997.90849). Using the Overall sample’s confidence interval, we can be 95% sure that the population variance for the AP Pass Rate for Economically Disadvantaged Students lies between 475.27108 and 579.91829.

Hypothesis Testing for AP Pass Rate Mean by School Size (Tabular)

Mean T-stat T-crit 90% Conc. 90% T-crit 95% Conc. 95% T-crit 99% Conc. 99%
Small 42.122 3.1547 1.6522 Reject 1.9714 Reject 2.5996 Reject
Medium 29.945 -5.0515 1.6503 Reject 1.9685 Reject 2.5936 Reject
Large 37.120 1.0662 1.6501 Fail to Reject 1.9682 Fail to Reject 2.5929 Fail to Reject

Discussion of Mean Hypothesis Testing Results

When conducting our Mean Hypothesis Test, we must use the Student’s T-Distribution because the population variance of our criterion variable is unknown. When we conduct this test, we set our Null Hypothesis as each subgroup having the exact same Mean as the overall sample (35.89087). Our Alternative Hypothesis is that the Mean of each of our subgroups is not equal to the overall sample mean. At every confidence level, we Rejected H0 for the “Small” and “Medium” subgroups, and Failed to Reject H0 for the “Large” subgroup. This means that according to our findings, the “Small” and “Medium” subgroups have differing means than the overall sample (at every tested confidence level). However, because we Failed to Reject H0 for our “Large” subgroup, this means that we cannot say that the “Large” subgroup has a different mean than the overall sample (at every tested confidence level).

Hypothesis Testing for AP Pass Rate Variance by School Size (Tabular)

Variance ChiSq Stat ChiSq-L 90% ChiSq-U 90% Conc. 90% ChiSq-L 95% ChiSq-U 95% Conc. 95% ChiSq-L 99% ChiSq-U 99% Conc. 99%
Small 815.36 323.85 698.94 965.66 Reject 678.83 997.91 Reject 641.72 1065.16 Reject
Medium 386.52 205.19 338.02 447.00 Reject 329.53 459.78 Reject 313.74 486.17 Reject
Large 385.58 212.79 338.04 444.62 Reject 329.70 457.07 Reject 314.18 482.75 Reject

Discussion of Variance Hypothesis Testing Results

When we conduct our Variance Hypothesis Test, we must use the Chi-Squared Distribution for Variance testing in order to check if each subgroup can be said to have the same variance as the overall sample. Thus, our Null Hypothesis is that each subgroup has the same variance as the overall sample (523.67893) while our Alternative Hypothesis is that each subgroup has a different variance than the overall sample. When considering the findings, we see that across all groups and across all Confidence Levels, we Rejected H0. Thus, our findings indicate that for every tested confidence level, none of our subgroups share the same variance as the overall sample.


5. Two Sample Confidence Intervals and Hypothesis Tests

Pair-wise Hypothesis Tests of Equal Variance (Tabular)

Group 1 Variance Group 2 Variance P Value Conc. 90% Conc. 95% Conc. 99%
Small vs. Medium 815.36 386.52 0.00000 Reject Reject Reject
Medium vs. Large 386.52 385.58 0.95459 Fail To Reject Fail To Reject Fail To Reject
Small vs. Large 815.36 385.58 0.00000 Reject Reject Reject

Discussion of Pair-wise Hypothesis Tests of Equal Variance

When we conduct our Pair-wise Hypothesis Tests of Equal Variance, our Null Hypothesis is that our subgroups have equal variances, while our Alternative Hypothesis is that our subgroups have different variances. Using the calculated P-values, we can determine if we Reject or Fail to Reject H0. If our P-value is below our significance level, we Reject H0, while if our P-value is greater than the significance level, we Fail to Reject H0. We Reject H0 for both the “Small vs. Medium” and the “Small vs. Large” pair-wise tests (which each had a P-value of 0). This means that we have statistically significant proof that the “Small” subgroup has a different population variance than the “Medium” and the “Large” subgroups. However, we Fail To Reject H0 for the “Medium vs. Large” pair-wise test (P-value of .955), meaning that we have enough statistical proof to assume that the “Medium” and “Large” subgroups have equal population variances.

Difference in Mean Pair-Wise Confidence Intervals (Tabular)

Group 1 Mean Group 2 Mean Xbar - Ybar LCL - 90% UCL - 90% LCL - 95% UCL - 95% LCL - 99% UCL - 99%
Small vs. Medium 42.12 29.95 12.177 8.385 15.969 7.6547 16.699 6.2218 18.132
Small vs. Large 42.12 37.12 5.002 1.230 8.774 0.5032 9.500 -0.9223 10.926
Medium vs. Large 29.95 37.12 -7.175 -15.776 -8.577 -16.4685 -7.885 -17.8252 -6.529

Difference in Mean Pair-wise Hypothesis Testings (Tabular)

Xbar - Ybar P-value Conc. 90% Conc. 95% Conc. 99%
Small vs. Medium 12.1768 2.0999e-07 Reject Reject Reject
Medium vs. Large 5.0017 4.1122e-08 Reject Reject Reject
Small vs. Large -7.1752 2.9421e-02 Reject Reject Reject

Discussion of Pair-wise Difference in Means Hypothesis Testing Results

When analyzing our Difference in Means Hypothesis Testing Results, our Null Hypothesis is that our subgroups share the same mean. Our Alternative Hypothesis is that our subgroups have differing means. Thus, our findings - that we Reject H0 at every Confidence Interval and for every Pair-Wise test - tell us that every one of the subgroups (at every tested Confidence Interval) has a differing mean from the other subgroups.

An important thing to note when we look at our data is that the P-value for our “Medium vs. Large” Pair-wise test was calculated under the assumption that both subgroups had equal population variance (due to our earlier findings). Every other pair-wise test was calculated under the assumption that subgroups had differing population variances from each other.


6. Single Factor Analysis of Variance (ANOVA) Tests for Joint Equality of Means

df Sum Sq. Mean Sq. F-value P-value Conc. 90% Conc. 95% Conc. 99%
School Size 2 18416.3 9208.143 18.3697 1.6049e-08 Reject Reject Reject
Residuals 775 388482.2 501.267

Discussion of ANOVA Tests for Joint Equality of Means Results

When we conduct our ANOVA test for joint equality of means, our Null Hypothesis is that every subgroup has the same mean, and our Alternative Hypothesis is that not every subgroup has the same mean. We get a P-value of 1.6049e-08 from our ANOVA test, which is far lower than any of the significance levels we are testing at. What this means is that we Reject H0 at every tested confidence interval, meaning that there are statistically significant differences in the means of each subgroup.


7. Correlation Analysis

Scatterplots

Significant Correlation Hypothesis Testing

Corr. Coeff. Sample Size P-Value Conc. 90% Conc. 95% Conc. 99%
Overall -0.02115 778 5.6e-01 Fail To Reject Fail To Reject Fail To Reject
Small 0.06758 209 3.3e-01 Fail To Reject Fail To Reject Fail To Reject
Medium -0.13589 279 2.3e-02 Reject Reject Fail To Reject
Large -0.31822 290 3.2e-08 Reject Reject Reject

Discussion of Significant Correlation Hypothesis Testing Results

When conducting our Significant Correlation Hypothesis Tests, our Null Hypothesis is that there is no correlation between our predictor and criterion variable, while our Alternative Hypothesis is that there is a correlation between the two. Examining our results, for “Overall” and for the “Small” subgroup, we Fail To Reject H0 at every tested confidence level, meaning that we find no correlation between the tested variables for these two groups. However, for the “Medium” subgroup we Reject H0 at 90% and 95%, but Fail To Reject 0 at the 99% confidence level. What this means is that when we test at 90% and 95%, we cannot say there is no correlation, but we can say this for the 99% confidence level. For the “Large” subgroup, we Reject H0 at every tested confidence level, meaning that (from our analysis) there exists a correlation in the “Medium” and “Large” subgroups.

One important thing to note here is that the correlation we are testing for is not necessarily a positive one like we predicted in our original hypothesis. In fact, the sample correlation coefficients were negative for every tested group except for “Small”. What this means for our “Medium” and “Large” groups (Where we Rejected H0) is that we found a correlation - a negative correlation between Total Expenditures Per Student, and AP Pass Rates for Economically Disadvantaged Students.

Pair-Wise Differences in Degree of Correlation Hypothesis Testing

Group 1 Corr. Coeff. Group 2 Corr. Coeff. Z Calc Conc. 90% Conc. 95% Conc. 99%
Small vs. Medium 0.06768 -0.1367 2.220 Reject Reject Fail To Reject
Medium vs. Large -0.13674 -0.3297 2.288 Reject Reject Fail To Reject
Small vs. Large 0.06768 -0.3297 4.351 Reject Reject Reject

Discussion of Pair-Wise Difference in Degree of Correlation Hypothesis Testing

(*Note - Z-stat for each Confidence Level is not listed, simply for efficiency sake. However, we compared our Z-calc vs. the Standard Normal Distribution for each probability listed)

For our Pair-Wise Difference in Correlation Hypothesis Tests, our Null Hypothesis is that the tested subgroups have the same correlation between Total Expenditures Per Student and AP Pass Rates for Economically Disadvantaged Students. Our Alternative Hypothesis is that the subgroups’ correlation between our predictor and criterion variable are not equal. When comparing the “Small” subgroup to the “Medium” subgroup, we Rejected H0 at every tested confidence level. We had the same findings when comparing the “Medium” and “Large” subgroups. This means that our analysis found a statistical significantly difference between the “Small” & “Medium” groups and the “Medium” & “Large” subgroups. However, when we compare the “Small” subgroup with the “Large” subgroup, we Fail To Reject H0 at the 90% and 95% confidence levels, and Reject H0 at the 99% confidence level. This means that we find both subgroups’ correlations equal except for at the 99% confidence level, which could potentially be the result of a Type I Error.

Correlation Joint Equality Hypothesis Testing

Corr. Coeff. Sample Size Zr (n-3) (n-3)Zr^2 (n-3)Zr
Small 0.0676 209 0.0677 206 0.944 13.9
Medium -0.1359 279 -0.1367 276 5.160 -37.7
Large -0.3182 290 -0.3297 287 31.191 -94.6
Sum NA NA NA 769 37.295 -118.4

Applying the formula to find our Chi-Squared Calculated Test Statistic:

𝛘² = 37.4489661899

And now we can do our actual Hypothesis Testing at our 3 different significance levels:

ChiSq Calc ChiSq Stat 90% Conc. 90% ChiSq Stat 95% Conc. 95% ChiSq Stat 99% Conc. 99%
37.449 4.6052 Reject 5.9915 Reject 9.2103 Reject

Discussion of Correlation Joint Equality Hypothesis Testing

When we are doing our Correlation Joint Equality Hypothesis Testing, our Null Hypothesis is that the correlation between Total Expenditures Per Student and the AP Pass Rate for Economically Disadvantaged Students is jointly equal between all 3 subgroups of School Size. Our Alternative Hypothesis is that any one of these correlation coefficients are different. At every significance level, we Reject the H0, meaning that we have proof that at least one group has a statistically significant difference in correlation than the other(s). In real world terms, we can say that the correlation between Total Expenditures per Student and AP Pass Rates for Economically Disadvantaged Students is different for different sized schools according to our data.


8. Conclusion

The bulk of our findings lie in Section #7, where we analyzed correlations between our predictor and criterion variable, even separating among subgroups and comparing these subgroups individually. Our findings seem quite troubling if they hold true.

Overall, we found no statistically significant correlation between Total Expenditures Per Student and AP Pass Rates for Economically Disadvantaged Students. We had the same findings for the “Small” subgroup. This would imply that an increase in per pupil spending has no impact on economically disadvantaged students’ AP test performance. However, when we look at our other subgroups, the findings become more troublesome.

When we tested the “Medium” and “Large” subgroups, we found a statistically significant negative correlation between Total Expenditures Per Student and AP Pass Rates for Economically Disadvantaged students. This goes directly against the original hypothesis posited in Section #1 of this paper, and against the previous literature on this topic. However, the deviance of the “Medium” and “Large” subgroups would agree with one part of our original hypothesis - that smaller schools focus their funding to help their economically disadvantaged students better than larger schools do. However, this finding is not much help considering that small schools were still found to have no benefit to their economically disadvantaged students when they increase spending.

Possible Limitations

Our findings don’t imply causality, however, because there are a few limitations to our analysis.

For one, we are performing a cross sectional analysis and not a time series analysis. We aren’t taking multiple schools, controlling for external factors, increasing/decreasing their Total Expenditures Per Student and then measuring their AP Pass Rates for Economically Disadvantaged Students. To ascertain causality, we would need to ensure that a change in Total Expenditures Per Student is the only factor affecting the measured AP Pass Rate.

Furthermore, we shouldn’t necessarily sound the alarm that our school spending isn’t helping our students who need the most help. AP Pass Rates are not necessarily the defining standard on learning outcomes for High School students. In fact, maybe we should celebrate that funding isn’t pushing students to perform better on AP tests - it might be doing things that are more beneficial for our students lives than letting them skip out on 3 credit hours of college coursework.

Possible Extensions

One of the main ways to extend our research project would be to actually create an experimental setting to test our hypothesis after excluding all possible confounding variables. We could also include a “Control” group to compare our findings to in order to ensure that our findings aren’t spurious. We also could increase our data set from just Texas to the entire country. While this might induce more variability in our data, more data is always better than less data and could help us guarantee more accuracy than possible with only 200-300 observations per subgroup. We also could attempt a similar project on high schools in other countries to determine if a correlation does exist in general, if a correlation exists only in other countries, or any number of more insights into our data.

Two other extensions that we could pursue involve Machine Learning. I have explained and worked those extensions out below in Sections #9 and #10.


9. Extension 1 - Imputing Missing Data

When analyzing our dataset, we had to cut out every observation that did not have a recorded value for any of “Total Enrollment”, “Pass_Rate_EconDis” or “Total Exp per Student”. While we were still left with many data points to analyze for this project, this is still not ideal. In the best case scenario, we are given a completed dataset that we do not have to delete entries from - but this is unrealistic. Thus, to get a (reasonably accurate) completed dataset, we have to impute for the missing values. We can do this in multiple ways - we can take the mean/median/mode of the dataset and plug it in for the missing values (not very accurate), or we can use a Machine Learning Model to predict these missing values based on where the other stats of the school lie. We can use the “caret” (Classification And REgression Training) package in R to make a simple model to impute these missing values, using Bagged Decision Trees.

There are some steps to do before jumping straight into the model, however. We must first choose which features we want our model to take in when attempting to predict our 3 variables. We obviously can cut out all the basic information for each High School, such as “Campus,” “CampName,” “District,” “DistName,” “County,” “CntyName,” “Region,” and “RegnName.” We also have to cut out all the different subgroups of spending, such as Instructional, Leadership, Guidance, and Extracurricular spending per student, as all of those combined are equivalent to “Total Exp per Student,” which means that if Total Expenditure Per Student is missing, these values must be missing as well. It is also important to note that we DO include the variables we are testing for as features in our model.

This leaves us with the features “Campus_Type”, “Part_Rate_All”, “Part_Rate_Female”, “Pass_Rate_Female”, “Part_Rate_Male”, “Pass_Rate_Male”, “Part_Rate_EconDis”, “SAT_All”, “SAT_EconDis”, “SAT_Female”, “SAT_Male”, “EconDis Enr”, “Pass_Rate_EconDis”, “TOTAL ENROLLMENT”, and “Total Exp per Student” that will be used to predict “Pass_Rate_EconDis”, “TOTAL ENROLLMENT”, and “Total Exp per Student”.

Next, we need to add new columns that track whether or not the value of our variables was missing in the original dataset. As such, we tack on 3 new binary columns to our dataset, which can read “Y” (Yes, was missing), or “N” (No, was not missing): “MissingTotalEnrollment”, “MissingPassRateEconDis”, and “MissingTotalExpPerStudent”. This is so that looking back on the dataset we don’t lose any information and can easily tell what was imputed and what was originally given. An example of what our dataset’s new columns look like is show below. Note that when there is an “NA” in our dataframe for “Pass_Rate_EconDis”, our “MissingPassRateEconDis” variable takes on a value of “Y” for yes. This happens throughout the dataset for “Total Exp Per Student” and “Total Enrollment” as well.

Pass_Rate_EconDis TOTAL ENROLLMENT Total Exp per Student MissingPassRateEconDis MissingTotalExpPerStudent MissingTotalEnrollment
NA 410 10100 Y N N
35.0 858 6855 N N N
25.0 231 13552 N N N
50.0 319 10579 N N N
NA 417 12710 Y N N
100.0 461 11646 N N N
25.4 1031 7986 N N N
NA 481 10657 Y N N
52.4 716 9528 N N N
35.7 806 7735 N N N
NA 217 9913 Y N N

Once we have done all of this, we can run the model and replace our original dataset’s missing values with our imputed values, and compare their basic descriptive statistics.

Original Data Set’s Basic Statistics

Min. Q1 Median Mean Q3 Max.
Total Enrollment 116.0 623.250 1525.00 1551.323 2271.5 5098
AP Pass Rate (Econ. Dis.) 0.5 17.275 33.15 35.891 50.0 100
Total Exp Per Student 4261.0 7250.750 8094.00 8636.021 9192.8 59298

New Data Set’s Basic Statistics

Min. Q1 Median Mean Q3 Max.
Total Enrollment 10.0 474.0 1121.2 1339.714 2056.50 5098
AP Pass Rate (Econ. Dis.) 0.5 16.7 33.2 36.615 50.45 100
Total Exp Per Student 0.0 7416.0 8437.0 9320.330 9953.00 118688

Discussion of Differences in Basic Statistics

When we compare the basic descriptive statistics of both the original dataset (with N/A’s excluded) and the new dataset (with N/A’s imputed), we see that broadly speaking, the missing values for “Total Enrollment” were generally imputed as lower than the original dataset, while the missing values for “Pass_Rate_EconDis” were generally imputed as higher than the original dataset. However, when we look at “Total Exp Per Student”, we see a far larger range shown, that does not seem reasonable. A value of $0 for Total Exp Per Student makes little sense (how would a school operate if it spent $0?), as does a value of $118,688, as that is almost double the original maximum of $59,298. We also can see a somewhat unfeasible phenomena in our new dataset’s minimum for “Total Enrollment” - a school size of 10 kids is possible but not very likely, and thus most likely a shortcoming of our imputation model.


10. Extension 2 - Creating & Training a Machine Learning Regression Model

Though we attempted to examine a relationship between Total Expenditures per Student and AP Pass Rates for Economically Disadvantaged students, we can take this one step further. We can now ask the question: how is “AP_Pass_Rate_EconDis” affected by every other variable we are given data for? We do this by creating an algorithm. First, we split the data into two sets: our Training set and our Testing set. Basically, we ask our Model to analyze a proportion of our data (the Training set) and determine the relationship between all of our factors (ex: SAT Scores, Instructional Expenditurem Per Student, etc.) and the AP Pass Rate for Economically Disadvantaged Students. Then, we use this Model on the remaining portion of our data (the Testing set) to evaluate how effective our created Model is when examining “new” data. The split between Train & Test used in this paper will be 70/30.

Below is the code used to create our Machine Learning model. I have set this code to not run, and to just display, with interspersed comments to explain exactly what is happening within each line.

# Creating our 70/30 Train/Test split of our data with createDataPartition()
indexes <- createDataPartition(hs$Pass_Rate_EconDis,
                               times = 1,
                               p = .7,
                               list = F
                               )

# Train gets the chosen training rows (70%), Test gets everything else (30%)
hs.train <- hs[indexes, ]
hs.test <- hs[-indexes, ]

# Training Our Model ####
# We need to set precise parameters for our trainControl(), and tuning grid which we will later
# pass to our final training function

# We tell our model to run with the "repeatedcv" method. We divide our training data set into
# 10 parts and then train each part based off of the other 9. We do this 3 times, and find the
# average error across all attempts. We then conduct a grid search, which means we exhaustively
# look through every single tested value to find the best one.
train.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3,
                              search = "grid"
                              )
# Now we construct our tune grid. This consists of multiple hyperparameters which we have to
# tune to get our most efficient model. Hence, we have it try out multiple different 
# combinations (it will try eta = .025, nrounds = 75, etc. and then do the exact same thing
# except eta = .05).
tune.grid <- expand.grid(eta = c(0.025, .05),
                         nrounds = c(75, 100, 125),
                         max_depth = c(6, 7, 8),
                         min_child_weight = c(1.5, 2, 2.5),
                         colsample_bytree = c(0.6, 0.7, 0.8),
                         gamma = 0,
                         subsample = 1
                         )

# In broad terms, this opens up 3 sessions of RStudio to all run through our train() function
# at the same time to speed it up, because this will take a long time to run.
c1 <- makeCluster(3, type = "SOCK")
registerDoSNOW(c1)

# Training our model. We want to predict "Pass_Rate_EconDis" as a function of every other given
# variable. We pass it our tune grid, train.control, and we tell it to give us the model with
# the highest R squared value.
caret.cv <- train(Pass_Rate_EconDis ~ .,
                  data = hs,
                  method = "xgbTree",
                  metric = "Rsquared",
                  tuneGrid = tune.grid,
                  trControl = train.control,
                  na.action = na.pass,
                  verbose = F,
                  verbosity = 0
                  )

# Turning off our other instances of RStudio
stopCluster(c1)

Machine Learning Model Results

eta max_depth colsample_bytree min_child_weight nrounds RMSE R^2 MAE
Value 0.025 7 0.7 2.5 125 9.412 0.8469 6.248

The methods to evaluate our Machine Learning model’s predictions of AP Pass Rates for Economically Disadvantaged Students are RMSE, R^2, and MAE (Root Mean Square Error, R^2, Mean Average Error). Because we told our model to prioritize R^2, the R^2 value of .8469 matters the most for our analysis. An R^2 of .8469 tells us that 84.69% of the variation in the AP Pass Rates for Economically Disadvantaged Students is explained by our model, and 15.31% is explained by other factors that we do not know.


Works Cited

Alboukadel Kassambara (2020). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.4.0. https://CRAN.R-project.org/package=ggpubr

Andri Signorell et mult. al. (2021). DescTools: Tools for descriptive statistics. R package version 0.99.44. https://cran.r-project.org/web/packages/DescTools/index.html

Burnette, Daarel. “Student Outcomes: Does More Money Really Matter?” Education Week, Education Week, 8 Dec. 2020, www.edweek.org/policy-politics/student-outcomes-does-more-money-really-matter/2019/06.

Hadley Wickham and Jennifer Bryan (2022). readxl: Read Excel Files. R package version 1.4.0. https://CRAN.R-project.org/package=readxl

Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Jackson, C. Kirabo. “Does School Spending Matter? the New Literature on an Old Question.”Confronting Inequality: How Policies and Practices Shape Children’s Opportunities., 2020, pp. 165–186., https://doi.org/10.1037/0000187-008.

Jonathan M. Lees (2020). PEIP: Geophysical Inverse Theory and Optimization. R package version 2.2-3. https://CRAN.R-project.org/package=PEIP

Julien Barnier (2021). rmdformats: HTML Output Formats and Templates for ‘rmarkdown’ Documents. R package version 1.0.3. https://CRAN.R-project.org/package=rmdformats

Kun Ren and Kenton Russell (2021). formattable: Create ‘Formattable’ Data Structures. R package version 0.2.1. https://CRAN.R-project.org/package=formattable

Max Kuhn (2022). caret: Classification and Regression Training. R package version 6.0-91. https://CRAN.R-project.org/package=caret

Microsoft Corporation and Stephen Weston (2022). doSNOW: Foreach Parallel Adaptor for the ‘snow’ Package. R package version 1.0.20. https://CRAN.R-project.org/package=doSNOW

Millard SP (2013). _EnvStats: An R Package for Environmental Statistics_. Springer, New York. ISBN 978-1-4614-8455-4, <URL: https://www.springer.com>.

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman

RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

Schulman, Craig T. “Texas High School AP and SAT Data.” Econometrics 461, http://people.tamu.edu/~cschulman/ECMT461/TermPr461.html.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Yihui Xie (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.38.