DS 803: Fundamentals of Statistical Analysis
For this project, our group analyzed a combination of race and ethnicity data from the National Center for Education Statistics (NCES) which provides the Common Core of Data (CCD). and educational funding from various governing entities.
The ElSi (Elementary/Secondary Information System) Table Generator which we used to pull various data can be found here.
The datasets are joined using the name of the school district and includes the number of students of each ethnicity (White, Black, Hispanic, etc.) in each respective district, as well as the dollar amounts of funding from different sources (federal, state, local, and miscellaneous). Based purely on initial assumption, we expected to see some correlation between the proportion of certain races and the amount of funding those districts received. However, while cleaning and combining the data, we realized that the type of district may also play a role and it might be interesting to include. For example, metropolitan areas may receive more money than rural ones, and that information may skew the results.
The research question we will attempt to answer the following question: Is there a relationship between the amount of funding given to school districts and the concentration of different ethnicities within a given community? While this data is from 2017-2019, in today’s socioeconomic climate these potential relationships could be very interesting to examine. Understanding any connections between the makeup of a community and the allocation of resources it currently receives may help suggest a more efficient way to adequately fund communities and regions that are being neglected.
## Rows: 15,589
## Columns: 34
## $ Row <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ AgencyName <chr> "21ST CENTURY CHARTER SCH OF GARY…
## $ State <chr> "Indiana", "Pennsylvania", "Arizo…
## $ FallMembership_16_17 <int> 888, 961, 64, 2278, 1069, 629, 12…
## $ RevenueLocalSources_16_17 <dbl> 516000, 12717000, 47000, 363000, …
## $ RevenueGeneral_16_17 <dbl> 9342000, 12848000, 492000, 201980…
## $ RevenueStateSources_16_17 <dbl> 7563000, 40000, 440000, 17778000,…
## $ RevenueFederalSources_16_17 <int> 1263000, 91000, 5000, 2057000, 14…
## $ CurrentExpendituresSecEdu_16_17 <dbl> 7447000, 11940000, 481000, 184750…
## $ CurrentExpenditures_16_17 <dbl> 8637000, 12528000, 495000, 185050…
## $ RevenuePerPupil_16_17 <int> 10520, 13369, 7688, 8867, 11573, …
## $ RevenueLocalSourcesPerPupil_16_17 <int> 581, 13233, 734, 159, 458, 167, 1…
## $ RevenueStateSourcesPerPupil_16_17 <int> 8517, 42, 6875, 7804, 9792, 7593,…
## $ RevenueFederalSourcesPerPupil_16_17 <int> 1422, 95, 78, 903, 1323, 1727, 73…
## $ AgencyID <int> 1800046, 4200091, 400328, 4800095…
## $ AgencyType_18_19 <chr> "7-Independent Charter District",…
## $ StartOfYearStatus_18_19 <chr> "1-Open", "1-Open", "1-Open", "1-…
## $ TotalStudentsAllGrades <int> 883, 1235, 65, 2084, 1409, 701, 1…
## $ AmerIndAlaska0tive <int> 0, 4, 2, 4, 2, 1, 0, 0, 0, 1, 5, …
## $ Asian_AsianPacificIsl <int> 0, 18, 0, 1, 7, 0, 3, 0, 0, 2, 1,…
## $ Hispanic <int> 23, 97, 22, 89, 1276, 7, 9, 51, 1…
## $ Black <int> 837, 128, 1, 1951, 53, 688, 48, 1…
## $ White <int> 0, 918, 37, 7, 61, 2, 53, 19, 393…
## $ Hawaiian_PacificIsl <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, …
## $ TwoOrMoreRaces <int> 23, 70, 3, 30, 10, 3, 17, 6, 2, 2…
## $ AmerIndAlaska0tive. <chr> "0.0%", "0.3%", "3.1%", "0.2%", "…
## $ Asian_AsianPacificIsl. <chr> "0.0%", "1.5%", "0.0%", "0.0%", "…
## $ Hispanic. <chr> "2.6%", "7.9%", "33.8%", "4.3%", …
## $ Black. <chr> "94.8%", "10.4%", "1.5%", "93.6%"…
## $ White. <chr> "0.0%", "74.3%", "56.9%", "0.3%",…
## $ Hawaiian_PacificIsl. <chr> "0.0%", "0.0%", "0.0%", "0.1%", "…
## $ TwoOrMoreRaces. <chr> "2.6%", "5.7%", "4.6%", "1.4%", "…
## $ CountyName_18_19 <chr> "Marion County", "Chester County"…
## $ UrbanCentricLocale_18_19 <chr> "13-City: Small", "21-Suburb: Lar…
We decided to take a stratified sample of our dataset using the Urban/Central Locale Variable. With this data especially, it is important to try to take a sample that is representative of the entire population. If we had taken a simple random sample, there is a chance it would not include a sufficient number of districts in any given type of community (rural, urban, suburban, etc.). Therefore, our sample, which includes fifty percent of our full dataset, incorporates adequate representation from each type of community.
set.seed(1)
stratified_sample <- data %>%
group_by(UrbanCentricLocale_18_19) %>%
sample_frac(0.5) %>%
ungroup
table(stratified_sample[, "UrbanCentricLocale_18_19"])##
## 11-City: Large 12-City: Mid-size 13-City: Small 21-Suburb: Large
## 599 188 310 1501
## 22-Suburb: Mid-size 23-Suburb: Small 31-Town: Fringe 32-Town: Distant
## 190 142 272 592
## 33-Town: Remote 41-Rural: Fringe 42-Rural: Distant 43-Rural: Remote
## 388 912 1539 1160
This scatter plot shows a positive correlation between the total number of students and the total amount of federal revenue. It is important to note that there is a high concentration of data points towards the origin (0).
ggplot(stratified_sample, aes(TotalStudentsAllGrades, RevenueFederalSources_16_17)) + geom_jitter() + geom_smooth(method="lm") + xlab("Total Students (All Grades)") + ylab("Revenue (Federal Sources 16/17)") + ggtitle("Total Students ~ Federal Funding") + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)After realizing the concentration of data points towards the origin (0), we decided to investigate this further with a boxplot analysis. By changing the Y limit of the plot, we can see that although there is a vast distribution of both of these variables, the 1st quartile, mean, and 3rd quartile are all still very close to the origin (0). We can confirm this with the summary() function, which shows the descriptive statistics for the respective variable. Therefore, we can infer that these graphs are heavily skewed to the right by looking at the median and max values.
stu1 <- ggplot(stratified_sample, aes(y=TotalStudentsAllGrades, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma) + ylab("Total Number of Students") + xlab("No Y Limit")
stu2 <- ggplot(stratified_sample, aes(y=TotalStudentsAllGrades, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,50000))+ ylab("Total Number of Students") + xlab("Y Limit = 50,000")
stu3 <- ggplot(stratified_sample, aes(y=TotalStudentsAllGrades, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,10000))+ ylab("Total Number of Students") + xlab("Y Limit = 10,000")
Federal1 <- ggplot(stratified_sample, aes(y=RevenueFederalSources_16_17, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ ylab("Amount of Federal Funding") + xlab("No Y Limit")
Federal2 <- ggplot(stratified_sample, aes(y=RevenueFederalSources_16_17, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,100000000))+ ylab("Amount of Federal Funding") + xlab("Y Limit = 100,000,000")
Federal3 <- ggplot(stratified_sample, aes(y=RevenueFederalSources_16_17, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,10000000))+ ylab("Amount of Federal Funding") + xlab("Y Limit = 10,000,000")
library(ggpubr)
ggarrange(Federal1, Federal2, Federal3 + rremove("x.text"),
ncol = 3, nrow = 1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 274000 743000 3255798 2130000 451479000
We can use the summary() function to see that the mean amount of federal funding for a given school is ~ $3.2 million while the maximum is ~ $450 million. This explains the large skew in the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 342 889 3041 2439 350434
We also used the summary() function to see that the mean number of students in a given school is 3,041 while the max is over 350,000.
We then plotted the distribution of Total Students, Federal Revenue, as well as each major Ethnicity group using density plots. It is important to note here that the distribution of Ethnicity groups share a common distribution with the exception of White students. We adjusted the scales for each of these plots to account for the more narrow distribution from the origin. If these plots shared the same scale, the data would be much more difficult to interpret and the difference in the different distributions would be much more apparent.
d5 <- ggplot(stratified_sample, aes(TotalStudentsAllGrades)) + geom_density() + coord_cartesian(xlim = c(0,30000)) + xlab("Distribution of Total Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d6 <- ggplot(stratified_sample, aes(RevenueFederalSources_16_17)) + geom_density() + coord_cartesian(xlim = c(0,10000000)) + xlab("Distribution of Federal Revenue") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d1 <- ggplot(stratified_sample, aes(White)) + geom_density() + coord_cartesian(xlim = c(0,25000)) + xlab("Distribution of White Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d2 <- ggplot(stratified_sample, aes(Black)) + geom_density() + coord_cartesian(xlim = c(0,10000)) + xlab("Distribution of Black Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d3 <- ggplot(stratified_sample, aes(Hispanic)) + geom_density() + coord_cartesian(xlim = c(0,15000)) + xlab("Distribution of Hispanic Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d4 <- ggplot(stratified_sample, aes(Asian_AsianPacificIsl)) + geom_density() + coord_cartesian(xlim = c(0,3000)) + xlab("Distribution of Asian/IslandPacific Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
ggarrange(d5, d6 + rremove("x.text"),
ncol = 2, nrow = 1, align="h")To get a different perspective on the data, we plotted the same data using histograms rather than a line. This shows the same distributions as the density distribution from before, but uses bins to get a better idea of the count of each occurrence. Again, we adjusted the scales for each of these plots in an effort to prioritize interpretability over scale consistency.
h1<- ggplot(stratified_sample, aes(RevenueFederalSources_16_17)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,10000000)) + scale_x_continuous(labels = comma) + xlab("Amount of Federal Revenue")
h2 <- ggplot(stratified_sample, aes(TotalStudentsAllGrades)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,10000)) + scale_x_continuous(labels = comma) + xlab("Number of Students")
h3 <- ggplot(stratified_sample, aes(Black)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,1000)) + scale_x_continuous(labels = comma) + xlab("Number of Black Students")
h4 <- ggplot(stratified_sample, aes(White)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,5000)) + scale_x_continuous(labels = comma) + xlab("Number of White Students")
h5 <- ggplot(stratified_sample, aes(Hispanic)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,4000)) + scale_x_continuous(labels = comma) + xlab("Number of Hispanic Students")
h6 <- ggplot(stratified_sample, aes(Asian_AsianPacificIsl)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,1000)) + scale_x_continuous(labels = comma) + xlab("Number of Asian/Pacific Island Students")
ggarrange(h1, h2,
ncol = 2, nrow = 1, align="h")To finish our exploratory analysis, we decided to run a quick correlation analysis between Federal Revenue and both Total Students and each Ethnicity group. The strongest correlation was between Federal Revenue and Total Students, and the weakest correlations were between Federal Revenue and White Students, and Federal Revenue and Asian/Pacific Islands Students. This will prove true as we create a linear model for each Ethnicity group.
## [1] 0.8864186
## [1] 0.8088773
## [1] 0.5572023
## [1] 0.8045561
## [1] 0.4966789
Our first linear model is created using the data from the stratified sample we took earlier, namely the variables “RevenueFederalSources_16_17” and “TotalStudentsAllGrades”. Using the summary() function on our linear model, we can see that our p-value for both the intercept and the Total Students are very significant.
linearMod <- lm(RevenueFederalSources_16_17 ~ TotalStudentsAllGrades, data=stratified_sample) # build linear regression model on full data
print(linearMod)##
## Call:
## lm(formula = RevenueFederalSources_16_17 ~ TotalStudentsAllGrades,
## data = stratified_sample)
##
## Coefficients:
## (Intercept) TotalStudentsAllGrades
## -292848 1167
##
## Call:
## lm(formula = RevenueFederalSources_16_17 ~ TotalStudentsAllGrades,
## data = stratified_sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75090769 -429091 140349 358779 276933956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.928e+05 7.268e+04 -4.029 5.65e-05 ***
## TotalStudentsAllGrades 1.167e+03 6.904e+00 169.030 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6142000 on 7791 degrees of freedom
## Multiple R-squared: 0.7857, Adjusted R-squared: 0.7857
## F-statistic: 2.857e+04 on 1 and 7791 DF, p-value: < 2.2e-16
To create training and test datasets, we first set a seed to achieve reproducible results. We created a training dataset by taking a simple random sample of the stratified sample from earlier, with 80% of that data comprising the training data and the remaining 20% comprising the test data. Afterwards, we created a new linear model using just the training data and applied this linear model to the predict() function. This predict() function provides predictions for a specific object (in this case, our linear model).
set.seed(100) # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(stratified_sample), 0.8*nrow(stratified_sample)) # row indices for training data
trainingData <- stratified_sample[trainingRowIndex, ] # model training data
testData <- stratified_sample[-trainingRowIndex, ] # test data
lmMod <- lm(RevenueFederalSources_16_17 ~ TotalStudentsAllGrades, data=trainingData) # build the model
FedRevenuePred <- predict(lmMod, testData) # predict revenueggplot(testData, aes(TotalStudentsAllGrades, FedRevenuePred)) +geom_smooth(method="lm") + xlab("Total Students (All Grades)") + ylab("Revenue (Federal Sources 16/17)") + ggtitle("Total Students ~ Federal Funding") + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)Lastly, we plotted these predictions using a simple line chart. The X-axis of this chart is the Total Number of Students as taken from the test dataset, and the Y-axis of this chart is the Predicted Federal Revenue that was obtained using our linear model. This is the plot that would be used to predict the amount of Federal Revenue obtained by a school based on the Total Number of Students. We wanted to obtain these predictive models for each Ethicity rather than just the Total Number of Students. To do this, we applied the same methodology as before with the exception of changing the Y-variable to match a specific Ethnicity rather than Total Number of Students.
library(ggpubr)
ggarrange(WhitePlot, BlackPlot, HispanicPlot, AsianPlot,
ncol = 2, nrow = 2, align="hv")x = stratified_sample$TotalStudentsAllGrades
n = length(stratified_sample$TotalStudentsAllGrades)
mean_MoM = sum(x)/n
var_MoM = sum(x^2)/n - (sum(x)/n)^2
list(Mean = mean_MoM, Variance = var_MoM)## $Mean
## [1] 3040.843
##
## $Variance
## [1] 101569766
Norm.CI=function(obs, sigma, alpha){
n=length(obs)
ME=qnorm(1-alpha/2)*sigma/sqrt(n)
c(mean(obs)-ME, mean(obs)+ME)
}
x = stratified_sample$TotalStudentsAllGrades
var_MoM = sum(x^2)/n - (sum(x)/n)^2
Norm.CI(x, sqrt(var_MoM), 0.05)## [1] 2817.085 3264.601
Comparing these population parameters to the true values, we can see that Method of Moments estimation does well in estimating the parameters.
Normt.CI=function(x, alpha){
n=length(x)
df=n-1
ME=qt(1-alpha/2, df)*sd(x)/sqrt(n)
c(mean(x)-ME, mean(x)+ME)
}
Normt.CI(x = stratified_sample$TotalStudentsAllGrades, 0.05)## [1] 2817.036 3264.650
Sigma.CI=function(x, alpha){
n=length(x)
df=n-1
var_MoM = sum(x^2)/n - (sum(x)/n)^2
c(df*var_MoM/qchisq(1-alpha/2, df), df*var_MoM/qchisq(alpha/2, df))
}
Sigma.CI(x = stratified_sample$TotalStudentsAllGrades, 0.05)## [1] 98454369 104836128
After conducting various exploratory analyses and creating linear models with respect to each ethnicity, we found that the number of Black students, followed closely by the number of Hispanic students, are the most accurate student population predictors of the amount of Federal Revenue for any given school.