Statistical Analysis of Common Core Data

Conor O’Regan, Jon Olson, Carter Mercer

12/3/2020

DS 803: Fundamentals of Statistical Analysis

For this project, our group analyzed a combination of race and ethnicity data from the National Center for Education Statistics (NCES) which provides the Common Core of Data (CCD). and educational funding from various governing entities.

The ElSi (Elementary/Secondary Information System) Table Generator which we used to pull various data can be found here.

The datasets are joined using the name of the school district and includes the number of students of each ethnicity (White, Black, Hispanic, etc.) in each respective district, as well as the dollar amounts of funding from different sources (federal, state, local, and miscellaneous). Based purely on initial assumption, we expected to see some correlation between the proportion of certain races and the amount of funding those districts received. However, while cleaning and combining the data, we realized that the type of district may also play a role and it might be interesting to include. For example, metropolitan areas may receive more money than rural ones, and that information may skew the results.

The research question we will attempt to answer the following question: Is there a relationship between the amount of funding given to school districts and the concentration of different ethnicities within a given community? While this data is from 2017-2019, in today’s socioeconomic climate these potential relationships could be very interesting to examine. Understanding any connections between the makeup of a community and the allocation of resources it currently receives may help suggest a more efficient way to adequately fund communities and regions that are being neglected.

glimpse(data)

## Rows: 15,589
## Columns: 34
## $ Row                                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ AgencyName                          <chr> "21ST CENTURY CHARTER SCH OF GARY…
## $ State                               <chr> "Indiana", "Pennsylvania", "Arizo…
## $ FallMembership_16_17                <int> 888, 961, 64, 2278, 1069, 629, 12…
## $ RevenueLocalSources_16_17           <dbl> 516000, 12717000, 47000, 363000, …
## $ RevenueGeneral_16_17                <dbl> 9342000, 12848000, 492000, 201980…
## $ RevenueStateSources_16_17           <dbl> 7563000, 40000, 440000, 17778000,…
## $ RevenueFederalSources_16_17         <int> 1263000, 91000, 5000, 2057000, 14…
## $ CurrentExpendituresSecEdu_16_17     <dbl> 7447000, 11940000, 481000, 184750…
## $ CurrentExpenditures_16_17           <dbl> 8637000, 12528000, 495000, 185050…
## $ RevenuePerPupil_16_17               <int> 10520, 13369, 7688, 8867, 11573, …
## $ RevenueLocalSourcesPerPupil_16_17   <int> 581, 13233, 734, 159, 458, 167, 1…
## $ RevenueStateSourcesPerPupil_16_17   <int> 8517, 42, 6875, 7804, 9792, 7593,…
## $ RevenueFederalSourcesPerPupil_16_17 <int> 1422, 95, 78, 903, 1323, 1727, 73…
## $ AgencyID                            <int> 1800046, 4200091, 400328, 4800095…
## $ AgencyType_18_19                    <chr> "7-Independent Charter District",…
## $ StartOfYearStatus_18_19             <chr> "1-Open", "1-Open", "1-Open", "1-…
## $ TotalStudentsAllGrades              <int> 883, 1235, 65, 2084, 1409, 701, 1…
## $ AmerIndAlaska0tive                  <int> 0, 4, 2, 4, 2, 1, 0, 0, 0, 1, 5, …
## $ Asian_AsianPacificIsl               <int> 0, 18, 0, 1, 7, 0, 3, 0, 0, 2, 1,…
## $ Hispanic                            <int> 23, 97, 22, 89, 1276, 7, 9, 51, 1…
## $ Black                               <int> 837, 128, 1, 1951, 53, 688, 48, 1…
## $ White                               <int> 0, 918, 37, 7, 61, 2, 53, 19, 393…
## $ Hawaiian_PacificIsl                 <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, …
## $ TwoOrMoreRaces                      <int> 23, 70, 3, 30, 10, 3, 17, 6, 2, 2…
## $ AmerIndAlaska0tive.                 <chr> "0.0%", "0.3%", "3.1%", "0.2%", "…
## $ Asian_AsianPacificIsl.              <chr> "0.0%", "1.5%", "0.0%", "0.0%", "…
## $ Hispanic.                           <chr> "2.6%", "7.9%", "33.8%", "4.3%", …
## $ Black.                              <chr> "94.8%", "10.4%", "1.5%", "93.6%"…
## $ White.                              <chr> "0.0%", "74.3%", "56.9%", "0.3%",…
## $ Hawaiian_PacificIsl.                <chr> "0.0%", "0.0%", "0.0%", "0.1%", "…
## $ TwoOrMoreRaces.                     <chr> "2.6%", "5.7%", "4.6%", "1.4%", "…
## $ CountyName_18_19                    <chr> "Marion County", "Chester County"…
## $ UrbanCentricLocale_18_19            <chr> "13-City: Small", "21-Suburb: Lar…

Sampling the Data

We decided to take a stratified sample of our dataset using the Urban/Central Locale Variable. With this data especially, it is important to try to take a sample that is representative of the entire population. If we had taken a simple random sample, there is a chance it would not include a sufficient number of districts in any given type of community (rural, urban, suburban, etc.). Therefore, our sample, which includes fifty percent of our full dataset, incorporates adequate representation from each type of community.

set.seed(1)
stratified_sample <- data %>%
  group_by(UrbanCentricLocale_18_19) %>%
  sample_frac(0.5) %>%
  ungroup

table(stratified_sample[, "UrbanCentricLocale_18_19"])

## 
##      11-City: Large   12-City: Mid-size      13-City: Small    21-Suburb: Large 
##                 599                 188                 310                1501 
## 22-Suburb: Mid-size    23-Suburb: Small     31-Town: Fringe    32-Town: Distant 
##                 190                 142                 272                 592 
##     33-Town: Remote    41-Rural: Fringe   42-Rural: Distant    43-Rural: Remote 
##                 388                 912                1539                1160

Exploratory Analysis

This scatter plot shows a positive correlation between the total number of students and the total amount of federal revenue. It is important to note that there is a high concentration of data points towards the origin (0).

ggplot(stratified_sample, aes(TotalStudentsAllGrades, RevenueFederalSources_16_17)) + geom_jitter() + geom_smooth(method="lm") + xlab("Total Students (All Grades)") + ylab("Revenue (Federal Sources 16/17)") + ggtitle("Total Students ~ Federal Funding") + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

Boxplots

After realizing the concentration of data points towards the origin (0), we decided to investigate this further with a boxplot analysis. By changing the Y limit of the plot, we can see that although there is a vast distribution of both of these variables, the 1st quartile, mean, and 3rd quartile are all still very close to the origin (0). We can confirm this with the summary() function, which shows the descriptive statistics for the respective variable. Therefore, we can infer that these graphs are heavily skewed to the right by looking at the median and max values.

stu1 <- ggplot(stratified_sample, aes(y=TotalStudentsAllGrades, x=0)) + geom_boxplot()  + scale_y_continuous(labels = comma) + ylab("Total Number of Students") + xlab("No Y Limit")
stu2 <- ggplot(stratified_sample, aes(y=TotalStudentsAllGrades, x=0)) + geom_boxplot()  + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,50000))+ ylab("Total Number of Students") + xlab("Y Limit = 50,000")
stu3 <- ggplot(stratified_sample, aes(y=TotalStudentsAllGrades, x=0)) + geom_boxplot()  + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,10000))+ ylab("Total Number of Students") + xlab("Y Limit = 10,000")

Federal1 <- ggplot(stratified_sample, aes(y=RevenueFederalSources_16_17, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ ylab("Amount of Federal Funding") + xlab("No Y Limit")
Federal2 <- ggplot(stratified_sample, aes(y=RevenueFederalSources_16_17, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,100000000))+ ylab("Amount of Federal Funding") + xlab("Y Limit = 100,000,000")
Federal3 <- ggplot(stratified_sample, aes(y=RevenueFederalSources_16_17, x=0)) + geom_boxplot() + scale_y_continuous(labels = comma)+ coord_cartesian(ylim = c(0,10000000))+ ylab("Amount of Federal Funding") + xlab("Y Limit = 10,000,000")

library(ggpubr)
ggarrange(Federal1, Federal2, Federal3 + rremove("x.text"), 
          ncol = 3, nrow = 1)

summary(stratified_sample$RevenueFederalSources_16_17)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0    274000    743000   3255798   2130000 451479000

We can use the summary() function to see that the mean amount of federal funding for a given school is ~ $3.2 million while the maximum is ~ $450 million. This explains the large skew in the data.

ggarrange(stu1, stu2, stu3 + rremove("x.text"),
          ncol = 3, nrow = 1)

summary(stratified_sample$TotalStudentsAllGrades)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     342     889    3041    2439  350434

We also used the summary() function to see that the mean number of students in a given school is 3,041 while the max is over 350,000.

Density Plots

We then plotted the distribution of Total Students, Federal Revenue, as well as each major Ethnicity group using density plots. It is important to note here that the distribution of Ethnicity groups share a common distribution with the exception of White students. We adjusted the scales for each of these plots to account for the more narrow distribution from the origin. If these plots shared the same scale, the data would be much more difficult to interpret and the difference in the different distributions would be much more apparent.

d5 <- ggplot(stratified_sample, aes(TotalStudentsAllGrades)) + geom_density() + coord_cartesian(xlim = c(0,30000)) + xlab("Distribution of Total Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d6 <- ggplot(stratified_sample, aes(RevenueFederalSources_16_17)) + geom_density() + coord_cartesian(xlim = c(0,10000000)) + xlab("Distribution of Federal Revenue") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)

d1 <- ggplot(stratified_sample, aes(White)) + geom_density() + coord_cartesian(xlim = c(0,25000)) + xlab("Distribution of White Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d2 <- ggplot(stratified_sample, aes(Black)) + geom_density() + coord_cartesian(xlim = c(0,10000)) + xlab("Distribution of Black Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d3 <- ggplot(stratified_sample, aes(Hispanic)) + geom_density() + coord_cartesian(xlim = c(0,15000)) + xlab("Distribution of Hispanic Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)
d4 <- ggplot(stratified_sample, aes(Asian_AsianPacificIsl)) + geom_density() + coord_cartesian(xlim = c(0,3000)) + xlab("Distribution of Asian/IslandPacific Students") + scale_x_continuous(labels = comma)+ scale_y_continuous(labels = comma)

ggarrange(d5, d6 + rremove("x.text"), 
          ncol = 2, nrow = 1, align="h")

ggarrange(d1, d2, d3, d4 + rremove("x.text"), 
          ncol = 2, nrow = 2, align="h")

Histograms

To get a different perspective on the data, we plotted the same data using histograms rather than a line. This shows the same distributions as the density distribution from before, but uses bins to get a better idea of the count of each occurrence. Again, we adjusted the scales for each of these plots in an effort to prioritize interpretability over scale consistency.

h1<- ggplot(stratified_sample, aes(RevenueFederalSources_16_17)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,10000000)) + scale_x_continuous(labels = comma) + xlab("Amount of Federal Revenue")
h2 <- ggplot(stratified_sample, aes(TotalStudentsAllGrades)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,10000)) + scale_x_continuous(labels = comma) + xlab("Number of Students")
h3 <- ggplot(stratified_sample, aes(Black)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,1000)) + scale_x_continuous(labels = comma) + xlab("Number of Black Students")
h4 <- ggplot(stratified_sample, aes(White)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,5000)) + scale_x_continuous(labels = comma) + xlab("Number of White Students")
h5 <- ggplot(stratified_sample, aes(Hispanic)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,4000)) + scale_x_continuous(labels = comma) + xlab("Number of Hispanic Students")
h6 <- ggplot(stratified_sample, aes(Asian_AsianPacificIsl)) + geom_histogram(bins=1000) + coord_cartesian(xlim=c(0,1000)) + scale_x_continuous(labels = comma) + xlab("Number of Asian/Pacific Island Students")

ggarrange(h1, h2, 
          ncol = 2, nrow = 1, align="h")

ggarrange(h3, h4, h5, h6, 
          ncol = 2, nrow = 2, align="h")

Correlation

To finish our exploratory analysis, we decided to run a quick correlation analysis between Federal Revenue and both Total Students and each Ethnicity group. The strongest correlation was between Federal Revenue and Total Students, and the weakest correlations were between Federal Revenue and White Students, and Federal Revenue and Asian/Pacific Islands Students. This will prove true as we create a linear model for each Ethnicity group.

cor(stratified_sample$RevenueFederalSources_16_17, stratified_sample$TotalStudentsAllGrades)

## [1] 0.8864186

cor(stratified_sample$RevenueFederalSources_16_17, stratified_sample$Black)

## [1] 0.8088773

cor(stratified_sample$RevenueFederalSources_16_17, stratified_sample$White)

## [1] 0.5572023

cor(stratified_sample$RevenueFederalSources_16_17, stratified_sample$Hispanic)

## [1] 0.8045561

cor(stratified_sample$RevenueFederalSources_16_17, stratified_sample$Asian_AsianPacificIsl)

## [1] 0.4966789

Simple Linear Regression Model

Our first linear model is created using the data from the stratified sample we took earlier, namely the variables “RevenueFederalSources_16_17” and “TotalStudentsAllGrades”. Using the summary() function on our linear model, we can see that our p-value for both the intercept and the Total Students are very significant.

linearMod <- lm(RevenueFederalSources_16_17 ~ TotalStudentsAllGrades, data=stratified_sample)  # build linear regression model on full data
print(linearMod)

## 
## Call:
## lm(formula = RevenueFederalSources_16_17 ~ TotalStudentsAllGrades, 
##     data = stratified_sample)
## 
## Coefficients:
##            (Intercept)  TotalStudentsAllGrades  
##                -292848                    1167

summary(linearMod)

## 
## Call:
## lm(formula = RevenueFederalSources_16_17 ~ TotalStudentsAllGrades, 
##     data = stratified_sample)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -75090769   -429091    140349    358779 276933956 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.928e+05  7.268e+04  -4.029 5.65e-05 ***
## TotalStudentsAllGrades  1.167e+03  6.904e+00 169.030  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6142000 on 7791 degrees of freedom
## Multiple R-squared:  0.7857, Adjusted R-squared:  0.7857 
## F-statistic: 2.857e+04 on 1 and 7791 DF,  p-value: < 2.2e-16

Train/Test Split

To create training and test datasets, we first set a seed to achieve reproducible results. We created a training dataset by taking a simple random sample of the stratified sample from earlier, with 80% of that data comprising the training data and the remaining 20% comprising the test data. Afterwards, we created a new linear model using just the training data and applied this linear model to the predict() function. This predict() function provides predictions for a specific object (in this case, our linear model).

set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(stratified_sample), 0.8*nrow(stratified_sample))  # row indices for training data
trainingData <- stratified_sample[trainingRowIndex, ]  # model training data
testData  <- stratified_sample[-trainingRowIndex, ]   # test data
lmMod <- lm(RevenueFederalSources_16_17 ~ TotalStudentsAllGrades, data=trainingData)  # build the model
FedRevenuePred <- predict(lmMod, testData)  # predict revenue

Plotting Revenue Predictions (Total Number of Students)

ggplot(testData, aes(TotalStudentsAllGrades, FedRevenuePred)) +geom_smooth(method="lm") + xlab("Total Students (All Grades)") + ylab("Revenue (Federal Sources 16/17)") + ggtitle("Total Students ~ Federal Funding") + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

Lastly, we plotted these predictions using a simple line chart. The X-axis of this chart is the Total Number of Students as taken from the test dataset, and the Y-axis of this chart is the Predicted Federal Revenue that was obtained using our linear model. This is the plot that would be used to predict the amount of Federal Revenue obtained by a school based on the Total Number of Students. We wanted to obtain these predictive models for each Ethicity rather than just the Total Number of Students. To do this, we applied the same methodology as before with the exception of changing the Y-variable to match a specific Ethnicity rather than Total Number of Students.

library(ggpubr)
ggarrange(WhitePlot, BlackPlot, HispanicPlot, AsianPlot, 
          ncol = 2, nrow = 2, align="hv")

Inferential Analysis

Estimating Parameters Using Method of Moments

x = stratified_sample$TotalStudentsAllGrades
n = length(stratified_sample$TotalStudentsAllGrades)
mean_MoM = sum(x)/n
var_MoM = sum(x^2)/n - (sum(x)/n)^2
list(Mean = mean_MoM, Variance = var_MoM)

## $Mean
## [1] 3040.843
## 
## $Variance
## [1] 101569766

Norm.CI=function(obs, sigma, alpha){
  n=length(obs)
  ME=qnorm(1-alpha/2)*sigma/sqrt(n)
  c(mean(obs)-ME, mean(obs)+ME)
}
x = stratified_sample$TotalStudentsAllGrades
var_MoM = sum(x^2)/n - (sum(x)/n)^2
Norm.CI(x, sqrt(var_MoM), 0.05)

## [1] 2817.085 3264.601

Comparing these population parameters to the true values, we can see that Method of Moments estimation does well in estimating the parameters.

Confidence Intervals for Mean and Variance of Total Students

Normt.CI=function(x, alpha){
  n=length(x)
  df=n-1
  ME=qt(1-alpha/2, df)*sd(x)/sqrt(n)
  c(mean(x)-ME, mean(x)+ME)
  }
Normt.CI(x = stratified_sample$TotalStudentsAllGrades, 0.05)

## [1] 2817.036 3264.650

Sigma.CI=function(x, alpha){
  n=length(x)
  df=n-1
  var_MoM = sum(x^2)/n - (sum(x)/n)^2
  c(df*var_MoM/qchisq(1-alpha/2, df), df*var_MoM/qchisq(alpha/2, df))
}
Sigma.CI(x = stratified_sample$TotalStudentsAllGrades, 0.05)

## [1]  98454369 104836128

Conclusion

After conducting various exploratory analyses and creating linear models with respect to each ethnicity, we found that the number of Black students, followed closely by the number of Hispanic students, are the most accurate student population predictors of the amount of Federal Revenue for any given school.