I first began by loading the assessment data from an Excel file and cleaned it up by removing unnecessary rows and columns. I intended to try and understand the correlation between some major variables, such as revenue, expenses, socioeconomic status, and proficiency. I hypothesize that there will be a correlation between local revenue, socioeconomic status, and proficiency. I also believe that the region of the county will have an effect on the correlation between local revenue and proficiency. I first looked at region to see what the data looked like and to see its significance. If we look below, we see that there are a few regions that have more counties then others, which is an interesting fact to keep in mind.
I loaded the spending data and selected only West Virginia public schools. I also removed cooperatives, which are not schools. I then renamed the columns to make them more understandable and selected only the columns I needed. I also created a new column for county, which is important for later analysis. I also cleaned up the county names to make sure they are consistent.
I loaded the demographic data and cleaned it up by removing unnecessary rows and columns. I also renamed the columns to make them more understandable. I then selected only the columns I needed and cleaned up the county names to make sure they are consistent throughout. I also removed any counties that were not in West Virginia.
Below, I joined the data from the assessment, spending, and demographic data sets. I also logged the local revenue to make the relationship more linear. I then cleaned the data by removing unnecessary columns and renaming them for clarity. I also logged local revenue to make sure we don’t have any extremely large numbers skewing our data, which is shown below. The new variable is called log_local_revenue.
Although I originally believed location would play a crucial role in
determining proficiency, it was not a significant as I had originally
thought. I have mapped a map of proficiency per county in West Virginia,
which is included below. As show below the darker the shade of purple,
the less proficient the county is, on average.
Below we can see there is a correlation between proficiency and local revenue as well as student status economically disadvantaged. I have chosen these two variables to focus on in my linear regression model.
Variables: log_local_revenue: Log of local revenue per pupil student_status_economically_disadvantaged: Percentage of students who are economically disadvantaged
This is another way to visualize the correlation between the variables.
We can see that as proficiency increases, revenue tends to increase as
well. We can also see that there is an upward trend between proficiency
and economically disadvantaged students. In other words counties where
economically disadvantaged students have higher proficiency tend to also
have higher overall proficiency. Although this may seem quite obvious,
it is important because this tracks to that if we increase revenue, and
students that are economically challenged do better, proficiency
increases in an upward trend.
From the regression model shown below we cna see that both local revenue and proficiency among economically disadvantaged students are significant predictors of the overall proficiency in West Virginia. Although this model only explains 14 percent of variance, it is important to note that there is a correlation. We can see that as local revenue increases by 1, overall proficiency increases by 1.04. We can see that for every one point increase in proficiency among economically disadvantaged students, overall proficiency increases by around .96 points.
##
## Call:
## lm(formula = proficiency ~ log_local_revenue + student_status_economically_disadvantaged,
## data = merged_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.821 -2.749 0.042 2.093 37.640
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.5906 2.1845 -3.932
## log_local_revenue 1.0393 0.2308 4.503
## student_status_economically_disadvantaged 0.9548 0.1575 6.063
## Pr(>|t|)
## (Intercept) 0.00009667486 ***
## log_local_revenue 0.00000843702 ***
## student_status_economically_disadvantaged 0.00000000273 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.961 on 473 degrees of freedom
## Multiple R-squared: 0.1437, Adjusted R-squared: 0.1401
## F-statistic: 39.69 on 2 and 473 DF, p-value: < 0.00000000000000022
To further understand the region data, I created a box plot for proficiency per region. The box plot shows that the median lines are fairly close and between 5 - 10 proficiency points. The boxes largely overlap which indicates that the middle 50% of values for each region are similar. We can also see outliers for each region and no region shows a pattern of having consistently higher or lower outliers.
# Test Data Below I have created some test data, which will allow me to
compare this to my model and see how it does. I have calculated RMSE
based on my test data, which turns out to be 4.49, indicating that my
model is off 4.49 points from the actual data, on average.
## [1] "RMSE: 4.49"
As seen below we can see that my model captures general trends in proficiency. Although this is only a general trend that is captured, there is definitely a trend worth noting.
From the graph below showing residuals vs predicted proficiency, we see that prediction errors are centered around zero for most values, and this indicates no major bias. Although we can notice some outliers as the predicted proficiency increases.
In the end, this multiple linear regression model demonstrates that local revenue funding and the proficiency of economically disadvantaged students are significant predictors of overall proficiency in West Virginia counties. Although i looked at region as a variable, I concluded that it was not a significant predictor of proficiency. After looking at a few correlation plots, I was able to see there was a correlation between proficiency, local revenue and economically disadvantaged students. Again, we can see that as local revenue increases, overall proficiency increases as well. We can see that as proficiency increases among underprivileged students, overall proficiency increases as well. Since we are seeing these trends, we should increase local revenue for school districts and counties, especially economically disadvantaged ones. We could do this in multiple ways such as raising taxes and changing or replacing certain state policies. By increasing local revenue, especially for places with a high percentage of economically disadvantaged students, we could raise overall proficiency in West Virginia.