Thesis: In West Virginia, the strongest predictors of county-level
educational outcomes are average family income and the percentage of
residents with at least a bachelor’s degree.
This analysis examines various data points and factors to predict
educational outcomes at the county level in West Virginia. The
predictors tested for their influence on proficiency rates include
federal, state, and local revenues allocated to each county;
county-level expenditures; unemployment rates; the percentage of
residents with less than a 9th-grade education; the enrollment amount
per each county; the percentage with less than a high school education;
the percentage with at least a bachelor’s degree; and income metrics
such as average vast majority income, average household income, and
average family income. These factors are analyzed to determine their
impact on past proficiency rates and identify the strongest influences
on educational outcomes.
Data Description
The data was gathered from us census websites for West Virginia
counties.
Key variables included:
- enroll: The amount of students enrolled in each
county.
- tfedrev: The amount of federal revenue allocated to
each county.
- tstrev: The amount of state revenue allocated to
each county.
- tlocrev: The amount of local revenue allocated to
each county.
- totalexp: The amount of expenditures utilized by
each county.
- unemployed: The percent of unemployment per each
county.
- less_than_9th_grade_education: The percent of
people in each county with less than a 9th grade education.
- less_than_high_school_grade_education: The percent
of people in each county with less than a high school education.
- at_least_bachelor_education: The percent of people
in each county with at least bachelor’s degree.
- vast_majority_income: The average amount of vast
majority income per each county.
- household_income: The average amount of household
income per each county.
- family_income: The average amount of family income
per each county.
- proficiency: The percent of students that are
educationally proficient per each county.
- proficiency_range: A range of percents of students
that are educationally proficient per each county.
Methods
Load Assessment Data
In this section, the assessment data provided by the professor was
loaded.
Load Spending Data
In this section, the spending data provided by the professor was
loaded.
Load Demographic Data
In this section, the demographic data provided by the professor was
loaded.
Add in New Data
In this section, new data was loaded and cleaned. Additional
educational outcome data, including math proficiency, reading
proficiency, and an average proficiency score, was incorporated.
County-level educational attainment data and income level information
were also added.
Join Data
In this section, the loaded data was merged to create a unified
dataset, facilitating more efficient analysis and ensuring all relevant
variables are accessible for the study.
View Proficiency Data
In this section, a state-level graph was created to visualize
proficiency rates at the county level.

Correlations
In this section, correlation analysis was performed to identify the
factors with the greatest influence on educational proficiency.

Create Test/Training Data
In this section, the data was divided into training and testing sets
to facilitate model training and evaluation.
Linear Regression Model
In this section, a linear regression model was created to predict
educational proficiency levels.
Call:
lm(formula = proficiency ~ at_least_bachelor_education + family_income,
data = t_train)
Residuals:
Min 1Q Median 3Q Max
-6.4377 -2.0840 -0.2013 1.9891 8.5273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.01459814 3.64431128 3.297 0.00200 **
at_least_bachelor_education 0.32319107 0.11888552 2.719 0.00949 **
family_income 0.00017861 0.00007458 2.395 0.02117 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.776 on 42 degrees of freedom
Multiple R-squared: 0.5767, Adjusted R-squared: 0.5566
F-statistic: 28.61 on 2 and 42 DF, p-value: 0.00000001442
View Residuals of Linear Regression Model
In this section, a histogram was created to visualize the
distribution of residuals based on the linear regression model.

PCA
In this section, principal component analysis (PCA) was performed to
identify key patterns in the population.

K-Means
In this section, k-means clustering was performed to group the
population into distinct clusters based on similar patterns.

Decision Tree
In this section, a decision tree was created to classify the
population based on distinct patterns and key factors.

Decision Tree Results
In this section, a confusion matrix was generated to evaluate the
performance of the decision tree by comparing the predicted values with
the actual outcomes.
[1] 0.6363636
18 to 25 25 to 30 30 to 35 35 to 40 40 to 47
18 TO 25 4 0 3 0 0
25 TO 30 5 22 3 2 1
30 TO 35 0 2 5 1 0
35 TO 40 0 0 1 4 2
18 to 25 25 to 30 30 to 35 35 to 40 40 to 47
18 TO 25 0.44444444 0.00000000 0.25000000 0.00000000 0.00000000
25 TO 30 0.55555556 0.91666667 0.25000000 0.28571429 0.33333333
30 TO 35 0.00000000 0.08333333 0.41666667 0.14285714 0.00000000
35 TO 40 0.00000000 0.00000000 0.08333333 0.57142857 0.66666667
Limitations
The timing of the data was inconsistent. My demographic, education,
and income data are based on the most recent information, while the
educational proficiency rate data is a few years older. As a result, the
county-level data may have been different a few years ago, which could
have affected the outcomes. Additionally, there are several factors I
didn’t consider, such as data on incarceration, smoking rates,
non-English speaking populations, and other variables that could
influence proficiency rates.
References
Sources included:
