library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read_csv("https://corgis-edu.github.io/corgis/datasets/csv/county_demographics/county_demographics.csv")
## Rows: 3139 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): County, State
## dbl (41): Age.Percent 65 and Older, Age.Percent Under 18 Years, Age.Percent ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
texas <- data %>%
filter(State == "TX" | State == "Texas")
names(texas) <- make.names(names(texas))
The dependent variable in this analysis is median household income. The independent variables include the percentage of residents with a bachelor’s degree or higher, the percentage of residents aged 65 and older, and total population.
model_data <- texas %>%
select(
Income.Median.Houseold.Income,
Education.Bachelor.s.Degree.or.Higher,
Age.Percent.65.and.Older,
Population.2020.Population
) %>%
na.omit()
model <- lm(
Income.Median.Houseold.Income ~
Education.Bachelor.s.Degree.or.Higher +
Age.Percent.65.and.Older +
Population.2020.Population,
data = model_data
)
summary(model)
##
## Call:
## lm(formula = Income.Median.Houseold.Income ~ Education.Bachelor.s.Degree.or.Higher +
## Age.Percent.65.and.Older + Population.2020.Population, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33216 -4796 18 5564 45175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.998e+04 2.608e+03 19.167 < 2e-16
## Education.Bachelor.s.Degree.or.Higher 9.872e+02 8.474e+01 11.650 < 2e-16
## Age.Percent.65.and.Older -8.386e+02 1.146e+02 -7.317 3.45e-12
## Population.2020.Population -3.355e-03 1.681e-03 -1.996 0.0471
##
## (Intercept) ***
## Education.Bachelor.s.Degree.or.Higher ***
## Age.Percent.65.and.Older ***
## Population.2020.Population *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9610 on 250 degrees of freedom
## Multiple R-squared: 0.4417, Adjusted R-squared: 0.4349
## F-statistic: 65.92 on 3 and 250 DF, p-value: < 2.2e-16
plot(model, which = 1)
The model’s R-squared is 0.442, meaning that approximately 44.2% of the variation in median household income across Texas counties is explained by the independent variables included in the model. This indicates that the model provides a moderate level of explanatory power.
Among the independent variables, percentage of residents with a bachelor’s degree or higher, Age.Percent.65.and.Older, and Population.2020.Population are all statistically significant (p < 0.05), indicating that each variable has a meaningful relationship with median household income. There are no statistically insignificant variables in this model.
The coefficient for Education.Bachelor.s.Degree.or.Higher is 987.25, meaning that for every one percentage point increase in the proportion of residents with a bachelor’s degree or higher, median household income is expected to increase by approximately $987.25, holding all other variables constant. This highlights the strong positive relationship between educational attainment and income levels across counties.
The residual plot shows a mostly random scatter around zero with no clear pattern, suggesting that the assumption of linearity is reasonably satisfied.
This suggests that educational attainment is a key driver of income differences across Texas counties.