This project analyzes county-level data from the United States to explore how economic factors affect median household income. The dataset contains quantitative variables, including poverty rate, unemployment rate, and median household income, as well as the categorical variable metro status, which indicates whether a county is classified as metro or non-metro. The goal of this project is to examine the relationship between poverty and unemployment on median household income using regression analysis and to visualize patterns in the data, particularly the difference between metro and non-metro counties.
Data Loading and Cleaning
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
# Loading dataset
county <- readr::read_csv("county.csv")
## Rows: 3142 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, state, metro, median_edu, smoking_ban
## dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(county)
## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ pop2000 : num [1:3142] 43671 140415 29038 20826 51024 ...
## $ pop2010 : num [1:3142] 54571 182265 27457 22915 57322 ...
## $ pop2017 : num [1:3142] 55504 212628 25270 22668 58013 ...
## $ pop_change : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : chr [1:3142] "yes" "yes" "no" "yes" ...
## $ median_edu : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
## $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
## $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
## $ smoking_ban : chr [1:3142] "none" "none" "partial" "none" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. state = col_character(),
## .. pop2000 = col_double(),
## .. pop2010 = col_double(),
## .. pop2017 = col_double(),
## .. pop_change = col_double(),
## .. poverty = col_double(),
## .. homeownership = col_double(),
## .. multi_unit = col_double(),
## .. unemployment_rate = col_double(),
## .. metro = col_character(),
## .. median_edu = col_character(),
## .. per_capita_income = col_double(),
## .. median_hh_income = col_double(),
## .. smoking_ban = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Data Cleaning
county_clean <- county %>%
select(name, median_hh_income, poverty, unemployment_rate, metro) %>%
drop_na()
head(county_clean)
## # A tibble: 6 × 5
## name median_hh_income poverty unemployment_rate metro
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Autauga County 55317 13.7 3.86 yes
## 2 Baldwin County 52562 11.8 3.99 yes
## 3 Barbour County 33368 27.2 5.9 no
## 4 Bibb County 43404 15.2 4.39 yes
## 5 Blount County 47412 15.6 4.02 yes
## 6 Bullock County 29655 28.5 4.93 no
Regression Analysis
model <- lm(median_hh_income ~ poverty + unemployment_rate, data = county)
summary(model)
##
## Call:
## lm(formula = median_hh_income ~ poverty + unemployment_rate,
## data = county)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28233 -5288 -1482 3173 61239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73915.97 492.59 150.055 <2e-16 ***
## poverty -1516.76 28.27 -53.652 <2e-16 ***
## unemployment_rate 14.75 111.73 0.132 0.895
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8693 on 3136 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5634, Adjusted R-squared: 0.5631
## F-statistic: 2023 on 2 and 3136 DF, p-value: < 2.2e-16
Here we look at how median household income changes with poverty and unemployment. The results show that counties with higher poverty have lower income, while unemployment does not affect income much. About 56% of the differences in income can be explained by these two factors. The regression equation is: median_hh_income = 73916 - 1517 * poverty + 14.8 * unemployment_rate This shows that poverty is the main factor affecting income in these counties.
Data Visualization for Regression Analysis
ggplot(county, aes(x = poverty, y = median_hh_income, color = unemployment_rate)) +
geom_point(size = 2) + # each county as a point
geom_smooth(method = "lm", se = FALSE, color = "black") + # regression line for poverty
labs(
title = "Effect of Poverty and Unemployment on Median Household Income",
x = "Poverty Rate (%)",
y = "Median Household Income ($)",
caption = "Source: county.csv dataset",
color = "Unemployment Rate (%)"
) +
scale_color_gradient(low = "blue", high = "red") + # low unemployment = blue, high = red
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Data Visualization: This visualization shows whether metro counties have higher median household income than non-metro counties. Each point represents a county, colored by metro status, so we can compare income differences between metro and non-metro areas.
ggplot(county, aes(x = poverty, y = median_hh_income, color = metro)) +
geom_point(size = 2) +
labs(
title = "Relationship Between Poverty and Median Household Income",
x = "Poverty Rate (%)",
y = "Median Household Income ($)",
caption = "Source: county.csv dataset",
color = "Metro Status"
) +
scale_color_manual(values = c("red", "blue")) + # categorical colors
theme_classic()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
This scatterplot shows the negative relationship between poverty and
income. Points are colored by metro status, showing that metro counties
(blue) generally have higher income than non-metro counties (red). The
plot includes a title, labeled axes, and a caption for the data
source.
Final Essay I cleaned the data by keeping only the columns I needed and removing any rows with missing values. The scatterplot shows how poverty affects income and also shows that metro counties usually have higher income than non-metro counties. I could not label every county on the plot because it would be too messy. This project shows that cleaning the data, doing regression, and making a plot helps us understand the differences in income between counties.