This project analyzes county-level data from the United States to explore how economic factors affect median household income. The dataset contains quantitative variables, including poverty rate, unemployment rate, and median household income, as well as the categorical variable metro status, which indicates whether a county is classified as metro or non-metro. The goal of this project is to examine the relationship between poverty and unemployment on median household income using regression analysis and to visualize patterns in the data, particularly the difference between metro and non-metro counties.

Data Loading and Cleaning

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

# Loading dataset 
county <- readr::read_csv("county.csv")
## Rows: 3142 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): name, state, metro, median_edu, smoking_ban
## dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(county)
## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name             : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ state            : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ pop2000          : num [1:3142] 43671 140415 29038 20826 51024 ...
##  $ pop2010          : num [1:3142] 54571 182265 27457 22915 57322 ...
##  $ pop2017          : num [1:3142] 55504 212628 25270 22668 58013 ...
##  $ pop_change       : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : chr [1:3142] "yes" "yes" "no" "yes" ...
##  $ median_edu       : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
##  $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
##  $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
##  $ smoking_ban      : chr [1:3142] "none" "none" "partial" "none" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   state = col_character(),
##   ..   pop2000 = col_double(),
##   ..   pop2010 = col_double(),
##   ..   pop2017 = col_double(),
##   ..   pop_change = col_double(),
##   ..   poverty = col_double(),
##   ..   homeownership = col_double(),
##   ..   multi_unit = col_double(),
##   ..   unemployment_rate = col_double(),
##   ..   metro = col_character(),
##   ..   median_edu = col_character(),
##   ..   per_capita_income = col_double(),
##   ..   median_hh_income = col_double(),
##   ..   smoking_ban = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Cleaning

county_clean <- county %>%
  select(name, median_hh_income, poverty, unemployment_rate, metro) %>%
  drop_na()
head(county_clean)
## # A tibble: 6 × 5
##   name           median_hh_income poverty unemployment_rate metro
##   <chr>                     <dbl>   <dbl>             <dbl> <chr>
## 1 Autauga County            55317    13.7              3.86 yes  
## 2 Baldwin County            52562    11.8              3.99 yes  
## 3 Barbour County            33368    27.2              5.9  no   
## 4 Bibb County               43404    15.2              4.39 yes  
## 5 Blount County             47412    15.6              4.02 yes  
## 6 Bullock County            29655    28.5              4.93 no

Regression Analysis

model <- lm(median_hh_income ~ poverty + unemployment_rate, data = county)
summary(model)
## 
## Call:
## lm(formula = median_hh_income ~ poverty + unemployment_rate, 
##     data = county)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28233  -5288  -1482   3173  61239 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       73915.97     492.59 150.055   <2e-16 ***
## poverty           -1516.76      28.27 -53.652   <2e-16 ***
## unemployment_rate    14.75     111.73   0.132    0.895    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8693 on 3136 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5634, Adjusted R-squared:  0.5631 
## F-statistic:  2023 on 2 and 3136 DF,  p-value: < 2.2e-16

Here we look at how median household income changes with poverty and unemployment. The results show that counties with higher poverty have lower income, while unemployment does not affect income much. About 56% of the differences in income can be explained by these two factors. The regression equation is: median_hh_income = 73916 - 1517 * poverty + 14.8 * unemployment_rate This shows that poverty is the main factor affecting income in these counties.

Data Visualization for Regression Analysis

ggplot(county, aes(x = poverty, y = median_hh_income, color = unemployment_rate)) +
  geom_point(size = 2) +  # each county as a point
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # regression line for poverty
  labs(
    title = "Effect of Poverty and Unemployment on Median Household Income",
    x = "Poverty Rate (%)",
    y = "Median Household Income ($)",
    caption = "Source: county.csv dataset",
    color = "Unemployment Rate (%)"
  ) +
  scale_color_gradient(low = "blue", high = "red") +  # low unemployment = blue, high = red
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Data Visualization: This visualization shows whether metro counties have higher median household income than non-metro counties. Each point represents a county, colored by metro status, so we can compare income differences between metro and non-metro areas.

ggplot(county, aes(x = poverty, y = median_hh_income, color = metro)) +
  geom_point(size = 2) + 
  labs(
    title = "Relationship Between Poverty and Median Household Income",
    x = "Poverty Rate (%)",
    y = "Median Household Income ($)",
    caption = "Source: county.csv dataset",
    color = "Metro Status"
  ) +
  scale_color_manual(values = c("red", "blue")) +  # categorical colors
  theme_classic()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

This scatterplot shows the negative relationship between poverty and income. Points are colored by metro status, showing that metro counties (blue) generally have higher income than non-metro counties (red). The plot includes a title, labeled axes, and a caption for the data source.

Final Essay I cleaned the data by keeping only the columns I needed and removing any rows with missing values. The scatterplot shows how poverty affects income and also shows that metro counties usually have higher income than non-metro counties. I could not label every county on the plot because it would be too messy. This project shows that cleaning the data, doing regression, and making a plot helps us understand the differences in income between counties.