Predicting Poverty Rates in Counties

Source: CDC

Image Source: CDC

I. Introduction

The county dataset shows population data on every county in the United States. This dataset was collected by U.S. Census Quick Facts. Smoking data was collected by a variety of other sources. I found this dataset on OpenIntro and after obtaining it, the variable names were already cleaned; all that is needed is to remove some N/A values which I do below. Some of the important variables in this dataset include population data for a few years, the percentage of the population that is in poverty, smoking ban or no ban, and unemployment rate. This dataset has 14 variables and 3142 observations. 4 of them are qualitative and 10 are quantitative.

For this project, my research question will be: What variables are most significant at predicting the percent of poverty (or poverty rates) in a county? For this research question, I will need every variable. For the linear regression, I will do backwards elimination to figure out the most significant variables for a linear regression. Furthermore, I’ll figuring out what factors explain poverty rates the best using the R^2 statistic. I chose this topic and dataset because I want to figure out what variables, such as unemployment rate, population, and/or house ownership rates, are most significant at predicting poverty rates in a county. This is meaningful in today’s day and age because if this project can develop meaningful insights, then we can figure out ways to reduce poverty in an area.

A. Load the Libraries and Dataset

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.5.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
setwd("C:/Users/sarah/Downloads") 
county <- readr::read_csv("county - county.csv")
## Rows: 3142 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): name, state, metro, median_edu, smoking_ban
## dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(county)
## # A tibble: 6 × 15
##   name           state  pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>          <chr>    <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>
## 1 Autauga County Alaba…   43671   54571   55504       1.48    13.7          77.5
## 2 Baldwin County Alaba…  140415  182265  212628       9.19    11.8          76.7
## 3 Barbour County Alaba…   29038   27457   25270      -6.22    27.2          68  
## 4 Bibb County    Alaba…   20826   22915   22668       0.73    15.2          82.9
## 5 Blount County  Alaba…   51024   57322   58013       0.68    15.6          82  
## 6 Bullock County Alaba…   11714   10914   10309      -2.28    28.5          76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## #   median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## #   smoking_ban <chr>

II. Cleaning and EDA

A. Cleaning

Check for N/A values

colSums(is.na(county))
##              name             state           pop2000           pop2010 
##                 0                 0                 3                 0 
##           pop2017        pop_change           poverty     homeownership 
##                 3                 3                 2                 0 
##        multi_unit unemployment_rate             metro        median_edu 
##                 0                 3                 3                 2 
## per_capita_income  median_hh_income       smoking_ban 
##                 2                 2               580

We have some N/A’s to deal with.

Since we’re focusing on poverty, we will only remove N/A values from the poverty variable. For visualizations and the final model, I will remove N/A values when I create them (if R doesn’t already do so by default).

county_clean <- county |>
  filter(poverty != is.na(poverty))
head(county_clean)
## # A tibble: 6 × 15
##   name           state  pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>          <chr>    <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>
## 1 Autauga County Alaba…   43671   54571   55504       1.48    13.7          77.5
## 2 Baldwin County Alaba…  140415  182265  212628       9.19    11.8          76.7
## 3 Barbour County Alaba…   29038   27457   25270      -6.22    27.2          68  
## 4 Bibb County    Alaba…   20826   22915   22668       0.73    15.2          82.9
## 5 Blount County  Alaba…   51024   57322   58013       0.68    15.6          82  
## 6 Bullock County Alaba…   11714   10914   10309      -2.28    28.5          76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## #   median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## #   smoking_ban <chr>

B. EDA

Check the structure of the data set.

str(county)
## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name             : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ state            : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ pop2000          : num [1:3142] 43671 140415 29038 20826 51024 ...
##  $ pop2010          : num [1:3142] 54571 182265 27457 22915 57322 ...
##  $ pop2017          : num [1:3142] 55504 212628 25270 22668 58013 ...
##  $ pop_change       : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : chr [1:3142] "yes" "yes" "no" "yes" ...
##  $ median_edu       : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
##  $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
##  $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
##  $ smoking_ban      : chr [1:3142] "none" "none" "partial" "none" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   state = col_character(),
##   ..   pop2000 = col_double(),
##   ..   pop2010 = col_double(),
##   ..   pop2017 = col_double(),
##   ..   pop_change = col_double(),
##   ..   poverty = col_double(),
##   ..   homeownership = col_double(),
##   ..   multi_unit = col_double(),
##   ..   unemployment_rate = col_double(),
##   ..   metro = col_character(),
##   ..   median_edu = col_character(),
##   ..   per_capita_income = col_double(),
##   ..   median_hh_income = col_double(),
##   ..   smoking_ban = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

All variables line up with their intended class.

Check the first 10 unique county names.

head(unique(county$name,),10)
##  [1] "Autauga County"  "Baldwin County"  "Barbour County"  "Bibb County"    
##  [5] "Blount County"   "Bullock County"  "Butler County"   "Calhoun County" 
##  [9] "Chambers County" "Cherokee County"

Check the unique values for the state variable.

unique(county$state)
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"

We can see that this dataset considers the District of Columbia as a state, therefore, there are a total of 51 values for state.

unique(county$metro)
## [1] "yes" "no"  NA
unique(county$median_edu)
## [1] "some_college" "hs_diploma"   NA             "bachelors"    "below_hs"
unique(county$smoking_ban)
## [1] "none"    "partial" NA

Create a summary of the variables in the dataset.

summary(county_clean)
##      name              state              pop2000           pop2010       
##  Length:3140        Length:3140        Min.   :     67   Min.   :     82  
##  Class :character   Class :character   1st Qu.:  11236   1st Qu.:  11118  
##  Mode  :character   Mode  :character   Median :  24653   Median :  25890  
##                                        Mean   :  89701   Mean   :  98318  
##                                        3rd Qu.:  61792   3rd Qu.:  66898  
##                                        Max.   :9519338   Max.   :9818605  
##                                        NA's   :3                          
##     pop2017           pop_change          poverty      homeownership  
##  Min.   :      88   Min.   :-33.6300   Min.   : 2.40   Min.   : 0.00  
##  1st Qu.:   10981   1st Qu.: -1.9700   1st Qu.:11.30   1st Qu.:69.50  
##  Median :   25862   Median : -0.0700   Median :15.20   Median :74.60  
##  Mean   :  103822   Mean   :  0.5329   Mean   :15.97   Mean   :73.28  
##  3rd Qu.:   67773   3rd Qu.:  2.3700   3rd Qu.:19.40   3rd Qu.:78.40  
##  Max.   :10163507   Max.   : 37.1900   Max.   :52.00   Max.   :91.30  
##  NA's   :3          NA's   :3                                         
##    multi_unit    unemployment_rate    metro            median_edu       
##  Min.   : 0.00   Min.   : 1.620    Length:3140        Length:3140       
##  1st Qu.: 6.10   1st Qu.: 3.520    Class :character   Class :character  
##  Median : 9.70   Median : 4.360    Mode  :character   Mode  :character  
##  Mean   :12.33   Mean   : 4.611                                         
##  3rd Qu.:15.90   3rd Qu.: 5.355                                         
##  Max.   :98.50   Max.   :19.070                                         
##                  NA's   :1                                              
##  per_capita_income median_hh_income smoking_ban       
##  Min.   :10467     Min.   : 19264   Length:3140       
##  1st Qu.:21772     1st Qu.: 41126   Class :character  
##  Median :25445     Median : 48073   Mode  :character  
##  Mean   :26093     Mean   : 49765                     
##  3rd Qu.:29276     3rd Qu.: 55771                     
##  Max.   :69533     Max.   :129588                     
## 

We can see that the average poverty percentage across the counties in the United States is 15.97%.

III. Simple Visualizations

In this section, I will create simple visualizations of the poverty variable with the homeownership variable, unemployment rate, smoking bans, and whether the county includes a metropolitan area (metro).

A. Scatterplot of Poverty by Home Ownership

ggplot(county_clean, aes(x = homeownership, y = poverty)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, color = "blue") # Add a regression line
## `geom_smooth()` using formula = 'y ~ x'

Interpretation: We can see that observations are clustered together towards high ownership rates and low poverty rates. This makes sense as counties with low poverty rates have high home ownership rates. The line fitted shows a downward relationship; as homeownership increases, poverty rates decrease.

B. Scatterplot of Poverty by Unemployment Rate

ggplot(county_clean, aes(x = unemployment_rate, y = poverty)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, color = "blue") # Add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation: We can see that as the unemployment rate increases, poverty rates increase. A line is fitted to best represent this positive linear relationship; as unemployment rate increases, poverty increases.

C. Box Plots of Smoking Bans by Poverty

Filter out N/A values for the smoking ban variable.

smoke_viz <- county_clean |>
  filter(smoking_ban != is.na(smoking_ban))
ggplot(smoke_viz, aes(x = smoking_ban, y = poverty)) +
  geom_boxplot()

Interpretation: We can see that the median poverty percentage for no smoke ban is greater than the median poverty percentage for a partial ban. There are a lot of outliers for both categories too. The difference doesn’t look that significant.

D. Box Plots of Metropolitan Area by Poverty

Remove N/A values

metroviz <- county_clean |>
  filter(metro != is.na(metro))
ggplot(metroviz, aes(x = metro, y = poverty)) +
  geom_boxplot()

Interpretation: We can see that the median poverty rate for counties with no metropolitan area is higher than counties with a metropolitan area. There are also more outliers for counties with no metropolitan areas. The difference in medians seems to be insignificant.

IV. Inclusion and Exclusion Criteria for Model and Visualization

Inclusion:

  • For the model, the predicted variable should be a numerical variable. The rest can be categorical and/or numerical.

  • For the visualization, the x and y variables must include poverty and the most significant variable at predicting poverty. (This inclusion criteria will be followed through in the visualization section).

Exclusion:

  • Greater than 20 unique categories of a qualitative variable. The only variables that meet this exclusion criteria are county (name) and state variables. These two are ineffective at predicting the poverty percentage because they are simply names and are just used as identifiers or groups.

Create a new dataset of the cleaned version that doesn’t include the county and state variables.

county3 <- county_clean |>
  select(-name, -state) # Remove name and state, but keep every other variable
head(county3)
## # A tibble: 6 × 13
##   pop2000 pop2010 pop2017 pop_change poverty homeownership multi_unit
##     <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>      <dbl>
## 1   43671   54571   55504       1.48    13.7          77.5        7.2
## 2  140415  182265  212628       9.19    11.8          76.7       22.6
## 3   29038   27457   25270      -6.22    27.2          68         11.1
## 4   20826   22915   22668       0.73    15.2          82.9        6.6
## 5   51024   57322   58013       0.68    15.6          82          3.7
## 6   11714   10914   10309      -2.28    28.5          76.9        9.9
## # ℹ 6 more variables: unemployment_rate <dbl>, metro <chr>, median_edu <chr>,
## #   per_capita_income <dbl>, median_hh_income <dbl>, smoking_ban <chr>

V. Multiple Linear Regression

Primary Model

fit <- lm(poverty ~ ., data = county3) # a dot tells R to include every variable

summary(fit)
## 
## Call:
## lm(formula = poverty ~ ., data = county3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2045  -2.0457  -0.2781   1.7276  21.2969 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.204e+01  1.303e+00  39.935  < 2e-16 ***
## pop2000                 4.033e-06  4.310e-06   0.936 0.349552    
## pop2010                -9.166e-06  1.160e-05  -0.790 0.429733    
## pop2017                 5.539e-06  7.611e-06   0.728 0.466789    
## pop_change              9.421e-02  1.926e-02   4.890 1.07e-06 ***
## homeownership          -2.055e-01  1.346e-02 -15.272  < 2e-16 ***
## multi_unit              7.747e-03  1.347e-02   0.575 0.565147    
## unemployment_rate       8.908e-01  4.531e-02  19.658  < 2e-16 ***
## metroyes                5.462e-01  1.650e-01   3.309 0.000948 ***
## median_edubelow_hs     -1.874e+00  3.392e+00  -0.552 0.580733    
## median_eduhs_diploma   -6.035e+00  6.818e-01  -8.851  < 2e-16 ***
## median_edusome_college -7.188e+00  6.279e-01 -11.448  < 2e-16 ***
## per_capita_income      -2.436e-04  2.411e-05 -10.103  < 2e-16 ***
## median_hh_income       -2.519e-04  1.071e-05 -23.513  < 2e-16 ***
## smoking_banpartial     -8.281e-03  1.530e-01  -0.054 0.956825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.295 on 2545 degrees of freedom
##   (580 observations deleted due to missingness)
## Multiple R-squared:  0.7478, Adjusted R-squared:  0.7465 
## F-statistic: 539.1 on 14 and 2545 DF,  p-value: < 2.2e-16

Interpretation:

8 variables are significant at predicting the poverty percentage (p-value < 0.05).

The adjusted R^2 value is 0.7465, meaning 74.65% of the variation in poverty is explained by every other variable (besides county and state). This is decent.

The overall p-value of this model is less than 2.2e-16, meaning this model is very significant at predicting the poverty percentage.

I will now take out the variable with the highest p-value for my next model: smoking_ban (p-value of 0.956825).

Backwards Elimination Models

county4 <- county3 |>
  select(-smoking_ban)
fit2 <- lm(poverty ~ ., data = county4)

summary(fit2)
## 
## Call:
## lm(formula = poverty ~ ., data = county4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2817  -2.0438  -0.2732   1.6736  21.6130 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.141e+01  1.164e+00  44.157  < 2e-16 ***
## pop2000                -1.969e-06  3.897e-06  -0.505 0.613443    
## pop2010                 9.184e-06  1.043e-05   0.880 0.378718    
## pop2017                -6.452e-06  6.812e-06  -0.947 0.343664    
## pop_change              8.854e-02  1.784e-02   4.962 7.36e-07 ***
## homeownership          -2.001e-01  1.205e-02 -16.601  < 2e-16 ***
## multi_unit              1.048e-02  1.173e-02   0.894 0.371521    
## unemployment_rate       8.974e-01  4.192e-02  21.411  < 2e-16 ***
## metroyes                5.509e-01  1.505e-01   3.660 0.000256 ***
## median_edubelow_hs     -7.472e+00  2.455e+00  -3.043 0.002359 ** 
## median_eduhs_diploma   -6.278e+00  6.136e-01 -10.231  < 2e-16 ***
## median_edusome_college -7.291e+00  5.678e-01 -12.842  < 2e-16 ***
## per_capita_income      -2.531e-04  2.217e-05 -11.416  < 2e-16 ***
## median_hh_income       -2.413e-04  9.910e-06 -24.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.337 on 3121 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.7386, Adjusted R-squared:  0.7375 
## F-statistic: 678.5 on 13 and 3121 DF,  p-value: < 2.2e-16

Next variable with the highest p-value: pop2000

county5 <- county4 |>
  select(-pop2000 )
fit3 <- lm(poverty ~ ., data = county5)

summary(fit3)
## 
## Call:
## lm(formula = poverty ~ ., data = county5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2798  -2.0427  -0.2704   1.6681  21.6085 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.132e+01  1.162e+00  44.179  < 2e-16 ***
## pop2010                 4.227e-06  4.075e-06   1.037 0.299655    
## pop2017                -3.505e-06  3.837e-06  -0.914 0.360940    
## pop_change              8.732e-02  1.781e-02   4.902 9.97e-07 ***
## homeownership          -2.001e-01  1.205e-02 -16.609  < 2e-16 ***
## multi_unit              1.002e-02  1.168e-02   0.858 0.391177    
## unemployment_rate       8.991e-01  4.182e-02  21.498  < 2e-16 ***
## metroyes                5.595e-01  1.494e-01   3.745 0.000184 ***
## median_edubelow_hs     -7.387e+00  2.454e+00  -3.011 0.002628 ** 
## median_eduhs_diploma   -6.191e+00  6.086e-01 -10.174  < 2e-16 ***
## median_edusome_college -7.199e+00  5.623e-01 -12.802  < 2e-16 ***
## per_capita_income      -2.544e-04  2.200e-05 -11.565  < 2e-16 ***
## median_hh_income       -2.405e-04  9.801e-06 -24.533  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.337 on 3123 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7387, Adjusted R-squared:  0.7377 
## F-statistic: 735.9 on 12 and 3123 DF,  p-value: < 2.2e-16

Next variable: multi_unit

county6 <- county5 |>
  select(-multi_unit)
fit4 <- lm(poverty ~ ., data = county6)

summary(fit4)
## 
## Call:
## lm(formula = poverty ~ ., data = county6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5773  -2.0338  -0.2671   1.6756  21.4812 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.188e+01  9.584e-01  54.131  < 2e-16 ***
## pop2010                 4.376e-06  4.071e-06   1.075  0.28246    
## pop2017                -3.597e-06  3.835e-06  -0.938  0.34840    
## pop_change              8.705e-02  1.781e-02   4.888 1.07e-06 ***
## homeownership          -2.075e-01  8.461e-03 -24.522  < 2e-16 ***
## unemployment_rate       9.000e-01  4.180e-02  21.529  < 2e-16 ***
## metroyes                5.855e-01  1.463e-01   4.003 6.40e-05 ***
## median_edubelow_hs     -7.453e+00  2.452e+00  -3.039  0.00239 ** 
## median_eduhs_diploma   -6.231e+00  6.067e-01 -10.270  < 2e-16 ***
## median_edusome_college -7.226e+00  5.614e-01 -12.872  < 2e-16 ***
## per_capita_income      -2.494e-04  2.121e-05 -11.759  < 2e-16 ***
## median_hh_income       -2.408e-04  9.793e-06 -24.589  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.337 on 3124 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7387, Adjusted R-squared:  0.7378 
## F-statistic: 802.8 on 11 and 3124 DF,  p-value: < 2.2e-16

Next variable: pop2017

Final Model

county7 <- county6 |>
  select(-pop2017)
fit5 <- lm(poverty ~ ., data = county7)

summary(fit5)
## 
## Call:
## lm(formula = poverty ~ ., data = county7)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5299  -2.0256  -0.2671   1.6721  21.4893 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.180e+01  9.543e-01  54.276  < 2e-16 ***
## pop2010                 5.635e-07  2.111e-07   2.669  0.00765 ** 
## pop_change              8.202e-02  1.699e-02   4.829 1.44e-06 ***
## homeownership          -2.073e-01  8.457e-03 -24.506  < 2e-16 ***
## unemployment_rate       9.012e-01  4.178e-02  21.569  < 2e-16 ***
## metroyes                5.929e-01  1.460e-01   4.060 5.03e-05 ***
## median_edubelow_hs     -7.397e+00  2.452e+00  -3.017  0.00257 ** 
## median_eduhs_diploma   -6.180e+00  6.043e-01 -10.228  < 2e-16 ***
## median_edusome_college -7.174e+00  5.586e-01 -12.842  < 2e-16 ***
## per_capita_income      -2.485e-04  2.119e-05 -11.730  < 2e-16 ***
## median_hh_income       -2.410e-04  9.790e-06 -24.617  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.337 on 3125 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7386, Adjusted R-squared:  0.7378 
## F-statistic:   883 on 10 and 3125 DF,  p-value: < 2.2e-16

There are still a total of 8 variables that are significant when predicting the poverty percentage. These are the final variables for the multiple linear model because they are the most significant at predicting poverty rates, and this final multiple linear model was created using backwards elimination.

The following most 8 most significant variables at predicting the poverty percentage are:

  • pop2010
  • pop_change
  • homeownership
  • unemployment_rate
  • metro
  • median_edu
  • per_capita_income
  • median_hh_income

We answer our research question; these variables are the most significant at predicting the poverty percentage of a county in the United States.

Equation: y = 5.635e-07(pop2010) + 8.202e-02(pop_change) + -2.073e-01(homeownership) + 9.012e-01(unemployment_rate) + 5.929e-01(metroyes) + -7.397e+00(median_edubelow_hs) + -6.180e+00(median_eduhs_diploma) + -7.174e+00(median_edusome_college) + -2.485e-04(per_capita_income) + -2.410e-04(median_hh_income) + 5.180e+01

Adjusted-R^2: 0.7378. 73.78% of the variation in poverty rates are explained by this model (these variables).

VI. Visualization

Since there are multiple very significant variables when predicting poverty rates, I will take a look at the median_edu variable with the poverty rates.

library(highcharter) # for interactive charts
library(RColorBrewer) # for color palettes

An external source was used to create this box plot. References are linked in the references section.

cols <- brewer.pal(6, "Pastel1") # Set the colors

box_plot <- data_to_boxplot(data = county7, poverty, median_edu, group_var = median_edu) # Create the mapping

highchart() |> # High charters
  hc_xAxis(type = "category") |> # Categorical Plot
  hc_add_series_list(box_plot) |>
  hc_title(text = "Box Plots of the Median Education level by Poverty Rates") |> # Title
  hc_xAxis(title = list(text = "Education")) |> # X-label
  hc_yAxis(title = list(text = "Poverty Percentage")) |> # Y-label
  hc_colors(cols) |> # Add colors
  hc_caption(text= "Data provided by U.S. Census Quick Facts.") |> # Caption
  hc_add_theme(hc_theme_darkunica()) # Theme. Check references for theme

Interpretation: There is a lot of useful insight in these box plots. We can see that for counties with their median education of below a high school education have the highest median poverty rate. For counties with a median education of a bachelors degree, which is the highest education level in this dataset, have the lowest median poverty rate.

I initially wanted to do shiny apps, but I realized that RPUBS doesn’t show the shiny apps if you create it, nor can you interact with it in a slide show. So I figured it’ll be best to do highcharts. If in the future I get the chance to do this again, then I will hopefully use shiny apps.