##Is an increased unemployment rate and a higher number of persons per household corelated to a higher poverty rate for children under 5 years of age?
Data is “United States Counties”, which includes data on socioeconomic, educational, housing and employment. Gathered by the US Census Bureau, Bureau of Labor Statistics, and the USDA Economic Research Service on 3,142 US counties. This data set contains 3142 observations of 188 variables. For this project I will only use the following three columns: Poverty_age_under_5 (2017), unemployment_rate (2017), number_of_persons_per_household (2017).
Link to data set: https://www.openintro.org/data/index.php?data=county_complete
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
CNTY_Data <- read.csv("county_complete.csv")
In the next three chunks I will look at the structure of the data, count total observation, count and remove missing values, look at the summary statistics of the columns I’m using and create a data frame including only the three columns that are being used to answer the project’s question. This new data frame will then be used to plot histograms to see the distribution of the data.
Structure and count.
str(CNTY_Data)
## 'data.frame': 3142 obs. of 188 variables:
## $ fips : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ name : chr "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ pop2000 : int 43671 140415 29038 20826 51024 11714 21399 112249 36583 23988 ...
## $ pop2010 : int 54571 182265 27457 22915 57322 10914 20947 118572 34215 25989 ...
## $ pop2011 : int 55199 186534 27351 22745 57562 10675 20880 117785 34031 25993 ...
## $ pop2012 : int 54927 190048 27175 22658 57595 10612 20688 117219 34092 25958 ...
## $ pop2013 : int 54695 194736 26947 22503 57623 10549 20372 116482 34122 26014 ...
## $ pop2014 : int 54864 199064 26749 22533 57546 10673 20327 115941 33948 25897 ...
## $ pop2015 : int 54838 202863 26264 22561 57590 10419 20141 115505 33968 25741 ...
## $ pop2016 : int 55278 207509 25774 22633 57562 10441 19965 114980 33717 25766 ...
## $ pop2017 : int 55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
## $ age_under_5_2010 : num 6.6 6.1 6.2 6 6.3 6.8 6.5 6.1 5.7 5.3 ...
## $ age_under_5_2017 : num 5.7 5.7 5.5 5.7 6.1 5.8 5.9 5.7 6.1 4.5 ...
## $ age_under_18_2010 : num 26.8 23 21.9 22.7 24.6 22.3 24.1 22.9 22.5 21.4 ...
## $ age_over_65_2010 : num 12 16.8 14.2 12.7 14.7 13.5 16.7 14.3 16.7 17.9 ...
## $ age_over_65_2017 : num 14.3 19 17.4 15.1 17.4 15.2 18.5 16.5 18.6 21 ...
## $ median_age_2017 : num 37.8 42.6 39.7 39.8 40.9 40.8 40.7 39.1 43 46.1 ...
## $ female_2010 : num 51.3 51.1 46.9 46.3 50.5 45.8 53 51.8 52.2 50.4 ...
## $ white_2010 : num 78.5 85.7 48 75.8 92.6 23 54.4 74.9 58.8 92.7 ...
## $ black_2010 : num 17.7 9.4 46.9 22 1.3 70.2 43.4 20.6 38.7 4.6 ...
## $ black_2017 : num 9.55 4.77 24.02 11.03 0.79 ...
## $ native_2010 : num 0.4 0.7 0.4 0.3 0.5 0.2 0.3 0.5 0.2 0.5 ...
## $ native_2017 : num 0.15 0.41 0.1 0.18 0.18 0.52 0.03 0.18 0.14 0.24 ...
## $ asian_2010 : num 0.9 0.7 0.4 0.1 0.2 0.2 0.8 0.7 0.5 0.2 ...
## $ asian_2017 : num 0.47 0.35 0.31 0 0.07 0.35 0.56 0.5 0.5 0.1 ...
## $ pac_isl_2010 : num NA NA NA NA NA NA 0 0.1 0 0 ...
## $ pac_isl_2017 : num 0.04 0 0 0 0 0 0 0 0 0 ...
## $ other_single_race_2017 : num 0.65 0.39 1.87 0.02 0.37 0.01 0.03 0.63 0.35 0.1 ...
## $ two_plus_races_2010 : num 1.6 1.5 0.9 0.9 1.2 0.8 0.8 1.7 1.1 1.5 ...
## $ two_plus_races_2017 : num 0.84 0.82 0.41 0.42 0.85 0.33 0.74 1.14 0.49 0.52 ...
## $ hispanic_2010 : num 2.4 4.4 5.1 1.8 8.1 7.1 0.9 3.3 1.6 1.2 ...
## $ hispanic_2017 : num 2.67 4.44 4.21 2.35 9.01 0.33 0.32 3.57 2.15 1.58 ...
## $ white_not_hispanic_2010 : num 77.2 83.5 46.8 75 88.9 21.9 54.1 73.6 58.1 92.1 ...
## $ white_not_hispanic_2017 : num 75.4 83.1 45.7 74.6 87.4 ...
## $ speak_english_only_2017 : num 96.2 94.5 94.3 97.8 92.3 97.2 98.5 95.9 98.6 99 ...
## $ no_move_in_one_plus_year_2010 : num 86.3 83 83 90.5 87.2 88.5 92.8 82.9 86.2 88.1 ...
## $ foreign_born_2010 : num 2 3.6 2.8 0.7 4.7 1.1 1.1 2.5 0.9 0.5 ...
## $ foreign_spoken_at_home_2010 : num 3.7 5.5 4.7 1.5 7.2 3.8 1.6 4.5 1.6 1.4 ...
## $ women_16_to_50_birth_rate_2017 : num 7.4 5.1 7.2 7.6 5.6 3.5 4.8 5.2 4.3 3.8 ...
## $ hs_grad_2010 : num 85.3 87.6 71.9 74.5 74.7 74.7 74.8 78.5 71.8 73.4 ...
## $ hs_grad_2016 : num 87.6 90 73.8 80.7 80 66.6 81.1 82.4 80.3 81.4 ...
## $ hs_grad_2017 : num 87.7 90.2 73.1 82.1 79.8 71.4 81.1 83.2 80.9 79.5 ...
## $ some_college_2016 : num 28.7 31.8 26 26.9 34 22.2 25.1 32.6 28.4 31.4 ...
## $ some_college_2017 : num 29.1 31.6 25.5 25 34.4 21.3 24.5 33.2 29.1 28.9 ...
## $ bachelors_2010 : num 21.7 26.8 13.5 10 12.5 12 11 16.1 10.8 10.5 ...
## $ bachelors_2016 : num 24.6 29.5 12.9 12 13.1 10.3 16.1 17.7 12.5 14 ...
## $ bachelors_2017 : num 25 30.7 12 13.2 13.1 13.4 16.1 17.9 13.3 12.5 ...
## $ veterans_2010 : int 5817 20396 2327 1883 4072 943 1675 11757 2893 2172 ...
## $ veterans_2017 : num 12.6 11.9 8 7.4 9.6 4.5 8.4 10.9 9.2 11.3 ...
## $ mean_work_travel_2010 : num 25.1 25.8 23.8 28.3 33.2 28.1 25.1 22.1 23.6 26.2 ...
## $ mean_work_travel_2017 : num 25.8 27 23.4 30 35 29.8 23.2 24.8 23.6 26.5 ...
## $ broadband_2017 : num 76.6 74.5 57.2 62 65.8 49.4 58.2 71 62.8 67.5 ...
## $ computer_2017 : num 86.2 86.9 73.4 74.8 78.2 64.2 68.3 82.9 72.7 79.4 ...
## $ housing_units_2010 : int 22135 104061 11829 8981 23887 4493 9964 53289 17004 16267 ...
## $ homeownership_2010 : num 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ housing_multi_unit_2010 : num 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ median_val_owner_occupied_2010 : int 133900 177200 88200 81200 113700 66300 70200 98200 82200 97100 ...
## $ households_2010 : int 19718 69476 9795 7441 20605 3732 8019 46421 13681 11352 ...
## $ households_2017 : int 21054 76133 9191 6916 20690 3670 7050 45099 13694 10795 ...
## $ persons_per_household_2010 : num 2.7 2.5 2.52 3.02 2.73 2.85 2.58 2.46 2.51 2.22 ...
## $ persons_per_household_2017 : num 2.59 2.63 2.54 2.97 2.76 2.74 2.81 2.49 2.44 2.37 ...
## $ per_capita_income_2010 : int 24568 26469 15875 19918 21070 20289 16916 20574 16626 21322 ...
## $ per_capita_income_2017 : num 27842 27780 17892 20572 21367 ...
## $ metro_2013 : int 1 1 0 1 1 0 0 1 0 0 ...
## $ median_household_income_2010 : int 53255 50147 33219 41770 45549 31602 30659 38407 31467 40690 ...
## $ median_household_income_2016 : int 54487 56460 32884 43079 47213 34278 35409 41778 39530 41456 ...
## $ median_household_income_2017 : int 55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
## $ private_nonfarm_establishments_2009 : int 877 4812 522 318 749 120 446 2444 568 350 ...
## $ private_nonfarm_employment_2009 : int 10628 52233 7990 2927 6968 1919 5400 38324 6241 3600 ...
## $ percent_change_private_nonfarm_employment_2009: num 16.6 17.4 -27 -14 -11.4 -18.5 2.1 -5.6 -45.8 5.4 ...
## $ nonemployment_establishments_2009 : int 2971 14175 1527 1192 3501 390 1180 6329 2074 1627 ...
## $ firms_2007 : int 4067 19035 1667 1385 4458 417 1769 8713 1981 2180 ...
## $ black_owned_firms_2007 : num 15.2 2.7 NA 14.9 NA NA NA 7.2 NA NA ...
## $ native_owned_firms_2007 : num NA 0.4 NA NA NA NA NA NA NA NA ...
## $ asian_owned_firms_2007 : num 1.3 1 NA NA NA NA 3.3 1.6 NA NA ...
## $ pac_isl_owned_firms_2007 : num NA NA NA NA NA NA NA NA NA NA ...
## $ hispanic_owned_firms_2007 : num 0.7 1.3 NA NA NA NA NA 0.5 NA NA ...
## $ women_owned_firms_2007 : num 31.7 27.3 27 NA 23.2 38.8 NA 24.7 29.3 14.5 ...
## $ manufacturer_shipments_2007 : int NA 1410273 NA 0 341544 NA 399132 2679991 667283 307439 ...
## $ mercent_whole_sales_2007 : int NA NA NA NA NA NA 56712 NA NA 62293 ...
## $ sales_2007 : int 598175 2966489 188337 124707 319700 43810 229277 1542981 264650 186321 ...
## $ sales_per_capita_2007 : int 12003 17166 6334 5804 5622 3995 11326 13678 7620 7613 ...
## $ accommodation_food_service_2007 : int 88157 436955 NA 10757 20941 3670 28427 186533 23237 13948 ...
## $ building_permits_2010 : int 191 696 10 8 18 1 3 107 10 6 ...
## $ fed_spending_2009 : int 331142 1119082 240308 163201 294114 108846 195055 1830659 294718 184642 ...
## $ area_2010 : num 594 1590 885 623 645 ...
## $ density_2010 : num 91.8 114.6 31 36.8 88.9 ...
## $ smoking_ban_2010 : chr "none" "none" "partial" "none" ...
## $ poverty_2010 : num 10.6 12.2 25 12.6 13.4 25.3 25 19.5 20.3 17.6 ...
## $ poverty_2016 : num 13.5 11.7 29.9 20.1 14.1 32.6 24.8 17.1 19.9 16.8 ...
## $ poverty_2017 : num 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ poverty_age_under_5_2017 : num 17.2 19.4 56.8 21.6 29.5 59.7 30.1 31.1 31.9 12.8 ...
## $ poverty_age_under_18_2017 : num 20 15.9 44.9 25.9 25.3 50.2 34.8 26.3 28.9 20.1 ...
## $ civilian_labor_force_2007 : int 24383 82659 10334 8791 26629 3653 9099 54861 15474 11984 ...
## $ employed_2007 : int 23577 80099 9684 8432 25780 3308 8539 52709 14469 11484 ...
## $ unemployed_2007 : int 806 2560 650 359 849 345 560 2152 1005 500 ...
## $ unemployment_rate_2007 : num 3.31 3.1 6.29 4.08 3.19 9.44 6.15 3.92 6.49 4.17 ...
## $ civilian_labor_force_2008 : int 24687 83223 10161 8749 26698 3634 9051 54564 15012 11996 ...
## [list output truncated]
count(CNTY_Data)
## n
## 1 3142
Summary statistics and missing values.
summary(CNTY_Data$persons_per_household_2017)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.400 2.360 2.480 2.517 2.630 4.130 2
summary(CNTY_Data$unemployment_rate_2017)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.620 3.520 4.360 4.611 5.355 19.070 3
summary(CNTY_Data$poverty_age_under_5_2017)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 16.50 23.80 25.17 32.30 90.50 4
New data frame and histograms.
CNTY_Analysis <- CNTY_Data |>
select(persons_per_household_2017, unemployment_rate_2017, poverty_age_under_5_2017) |>
filter(!is.na(persons_per_household_2017),
!is.na(unemployment_rate_2017),
!is.na(poverty_age_under_5_2017))
summary(CNTY_Analysis)
## persons_per_household_2017 unemployment_rate_2017 poverty_age_under_5_2017
## Min. :1.830 Min. : 1.620 Min. : 0.00
## 1st Qu.:2.360 1st Qu.: 3.520 1st Qu.:16.50
## Median :2.480 Median : 4.360 Median :23.80
## Mean :2.518 Mean : 4.611 Mean :25.17
## 3rd Qu.:2.630 3rd Qu.: 5.357 3rd Qu.:32.30
## Max. :4.130 Max. :19.070 Max. :90.50
ggplot(CNTY_Analysis, aes(x = persons_per_household_2017)) +
geom_histogram(binwidth = 0.25, fill = "#1f77b4", color = "black") +
labs(title = "Persons per household", x = "Persons per household", y = "Number of counties") +
theme_minimal()
ggplot(CNTY_Analysis, aes(x = unemployment_rate_2017)) +
geom_histogram(binwidth = 1, fill = "#1f77b4", color = "black") +
labs(title = "Unemployment rate", x = "Unemployment rate", y = "Number of counties") +
theme_minimal()
ggplot(CNTY_Analysis, aes(x = poverty_age_under_5_2017)) +
geom_histogram(binwidth = 1, fill = "#1f77b4", color = "black") +
labs(title = "Poverty rate for children under 5", x = "Percentage", y = "Number of counties") +
theme_minimal()
Multiple Linear Regression
I will use a multiple linear regression to analyze the data.
CNTY_MLG2 <- lm(poverty_age_under_5_2017 ~ persons_per_household_2017 + unemployment_rate_2017, data = CNTY_Data)
summary(CNTY_MLG2)
##
## Call:
## lm(formula = poverty_age_under_5_2017 ~ persons_per_household_2017 +
## unemployment_rate_2017, data = CNTY_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.595 -7.367 -1.060 5.936 61.283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7759 1.9439 4.000 6.48e-05 ***
## persons_per_household_2017 0.3627 0.7727 0.469 0.639
## unemployment_rate_2017 3.5738 0.1231 29.028 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.21 on 3135 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.2175, Adjusted R-squared: 0.217
## F-statistic: 435.8 on 2 and 3135 DF, p-value: < 2.2e-16
Coefficients: Intercept (7.7759), Persons per household (0.3627), Unemployment rate (3.5738). Standard error: 11.21 P-Values: Persons per household (0.63), Unemployment rate (<2e-16). R-squared: 0.2175
These results indicate that one extra person per household and a one percent increase in unemployment rate per county increase poverty for children under five by 0.36% and 3.57% percent respectively. However only unemployment rate had a p-value that indicated it was statistically significant so according to this model the only significant factor of the two when it comes to predicting poverty for children under 5 is unemployment rate. The R squared value (0.2175) also indicates that this model has the ability to predict 21.75% of variability in the rates of childhood poverty under 5.
##Assumptions and Diagnostics
Linearity
Linearity is not satisfied, regression lines are not straight and residuals are unevenly distributed across the line with most of them clustering on the left side of the line.
crPlots(CNTY_MLG2)
Independence
Independence is satisfied, the residuals are spread evenly across the line and their are no noticeable patterns or clusters.
plot(resid(CNTY_MLG2), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Homosedasticity and normality of residuals
Homoscedasticity is not satisfied, Residuals vs fitted and scale-location plots both have uneven distributions of residuals with clusters on the left side of the line. Scale-location plot also does not have a horizontal regression line.
Normality is satisfied, the Q-Q plot has very slight tails on both ends and Residuals vs Leverage plot has a few outliers but none seem to be very influential.
par(mfrow=c(2,2)); plot(CNTY_MLG2); par(mfrow=c(1,1))
Multicollinearity
There is no multicolinearity between the predictors, the correlation matrix shows a correlation of 0.16 between the predictors which is very low indicating no correlation.
cor(CNTY_Analysis[, c("persons_per_household_2017", "unemployment_rate_2017")], use = "complete.obs")
## persons_per_household_2017 unemployment_rate_2017
## persons_per_household_2017 1.0000000 0.1668269
## unemployment_rate_2017 0.1668269 1.0000000
Conclusions
The main takeaway from this analysis is that of the two predictors, number of persons per household and unemployment rate only the second one has a statistically significant effect on poverty rates for children under five years of age. With the poverty rate increasing by 3.57% for every 1% increase of the unemployment rate in a county. The limitation of this model is it’s low predicting power with the predictors in this model only accounting for 21.75% of the poverty rate. The next step in researching the factors that affect the poverty rate for children under 5 would be to remove the predictor of persons per household and add different predictors to try and raise the predicting power of the model.