A. Cleaning
Check for N/A values
## name state pop2000 pop2010
## 0 0 3 0
## pop2017 pop_change poverty homeownership
## 3 3 2 0
## multi_unit unemployment_rate metro median_edu
## 0 3 3 2
## per_capita_income median_hh_income smoking_ban
## 2 2 580
We have some N/A’s to deal with.
Since we’re focusing on poverty, we will only remove N/A values from
the poverty variable. For visualizations and the final model, I will
remove N/A values when I create them (if R doesn’t already do so by
default).
county_clean <- county |>
filter(poverty != is.na(poverty))
head(county_clean)
## # A tibble: 6 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Autauga County Alaba… 43671 54571 55504 1.48 13.7 77.5
## 2 Baldwin County Alaba… 140415 182265 212628 9.19 11.8 76.7
## 3 Barbour County Alaba… 29038 27457 25270 -6.22 27.2 68
## 4 Bibb County Alaba… 20826 22915 22668 0.73 15.2 82.9
## 5 Blount County Alaba… 51024 57322 58013 0.68 15.6 82
## 6 Bullock County Alaba… 11714 10914 10309 -2.28 28.5 76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## # median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## # smoking_ban <chr>
B. EDA
Check the structure of the data set.
## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ pop2000 : num [1:3142] 43671 140415 29038 20826 51024 ...
## $ pop2010 : num [1:3142] 54571 182265 27457 22915 57322 ...
## $ pop2017 : num [1:3142] 55504 212628 25270 22668 58013 ...
## $ pop_change : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : chr [1:3142] "yes" "yes" "no" "yes" ...
## $ median_edu : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
## $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
## $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
## $ smoking_ban : chr [1:3142] "none" "none" "partial" "none" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. state = col_character(),
## .. pop2000 = col_double(),
## .. pop2010 = col_double(),
## .. pop2017 = col_double(),
## .. pop_change = col_double(),
## .. poverty = col_double(),
## .. homeownership = col_double(),
## .. multi_unit = col_double(),
## .. unemployment_rate = col_double(),
## .. metro = col_character(),
## .. median_edu = col_character(),
## .. per_capita_income = col_double(),
## .. median_hh_income = col_double(),
## .. smoking_ban = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
All variables line up with their intended class.
Check the first 10 unique county names.
head(unique(county$name,),10)
## [1] "Autauga County" "Baldwin County" "Barbour County" "Bibb County"
## [5] "Blount County" "Bullock County" "Butler County" "Calhoun County"
## [9] "Chambers County" "Cherokee County"
Check the unique values for the state variable.
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
We can see that this dataset considers the District of Columbia as a
state, therefore, there are a total of 51 values for state.
## [1] "yes" "no" NA
unique(county$median_edu)
## [1] "some_college" "hs_diploma" NA "bachelors" "below_hs"
unique(county$smoking_ban)
## [1] "none" "partial" NA
Create a summary of the variables in the dataset.
## name state pop2000 pop2010
## Length:3140 Length:3140 Min. : 67 Min. : 82
## Class :character Class :character 1st Qu.: 11236 1st Qu.: 11118
## Mode :character Mode :character Median : 24653 Median : 25890
## Mean : 89701 Mean : 98318
## 3rd Qu.: 61792 3rd Qu.: 66898
## Max. :9519338 Max. :9818605
## NA's :3
## pop2017 pop_change poverty homeownership
## Min. : 88 Min. :-33.6300 Min. : 2.40 Min. : 0.00
## 1st Qu.: 10981 1st Qu.: -1.9700 1st Qu.:11.30 1st Qu.:69.50
## Median : 25862 Median : -0.0700 Median :15.20 Median :74.60
## Mean : 103822 Mean : 0.5329 Mean :15.97 Mean :73.28
## 3rd Qu.: 67773 3rd Qu.: 2.3700 3rd Qu.:19.40 3rd Qu.:78.40
## Max. :10163507 Max. : 37.1900 Max. :52.00 Max. :91.30
## NA's :3 NA's :3
## multi_unit unemployment_rate metro median_edu
## Min. : 0.00 Min. : 1.620 Length:3140 Length:3140
## 1st Qu.: 6.10 1st Qu.: 3.520 Class :character Class :character
## Median : 9.70 Median : 4.360 Mode :character Mode :character
## Mean :12.33 Mean : 4.611
## 3rd Qu.:15.90 3rd Qu.: 5.355
## Max. :98.50 Max. :19.070
## NA's :1
## per_capita_income median_hh_income smoking_ban
## Min. :10467 Min. : 19264 Length:3140
## 1st Qu.:21772 1st Qu.: 41126 Class :character
## Median :25445 Median : 48073 Mode :character
## Mean :26093 Mean : 49765
## 3rd Qu.:29276 3rd Qu.: 55771
## Max. :69533 Max. :129588
##
We can see that the average poverty percentage across the counties in
the United States is 15.97%.