R3 Data cleaning: easy scraping, reformatting and reshaping

Let’s go!

Load packages

We’ll need to load several packages for this class.

library(tidyverse) # we'll be using mostly tidyr and stringr packages within the tidyverse, handy for data cleaning and string manipulation respectively.
library(lubridate) # this package allows us to manipulate dates
library(here) # easier file paths
library(readxl) # read and write data from Excel files
library(googlesheets4) #write to and pull data from Google Sheets
library(janitor) #data cleaning package - clean_names() is great
library(campfin) # developed by the Accountability Project to clean campaign finance data but has tons of tools for data cleaning
library(rvest) #web scraping

Loading/scraping an html table

Let’s say an EPA webpage has a table with some useful information: the percentage of state area with groundwater nitrate contamination for all states. You want to know how Nebraska is doing compared to other states, but there’s no sort button to view a column in descending order. So let’s scrape it.

Rvest package can help us do that:

read_html() function:
- performs a HTTP request then parses the HTML received with xml2 package, a lower level package that Rvest buils on top of.
- returns a xml_document object which you’ll then manipulate using rvest functions. xml_document is a class from xml2 package.
html_element() function:
- finds HTML element using CSS selectors or XPath expressions
html_table() function:
- parses an html table into a data frame

Tips:

Right click the page and “inspect” allows you to peak into the HTML tree and find what elements you want. A more detailed tutorial on how to do that here
Learn CSS selectors with an interactive game: https://flukeout.github.io/

epa_url <- "https://www.epa.gov/nutrient-policy-data/estimated-nitrate-concentrations-groundwater-used-drinking"
## creating a variable for the url

epa_tab <- read_html(epa_url) %>% ##read in the html page
  html_element("table") %>% ##select the "table" element
  html_table() ##parse the html table into data frame

# Clean up column names
names(epa_tab) <- c("state","area","percent","private_perc_2005","private_perc_2015")

Now you can manipulate the data frame. But first let’s check the data types with str().

str(epa_tab)

## tibble [50 × 5] (S3: tbl_df/tbl/data.frame)
##  $ state            : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ area             : chr [1:50] "5" "No data" "216" "0" ...
##  $ percent          : chr [1:50] "0" "No data" "0" "0" ...
##  $ private_perc_2005: int [1:50] 11 35 4 7 7 6 24 10 10 18 ...
##  $ private_perc_2015: int [1:50] 11 26 3 5 4 5 24 19 12 15 ...

Some columns are characteristics, but to perform any sorting they need to be numeric. Let’s use the following functions to convert them:

mutate():
- let you create new columns that are functions of existing variables
- if you just want to create one new column, i.e. based on “area” column: epa_tab <- epa_tab %>% mutate( area = as.numeric(area) )
across():
- makes it easy to apply the same transformation to multiple columns

## to double check: epa_tab %>% mutate(across(2:4, as.numeric)) %>% View()

epa_tab <- epa_tab %>% mutate(across(2:4, as.numeric))

You might have seen “Warning: NAs introduced by coercion”; it’s because when it says “no data” they are character strings that R couldn’t turn into numeric. R coerced them to be NAs.

Now let’s rank the states based on the percentage of area with nitrate contamination, from highest to lowest.

epa_tab %>% arrange(desc(percent))

## # A tibble: 50 × 5
##    state         area percent private_perc_2005 private_perc_2015
##    <chr>        <dbl>   <dbl>             <dbl>             <int>
##  1 Oklahoma        NA      12                 8                 9
##  2 Kansas          NA      11                 5                 5
##  3 Delaware       189      10                10                19
##  4 Maryland       347       4                17                24
##  5 Texas           NA       4                10                 5
##  6 Nebraska        NA       2                18                 9
##  7 Pennsylvania   772       2                20                27
##  8 California      NA       1                 7                 4
##  9 Colorado        NA       1                 6                 5
## 10 Iowa           463       1                18                16
## # ℹ 40 more rows

Now that the data are numeric you can also perform calcuations of multiple columns, for example, the change in % of residents on a private well in 2015 vs 2005.

epa_tab <- epa_tab %>% 
  mutate(perc_change = private_perc_2015-private_perc_2005)

Data from Excel

We are going to work with court records from an eviction court in Nebraska. The data was maintained by the administrative office of the courts, provided to the Flatwater Free Press by a researcher who requested it in the first place.

eviction_p1 <- read_xlsx("../data/evictions 2020-01-01 to 2021-01-01.xlsx") %>% clean_names()

eviction_p2 <- read_xlsx("../data/evictions 2019-04-01 to 2019-12-31.xlsx") %>% clean_names()

eviction_p3 <- read_xlsx("../data/evictions 04-01-2017 to 03-31-2019.xlsx") %>% clean_names()

Combining multiple datasets into one

We can easily combine data files in the same structure with bind_rows(). Note that these data column types need to be consistent, so we can first make sure the date columns are read in as dates.

We can look at the specific columns using select(). And if we don’t want to type the column names for each, we can use ends_with() or start_with() to select those columns based on string pattern.

eviction_p1 %>% select(ends_with("date"))

## # A tibble: 6,140 × 2
##    file_date           jdg_date           
##    <dttm>              <dttm>             
##  1 2020-01-01 00:00:00 NA                 
##  2 2020-01-02 00:00:00 2020-01-15 00:00:00
##  3 2020-01-02 00:00:00 2020-01-22 00:00:00
##  4 2020-01-02 00:00:00 2020-01-15 00:00:00
##  5 2020-01-02 00:00:00 2020-01-15 00:00:00
##  6 2020-01-02 00:00:00 NA                 
##  7 2020-01-02 00:00:00 2020-01-15 00:00:00
##  8 2020-01-02 00:00:00 NA                 
##  9 2020-01-02 00:00:00 2020-01-15 00:00:00
## 10 2020-01-02 00:00:00 2020-01-27 00:00:00
## # ℹ 6,130 more rows

eviction_p2 %>% select(ends_with("date"))

## # A tibble: 7,045 × 2
##    file_date           jdg_date           
##    <dttm>              <dttm>             
##  1 2019-04-01 00:00:00 2019-04-23 00:00:00
##  2 2019-04-01 00:00:00 2019-04-17 00:00:00
##  3 2019-04-01 00:00:00 2019-04-15 00:00:00
##  4 2019-04-01 00:00:00 NA                 
##  5 2019-04-01 00:00:00 2019-04-15 00:00:00
##  6 2019-04-01 00:00:00 2019-04-15 00:00:00
##  7 2019-04-01 00:00:00 2019-04-16 00:00:00
##  8 2019-04-01 00:00:00 NA                 
##  9 2019-04-01 00:00:00 2019-04-16 00:00:00
## 10 2019-04-01 00:00:00 NA                 
## # ℹ 7,035 more rows

eviction_p3 %>% select(ends_with("date"))

## # A tibble: 9,425 × 2
##    file_date  jdg_date  
##    <chr>      <chr>     
##  1 04/03/2017 04/17/2017
##  2 04/03/2017 04/17/2017
##  3 04/03/2017 <NA>      
##  4 04/03/2017 04/17/2017
##  5 04/03/2017 04/19/2017
##  6 04/03/2017 <NA>      
##  7 04/03/2017 04/17/2017
##  8 04/04/2017 04/18/2017
##  9 04/04/2017 04/24/2017
## 10 04/04/2017 <NA>      
## # ℹ 9,415 more rows

## read more about POSIX classes here:https://www.stat.berkeley.edu/~s133/dates.html

Same with the zipcode columns.

eviction_p1 %>% select(starts_with("zip"))

## # A tibble: 6,140 × 2
##     zip1 zip2 
##    <dbl> <lgl>
##  1 68111 NA   
##  2 68102 NA   
##  3 68134 NA   
##  4 68108 NA   
##  5 68111 NA   
##  6 68112 NA   
##  7 68132 NA   
##  8 68111 NA   
##  9 68111 NA   
## 10 68104 NA   
## # ℹ 6,130 more rows

eviction_p2 %>% select(starts_with("zip"))

## # A tibble: 7,045 × 2
##     zip1  zip2
##    <dbl> <dbl>
##  1    NA    NA
##  2 68104    NA
##  3 68107    NA
##  4 68144    NA
##  5 68466    NA
##  6 68025    NA
##  7 68701    NA
##  8 68803  2064
##  9 66083    NA
## 10 69138    NA
## # ℹ 7,035 more rows

eviction_p3 %>% select(starts_with("zip"))

## # A tibble: 9,425 × 2
##    zip1  zip2 
##    <chr> <chr>
##  1 68111 <NA> 
##  2 68107 <NA> 
##  3 68107 <NA> 
##  4 68107 <NA> 
##  5 68132 <NA> 
##  6 68110 <NA> 
##  7 68104 <NA> 
##  8 68127 <NA> 
##  9 68105 <NA> 
## 10 68144 <NA> 
## # ℹ 9,415 more rows

## read more about POSIX classes here:https://www.stat.berkeley.edu/~s133/dates.html

Converting dates and numbers

lubridate package has some handy functions for converting dates: such as ymd(), which parses dates in the year month date order. P.S. as long as formatting goes it works for all of these: “yyyy-mm-dd”,“yyyy/mm/dd”,“yyyymmdd”.

eviction_p1 <- eviction_p1 %>%  
  mutate(across(.cols = ends_with("date"), ymd)) %>% 
  mutate(across(.cols = starts_with("zip"), as.double))

eviction_p2 <- eviction_p2 %>%  
  mutate(across(.cols = ends_with("date"), ymd)) %>% 
  mutate(across(.cols = starts_with("zip"), as.double))
  
eviction_p3 <- eviction_p3 %>%  
  mutate(across(.cols = ends_with("date"), mdy)) %>% 
  mutate(across(.cols = starts_with("zip"), as.double))

we need to use mdy() for eviction_p3 because the date columns have a different format from the other two: “mm/dd/yyyy” rather than “yyyy-mm-dd”.

Binding rows

Now that we’ve got a working date field, we can combine the data and save to a new variable.

eviction <- eviction_p1 %>% bind_rows(eviction_p2) %>% bind_rows(eviction_p3)

extracting year out of date

We can extract the year from the file_date field with the year() function in lubridate.

eviction <- eviction %>% mutate(file_year = year(file_date))

String functions

Now the fun begins. We need to isolate plaintiff in the case name and which ones are tied to Omaha Public Authority.

Split up a string

str_extract(): extracts the first complete match from each string.

You put regular expressions in the function as an argument to tell R what pattern to match.

Find regex cheat sheet here
Test your regex here

Look for these on the cheat sheet and try running the example code: - Match characters: - \\. - \\s - . - Anchors: - ^a - a$ - Look-arounds: - a(?=c) - (?<=b)a - Quantifiers: - a+

eviction <- eviction %>% 
  mutate(caption_up = str_to_upper(eviction$caption), .after = "caption") 
## create a "caption_up" column after "caption" column, which is everything in "caption" column to upper case 

eviction <- eviction %>% 
  mutate(
    plaintiff = caption_up %>% str_extract("^.+(?=\\sV\\.)") %>% str_trim(), 
    def = caption_up %>% str_extract("(?<=\\sV\\.).+$") %>% str_trim()
  )
## plaintiff: look for start of string, what's followed by ' V.'
## defendant: look for end of string, what's proceded by V.

landlord_names <- eviction %>% tabyl(plaintiff)

Determine if string matches a pattern

We have 5,000+ distinct landlord names. Now we can use a str_detect function to capture the ones denoting the Omaha Housing Authority. We don’t know what these are, so let’s cast a wide net first.

But let’s first narrow the data to everything in Douglas County. We can use the str_detect function to tell R to give us all the entries whose court_name variable contains the string “Douglas County”.

eviction <- eviction %>% 
  filter(court_name %>% str_detect("Douglas County"))

Next we will look for every possible name variation of Omaha Housing Authority and generate a frequency table

oha <- eviction %>% 
  mutate(caption = str_to_upper(caption)) %>% 
  filter(caption %>% str_detect(str_to_upper("Omaha Housing Authority|OHA|Housing Authority|Housing Authority-City of Omaha|Housing Auth. City of Omaha|Housing Authority Of Omaha|Omaha Housing Auth|Housing Auth. City of Omaha|Housing Auhtority City of Omaha|Housing Authority City Omaha|^OMA")))

## | is an alternator, means "OR", allows for searching multiple phrases;
## ^ is an anchor, means "starts with"


oha_names <- oha %>% tabyl(plaintiff) 
## janitor::tabyl() generates a frequency table, Returns a data.frame with frequencies and percentages of the tabulated variable

We can manually review the plaintiff names and see if they’re truly OHA. Let’s check if “THE HOUSING AUTHORITY” is truly OHA cases. By checking the address, we can manually verify these cases.

When you need to quickly inspect a certain column, use select to rearrange column positions for easier viewing.

oha %>% filter(plaintiff == "THE HOUSING AUTHORITY") %>% 
  #addr1 now shifts to the left and everything else stays in place
  select(addr1, everything())

## # A tibble: 40 × 28
##    addr1  court_name casetype caseyr casenum csealind caption caption_up subtype
##    <chr>  <chr>      <chr>     <dbl>   <dbl> <chr>    <chr>   <chr>      <chr>  
##  1 4207 … 01 - Doug… CI           18     223 N        THE HO… THE HOUSI… REALFED
##  2 4204 … 01 - Doug… CI           18     227 N        THE HO… THE HOUSI… REALFED
##  3 2908 … 01 - Doug… CI           18     621 N        THE HO… THE HOUSI… REALFED
##  4 5640 … 01 - Doug… CI           18     622 N        THE HO… THE HOUSI… REALFED
##  5 5100 … 01 - Doug… CI           18    2049 N        THE HO… THE HOUSI… REALFED
##  6 600 S… 01 - Doug… CI           18    2050 N        THE HO… THE HOUSI… REALFED
##  7 600 S… 01 - Doug… CI           18    2052 N        THE HO… THE HOUSI… REALFED
##  8 1601 … 01 - Doug… CI           18    2056 N        THE HO… THE HOUSI… REALFED
##  9 5100 … 01 - Doug… CI           18    2058 N        THE HO… THE HOUSI… REALFED
## 10 5405 … 01 - Doug… CI           18    2076 N        THE HO… THE HOUSI… REALFED
## # ℹ 30 more rows
## # ℹ 19 more variables: dscrptn <chr>, file_date <date>, cstatusx <chr>,
## #   prtyrole <chr>, prtyseq <dbl>, pstatus <chr>, name <chr>, addr2 <chr>,
## #   city <chr>, state <chr>, zip1 <dbl>, zip2 <dbl>, jdg_date <date>,
## #   jdgcode <chr>, atty_bar <dbl>, atty_name <chr>, file_year <dbl>,
## #   plaintiff <chr>, def <chr>

Once we find the plaintiff names that are actually OHA, let’s save them to an object.

oha_names$plaintiff[c(7:13,30:34,59:67)] -> oha_clean_names

We will filter the plaintiff column based on if the plaintiff name matches any of names in oha_clean_names.

#cases we are confident that are filed by OHA 
actual_oha <- oha %>% filter(plaintiff %in% oha_clean_names)
# the ones that are not OHA
oha_out <- eviction %>% filter(plaintiff %out% oha_names$plaintiff)

oha_out %>% tabyl(plaintiff) -> oha_out_names

Now we have a clean data set of all OHA filings, let’s see how many cases they filed each year, and the ratio of OHA filings out of all filings.

count() in dplyr package lets you quickly count the unique values of one or more variables.

# OHA cases each year
x <- actual_oha %>% count(file_year)
names(x) <- c("filing_year","oha_cases")

# Let's see how many cases in total are filed each year
y <- eviction %>% count(file_year)
names(y) <- c("filing_year","all_cases")

# join and calculate percent
z <- x %>% left_join(y) %>% 
  mutate(oha_perc = oha_cases/all_cases)

z %>% head()

## # A tibble: 4 × 4
##   filing_year oha_cases all_cases oha_perc
##         <dbl>     <int>     <int>    <dbl>
## 1        2017       346      3302   0.105 
## 2        2018       347      4918   0.0706
## 3        2019       312      5071   0.0615
## 4        2020       217      3245   0.0669

Working with Google Sheets

Say you’d like to send this to another reporter in your newsroom for some spot checks to see if you’ve gotten everything right. An easy way to do this is connecting to your Google account and upload the sheet to a Google Drive.

Let’s get started. First, you’ll need a Google Account and install the googlesheets4 package.

Let’s create a sheet from scratch. You’ll first need to authenticate your account.

googlesheets4::gs4_auth()

Next we will create a Googlesheet

googlesheets4::gs4_create("oha",sheets = actual_oha)
# first argument is the sheet name, second argument is the data frame, which will also be the sheet name

You can then make changes in the Google sheet. Grab the ID of your sheet. It’s in the address url. The string after “https://docs.google.com/spreadsheets/d/”

Let’s add the summary table to the same Google sheet in another tab.

sheet_write(z, ss = "1yyMApNbIVAWxvFPEw_11ZrkRQdip9to_UIs7GL4Op74",sheet = "smmary_table")

Data reformatting & reshaping

Adding description columns

Next we’ll take a look at osha.

osha <- read_csv("../data/osha.csv")

# or read from github repo
# osha <- read_csv("https://raw.githubusercontent.com/ireapps/teaching-guide-R123/main/data/osha.csv")

glimpse(osha)

## Rows: 307,902
## Columns: 31
## $ id                           <dbl> 1426059, 1426060, 1426061, 1426062, 14260…
## $ company_name                 <chr> "Aza Services, LLC", "All Steel Welding L…
## $ establishment_name           <chr> "Aza Services LLC", "All Steel Welding LL…
## $ ein                          <chr> "812104657", "853696720", "872099752", "8…
## $ street_address               <chr> "613 Davis Rd", "220 Grand St", "429 Dela…
## $ city                         <chr> "Jefferson", "Channelview", "Newark", "Ea…
## $ state                        <chr> "GA", "TX", "NJ", "NC", "HI", "WV", "CO",…
## $ zip_code                     <chr> "30549", "77530", "07105", "28312", "9681…
## $ naics_code                   <dbl> 238140, 332312, 492110, 221122, 236118, 5…
## $ industry_description         <chr> "Masonry", "Fabrication, Erection, and We…
## $ annual_average_employees     <dbl> 12, 20, 25, 1, 4, 20, 14, 275, 55, 115, 1…
## $ total_hours_worked           <dbl> 700, 38400, 700, 2080, 10800, 49000, 2348…
## $ no_injuries_illnesses        <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2,…
## $ total_deaths                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_dafw_cases             <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
## $ total_djtr_cases             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_other_cases            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,…
## $ total_dafw_days              <dbl> 0, 0, 0, 0, 0, 0, 0, 14, 1, 80, 0, 0, 0, …
## $ total_djtr_days              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_injuries               <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0,…
## $ total_poisonings             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_respiratory_conditions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_skin_disorders         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_hearing_loss           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_other_illnesses        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ establishment_id             <dbl> 752183, 795137, 795138, 795139, 779923, 6…
## $ establishment_type           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,…
## $ size                         <dbl> 1, 1, 2, 1, 1, 1, 1, 3, 2, 2, 1, 2, 2, 1,…
## $ year_filing_for              <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021,…
## $ created_timestamp            <chr> "1/1/2022 10:05:03", "1/1/2022 10:05:11",…
## $ change_reason                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

If you look through the documentation for this dataset, you’ll notice that some of these fields are coded, such as size and establishment_type. For columns that have many value options, we might want to join to a lookup table. But for just a few values, we can add a _desc column into our data and code in values based on the original column.

We’ll add an estab_type_desc column based on the values in establishment_type, using a function called case_when(). This is something like an if or ifelse statement:

# test it out
osha %>% mutate(estab_type_desc = case_when(
  establishment_type==1 ~ "Not a Government Entity",
  establishment_type==2 ~ "State Government Entity",
  establishment_type==3 ~ "Local Government Entity",
  TRUE ~ "Error"
)) %>% 
  count(establishment_type, estab_type_desc)

## # A tibble: 4 × 3
##   establishment_type estab_type_desc              n
##                <dbl> <chr>                    <int>
## 1                  1 Not a Government Entity 291709
## 2                  2 State Government Entity   5447
## 3                  3 Local Government Entity   8847
## 4                 NA Error                     1899

# make it permanent
osha <- osha %>% mutate(estab_type_desc = case_when(
  establishment_type==1 ~ "Not a Government Entity",
  establishment_type==2 ~ "State Government Entity",
  establishment_type==3 ~ "Local Government Entity",
  TRUE ~ "Error"
))

Reshaping data structure

This is the original file that from the Census Bureau.

poverty <- read_csv("../data/poverty_original.csv")

#poverty <- read_csv("https://raw.githubusercontent.com/ireapps/teaching-guide-R123/main/data/poverty_original.csv")

In this example, each variable (i.e. below50, below125, etc) is its own row. To make it easier to do calculations by county, I transposed this data so that each variable would be its own column rather than row. I did that using pivot_wider() (for the sake of this example, I’m going to exclude the margin of error, or moe).

poverty %>% 
  select(-moe) %>% 
  pivot_wider(names_from=variable, values_from=estimate)

## # A tibble: 3,221 × 11
##    GEOID NAME    population below50 below125 below150 below185 below200 below300
##    <chr> <chr>        <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 01001 Autaug…      55639    3504    10760    12611    15883    17499    27259
##  2 01003 Baldwi…     218289    7250    28632    36413    50285    57127    92639
##  3 01005 Barbou…      25026    3018     7553     8965    10606    11152    14933
##  4 01007 Bibb C…      22374    1875     4646     5730     7811     8370    12236
##  5 01009 Blount…      57755    3834    10542    13624    18517    20042    32176
##  6 01011 Bulloc…      10173    1074     3472     4128     4989     5054     6468
##  7 01013 Butler…      19726    1194     4997     6133     7478     8150    12138
##  8 01015 Calhou…     114324    7486    25293    30323    39419    42196    64372
##  9 01017 Chambe…      33427    1628     7294     9277    12781    14169    20226
## 10 01019 Cherok…      26035     973     5424     6662     9132     9798    14649
## # ℹ 3,211 more rows
## # ℹ 2 more variables: below400 <dbl>, below500 <dbl>

There is a function called pivot_longer() that does the opposite:

# First I'll create a new variable with the wider data:
poverty_wide <- poverty %>% 
  select(-moe) %>% 
  pivot_wider(names_from=variable, values_from=estimate)

# Then I'll turn it back to long using pivot_longer()
poverty_wide %>% pivot_longer(cols = population:below500, names_to="variable", values_to="estimate")

## # A tibble: 28,989 × 4
##    GEOID NAME                    variable   estimate
##    <chr> <chr>                   <chr>         <dbl>
##  1 01001 Autauga County, Alabama population    55639
##  2 01001 Autauga County, Alabama below50        3504
##  3 01001 Autauga County, Alabama below125      10760
##  4 01001 Autauga County, Alabama below150      12611
##  5 01001 Autauga County, Alabama below185      15883
##  6 01001 Autauga County, Alabama below200      17499
##  7 01001 Autauga County, Alabama below300      27259
##  8 01001 Autauga County, Alabama below400      35423
##  9 01001 Autauga County, Alabama below500      42806
## 10 01003 Baldwin County, Alabama population   218289
## # ℹ 28,979 more rows