In this session we’re going to:
1. Get data into R! We will scrape an html table and load data from Excel, csv and csv files on the web.
2. Clean data in R! We will practice cleaning up column names, converting data types, combining multiple data sets, manipulating strings. There’s also some reshaping at the end because it’s crucial step to get data ready for analysis.
You will walk away with:
We’ll need to load several packages for this class.
library(tidyverse) # we'll be using mostly tidyr and stringr packages within the tidyverse, handy for data cleaning and string manipulation respectively.
library(lubridate) # this package allows us to manipulate dates
library(here) # easier file paths
library(readxl) # read and write data from Excel files
library(googlesheets4) #write to and pull data from Google Sheets
library(janitor) #data cleaning package - clean_names() is great
library(campfin) # developed by the Accountability Project to clean campaign finance data but has tons of tools for data cleaning
library(rvest) #web scraping
Let’s say an EPA webpage has a table with some useful information: the percentage of state area with groundwater nitrate contamination for all states. You want to know how Nebraska is doing compared to other states, but there’s no sort button to view a column in descending order. So let’s scrape it.
Rvest package can help us do that:
read_html() function:
html_element() function:
html_table() function:
Tips:
epa_url <- "https://www.epa.gov/nutrient-policy-data/estimated-nitrate-concentrations-groundwater-used-drinking"
## creating a variable for the url
epa_tab <- read_html(epa_url) %>% ##read in the html page
html_element("table") %>% ##select the "table" element
html_table() ##parse the html table into data frame
# Clean up column names
names(epa_tab) <- c("state","area","percent","private_perc_2005","private_perc_2015")
Now you can manipulate the data frame. But first let’s check the data
types with str().
str(epa_tab)
## tibble [50 × 5] (S3: tbl_df/tbl/data.frame)
## $ state : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ area : chr [1:50] "5" "No data" "216" "0" ...
## $ percent : chr [1:50] "0" "No data" "0" "0" ...
## $ private_perc_2005: int [1:50] 11 35 4 7 7 6 24 10 10 18 ...
## $ private_perc_2015: int [1:50] 11 26 3 5 4 5 24 19 12 15 ...
Some columns are characteristics, but to perform any sorting they need to be numeric. Let’s use the following functions to convert them:
mutate():
epa_tab <- epa_tab %>% mutate( area = as.numeric(area) )across():
## to double check: epa_tab %>% mutate(across(2:4, as.numeric)) %>% View()
epa_tab <- epa_tab %>% mutate(across(2:4, as.numeric))
You might have seen “Warning: NAs introduced by coercion”; it’s because when it says “no data” they are character strings that R couldn’t turn into numeric. R coerced them to be NAs.
Now let’s rank the states based on the percentage of area with nitrate contamination, from highest to lowest.
epa_tab %>% arrange(desc(percent))
## # A tibble: 50 × 5
## state area percent private_perc_2005 private_perc_2015
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Oklahoma NA 12 8 9
## 2 Kansas NA 11 5 5
## 3 Delaware 189 10 10 19
## 4 Maryland 347 4 17 24
## 5 Texas NA 4 10 5
## 6 Nebraska NA 2 18 9
## 7 Pennsylvania 772 2 20 27
## 8 California NA 1 7 4
## 9 Colorado NA 1 6 5
## 10 Iowa 463 1 18 16
## # ℹ 40 more rows
Now that the data are numeric you can also perform calcuations of multiple columns, for example, the change in % of residents on a private well in 2015 vs 2005.
epa_tab <- epa_tab %>%
mutate(perc_change = private_perc_2015-private_perc_2005)
We are going to work with court records from an eviction court in Nebraska. The data was maintained by the administrative office of the courts, provided to the Flatwater Free Press by a researcher who requested it in the first place.
eviction_p1 <- read_xlsx("../data/evictions 2020-01-01 to 2021-01-01.xlsx") %>% clean_names()
eviction_p2 <- read_xlsx("../data/evictions 2019-04-01 to 2019-12-31.xlsx") %>% clean_names()
eviction_p3 <- read_xlsx("../data/evictions 04-01-2017 to 03-31-2019.xlsx") %>% clean_names()
We can easily combine data files in the same structure with
bind_rows(). Note that these data column types need to be
consistent, so we can first make sure the date columns are read in as
dates.
We can look at the specific columns using select(). And
if we don’t want to type the column names for each, we can use
ends_with() or start_with() to select those
columns based on string pattern.
eviction_p1 %>% select(ends_with("date"))
## # A tibble: 6,140 × 2
## file_date jdg_date
## <dttm> <dttm>
## 1 2020-01-01 00:00:00 NA
## 2 2020-01-02 00:00:00 2020-01-15 00:00:00
## 3 2020-01-02 00:00:00 2020-01-22 00:00:00
## 4 2020-01-02 00:00:00 2020-01-15 00:00:00
## 5 2020-01-02 00:00:00 2020-01-15 00:00:00
## 6 2020-01-02 00:00:00 NA
## 7 2020-01-02 00:00:00 2020-01-15 00:00:00
## 8 2020-01-02 00:00:00 NA
## 9 2020-01-02 00:00:00 2020-01-15 00:00:00
## 10 2020-01-02 00:00:00 2020-01-27 00:00:00
## # ℹ 6,130 more rows
eviction_p2 %>% select(ends_with("date"))
## # A tibble: 7,045 × 2
## file_date jdg_date
## <dttm> <dttm>
## 1 2019-04-01 00:00:00 2019-04-23 00:00:00
## 2 2019-04-01 00:00:00 2019-04-17 00:00:00
## 3 2019-04-01 00:00:00 2019-04-15 00:00:00
## 4 2019-04-01 00:00:00 NA
## 5 2019-04-01 00:00:00 2019-04-15 00:00:00
## 6 2019-04-01 00:00:00 2019-04-15 00:00:00
## 7 2019-04-01 00:00:00 2019-04-16 00:00:00
## 8 2019-04-01 00:00:00 NA
## 9 2019-04-01 00:00:00 2019-04-16 00:00:00
## 10 2019-04-01 00:00:00 NA
## # ℹ 7,035 more rows
eviction_p3 %>% select(ends_with("date"))
## # A tibble: 9,425 × 2
## file_date jdg_date
## <chr> <chr>
## 1 04/03/2017 04/17/2017
## 2 04/03/2017 04/17/2017
## 3 04/03/2017 <NA>
## 4 04/03/2017 04/17/2017
## 5 04/03/2017 04/19/2017
## 6 04/03/2017 <NA>
## 7 04/03/2017 04/17/2017
## 8 04/04/2017 04/18/2017
## 9 04/04/2017 04/24/2017
## 10 04/04/2017 <NA>
## # ℹ 9,415 more rows
## read more about POSIX classes here:https://www.stat.berkeley.edu/~s133/dates.html
Same with the zipcode columns.
eviction_p1 %>% select(starts_with("zip"))
## # A tibble: 6,140 × 2
## zip1 zip2
## <dbl> <lgl>
## 1 68111 NA
## 2 68102 NA
## 3 68134 NA
## 4 68108 NA
## 5 68111 NA
## 6 68112 NA
## 7 68132 NA
## 8 68111 NA
## 9 68111 NA
## 10 68104 NA
## # ℹ 6,130 more rows
eviction_p2 %>% select(starts_with("zip"))
## # A tibble: 7,045 × 2
## zip1 zip2
## <dbl> <dbl>
## 1 NA NA
## 2 68104 NA
## 3 68107 NA
## 4 68144 NA
## 5 68466 NA
## 6 68025 NA
## 7 68701 NA
## 8 68803 2064
## 9 66083 NA
## 10 69138 NA
## # ℹ 7,035 more rows
eviction_p3 %>% select(starts_with("zip"))
## # A tibble: 9,425 × 2
## zip1 zip2
## <chr> <chr>
## 1 68111 <NA>
## 2 68107 <NA>
## 3 68107 <NA>
## 4 68107 <NA>
## 5 68132 <NA>
## 6 68110 <NA>
## 7 68104 <NA>
## 8 68127 <NA>
## 9 68105 <NA>
## 10 68144 <NA>
## # ℹ 9,415 more rows
## read more about POSIX classes here:https://www.stat.berkeley.edu/~s133/dates.html
lubridate
package has some handy functions for converting dates: such as
ymd(), which parses dates in the year month date order.
P.S. as long as formatting goes it works for all of these:
“yyyy-mm-dd”,“yyyy/mm/dd”,“yyyymmdd”.
eviction_p1 <- eviction_p1 %>%
mutate(across(.cols = ends_with("date"), ymd)) %>%
mutate(across(.cols = starts_with("zip"), as.double))
eviction_p2 <- eviction_p2 %>%
mutate(across(.cols = ends_with("date"), ymd)) %>%
mutate(across(.cols = starts_with("zip"), as.double))
eviction_p3 <- eviction_p3 %>%
mutate(across(.cols = ends_with("date"), mdy)) %>%
mutate(across(.cols = starts_with("zip"), as.double))
we need to use mdy() for eviction_p3 because the date
columns have a different format from the other two: “mm/dd/yyyy” rather
than “yyyy-mm-dd”.
Now that we’ve got a working date field, we can combine the data and save to a new variable.
eviction <- eviction_p1 %>% bind_rows(eviction_p2) %>% bind_rows(eviction_p3)
We can extract the year from the file_date field with
the year() function in lubridate.
eviction <- eviction %>% mutate(file_year = year(file_date))
Now the fun begins. We need to isolate plaintiff in the case name and which ones are tied to Omaha Public Authority.
str_extract(): extracts the first complete match from
each string.You put regular expressions in the function as an argument to tell R what pattern to match.
Look for these on the cheat sheet and try running the example code: -
Match characters: - \\. - \\s - .
- Anchors: - ^a - a$ - Look-arounds: -
a(?=c) - (?<=b)a - Quantifiers: -
a+
eviction <- eviction %>%
mutate(caption_up = str_to_upper(eviction$caption), .after = "caption")
## create a "caption_up" column after "caption" column, which is everything in "caption" column to upper case
eviction <- eviction %>%
mutate(
plaintiff = caption_up %>% str_extract("^.+(?=\\sV\\.)") %>% str_trim(),
def = caption_up %>% str_extract("(?<=\\sV\\.).+$") %>% str_trim()
)
## plaintiff: look for start of string, what's followed by ' V.'
## defendant: look for end of string, what's proceded by V.
landlord_names <- eviction %>% tabyl(plaintiff)
We have 5,000+ distinct landlord names. Now we can use a
str_detect function to capture the ones denoting the Omaha
Housing Authority. We don’t know what these are, so let’s cast a wide
net first.
But let’s first narrow the data to everything in Douglas County. We
can use the str_detect function to tell R to give us all
the entries whose court_name variable contains the string
“Douglas County”.
eviction <- eviction %>%
filter(court_name %>% str_detect("Douglas County"))
Next we will look for every possible name variation of Omaha Housing Authority and generate a frequency table
oha <- eviction %>%
mutate(caption = str_to_upper(caption)) %>%
filter(caption %>% str_detect(str_to_upper("Omaha Housing Authority|OHA|Housing Authority|Housing Authority-City of Omaha|Housing Auth. City of Omaha|Housing Authority Of Omaha|Omaha Housing Auth|Housing Auth. City of Omaha|Housing Auhtority City of Omaha|Housing Authority City Omaha|^OMA")))
## | is an alternator, means "OR", allows for searching multiple phrases;
## ^ is an anchor, means "starts with"
oha_names <- oha %>% tabyl(plaintiff)
## janitor::tabyl() generates a frequency table, Returns a data.frame with frequencies and percentages of the tabulated variable
We can manually review the plaintiff names and see if they’re truly OHA. Let’s check if “THE HOUSING AUTHORITY” is truly OHA cases. By checking the address, we can manually verify these cases.
When you need to quickly inspect a certain column, use
select to rearrange column positions for easier
viewing.
oha %>% filter(plaintiff == "THE HOUSING AUTHORITY") %>%
#addr1 now shifts to the left and everything else stays in place
select(addr1, everything())
## # A tibble: 40 × 28
## addr1 court_name casetype caseyr casenum csealind caption caption_up subtype
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 4207 … 01 - Doug… CI 18 223 N THE HO… THE HOUSI… REALFED
## 2 4204 … 01 - Doug… CI 18 227 N THE HO… THE HOUSI… REALFED
## 3 2908 … 01 - Doug… CI 18 621 N THE HO… THE HOUSI… REALFED
## 4 5640 … 01 - Doug… CI 18 622 N THE HO… THE HOUSI… REALFED
## 5 5100 … 01 - Doug… CI 18 2049 N THE HO… THE HOUSI… REALFED
## 6 600 S… 01 - Doug… CI 18 2050 N THE HO… THE HOUSI… REALFED
## 7 600 S… 01 - Doug… CI 18 2052 N THE HO… THE HOUSI… REALFED
## 8 1601 … 01 - Doug… CI 18 2056 N THE HO… THE HOUSI… REALFED
## 9 5100 … 01 - Doug… CI 18 2058 N THE HO… THE HOUSI… REALFED
## 10 5405 … 01 - Doug… CI 18 2076 N THE HO… THE HOUSI… REALFED
## # ℹ 30 more rows
## # ℹ 19 more variables: dscrptn <chr>, file_date <date>, cstatusx <chr>,
## # prtyrole <chr>, prtyseq <dbl>, pstatus <chr>, name <chr>, addr2 <chr>,
## # city <chr>, state <chr>, zip1 <dbl>, zip2 <dbl>, jdg_date <date>,
## # jdgcode <chr>, atty_bar <dbl>, atty_name <chr>, file_year <dbl>,
## # plaintiff <chr>, def <chr>
Once we find the plaintiff names that are actually OHA, let’s save them to an object.
oha_names$plaintiff[c(7:13,30:34,59:67)] -> oha_clean_names
We will filter the plaintiff column based on if the plaintiff name matches any of names in oha_clean_names.
#cases we are confident that are filed by OHA
actual_oha <- oha %>% filter(plaintiff %in% oha_clean_names)
# the ones that are not OHA
oha_out <- eviction %>% filter(plaintiff %out% oha_names$plaintiff)
oha_out %>% tabyl(plaintiff) -> oha_out_names
Now we have a clean data set of all OHA filings, let’s see how many cases they filed each year, and the ratio of OHA filings out of all filings.
count() in dplyr package lets you quickly count the
unique values of one or more variables.
# OHA cases each year
x <- actual_oha %>% count(file_year)
names(x) <- c("filing_year","oha_cases")
# Let's see how many cases in total are filed each year
y <- eviction %>% count(file_year)
names(y) <- c("filing_year","all_cases")
# join and calculate percent
z <- x %>% left_join(y) %>%
mutate(oha_perc = oha_cases/all_cases)
z %>% head()
## # A tibble: 4 × 4
## filing_year oha_cases all_cases oha_perc
## <dbl> <int> <int> <dbl>
## 1 2017 346 3302 0.105
## 2 2018 347 4918 0.0706
## 3 2019 312 5071 0.0615
## 4 2020 217 3245 0.0669
Say you’d like to send this to another reporter in your newsroom for some spot checks to see if you’ve gotten everything right. An easy way to do this is connecting to your Google account and upload the sheet to a Google Drive.
Let’s get started. First, you’ll need a Google Account and install
the googlesheets4 package.
Let’s create a sheet from scratch. You’ll first need to authenticate your account.
googlesheets4::gs4_auth()
Next we will create a Googlesheet
googlesheets4::gs4_create("oha",sheets = actual_oha)
# first argument is the sheet name, second argument is the data frame, which will also be the sheet name
You can then make changes in the Google sheet. Grab the ID of your sheet. It’s in the address url. The string after “https://docs.google.com/spreadsheets/d/”
Let’s add the summary table to the same Google sheet in another tab.
sheet_write(z, ss = "1yyMApNbIVAWxvFPEw_11ZrkRQdip9to_UIs7GL4Op74",sheet = "smmary_table")
Next we’ll take a look at osha.
osha <- read_csv("../data/osha.csv")
# or read from github repo
# osha <- read_csv("https://raw.githubusercontent.com/ireapps/teaching-guide-R123/main/data/osha.csv")
glimpse(osha)
## Rows: 307,902
## Columns: 31
## $ id <dbl> 1426059, 1426060, 1426061, 1426062, 14260…
## $ company_name <chr> "Aza Services, LLC", "All Steel Welding L…
## $ establishment_name <chr> "Aza Services LLC", "All Steel Welding LL…
## $ ein <chr> "812104657", "853696720", "872099752", "8…
## $ street_address <chr> "613 Davis Rd", "220 Grand St", "429 Dela…
## $ city <chr> "Jefferson", "Channelview", "Newark", "Ea…
## $ state <chr> "GA", "TX", "NJ", "NC", "HI", "WV", "CO",…
## $ zip_code <chr> "30549", "77530", "07105", "28312", "9681…
## $ naics_code <dbl> 238140, 332312, 492110, 221122, 236118, 5…
## $ industry_description <chr> "Masonry", "Fabrication, Erection, and We…
## $ annual_average_employees <dbl> 12, 20, 25, 1, 4, 20, 14, 275, 55, 115, 1…
## $ total_hours_worked <dbl> 700, 38400, 700, 2080, 10800, 49000, 2348…
## $ no_injuries_illnesses <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2,…
## $ total_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_dafw_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
## $ total_djtr_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_other_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,…
## $ total_dafw_days <dbl> 0, 0, 0, 0, 0, 0, 0, 14, 1, 80, 0, 0, 0, …
## $ total_djtr_days <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_injuries <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0,…
## $ total_poisonings <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_respiratory_conditions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_skin_disorders <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_hearing_loss <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_other_illnesses <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ establishment_id <dbl> 752183, 795137, 795138, 795139, 779923, 6…
## $ establishment_type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,…
## $ size <dbl> 1, 1, 2, 1, 1, 1, 1, 3, 2, 2, 1, 2, 2, 1,…
## $ year_filing_for <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021,…
## $ created_timestamp <chr> "1/1/2022 10:05:03", "1/1/2022 10:05:11",…
## $ change_reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
If you look through the documentation
for this dataset, you’ll notice that some of these fields are coded,
such as size and establishment_type. For
columns that have many value options, we might want to join to a lookup
table. But for just a few values, we can add a _desc column
into our data and code in values based on the original column.
We’ll add an estab_type_desc column based on the values
in establishment_type, using a function called
case_when(). This is something like an if or
ifelse statement:
# test it out
osha %>% mutate(estab_type_desc = case_when(
establishment_type==1 ~ "Not a Government Entity",
establishment_type==2 ~ "State Government Entity",
establishment_type==3 ~ "Local Government Entity",
TRUE ~ "Error"
)) %>%
count(establishment_type, estab_type_desc)
## # A tibble: 4 × 3
## establishment_type estab_type_desc n
## <dbl> <chr> <int>
## 1 1 Not a Government Entity 291709
## 2 2 State Government Entity 5447
## 3 3 Local Government Entity 8847
## 4 NA Error 1899
# make it permanent
osha <- osha %>% mutate(estab_type_desc = case_when(
establishment_type==1 ~ "Not a Government Entity",
establishment_type==2 ~ "State Government Entity",
establishment_type==3 ~ "Local Government Entity",
TRUE ~ "Error"
))
This is the original file that from the Census Bureau.
poverty <- read_csv("../data/poverty_original.csv")
#poverty <- read_csv("https://raw.githubusercontent.com/ireapps/teaching-guide-R123/main/data/poverty_original.csv")
In this example, each variable (i.e. below50,
below125, etc) is its own row. To make it easier to do
calculations by county, I transposed this data so that each variable
would be its own column rather than row. I did that using
pivot_wider() (for the sake of this example, I’m going to
exclude the margin of error, or moe).
poverty %>%
select(-moe) %>%
pivot_wider(names_from=variable, values_from=estimate)
## # A tibble: 3,221 × 11
## GEOID NAME population below50 below125 below150 below185 below200 below300
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Autaug… 55639 3504 10760 12611 15883 17499 27259
## 2 01003 Baldwi… 218289 7250 28632 36413 50285 57127 92639
## 3 01005 Barbou… 25026 3018 7553 8965 10606 11152 14933
## 4 01007 Bibb C… 22374 1875 4646 5730 7811 8370 12236
## 5 01009 Blount… 57755 3834 10542 13624 18517 20042 32176
## 6 01011 Bulloc… 10173 1074 3472 4128 4989 5054 6468
## 7 01013 Butler… 19726 1194 4997 6133 7478 8150 12138
## 8 01015 Calhou… 114324 7486 25293 30323 39419 42196 64372
## 9 01017 Chambe… 33427 1628 7294 9277 12781 14169 20226
## 10 01019 Cherok… 26035 973 5424 6662 9132 9798 14649
## # ℹ 3,211 more rows
## # ℹ 2 more variables: below400 <dbl>, below500 <dbl>
There is a function called pivot_longer() that does the
opposite:
# First I'll create a new variable with the wider data:
poverty_wide <- poverty %>%
select(-moe) %>%
pivot_wider(names_from=variable, values_from=estimate)
# Then I'll turn it back to long using pivot_longer()
poverty_wide %>% pivot_longer(cols = population:below500, names_to="variable", values_to="estimate")
## # A tibble: 28,989 × 4
## GEOID NAME variable estimate
## <chr> <chr> <chr> <dbl>
## 1 01001 Autauga County, Alabama population 55639
## 2 01001 Autauga County, Alabama below50 3504
## 3 01001 Autauga County, Alabama below125 10760
## 4 01001 Autauga County, Alabama below150 12611
## 5 01001 Autauga County, Alabama below185 15883
## 6 01001 Autauga County, Alabama below200 17499
## 7 01001 Autauga County, Alabama below300 27259
## 8 01001 Autauga County, Alabama below400 35423
## 9 01001 Autauga County, Alabama below500 42806
## 10 01003 Baldwin County, Alabama population 218289
## # ℹ 28,979 more rows