Intro to text analysis with R

Text data representation in R

Texts are bodies of stringed characters. The characters are combinations of letters, numbers, symbols, etc. In R, text data is often handled as character vectors. A single piece of text data, whether it’s a word, a sentence, or an entire document, is typically represented as a string (i.e., a sequence of characters).

When you have many pieces of text data, they can be stored in a character vector, where each element of the vector is a string. For example, you might have a vector of city names, where each city name is a string. You can also handle text data in data frames, where one or more columns are made up of character vectors.

R provides a variety of tools for processing and analyzing text data. Here are a few key examples:

The stringr package provides many functions for manipulating strings in R. This includes functions for doing things like converting text to upper or lower case, finding and replacing patterns in text, splitting text into pieces, and more.
The tm (Text Mining) and tidytext packages provide tools for more advanced text processing and analysis tasks, such as creating term-document matrices, performing sentiment analysis, and more.
Regular expressions (regex) provide a powerful way to identify and manipulate patterns in text. We can use regex in R through functions like grep(), gsub(), and regexpr() in base R, or through similar functions provided by stringr. We will practice some regex in this tutorial.

library(tidyverse)

Working with text in R

Combining strings with `paste` and `paste0`

names <- c("Orange", "Durham", "Chatham", "Wake")
names

[1] "Orange"  "Durham"  "Chatham" "Wake"

Combining strings with paste

counties <- paste(names, "County")
counties

[1] "Orange County"  "Durham County"  "Chatham County" "Wake County"

paste adds a space between the entities while joining. We can use paste0 if we don’t want the space.

counties <- paste0(names, "County")
counties

[1] "OrangeCounty"  "DurhamCounty"  "ChathamCounty" "WakeCounty"

Alternatively, we can use paste0 and add a space as " ".

counties <- paste0(names, " ", "County")
counties

[1] "Orange County"  "Durham County"  "Chatham County" "Wake County"

counties <- paste0(names, "+", "County")
counties

[1] "Orange+County"  "Durham+County"  "Chatham+County" "Wake+County"

Alternatively, we can use collapse

counties <- paste0(c(names[1], "County"), collapse = " ")
counties

[1] "Orange County"

A short intro to Regular Expressions

Regular Expressions (regex) is a powerful tool to work with text data. They provide a way to search, match, extract, replace, or split text based on complex patterns.

# Load required package

# Create dummy dataset
urban_planning_data <- data.frame(
  Plan_ID = c("PLAN_01", "PLAN_02", "PLAN_03", "PLAN_04", "PLAN_05"),
  Description = c("Residential Plan for District 11", "Commercial Project in Sector 4", 
                  "Industrial Development Plan for District 9", "Residential Plan for District 4",
                  "Mixed-Use Plan for Sector 8"),
  stringsAsFactors = FALSE
)

urban_planning_data

  Plan_ID                                Description
1 PLAN_01           Residential Plan for District 11
2 PLAN_02             Commercial Project in Sector 4
3 PLAN_03 Industrial Development Plan for District 9
4 PLAN_04            Residential Plan for District 4
5 PLAN_05                Mixed-Use Plan for Sector 8

Identify plan types

# Load required package
library(stringr) #part of tidyverse

# Create a regex pattern to identify plan types
plan_type_pattern <- "Residential|Commercial|Industrial|Mixed-Use"

# Extract plan types
urban_planning_data$Plan_Type <- str_extract(urban_planning_data$Description, plan_type_pattern)

urban_planning_data

  Plan_ID                                Description   Plan_Type
1 PLAN_01           Residential Plan for District 11 Residential
2 PLAN_02             Commercial Project in Sector 4  Commercial
3 PLAN_03 Industrial Development Plan for District 9  Industrial
4 PLAN_04            Residential Plan for District 4 Residential
5 PLAN_05                Mixed-Use Plan for Sector 8   Mixed-Use

Exercise

Extract whether the implementation area is a “District” or a “Sector” in a new column.

# Create a regex pattern to identify location numbers
location_number_pattern <- "\\b\\d+\\b"  # \b is word boundary, \d+ is one or more digits

# Extract location numbers
urban_planning_data$Location_Number <- str_extract(urban_planning_data$Description, location_number_pattern)

urban_planning_data

  Plan_ID                                Description   Plan_Type
1 PLAN_01           Residential Plan for District 11 Residential
2 PLAN_02             Commercial Project in Sector 4  Commercial
3 PLAN_03 Industrial Development Plan for District 9  Industrial
4 PLAN_04            Residential Plan for District 4 Residential
5 PLAN_05                Mixed-Use Plan for Sector 8   Mixed-Use
  Location_Number
1              11
2               4
3               9
4               4
5               8

# Create a regex pattern to identify complete location information
complete_location_pattern <- "(District|Sector) \\d+"

# Extract complete location information
urban_planning_data$Location <- str_extract(urban_planning_data$Description, complete_location_pattern)

urban_planning_data

  Plan_ID                                Description   Plan_Type
1 PLAN_01           Residential Plan for District 11 Residential
2 PLAN_02             Commercial Project in Sector 4  Commercial
3 PLAN_03 Industrial Development Plan for District 9  Industrial
4 PLAN_04            Residential Plan for District 4 Residential
5 PLAN_05                Mixed-Use Plan for Sector 8   Mixed-Use
  Location_Number    Location
1              11 District 11
2               4    Sector 4
3               9  District 9
4               4  District 4
5               8    Sector 8

Some more text pattern extraction

Let’s assume we have a dataset with an “address” column. Our goal will be to extract the street numbers, street names, and types from these addresses.

# Create a data frame with addresses
df <- data.frame(address = c("123 Main St", "456 Pine Ave", "789 Oak Blvd", "321 Elm Dr"))
df

       address
1  123 Main St
2 456 Pine Ave
3 789 Oak Blvd
4   321 Elm Dr

Extracting Street Numbers

We can use str_extract() to pull out the street numbers, which are the series of digits at the beginning of each address.

# Extract street numbers
df$street_number <- str_extract(df$address, pattern = "\\d+")
df$street_n <- str_extract(df$address, pattern = "\\d")
df

       address street_number street_n
1  123 Main St           123        1
2 456 Pine Ave           456        4
3 789 Oak Blvd           789        7
4   321 Elm Dr           321        3

In the regex pattern \\d+, \\d represents any digit, and + means one or more of the preceding element.

Extracting Street Types

Let’s extract the street types. We’ll use str_extract() again with a new pattern.

# Extract street types
df$street_type <- str_extract(df$address, pattern = "\\b\\w+$")
df

       address street_number street_n street_type
1  123 Main St           123        1          St
2 456 Pine Ave           456        4         Ave
3 789 Oak Blvd           789        7        Blvd
4   321 Elm Dr           321        3          Dr

In this pattern, \\b\\w+$ represents “match a word boundary (\\b), followed by one or more word characters at the end of the string (\\w+$)”.

nc_cities <- c("Charlotte", "Raleigh", "Greensboro", "Durham", "Winston-Salem", "Fayetteville", "Cary", "Wilmington", "High Point")

str_view(nc_cities, "al")

[2] │ R<al>eigh
[5] │ Winston-S<al>em

str_detect(nc_cities, "a")

[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

How many cities start with “C”?

sum(str_detect(nc_cities, "^C"))

[1] 2

sum(str_detect(nc_cities, "^c"))

[1] 0

Exercise

Why are the results different?

str_count(nc_cities, "e")

[1] 1 1 2 0 1 3 0 0 0

Exercise

What is the above function doing?

While REGEX is very powerful and can be helpful in several tasks, it takes some time to master. At this point, you should understand how we can use some patterns to extract elements we need from texts.

Exercise

Explore Regex Translator to understand and explore how various text patterns are converted into regex.

Matching messy texts

The Worker Adjustment and Retraining Notification Act (WARN) requires employers with 100 or more employees (generally not counting those who have worked less than six months in the last 12 months and those who work an average of less than 20 hours a week) to provide at least 60 calendar days advance written notice of a plant closing and mass lay-off affecting 50 or more employees at a single site of employment. In North Carolina, the Department of Commerce is in charge of collecting and archiving the notices. Research by Cleveland Federal Reserve Bank suggests that these notices are useful bellwethers for economic conditions in the state.

We will use another business listings dataset from Infogroup called ReferenceUSA to establish additional information about the business that is listed in the WARN database. We are limiting our analysis to Mecklenberg county in North Carolina. ReferenceUSA data for the US can be obtained from UNC library. The WARN dataset from Cleveland Federal Reserve. Citation for the data is:

Krolikowski, Pawel M. and Kurt G. Lunsford. 2020. “Advance Layoff Notices and Labor Market Forecasting.” Federal Reserve Bank of Cleveland, Working Paper no. 20-03. https://doi.org/10.26509/frbc-wp-202003

We use postmastr package that’s not available on CRAN. We can install using remotes package to install from Github.

#install.packages("remotes")
#remotes::install_github("slu-openGIS/postmastr")

warn <- read_csv("./textmatching/WARN_Mecklenburg.csv")

Rows: 111 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): county, notice_date, received_date, effective_date, company, type, ...
dbl (3): number_affected, zipcode_city, warn_no

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(warn)

 [1] "county"          "notice_date"     "received_date"   "effective_date" 
 [5] "company"         "type"            "number_affected" "zipcode_city"   
 [9] "warn_no"         "address"

Parsing the Addresses

Most geocoders expect an address that is properly standardised (e.g. South - S, Bvld - Boulevard ) and spelling errors corrected. This is important as user entered text about addresses rely on personal and local convention rather than standardisation.

First there is a need to create a unique ID for unique addresses, so that only unique addresses are parsed.

In the following code, I am replacing all the diacritics, e.g.ř, ü, é, è by transliterating to Latin alphabet. This is not strictly necessary but is useful to remember that some place names are in Spanish in the US. Some databases store the diacritics and some don’t. Apologies to speakers of other languages.

It is always a good idea to have one case. We are going to use the upper case.

warn <- warn %>%
        mutate(address = str_replace_all(address, "[[:space:]]", " "), #any whitespace like tabs or spaces
               address = stringi::stri_trans_general(address, "Latin-ASCII"),
               address = str_remove_all(address, "[[:punct:]]"),
               address = str_to_upper(address))

Exercise

Explore stringr documentation for examples of matching various text patterns using the package.
Explore postmastr workflow to understand how we can work with address data with the package.

library(postmastr)
warn <- pm_identify(warn, var = "address")
warn_min <- pm_prep(warn, var = "address", type ='street') # There do not seem to be any addresses that are based on intersections. So we are using the type=street.
nrow(warn)

[1] 111

nrow(warn_min)

[1] 91

You should notice the difference in the number of rows. 20 observations are dropped.

Extract Zipcodes and States

Zipcodes come in two formats. A 5 digit variety and 5-4 digit variety. pm_postal_parse is able to parse both types, though in this instance only 5 digit codes are present.

warn_min <- pm_postal_parse(warn_min)

In this instance only state present in the dataset is NC. First we need to create a dictonary in case NC is spelled out in different ways such as NORTH CAROLINA or NC or N CAROLINA. Fortunately, this dataset only contains NC. If not, use pm_append to add different instances of the state name to the dictionary.

ncDict <- pm_dictionary(locale = "us", type = "state", filter = "NC", case = "upper")
ncDict

# A tibble: 2 × 2
  state.output state.input   
  <chr>        <chr>         
1 NC           NC            
2 NC           NORTH CAROLINA

(warn_min <- pm_state_parse(warn_min, dict=ncDict))

# A tibble: 91 × 4
   pm.uid pm.address                          pm.state pm.zip
    <int> <chr>                               <chr>    <chr> 
 1      1 895 WEST TRADE STREET CHARLOTTE     NC       28202 
 2      2 5501 JOSH BIRMINGAHM PKWY CHARLOTTE NC       28208 
 3      3 4800 HANGAR ROAD CHARLOTTE          NC       28208 
 4      4 5020 HANGAR ROAD CHARLOTTE          NC       28208 
 5      5 4716 YORKMONT ROAD CHARLOTTE        NC       28208 
 6      6 5501 JOSH BIRMINGHAM PKWY CHARLOTTE NC       28208 
 7      7 5000 HANGAR ROAD CHARLOTTE          NC       28208 
 8      8 100 WEST TRADE STREET CHARLOTTE     NC       28202 
 9      9 5501 CARNEGIE BLVD CHARLOTTE        NC       28209 
10     10 2200 REXFORD ROAD CHARLOTTE         NC       28211 
# ℹ 81 more rows

ncCityDict <- pm_dictionary(locale = "us", type = "city", filter = "NC", case = "upper")
(warn_min <- pm_city_parse(warn_min, dictionary = ncCityDict))

# A tibble: 91 × 5
   pm.uid pm.address                pm.city   pm.state pm.zip
    <int> <chr>                     <chr>     <chr>    <chr> 
 1      1 895 WEST TRADE STREET     CHARLOTTE NC       28202 
 2      2 5501 JOSH BIRMINGAHM PKWY CHARLOTTE NC       28208 
 3      3 4800 HANGAR ROAD          CHARLOTTE NC       28208 
 4      4 5020 HANGAR ROAD          CHARLOTTE NC       28208 
 5      5 4716 YORKMONT ROAD        CHARLOTTE NC       28208 
 6      6 5501 JOSH BIRMINGHAM PKWY CHARLOTTE NC       28208 
 7      7 5000 HANGAR ROAD          CHARLOTTE NC       28208 
 8      8 100 WEST TRADE STREET     CHARLOTTE NC       28202 
 9      9 5501 CARNEGIE BLVD        CHARLOTTE NC       28209 
10     10 2200 REXFORD ROAD         CHARLOTTE NC       28211 
# ℹ 81 more rows

Parsing Street, Numbers and Direction

We can use similar functions to parse out the street number.

warn_min <- warn_min %>%
             pm_house_parse()

Directionality of the street is little of a challenge. North could mean direction or a street name North St. Postmastr has logic already built into it to distinguish these two cases. By default, postmastr uses dic_us_dir dictionary.

dic_us_dir

# A tibble: 20 × 2
   dir.output dir.input 
   <chr>      <chr>     
 1 E          E         
 2 E          East      
 3 N          N         
 4 N          North     
 5 NE         NE        
 6 NE         Northeast 
 7 NE         North East
 8 NW         NW        
 9 NW         Northwest 
10 NW         North West
11 S          S         
12 S          South     
13 SE         SE        
14 SE         Southeast 
15 SE         South East
16 SW         SW        
17 SW         Southwest 
18 SW         South West
19 W          W         
20 W          West

We have already converted our strings to upper cases rather than leaving it in the sentence case as dic_us_dir assumes. We will have to modify the dictionary to fit our usecase.

dic_us_dir <- dic_us_dir %>%
              mutate(dir.input = str_to_upper(dir.input))


warn_min <- warn_min %>%  
             pm_streetDir_parse(dictionary = dic_us_dir) %>%
             pm_streetSuf_parse() %>%
             pm_street_parse(ordinal = TRUE, drop = TRUE)

Once we have parsed data, we add our parsed data back into the source.

warn_parsed <- pm_replace(warn_min, source = warn) %>%
               pm_rebuild(output="short", keep_parsed = 'yes')

Now that it is straightforward to geocode the addresses using a census geocoder. You can quickly visualise using mapview.

# install.packages("remotes")
#remotes::install_github("chris-prener/censusxy")
library(censusxy)

warn_sf <- cxy_geocode(warn_parsed, street = "pm.address", city = "pm.city", state = "pm.state", zip = "pm.zip",
    output = "full", class = "sf", parallel = 4)

35 rows removed to create an sf object. These were addresses that the geocoder could not match.

# You can only use parallel on non-Windows OS and must specify the number of cores that you want to employ for this exercise

There are other Geocoding tools available and may provide better results based on their own database and algorithms.

library(tmap)

The legacy packages maptools, rgdal, and rgeos, underpinning this package
will retire shortly. Please refer to R-spatial evolution reports on
https://r-spatial.org/r/2023/05/15/evolution4.html for details.
This package is now running under evolution status 0

tmap_mode('view')

tmap mode set to interactive viewing

m1 <-
tm_shape(warn_sf)+
  tm_symbols(col ='red') + 
  tm_basemap(leaflet::providers$Stamen.TonerHybrid)
  
m1

Text data representation in R

Working with text in R

Combining strings with paste and paste0

A short intro to Regular Expressions

Some more text pattern extraction

Matching messy texts

Parsing the Addresses

Extract Zipcodes and States

Parsing Street, Numbers and Direction

References

Combining strings with `paste` and `paste0`