Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:

  • R v. 3.5.1
  • RStudio v. 1.1.456
  • Document v. 1.1
  • Last Updated: 2018-12-06


1 Introductory Review

The following provides an overview of techniques we’ve learned, including links to the original session.


1.1 Objects & Assignment

Objects in R contain single values, multiple values (vectors), and tabular data (data frames).

The Assignment Operator, <-, names and stores one or more values, functions, or data structures.

my_value <- 5                                            # Store a single value

my_vector <- c(5, 10, 15)                                # Vectors: Concatenated values

my_dataframe <- data.frame(x = c(1, 2, 3),
                           y = c("a", "b", "c"),
                           z = c(TRUE, TRUE, FALSE))     # Data Frames: Tabular structures


Print objects by simply entering the object name or explicitly using the function print().

my_value             # Autoprints using only the object name
## [1] 5
print(my_vector)     # Explicitly prints with function print() 
## [1]  5 10 15


Built-In Objects already exist in R, such as letters, all lowercase letters, or mtcars, a dataset on cars from 1972.

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Original Session: Intro to R: Operators


1.2 Operators

Arithmetic Operators in R are used for addition, subtraction, multiplication, division, operator preference, and exponentiation.

Class Numeric data are required.

(5^2 * 4) / 2
## [1] 50


Relational Operators in R are used in relational statements that compare one or a series of values, e.g. <, >, ==, !=.

Class Logical result from relational statements, i.e. TRUE or FALSE.

10 < c(8, 9, 11, 12)
## [1] FALSE FALSE  TRUE  TRUE


Logical Operators bind multiple relational statements.

OR, i.e. |, requires at least one statement to be TRUE.

5 > 1 | 10 < 5
## [1] TRUE

AND, i.e. &, requires all statements to be `TRUE.

5 > 1 & 10 < 5
## [1] FALSE

Original Session: Intro to R: Operators


1.3 Subsetting & Indexing

The Dollar Sign Operator, i.e. $, subsets or extracts a specific variable from a dataset.

mtcars$mpg          # Combine the dataset name and variable name to subset the variable
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4


Indexing a subset variable is done with brackets, [ & ], and the number or numbers of the element(s) by position.

mtcars$mpg[5]      # Combine the dataset, variable, and position to extract a specific value
## [1] 18.7


Index by Row & Column Position using the row number and column number in brackets, separated by a comma, ,.

mtcars[25, 1]
## [1] 19.2


Index by Name using the row name and column name in the same manner.

mtcars["Pontiac Firebird", "mpg"]
## [1] 19.2


Index Multiple Positions by concatenating more than one position number using function c().

mtcars["Honda Civic", c(1, 2, 4, 6)]
##              mpg cyl hp    wt
## Honda Civic 30.4   4 52 1.615


Subset All Rows or All Columns by leaving the position empty within the brackets.

mtcars[1:5, ]        # Subset rows 1-5 and all columns
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
mtcars[ c(1, 2)]     # Subset columns 1-2 and all rows
##                      mpg cyl
## Mazda RX4           21.0   6
## Mazda RX4 Wag       21.0   6
## Datsun 710          22.8   4
## Hornet 4 Drive      21.4   6
## Hornet Sportabout   18.7   8
## Valiant             18.1   6
## Duster 360          14.3   8
## Merc 240D           24.4   4
## Merc 230            22.8   4
## Merc 280            19.2   6
## Merc 280C           17.8   6
## Merc 450SE          16.4   8
## Merc 450SL          17.3   8
## Merc 450SLC         15.2   8
## Cadillac Fleetwood  10.4   8
## Lincoln Continental 10.4   8
## Chrysler Imperial   14.7   8
## Fiat 128            32.4   4
## Honda Civic         30.4   4
## Toyota Corolla      33.9   4
## Toyota Corona       21.5   4
## Dodge Challenger    15.5   8
## AMC Javelin         15.2   8
## Camaro Z28          13.3   8
## Pontiac Firebird    19.2   8
## Fiat X1-9           27.3   4
## Porsche 914-2       26.0   4
## Lotus Europa        30.4   4
## Ford Pantera L      15.8   8
## Ferrari Dino        19.7   6
## Maserati Bora       15.0   8
## Volvo 142E          21.4   4


Filter with Relational Operators by placing a relational statement in the row position, in brackets.

mtcars[mtcars$mpg < 15, ]     # Subset only cars with less than 15 mpg
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4


Assign Subset Data to New Objects using the assignment operator, <-, an object name, and the subset data.

gas_guzzlers <- mtcars[mtcars$mpg < 15, ]


Save Objects to Index Data using the assignment operator, <- and one or more relational statements.

index <- mtcars$cyl == 8 & mtcars$hp > 240     # Store logical values: TRUE or FALSE

dream_cars <- mtcars[index, ]                  # Use the indexing object in the row position

print(dream_cars)                              # Print results
##                 mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Duster 360     14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
## Camaro Z28     13.3   8  350 245 3.73 3.84 15.41  0  0    3    4
## Ford Pantera L 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4
## Maserati Bora  15.0   8  301 335 3.54 3.57 14.60  0  1    5    8

Original Session: Intro to R: Subsets & Indices


2 On Classes

Classes of both variables and single values dictate how R will recognize and work with them.


2.1 Determining Class

Identify Class by using the class() function and inputting either one or more values or an object.

class(10L)                # Call class() on a single value; here, "L" indicates an integer
## [1] "integer"
class(c(TRUE, FALSE))     # Call class() on multiple values, e.g. "logical" values
## [1] "logical"
class(mtcars)             # Call class() on an object with stored data to determine structure
## [1] "data.frame"
class(mtcars$mpg)         # Call class() on a subset variable for the class of its values
## [1] "numeric"


Numeric data include any quantitative data, including:

  • Class numeric in an all encompassing term for quantitative data
  • Class integer, or values comprised of whole numbers
  • Class double, or values with floating decimals


Logical data contain logical values, e.g. TRUE or FALSE.

Under the hood, logical data are represented by 1 and 0.

TRUE == 1
## [1] TRUE
FALSE == 0
## [1] TRUE


Character data contain uncategorized text, e.g. “Onondaga County”.

my_county <- "Onondaga County"
class(my_county)
## [1] "character"


Factor data represent categorical data where each category is a “level”, e.g. gender, race, or census tract.

cylinders <- factor(mtcars$cyl)        # Create factors using the factor() function
class(cylinders)
## [1] "factor"
levels(cylinders)                      # Function levels() prints each category in a factor
## [1] "4" "6" "8"


2.2 Coercing Classes

Coercion is the act of converting values and objects to new classes, usually with an as.() function.

class(mtcars$mpg)                          # Print the class of variable "mpg"
## [1] "numeric"
mtcars$mpg <- as.character(mtcars$mpg)     # Coerce the class from "numeric" to "character"
class(mtcars$mpg)                          # Re-print the class to confirm changes
## [1] "character"


The Purpose of Coercion is so R will treat your values in the manner you intend.

Function Overloading is the quality in R which allows functions to behave differently depending on object class.

class(mtcars$cyl)       # Determine class for variable "cyl", or number of cylinders
## [1] "numeric"
summary(mtcars$cyl)     # Print descriptive statitistics for numeric data with summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   6.000   6.188   8.000   8.000
mtcars$cyl <- as.character(mtcars$cyl)     # Coerce to class "character" with as.character()
summary(mtcars$cyl)                        # Function summary() now prints the number of elements
##    Length     Class      Mode 
##        32 character character
mtcars$cyl <- as.factor(mtcars$cyl)     # Coerce to class "factor" with as.factor()
summary(mtcars$cyl)                     # Prints each "level" (category) and frequency of each
##  4  6  8 
## 11  7 14


Identify All Classes in a Dataset by using the function str(), or “structure”, which prints the:

  • Dimensions of the dataset, i.e. total rows (observations) and columns (variables)
  • Class of the data structure
  • Class of each variable
  • First few values of each variable
  • Quantitaty and name of each factor level
str(iris)     # Print the structure of the "iris" dataset, or 150 measures of iris species
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


2.3 Importance of Coercion

Coercion in Data Visualization is also very important. Observe the following. What do you notice about the x-axis?

data(mtcars)
plot(x = mtcars$cyl, 
     y = mtcars$mpg,
     col = "tomato",
     xlab = "Number of Cylinders",
     ylab = "Miles per Gallon",
     main = "Cylinders vs. MPG")

Note: R identifies two continuous variables and makes a scatterplot, assuming 5- and 7-cylinder engines are missing.


Prevent Categorical Variables from Appearing Continuous by coercing “numeric” variables to class “factor”.

data(mtcars)
plot(x = as.factor(mtcars$cyl),        # The only change is nesting the variable in as.factor()
     y = mtcars$mpg, 
     col = "tomato",
     xlab = "Number of Cylinders",
     ylab = "Miles per Gallon",
     main = "Cylinders vs. MPG")

Function Overloading occurs as function plot() now acknowledges the “factor”, creating a box plot.


Coercion in Regression is even more important. Using lm(), we’ll try to create a linear model with the original mtcars.

data(mtcars)
my_lm <- lm(mpg ~ cyl, 
            data = mtcars)
print(my_lm)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

Incorrect Interpretation: Per the coefficients, every unit of cyl added reduces mpg by 2.87. This is absurd.


data(mtcars)
my_lm <- lm(mpg ~ as.factor(cyl), 
            data = mtcars)
print(my_lm)
## 
## Call:
## lm(formula = mpg ~ as.factor(cyl), data = mtcars)
## 
## Coefficients:
##     (Intercept)  as.factor(cyl)6  as.factor(cyl)8  
##          26.664           -6.921          -11.564

Correct Interpretation: 6-cylinder engines reduce mpg by 6.92, while 8-cylinder engines reduce mpg by 11.56.


2.4 Practice

Coercing Numeric Data is equally important, as demonstrated in the following scenario.

Scenario: You’re colleague has written a PDF scraper to extract key Form 990 data, seen in dataset form_990:

form_990 <- data.frame("FY_2017" = c("764882", "240739", "49212"), 
                       "FY_2018" = c("841912", "263997", "41315"), 
                       stringsAsFactors = FALSE, 
                       row.names = c("Programming Expenses", 
                                     "Administrative Expenses", 
                                     "Fund Development Expenses"))
print(form_990)
##                           FY_2017 FY_2018
## Programming Expenses       764882  841912
## Administrative Expenses    240739  263997
## Fund Development Expenses   49212   41315

Practice: Find the sum total of all expenses in fiscal years 2017 and 2018.

  • Determine the classe of each variable using class().
  • Subset each variable in form_990 using the $ operator.
  • Use the appropriate coercion function to render the data usable.
  • Use function sum() to find the total of each fiscal year.
  • Use function sum() again on the totals.


Conclusions: Identifying variable classes is a crucial first step in exploratory data analysis. As demonstrated above, failing to identify and coerce classes can be fatal to the accuracy of your analyses and visualizations. We’ve only looked at coercion with “numeric” and “factor” classes, but for nearly every data class (and there are many more), there is a way to coerce it to a more appropriate and actionable class.


2.5 Further Resources: Factors

Learn More about factor() and as.factor() by calling help(factor) and help(as.factor) within R. In addition, I highly recommend exploring the fourth module in DataCamp’s free Introduction to R.


3 Text Data

The following provides an overview of base R functions for data of class “character”. Run the following in R.

url <- "https://tinyurl.com/y9xuc5pa"
construct <- read.csv(file = url, stringsAsFactors = FALSE); rm(url)

These are the records of Quality Structures, Inc., the largest of multiple contractors working on Syracuse International Airport’s 2018 renovations and retrieved via Freedom of Information Act (FOIA).

Read the documentation here: REIS GitHub Repository.


3.1 An Introduction

Overview: Data of class “character” is often easily distinguishable due to quotations, e.g. "this".

Any values you write or store are automatically converted to class “character” when using quotations. Observe:

my_word <- "perspicacity"     # Quotes guarantee that value will be stored as class "character"
class(my_word)
## [1] "character"
print(my_word)
## [1] "perspicacity"


String Manipulation is the act of manipulating text data, most often referred to as strings.

We can think of “strings” as a sequence of characters, which may be alphabetical or numeric.


3.2 Pasting Strings

Pasting is the act of combining multiple strings to form a longer or more complex string, performed with paste().

x <- "I'm"
y <- "learning"
z <- "R!"
paste(x, y, z, sep = " ")     # Argument "sep =" specifies the character between pasted strings
## [1] "I'm learning R!"


Notice that this we’ve pasted together objects, but you can just as easily input the strings by hand:

paste("I'm", "learning", "R!", sep = " ")
## [1] "I'm learning R!"


The versatility of paste() is often underappreciated at first glance. We could goof off by tampering with sep =:

paste("Millennial:", x, y, z, sep = ", like, ")
## [1] "Millennial:, like, I'm, like, learning, like, R!"


We could do something more useful, like combine names in a character roster. First, let’s create one:

first <- c("Luis", "Cody", "Shannon", "Jamison")
last <- c("Escoboza", "Peck", "Connor", "Crawford")
roster <- data.frame(first, last)
print(roster)
##     first     last
## 1    Luis Escoboza
## 2    Cody     Peck
## 3 Shannon   Connor
## 4 Jamison Crawford


Now we can use paste to make a “Surname, First Name” format, like so:

paste(roster$last, roster$first, sep = ", ")
## [1] "Escoboza, Luis"    "Peck, Cody"        "Connor, Shannon"  
## [4] "Crawford, Jamison"


Now we can add it as a new variable using $.

roster$both <- paste(roster$last, roster$first, sep = ", ")
print(roster)
##     first     last              both
## 1    Luis Escoboza    Escoboza, Luis
## 2    Cody     Peck        Peck, Cody
## 3 Shannon   Connor   Connor, Shannon
## 4 Jamison Crawford Crawford, Jamison


We could also create a sequence of URLs for a web crawler, e.g. adult literacy programs around Dallas, TX:

url <- "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student="
iteration <- as.character(c(1:6))

paste(url, iteration, sep = "")
## [1] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=1"
## [2] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=2"
## [3] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=3"
## [4] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=4"
## [5] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=5"
## [6] "https://www.nationalliteracydirectory.org/programs?q=75201&radius=25&student=6"


Conclusions: The National Literacy Directory, which provides the search results for the above pages, is owned by the Dollar General Literacy Foundation - and they absolutely do not want those data shared, despite the tremendous potential it could have in the hands of researchers for ameliorating today’s adult literacy crisis. Fortunately, you can only search within a 25-mile radius, which limits the amount of search options.

Of course, if you just change radius=25 in url to, say, radius=6000, or 1/4 of the circumference of the earth, you’d have every adult education program in the United States, including Hawaii and Alaska. That’s 537 individual pages of search results through which one could sequence, covering 10,730 programs. But you should definitely not do that.

In sum, paste() is extremely useful. Never forget it.


4 Mid-Session Review

The following reviews key concepts with which we’ve practiced, emphasizing elements in the first half of the present work.


4.1 Assignment

  • Assignment is the process of storing information within an object using <-
  • Classes depict variable type and determine how variables behave, e.g. “numeric”, “character”, etc.
  • Concatenation binds separate values into a vector using function c() and preserves distinctness
my_object <- c(1, 3, 5)

class(my_object)
## [1] "numeric"


4.2 Operators

  • Arithmetic Operators are used like any scientific calculator, (+, -, *, /)
  • Relational Operators compare values and evaluated expressions, e.g. <, >, ==, etc.
  • Logical Operators combine relational operator statements, i.e. | and &
    • Relational statements typically outputs logical values
      • If a value meets the condition: TRUE
      • If a value does not meet the condition: FALSE
3 + 3 == 6 & 3 <= 12 / 4
## [1] TRUE


4.3 Data Structures

  • Objects store values, functions, tabular datasets, and more advanced data structures
  • Vectors are one-dimensional arrays of one or more elements
  • Matrices are tabular data structures containing data of a uniform class
  • Data Frames are also tabular data structures, but contian mixed classes
my_object <- 10

my_vector <- c("a", "b", "c")

my_matrix <- matrix(data = 1:4, 
                    nrow = 2, 
                    ncol = 2)

my_dataframe <- data.frame("x" = c(1, 2, 3), 
                           "y" = c("a", "b", "c"))
  • Printing is the act of producing a value or values
    • Autoprinting occurs simply by evaluating an object name
    • Explicit Printing occurs when using function print()
my_vector
## [1] "a" "b" "c"
print(my_dataframe)
##   x y
## 1 1 a
## 2 2 b
## 3 3 c


4.4 Subseting & Indexing

  • Subset Variables by using the $ operator, e.g. dataframe_name$variable_name
  • Index Vector Values by inputting element position within brackets, e.g. vector_name[3]
  • Index Tabular Values by inputting row and column position, e.g. dataframe_name[12, 5]
  • Subset All Rows or Columns by leaving row or column position empty, e.g. df[12, ]
  • Determine Position of values meeting a specified condition using function which()
mtcars$mpg[5]
## [1] 18.7
mtcars[5, "mpg"]
## [1] 18.7


4.5 Filtering

  • Filtering Operations provide “logical” output based on conditions using relational operators
    • Filter Vectors by inserting relational operators within brackets, e.g. vector[vector > 5]
    • Filter Tabular Data by inserting conditions in row or column positions, e.g. df[var1 > 5, ]
  • Filtering with Objects allows filtering conditions to be assigned to objects
    • Assign Filtering Conditions using <- and comparators, e.g. index <- df$variable < 15
my_filter <- mtcars$mpg > 25

mtcars[my_filter, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2


4.6 Classes

  • Variable Class determines how a variable behaves, which may differ by function
    • Critical to determine variable classes at outset of any analysis
    • Determine a single variable’s class using function class()
    • Determine multiple variables’ classes using function str() on a data frame
    • Detemine if a variable is a specified class using is.*() functions, e.g. is.character()
  • Variable Classes include:
    • Numeric variables contain quantitative data which may or may not be integers, e.g. 5.0 or 5
    • Integer variables contain quantitative data comprised of whole numbers, e.g. 5
      • Integers may be explicitly defined using L, e.g. 5L
    • Double variables contain quantitaive data with decimal points, e.g. 5.2
    • Logical variables are binary values comprised of TRUE and FALSE
      • Logicals typically appear when detecting or filtering with relational operators
    • Character variables contain qualitative data or text, e.g. "Some text."
      • Characters may be explicitly defined using quotation marks, i.e. ""
    • Factor variables contain discrete, nominal, or categorical data
      • Factors may be created explicitly using function factor()
      • Categories are referred to as levels, which may be ordered to create ordinal variables
      • View factor levels and their labels using function levels()
      • Order or rename levels by using argument levels = and labels = in factor()
  • Coercion is the act of explicitly converting classes
    • Coerce classes using as.*() functions, e.g. as.numeric()
class(15)
## [1] "numeric"
is.character("Some text.")
## [1] TRUE
as.logical(0)
## [1] FALSE
as.logical(1)
## [1] TRUE


4.7 Strings & Pasting

  • Strings are sequences of characters nested in quotations to comprise character data
    • Strings are explicitly written using quotes, e.g. "This is a string."
  • Pasting is the act of combining strings using the extremely versatile function paste()
    • Argument sep = takes a character string for which to separate pasted values
    • Function paste0() collapses pasted strings with no separator by default
values <- 1:3

suffixes <- c("st", "nd", "rd")

paste("This", "is", "my", 
      paste0(values, suffixes), "string.", 
      sep = " ")
## [1] "This is my 1st string." "This is my 2nd string."
## [3] "This is my 3rd string."


5 String Fundamentals

Quotation marks in strings are possible by using single quotes: 'Excel is "fine", thanks.'

Strings without quotes are possible using functions:

  • noquote() for numbered strings
  • writeLines() for bare strings
writeLines("You could make this an error message if you're writing a custom R function.")
## You could make this an error message if you're writing a custom R function.

Convert numeric data to character data using functions:

  • as.character() for simple coercion
  • format() for customized formatting via arguments, including:
    • Scientific Notation: scientific =)
    • Comma Separators: big.interval = or big.mark =
    • Alignment: justify =
  • formatC() for syntax in C language syntax via arguments, including:
    • Positive Numbers: flag = "+"
    • Negative Numbers: flag = "-"
    • Leading Zeroes: flag = "0"
format(x = 00003500, 
       big.mark = ",", 
       drop0trailing = TRUE)
## [1] "3,500"

Further formatting options are available using the “scales” package.


6 Regular Expressions: Briefly

Regular Expressions, in short, are using sequences of metacharacters for powerful pattern recognition.

For example, the following metacharacters can be used in any pattern = string.

  • . indicates “any character”
  • * indicates “any number of times”
  • ^ indicates “beginning of string”
  • $ indicates “end of string”
  • \ indicates “ignore the following character, it’s actually a [insert metacharacter]”

Important Caveat: Since many patterns contain metacharacters, like ., you must keep “regex” in mind.

  • Imagine you’re searching for pattern = "Census Tract 5.00"
  • Since . is a metacharacter, R interprets this as “5”, any character, and “00”
  • Therefore, you have to use an escape sequence for certain characters: \
    • string = "Census Tract 5\.00" will help ensure . really means .
    • Otherwise, it may detect, e.g. "Census Tract 5200"

Don’t Worry: You won’t memorize “regex” unless you use them everyday, but some things stick over time.

  • It’s perfectly fine to simply look them up as needed.
  • Just be aware of them when using pattern detection!

Learn More: To learn more about “regex”, I recommend Jenny Bryan’s Stat 545: “Regular Expression in R”.


Even a rudimentary understanding of regex is powerful for pattern detection.

Even a rudimentary understanding of “regex” is powerful for pattern detection.


7 Package “stringr”

Overview: The “stringr” package is designed specifically for working with character data:

Unified, Consistent Framework: All “stringr” functions:

  • Begin with str_ for easy autocompletion
  • Typically have a less intuitive counterpart function in base R
  • Always accept a vector of character values as the first argument

Installing & Loading: The following installs and loads package “stringr” if undetected:

if(!require("stringr")){install.packages("stringr")}
## Loading required package: stringr
library(stringr)


7.1 Pasting in “stringr”

The “stringr” equivalent of function paste() is str_c(). Advantages over paste() include:

  • Propogates rather than coerces missing, or NA values
  • Collapses strings by default, i.e. sep = is "", similar to paste0()
paste("The", "quick", "brown", "fox", NA, "over", "the", "lazy", NA)
## [1] "The quick brown fox NA over the lazy NA"
str_c("The", "quick", "brown", "fox", NA, "over", "the", "lazy", NA)
## [1] NA
str_c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog.", 
      sep = " ")
## [1] "The quick brown fox jumps over the lazy dog."


7.2 Determining Length

Function str_length() determines the number of characters in a given string:

str_length("Duffle kerfuffle.")
## [1] 17

The base R equivalent is function nchar().


7.3 Extracting Substrings

Function str_sub() extracts a subset of characters determined by beginning and ending positions:

order_records <- c("The order arrived at 03:02 PM EST, 03 December 2018.",
                   "The order arrived at 12:19 PM EST, 03 December 2018.",
                   "The order arrived at 09:53 AM EST, 03 December 2018.")

arrivals <- str_sub(string = order_records,     # Input string or vector of strings
                    start = 22,                 # Indicate position number to begin extraction
                    end = 33)                   # Indicate position number to end extraction

data.frame("arrival_time" = arrivals)           # Organize extracted data
##   arrival_time
## 1 03:02 PM EST
## 2 12:19 PM EST
## 3 09:53 AM EST


7.4 Pattern Recognition

Several “stringr” functions involve pattern recognition, which can be:

  • Written for an entire value, e.g. string = "Tract 32"
  • Written as part of a value with fixed(), e.g. string = fixed("Tract 32)
  • Written with more latitude using regular expressions, e.g. string = ".* 32"


7.5 Detect Matches

Detect patterns with str_detect(), which returns a logical value if conditions are met.

inconsistent_labels <- c("Tract 32", 
                         "census tract 32.00", 
                         "trAct 32", 
                         "ct 32",
                         "tract 5.00",
                         "CT 5")

str_detect(string = inconsistent_labels, 
           pattern = "32")                   # Only detects values containing "32"
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE
str_detect(string = inconsistent_labels, 
           pattern = "32|5")                 # Detects values containing either "32" or "5"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE


7.6 Return Matches

Return values containing a specified pattern using str_subset():

str_subset(string = inconsistent_labels,
           pattern = "32")
## [1] "Tract 32"           "census tract 32.00" "trAct 32"          
## [4] "ct 32"
str_subset(string = inconsistent_labels,
           pattern = "00")
## [1] "census tract 32.00" "tract 5.00"


7.7 Counting Matches

Quantify values containing a specified pattern using str_count():

str_count(string = inconsistent_labels, 
          pattern = "Tract|tract|trAct")
## [1] 1 1 1 0 1 0
str_count(string = inconsistent_labels, 
          pattern = "0")
## [1] 0 2 0 0 2 0


7.8 Splitting Strings

Split strings into composite parts with str_split() by specifying a pattern on which to split:

  • Here, we’ll use part of a review for Fallout 76 on Metacritic by user “noises1990”
  • Note: Punctuation has been modified slightly for instructional purposes
fo_review <- "This is complete crap! I've been waiting 20 minutes on the loading screen to connect to the server! After I connect because of desync and replication issues with the server, I get killed! Have to wait another 5 minutes on the loading screen to respawn! I go into a building and get stuck on a pile of trash so I need to fast travel somewhere!"

split_rev <- str_split(string = fo_review,
                       pattern = "! ",         # Split string for every occurence of "! "
                       simplify = TRUE)        # "FALSE" returns list, "TRUE" returns matrix

data.frame("sentences" = split_rev[, 1:5], stringsAsFactors = FALSE)
##                                                                                   sentences
## 1                                                                     This is complete crap
## 2               I've been waiting 20 minutes on the loading screen to connect to the server
## 3    After I connect because of desync and replication issues with the server, I get killed
## 4                           Have to wait another 5 minutes on the loading screen to respawn
## 5 I go into a building and get stuck on a pile of trash so I need to fast travel somewhere!


7.9 Find & Replace Operations

Find & Replace the detected patterns with:

  • str_replace() for only the first pattern detected in string
  • str_replace_all() for all patterns detected in string
print(inconsistent_labels)
## [1] "Tract 32"           "census tract 32.00" "trAct 32"          
## [4] "ct 32"              "tract 5.00"         "CT 5"
str_replace_all(string = inconsistent_labels, 
                pattern = ".00| |[a-zA-Z]*",   # Detect ".00", OR " ", OR any/all letters
                replacement = "")              # Replace with "", or nothing!
## [1] "32" "32" "32" "32" "5"  "5"


7.10 Trimming Whitespace

Trimming eliminates any extra spaces surrounding characters using str_trim().

  • Argument side = indicates which side to trim: "left", "right", or "both"
str_trim(string = "        mad whitespace    ", 
         side = "both")
## [1] "mad whitespace"


7.11 Padding Strings

Padding is the opposite of trimming, where str_pad() allows you to add characters.

  • Argument side = indicates which side to pad: "left", "right", or "both"
  • Argument width = indicates the maximum number of characters achieved via padding
  • Argument pad = indicates the character with which to pad

Here, we’ll use Syracuse’s Census Tract 61.02. Notably:

  • Tract-level FIPS codes must have 6 characters
  • Since we don’t have 6 characters, we can pad with leading zeroes
str_pad(string = "6102", 
        width = 6, 
        side = "left", 
        pad = "0")
## [1] "006102"

Use in concert with paste(), paste0(), or str_c() to create a full FIPS code!


8 Applied Practice

Instructions: Run the following code to read in Census Geocoder output: geocoded.

  • These data are real output for the location of reported crimes in Syracuse.
  • Column names found in “Record Layouts for Output” under “Batch Geocoding Process”.
  • See the Census output documentation here.
library(readr)

url <- "https://tinyurl.com/y92r2qcd"

names <- c("id", "input", "indicator", "type", "output", "coords", 
           "line_id", "id_side", "state", "county", "tract", "block")

geocoded <- read_csv(file = url, 
                     col_names = names)

geocoded <- geocoded[which(complete.cases(geocoded)), ]


Challenges: Perform the following tasks using the geocoded dataset:

  1. Recall that “tract” codes are not numeric values, but labels.
    1. Coerce the appropriate columns to class “character” using an as.*() function.
  2. Extract the ZIP code as a substring from variable output.
    1. Store these extracted data in a new column: zip
    2. Hint: Initialize this new column using geocoded$zip <- NA
  3. Using column output, detect which rows contain “EAST SYRACUSE, NY”.
    1. Save these rows in a new dataset: es_geocoded
  4. Initialize variable fips and paste variables state, county, and tract.
    1. Hint: Initialize this new column in geocoded using geocoded$fips <- NA
  5. Transform variable tract in geocoded to eliminate leading and trailing zeroes.
    1. Store these in a new variable, ct_abbr, by running geocoded$ct_abbr <- ...