Introduction

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provides a cohesive set of functions designed to make working with strings as easy as possible.

Package stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focuses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine. If you find that stringr is missing a function that you need, try looking in stringi.

For a detailed overview of stringr visit https://stringr.tidyverse.org/.


Getting started

To get started, load packages tidyverse, tidytext, and wordcloud. Package stringr will automatically be loaded when you load package tidyverse. Install any packages with install.packages("package_to_install").

String manipulation

In the following tasks use functions available in package stringr. Reference the stringr RStudio cheat sheat available at https://www.rstudio.com/resources/cheatsheets/.

Functions in stringr

  • are structured as str_*(), where * gives a hint as to the function’s puropse,

  • mainly have the first argument as string: either a character vector, or something coercible to one,

  • have subsequent arguments that are function dependent, but a common argument is pattern: a pattern to look for with the default being a regular expression,

  • are vectorized.

Task 1

Determine the length of each string.

  1. “coffee”
  2. “what is the character count here?”
  3. “whatisthecharactercounthere?”

Task 2

Determine the length of each string.

  1. c("coffee", "tea", "whiskey", "water")
  2. c("a", "ab", "abc", "abcd")
  3. c("789", "pi", "e", "0")

Task 3

Extract a substring from phrase.

phrase <- "extract a substring from this phrase"

Task 4

Extract the first two letters from each word in presidents.

presidents <- c("Clinton", "Bush", "Regan", "Carter")

Task 5

Extract the last two letters from each word in presidents.

Task 6

Split big.cats at each comma.

big.cats <- "lion, tiger, jaguar, cougar, leopard, snow leopard, cheetah"

Task 7

What structure was returned to you in Task 6? Unlist it.

Task 8

Replace each “a” in big.cats with “A”.

Task 9

Replace the first “a” in big.cats with “A”.

Task 10

Replace every vowel in big.cats with an “@” symbol. Hint: use a regexp.

Task 11

Extract every word “fruit” or “flies” from phrases.

phrases <-  c("time flies when you're having fun in 191",
              "fruit flies when you throw it",
              "a fruit fly is a beautiful creature",
              "how do you spell fruitfly?")

Task 12

Tongue twister: Something in a 30 acre thermal thicket of thorns and thistles thumped and thundered threatening the 3-D thoughts of Matthew the thug - although, theatrically, it was only the 13000 thistles and thorns through the underneath of his thigh that the 30 year old thug thought of that morning.

Extract the numeric values from the tongue twister above. Unlist the resulting object.

twister <- paste("Something in a 30 acre thermal thicket of thorns and",
                 "thistles thumped and thundered threatening the 3-D",
                 "thoughts of Matthew the thug - although, theatrically,",
                 "it was only the 13000 thistles and thorns through the",
                 "underneath of his thigh that the 30 year old thug",
                 "thought of that morning.", sep = " ")

Word cloud and sentiment analysis

Create a word cloud or perform a sentiment analysis on one of the three documents below. Make use of the functions in package tidytext.

Abraham Lincoln
Gettysburg Address
November 19, 1863, Address Delivered at the Dedication of the Cemetery at Gettysburg

https://www.d.umn.edu/~rmaclin/gettysburg-address.html

Dr. Martin Luther King Jr.
I have a dream speech
August 28, 1963, Lincoln Memorial in Washington D.C.

http://www.analytictech.com/mb021/mlk.htm

Theodore J. Kaczynski aka Unabomber
Manifesto
September 19, 1995, The Washington Post

https://www.josharcher.uk/static/files/2018/01/Industrial_Society_and_Its_Future-Ted_Kaczynski.txt