TidyVerse CREATE Vignette

Stringr Package within TidyVerse

The “stringr” (pronounce “stringer”) package is a package within the TidyVerse that helps with string manipulation. For information on strings, you can refer to chapter 14 of Hadley’s R for Data Science https://r4ds.hadley.nz/strings. stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.

Installation

You can install tidyverse, or just the stringr package.

install.packages("tidyverse")
library(tidyverse)

Usage

To illustrate usage of stringr functions, let’s load a dataset. The dataset below is a Kaggle dataset that has fast food information: https://www.kaggle.com/datasets/tan5577/nutritonal-fast-food-dataset

food <- read_csv("https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Tidyverse_FastFoodNutritionMenuV2.csv")

head(food)

## # A tibble: 6 × 14
##   Company    Item                 Calories `Calories from\nFat` `Total Fat\n(g)`
##   <chr>      <chr>                <chr>    <chr>                <chr>           
## 1 McDonald’s Hamburger            250      80                   9               
## 2 McDonald’s Cheeseburger         300      110                  12              
## 3 McDonald’s Double Cheeseburger  440      210                  23              
## 4 McDonald’s McDouble             390      170                  19              
## 5 McDonald’s Quarter Pounder® wi… 510      230                  26              
## 6 McDonald’s Double Quarter Poun… 740      380                  42              
## # ℹ 9 more variables: `Saturated Fat\n(g)` <chr>, `Trans Fat\n(g)` <chr>,
## #   `Cholesterol\n(mg)` <chr>, `Sodium \n(mg)` <chr>, `Carbs\n(g)` <chr>,
## #   `Fiber\n(g)` <chr>, `Sugars\n(g)` <chr>, `Protein\n(g)` <chr>,
## #   `Weight Watchers\nPnts` <chr>

Stringr functions start with “str” and work with vectors of strings.

str_length

We can see how long the first Item is

str_length(food[1,2])

## [1] 9

We get a length of 9, which matches Hamburger above.

str_count

We can see the amount of vowels in the first 6 items

food6 <- head(food, 6)

str_count(food6$Item, "[aeiou]")

## [1]  3  5  8  3 10 13

It may also be interesting to see how many of the 1148 observations in the data set are double or triple items.

str_count(food$Item, "Double") %>% sum()

## [1] 18

str_count(food$Item, "Triple") %>% sum()

## [1] 15

18 and 15, respectively - 33 total items of the 1148.

str_replace

We can use str_replace to replace parts of a string. Let’s replace McDonald with The Clown.

food6$Company

## [1] "McDonald’s" "McDonald’s" "McDonald’s" "McDonald’s" "McDonald’s"
## [6] "McDonald’s"

food6$Company %>% str_replace("McDonald", "The Clown")

## [1] "The Clown’s" "The Clown’s" "The Clown’s" "The Clown’s" "The Clown’s"
## [6] "The Clown’s"

str_split

We can split strings based on patterns as well. Let’s split McDonald’s on Mc into Donald’s.

food6$Company %>% str_split("Mc")

## [[1]]
## [1] ""         "Donald’s"
## 
## [[2]]
## [1] ""         "Donald’s"
## 
## [[3]]
## [1] ""         "Donald’s"
## 
## [[4]]
## [1] ""         "Donald’s"
## 
## [[5]]
## [1] ""         "Donald’s"
## 
## [[6]]
## [1] ""         "Donald’s"

We can use regular expressions (regex) to split on capital letters. The regex “(?=[A-Z])” splits the string before an uppercase letter and keeps the character as the first part of the next string.

str_split(food6$Company, "(?=[A-Z])", simplify = TRUE)

##      [,1] [,2] [,3]      
## [1,] ""   "Mc" "Donald’s"
## [2,] ""   "Mc" "Donald’s"
## [3,] ""   "Mc" "Donald’s"
## [4,] ""   "Mc" "Donald’s"
## [5,] ""   "Mc" "Donald’s"
## [6,] ""   "Mc" "Donald’s"

Summary

In this vignette we explored the stringr package and some functions that can be utilized on a sample fast food data set.