The “stringr” (pronounce “stringer”) package is a package within the TidyVerse that helps with string manipulation. For information on strings, you can refer to chapter 14 of Hadley’s R for Data Science https://r4ds.hadley.nz/strings. stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.
You can install tidyverse, or just the stringr package.
install.packages("tidyverse")
library(tidyverse)
To illustrate usage of stringr functions, let’s load a dataset. The dataset below is a Kaggle dataset that has fast food information: https://www.kaggle.com/datasets/tan5577/nutritonal-fast-food-dataset
food <- read_csv("https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Tidyverse_FastFoodNutritionMenuV2.csv")
head(food)
## # A tibble: 6 × 14
## Company Item Calories `Calories from\nFat` `Total Fat\n(g)`
## <chr> <chr> <chr> <chr> <chr>
## 1 McDonald’s Hamburger 250 80 9
## 2 McDonald’s Cheeseburger 300 110 12
## 3 McDonald’s Double Cheeseburger 440 210 23
## 4 McDonald’s McDouble 390 170 19
## 5 McDonald’s Quarter Pounder® wi… 510 230 26
## 6 McDonald’s Double Quarter Poun… 740 380 42
## # ℹ 9 more variables: `Saturated Fat\n(g)` <chr>, `Trans Fat\n(g)` <chr>,
## # `Cholesterol\n(mg)` <chr>, `Sodium \n(mg)` <chr>, `Carbs\n(g)` <chr>,
## # `Fiber\n(g)` <chr>, `Sugars\n(g)` <chr>, `Protein\n(g)` <chr>,
## # `Weight Watchers\nPnts` <chr>
Stringr functions start with “str” and work with vectors of strings.
We can see how long the first Item is
str_length(food[1,2])
## [1] 9
We get a length of 9, which matches Hamburger above.
We can see the amount of vowels in the first 6 items
food6 <- head(food, 6)
str_count(food6$Item, "[aeiou]")
## [1] 3 5 8 3 10 13
It may also be interesting to see how many of the 1148 observations in the data set are double or triple items.
str_count(food$Item, "Double") %>% sum()
## [1] 18
str_count(food$Item, "Triple") %>% sum()
## [1] 15
18 and 15, respectively - 33 total items of the 1148.
We can use str_replace to replace parts of a string. Let’s replace McDonald with The Clown.
food6$Company
## [1] "McDonald’s" "McDonald’s" "McDonald’s" "McDonald’s" "McDonald’s"
## [6] "McDonald’s"
food6$Company %>% str_replace("McDonald", "The Clown")
## [1] "The Clown’s" "The Clown’s" "The Clown’s" "The Clown’s" "The Clown’s"
## [6] "The Clown’s"
We can split strings based on patterns as well. Let’s split McDonald’s on Mc into Donald’s.
food6$Company %>% str_split("Mc")
## [[1]]
## [1] "" "Donald’s"
##
## [[2]]
## [1] "" "Donald’s"
##
## [[3]]
## [1] "" "Donald’s"
##
## [[4]]
## [1] "" "Donald’s"
##
## [[5]]
## [1] "" "Donald’s"
##
## [[6]]
## [1] "" "Donald’s"
We can use regular expressions (regex) to split on capital letters. The regex “(?=[A-Z])” splits the string before an uppercase letter and keeps the character as the first part of the next string.
str_split(food6$Company, "(?=[A-Z])", simplify = TRUE)
## [,1] [,2] [,3]
## [1,] "" "Mc" "Donald’s"
## [2,] "" "Mc" "Donald’s"
## [3,] "" "Mc" "Donald’s"
## [4,] "" "Mc" "Donald’s"
## [5,] "" "Mc" "Donald’s"
## [6,] "" "Mc" "Donald’s"
In this vignette we explored the stringr package and some functions that can be utilized on a sample fast food data set.