In this document, I’ll demonstrate some of the capabilities of the Tidyverse package “stringr.” I’ll do this by working on a dataset with information about movies from 2007 - 2009 (source: https://www.kaggle.com/datasets/sujaykapadnis/hollywood-hits-and-flops-2007-2023).
The stringr package has a number of functions designed to ease our work with strings. This cheatsheet is incredibly helpful: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
As outlined in the cheatsheet, these functions fall into 7 categories: 1. Detect matches 2. Subset strings 3. Manage lengths 4. Mutate strings 5. Join and split 6. Order strings 7. Helpers
I’ll load the dataset (actually using a different Tidyverse package to bind the original ones):
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies07 = read_csv("https://raw.githubusercontent.com/gsteinmetzsilber/DATA607--Tidyverse/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202007.csv")
## Rows: 90 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Film, Script Type, Primary Genre, Genre, of Gross earned abroad, B...
## dbl (18): Year, Rotten Tomatoes critics, Metacritic critics, Average criti...
## lgl (4): None, Distributor, IMDb Rating, IMDB vs RT disparity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
movies08 = read_csv("https://raw.githubusercontent.com/gsteinmetzsilber/DATA607--Tidyverse/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202008.csv")
## Rows: 147 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Film, Script Type, Genre, of Gross earned abroad, Budget recovered...
## dbl (19): Year, Rotten Tomatoes critics, Metacritic critics, Average criti...
## lgl (4): Primary Genre, Distributor, IMDb Rating, IMDB vs RT disparity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
movies09 = read_csv("https://raw.githubusercontent.com/gsteinmetzsilber/DATA607--Tidyverse/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202009.csv")
## Rows: 136 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Film, Script Type, Genre, of Gross earned abroad, Budget recovered...
## dbl (18): Year, Rotten Tomatoes critics, Metacritic critics, Average criti...
## num (1): Foreign Gross ($million)
## lgl (4): Primary Genre, Distributor, IMDb Rating, IMDB vs RT disparity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
movies = rbind(movies07, movies08, movies09)
head(movies)
## # A tibble: 6 × 33
## Film Year `Script Type` Rotten Tomatoes cri…¹ `Metacritic critics`
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 300 2007 adaptation 60 51
## 2 3:10 to Yuma 2007 remake 88 76
## 3 30 Days of N… 2007 adaptation 50 53
## 4 Across the U… 2007 original scr… 54 56
## 5 Alien vs. Pr… 2007 sequel 14 29
## 6 Alvin and th… 2007 adaptation 26 39
## # ℹ abbreviated name: ¹`Rotten Tomatoes critics`
## # ℹ 28 more variables: `Average critics` <dbl>,
## # `Rotten Tomatoes Audience` <dbl>, `Metacritic Audience` <dbl>,
## # `Rotten Tomatoes vs Metacritic deviance` <dbl>, `Average audience` <dbl>,
## # `Audience vs Critics deviance` <dbl>, `Primary Genre` <chr>, Genre <chr>,
## # `Opening Weekend` <dbl>, `Opening weekend ($million)` <dbl>,
## # `Domestic Gross` <dbl>, `Domestic gross ($million)` <dbl>, …
Now we have this dataset of 373 movies. A few columns are strings, namely:
names(movies)[sapply(movies, is.character)]
## [1] "Film" "Script Type"
## [3] "Primary Genre" "Genre"
## [5] "of Gross earned abroad" "Budget recovered"
## [7] "Budget recovered opening weekend" "Oscar Winners"
## [9] "Oscar Detail" "Link"
## [11] "Release Date (US)"
stringr is potentially helpful with any of these columns.
The first category of stringr functions I’ll discuss is functions that help detect matches. str_detect is aptly named; if checks whether strings contain a pattern:
library(stringr)
str_detect(movies$Film, "Christmas")
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE
This is neither aesthetically pleasing nor helpful. We can use sum() as well to figure out how many of the movies in the dataset have Christmas in the title:
sum(str_detect(movies$Film, "Christmas"))
## [1] 2
On a similar note, str_starts and str_ends checks whether a string has a certain pattern at its beginning or end (respectively).
Let’s see how many movies start with the word “I.” Note, I’ll use regex here (and it’s generally extremely useful with stringr):
sum(str_starts(movies$Film, "I\\s"))
## [1] 4
The functions thus far have returned logical vectors. We can then sum up the number of TRUEs and that might have given us interesting results (depending on what one considers interesting). str_which returns indices. For example, let’s see which rows have films with the word “love” in the title:
str_which(movies$Film, "Love")
## [1] 48 130 273 301 346 347
Now, I don’t mind the occasional romance movie, but suppose someone hated romance movies and couldn’t even stand having them in the dataset. Well then having the indices of these “Love” movies is incredibly useful:
indices = str_which(movies$Film, "Love")
movies = movies[-indices, ]
And just to prove it worked:
str_which(movies$Film, "Love")
## integer(0)
There are now no movies in the dataset with “Love” in the title. Bitter people might now be much happier, so maybe this dataset is better from a utilitarian perspective.
Let’s wrap up this section by seeing how common digits are in titles. In particular, I want to get a sense of how many digits there are in each title. We can use str_count to this end:
str_count(movies$Film, "\\d")
## [1] 3 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 5 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0
## [260] 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2
## [334] 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 4 0 1 0 0 0 0 0 0 0 3 0 0 0 0 0 0
This gives a nice, albeit slightly dizzying, overview of the matter at hand. Most titles don’t have any diigits, a bunch have 1 or 2, and then a few have more than 2. This concludes my overview of stringr’s functions that detect matches. I now move on to another of stringr’s capabilities:
In the genre column, there are sometimes multiple genres listed.
count = str_count(movies$Genre, ",")
count+1
## [1] 2 1 1 1 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1
## [38] 1 2 1 2 1 2 2 1 1 2 2 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 2 2 2 1 2 1 1 1 2 1 1
## [75] 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 3 2 3 3 3 3 3 3 3 2 3 2 3 3 2 3 3 3 2 3 2 3
## [112] 3 3 3 3 3 2 2 2 3 3 3 2 2 2 3 1 3 2 3 2 3 2 2 3 3 2 1 2 3 3 3 3 3 3 2 1 2
## [149] 2 3 3 2 3 2 2 3 2 2 2 2 2 3 2 2 2 1 1 2 2 2 3 2 2 2 3 1 2 3 2 1 2 3 1 2 2
## [186] 2 3 2 2 2 2 1 2 1 3 2 2 2 1 2 3 3 2 1 1 2 3 3 2 2 2 2 1 2 3 2 1 3 3 1 2 3
## [223] 1 3 3 3 2 2 3 2 2 2 2 1 1 3 2 2 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3
## [260] 3 2 2 2 3 1 3 2 2 3 3 3 3 2 1 2 1 3 2 3 2 2 2 3 3 3 2 3 2 3 3 3 1 3 3 2 3
## [297] 2 2 2 2 3 2 2 2 3 2 1 3 3 3 3 2 2 2 2 2 2 1 3 2 2 2 1 1 3 3 2 2 2 1 2 3 2
## [334] 3 3 1 1 2 2 1 3 2 2 2 2 3 1 1 2 2 1 3 2 3 3 2 2 3 2 1 2 3 1 1 1 3 2
In fact, a bunch of movies have two or three genres. Let’s say (and I don’t know that this is true) that the first listed genre is the primary genre. Let’s create a column with that primary genre. Note: there already is a column nmeant to list the primary genres. But it’s largely filled with missing values.
We essentially want to subset that first genre from the list of genres. We can use str_sub to this end. For this function, we need the start and end indices for the part that we want to subset. Now, we can just use 1 for the start. But for the end, we will have to use the str_locate function to locate the place of that first comma.
movies$Actual_Primary_Genre <- ifelse(
str_detect(movies$Genre, ","),
str_sub(movies$Genre, start = 1, end = str_locate(movies$Genre, ",") - 1), # -1 to not include the comma
movies$Genre #if there are no commas, then Actual_Primary_Genre should just be the same as Genre
)
Let’s also delete the old primary genre column:
movies = movies %>%
select(-"Primary Genre")
We could have also used str_extract to figure out the first genre in that column. In fact, it would have been much easier than str_sub. str_extract, well, extracts the first match in each string. Here my pattern will be everything before a comma or everything other than a comma before the end of the string (in case there’s no comma):
p_genres_again = str_extract(movies$Genre, "^[^,]+(?=[,])|^[^,]+$")
head(p_genres_again, 10)
## [1] "period" "western" "horror" "musical" "sci-fi" "family"
## [7] "crime" "animation" "animation" "sports"
I’ll show one final stringr function, this one is for returning the first match, as a matrix. Let’s say we didn’t have a year column. But we do have a release date column, and so we can leverage that to extract the year:
years = str_match(movies$`Release Date (US)`, "(?<=[,]\\s)20\\d\\d")
movies$again_year = years
And just to confirm that this did a pretty good job:
years_same = movies %>%
filter(Year == again_year) %>%
nrow()
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
nrow(movies) - years_same
## [1] 2
100 * years_same / nrow(movies)
## [1] 99.45504
OK, so something went slightly wrong; there are 2 movies that I didn’t extract the right year for by using the US release date. But that’s a pretty good result; we got 99.5% of the years right just by quickly matching a pattern in the US release date (which might not even have been the first release date).
In this exercise, I took a look at the tidyverse package “stringr” and some of its capabilities. I worked with a dataset of 2007-2009 movies and focused on using stringr to detect matches and subset strings.
This work is ready to be extended. As a reminder, if I covered two capabilities of stringr, the other ones are:
I think managing lengths, and joining and splitting are particularly interesting.
And as one more reminder to the extender,the cheatsheet I linked to in the introduction will be helpful.
Using str_extract to extract the release date for the movie then adding it to a new column
movies_extended <- movies %>%
mutate(release_day = str_extract(`Release Date (US)`, "[0-9]+"))
head(movies_extended)
## # A tibble: 6 × 35
## Film Year `Script Type` Rotten Tomatoes cri…¹ `Metacritic critics`
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 300 2007 adaptation 60 51
## 2 3:10 to Yuma 2007 remake 88 76
## 3 30 Days of N… 2007 adaptation 50 53
## 4 Across the U… 2007 original scr… 54 56
## 5 Alien vs. Pr… 2007 sequel 14 29
## 6 Alvin and th… 2007 adaptation 26 39
## # ℹ abbreviated name: ¹`Rotten Tomatoes critics`
## # ℹ 30 more variables: `Average critics` <dbl>,
## # `Rotten Tomatoes Audience` <dbl>, `Metacritic Audience` <dbl>,
## # `Rotten Tomatoes vs Metacritic deviance` <dbl>, `Average audience` <dbl>,
## # `Audience vs Critics deviance` <dbl>, Genre <chr>, `Opening Weekend` <dbl>,
## # `Opening weekend ($million)` <dbl>, `Domestic Gross` <dbl>,
## # `Domestic gross ($million)` <dbl>, `Foreign Gross ($million)` <dbl>, …