Tidyverse: using stringr, dplyr, and tibble to clean up catch phrases

Cleaning up catch phrases from classic movies

Source: https://www.kaggle.com/thomaskonstantin/150-famous-movie-catchphrases-with-context?select=Catchphrase.csv

Chose a text only dataset in order to demonstrate efficient string manipulating functions from stringr. Additionally, most examples contain data stored in a tibble and data management functions from dplyr.

# Load data as tibble dataframe (directly from Github)
catch_phrases <- read_csv("https://gist.githubusercontent.com/mattlucich/afc4b9c362e303c1f6ba8880877f0b60/raw/a08a96361705b00c4ee8f4ec0b3d324f864ae419/catchphrase.csv") %>%
                  rename(catchphrase = Catchphrase, 
                         movie_name = `Movie Name`,
                         context = Context)

## Parsed with column specification:
## cols(
##   Catchphrase = col_character(),
##   `Movie Name` = col_character(),
##   Context = col_character()
## )

1: How do I remove repeated extraneous characters from my data?

Answer: Use stringr's str_replace_all() function which replaces all matches of the character/pattern of interest.

# Remove extraneous line breaks
catch_phrases$catchphrase <- catch_phrases$catchphrase %>% str_replace_all("\n" , "")

head(catch_phrases)

## # A tibble: 6 x 3
##   catchphrase                         movie_name     context                    
##   <chr>                               <chr>          <chr>                      
## 1 Beetlejuice, Beetlejuice, Beetleju… BEETLEJUICE    Lydia, summoning Beetlejui…
## 2 It's showtime!                      BEETLEJUICE    Beetlejuice, being summone…
## 3 They're heeeere!                    POLTERGEIST    Carol Anne Freeling, notif…
## 4 Hey you guys!                       THE GOONIES    Sloth, calling the attenti…
## 5 Good morning, Vietnam!              GOOD MORNING,… Adrian Cronauer's greeting…
## 6 I love the smell of napalm in the … APOCALYPSE NOW Lt. Col. Bill Kilgore, des…

2: How do I convert uppercase text to title case?

Answer: Use stringr's str_to_title() to convert text to title case, then use dplyr's mutate() to perform the transformation on each row of the dataframe, replacing the previous value.

# Use stringr and dplyr to convert all of the movie names to capital case
catch_phrases <- catch_phrases %>% mutate(movie_name = str_to_title(movie_name))

head(catch_phrases)

## # A tibble: 6 x 3
##   catchphrase                         movie_name     context                    
##   <chr>                               <chr>          <chr>                      
## 1 Beetlejuice, Beetlejuice, Beetleju… Beetlejuice    Lydia, summoning Beetlejui…
## 2 It's showtime!                      Beetlejuice    Beetlejuice, being summone…
## 3 They're heeeere!                    Poltergeist    Carol Anne Freeling, notif…
## 4 Hey you guys!                       The Goonies    Sloth, calling the attenti…
## 5 Good morning, Vietnam!              Good Morning,… Adrian Cronauer's greeting…
## 6 I love the smell of napalm in the … Apocalypse Now Lt. Col. Bill Kilgore, des…

3: How do I filter for rows in a tibble containing certain characters/patterns?

Answer: Use stringr's str_detect() to detect the character/pattern of interest, then use dplyr's filter() to return only rows where the str_detect() condition is true.

# Filter for the high energy quotes (i.e. ones with exclamation points)
exclamation_points <- catch_phrases %>% filter(str_detect(catchphrase, '!') )

# Percent of quotes that have exclamation points
dim(exclamation_points)[1] / dim(catch_phrases)[1]

## [1] 0.52

4: How do I count how many matches a string has with a particular character/pattern (and filter out tibble rows with zero matches)?

Answer: Use stringr's str_count() to count the number of matches for the character/pattern of interest. Use dplyr's mutate() to perform the transformation on each row of the dataframe, replacing the previous value. Then, use dplyr's filter to only return rows with at least one match. Use dplyr's select() to return only the columns of interest and arrange() to sort the data in descending order by match count.

# Filter for the high energy quotes (i.e. ones with exclamation points)
catch_phrases %>% mutate(exc_count = str_count(catchphrase, '!')) %>% 
                  filter(exc_count > 0) %>% 
                  select(catchphrase, exc_count) %>%
                  arrange(desc(exc_count))

## # A tibble: 78 x 2
##    catchphrase                                                         exc_count
##    <chr>                                                                   <int>
##  1 It's alive! It's alive! IT'S ALIVE!!!                                       5
##  2 Gooble gobble, gooble gobble. We accept her! We accept her!. Goobl…         4
##  3 Hey Stella! HEY STELLA!!!                                                   4
##  4 Get away from her, you BITCH!!!                                             3
##  5 SHOW ME THE MONEY!!!                                                        3
##  6 HOLY SCHNIKES!!!                                                            3
##  7 INCONCEIVABLE!!!                                                            3
##  8 THE POWER OF CHRIST COMPELS YOU!!!                                          3
##  9 Hey! I'm walking here! I'm walking here!                                    3
## 10 You don't understand! I coulda had class! I coulda been a contende…         2
## # … with 68 more rows

5: How do I concatenate columns?

Answer: Use stringr's str_glue_data() function to combine multiple columns, separated by strings before, between or after the columns. The below example selects all columns, but returns only movie_name and catchphrase, separated by a dash.

# Combine into one string
cp_glue <- catch_phrases %>% str_glue_data("{rownames(.)} {movie_name} - {catchphrase}")

head(cp_glue)

## 1 Beetlejuice - Beetlejuice, Beetlejuice, Beetlejuice!
## 2 Beetlejuice - It's showtime!
## 3 Poltergeist - They're heeeere!
## 4 The Goonies - Hey you guys!
## 5 Good Morning, Vietnam - Good morning, Vietnam!
## 6 Apocalypse Now - I love the smell of napalm in the morning. You know, one time we had a hill. bombed, for 12 hours. When it was all over, I walked up. We didn't find one. of 'em, not one stinkin' dink body. The smell, you know that gasoline smell,. the whole hill. Smelled like ... victory.

6: How do I order a vector alphabetically?

Answer: Use dplyr's pull() function to extract the catchphrase column from the catch_phrases tibble, converting it into a vector. Use stringr's str_sort() function to order the vector alphabetically.

# Convert catchphrase column to vector
cp_vec <- catch_phrases %>% pull(catchphrase)

# Sort catchphrase in alphabetical order by the letter beginning the phrase
cp_sort <- str_sort(cp_vec)

head(cp_sort)

## [1] "...and this one time, at band camp..."           
## [2] "A martini...shaken, not stirred."                
## [3] "After all, tomorrow is another day."             
## [4] "Alright, Mr. DeMille, I'm ready for my close-up."
## [5] "Alrighty then!"                                  
## [6] "As if!"