Chose a text only dataset in order to demonstrate efficient string manipulating functions from stringr. Additionally, most examples contain data stored in a tibble and data management functions from dplyr.
# Load data as tibble dataframe (directly from Github)
catch_phrases <- read_csv("https://gist.githubusercontent.com/mattlucich/afc4b9c362e303c1f6ba8880877f0b60/raw/a08a96361705b00c4ee8f4ec0b3d324f864ae419/catchphrase.csv") %>%
rename(catchphrase = Catchphrase,
movie_name = `Movie Name`,
context = Context)
## Parsed with column specification:
## cols(
## Catchphrase = col_character(),
## `Movie Name` = col_character(),
## Context = col_character()
## )
Answer: Use stringr's str_replace_all() function which replaces all matches of the character/pattern of interest.
# Remove extraneous line breaks
catch_phrases$catchphrase <- catch_phrases$catchphrase %>% str_replace_all("\n" , "")
head(catch_phrases)
## # A tibble: 6 x 3
## catchphrase movie_name context
## <chr> <chr> <chr>
## 1 Beetlejuice, Beetlejuice, Beetleju… BEETLEJUICE Lydia, summoning Beetlejui…
## 2 It's showtime! BEETLEJUICE Beetlejuice, being summone…
## 3 They're heeeere! POLTERGEIST Carol Anne Freeling, notif…
## 4 Hey you guys! THE GOONIES Sloth, calling the attenti…
## 5 Good morning, Vietnam! GOOD MORNING,… Adrian Cronauer's greeting…
## 6 I love the smell of napalm in the … APOCALYPSE NOW Lt. Col. Bill Kilgore, des…
Answer: Use stringr's str_to_title() to convert text to title case, then use dplyr's mutate() to perform the transformation on each row of the dataframe, replacing the previous value.
# Use stringr and dplyr to convert all of the movie names to capital case
catch_phrases <- catch_phrases %>% mutate(movie_name = str_to_title(movie_name))
head(catch_phrases)
## # A tibble: 6 x 3
## catchphrase movie_name context
## <chr> <chr> <chr>
## 1 Beetlejuice, Beetlejuice, Beetleju… Beetlejuice Lydia, summoning Beetlejui…
## 2 It's showtime! Beetlejuice Beetlejuice, being summone…
## 3 They're heeeere! Poltergeist Carol Anne Freeling, notif…
## 4 Hey you guys! The Goonies Sloth, calling the attenti…
## 5 Good morning, Vietnam! Good Morning,… Adrian Cronauer's greeting…
## 6 I love the smell of napalm in the … Apocalypse Now Lt. Col. Bill Kilgore, des…
Answer: Use stringr's str_detect() to detect the character/pattern of interest, then use dplyr's filter() to return only rows where the str_detect() condition is true.
# Filter for the high energy quotes (i.e. ones with exclamation points)
exclamation_points <- catch_phrases %>% filter(str_detect(catchphrase, '!') )
# Percent of quotes that have exclamation points
dim(exclamation_points)[1] / dim(catch_phrases)[1]
## [1] 0.52
Answer: Use stringr's str_count() to count the number of matches for the character/pattern of interest. Use dplyr's mutate() to perform the transformation on each row of the dataframe, replacing the previous value. Then, use dplyr's filter to only return rows with at least one match. Use dplyr's select() to return only the columns of interest and arrange() to sort the data in descending order by match count.
# Filter for the high energy quotes (i.e. ones with exclamation points)
catch_phrases %>% mutate(exc_count = str_count(catchphrase, '!')) %>%
filter(exc_count > 0) %>%
select(catchphrase, exc_count) %>%
arrange(desc(exc_count))
## # A tibble: 78 x 2
## catchphrase exc_count
## <chr> <int>
## 1 It's alive! It's alive! IT'S ALIVE!!! 5
## 2 Gooble gobble, gooble gobble. We accept her! We accept her!. Goobl… 4
## 3 Hey Stella! HEY STELLA!!! 4
## 4 Get away from her, you BITCH!!! 3
## 5 SHOW ME THE MONEY!!! 3
## 6 HOLY SCHNIKES!!! 3
## 7 INCONCEIVABLE!!! 3
## 8 THE POWER OF CHRIST COMPELS YOU!!! 3
## 9 Hey! I'm walking here! I'm walking here! 3
## 10 You don't understand! I coulda had class! I coulda been a contende… 2
## # … with 68 more rows
Answer: Use stringr's str_glue_data() function to combine multiple columns, separated by strings before, between or after the columns. The below example selects all columns, but returns only movie_name and catchphrase, separated by a dash.
# Combine into one string
cp_glue <- catch_phrases %>% str_glue_data("{rownames(.)} {movie_name} - {catchphrase}")
head(cp_glue)
## 1 Beetlejuice - Beetlejuice, Beetlejuice, Beetlejuice!
## 2 Beetlejuice - It's showtime!
## 3 Poltergeist - They're heeeere!
## 4 The Goonies - Hey you guys!
## 5 Good Morning, Vietnam - Good morning, Vietnam!
## 6 Apocalypse Now - I love the smell of napalm in the morning. You know, one time we had a hill. bombed, for 12 hours. When it was all over, I walked up. We didn't find one. of 'em, not one stinkin' dink body. The smell, you know that gasoline smell,. the whole hill. Smelled like ... victory.
Answer: Use dplyr's pull() function to extract the catchphrase column from the catch_phrases tibble, converting it into a vector. Use stringr's str_sort() function to order the vector alphabetically.
# Convert catchphrase column to vector
cp_vec <- catch_phrases %>% pull(catchphrase)
# Sort catchphrase in alphabetical order by the letter beginning the phrase
cp_sort <- str_sort(cp_vec)
head(cp_sort)
## [1] "...and this one time, at band camp..."
## [2] "A martini...shaken, not stirred."
## [3] "After all, tomorrow is another day."
## [4] "Alright, Mr. DeMille, I'm ready for my close-up."
## [5] "Alrighty then!"
## [6] "As if!"