Filter out rows which contain the hashtag #caa
When we download twitter data for analysis it includes the hashtags associated with each tweet. The data wraps the hashtags (0 or more) into a single string.
Example: "#red, #white, #caa, #blue"
Note that it is all one long single string.
I need to filter my tibble to just those rows which contain ‘#caa’.
Here is a generic 4 row tibble with a hashtags variable as described as one long string and then split out each hashtag which returns a tibble with a list vector named hashtags.
library(tidyverse)
tb <- tibble("numbers" = c(20, 33, 28, 23),
"sex" = c("m", "f", "f", "m"),
"hashtags"= c("#caa", "#red, #yellow","","#red, #caa")
)
tb$hashtags <- str_split(tb$hashtags,", ")
view(tb)
I could grep for #caa and call it a day. That would work, but I prefer to slice up the string into multiple hashtags for further analysis.
I str_split() each string into its individual components and then iterate over that list of chars to see if #caa or any other hashtag exists.
The problem comes when str_split is run. str_split returns a list.
A list with the length of however many rows there are in the original tibble.
Checking each list for a value is suprisingly hard. filter("#caa" %in% tb$hashtags) returns the entire tibble so we need to look elsewhere.
sapply()
src: (https://stackoverflow.com/a/53086319/4858518)
It returns a boolen 1 if #caa exists
So I mutate a column with a 1 or 0 and just filter on that to get my filtered results
Now we use mutate %>% sapply() to create a new logical/boolean vector ‘hasCaa’ with a 1 or 0 and filter to get our two desired rows.
tb <- tibble("numbers" = c(20, 33, 28, 23),
"sex" = c("m", "f", "f", "m"),
"hashtags"= c("#caa", "#red, #yellow","","#red, #caa")
)
tb$hashtags <- str_split(tb$hashtags,", ")
tb <- tb %>% mutate( hasCaa = 1 * sapply(tb$hashtags, `%in%`, x = "#caa") )
tb <- tb %>% filter(hasCaa == 1)
view(tb)
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.