Stringr is a package in the tidyverse that deals with, well, strings. What are strings? Anything in your data frame that is text- or character-based. So, in the case of babynames, that’d be Name and Sex - Year, Prop and N are not strings.
Stringr allows for string manipulation across your entire dataset. If, for instance, we had a column of character data that repeatedly mis-spelled a name: Bryan instead of Brian, say - stringr could change every instance of Bryan into Brian in one line of code.
Let’s start playing around with stringr by creating a variable that is equal to a string:
library(stringr)
sentence <- c("hello", "this is a long sentence", NA)
Here are some functions from the stringr package that we can use to
manipulate this sentence.
How many characters are there in the string?
str_length(sentence)
## [1] 5 23 NA
Let’s replace every instance of ‘l’ with ‘x’
str_replace(sentence, "l", "X")
## [1] "heXlo" "this is a Xong sentence"
## [3] NA
Okay, now let’s create a new string that is a list of character strings:
list <- c("Apple", "Banana", "Pear")
Let’s use str_sub to ‘pull out’ the first character of the string:
str_sub(list, 1,1)
## [1] "A" "B" "P"
What just happened? We told R to only grab the first character. Can we grab the first three characters?
str_sub(list, 1,3)
## [1] "App" "Ban" "Pea"
Let’s adjust the string’s content so each fruit is written entirely in lower case:
str_to_lower(str_sub(list))
## [1] "apple" "banana" "pear"
We can also use stringr to detect specific words or phrases:
str_detect(list, "Pear")
## [1] FALSE FALSE TRUE
R returns the position of “Pear” in the list.
Let’s apply more Stringr concepts to a body of text. Where could we find text? How would we get text into R?
One way would be a ‘web scrape’ - programmatically grabbing all the relevant text off of a page, or a series of pages - think Amazon product reviews.
Another would be to import a body of text, like a .txt file, into R, and then break it up into individual words - more on this in the next chapter.
A third way would be to use data included in an R package. More commonly, packages give you access to online datasets too large to download (Spotify, The New York Times, etc.). This technique, of selectively downloading relevant data from a much larger, online database, is the basis for the concept of an ‘API.’ [Application Programming Interface, if you’re wondering.] Many websites also use APIs, with downloadable files for analysis.
So as to avoid any complication, let’s hold off on accessing an API until we cover Twitter, and continue to use the babynames dataset for now; then we can move on to analyzing lyrics, political speeches, and great works of literature.
A snippet of Base R here, rather than the tidyverse: if we want to specify a column in our dataset, we write the name of the dataset or variable, then the dollar sign, then the name of the column:
babynames$name
Let’s use str_detect() to find all of the names that include a ‘sh’ sound:
As you see, str_detect() runs as a boolean operator, in that it ascribes a TRUE or FALSE value for each entry in the column, based on our conditional statement: is there a ‘sh’ in the character string?
Let’s combine our previous experience with ggplot to write a long, complicated set of code that will visualize the most popular names starting in ‘Sh’ for women born in 1938:
babynames %>%
filter(str_detect(name, "Sh") & sex=="F" & year == 1938) %>%
arrange(desc(prop)) %>%
head(20) %>%
ggplot(aes(reorder(name,prop),prop, fill = name)) +
geom_col() +
coord_flip()
Let’s fancy that up a bit, by calculating a ‘percentage’ of use of that name per year / per gender:
babynames %>%
filter(str_detect(name, "Sh") & sex=="F" & year == 1938) %>%
arrange(desc(prop)) %>%
head(20) %>%
mutate(percent = (prop * 100)) %>%
ggplot(aes(reorder(name,percent),percent, fill = name)) +
geom_col() +
coord_flip()
babynames %>%
filter(year > 2000 & str_detect(name, "Z")) %>%
arrange(desc(prop))
## # A tibble: 13,018 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2001 M Zachary 18186 0.00880
## 2 2002 M Zachary 16622 0.00805
## 3 2003 M Zachary 15539 0.00740
## 4 2004 M Zachary 13711 0.00649
## 5 2005 M Zachary 12283 0.00578
## 6 2006 M Zachary 11005 0.00502
## 7 2007 M Zachary 10212 0.00461
## 8 2008 M Zachary 9226 0.00424
## 9 2012 F Zoey 7466 0.00386
## 10 2009 M Zachary 8078 0.00381
## # … with 13,008 more rows
Why did R only pull out the names starting with Z? Because we capitalized it. How do we get both?
babynames %>%
mutate(Z = str_count(babynames$name, "[zZ]")) %>%
arrange(desc(prop))
## # A tibble: 1,924,665 × 6
## year sex name n prop Z
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1880 M John 9655 0.0815 0
## 2 1881 M John 8769 0.0810 0
## 3 1880 M William 9532 0.0805 0
## 4 1883 M John 8894 0.0791 0
## 5 1881 M William 8524 0.0787 0
## 6 1882 M John 9557 0.0783 0
## 7 1884 M John 9388 0.0765 0
## 8 1882 M William 9298 0.0762 0
## 9 1886 M John 9026 0.0758 0
## 10 1885 M John 8756 0.0755 0
## # … with 1,924,655 more rows
Instead of arranging our data by name popularity, let’s look at the names with the most z’s in them:
babynames %>%
mutate(Z = str_count(babynames$name, "[zZ]")) %>%
arrange(desc(Z))
## # A tibble: 1,924,665 × 6
## year sex name n prop Z
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 2010 M Zzyzx 5 0.00000244 3
## 2 1880 F Lizzie 388 0.00398 2
## 3 1880 F Kizzie 13 0.000133 2
## 4 1881 F Lizzie 396 0.00401 2
## 5 1881 F Kizzie 9 0.0000910 2
## 6 1882 F Lizzie 495 0.00428 2
## 7 1882 F Kizzie 9 0.0000778 2
## 8 1882 F Dezzie 5 0.0000432 2
## 9 1883 F Lizzie 496 0.00413 2
## 10 1883 F Kizzie 14 0.000117 2
## # … with 1,924,655 more rows
What if, instead of specifying a particular letter, we just wanted to count the most frequent first letters in names?
babynames %>%
mutate(first_letter = substr(name, 1,1)) -> baby_letters
Let’s plot that:
baby_letters %>%
count(first_letter, sort = TRUE) %>%
ggplot(aes(reorder(first_letter, n),n)) +
geom_col() +
coord_flip()
We can also use stringr to calculate the length of all of our strings. What is the frequency of the shortest and longest names?
babynames %>%
mutate(length = str_length(name)) -> babynames_length
Now, if we want to see the average length of a names over time, the code gets a little mroe advanced - note the mean function we haven’t used like this before:
babynames_length %>%
group_by(year) %>%
summarise_at(vars(length), funs(mean(.))) %>%
ggplot(aes(year, length)) + geom_line()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
The results are impressive. Let’s split them up by Sex:
babynames_length %>%
group_by(year, sex) %>%
summarise_at(vars(length), funs(mean(.))) %>%
ggplot(aes(year, length, color = sex)) + geom_line()
As mentioned in the last Chapter, summarise() is confusing. In the case of babynames, you’ll know to use it when you keep getting the same results over and over, and you want to group those names together. Let’s take a look at this issue by calculating the most common 3-letter names:
babynames_length %>%
filter(length == 3) %>%
arrange(desc(prop))
## # A tibble: 41,274 × 6
## year sex name n prop length
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1975 F Amy 32252 0.0207 3
## 2 1976 F Amy 31341 0.0199 3
## 3 1974 F Amy 29564 0.0189 3
## 4 1973 F Amy 26964 0.0174 3
## 5 1977 F Amy 26731 0.0163 3
## 6 1972 F Amy 25873 0.0160 3
## 7 1880 F Ida 1472 0.0151 3
## 8 1971 F Amy 26238 0.0150 3
## 9 1881 F Ida 1439 0.0146 3
## 10 1882 F Ida 1673 0.0145 3
## # … with 41,264 more rows
We get a lot of repeated names. Time to summarize!
babynames_length %>%
filter(length == 3) %>%
group_by(name) %>%
summarise(total = sum(n) ) %>%
arrange(desc(total))
## # A tibble: 970 × 2
## name total
## <chr> <int>
## 1 Amy 692096
## 2 Ann 469710
## 3 Joe 462099
## 4 Roy 407020
## 5 Lee 292891
## 6 Eva 263741
## 7 Ava 251052
## 8 Ian 222950
## 9 Mia 216774
## 10 Kim 214365
## # … with 960 more rows
Let’s try that again, with 2-letter names:
babynames_length %>%
filter(length == 2) %>%
group_by(name) %>%
summarize(total = sum(n)) %>%
arrange(desc(total))
## # A tibble: 149 × 2
## name total
## <chr> <int>
## 1 Jo 180579
## 2 Ty 45278
## 3 Ed 26330
## 4 Al 17221
## 5 Bo 10856
## 6 Lu 4013
## 7 Cy 3418
## 8 Wm 2737
## 9 Kc 2585
## 10 An 2048
## # … with 139 more rows
What is the longest name?
babynames_length %>%
arrange(desc(length))
## # A tibble: 1,924,665 × 6
## year sex name n prop length
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1978 M Christophermich 5 0.00000293 15
## 2 1979 M Johnchristopher 5 0.00000279 15
## 3 1980 M Christophermich 7 0.00000377 15
## 4 1980 M Christopherjohn 5 0.0000027 15
## 5 1981 F Mariadelrosario 5 0.0000028 15
## 6 1981 M Christopherjohn 5 0.00000268 15
## 7 1982 F Mariadelosangel 6 0.00000331 15
## 8 1982 M Christopherjohn 6 0.00000318 15
## 9 1982 M Christophermich 5 0.00000265 15
## 10 1983 M Christopherjohn 8 0.00000429 15
## # … with 1,924,655 more rows
How many 15 letter names are there?
babynames_length %>%
filter(length == 15) %>%
count(name, sort = TRUE)
## # A tibble: 34 × 2
## name n
## <chr> <int>
## 1 Christopherjohn 19
## 2 Johnchristopher 17
## 3 Christopherjame 16
## 4 Franciscojavier 16
## 5 Christophermich 8
## 6 Ryanchristopher 7
## 7 Christianjoseph 4
## 8 Christopherjose 4
## 9 Jonathanmichael 4
## 10 Mariadelosangel 4
## # … with 24 more rows
Let’s plot those names:
babynames %>%
filter(name %in% c("Christopherjohn","Johnchristopher","Christopherjame","Franciscojavier", "Christophermich", "Ryanchristopher","Christianjoseph", "Christopherjose", "Jonathanmichael", "Mariadelosangel"
)) %>%
ggplot(aes(year, prop, color = name)) + geom_line()
By the look of these names, it’s clear that most of them are actually longer than 15 characters - but 15 characters is the cut-off point for the column. Thus, we cannot accurately estimate the most common 15-letter names.
Along similar lines, an analysis of the methodology behind babynames shows that only names that have at least 5 instances in a given year are recorded. So it’d be similarly futile for us to attempt to measure the rarest names, as they are excluded in the database. (It also helps clarify why some rarer names seem to ‘disappear’ in certain years.)
OK, what else can stringr do?
How about the average number of vowels per name?
str_count(babynames$name, "[aeiou]")
That’s a lot of numbers. Let’s calculate a mean value instead:
mean(str_count(babynames$name, "[aeiou]"))
## [1] 2.420695
How about consonants?
mean(str_count(babynames$name, "[bcdfghjklmnpqrstvwxyz]"))
## [1] 2.752322
Example ideas for further exploration:
How many names contain ‘liz’ in them?
babynames %>%
filter(str_detect(babynames$name, "liz") ) %>%
count(name, sort = TRUE) %>%
head(20) %>%
ggplot(aes(reorder(name, n),n)) + geom_col() +
coord_flip()
Case Study: Born Without A (Proper) Name
There are a number of names in the database that are totally anonymous. When, and why?
babynames %>%
filter(name %in% c("Unknown", "Unnamed", "Infant", "Infantof", "Notnamed", "Baby")) %>%
ggplot(aes(year, prop, color = name)) + geom_line() + facet_wrap(~sex)
Let’s compare this to the number of unique names per year:
babynames %>%
group_by(year) %>%
summarize(annual = n_distinct(name)) %>%
ggplot(aes(year, annual )) + geom_line()
Babynames also includes a data set called births, that simply lists out the total number of births per year:
data(births)
ggplot(births, aes(year, births)) + geom_line()
Why are these last two graphs different? Because the first is counting names, the second is counting births. And most babies have names that are shared with other babies, especially in the same year.