The Social Security Administration keeps records on first names at birth, going back to 1890. Hadley Wickham has created an R package that lets us load this data directly into R. Using a package is a two-step process: first, you install the package onto your computer, essentially; secondly, you load the package as a library to access its functions and data in your current work environment. Put another way, a package needs only be installed #once#, but has to be loaded into your R environment every time you start up R Studio.
(Footnote: where do these packages come from, and who makes them? Why? By default, packages are hosted on an online repository referred to as CRAN; they can also be installed from GitHub. And anyone can make and share an R package, for instance if they solve a complicated workflow and want to share that insight with others to save them time and redundancy. Having R package-based means that users only need to load packages relevant to their area of study - keep in mind R is used for everything from Health Sciences to Comparative Literature.)
install.packages('babynames')
library(babynames)
Let’s take a look at the babynames package, by asking about its structure, and taking a glimpse at it:
str(babynames)
## tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
## $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
## $ sex : chr [1:1924665] "F" "F" "F" "F" ...
## $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
## $ n : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
## $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
glimpse(babynames)
## Rows: 1,924,665
## Columns: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
## $ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
## $ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.016…
The output of glimpse is a little overwhelming at first, although it makes clear there are a lot of rows in it (Excel cannot handle 1,924,665 rows of data, but this is still not ‘Big Data’).
So let’s try head and tail to see the first and last 10 rows of the dataset:
head(babynames)
## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
tail(babynames)
## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 M Zyhier 5 0.00000255
## 2 2017 M Zykai 5 0.00000255
## 3 2017 M Zykeem 5 0.00000255
## 4 2017 M Zylin 5 0.00000255
## 5 2017 M Zylis 5 0.00000255
## 6 2017 M Zyrie 5 0.00000255
Everything we’ve done up to this point has been in Base R, that is, R without any functions added via packages. Throughout this book, we’ll be relying on a collection of inter-connected packages called the Tidyverse, which drastically simplifies performing varying calculations on a dataset. Like all packages, we first have to install and then load it.
• Please note: when you install the ‘tidyverse’ package, you’ll get a prompt inside your R Console at the bottom-left, asking you a Yes/No question. You cannot proceed until you click your cursor in the Console and type out the word ‘Yes.’ Note that the Console may be cut off or obfuscated; you may need to adjust your R Studio windows to see it better.
install.packages('tidyverse')
library(tidyverse)
Loading the Tidyverse shows the eight packages that make it up, as well as a few warnings that we can disregard. We’ll begin on focusing on three packages that will help us work with babynames: dplyr, ggplot2 and stringr. All three are loaded once we run the library(tidyverse) command. The first, dplyr, we use to manipulate data: to filter it, rearrange it,count it, do calculations for custom columns, and pivot the data to our liking.
The greatest and most powerful functionality added to R via the Tidyverse, in my opinion, is the ‘pipe operator,’ which allows us to chain commands to each other. It can be tricky to type; the shortcut is Shift-Command-M (on a PC, it’d be Shift-Control-M).
Another useful logical operator for this dataset is the ‘includes’ operator, which looks like this: %in% .
I think an example will make the best sense of how to use these operators, so let’s get started with filtering.
If we just type ‘babynames’ in R, we’ll see the first 10 rows of data, organized by year, and an indication that there are nearly 2 million more rows remaining. That’s way too much to visualize! Let’s start by filtering for specific names: I’ll use the names of The Beatles as a starting point.
babynames %>%
filter(name %in% 'Ringo')
## # A tibble: 28 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1964 M Ringo 12 0.00000592
## 2 1965 M Ringo 18 0.0000095
## 3 1966 M Ringo 8 0.0000044
## 4 1972 M Ringo 6 0.00000358
## 5 1974 M Ringo 6 0.00000368
## 6 1975 M Ringo 8 0.00000493
## 7 1976 M Ringo 13 0.00000796
## 8 1977 M Ringo 10 0.00000585
## 9 1978 M Ringo 12 0.00000702
## 10 1979 M Ringo 9 0.00000502
## # … with 18 more rows
That ‘translates’ to ‘take the babynames dataset and filter it so only values in the name column that match ’Ringo’ are included.’ The and is the pipe operator ( %>% ), and the match is the * %in% * operator.
If we wanted to look for more than one name, we’d change the syntax to use a ‘combine’ command ( c ), with the values comma-separated. Let’s try that with babynames and the Beatles.
babynames %>%
filter(name %in% c('Ringo', 'Paul', 'George', 'John'))
## # A tibble: 831 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F John 46 0.000471
## 2 1880 F George 26 0.000266
## 3 1880 M John 9655 0.0815
## 4 1880 M George 5126 0.0433
## 5 1880 M Paul 301 0.00254
## 6 1881 F George 30 0.000303
## 7 1881 F John 26 0.000263
## 8 1881 M John 8769 0.0810
## 9 1881 M George 4664 0.0431
## 10 1881 M Paul 291 0.00269
## # … with 821 more rows
Now we get a ton of results - far too many to show on the screen.
That’s where data visualization comes in: it’s often impossible to even see your results from big data without plotting it into a visual, summarized form. Our visualization package is called ggplot2, and it’s amazingly straightforward to use, once you get used to its syntax:
ggplot(data, aes(x,y)) + geometry()
Huh?
Well, our function (think: verb) is ‘ggplot().’ Inside that, we define our ‘aesthetics’ (aes) for the visualization, which has its own set of parentheses. Inside the aesthetics, we define the columns we want to use for the x and y axes, and finally we define the type of geomtry our visualization will use, such as a bar chart ( geom_col() ), scatterplot ( geom_point() ), or line chart ( geom_line() ), for instance.
Like most things in R and the Tidyverse, it’s easier to make sense of ggplot() through examples:
babynames %>%
filter(name %in% 'Ringo') %>%
ggplot(aes(year, prop)) + geom_line()
To review, we take ‘babynames,’ and filter so the name includes ‘Ringo;’ We then plot by setting the aesthetics to use the ‘year’ and ‘prop’ columns of our dataset; our plot will be a line chart.
Learning by example often raises as many questions as it answers, so I’ll try to address those as we go along.
Let’s take this one step further:
babynames %>%
filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>%
ggplot(aes(year, prop, color = name)) + geom_line()
That looks really weird! Why? In summary, becuase there is a ‘Sex’ column in the data. So, for any year - say, 1974 - there are 4 male Ringo’s, and zero female Ringo’s - ggplot is trying to plot both the 4 and 0 values on the same vertical axis.
Therefore, our solution is to filter out one sex:
babynames %>%
filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>%
filter(sex %in% "M") %>%
ggplot(aes(year, prop, color = name)) + geom_line()
That’s much better - but what happened to Ringo?
The other names are so much more popular that his line - which doesn’t start showing up until 1964 - is very low in comparison.
Let’s try filtering the ‘year’ column in order to have our ggplot ‘zoom in’ on the Ringo section.
babynames %>%
filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>%
filter(sex %in% "M") %>%
filter(year > 1964) %>%
ggplot(aes(year, prop, color = name)) + geom_line()
Another question arises: why are we plotting ‘prop?’ what about ‘n?’
babynames %>%
filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>%
filter(sex %in% "M") %>%
filter(year > 1964) %>%
ggplot(aes(year, n, color = name)) + geom_line()
The results are very similar for ‘n’ as they were for ‘prop.’ Let’s take an example name where that’s not the case:
babynames %>%
filter(name %in% "Mary") %>%
filter(sex %in% "M") %>%
filter(year < '1940') %>%
ggplot(aes(year, prop)) + geom_line()
babynames %>%
filter(name %in% "Mary") %>%
filter(sex %in% "M") %>%
filter(year < '1940') %>%
ggplot(aes(year, n)) + geom_line()
# you can also try 'Joseph' and 'M'
Remember that n is a simple count of names per year, whereas prop is a calculation: total number of people given that number in a given year, divided by the total number of births. So when there are fewer names in the database, the differnece between n and prop is more obvious. That’d be in the early years of the babynames dataset, when biblical names like Mary and Joseph were much more common relative to all of the names. Since then, there are just more names, meaning the prop of the most common names of long ago have nearly all gone down in prop - even if their n value is increasing.
# note that we have not learned how to do this yet:
babynames %>%
group_by(year) %>%
summarize(total = n_distinct(name)) %>%
ggplot(aes(year, total)) + geom_line() +
xlab('Year') + ylab('Count of Unique Names')
Now that we have all of this information, we can try to answer specific questions. Let’s start with some basic ones - try to answer them on your own if you can.
babynames %>%
filter(year == "1890")
## # A tibble: 2,695 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1890 F Mary 12078 0.0599
## 2 1890 F Anna 5233 0.0259
## 3 1890 F Elizabeth 3112 0.0154
## 4 1890 F Margaret 3100 0.0154
## 5 1890 F Emma 2980 0.0148
## 6 1890 F Florence 2744 0.0136
## 7 1890 F Ethel 2718 0.0135
## 8 1890 F Minnie 2650 0.0131
## 9 1890 F Clara 2496 0.0124
## 10 1890 F Bertha 2388 0.0118
## # … with 2,685 more rows
But then what? Any why two ‘=’ symbols?
Let’s start with ‘==.’ In R, you can create variables, and one way to do this is with a single ‘=’ symbol:
x = 10
In the case of babynames, we’re not aiming ot create a variable - we just want to limit the data to only include entries in the year 1890 - so we need to use two ‘=’ symbols to differentiate from making a variable.
Also, in this textbook we will be using the arrow function to create variables, because it works in two directions:
x <- 10
10 -> x
Way more useful than ‘=.’
Let’s take the entire babynames dataset and create a sub-set of the data that only incudes entries from the year 1890, using a variable:
babynames %>%
filter(year == "1890") -> babynames_1890
That makes things much easier, if that’s the only year we want to look at - but we’ll have to plot our graphs differently.
Let’s begin with the arrange() function, which does as it sounds - we’ll tell it to arrange in descending order of prop:
babynames_1890 %>%
arrange(desc(prop))
## # A tibble: 2,695 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1890 M John 8502 0.0710
## 2 1890 M William 7494 0.0626
## 3 1890 F Mary 12078 0.0599
## 4 1890 M James 5097 0.0426
## 5 1890 M George 4458 0.0372
## 6 1890 M Charles 4061 0.0339
## 7 1890 F Anna 5233 0.0259
## 8 1890 M Frank 3078 0.0257
## 9 1890 M Joseph 2670 0.0223
## 10 1890 M Robert 2541 0.0212
## # … with 2,685 more rows
So, what was the answer to our question? Well, ‘John’ has the highest proportion, but Mary has the highest count.
Why does Mary have a higher ‘n’ value, but a lower proportion - in the same year? Because ‘prop’ also takes sex into account, i.e. it counts the proportion of each sex in each year that receives a given name.
I think it’d be helpful to see how we can account for this somewhat confusing aspect to our data, by simply splitting it along the variable that is giving us a hard time: sex. [note that all of our variables, or columns, are lowercase - everything is, except the names themselves. R will not detect your mistake if you lowercase a Name or Uppercase a column.]
But sex is the least of our problems, as we’ll see.
##Count##
Let’s try by using a new function, count(), which does exactly what it sounds like. We can specify to sort the results; for some archaic reason, we have to write the words ‘true’ and ‘false’ in ALL CAPS:
babynames_1890 %>%
count(sex, sort = TRUE)
## # A tibble: 2 × 2
## sex n
## <chr> <int>
## 1 F 1534
## 2 M 1161
That’s not very many names. Or Men.
count() can be helpful, but note that it strips away all of the columns except the one we counted - and adds an ‘n’ column. But that’s it - prop, count, name, year - all gone. More on count(later).
Let’s plot the most popular names of 1890 - but first, we have to cut them off, as 2,695 names is too many to plot:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10)
## # A tibble: 10 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1890 M John 8502 0.0710
## 2 1890 M William 7494 0.0626
## 3 1890 F Mary 12078 0.0599
## 4 1890 M James 5097 0.0426
## 5 1890 M George 4458 0.0372
## 6 1890 M Charles 4061 0.0339
## 7 1890 F Anna 5233 0.0259
## 8 1890 M Frank 3078 0.0257
## 9 1890 M Joseph 2670 0.0223
## 10 1890 M Robert 2541 0.0212
We can plot ten names, easy:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(name, prop)) + geom_col()
I would prefer them in order. Why didn’t arrange(desc(prop)) do that for us? Well, long story, but basically, however you reshape yoru data, ggplot() is going to need its own set of instructions for how to visualize it.
So arrange() doesn’t work - we have to make the adjustment inside the ggplot() call:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop), prop)) + geom_col()
OK, that got pretty complicated. Three nested parentheses? Let’s look at the offending line: originally, the ‘aesthetics’ of our ggplot() were:
aes(name, prop)
And we want to reorder the ‘names,’ based on ‘prop:’
reorder(name, prop)
In other words, italics ‘I want to reorder the names based on their proportion.’
Let’s put it back together:
aes(reorder(name, prop), prop))
And to see it in action:
```{r. babynames_1890_v4} babynames_1890 %>% head(10) %>% ggplot(aes(reorder(name, -prop), prop)) + geom_col()
<ul>
<li>OK, but why is it in the wrong order?<li>
<ul>
We just have to 'tell' ggplot() to reverse, or do the opposite of, the order it chose. We can use the minus sign for this:
## More Aesthetics ##
Let's add some color:
```r
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, -prop), prop, color = "red")) + geom_col()
That didn’t work the way it did for The Beatles!
Of course, that was a line graph - geom_line() - and that line has a color. This time, ‘color’ is read as stroke or outline; fill controls our columns. Also, we move the command to inside the geometry:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, -prop), prop)) + geom_col(fill = "blue")
Great. Now let’s make the fill based on the value of a column:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, -prop), prop, fill = name)) + geom_col()
OK, great. The names are all different colors because they are discrete data points, i.e. they are not measured - like ‘prop:’
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, -prop), prop, fill = prop)) + geom_col()
Since ‘prop’ is measured, it is visualized as a range of a single
color.
This would all look better sideways:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop), prop, fill = prop)) + geom_col() + coord_flip()
By the way, we could ‘show’ the anamaly about John and Mary having confusing ‘n’ and ‘prop’ values by adujsting our aesthetics - let’s use fill:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop), prop, fill = n)) + geom_col() + coord_flip()
We’d mentioned earlier that the ‘sex’ column is complicating our use of ‘prop’ over ‘n.’ To account for this problem, we could color by sex:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop), prop, fill = sex)) + geom_col() + coord_flip()
One more ggplot() trick to change the way we visualize our data: making multiple graphs, based on a variable:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop), prop, fill = sex)) +
geom_col() +
coord_flip() +
facet_wrap(~sex)
That looks…. terrible. Why? The function facet_wrap() is trying to use the same values, and the same scale, for each of the two graphs. Let’s ‘free’ the y-axis to account for this discrepancy:
babynames_1890 %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop), prop, fill = sex)) +
geom_col() +
coord_flip() +
facet_wrap(~sex, scales = "free_y")
Wow, that really changes things - weren’t there more female than male names, when we used count() earlier? Sure, but while there may be more variance in female names in 1890, females only have two of the most common names.
How would we answer this? I find it’s easiest to write our the process in English, then translate it to R and the Tidyverse:
‘Take babynames and filter it to only include entries from 2017. Then arrange the remaining entries in descending order of proportion.’
In the Tidyverse:
babynames %>%
filter(year == 2017) %>%
arrange(desc(prop))
## # A tibble: 32,469 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 F Emma 19738 0.0105
## 2 2017 F Olivia 18632 0.00994
## 3 2017 M Liam 18728 0.00954
## 4 2017 M Noah 18326 0.00933
## 5 2017 F Ava 15902 0.00848
## 6 2017 F Isabella 15100 0.00805
## 7 2017 F Sophia 14831 0.00791
## 8 2017 M William 14904 0.00759
## 9 2017 M James 14232 0.00725
## 10 2017 F Mia 13437 0.00717
## # … with 32,459 more rows
It looks like nearly 1% of American girls in 2017 were named ‘Emma.’ Olivia, Liam and Noah are also overwhelmingly popular.
What about my name? My birth year? Just replace my values with yours:
babynames %>%
filter(name %in% c("Brian", "Bryan")) %>%
filter(sex == "M") %>%
ggplot(aes(year, prop, color = name)) + geom_line()
babynames %>%
filter(year == 1975) %>%
arrange(desc(prop)) %>%
head(10) %>%
ggplot(aes(reorder(name, prop),prop, fill = sex)) + geom_col() +
coord_flip()
I made the Top 10! (It’s all been downhill since then)
I encourage you to play around with babynames and get more comfortable deliberately modifying data before we continue. Or, if you’re getting sick of babies, you can use these tidyverse functions on any dataset. Let’s try using one of R’s built-in ones, mtcars:
data(mtcars)
View(mtcars)
Looks like lots of older car performance statistics. Let’s try comparing weight to mpg:
mtcars %>%
ggplot(aes(wt, mpg)) + geom_point()
geom_point() creates a scatterplot, but we can also see hints of an overall trend, or correlation here - so let’s add that to the ggplot():
mtcars %>%
ggplot(aes(wt, mpg)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Great. On to mutate().
While babynames has five columns, or variables, to play with, some observations require creating a calculated field of content - essentially generating a sixth column, in the case of babynames, to show something already in the data but not made clear.
babynames does not have rankings of popular names for each year. Could we create that column? Sure! When we rearrange our data in descending order of prop, we have essentially created rankings based on row number- we just need to ‘mutate’ the data frame to show it.
babynames %>%
filter(year == 2017) %>%
arrange(desc(prop)) %>%
mutate(rank = row_number())
## # A tibble: 32,469 × 6
## year sex name n prop rank
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 2017 F Emma 19738 0.0105 1
## 2 2017 F Olivia 18632 0.00994 2
## 3 2017 M Liam 18728 0.00954 3
## 4 2017 M Noah 18326 0.00933 4
## 5 2017 F Ava 15902 0.00848 5
## 6 2017 F Isabella 15100 0.00805 6
## 7 2017 F Sophia 14831 0.00791 7
## 8 2017 M William 14904 0.00759 8
## 9 2017 M James 14232 0.00725 9
## 10 2017 F Mia 13437 0.00717 10
## # … with 32,459 more rows
That looks good! We can now save this new, mutated dataset:
babynames %>%
filter(year == 2017) %>%
arrange(desc(prop)) %>%
mutate(rank = row_number()) -> babynames_2017_ranked
Let’s try a practical use of mutate() by focusing on finding the most popular names of a particular generation.
According to Wikipedia, the ‘Silent Generation’ were born between the years of 1928 and 1944:
silent_gen <- babynames %>%
filter(year > 1927) %>%
filter(year < 1945)
Ok, we have sub-setted our data to only incude the years of this generation. Let’s further simplify things by only looking at female names of the Silent Generation - and also add a ‘rank’ column:
silent_gen %>%
filter(sex =="F") %>%
mutate(rank = row_number()) -> silent_gen_f
Ok, let’s see some results - what are the most popular female names of the Silent Generation?
silent_gen_f %>%
head(10) %>%
ggplot(aes(name, prop)) + geom_col()
It appears that Mary is the most popular name during this period. But
wasn’t Mary popular back in 1890? Let’s look at these 10 names over
time:
babynames %>%
filter(name %in% c("Mary", "Barbara", "Betty", "Doris", "Dorothy", "Helen", "Margaret", "Ruth", "Shirley", "Virginia")) %>%
filter(sex =="F") %>%
ggplot(aes(year, prop, color = name)) + geom_line()
It appears that Mary is the most popular name for a very long time,
gradually waning as more and more unique names get added to the database
every year (therefore decreasing its proportion). Unlike the other 9
names, it definitely doesn’t peak during this generation - so let’s
remove it.
babynames %>%
filter(name %in% c("Barbara", "Betty", "Doris", "Dorothy", "Helen", "Margaret", "Ruth", "Shirley", "Virginia")) %>%
filter(sex =="F") %>%
ggplot(aes(year, prop, color = name)) + geom_line()
That looks much better! What is that one name that is peaking like crazy right in the middle?
babynames %>%
filter(name %in% "Shirley") %>%
filter(sex =="F") %>%
ggplot(aes(year, prop, color = name)) + geom_line()
Wow! What could we possibly blame this on? The popularity of Shirley Temple? There’s no way to quantitatively measure that, even if we think it to be true.
Similar to creating a pivot table, the summarize() command reshapes your data by …
If we look at most popular names of the Silent Generation, we saw a lot of names repeated:
silent_gen_f %>%
arrange(desc(prop)) %>%
head(10)
## # A tibble: 10 × 6
## year sex name n prop rank
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1928 F Mary 66869 0.0559 1
## 2 1930 F Mary 64146 0.0550 10712
## 3 1929 F Mary 63510 0.0549 5437
## 4 1931 F Mary 60296 0.0546 15960
## 5 1932 F Mary 59872 0.0541 20937
## 6 1933 F Mary 55507 0.0531 26037
## 7 1934 F Mary 56924 0.0526 30895
## 8 1935 F Mary 55065 0.0507 35868
## 9 1937 F Mary 55642 0.0505 45616
## 10 1936 F Mary 54373 0.0505 40760
If we want to count the total number of instances of each name over time
babynames %>%
group_by(year, sex) %>%
summarise(name_count = n_distinct(name)) -> distinct
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
View(distinct)
ggplot(distinct, aes(year, name_count)) + geom_line() +
facet_wrap(~sex)
ggplot(births, aes(year, births)) + geom_line()
#stringr
skip to Chapter X [link] to find out how.