Loading Packages and Data

The Social Security Administration keeps records on first names at birth, going back to 1890. Hadley Wickham has created an R package that lets us load this data directly into R. Using a package is a two-step process: first, you install the package onto your computer, essentially; secondly, you load the package as a library to access its functions and data in your current work environment. Put another way, a package needs only be installed #once#, but has to be loaded into your R environment every time you start up R Studio.

(Footnote: where do these packages come from, and who makes them? Why? By default, packages are hosted on an online repository referred to as CRAN; they can also be installed from GitHub. And anyone can make and share an R package, for instance if they solve a complicated workflow and want to share that insight with others to save them time and redundancy. Having R package-based means that users only need to load packages relevant to their area of study - keep in mind R is used for everything from Health Sciences to Comparative Literature.)

install.packages('babynames')
library(babynames)

Let’s take a look at the babynames package, by asking about its structure, and taking a glimpse at it:

str(babynames)

## tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
##  $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ sex : chr [1:1924665] "F" "F" "F" "F" ...
##  $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
##  $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...

glimpse(babynames)

## Rows: 1,924,665
## Columns: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
## $ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
## $ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.016…

The output of glimpse is a little overwhelming at first, although it makes clear there are a lot of rows in it (Excel cannot handle 1,924,665 rows of data, but this is still not ‘Big Data’).

So let’s try head and tail to see the first and last 10 rows of the dataset:

head(babynames)

## # A tibble: 6 × 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

tail(babynames)

## # A tibble: 6 × 5
##    year sex   name       n       prop
##   <dbl> <chr> <chr>  <int>      <dbl>
## 1  2017 M     Zyhier     5 0.00000255
## 2  2017 M     Zykai      5 0.00000255
## 3  2017 M     Zykeem     5 0.00000255
## 4  2017 M     Zylin      5 0.00000255
## 5  2017 M     Zylis      5 0.00000255
## 6  2017 M     Zyrie      5 0.00000255

Everything we’ve done up to this point has been in Base R, that is, R without any functions added via packages. Throughout this book, we’ll be relying on a collection of inter-connected packages called the Tidyverse, which drastically simplifies performing varying calculations on a dataset. Like all packages, we first have to install and then load it.

• Please note: when you install the ‘tidyverse’ package, you’ll get a prompt inside your R Console at the bottom-left, asking you a Yes/No question. You cannot proceed until you click your cursor in the Console and type out the word ‘Yes.’ Note that the Console may be cut off or obfuscated; you may need to adjust your R Studio windows to see it better.

install.packages('tidyverse')
library(tidyverse)

Loading the Tidyverse shows the eight packages that make it up, as well as a few warnings that we can disregard. We’ll begin on focusing on three packages that will help us work with babynames: dplyr, ggplot2 and stringr. All three are loaded once we run the library(tidyverse) command. The first, dplyr, we use to manipulate data: to filter it, rearrange it,count it, do calculations for custom columns, and pivot the data to our liking.

%>%

The greatest and most powerful functionality added to R via the Tidyverse, in my opinion, is the ‘pipe operator,’ which allows us to chain commands to each other. It can be tricky to type; the shortcut is Shift-Command-M (on a PC, it’d be Shift-Control-M).

Another useful logical operator for this dataset is the ‘includes’ operator, which looks like this: %in% .

I think an example will make the best sense of how to use these operators, so let’s get started with filtering.

Filtering Data

If we just type ‘babynames’ in R, we’ll see the first 10 rows of data, organized by year, and an indication that there are nearly 2 million more rows remaining. That’s way too much to visualize! Let’s start by filtering for specific names: I’ll use the names of The Beatles as a starting point.

babynames %>% 
  filter(name %in% 'Ringo')

## # A tibble: 28 × 5
##     year sex   name      n       prop
##    <dbl> <chr> <chr> <int>      <dbl>
##  1  1964 M     Ringo    12 0.00000592
##  2  1965 M     Ringo    18 0.0000095 
##  3  1966 M     Ringo     8 0.0000044 
##  4  1972 M     Ringo     6 0.00000358
##  5  1974 M     Ringo     6 0.00000368
##  6  1975 M     Ringo     8 0.00000493
##  7  1976 M     Ringo    13 0.00000796
##  8  1977 M     Ringo    10 0.00000585
##  9  1978 M     Ringo    12 0.00000702
## 10  1979 M     Ringo     9 0.00000502
## # … with 18 more rows

That ‘translates’ to ‘take the babynames dataset and filter it so only values in the name column that match ’Ringo’ are included.’ The and is the pipe operator ( %>% ), and the match is the * %in% * operator.

If we wanted to look for more than one name, we’d change the syntax to use a ‘combine’ command ( c ), with the values comma-separated. Let’s try that with babynames and the Beatles.

  babynames %>% 
  filter(name %in% c('Ringo', 'Paul', 'George', 'John'))

## # A tibble: 831 × 5
##     year sex   name       n     prop
##    <dbl> <chr> <chr>  <int>    <dbl>
##  1  1880 F     John      46 0.000471
##  2  1880 F     George    26 0.000266
##  3  1880 M     John    9655 0.0815  
##  4  1880 M     George  5126 0.0433  
##  5  1880 M     Paul     301 0.00254 
##  6  1881 F     George    30 0.000303
##  7  1881 F     John      26 0.000263
##  8  1881 M     John    8769 0.0810  
##  9  1881 M     George  4664 0.0431  
## 10  1881 M     Paul     291 0.00269 
## # … with 821 more rows

Now we get a ton of results - far too many to show on the screen.

Visualizing

That’s where data visualization comes in: it’s often impossible to even see your results from big data without plotting it into a visual, summarized form. Our visualization package is called ggplot2, and it’s amazingly straightforward to use, once you get used to its syntax:

ggplot(data, aes(x,y)) + geometry()

Huh?

Well, our function (think: verb) is ‘ggplot().’ Inside that, we define our ‘aesthetics’ (aes) for the visualization, which has its own set of parentheses. Inside the aesthetics, we define the columns we want to use for the x and y axes, and finally we define the type of geomtry our visualization will use, such as a bar chart ( geom_col() ), scatterplot ( geom_point() ), or line chart ( geom_line() ), for instance.

Like most things in R and the Tidyverse, it’s easier to make sense of ggplot() through examples:

babynames %>% 
  filter(name %in% 'Ringo') %>% 
  ggplot(aes(year, prop)) + geom_line()

To review, we take ‘babynames,’ and filter so the name includes ‘Ringo;’ We then plot by setting the aesthetics to use the ‘year’ and ‘prop’ columns of our dataset; our plot will be a line chart.

Learning by example often raises as many questions as it answers, so I’ll try to address those as we go along.

Let’s take this one step further:

babynames %>% 
  filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

That looks really weird! Why? In summary, becuase there is a ‘Sex’ column in the data. So, for any year - say, 1974 - there are 4 male Ringo’s, and zero female Ringo’s - ggplot is trying to plot both the 4 and 0 values on the same vertical axis.

Therefore, our solution is to filter out one sex:

babynames %>% 
  filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>% 
  filter(sex %in% "M") %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

That’s much better - but what happened to Ringo?

The other names are so much more popular that his line - which doesn’t start showing up until 1964 - is very low in comparison.

Let’s try filtering the ‘year’ column in order to have our ggplot ‘zoom in’ on the Ringo section.

babynames %>% 
  filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>% 
  filter(sex %in% "M") %>% 
  filter(year > 1964) %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

Another question arises: why are we plotting ‘prop?’ what about ‘n?’

babynames %>% 
  filter(name %in% c('Ringo', 'John', 'George', 'Paul')) %>% 
  filter(sex %in% "M") %>% 
  filter(year > 1964) %>% 
  ggplot(aes(year, n, color = name)) + geom_line()

The results are very similar for ‘n’ as they were for ‘prop.’ Let’s take an example name where that’s not the case:

babynames %>% 
  filter(name %in% "Mary") %>% 
  filter(sex %in% "M") %>% 
  filter(year < '1940') %>% 
  ggplot(aes(year, prop)) + geom_line()

babynames %>% 
  filter(name %in% "Mary") %>% 
  filter(sex %in% "M") %>% 
  filter(year < '1940') %>% 
  ggplot(aes(year, n)) + geom_line()

# you can also try 'Joseph' and 'M'

Remember that n is a simple count of names per year, whereas prop is a calculation: total number of people given that number in a given year, divided by the total number of births. So when there are fewer names in the database, the differnece between n and prop is more obvious. That’d be in the early years of the babynames dataset, when biblical names like Mary and Joseph were much more common relative to all of the names. Since then, there are just more names, meaning the prop of the most common names of long ago have nearly all gone down in prop - even if their n value is increasing.

 # note that we have not learned how to do this yet:

 babynames %>% 
    group_by(year) %>% 
    summarize(total = n_distinct(name)) %>% 
    ggplot(aes(year, total)) + geom_line() +
  xlab('Year') + ylab('Count of Unique Names')

Now that we have all of this information, we can try to answer specific questions. Let’s start with some basic ones - try to answer them on your own if you can.

What was the most popular name in the first year of the database, 1890?

babynames %>% 
  filter(year == "1890")

## # A tibble: 2,695 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1890 F     Mary      12078 0.0599
##  2  1890 F     Anna       5233 0.0259
##  3  1890 F     Elizabeth  3112 0.0154
##  4  1890 F     Margaret   3100 0.0154
##  5  1890 F     Emma       2980 0.0148
##  6  1890 F     Florence   2744 0.0136
##  7  1890 F     Ethel      2718 0.0135
##  8  1890 F     Minnie     2650 0.0131
##  9  1890 F     Clara      2496 0.0124
## 10  1890 F     Bertha     2388 0.0118
## # … with 2,685 more rows

But then what? Any why two ‘=’ symbols?

Let’s start with ‘==.’ In R, you can create variables, and one way to do this is with a single ‘=’ symbol:

x = 10

In the case of babynames, we’re not aiming ot create a variable - we just want to limit the data to only include entries in the year 1890 - so we need to use two ‘=’ symbols to differentiate from making a variable.

Also, in this textbook we will be using the arrow function to create variables, because it works in two directions:

x <- 10
10 -> x

Way more useful than ‘=.’

Let’s take the entire babynames dataset and create a sub-set of the data that only incudes entries from the year 1890, using a variable:

babynames %>% 
  filter(year == "1890") -> babynames_1890

That makes things much easier, if that’s the only year we want to look at - but we’ll have to plot our graphs differently.

Modifying Data

Let’s begin with the arrange() function, which does as it sounds - we’ll tell it to arrange in descending order of prop:

babynames_1890 %>% 
  arrange(desc(prop))

## # A tibble: 2,695 × 5
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  1890 M     John     8502 0.0710
##  2  1890 M     William  7494 0.0626
##  3  1890 F     Mary    12078 0.0599
##  4  1890 M     James    5097 0.0426
##  5  1890 M     George   4458 0.0372
##  6  1890 M     Charles  4061 0.0339
##  7  1890 F     Anna     5233 0.0259
##  8  1890 M     Frank    3078 0.0257
##  9  1890 M     Joseph   2670 0.0223
## 10  1890 M     Robert   2541 0.0212
## # … with 2,685 more rows

So, what was the answer to our question? Well, ‘John’ has the highest proportion, but Mary has the highest count.

Why does Mary have a higher ‘n’ value, but a lower proportion - in the same year? Because ‘prop’ also takes sex into account, i.e. it counts the proportion of each sex in each year that receives a given name.

I think it’d be helpful to see how we can account for this somewhat confusing aspect to our data, by simply splitting it along the variable that is giving us a hard time: sex. [note that all of our variables, or columns, are lowercase - everything is, except the names themselves. R will not detect your mistake if you lowercase a Name or Uppercase a column.]

But sex is the least of our problems, as we’ll see.

##Count##

Let’s try by using a new function, count(), which does exactly what it sounds like. We can specify to sort the results; for some archaic reason, we have to write the words ‘true’ and ‘false’ in ALL CAPS:

babynames_1890 %>% 
count(sex, sort = TRUE)

## # A tibble: 2 × 2
##   sex       n
##   <chr> <int>
## 1 F      1534
## 2 M      1161

That’s not very many names. Or Men.

count() can be helpful, but note that it strips away all of the columns except the one we counted - and adds an ‘n’ column. But that’s it - prop, count, name, year - all gone. More on count(later).

Let’s plot the most popular names of 1890 - but first, we have to cut them off, as 2,695 names is too many to plot:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10)

## # A tibble: 10 × 5
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  1890 M     John     8502 0.0710
##  2  1890 M     William  7494 0.0626
##  3  1890 F     Mary    12078 0.0599
##  4  1890 M     James    5097 0.0426
##  5  1890 M     George   4458 0.0372
##  6  1890 M     Charles  4061 0.0339
##  7  1890 F     Anna     5233 0.0259
##  8  1890 M     Frank    3078 0.0257
##  9  1890 M     Joseph   2670 0.0223
## 10  1890 M     Robert   2541 0.0212

We can plot ten names, easy:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(name, prop)) + geom_col()

I would prefer them in order. Why didn’t arrange(desc(prop)) do that for us? Well, long story, but basically, however you reshape yoru data, ggplot() is going to need its own set of instructions for how to visualize it.

So arrange() doesn’t work - we have to make the adjustment inside the ggplot() call:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop), prop)) + geom_col()

OK, that got pretty complicated. Three nested parentheses? Let’s look at the offending line: originally, the ‘aesthetics’ of our ggplot() were:

aes(name, prop)

And we want to reorder the ‘names,’ based on ‘prop:’

reorder(name, prop)

In other words, italics ‘I want to reorder the names based on their proportion.’

Let’s put it back together:

aes(reorder(name, prop), prop))

And to see it in action:

```{r. babynames_1890_v4} babynames_1890 %>% head(10) %>% ggplot(aes(reorder(name, -prop), prop)) + geom_col()

<ul>
<li>OK, but why is it in the wrong order?<li>
<ul>

We just have to 'tell' ggplot() to reverse, or do the opposite of, the order it chose. We can use the minus sign for this:
## More Aesthetics ##

Let's add some color: 


```r
babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, color = "red")) + geom_col()

That didn’t work the way it did for The Beatles!

Of course, that was a line graph - geom_line() - and that line has a color. This time, ‘color’ is read as stroke or outline; fill controls our columns. Also, we move the command to inside the geometry:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop)) + geom_col(fill = "blue")

Great. Now let’s make the fill based on the value of a column:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill = name)) + geom_col()

OK, great. The names are all different colors because they are discrete data points, i.e. they are not measured - like ‘prop:’

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, -prop), prop, fill = prop)) + geom_col()

Since ‘prop’ is measured, it is visualized as a range of a single color.

This would all look better sideways:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop), prop, fill = prop)) + geom_col() + coord_flip()

By the way, we could ‘show’ the anamaly about John and Mary having confusing ‘n’ and ‘prop’ values by adujsting our aesthetics - let’s use fill:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop), prop, fill = n)) + geom_col() + coord_flip()

We’d mentioned earlier that the ‘sex’ column is complicating our use of ‘prop’ over ‘n.’ To account for this problem, we could color by sex:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop), prop, fill = sex)) + geom_col() + coord_flip()

One more ggplot() trick to change the way we visualize our data: making multiple graphs, based on a variable:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop), prop, fill = sex)) + 
  geom_col() + 
  coord_flip() +
  facet_wrap(~sex)

That looks…. terrible. Why? The function facet_wrap() is trying to use the same values, and the same scale, for each of the two graphs. Let’s ‘free’ the y-axis to account for this discrepancy:

babynames_1890 %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop), prop, fill = sex)) + 
  geom_col() + 
  coord_flip() +
  facet_wrap(~sex, scales = "free_y")

Wow, that really changes things - weren’t there more female than male names, when we used count() earlier? Sure, but while there may be more variance in female names in 1890, females only have two of the most common names.

Review

OK, so now we can answer some other direct questions:

What were the most popular names in 2017, the most recent year of the database?

How would we answer this? I find it’s easiest to write our the process in English, then translate it to R and the Tidyverse:

‘Take babynames and filter it to only include entries from 2017. Then arrange the remaining entries in descending order of proportion.’

In the Tidyverse:

babynames %>% 
  filter(year == 2017) %>% 
  arrange(desc(prop))

## # A tibble: 32,469 × 5
##     year sex   name         n    prop
##    <dbl> <chr> <chr>    <int>   <dbl>
##  1  2017 F     Emma     19738 0.0105 
##  2  2017 F     Olivia   18632 0.00994
##  3  2017 M     Liam     18728 0.00954
##  4  2017 M     Noah     18326 0.00933
##  5  2017 F     Ava      15902 0.00848
##  6  2017 F     Isabella 15100 0.00805
##  7  2017 F     Sophia   14831 0.00791
##  8  2017 M     William  14904 0.00759
##  9  2017 M     James    14232 0.00725
## 10  2017 F     Mia      13437 0.00717
## # … with 32,459 more rows

It looks like nearly 1% of American girls in 2017 were named ‘Emma.’ Olivia, Liam and Noah are also overwhelmingly popular.

What about my name? My birth year? Just replace my values with yours:

babynames %>% 
  filter(name %in% c("Brian", "Bryan")) %>% 
  filter(sex == "M") %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

babynames %>% 
  filter(year == 1975) %>% 
  arrange(desc(prop)) %>% 
  head(10) %>% 
  ggplot(aes(reorder(name, prop),prop, fill = sex)) + geom_col() +
  coord_flip()

I made the Top 10! (It’s all been downhill since then)

We leave quotes off of the year, as it is numeric - only strings, or characters, get quotes.
We make sure to reorderthe data (based on popularity, or ‘prop’) before limiting it to the top 10 results. Othewise, it could be organized alphabetically or something - and we’d be getting 10 names, but not the 10 most popular names.
We have to reorder our ‘name’ in the ggplot() to go in order of prop - even though we just reorganized the data this way two steps ago, ggplot() uses its own internal logic to organize the data.
We have to plot as a bar chart, or column chart. Why? Because line charts are for continuous variables, like year - not discrete ones, like ‘name.’ If you can measure it, it’s continious. If you can count it, it’s discrete.

I encourage you to play around with babynames and get more comfortable deliberately modifying data before we continue. Or, if you’re getting sick of babies, you can use these tidyverse functions on any dataset. Let’s try using one of R’s built-in ones, mtcars:

data(mtcars)
View(mtcars)

Looks like lots of older car performance statistics. Let’s try comparing weight to mpg:

mtcars %>% 
  ggplot(aes(wt, mpg)) + geom_point()

geom_point() creates a scatterplot, but we can also see hints of an overall trend, or correlation here - so let’s add that to the ggplot():

mtcars %>% 
  ggplot(aes(wt, mpg)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Great. On to mutate().

Mutate

While babynames has five columns, or variables, to play with, some observations require creating a calculated field of content - essentially generating a sixth column, in the case of babynames, to show something already in the data but not made clear.

babynames does not have rankings of popular names for each year. Could we create that column? Sure! When we rearrange our data in descending order of prop, we have essentially created rankings based on row number- we just need to ‘mutate’ the data frame to show it.

babynames %>% 
  filter(year == 2017) %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number())

## # A tibble: 32,469 × 6
##     year sex   name         n    prop  rank
##    <dbl> <chr> <chr>    <int>   <dbl> <int>
##  1  2017 F     Emma     19738 0.0105      1
##  2  2017 F     Olivia   18632 0.00994     2
##  3  2017 M     Liam     18728 0.00954     3
##  4  2017 M     Noah     18326 0.00933     4
##  5  2017 F     Ava      15902 0.00848     5
##  6  2017 F     Isabella 15100 0.00805     6
##  7  2017 F     Sophia   14831 0.00791     7
##  8  2017 M     William  14904 0.00759     8
##  9  2017 M     James    14232 0.00725     9
## 10  2017 F     Mia      13437 0.00717    10
## # … with 32,459 more rows

That looks good! We can now save this new, mutated dataset:

babynames %>% 
  filter(year == 2017) %>% 
  arrange(desc(prop)) %>% 
  mutate(rank = row_number()) -> babynames_2017_ranked

Let’s try a practical use of mutate() by focusing on finding the most popular names of a particular generation.

According to Wikipedia, the ‘Silent Generation’ were born between the years of 1928 and 1944:

silent_gen <- babynames %>% 
  filter(year > 1927) %>% 
  filter(year < 1945)

Ok, we have sub-setted our data to only incude the years of this generation. Let’s further simplify things by only looking at female names of the Silent Generation - and also add a ‘rank’ column:

silent_gen %>% 
  filter(sex =="F") %>% 
  mutate(rank = row_number()) -> silent_gen_f

Ok, let’s see some results - what are the most popular female names of the Silent Generation?

silent_gen_f %>% 
  head(10) %>% 
  ggplot(aes(name, prop)) + geom_col()

It appears that Mary is the most popular name during this period. But wasn’t Mary popular back in 1890? Let’s look at these 10 names over time:

babynames %>% 
  filter(name %in% c("Mary", "Barbara", "Betty", "Doris", "Dorothy", "Helen", "Margaret", "Ruth", "Shirley", "Virginia")) %>% 
  filter(sex =="F") %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

It appears that Mary is the most popular name for a very long time, gradually waning as more and more unique names get added to the database every year (therefore decreasing its proportion). Unlike the other 9 names, it definitely doesn’t peak during this generation - so let’s remove it.

babynames %>% 
  filter(name %in% c("Barbara", "Betty", "Doris", "Dorothy", "Helen", "Margaret", "Ruth", "Shirley", "Virginia")) %>% 
  filter(sex =="F") %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

That looks much better! What is that one name that is peaking like crazy right in the middle?

babynames %>% 
  filter(name %in% "Shirley") %>% 
  filter(sex =="F") %>% 
  ggplot(aes(year, prop, color = name)) + geom_line()

Wow! What could we possibly blame this on? The popularity of Shirley Temple? There’s no way to quantitatively measure that, even if we think it to be true.

group_by , summarise()

Similar to creating a pivot table, the summarize() command reshapes your data by …

If we look at most popular names of the Silent Generation, we saw a lot of names repeated:

silent_gen_f %>% 
  arrange(desc(prop)) %>% 
  head(10)

## # A tibble: 10 × 6
##     year sex   name      n   prop  rank
##    <dbl> <chr> <chr> <int>  <dbl> <int>
##  1  1928 F     Mary  66869 0.0559     1
##  2  1930 F     Mary  64146 0.0550 10712
##  3  1929 F     Mary  63510 0.0549  5437
##  4  1931 F     Mary  60296 0.0546 15960
##  5  1932 F     Mary  59872 0.0541 20937
##  6  1933 F     Mary  55507 0.0531 26037
##  7  1934 F     Mary  56924 0.0526 30895
##  8  1935 F     Mary  55065 0.0507 35868
##  9  1937 F     Mary  55642 0.0505 45616
## 10  1936 F     Mary  54373 0.0505 40760

If we want to count the total number of instances of each name over time

babynames %>% 
  group_by(year, sex) %>% 
  summarise(name_count = n_distinct(name)) -> distinct

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

View(distinct)
ggplot(distinct, aes(year, name_count)) + geom_line() +
facet_wrap(~sex)

 ggplot(births, aes(year, births)) + geom_line()

#stringr

skip to Chapter X [link] to find out how.

Load, Filter & Plot: Babynames

Brian Walsh

5/20/2022