This is a Quarto document. It allows you to run your R analyses and then generate and publish them. You can type text in this white space. The slightly darker space below is called a chunk and is for your R code. You can switch back and forth between Source and Visual on the upper left. I personally prefer Source, but you may prefer Visual.
Often one of the first chunks you’ll see in a notebook has the packages you’ll be using. Using packages in R is a two-step process. First, you’ll need to download (what R calls ‘install’) the package. Do that in the Packages pane on the right: Click Install and download the packages you want to use. Next, you use the library() function to load the packages.
(Note: when you have library commands in your code that use packages that you have not yet installed, you may get a message at the top of your window asking you if you want to install them. Go ahead and do that if you want.)
In the dark area below the two lines library(tidyverse) and library(wordcloud2), type the following line: library(babynames)
Then press the little green arrow in the top right of the chunk to run everything.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(wordcloud2)library(babynames)
Viewing the data
The package ‘babynames’ has data on childrens’ names from the Social Security Administration.
First let’s look at the data by typing in the name of the package in the chunk below. Then press the little green arrow at the top right of the chunk.
babynames
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
Another way to view it with View(). Create a code chunk below by clicking the Code menu, and then Insert Chunk. Then type View(babynames). Make sure you pay attention to uppercase and lowercase!
There are variables for year, sex, name, n (the total number of people of that sex given that name that year), and prop (the proportion of people of that sex given that name that year).
Notice that there are almost 2 million rows in the data set!
One final way to look at a data set that you should know is with glimpse(). It will show the basic outline of the data: How many observations (rows), how many variables (columns), and the first several rows of each variable.
Create another chunk below, this time using the little +C icon in the upper right (it does exactly the same thing as Code->Insert Chunk), and then use glimpse().
Filtering the data and using the pipe
If you want to see the names for just one year, use filter(). The filter() command will find rows based on some condition, like a year or the sex of a name. Run the chunk below:
filter(babynames, year ==1900)
# A tibble: 3,730 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1900 F Mary 16706 0.0526
2 1900 F Helen 6343 0.0200
3 1900 F Anna 6114 0.0192
4 1900 F Margaret 5304 0.0167
5 1900 F Ruth 4765 0.0150
6 1900 F Elizabeth 4096 0.0129
7 1900 F Florence 3920 0.0123
8 1900 F Ethel 3896 0.0123
9 1900 F Marie 3856 0.0121
10 1900 F Lillian 3414 0.0107
# ℹ 3,720 more rows
A better way to do the above command uses a symbol from the tidyverse called the ‘pipe,’ which looks like this: %>% It’s the percent symbol followed by the greater than symbol followed by another percent. An easy way to type it in R Studio is with control-shift-m (or command-shift-m).
When you see the pipe, think “and then.” So the command in the chunk below says: take the babynames data, AND THEN filter it to the year 1900. It takes the first part, babynames, and sends it to the next part, filter(). Taking babynames out of the parentheses in filter() can make the command a little easier to read, especially when we start using longer sequences of commands.
babynames %>%filter(year ==1900)
# A tibble: 3,730 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1900 F Mary 16706 0.0526
2 1900 F Helen 6343 0.0200
3 1900 F Anna 6114 0.0192
4 1900 F Margaret 5304 0.0167
5 1900 F Ruth 4765 0.0150
6 1900 F Elizabeth 4096 0.0129
7 1900 F Florence 3920 0.0123
8 1900 F Ethel 3896 0.0123
9 1900 F Marie 3856 0.0121
10 1900 F Lillian 3414 0.0107
# ℹ 3,720 more rows
The chunk above does exactly the same thing as the previous.
Look at the names: A few of the names are still popular today, but many are not.
To see the top “slice” of names use slice_max(). slice_max(prop, n = 10) will give you the top 10 with the largest proportion of names.
babynames %>%filter(year ==1900) %>%slice_max(prop, n =10)
# A tibble: 10 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1900 M John 9829 0.0606
2 1900 M William 8579 0.0529
3 1900 F Mary 16706 0.0526
4 1900 M James 7245 0.0447
5 1900 M George 5403 0.0333
6 1900 M Charles 4099 0.0253
7 1900 M Robert 3821 0.0236
8 1900 M Joseph 3714 0.0229
9 1900 M Frank 3477 0.0214
10 1900 F Helen 6343 0.0200
This table groups the names weirdly because the male and female names are separate. Let’s filter for female names only. Note that we have to put letters like “F” in quotes, but numbers should not have quotes.
babynames %>%filter(year ==1900, sex =="F") %>%slice_max(prop, n =10)
# A tibble: 10 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1900 F Mary 16706 0.0526
2 1900 F Helen 6343 0.0200
3 1900 F Anna 6114 0.0192
4 1900 F Margaret 5304 0.0167
5 1900 F Ruth 4765 0.0150
6 1900 F Elizabeth 4096 0.0129
7 1900 F Florence 3920 0.0123
8 1900 F Ethel 3896 0.0123
9 1900 F Marie 3856 0.0121
10 1900 F Lillian 3414 0.0107
Create a code chunk below and find the top 10 names for your sex and the year you were born.
Creating new variables with mutate()
An important command is mutate(). It creates a new variable. The prop variable is a little hard to read because of the decimal point, so converting it to a percent will make it more readable.
babynames %>%filter(year ==1966, sex =="M") %>%mutate(percent = prop *100) %>%slice_max(prop, n =10)
# A tibble: 10 × 6
year sex name n prop percent
<dbl> <chr> <chr> <int> <dbl> <dbl>
1 1966 M Michael 79992 0.0440 4.40
2 1966 M David 66419 0.0365 3.65
3 1966 M James 65180 0.0359 3.59
4 1966 M John 65038 0.0358 3.58
5 1966 M Robert 59333 0.0326 3.26
6 1966 M William 38260 0.0210 2.10
7 1966 M Mark 34805 0.0191 1.91
8 1966 M Richard 34467 0.0190 1.90
9 1966 M Jeffrey 30198 0.0166 1.66
10 1966 M Thomas 29016 0.0160 1.60
Two more changes to simplify the way it looks: Using round() will get rid of some of the extra digits. round(x, 1) specifies 1 digit to the right of the decimal place. In addition, since we have percent, we don’t need to keep prop in our table. The ‘select’ function shows only the columns we want to show:
babynames %>%filter(year ==1966, sex =="M") %>%mutate(percent =round(prop *100, 1)) %>%slice_max(prop, n =10) %>%select(year, sex, name, percent)
# A tibble: 10 × 4
year sex name percent
<dbl> <chr> <chr> <dbl>
1 1966 M Michael 4.4
2 1966 M David 3.7
3 1966 M James 3.6
4 1966 M John 3.6
5 1966 M Robert 3.3
6 1966 M William 2.1
7 1966 M Mark 1.9
8 1966 M Richard 1.9
9 1966 M Jeffrey 1.7
10 1966 M Thomas 1.6
Another function is row_number(), which is useful here because row number corresponds to the popularity rank of the name.
The following uses mutate() to create a new variable called ‘rank’ which is set to be equal to the row number of the name. Notice also that the # sign (pound sign or hashtag symbol) can be used to create comments in the code:
babynames %>%# start with the data setfilter(year ==1966, sex =="M") %>%# choose only the year you wantmutate(percent =round(prop *100, 1)) %>%# convert prop to percentmutate(rank =row_number()) %>%# mutate() creates a new variable and calls it rankselect(year, sex, name, percent, rank) # show only certain variables
# A tibble: 4,536 × 5
year sex name percent rank
<dbl> <chr> <chr> <dbl> <int>
1 1966 M Michael 4.4 1
2 1966 M David 3.7 2
3 1966 M James 3.6 3
4 1966 M John 3.6 4
5 1966 M Robert 3.3 5
6 1966 M William 2.1 6
7 1966 M Mark 1.9 7
8 1966 M Richard 1.9 8
9 1966 M Jeffrey 1.7 9
10 1966 M Thomas 1.6 10
# ℹ 4,526 more rows
We can use this rank variable as a measure of the popularity of a particular name. If I want to see how popular Matthew was in 1966, the year I was born, I add another filter() command for my name:
# A tibble: 1 × 7
year sex name n prop rank percent
<dbl> <chr> <chr> <int> <dbl> <int> <dbl>
1 1966 M Matthew 10807 0.00594 34 0.6
This shows that Matthew was the 34th most popular boy name in 1966, with 10,809 other babies named Matthew for less than 1% of baby boys that year.
Try this with a few other names, years, and sexes. Just copy-paste the above chunk, and then change the name, year, and sex.
Word clouds
A cute graphic for seeing the popularity of words is called a word cloud. It shows words (names in this case) sized by how often they occur.
babynames %>%filter(year ==2015) %>%# use only one yearfilter(sex =="F") %>%# use only one sexslice_max(prop, n=100) %>%# use the top 100 namesselect(name, n) %>%# select the two relevant variables: the name and how often it occurswordcloud2() # generate the word cloud
There are supposed to be 100 names in a semi-circle, but I see fewer than that and no circle. The problem is that the font is too big to show the whole thing. This makes the font size smaller with ‘size = .5’ in wordcloud2().
babynames %>%filter(year ==2015) %>%# use only one yearfilter(sex =="F") %>%# use only one sexslice_max(prop, n=100) %>%# use the top 100 namesselect(name, n) %>%# select the two relevant variables: the name and how often it occurswordcloud2(size = .5) # generate the word cloud
That looks better. Hover over a name and it will show the name and how many babies were given that name that year. Click run again and it will generate a slightly different picture each time.
Some other parameters you can change include shape and color. Set shape = “star”, “triangle”, “pentagon”, “diamond”, or “star.” Set color = “pink” or “blue” etc.
babynames %>%filter(year ==2015) %>%# use only one yearfilter(sex =="F") %>%# use only one sexslice_max(prop, n=100) %>%# use the top 100 namesselect(name, n) %>%# select the two relevant variables: the name and how often it occurswordcloud2(size = .5, shape ="pentagon", color ="pink") # generate the word cloud
Copy and paste the chunk above to make one of your own with a different year and/or sex, and try a different shape and color (that last one was pretty ugly).
wordcloud2 also comes with a few different themes, or styles. This one uses WCtheme(2):
babynames %>%filter(year ==2015) %>%# use only one yearfilter(sex =="F") %>%# use only one sexslice_max(prop, n=100) %>%# use the top 100 namesselect(name, n) %>%# select the two relevant variables: the name and how often it occurswordcloud2(size = .5) +WCtheme(2) # generate the word cloud
Create a word cloud with a year and sex of your choice, size = .5, red colored words, a square shape, and theme 1.
Graphing changes in popularity over time
We can also look at a specific name’s popularity over time. ggplot is the most common graphing package in R, and it is also part of the tidyverse. Here’s one way to use it. Start with the data, then filter for just one name, then create the plot with aes (“aesthetics”) set so the x-axis is year and the y is the proportion of names. Then use the + to add the line “geometry” with geom_line.
babynames %>%# start with the babynames datafilter(name =="Matthew") %>%# look only at the name Matthewggplot(aes(x = year, y = prop)) +# create a graph with x and y aesthetics geom_line() # make it a line graph
It’s weird that the line zig-zags back and forth most years from a number and then back down to 0. That’s because girls aren’t named Matthew, so we’re getting a number for the boys but a 0 for the girls each year. If you add sex == “M” to the filter it looks normal:
babynames %>%# start with the datafilter(name =="Matthew", sex =="M") %>%# choose the name and sexggplot(aes(x = year, y = prop)) +# put year on the x-axis and prop (proportion) on ygeom_line() # make it a line graph
This will make two final changes: 1. Mutate the prop variable into percent as above, and then use that as the y-axis, and 2. color the line blue.
babynames %>%# start with the datafilter(name =="Matthew", sex =="M") %>%# choose the name and sexmutate(percent =round(prop *100, 1)) %>%# create a new variable called percentggplot(aes(x = year, y = percent)) +# put year on the x-axis and prop (proportion) on ygeom_line(color ="blue") # make it a line graph and give the line a color
I was born in 1966, and my name had increased in popularity about a decade before I was born, and then began to drop off in popularity about 15 years later. At its peak year, over 2% of baby boys were named Matthew.
Just out of curiousity I wanted to see if there were any girls named Matthew. I changed sex == “F” and I also changed y = n (so the y-axis was n, or the number of babies with that name), to see the absolute number of girls named Matthew rather than the proportion or percentage.
babynames %>%filter(name =="Matthew", sex =="F") %>%ggplot(aes(x = year, y = n)) +geom_line()
During the 1970s and 1980s, a couple hundred girls each year were named Matthew! I wonder how many were really girls and how many of those were just errors in data coding. Did that many parents really name their girls Matthew?
My wife’s name is Michele (with one ‘l’), and her named peaked right around the time she was born:
babynames %>%filter(name =="Michele", sex =="F") %>%mutate(percent =round(prop *100, 1)) %>%ggplot(aes(x = year, y = percent)) +geom_line()
Let’s look at both of my daughter’s names in one graph. In filter(), I put both of their names separated by the vertical line |, which is a symbol for OR. Then I set color = name, so the two names will have different color lines.
babynames %>%filter(name =="Emma"| name =="Julia", sex =="F") %>%mutate(percent =round(prop *100, 1)) %>%ggplot(aes(x = year, y = percent, color = name)) +geom_line()
Their names both had a peak around the time they were born (1999 and 2003), but they were both also popular over 100 years ago. In fact, my daughter Emma was named after my wife’s great-grandmother.
Is it possible to guess the year someone was born by their name? One way to do this is to get the peak year for that name:
babynames %>%# Start with the datasetfilter(name =="Michele", sex =="F") %>%# only look at the name you wantslice_max(prop, n =1) # get the year with the top number for that name
# A tibble: 1 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1968 F Michele 11217 0.00656
My wife was born in 1968, so that’s a pretty good guess!
Let’s get the top 10 years for Matthew, and sort them in descending order with arrange().
babynames %>%# Start with the datasetfilter(name =="Matthew", sex =="M") %>%# only look at the name and sex you wantslice_max(prop, n =10) # get the top 10 names
# A tibble: 10 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1983 M Matthew 50214 0.0269
2 1984 M Matthew 49775 0.0265
3 1985 M Matthew 47073 0.0245
4 1986 M Matthew 46923 0.0244
5 1982 M Matthew 46060 0.0244
6 1987 M Matthew 46481 0.0238
7 1981 M Matthew 43330 0.0233
8 1988 M Matthew 45868 0.0229
9 1989 M Matthew 45371 0.0217
10 1990 M Matthew 44800 0.0208
The peak “Matthew” year was 1983, followed by years mostly right around that same time.
Copy and paste the chunk above and try a few different names and sexes. Could you have guessed when people you know were born?
Famous names
Some names become popular when there is a famous person with that name. See how Barack was non-existent prior to 2007, peaked the year he became president, and then dropped off again.
babynames %>%filter(name =="Siri") %>%ggplot(aes(x = year, y = n)) +geom_line()
The Disney movie The Little Mermaid came out in 1989 with Ariel its star character, Aladdin with Princess Jasmine came out in 1992, and Frozen with Elsa came out in 2013.
babynames %>%filter(name =="Ariel"| name =="Elsa"| name =="Jasmine") %>%filter(sex =="F") %>%ggplot(aes(x = year, y = n, color = name)) +geom_line()
To get a better look at this, filter for years only after 1980, with filter(year > 1980).
Copy and paste the chunk above, but add filter(year > 1980) %>% on a new line right after filter(sex == “F”) %>% :
It looks like Ariel peaked in popularity a few years after Little Mermaid came out in 1989, and Jasmine was very popular around the time Aladdin came out in 1992. Elsa never really became popular, but the movie came out in 2013 and the dataset ends in 2015. Maybe we’re about to have a bumper crop of Elsas.
Assignment
Create a new R Notebook and create a report about your name. You may use someone else’s name if you prefer. To do this, click on the File -> New File -> R Notebook menu. You should keep the tab with this notebook open so you can see it too. Feel free to copy and paste between this notebook and your new project.
In the white space around the chunks of code, write brief descriptions of the analysis and what the results show. Write it so that a friend or family member could read and understand it.
Determine its rank the year you were born.
Create a word cloud of the names of your sex and the year you were born.
Graph its popularity over time.
Create a table showing which years it was most popular.
Graph its popularity in comparison to another name or two (e.g., a friend, family member, etc.). To keep it simple, use other names of the same sex.