In this notebook we will analyze the distribution of a name over time.
We will also learn about the importance of using reproducable research which means that your research should be documented in a way that allows anyone else to repeat exactly what you did.
Type your answers to the questions right in this notebook.
For the United States the best data on baby names comes from the Social Security Administration because almost every US citizen applies for a social security number. Visit the [baby names website] (https://www.ssa.gov/oact/babynames/index.html).
Try exploring your name on the website by looking at different time periods and states.
The social security site is fun and easy to use but it has some limitations.
First, it only includes the top 1000 names. That is a lot of names, but not close to the real number of names out there. Also making comparisons (such as between two names or the same name over decades) is not very easy. Finally, if someone had to reproduce the exact same analysis as you did there are lots of ways it could go wrong, from misspelling the names to making a mistake in one of the selections.
Here is a link to the [background information about the baby names data] (https://www.ssa.gov/OACT/babynames/background.html). Some key points to note are that to protect privacy of people with unusual names only names that are used at least 5 times in a year are included in the data.
If your name does not appear please pick a different name (suggestion: use a different spelling or use the name of a family member or friend of a similar age to you).
We are going to use a data set from the R package called babynames. The author has already downloaded the full data for you and stored it in R data frames.
The first thing we will do is to load the babynames package and look at the first few records of two of the data frames. Click on the green arrow on the right to run the code.
library(babynames)
head(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.07238433
## 2 1880 F Anna 2604 0.02667923
## 3 1880 F Emma 2003 0.02052170
## 4 1880 F Elizabeth 1939 0.01986599
## 5 1880 F Minnie 1746 0.01788861
## 6 1880 F Margaret 1578 0.01616737
head(applicants)
## # A tibble: 6 x 3
## year sex n_all
## <int> <chr> <int>
## 1 1880 F 97604
## 2 1880 M 118399
## 3 1881 F 98855
## 4 1881 M 108282
## 5 1882 F 115696
## 6 1882 M 122031
tail(lifetables)
## # A tibble: 6 x 9
## x qx lx dx Lx Tx ex sex year
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl>
## 1 114 0.42433 76 32 60 130 1.72 F 2010
## 2 115 0.44791 44 20 34 71 1.62 F 2010
## 3 116 0.47283 24 11 18 37 1.52 F 2010
## 4 117 0.49531 13 6 10 18 1.44 F 2010
## 5 118 0.51795 6 3 5 9 1.36 F 2010
## 6 119 0.54163 3 2 2 4 1.28 F 2010
tail(births)
## # A tibble: 6 x 2
## year births
## <int> <int>
## 1 2009 4130665
## 2 2010 3999386
## 3 2011 3953590
## 4 2012 3952841
## 5 2013 3932181
## 6 2014 3988076
You should see three listings.
(List them out. These are the name right across the top.)
In the babynames data set the n
represents the number of applications in that year for that name and sex. But what does prop mean?
The variable prop represents the proportion of all applicants of that sex in that year that had that name. If we look in the applicants data frame the variable n_all
tells us the total number of applicants.
There were 97604 female applicants born in 1880.
Out of those 7065 were named Mary.
The proportion is found by dividing 7065 by 97604.
We can use R to do the calculation for us. Press the green arrow.
# Mary
7065/97604
## [1] 0.07238433
Does it match what is in the prop column? (Note that the displayed results round the last digit and that is fine).
Below do the same calculation for Anna and Minnie
# Anna
2604/97604
## [1] 0.02667923
# Minnie
1746/97604
## [1] 0.01788861
In social research we almost always use proportions (or their cousins, percents) rather than raw numbers. Using R we would not do the calculations for each individual name, instead we would use the data for the entire column of names and the entire column for year. If we added n_all to the babynames data frame the calculation would look like this:
babynames$prop <- babynames$n/applicants$n_all
That’s exactly what the people who made the babynames data set did for us.
Some sociologists have hypothesized that in some time periods the name of a celebrity, leader or other famous person will become very popular for a time. Is your name associated with anyone like that? (You could check Google or Wikipedia to get ideas about this.)
For example, perhaps the name Malia became popular in 2008 when Barack Obama, who has a daughter of that name, was elected president. Do you think your name would be particularly popular in a certain year because of a famous person of event? What year or years?
Another possibility is that a name becomes unpopular because it is associated with someone who is famous for negative reasons or with some kind of negative characteristics. For example the name Monica might have decreased in popularity due to the Clinton scandal in the mid 1980s. Is there a time period when you expect that your name would have become less popular?
Another hypothesis might be that names that are associated with ethnic groups will go up in popularity when that group grows. Would that be a factor for your name? When would you expect your name to be popular based on ethnicity?
Finally, what year were you born in? Do you think your name might have been a popular name that year for any reason at all? Some sociologists would hypothesis that there are underlying cycles of popularity. If your parents chose your name, maybe a lot of others did too. So the hypothesis would be that your name was at or near its peak poularity the year you were born.
Overall, how do you think the presence of your name over time will vary? Why do you think that?
Now let’s make a line graph looking at your data.
First we will create a variable to represent the name you want to analyze. This will make it easier if you want to analyze other names. Put your name inside the quotation marks: " " next to name_to_analyze
. Put F or M inside the quotation marks for sex_to_analyze
.
name_to_analyze <- "Elin"
sex_to_analyze <- "F"
name_to_analyze
## [1] "Elin"
sex_to_analyze
## [1] "F"
Get just the data on your name, so filter out all of the other names using the dplyr filter function. Notice how this code relates the variable names (name, sex) to the specific values you want to analyze (name_to_analyze, sex_to_analyze). We could also type in the original strings but that’s going to be inefficient if we use the same names in a lot of places or if we want to have our code be reusable for other names.
Notice that in compute languages “==” is generally used to mean “has the same value” or “equal to.”
library(dplyr)
my_name_data <- filter(babynames, name == name_to_analyze, sex == sex_to_analyze)
head(my_name_data)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1893 F Elin 9 3.995880e-05
## 2 1894 F Elin 5 2.118895e-05
## 3 1904 F Elin 6 2.051717e-05
## 4 1907 F Elin 6 1.778131e-05
## 5 1908 F Elin 6 1.692367e-05
## 6 1909 F Elin 7 1.901667e-05
tail(my_name_data)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2010 F Elin 386 0.0001972555
## 2 2011 F Elin 428 0.0002213783
## 3 2012 F Elin 335 0.0001732116
## 4 2013 F Elin 328 0.0001707486
## 5 2014 F Elin 384 0.0001971705
## 6 2015 F Elin 296 0.0001529630
Now let’s make a line graph. Try just running it as is, but then fix the title and change the color. In this case we are using the ggplot2 graphing package.
library(ggplot2)
ggplot(data=my_name_data, aes(x=year, y= prop)) +
geom_line(color = "purple") +
labs(
title = paste0("Figure 1: Proportion of ", sex_to_analyze, " babies given the name ",
name_to_analyze, " over time"),
x = "Year",
y ="Proportion",
caption = "Source: R babynames package version of data from the Social Security Administration"
)
How could we improve the above code to make it more reusable?
Was your name high in popularity in years where it was associated with a celebrity, in years when your ethnic group was a relatively large part of the population, or in your birth year?
Now pick another name for the same sex that you think will have the same pattern as the first name you used.
second_name_to_analyze<-"Linnea"
second_name_to_analyze
## [1] "Linnea"
This time we will filter for both names. Notice that the | symbol is used for “or”.
my_name_data<- filter(babynames, name == name_to_analyze |
name == second_name_to_analyze , sex == sex_to_analyze)
head(my_name_data)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1893 F Elin 9 3.995880e-05
## 2 1894 F Elin 5 2.118895e-05
## 3 1894 F Linnea 5 2.118895e-05
## 4 1898 F Linnea 7 2.553384e-05
## 5 1900 F Linnea 8 2.517505e-05
## 6 1901 F Linnea 10 3.933415e-05
tail(my_name_data)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2013 F Elin 328 1.707486e-04
## 2 2013 F Linnea 160 8.329199e-05
## 3 2014 F Elin 384 1.971705e-04
## 4 2014 F Linnea 157 8.061398e-05
## 5 2015 F Elin 296 1.529630e-04
## 6 2015 F Linnea 177 9.146772e-05
Now let’s make a graph with lines for both names. Make sure to update the title. Notice that we’ve added the group option in the first line and then the linetype aesthetic in the second line so we can tell which data goes with which name.
ggplot(data=my_name_data, aes(x=year, y= prop, group = name)) +
geom_line(color = "red", aes(linetype=name)) +
labs(
title = "Figure 2: Proportion of __ babies given the names ___ and ___ over time",
x = "Year",
y ="Proportion",
caption = "Source: R babynames package version of data from the Social Security Administration"
)
Make the title more reusable.
Another common question people like to ask is if you know someone has a given name, how old would you guess he or she is? This is a very interesting problem for at least two groups:
How can we use the babynames data to do this? Fortunately Google reveals an R package to help. github.com/andland/nameage There are other approaches but let’s try this one.
To install this package
devtools::install_github(“andland/nameage”)
Follow the instructions on the github page to use the nameage() and plot_nameage() functions for your names. (There are at least three ways to organize doing this.)
#library()
#plot_nameage(c("Elin", "Linnea"), type = "age")
#plot_nameage(c("Elin", "Linnea"), type = "year")
How old would someone guess you are? What about a person with the other name?