In this notebook we will analyze the distribution of a name over time.
We will also learn about the importance of using reproducable research which means that your research should be documented in a way that allows anyone else to repeat exactly what you did.

Type your answers to the questions right in this notebook.

For the United States the best data on baby names comes from the Social Security Administration because almost every US citizen applies for a social security number. Visit the [baby names website] (https://www.ssa.gov/oact/babynames/index.html).

Try exploring your name on the website by looking at different time periods and states.

What did you discover?

Is the website good enough for reproducable research?

The social security site is fun and easy to use but it has some limitations.
First, it only includes the top 1000 names. That is a lot of names, but not close to the real number of names out there. Also making comparisons (such as between two names or the same name over decades) is not very easy. Finally, if someone had to reproduce the exact same analysis as you did there are lots of ways it could go wrong, from misspelling the names to making a mistake in one of the selections.

Now let’s start a reproducable analysis.

About the data.

Here is a link to the [background information about the baby names data] (https://www.ssa.gov/OACT/babynames/background.html). Some key points to note are that to protect privacy of people with unusual names only names that are used at least 5 times in a year are included in the data.

If your name does not appear please pick a different name (suggestion: use a different spelling or use the name of a family member or friend of a similar age to you).

We are going to use a data set from the R package called babynames. The author has already downloaded the full data for you and stored it in R data frames.

Starting our reproducible analysis

The first thing we will do is to load the babynames package and look at the first few records of two of the data frames. Click on the green arrow on the right to run the code.

library(babynames)
head(babynames)
## # A tibble: 6 x 5
##    year   sex      name     n       prop
##   <dbl> <chr>     <chr> <int>      <dbl>
## 1  1880     F      Mary  7065 0.07238433
## 2  1880     F      Anna  2604 0.02667923
## 3  1880     F      Emma  2003 0.02052170
## 4  1880     F Elizabeth  1939 0.01986599
## 5  1880     F    Minnie  1746 0.01788861
## 6  1880     F  Margaret  1578 0.01616737
head(applicants)
## # A tibble: 6 x 3
##    year   sex  n_all
##   <int> <chr>  <int>
## 1  1880     F  97604
## 2  1880     M 118399
## 3  1881     F  98855
## 4  1881     M 108282
## 5  1882     F 115696
## 6  1882     M 122031
tail(lifetables)
## # A tibble: 6 x 9
##       x      qx    lx    dx    Lx    Tx    ex    sex  year
##   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl>
## 1   114 0.42433    76    32    60   130  1.72      F  2010
## 2   115 0.44791    44    20    34    71  1.62      F  2010
## 3   116 0.47283    24    11    18    37  1.52      F  2010
## 4   117 0.49531    13     6    10    18  1.44      F  2010
## 5   118 0.51795     6     3     5     9  1.36      F  2010
## 6   119 0.54163     3     2     2     4  1.28      F  2010
tail(births)
## # A tibble: 6 x 2
##    year  births
##   <int>   <int>
## 1  2009 4130665
## 2  2010 3999386
## 3  2011 3953590
## 4  2012 3952841
## 5  2013 3932181
## 6  2014 3988076

You should see three listings.

What are the column names in the two data sets?

(List them out. These are the name right across the top.)

In the babynames data set the n represents the number of applications in that year for that name and sex. But what does prop mean?

The variable prop represents the proportion of all applicants of that sex in that year that had that name. If we look in the applicants data frame the variable n_all tells us the total number of applicants.

There were 97604 female applicants born in 1880.
Out of those 7065 were named Mary.
The proportion is found by dividing 7065 by 97604.

We can use R to do the calculation for us. Press the green arrow.

# Mary
7065/97604
## [1] 0.07238433

Does it match what is in the prop column? (Note that the displayed results round the last digit and that is fine).

Below do the same calculation for Anna and Minnie

# Anna
2604/97604
## [1] 0.02667923
# Minnie
1746/97604
## [1] 0.01788861

In social research we almost always use proportions (or their cousins, percents) rather than raw numbers. Using R we would not do the calculations for each individual name, instead we would use the data for the entire column of names and the entire column for year. If we added n_all to the babynames data frame the calculation would look like this:

babynames$prop <- babynames$n/applicants$n_all

That’s exactly what the people who made the babynames data set did for us.

Make a prediction

Some sociologists have hypothesized that in some time periods the name of a celebrity, leader or other famous person will become very popular for a time. Is your name associated with anyone like that? (You could check Google or Wikipedia to get ideas about this.)
For example, perhaps the name Malia became popular in 2008 when Barack Obama, who has a daughter of that name, was elected president. Do you think your name would be particularly popular in a certain year because of a famous person of event? What year or years?

Another possibility is that a name becomes unpopular because it is associated with someone who is famous for negative reasons or with some kind of negative characteristics. For example the name Monica might have decreased in popularity due to the Clinton scandal in the mid 1980s. Is there a time period when you expect that your name would have become less popular?

Another hypothesis might be that names that are associated with ethnic groups will go up in popularity when that group grows. Would that be a factor for your name? When would you expect your name to be popular based on ethnicity?

Finally, what year were you born in? Do you think your name might have been a popular name that year for any reason at all? Some sociologists would hypothesis that there are underlying cycles of popularity. If your parents chose your name, maybe a lot of others did too. So the hypothesis would be that your name was at or near its peak poularity the year you were born.

Overall, how do you think the presence of your name over time will vary? Why do you think that?

Now let’s make a line graph looking at your data.

First we will create a variable to represent the name you want to analyze. This will make it easier if you want to analyze other names. Put your name inside the quotation marks: " " next to name_to_analyze. Put F or M inside the quotation marks for sex_to_analyze.

name_to_analyze <- "Elin"
sex_to_analyze <- "F"
name_to_analyze
## [1] "Elin"
sex_to_analyze
## [1] "F"

Get just the data on your name, so filter out all of the other names using the dplyr filter function. Notice how this code relates the variable names (name, sex) to the specific values you want to analyze (name_to_analyze, sex_to_analyze). We could also type in the original strings but that’s going to be inefficient if we use the same names in a lot of places or if we want to have our code be reusable for other names.

Notice that in compute languages “==” is generally used to mean “has the same value” or “equal to.”

library(dplyr)
my_name_data <- filter(babynames, name == name_to_analyze, sex == sex_to_analyze)
head(my_name_data)
## # A tibble: 6 x 5
##    year   sex  name     n         prop
##   <dbl> <chr> <chr> <int>        <dbl>
## 1  1893     F  Elin     9 3.995880e-05
## 2  1894     F  Elin     5 2.118895e-05
## 3  1904     F  Elin     6 2.051717e-05
## 4  1907     F  Elin     6 1.778131e-05
## 5  1908     F  Elin     6 1.692367e-05
## 6  1909     F  Elin     7 1.901667e-05
tail(my_name_data)
## # A tibble: 6 x 5
##    year   sex  name     n         prop
##   <dbl> <chr> <chr> <int>        <dbl>
## 1  2010     F  Elin   386 0.0001972555
## 2  2011     F  Elin   428 0.0002213783
## 3  2012     F  Elin   335 0.0001732116
## 4  2013     F  Elin   328 0.0001707486
## 5  2014     F  Elin   384 0.0001971705
## 6  2015     F  Elin   296 0.0001529630

Now let’s make a line graph. Try just running it as is, but then fix the title and change the color. In this case we are using the ggplot2 graphing package.

library(ggplot2)

ggplot(data=my_name_data, aes(x=year, y= prop)) +
  geom_line(color = "purple")  +
  labs(
      title = paste0("Figure 1: Proportion of ", sex_to_analyze, " babies given the name ", 
      name_to_analyze, " over time"), 
      x = "Year",
      y ="Proportion",
      caption = "Source: R babynames package version of data from the Social Security Administration"
  )

How could we improve the above code to make it more reusable?

Describe the graph

Were any of your predictions supported?

Was your name high in popularity in years where it was associated with a celebrity, in years when your ethnic group was a relatively large part of the population, or in your birth year?

Follow up

Now pick another name for the same sex that you think will have the same pattern as the first name you used.

Why do you think it will have the same pattern?

second_name_to_analyze<-"Linnea"

second_name_to_analyze
## [1] "Linnea"

This time we will filter for both names. Notice that the | symbol is used for “or”.

my_name_data<- filter(babynames, name == name_to_analyze |
                                 name == second_name_to_analyze , sex == sex_to_analyze)
head(my_name_data)
## # A tibble: 6 x 5
##    year   sex   name     n         prop
##   <dbl> <chr>  <chr> <int>        <dbl>
## 1  1893     F   Elin     9 3.995880e-05
## 2  1894     F   Elin     5 2.118895e-05
## 3  1894     F Linnea     5 2.118895e-05
## 4  1898     F Linnea     7 2.553384e-05
## 5  1900     F Linnea     8 2.517505e-05
## 6  1901     F Linnea    10 3.933415e-05
tail(my_name_data)
## # A tibble: 6 x 5
##    year   sex   name     n         prop
##   <dbl> <chr>  <chr> <int>        <dbl>
## 1  2013     F   Elin   328 1.707486e-04
## 2  2013     F Linnea   160 8.329199e-05
## 3  2014     F   Elin   384 1.971705e-04
## 4  2014     F Linnea   157 8.061398e-05
## 5  2015     F   Elin   296 1.529630e-04
## 6  2015     F Linnea   177 9.146772e-05

Now let’s make a graph with lines for both names. Make sure to update the title. Notice that we’ve added the group option in the first line and then the linetype aesthetic in the second line so we can tell which data goes with which name.

ggplot(data=my_name_data, aes(x=year, y= prop, group = name)) +
  geom_line(color = "red", aes(linetype=name))  +
  labs(
      title = "Figure 2: Proportion of __  babies given the names ___ and ___ over time", 
      x = "Year",
      y ="Proportion",
      caption = "Source: R babynames package version of data from the Social Security Administration"
  )

Make the title more reusable.

How do the lines compare to each other? Was your prediction supported?

How old is someone with your name?

Another common question people like to ask is if you know someone has a given name, how old would you guess he or she is? This is a very interesting problem for at least two groups:

How can we use the babynames data to do this? Fortunately Google reveals an R package to help. github.com/andland/nameage There are other approaches but let’s try this one.

To install this package

devtools::install_github(“andland/nameage”)

Follow the instructions on the github page to use the nameage() and plot_nameage() functions for your names. (There are at least three ways to organize doing this.)

#library()


#plot_nameage(c("Elin", "Linnea"), type = "age")
#plot_nameage(c("Elin", "Linnea"), type = "year")

How old would someone guess you are? What about a person with the other name?