Using R to Investigate Baby Naming Practices

Kara Joyner

Learning Goals

You will have an idea of what R can do.
You will learn how to install R and R Studio onto your computer.
You will learn about R packages and libraries.
You be exposed to simple and complex code.
You will see two different examples of data visualization.
You will see how you can publish your document here: RPubs

My Journey

SPSS
SAS
Stata
Minitab
ArcGIS
Working mainly with R since 2021

Why R?

Great data visualizations:
- Healy, K. (2018). Data visualization: a practical introduction. Princeton University Press
Capability of creating reports and presentations in a single document
Ability to scrape data directly from the Web
Reproducibility and replicability
Parallels to Python

Problem Solving in R

Look at documentation produced by R Developers.
Collect Posit Cheat Sheets.
Identify errors in your code in a sequential manner with code chunks.
Copy the error message into Google.
Be sure you have all the appropriate packages and libraries.
Check the variable types and case of letters.
Enjoy the challenges!

Installing R and RStudio

Uninstall R and RStudio any time you want download new versions.
See the instructions here: Installing R and RStudio
Download R.
Download RStudio.
Download the Quarto document “Using R to Study Baby Names” from Canvas to your computer.
Open the Quarto document in RStudio.

Orienting Yourself to Any Stats Package (Brute Force)

Look for sample code like this: Analyzing US Baby Names
Save the links for materials that look useful.
Make the necessary changes to the code.
Debug the code.
Ask yourself if the results make any sense.
Add comments soon after you have cracked the code.

Loading Packages and Listing Libraries

The RStudio user interface has 4 primary panes: source, console, environment, and output
My Quarto document appears in the upper left pane.
Click on Install in the lower right corner to load packages that you need.
List the libraries that you need.

# Set the CRAN mirror to a specific location, e.g., UT Austin
options(repos = "https://mirror.utexas.edu/CRAN/")
library(readr)   
library(dplyr) 
library(janitor) 
library(ggplot2)

Today’s Data Visualizations

1) Visualizing Trends in the Rate at Which Babies are Named Taylor
2) Mapping the Distribution of Boys Named Nicholas across States
WARNING: Today’s examples require data wrangling at a very advanced level.
You can easily switch the name, sex, and year.

Visualizing Trends in the Rate at Which Babies are Named Taylor

Baby Names Data into R

I have downloaded the csv file (under Canvas) for this example.
Read the data into RStudio.

baby_names_national <- read_csv("C:/Users/Ameria/Downloads/R1/baby_names_national.csv")

Examine the number of rows.

nrow(baby_names_national)

[1] 2020863

Becoming Acquainted with the Data

What they are showing: number of girls with each name each year (Girl Name-Years)

head(baby_names_national)

# A tibble: 6 × 4
  name      sex   count  year
  <chr>     <chr> <dbl> <dbl>
1 Mary      F      7065  1880
2 Anna      F      2604  1880
3 Emma      F      2003  1880
4 Elizabeth F      1939  1880
5 Minnie    F      1746  1880
6 Margaret  F      1578  1880

To examine trends in the rate at which girls are named Taylor, we need to combine observations within years.

Creating Variables and Transforming Data (“Data Wrangling”)

# creating an indicator for whether the baby is named Taylor 
baby_names_national$Taylor <- ifelse(baby_names_national$name == 'Taylor', 1, 0) 

#calculating the number of girls named Taylor in each state
baby_names_national$Taylorcount<-(baby_names_national$Taylor)*(baby_names_national$count) 

#creating a new data set that counts the number of girls and boys each year named Taylor, as well as the total numbers of boys and girls in each state
Taylorprobs<- baby_names_national %>%  
group_by(year,sex) %>%  
summarize(numTaylor=sum(Taylorcount),n=sum(count))   

#calculating the proportion of boys and girls named Taylor each year
Taylorprobs$probTaylor<-(Taylorprobs$numTaylor)/(Taylorprobs$n)  

#creating this variable for the figure
Taylorprobs$Sex<-Taylorprobs$sex

Viewing the Transformed Data

# looking at the object I just created
Taylorprobs

# A tibble: 282 × 6
# Groups:   year [141]
    year sex   numTaylor      n probTaylor Sex  
   <dbl> <chr>     <dbl>  <dbl>      <dbl> <chr>
 1  1880 F             0  90994   0        F    
 2  1880 M            37 110490   0.000335 M    
 3  1881 F             0  91953   0        F    
 4  1881 M            39 100738   0.000387 M    
 5  1882 F             0 107847   0        F    
 6  1882 M            27 113686   0.000237 M    
 7  1883 F             0 112319   0        F    
 8  1883 M            27 104625   0.000258 M    
 9  1884 F             0 129019   0        F    
10  1884 M            21 114442   0.000183 M    
# ℹ 272 more rows

Visualizing Trends in the Rate at Which Babies Are Named Taylor

ggplot(Taylorprobs, aes(x = year, y = probTaylor, col = Sex)) + geom_line() + geom_point() + labs(x="Year of Birth", y="Percent",title="Percent of Babies Named Taylor by Calendar Year",caption="Source: Analyses in R by Joyner of data from https://data.world/nkrishnaswami/us-ssa-baby-names-national")

Mapping the Distribution of Boys Named Nicholas across States

Reading More Baby Names Data into R

I have downloaded a second csv file (under Canvas) for this example.

baby_names_state <- read_csv("C:/Users/Ameria/Downloads/R1/baby_names_state.csv")    
head(baby_names_state)

# A tibble: 6 × 5
  state_abb sex    year name     count
  <chr>     <chr> <dbl> <chr>    <dbl>
1 AK        F      1910 Mary        14
2 AK        F      1910 Annie       12
3 AK        F      1910 Anna        10
4 AK        F      1910 Margaret     8
5 AK        F      1910 Helen        7
6 AK        F      1910 Elsie        6

How this file is different: girl name-state-year

Filtering the Data and Creating Variables

#keeping only male babies born in 2000
baby_names_state <-filter(baby_names_state, sex=='M') 
baby_names_state <-filter(baby_names_state, year=='2000')

#changing the name of a variable so I can join it to another file later
baby_names_state$state_abbv<-baby_names_state$state_abb

#creating an indicator of whether the baby is named Nicholas
baby_names_state$Nicholas <- ifelse(baby_names_state$name == 'Nicholas', 1, 0)

#filtering the data so that it consists only of boys names Nicholas
baby_names_state <-filter(baby_names_state, Nicholas==1)

Transforming the Data

#now we are grouping the data by state so that we can determine the number and percent of boys named Nicholas in 2000 who reside in each state
baby_names_state_transformed <- baby_names_state %>%    
group_by(state_abbv) %>% 
  
# Variable to be transformed   
mutate(num = count) %>%    
ungroup() %>%    
mutate(perc = num / sum(num)) %>%    
arrange(perc) %>%   
mutate(labels = scales::percent(perc))

Viewing the Transformed Data

baby_names_state_transformed

# A tibble: 51 × 10
   state_abb sex    year name     count state_abbv Nicholas   num    perc labels
   <chr>     <chr> <dbl> <chr>    <dbl> <chr>         <dbl> <dbl>   <dbl> <chr> 
 1 WY        M      2000 Nicholas    30 WY                1    30 0.00122 0.121…
 2 AK        M      2000 Nicholas    36 AK                1    36 0.00146 0.146…
 3 MT        M      2000 Nicholas    43 MT                1    43 0.00174 0.174…
 4 VT        M      2000 Nicholas    47 VT                1    47 0.00191 0.190…
 5 ND        M      2000 Nicholas    51 ND                1    51 0.00207 0.206…
 6 HI        M      2000 Nicholas    64 HI                1    64 0.00260 0.259…
 7 SD        M      2000 Nicholas    65 SD                1    65 0.00264 0.263…
 8 DC        M      2000 Nicholas    69 DC                1    69 0.00280 0.279…
 9 ID        M      2000 Nicholas    89 ID                1    89 0.00361 0.361…
10 DE        M      2000 Nicholas    94 DE                1    94 0.00381 0.381…
# ℹ 41 more rows

Joining to Spatial Data

install.packages("remotes")
remotes::install_github("UrbanInstitute/urbnthemes")
library(urbnmapr) 
library(urbnthemes) 
library(janitor) 
library(sf)  
library(devtools)
install_github("UrbanInstitute/urbnthemes", force = TRUE)


── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file 'C:\Users\Ameria\AppData\Local\Temp\RtmpK4KRpC\remotes435c28aa268c\UrbanInstitute-urbnthemes-f6a368d/DESCRIPTION' ... OK
* preparing 'urbnthemes':
* checking DESCRIPTION meta-information ...Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
  It is recommended to use 'given' instead of 'middle'.
 OK
Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
  It is recommended to use 'given' instead of 'middle'.
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'urbnthemes_0.0.2.tar.gz'

#joining the data to a map  
spatial_data <- left_join(baby_names_state_transformed, get_urbn_map(map = "states", sf = TRUE), by = "state_abbv")

Viewing the Spatial Data

spatial_data

# A tibble: 51 × 13
   state_abb sex    year name     count state_abbv Nicholas   num    perc labels
   <chr>     <chr> <dbl> <chr>    <dbl> <chr>         <dbl> <dbl>   <dbl> <chr> 
 1 WY        M      2000 Nicholas    30 WY                1    30 0.00122 0.121…
 2 AK        M      2000 Nicholas    36 AK                1    36 0.00146 0.146…
 3 MT        M      2000 Nicholas    43 MT                1    43 0.00174 0.174…
 4 VT        M      2000 Nicholas    47 VT                1    47 0.00191 0.190…
 5 ND        M      2000 Nicholas    51 ND                1    51 0.00207 0.206…
 6 HI        M      2000 Nicholas    64 HI                1    64 0.00260 0.259…
 7 SD        M      2000 Nicholas    65 SD                1    65 0.00264 0.263…
 8 DC        M      2000 Nicholas    69 DC                1    69 0.00280 0.279…
 9 ID        M      2000 Nicholas    89 ID                1    89 0.00361 0.361…
10 DE        M      2000 Nicholas    94 DE                1    94 0.00381 0.381…
# ℹ 41 more rows
# ℹ 3 more variables: state_fips <chr>, state_name <chr>,
#   geometry <MULTIPOLYGON [m]>

Mapping the Distribution of Boys Named Nicholas in 2000 across States

ggplot() + geom_sf(spatial_data, mapping = aes(fill = perc, geometry=geometry), color = "#ffffff", size = 0.25) + scale_fill_gradientn(labels = scales::percent) + labs(fill = "Percent") + coord_sf(datum = NA) + labs(title="Distribution across States of Boys Named Nicholas in 2000",caption="Source: Analyses in R by Joyner of data from https://data.world/nkrishnaswami/us-ssa-baby-names-state")

Publishing Your Document in RPubs

Render your document to find any problems.
Go to this Website: R Pubs
Register in RPubs if you don’t already have an account.
After you register you can publish your document by clicking the light blue symbol in the upper right of the upper left pane.