Using R to Investigate Baby Naming Practices

Kara Joyner

Learning Goals

  • You will have an idea of what R can do.

  • You will learn how to install R and R Studio onto your computer.

  • You will learn about R packages and libraries.

  • You be exposed to simple and complex code.

  • You will see two different examples of data visualization.

  • You will see how you can publish your document here: RPubs

My Journey

  • SPSS

  • SAS

  • Stata

  • Minitab

  • ArcGIS

  • Working mainly with R since 2021

Why R?

Problem Solving in R

  • Look at documentation produced by R Developers.

  • Collect Posit Cheat Sheets.

  • Identify errors in your code in a sequential manner with code chunks.

  • Copy the error message into Google.

  • Be sure you have all the appropriate packages and libraries.

  • Check the variable types and case of letters.

  • Enjoy the challenges!

Installing R and RStudio

  • Uninstall R and RStudio any time you want download new versions.

  • See the instructions here: Installing R and RStudio

  • Download R.

  • Download RStudio.

  • Download the Quarto document “Using R to Study Baby Names” from Canvas to your computer.

  • Open the Quarto document in RStudio.

Orienting Yourself to Any Stats Package (Brute Force)

  • Look for sample code like this: Analyzing US Baby Names

  • Save the links for materials that look useful.

  • Make the necessary changes to the code.

  • Debug the code.

  • Ask yourself if the results make any sense.

  • Add comments soon after you have cracked the code.

Loading Packages and Listing Libraries

  • The RStudio user interface has 4 primary panes: source, console, environment, and output

  • My Quarto document appears in the upper left pane.

  • Click on Install in the lower right corner to load packages that you need.

  • List the libraries that you need.

# Set the CRAN mirror to a specific location, e.g., UT Austin
options(repos = "https://mirror.utexas.edu/CRAN/")
library(readr)   
library(dplyr) 
library(janitor) 
library(ggplot2) 

Today’s Data Visualizations

  • 1) Visualizing Trends in the Rate at Which Babies are Named Taylor

  • 2) Mapping the Distribution of Boys Named Nicholas across States

  • WARNING: Today’s examples require data wrangling at a very advanced level.

  • You can easily switch the name, sex, and year.

Baby Names Data into R

  • I have downloaded the csv file (under Canvas) for this example.

  • Read the data into RStudio.

baby_names_national <- read_csv("C:/Users/Ameria/Downloads/R1/baby_names_national.csv")   
  • Examine the number of rows.
nrow(baby_names_national)
[1] 2020863

Becoming Acquainted with the Data

  • What they are showing: number of girls with each name each year (Girl Name-Years)
head(baby_names_national)
# A tibble: 6 × 4
  name      sex   count  year
  <chr>     <chr> <dbl> <dbl>
1 Mary      F      7065  1880
2 Anna      F      2604  1880
3 Emma      F      2003  1880
4 Elizabeth F      1939  1880
5 Minnie    F      1746  1880
6 Margaret  F      1578  1880
  • To examine trends in the rate at which girls are named Taylor, we need to combine observations within years.

Creating Variables and Transforming Data (“Data Wrangling”)

# creating an indicator for whether the baby is named Taylor 
baby_names_national$Taylor <- ifelse(baby_names_national$name == 'Taylor', 1, 0) 

#calculating the number of girls named Taylor in each state
baby_names_national$Taylorcount<-(baby_names_national$Taylor)*(baby_names_national$count) 

#creating a new data set that counts the number of girls and boys each year named Taylor, as well as the total numbers of boys and girls in each state
Taylorprobs<- baby_names_national %>%  
group_by(year,sex) %>%  
summarize(numTaylor=sum(Taylorcount),n=sum(count))   

#calculating the proportion of boys and girls named Taylor each year
Taylorprobs$probTaylor<-(Taylorprobs$numTaylor)/(Taylorprobs$n)  

#creating this variable for the figure
Taylorprobs$Sex<-Taylorprobs$sex 

Viewing the Transformed Data

# looking at the object I just created
Taylorprobs
# A tibble: 282 × 6
# Groups:   year [141]
    year sex   numTaylor      n probTaylor Sex  
   <dbl> <chr>     <dbl>  <dbl>      <dbl> <chr>
 1  1880 F             0  90994   0        F    
 2  1880 M            37 110490   0.000335 M    
 3  1881 F             0  91953   0        F    
 4  1881 M            39 100738   0.000387 M    
 5  1882 F             0 107847   0        F    
 6  1882 M            27 113686   0.000237 M    
 7  1883 F             0 112319   0        F    
 8  1883 M            27 104625   0.000258 M    
 9  1884 F             0 129019   0        F    
10  1884 M            21 114442   0.000183 M    
# ℹ 272 more rows

Mapping the Distribution of Boys Named Nicholas across States

Reading More Baby Names Data into R

  • I have downloaded a second csv file (under Canvas) for this example.
baby_names_state <- read_csv("C:/Users/Ameria/Downloads/R1/baby_names_state.csv")    
head(baby_names_state)
# A tibble: 6 × 5
  state_abb sex    year name     count
  <chr>     <chr> <dbl> <chr>    <dbl>
1 AK        F      1910 Mary        14
2 AK        F      1910 Annie       12
3 AK        F      1910 Anna        10
4 AK        F      1910 Margaret     8
5 AK        F      1910 Helen        7
6 AK        F      1910 Elsie        6
  • How this file is different: girl name-state-year

Filtering the Data and Creating Variables

#keeping only male babies born in 2000
baby_names_state <-filter(baby_names_state, sex=='M') 
baby_names_state <-filter(baby_names_state, year=='2000')

#changing the name of a variable so I can join it to another file later
baby_names_state$state_abbv<-baby_names_state$state_abb

#creating an indicator of whether the baby is named Nicholas
baby_names_state$Nicholas <- ifelse(baby_names_state$name == 'Nicholas', 1, 0)

#filtering the data so that it consists only of boys names Nicholas
baby_names_state <-filter(baby_names_state, Nicholas==1)

Transforming the Data

#now we are grouping the data by state so that we can determine the number and percent of boys named Nicholas in 2000 who reside in each state
baby_names_state_transformed <- baby_names_state %>%    
group_by(state_abbv) %>% 
  
# Variable to be transformed   
mutate(num = count) %>%    
ungroup() %>%    
mutate(perc = num / sum(num)) %>%    
arrange(perc) %>%   
mutate(labels = scales::percent(perc))

Viewing the Transformed Data

baby_names_state_transformed
# A tibble: 51 × 10
   state_abb sex    year name     count state_abbv Nicholas   num    perc labels
   <chr>     <chr> <dbl> <chr>    <dbl> <chr>         <dbl> <dbl>   <dbl> <chr> 
 1 WY        M      2000 Nicholas    30 WY                1    30 0.00122 0.121…
 2 AK        M      2000 Nicholas    36 AK                1    36 0.00146 0.146…
 3 MT        M      2000 Nicholas    43 MT                1    43 0.00174 0.174…
 4 VT        M      2000 Nicholas    47 VT                1    47 0.00191 0.190…
 5 ND        M      2000 Nicholas    51 ND                1    51 0.00207 0.206…
 6 HI        M      2000 Nicholas    64 HI                1    64 0.00260 0.259…
 7 SD        M      2000 Nicholas    65 SD                1    65 0.00264 0.263…
 8 DC        M      2000 Nicholas    69 DC                1    69 0.00280 0.279…
 9 ID        M      2000 Nicholas    89 ID                1    89 0.00361 0.361…
10 DE        M      2000 Nicholas    94 DE                1    94 0.00381 0.381…
# ℹ 41 more rows

Joining to Spatial Data

install.packages("remotes")
remotes::install_github("UrbanInstitute/urbnthemes")
library(urbnmapr) 
library(urbnthemes) 
library(janitor) 
library(sf)  
library(devtools)
install_github("UrbanInstitute/urbnthemes", force = TRUE) 

── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file 'C:\Users\Ameria\AppData\Local\Temp\RtmpK4KRpC\remotes435c28aa268c\UrbanInstitute-urbnthemes-f6a368d/DESCRIPTION' ... OK
* preparing 'urbnthemes':
* checking DESCRIPTION meta-information ...Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
  It is recommended to use 'given' instead of 'middle'.
 OK
Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
  It is recommended to use 'given' instead of 'middle'.
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'urbnthemes_0.0.2.tar.gz'
#joining the data to a map  
spatial_data <- left_join(baby_names_state_transformed, get_urbn_map(map = "states", sf = TRUE), by = "state_abbv")

Viewing the Spatial Data

spatial_data
# A tibble: 51 × 13
   state_abb sex    year name     count state_abbv Nicholas   num    perc labels
   <chr>     <chr> <dbl> <chr>    <dbl> <chr>         <dbl> <dbl>   <dbl> <chr> 
 1 WY        M      2000 Nicholas    30 WY                1    30 0.00122 0.121…
 2 AK        M      2000 Nicholas    36 AK                1    36 0.00146 0.146…
 3 MT        M      2000 Nicholas    43 MT                1    43 0.00174 0.174…
 4 VT        M      2000 Nicholas    47 VT                1    47 0.00191 0.190…
 5 ND        M      2000 Nicholas    51 ND                1    51 0.00207 0.206…
 6 HI        M      2000 Nicholas    64 HI                1    64 0.00260 0.259…
 7 SD        M      2000 Nicholas    65 SD                1    65 0.00264 0.263…
 8 DC        M      2000 Nicholas    69 DC                1    69 0.00280 0.279…
 9 ID        M      2000 Nicholas    89 ID                1    89 0.00361 0.361…
10 DE        M      2000 Nicholas    94 DE                1    94 0.00381 0.381…
# ℹ 41 more rows
# ℹ 3 more variables: state_fips <chr>, state_name <chr>,
#   geometry <MULTIPOLYGON [m]>

Mapping the Distribution of Boys Named Nicholas in 2000 across States

ggplot() + geom_sf(spatial_data, mapping = aes(fill = perc, geometry=geometry), color = "#ffffff", size = 0.25) + scale_fill_gradientn(labels = scales::percent) + labs(fill = "Percent") + coord_sf(datum = NA) + labs(title="Distribution across States of Boys Named Nicholas in 2000",caption="Source: Analyses in R by Joyner of data from https://data.world/nkrishnaswami/us-ssa-baby-names-state")

Publishing Your Document in RPubs

  • Render your document to find any problems.

  • Go to this Website: R Pubs

  • Register in RPubs if you don’t already have an account.

  • After you register you can publish your document by clicking the light blue symbol in the upper right of the upper left pane.