Having fun with baby names

Prerequesites (packages)

(babynames)
(dplyr)
(tidyr)
(ggplot2)
(gridExtra)
(magrittr)
(fastDummies)
(corrplot)
(purrr)
(broom)
(babynames)
(data.table)
(rlang)
(plotly)

Exploring Babynames Package

BabyNames package contains baby names and their frequency between 1887 - 2018
Let’s look at some summary statistics

library(babynames)
summary(babynames)

##       year          sex                name                 n          
##  Min.   :1880   Length:1924665     Length:1924665     Min.   :    5.0  
##  1st Qu.:1951   Class :character   Class :character   1st Qu.:    7.0  
##  Median :1985   Mode  :character   Mode  :character   Median :   12.0  
##  Mean   :1975                                         Mean   :  180.9  
##  3rd Qu.:2003                                         3rd Qu.:   32.0  
##  Max.   :2017                                         Max.   :99686.0  
##       prop          
##  Min.   :2.260e-06  
##  1st Qu.:3.870e-06  
##  Median :7.300e-06  
##  Mean   :1.363e-04  
##  3rd Qu.:2.288e-05  
##  Max.   :8.155e-02

Name Popularity

Now that we have seen some statistics for baby names, lets look at some popular female names from 1900
Seems like Mary was popular in 1900

bNames <- function(gender,yr,how_many){
  library(tidyverse)
  library(babynames)
  library(data.table)
  library(rlang)
  library(ggplot2)
  library(plotly)
  
  #adding gender check 
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - 
         Gender is required, and must be M for Male, and F for Female')
  }
  #adding year check 
  if (yr < 1880 || yr > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
  
  #actual function to run the code and do the analysis 
graphresults <- babynames %>% group_by(year, name) %>% filter(year == yr & sex == gender) %>% filter(n == max(n) & n > 5000) %>% plot_ly(labels = ~name, values = ~n) %>% add_pie(hole = 0.5)
return(graphresults)
}
bNames(gender = "F", 1900, 10)

Exploring bNames function

Here we will look at the bNames
the function has some default arguments, and required arguments
Required arguents for bNames (Gender “M” or “F”), (Year 1880-2017), (How many top Names)
Gender is categorical, Year, and How many top names are integers
Passing these values in bNames(gender = “F”, 1900, 10) produces the result below

bNames <- function(gender,yr,how_many){
  library(tidyverse)
  library(babynames)
  
  #adding gender check 
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - 
         Gender is required, and must be M for Male, and F for Female')
  }
  #adding year check 
  if (yr < 1880 || yr > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
  #actual function to run the code and do the analysis 
results <- babynames %>% group_by(year, name) %>% filter(year == yr & sex == gender) %>% filter(n == max(n) & n > 5000) 
knitr::kable(return(results))
}
bNames("F", 1900, 10)

Part 1. Predicting the future with bNamesPred!

Let’s use our awesome skills to predict which Female names will be popular in year 2025
bNamesPred has the following required arguments (Startdate, Enddate) both year integers, and values between 1879-2018, also Startdate cannot be > Enddate, the other argument is (Gender “M” or “F”)

library(babynames) 
library(dplyr) 
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(fastDummies)
library(corrplot)
library(purrr)
library(broom)
library(babynames)
library(data.table)
library(rlang)


bNamesPred <- function(startdate,enddate,gender,predyear=2025){
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - Gender is required, and must be M for Male, and F for Female')
  }
  if (startdate < 1880 || enddate > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
#Filtering by last 10 years for prediction, and looking at the most popular names for last 10 years (MALES)
result <- babynames %>% group_by(year, name) %>% filter(year %in% (startdate:enddate) & sex == gender) %>% filter(n == max(n) & n > 10000)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results) %>% ggplot(aes(y=n,x=year))+geom_point()+geom_smooth(method = "lm",  formula = y ~ splines::bs(x, 3), se = FALSE, alpha = .15)+facet_grid(name ~.)
new_year <- data.frame(year = c(predyear))
Prediction_model <- result %>% group_by(name) %>% nest() %>% mutate(m1 = map(.x = data, .f = ~lm(n~year, data = .))) %>% mutate(Pred = map(.x = m1, ~ predict(.,new_year))) %>% select(name, Pred) %>% unnest
Prediction_Graph <- Prediction_model %>% filter(Pred > 11000) %>% ggplot(aes(name, Pred)) + geom_col(aes(fill = name))
  return(Prediction_Graph)
}
bNamesPred(2015, 2017,"F")

Part 2. Predicting the future!

Let’s use our awesome skills to predict which Male names will be popular in year 2025
As before bNamesPred has the following required arguments (Startdate, Enddate) both year integers, and values between 1879-2018, also Startdate cannot be > Enddate, the other argument is (Gender “M” or “F”)

library(babynames) 
library(dplyr) 
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(fastDummies)
library(corrplot)
library(purrr)
library(broom)
library(babynames)
library(data.table)
library(rlang)


bNamesPred <- function(startdate,enddate,gender,predyear=2025){
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - Gender is required, and must be M for Male, and F for Female')
  }
  if (startdate < 1880 || enddate > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
#Filtering by last 10 years for prediction, and looking at the most popular names for last 10 years (MALES)
result <- babynames %>% group_by(year, name) %>% filter(year %in% (startdate:enddate) & sex == gender) %>% filter(n == max(n) & n > 10000)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results) %>% ggplot(aes(y=n,x=year))+geom_point()+geom_smooth(method = "lm",  formula = y ~ splines::bs(x, 3), se = FALSE, alpha = .15)+facet_grid(name ~.)
new_year <- data.frame(year = c(predyear))
Prediction_model <- result %>% group_by(name) %>% nest() %>% mutate(m1 = map(.x = data, .f = ~lm(n~year, data = .))) %>% mutate(Pred = map(.x = m1, ~ predict(.,new_year))) %>% select(name, Pred) %>% unnest
Prediction_Graph <- Prediction_model %>% filter(Pred > 11000) %>% ggplot(aes(name, Pred)) + geom_col(aes(fill = name))
  return(Prediction_Graph)
}
bNamesPred(2015, 2017,"M")

Part 3. Understanding the Prediction and Closing thoughts

I used linear models on the most popular names for a range of years
After the linear model finds the most popular names for the given range, it uses year 2025 as prediction
It is important to know that this is just a simple / naive prediction
The model bases its final outputs on the give range, so the range should be something from the tail end of of years. e.g. use 2007-2017 instead of 1880-1950
We are also missing a lot of predictor variables that can help us create a better, and try different models
This prediction should not be taken seriously as I have made assumptions like a linear increase / decrease in name popularity-this was needed as we do not have enough predictor variables to make a concrete conclusion. For predicting the name I am looking atnames from 2007 to 2017 (this range can be adjusted by the user) and filtering for the name where the count is above 10000 - as not having this restriction will include all results, and this set can be extremely large. Once the filtering is done on the most popular names a regression model is run on those names and I am using the map function to accomplish this task. A new variable is also created that acts as a prediction variable, and once the linear model runs I use the prediction variable to get a prediction for year 2025. The final result is just a graph showing the most popular names.

Having fun with baby names

Haris Javed

March 6, 2020

Prerequesites (packages)

Exploring Babynames Package

Name Popularity

Exploring bNames function

Part 1. Predicting the future with bNamesPred!

Part 2. Predicting the future!

Part 3. Understanding the Prediction and Closing thoughts