Prerequesites (packages)

  • (babynames)
  • (dplyr)
  • (tidyr)
  • (ggplot2)
  • (gridExtra)
  • (magrittr)
  • (fastDummies)
  • (corrplot)
  • (purrr)
  • (broom)
  • (babynames)
  • (data.table)
  • (rlang)
  • (plotly)

Exploring Babynames Package

  • BabyNames package contains baby names and their frequency between 1887 - 2018
  • Let’s look at some summary statistics
library(babynames)
summary(babynames)
##       year          sex                name                 n          
##  Min.   :1880   Length:1924665     Length:1924665     Min.   :    5.0  
##  1st Qu.:1951   Class :character   Class :character   1st Qu.:    7.0  
##  Median :1985   Mode  :character   Mode  :character   Median :   12.0  
##  Mean   :1975                                         Mean   :  180.9  
##  3rd Qu.:2003                                         3rd Qu.:   32.0  
##  Max.   :2017                                         Max.   :99686.0  
##       prop          
##  Min.   :2.260e-06  
##  1st Qu.:3.870e-06  
##  Median :7.300e-06  
##  Mean   :1.363e-04  
##  3rd Qu.:2.288e-05  
##  Max.   :8.155e-02

Name Popularity

  • Now that we have seen some statistics for baby names, lets look at some popular female names from 1900
  • Seems like Mary was popular in 1900
bNames <- function(gender,yr,how_many){
  library(tidyverse)
  library(babynames)
  library(data.table)
  library(rlang)
  library(ggplot2)
  library(plotly)
  
  #adding gender check 
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - 
         Gender is required, and must be M for Male, and F for Female')
  }
  #adding year check 
  if (yr < 1880 || yr > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
  
  #actual function to run the code and do the analysis 
graphresults <- babynames %>% group_by(year, name) %>% filter(year == yr & sex == gender) %>% filter(n == max(n) & n > 5000) %>% plot_ly(labels = ~name, values = ~n) %>% add_pie(hole = 0.5)
return(graphresults)
}
bNames(gender = "F", 1900, 10)

Exploring bNames function

  • Here we will look at the bNames
  • the function has some default arguments, and required arguments
  • Required arguents for bNames (Gender “M” or “F”), (Year 1880-2017), (How many top Names)
  • Gender is categorical, Year, and How many top names are integers
  • Passing these values in bNames(gender = “F”, 1900, 10) produces the result below
bNames <- function(gender,yr,how_many){
  library(tidyverse)
  library(babynames)
  
  #adding gender check 
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - 
         Gender is required, and must be M for Male, and F for Female')
  }
  #adding year check 
  if (yr < 1880 || yr > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
  #actual function to run the code and do the analysis 
results <- babynames %>% group_by(year, name) %>% filter(year == yr & sex == gender) %>% filter(n == max(n) & n > 5000) 
knitr::kable(return(results))
}
bNames("F", 1900, 10)

Part 1. Predicting the future with bNamesPred!

  • Let’s use our awesome skills to predict which Female names will be popular in year 2025
  • bNamesPred has the following required arguments (Startdate, Enddate) both year integers, and values between 1879-2018, also Startdate cannot be > Enddate, the other argument is (Gender “M” or “F”)
library(babynames) 
library(dplyr) 
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(fastDummies)
library(corrplot)
library(purrr)
library(broom)
library(babynames)
library(data.table)
library(rlang)


bNamesPred <- function(startdate,enddate,gender,predyear=2025){
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - Gender is required, and must be M for Male, and F for Female')
  }
  if (startdate < 1880 || enddate > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
#Filtering by last 10 years for prediction, and looking at the most popular names for last 10 years (MALES)
result <- babynames %>% group_by(year, name) %>% filter(year %in% (startdate:enddate) & sex == gender) %>% filter(n == max(n) & n > 10000)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results) %>% ggplot(aes(y=n,x=year))+geom_point()+geom_smooth(method = "lm",  formula = y ~ splines::bs(x, 3), se = FALSE, alpha = .15)+facet_grid(name ~.)
new_year <- data.frame(year = c(predyear))
Prediction_model <- result %>% group_by(name) %>% nest() %>% mutate(m1 = map(.x = data, .f = ~lm(n~year, data = .))) %>% mutate(Pred = map(.x = m1, ~ predict(.,new_year))) %>% select(name, Pred) %>% unnest
Prediction_Graph <- Prediction_model %>% filter(Pred > 11000) %>% ggplot(aes(name, Pred)) + geom_col(aes(fill = name))
  return(Prediction_Graph)
}
bNamesPred(2015, 2017,"F")

Part 2. Predicting the future!

  • Let’s use our awesome skills to predict which Male names will be popular in year 2025
  • As before bNamesPred has the following required arguments (Startdate, Enddate) both year integers, and values between 1879-2018, also Startdate cannot be > Enddate, the other argument is (Gender “M” or “F”)
library(babynames) 
library(dplyr) 
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(fastDummies)
library(corrplot)
library(purrr)
library(broom)
library(babynames)
library(data.table)
library(rlang)


bNamesPred <- function(startdate,enddate,gender,predyear=2025){
  if (gender != "F" && gender != "M"){
    stop('You used Incorrect Gender Tag - Gender is required, and must be M for Male, and F for Female')
  }
  if (startdate < 1880 || enddate > 2017){
    stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
  }
#Filtering by last 10 years for prediction, and looking at the most popular names for last 10 years (MALES)
result <- babynames %>% group_by(year, name) %>% filter(year %in% (startdate:enddate) & sex == gender) %>% filter(n == max(n) & n > 10000)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results) %>% ggplot(aes(y=n,x=year))+geom_point()+geom_smooth(method = "lm",  formula = y ~ splines::bs(x, 3), se = FALSE, alpha = .15)+facet_grid(name ~.)
new_year <- data.frame(year = c(predyear))
Prediction_model <- result %>% group_by(name) %>% nest() %>% mutate(m1 = map(.x = data, .f = ~lm(n~year, data = .))) %>% mutate(Pred = map(.x = m1, ~ predict(.,new_year))) %>% select(name, Pred) %>% unnest
Prediction_Graph <- Prediction_model %>% filter(Pred > 11000) %>% ggplot(aes(name, Pred)) + geom_col(aes(fill = name))
  return(Prediction_Graph)
}
bNamesPred(2015, 2017,"M")

Part 3. Understanding the Prediction and Closing thoughts

  • I used linear models on the most popular names for a range of years
  • After the linear model finds the most popular names for the given range, it uses year 2025 as prediction
  • It is important to know that this is just a simple / naive prediction
  • The model bases its final outputs on the give range, so the range should be something from the tail end of of years. e.g. use 2007-2017 instead of 1880-1950
  • We are also missing a lot of predictor variables that can help us create a better, and try different models
  • This prediction should not be taken seriously as I have made assumptions like a linear increase / decrease in name popularity-this was needed as we do not have enough predictor variables to make a concrete conclusion. For predicting the name I am looking atnames from 2007 to 2017 (this range can be adjusted by the user) and filtering for the name where the count is above 10000 - as not having this restriction will include all results, and this set can be extremely large. Once the filtering is done on the most popular names a regression model is run on those names and I am using the map function to accomplish this task. A new variable is also created that acts as a prediction variable, and once the linear model runs I use the prediction variable to get a prediction for year 2025. The final result is just a graph showing the most popular names.