Prerequesites (packages)
- (babynames)
- (dplyr)
- (tidyr)
- (ggplot2)
- (gridExtra)
- (magrittr)
- (fastDummies)
- (corrplot)
- (purrr)
- (broom)
- (babynames)
- (data.table)
- (rlang)
- (plotly)
Exploring Babynames Package
- BabyNames package contains baby names and their frequency between 1887 - 2018
- Let’s look at some summary statistics
library(babynames)
summary(babynames)
## year sex name n
## Min. :1880 Length:1924665 Length:1924665 Min. : 5.0
## 1st Qu.:1951 Class :character Class :character 1st Qu.: 7.0
## Median :1985 Mode :character Mode :character Median : 12.0
## Mean :1975 Mean : 180.9
## 3rd Qu.:2003 3rd Qu.: 32.0
## Max. :2017 Max. :99686.0
## prop
## Min. :2.260e-06
## 1st Qu.:3.870e-06
## Median :7.300e-06
## Mean :1.363e-04
## 3rd Qu.:2.288e-05
## Max. :8.155e-02
Name Popularity
- Now that we have seen some statistics for baby names, lets look at some popular female names from 1900
- Seems like Mary was popular in 1900
bNames <- function(gender,yr,how_many){
library(tidyverse)
library(babynames)
library(data.table)
library(rlang)
library(ggplot2)
library(plotly)
#adding gender check
if (gender != "F" && gender != "M"){
stop('You used Incorrect Gender Tag -
Gender is required, and must be M for Male, and F for Female')
}
#adding year check
if (yr < 1880 || yr > 2017){
stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
}
#actual function to run the code and do the analysis
graphresults <- babynames %>% group_by(year, name) %>% filter(year == yr & sex == gender) %>% filter(n == max(n) & n > 5000) %>% plot_ly(labels = ~name, values = ~n) %>% add_pie(hole = 0.5)
return(graphresults)
}
bNames(gender = "F", 1900, 10)
Exploring bNames function
- Here we will look at the bNames
- the function has some default arguments, and required arguments
- Required arguents for bNames (Gender “M” or “F”), (Year 1880-2017), (How many top Names)
- Gender is categorical, Year, and How many top names are integers
- Passing these values in bNames(gender = “F”, 1900, 10) produces the result below
bNames <- function(gender,yr,how_many){
library(tidyverse)
library(babynames)
#adding gender check
if (gender != "F" && gender != "M"){
stop('You used Incorrect Gender Tag -
Gender is required, and must be M for Male, and F for Female')
}
#adding year check
if (yr < 1880 || yr > 2017){
stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
}
#actual function to run the code and do the analysis
results <- babynames %>% group_by(year, name) %>% filter(year == yr & sex == gender) %>% filter(n == max(n) & n > 5000)
knitr::kable(return(results))
}
bNames("F", 1900, 10)
Part 1. Predicting the future with bNamesPred!
- Let’s use our awesome skills to predict which Female names will be popular in year 2025
- bNamesPred has the following required arguments (Startdate, Enddate) both year integers, and values between 1879-2018, also Startdate cannot be > Enddate, the other argument is (Gender “M” or “F”)
library(babynames)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(fastDummies)
library(corrplot)
library(purrr)
library(broom)
library(babynames)
library(data.table)
library(rlang)
bNamesPred <- function(startdate,enddate,gender,predyear=2025){
if (gender != "F" && gender != "M"){
stop('You used Incorrect Gender Tag - Gender is required, and must be M for Male, and F for Female')
}
if (startdate < 1880 || enddate > 2017){
stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
}
#Filtering by last 10 years for prediction, and looking at the most popular names for last 10 years (MALES)
result <- babynames %>% group_by(year, name) %>% filter(year %in% (startdate:enddate) & sex == gender) %>% filter(n == max(n) & n > 10000)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results) %>% ggplot(aes(y=n,x=year))+geom_point()+geom_smooth(method = "lm", formula = y ~ splines::bs(x, 3), se = FALSE, alpha = .15)+facet_grid(name ~.)
new_year <- data.frame(year = c(predyear))
Prediction_model <- result %>% group_by(name) %>% nest() %>% mutate(m1 = map(.x = data, .f = ~lm(n~year, data = .))) %>% mutate(Pred = map(.x = m1, ~ predict(.,new_year))) %>% select(name, Pred) %>% unnest
Prediction_Graph <- Prediction_model %>% filter(Pred > 11000) %>% ggplot(aes(name, Pred)) + geom_col(aes(fill = name))
return(Prediction_Graph)
}
bNamesPred(2015, 2017,"F")

Part 2. Predicting the future!
- Let’s use our awesome skills to predict which Male names will be popular in year 2025
- As before bNamesPred has the following required arguments (Startdate, Enddate) both year integers, and values between 1879-2018, also Startdate cannot be > Enddate, the other argument is (Gender “M” or “F”)
library(babynames)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(fastDummies)
library(corrplot)
library(purrr)
library(broom)
library(babynames)
library(data.table)
library(rlang)
bNamesPred <- function(startdate,enddate,gender,predyear=2025){
if (gender != "F" && gender != "M"){
stop('You used Incorrect Gender Tag - Gender is required, and must be M for Male, and F for Female')
}
if (startdate < 1880 || enddate > 2017){
stop('You used Incorrect Year - Year is required, and must be BETWEEN 1879 and 2018')
}
#Filtering by last 10 years for prediction, and looking at the most popular names for last 10 years (MALES)
result <- babynames %>% group_by(year, name) %>% filter(year %in% (startdate:enddate) & sex == gender) %>% filter(n == max(n) & n > 10000)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results)
result %>% nest(-name) %>% mutate(fit = map(data,~lm(n~year, data = .)),results = map(fit,augment)) %>% unnest(results) %>% ggplot(aes(y=n,x=year))+geom_point()+geom_smooth(method = "lm", formula = y ~ splines::bs(x, 3), se = FALSE, alpha = .15)+facet_grid(name ~.)
new_year <- data.frame(year = c(predyear))
Prediction_model <- result %>% group_by(name) %>% nest() %>% mutate(m1 = map(.x = data, .f = ~lm(n~year, data = .))) %>% mutate(Pred = map(.x = m1, ~ predict(.,new_year))) %>% select(name, Pred) %>% unnest
Prediction_Graph <- Prediction_model %>% filter(Pred > 11000) %>% ggplot(aes(name, Pred)) + geom_col(aes(fill = name))
return(Prediction_Graph)
}
bNamesPred(2015, 2017,"M")

Part 3. Understanding the Prediction and Closing thoughts
- I used linear models on the most popular names for a range of years
- After the linear model finds the most popular names for the given range, it uses year 2025 as prediction
- It is important to know that this is just a simple / naive prediction
- The model bases its final outputs on the give range, so the range should be something from the tail end of of years. e.g. use 2007-2017 instead of 1880-1950
- We are also missing a lot of predictor variables that can help us create a better, and try different models
- This prediction should not be taken seriously as I have made assumptions like a linear increase / decrease in name popularity-this was needed as we do not have enough predictor variables to make a concrete conclusion. For predicting the name I am looking atnames from 2007 to 2017 (this range can be adjusted by the user) and filtering for the name where the count is above 10000 - as not having this restriction will include all results, and this set can be extremely large. Once the filtering is done on the most popular names a regression model is run on those names and I am using the map function to accomplish this task. A new variable is also created that acts as a prediction variable, and once the linear model runs I use the prediction variable to get a prediction for year 2025. The final result is just a graph showing the most popular names.