email: jc3181 AT columbia DOT edu

 

Introduction

The babynames package put together by Hadley Wickham is a lot of fun for teaching R. This package contains a dataset of the same name that contains the number of boys and girls born each year since 1880 with every name.

What you’ll notice if you play around with graphing the distribution of children born with particular names is that there are different patterns. Some names were popular long ago, some are only popular recently, others have had ups and downs in popularity. For a bit of fun, I thought it would be interesting to try and identify different patterns through principal components analysis (PCA) and clustering techniques.

This is mainly for fun. I’m not being too rigorous with my choice of methods, but it’s certainly meritous. I’m going to leave all the raw code in the final output so others can see how these graphs/conclusions were made.

 

Load libraries and data

These are the libraries that we’ll need. The data is in babynames, we will use dplyr and tidyr for data manipulation, magrittr for its chain operators, and ggplot2 and gridExtra for plotting.

### load packages
library(babynames) 
library(dplyr) 
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)


head(babynames)
## Source: local data frame [6 x 5]
## 
##   year sex      name    n       prop
## 1 1880   F      Mary 7065 0.07238359
## 2 1880   F      Anna 2604 0.02667896
## 3 1880   F      Emma 2003 0.02052149
## 4 1880   F Elizabeth 1939 0.01986579
## 5 1880   F    Minnie 1746 0.01788843
## 6 1880   F  Margaret 1578 0.01616720
tail(babynames)
## Source: local data frame [6 x 5]
## 
##   year sex   name n         prop
## 1 2013   M  Zyere 5 2.499421e-06
## 2 2013   M Zyhier 5 2.499421e-06
## 3 2013   M  Zylar 5 2.499421e-06
## 4 2013   M Zymari 5 2.499421e-06
## 5 2013   M Zymeer 5 2.499421e-06
## 6 2013   M  Zyree 5 2.499421e-06

As can be seen, this dataset records the number of children (n) of each name (name) that are boys or girls (sex) that were born in each year (year). Additionally, the proportion (prop) of children born with that name in each year.

Here is my basic code for plotting the distribution of each name over time - I’ve picked two contrasting examples.

g1 <- babynames %>%
  filter(name=="Barbara") %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Barbara")

g2 <- babynames %>%
  filter(name=="Megan") %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Megan")

g3 <- babynames %>%
  filter(name=="Jennifer") %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Jennifer")

g4 <- babynames %>%
  filter(name=="Irene") %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=sex), lwd=1) +
  scale_color_manual(values = c("firebrick1", "dodgerblue")) +
  theme_bw() +
  ggtitle("Irene")

grid.arrange(g1,g2,g3,g4,ncol=2)

 

Getting some basic descriptives

These are the most popular names of all time:

babynames %>%
  group_by(sex, name) %>%
  summarize(total = sum(n)) %>%
  arrange(desc(total)) %$%
  split(., sex) 
## $F
## Source: local data frame [64,089 x 3]
## Groups: sex
## 
##    sex      name   total
## 1    F      Mary 4112464
## 2    F Elizabeth 1591439
## 3    F  Patricia 1570135
## 4    F  Jennifer 1461186
## 5    F     Linda 1450328
## 6    F   Barbara 1432543
## 7    F  Margaret 1238016
## 8    F     Susan 1120083
## 9    F   Dorothy 1105281
## 10   F     Sarah 1055860
## .. ...       ...     ...
## 
## $M
## Source: local data frame [38,601 x 3]
## Groups: sex
## 
##    sex    name   total
## 1    M   James 5091189
## 2    M    John 5073958
## 3    M  Robert 4789776
## 4    M Michael 4293460
## 5    M William 4038447
## 6    M   David 3565229
## 7    M  Joseph 2557792
## 8    M Richard 2552302
## 9    M Charles 2356886
## 10   M  Thomas 2275889
## .. ...     ...     ...

 

The total number of unique names:

babynames %$%
  split(., sex) %>%
  lapply(. %$% length(unique(name)))
## $F
## [1] 64089
## 
## $M
## [1] 38601

There are far more unique female name than male names.

 

Reshape data

The first exploratoration of these data that I’d like to do is to perform a PCA on the distribution of names over years for each sex. This will give us a general idea of how many different ‘components/groups’ we might expect.

To do this, we need to have our data in a ‘wide’ format, with each column/variable representing a year and each row representing the total number of births that year for that particular name.

Also, for this overview, I’m only going to focus on females.

babywideF <- 
  babynames %>% 
  filter(sex=="F") %>% 
  select(name, year, n) %>%
  spread(year, n, fill=0)

rownames(babywideF)<- babywideF %>% .$name  #set rownames
babywideF %<>% select(-name) # remove name var.

 

PCA

For this sort of exploratory analysis, I’m going to simply use the default PCA function in R - princomp and plot scree plots.

 

### principal components analysis - females
resF.pca <- princomp(babywideF)
plot(resF.pca)

 

The above scree plot seems to indicate that there is one major component of names that accounts for the majority of variance. There then appears to be a few more components that account for a fair amount of variance. Depending on how micro-detailed we want to go, we could look for 4 or 5 components fairly reasonably. Though there may be something of interest in looking at 7 or 8 groups.

 

Clustering

To explore which names show more similar distribution patterns to one another over time, I’m going to firstly use k-means hierarchical clustering. There are pros and cons to all clustering methods. A problem with k-means is that you can get different clusters each time you run it due to how the method operates. An upside is that it is a fairly flexible method…

The clustering is done with the kmeans function and by setting the number of clusters to find. After saving the results, we can look at how many individual names have been put into each cluster (the relative number of the cluster isn’t that important - e.g. if we re-run the clustering, the clusters with most names could be called by a different cluster number in future runs). I am going to set.seed() to make sure that this code is repeatable.

 

###k-means clustering analysis
set.seed(100)
resF.k <- kmeans(babywideF, 6)
table(resF.k$cluster)
## 
##     1     2     3     4     5     6 
## 63799    39    11    64    19   157

 

The majority of names (63,799) are contained within the 1st component. The other five components have a much more manageable number of names.

Let’s look at some of these names in more detail:

names(resF.k$cluster[resF.k$cluster==2])
##  [1] "Alice"     "Ann"       "Anna"      "Betty"     "Beverly"  
##  [6] "Brenda"    "Carolyn"   "Catherine" "Cheryl"    "Christine"
## [11] "Cynthia"   "Debra"     "Diane"     "Doris"     "Dorothy"  
## [16] "Evelyn"    "Frances"   "Gloria"    "Helen"     "Jane"     
## [21] "Janet"     "Janice"    "Jean"      "Joan"      "Joyce"    
## [26] "Judith"    "Judy"      "Kathleen"  "Margaret"  "Marie"    
## [31] "Marilyn"   "Martha"    "Mildred"   "Pamela"    "Rose"     
## [36] "Ruth"      "Sharon"    "Shirley"   "Virginia"
names(resF.k$cluster[resF.k$cluster==3])
##  [1] "Barbara"  "Carol"    "Deborah"  "Donna"    "Karen"    "Linda"   
##  [7] "Mary"     "Nancy"    "Patricia" "Sandra"   "Susan"
names(resF.k$cluster[resF.k$cluster==4])
##  [1] "Abigail"   "Alexandra" "Alexis"    "Alicia"    "Allison"  
##  [6] "Alyssa"    "Amber"     "Andrea"    "Ava"       "Brianna"  
## [11] "Brittany"  "Brooke"    "Cassandra" "Chelsea"   "Chloe"    
## [16] "Courtney"  "Crystal"   "Danielle"  "Destiny"   "Emily"    
## [21] "Emma"      "Erica"     "Erin"      "Gabrielle" "Grace"    
## [26] "Hailey"    "Haley"     "Hannah"    "Isabella"  "Jamie"    
## [31] "Jasmine"   "Jenna"     "Jordan"    "Julia"     "Kaitlyn"  
## [36] "Katelyn"   "Katherine" "Katie"     "Kayla"     "Kelsey"   
## [41] "Kristen"   "Lauren"    "Leah"      "Lindsey"   "Madison"  
## [46] "Maria"     "Megan"     "Mia"       "Morgan"    "Natalie"  
## [51] "Olivia"    "Paige"     "Rachel"    "Samantha"  "Sara"     
## [56] "Savannah"  "Shannon"   "Shelby"    "Sophia"    "Sydney"   
## [61] "Taylor"    "Tiffany"   "Vanessa"   "Victoria"
names(resF.k$cluster[resF.k$cluster==5])
##  [1] "Amanda"    "Amy"       "Angela"    "Ashley"    "Christina"
##  [6] "Elizabeth" "Heather"   "Jennifer"  "Jessica"   "Kelly"    
## [11] "Kimberly"  "Laura"     "Lisa"      "Melissa"   "Michelle" 
## [16] "Nicole"    "Rebecca"   "Sarah"     "Stephanie"
names(resF.k$cluster[resF.k$cluster==6])
##   [1] "Agnes"      "Alma"       "Anita"      "Anne"       "Annette"   
##   [6] "Annie"      "April"      "Arlene"     "Audrey"     "Beatrice"  
##  [11] "Becky"      "Bernice"    "Bertha"     "Beth"       "Bonnie"    
##  [16] "Carla"      "Carmen"     "Carole"     "Caroline"   "Carrie"    
##  [21] "Cathy"      "Charlene"   "Charlotte"  "Cindy"      "Clara"     
##  [26] "Claudia"    "Colleen"    "Connie"     "Constance"  "Dana"      
##  [31] "Darlene"    "Dawn"       "Deanna"     "Debbie"     "Delores"   
##  [36] "Denise"     "Diana"      "Dianne"     "Dolores"    "Edith"     
##  [41] "Edna"       "Eileen"     "Elaine"     "Eleanor"    "Ella"      
##  [46] "Ellen"      "Elsie"      "Esther"     "Ethel"      "Eva"       
##  [51] "Florence"   "Gail"       "Georgia"    "Geraldine"  "Gertrude"  
##  [56] "Gina"       "Gladys"     "Glenda"     "Gwendolyn"  "Hazel"     
##  [61] "Heidi"      "Holly"      "Ida"        "Irene"      "Jackie"    
##  [66] "Jacqueline" "Jeanette"   "Jeanne"     "Jessie"     "Jill"      
##  [71] "Jo"         "Joann"      "Joanne"     "Josephine"  "Joy"       
##  [76] "Juanita"    "Julie"      "June"       "Kathryn"    "Kathy"     
##  [81] "Kay"        "Kim"        "Kristin"    "Laurie"     "Leslie"    
##  [86] "Lillian"    "Lillie"     "Lois"       "Loretta"    "Lori"      
##  [91] "Lorraine"   "Louise"     "Lucille"    "Lucy"       "Lynn"      
##  [96] "Marcia"     "Marian"     "Marion"     "Marjorie"   "Marlene"   
## [101] "Marsha"     "Maureen"    "Melanie"    "Melinda"    "Michele"   
## [106] "Monica"     "Norma"      "Patsy"      "Paula"      "Pauline"   
## [111] "Peggy"      "Penny"      "Phyllis"    "Regina"     "Renee"     
## [116] "Rhonda"     "Rita"       "Roberta"    "Robin"      "Rosa"      
## [121] "Rosemary"   "Ruby"       "Sally"      "Sheila"     "Sherri"    
## [126] "Sherry"     "Sheryl"     "Stacey"     "Stacy"      "Sue"       
## [131] "Suzanne"    "Sylvia"     "Tamara"     "Tammy"      "Tanya"     
## [136] "Tara"       "Teresa"     "Terri"      "Terry"      "Thelma"    
## [141] "Theresa"    "Tina"       "Toni"       "Tonya"      "Tracy"     
## [146] "Valerie"    "Vera"       "Veronica"   "Vicki"      "Vickie"    
## [151] "Vivian"     "Wanda"      "Wendy"      "Willie"     "Wilma"     
## [156] "Yolanda"    "Yvonne"

 

What do we think? Do any of these names make sense to be grouped together? Certainly group 3 seems to contain some of the traditionally more popular names such as Mary and Linda. Jennifer, classically popular from the late 70s to 90s is in group 5. Looking at group 4, they appear to contain names that have been popular more recently. Group 2 seem to contain traditional names like Frances and Joan. Group 6 contains a lot of names that also appear to be traditional as well as more recent, so it’s not immediately clear why they form a separate group.

Just for completeness, here is a random sample of 10 names from ‘group 5’. As can be seen, these tend to be uncommon names.

set.seed(10)
sample(names(resF.k$cluster[resF.k$cluster==1]),10)
##  [1] "Kricket" "Faraday" "Johnah"  "Nieisha" "Armonii" "Dasjia"  "Eldana" 
##  [8] "Eiko"    "Markiea" "Jonah"

 

Repeat the process ?

It might be more beneficial to repeat this process, but only include those names in the top 6 components. To do this we will filter our data by not keeping any names that appear in cluster 5. This leaves us with 290 names.

group1x <- names(resF.k$cluster[resF.k$cluster==1])
babywideF1 <- babywideF %>% mutate(id = rownames(.)) %>% filter(!id %in% group1x) 
rownames(babywideF1) <- babywideF1$id #using this temp var to re-insert names into rownames (probably not the best way of doing this)
babywideF1 %<>% select(-id) 
### principal components analysis - females
resF1.pca <- princomp(babywideF1)
plot(resF1.pca)

 

The scree plot again indicates approximately three or four main components, plus perhaps 2 or 3 ‘fringe’ ones. Let’s proceed with 7 clusters, just because it might be more fun/interesting to try and split names up as much as possible to see if it makes logical sense.

###k-means clustering analysis
set.seed(10)
resF1.k <- kmeans(babywideF1, 7)
table(resF1.k$cluster)
## 
##   1   2   3   4   5   6   7 
##  17  17  11   1  66  14 164

 

Again, let’s look at these in a bit more detail, this time looking from smallest group to largest.

names(resF1.k$cluster[resF1.k$cluster==4])
## [1] "Mary"

 

Well, the name “Mary” is its own phenomenon.

  babynames %>%
  filter(sex=="F") %>%
  filter(name=="Mary") %$% 
  ggplot(., aes(year, n)) +
  geom_line(lwd=1, color="red") +
  theme_bw()

 

Next up is group 3 containing 11 names.

 

group3 <- names(resF1.k$cluster[resF1.k$cluster==3])

gg3 <- 
  babynames %>%
  filter(sex=="F") %>%
  filter(name %in% group3) %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name, group=name), lwd=1) +
  theme_bw()

gg3

 

Well, these certainly seem to be children of the 80s and 90s - with some (Sarah, Emily, Rachel) having a very small blip in the 1920s.

 

Next up is group 6 containing 14 names.

group6 <- names(resF1.k$cluster[resF1.k$cluster==6])

gg6 <- 
  babynames %>%
  filter(sex=="F") %>%
  filter(name %in% group6) %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name, group=name), lwd=1) +
  theme_bw()

gg6

 

This group of names appear to be children of the 1970s, though the distribution for most names is quite wide from the mid 50s to 90s. Three names look a little different from the rest - Lisa and Jennifer have much higher peaks than the others, with each peaking before and after 1975 respectively. In fact, looking at all names, they roughly split equally between those who peak prior (e.g. Julie, Lisa, Kimberly) and after 1975. The third name that is a bit different from the rest is Elizabeth. I’m not sure it should be classified with the others - it is very unusual in having 3 peaks.

 

  babynames %>%
  filter(sex=="F") %>%
  filter(name=="Elizabeth") %$% 
  ggplot(., aes(year, n)) +
  geom_line(lwd=1, color="red") +
  theme_bw()

 

The next group is group 1 with 17 names.

group1 <- names(resF1.k$cluster[resF1.k$cluster==1])

gg1 <- 
  babynames %>%
  filter(sex=="F") %>%
  filter(name %in% group1) %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name, group=name), lwd=1) +
  theme_bw()

gg1

 

These names are those that had their heyday in the 1950s having started to become popular in the 1920s and 1930s. By the 1970s, they were losing popularity. The one that stands out most is Linda, which was exceptionally popular in the late 40s.

 

The next group is group 2, also with 17 names.

group2 <- names(resF1.k$cluster[resF1.k$cluster==2])

gg2 <- 
  babynames %>%
  filter(sex=="F") %>%
  filter(name %in% group2) %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name, group=name), lwd=1) +
  theme_bw()

gg2

 

These names are primarily those whose boom period was the 1920s and 1930s. A couple of these names - Anna and Evelyn - are having mini-resurgences.

 

Group 5 is the second biggest group and it has 66 names.

group5 <- names(resF1.k$cluster[resF1.k$cluster==5])

group5 #these are these names
##  [1] "Abigail"   "Alexandra" "Alexis"    "Alicia"    "Allison"  
##  [6] "Alyssa"    "Amber"     "Andrea"    "April"     "Ava"      
## [11] "Brianna"   "Brooke"    "Caroline"  "Cassandra" "Chelsea"  
## [16] "Chloe"     "Christina" "Courtney"  "Crystal"   "Danielle" 
## [21] "Destiny"   "Ella"      "Emma"      "Erica"     "Erin"     
## [26] "Gabrielle" "Grace"     "Hailey"    "Haley"     "Hannah"   
## [31] "Isabella"  "Jamie"     "Jasmine"   "Jenna"     "Jordan"   
## [36] "Julia"     "Kaitlyn"   "Katelyn"   "Katherine" "Katie"    
## [41] "Kayla"     "Kelsey"    "Kristen"   "Kristin"   "Leah"     
## [46] "Lindsey"   "Madison"   "Maria"     "Melanie"   "Mia"      
## [51] "Monica"    "Morgan"    "Natalie"   "Olivia"    "Paige"    
## [56] "Sara"      "Savannah"  "Shannon"   "Shelby"    "Sophia"   
## [61] "Sydney"    "Tara"      "Taylor"    "Tiffany"   "Vanessa"  
## [66] "Victoria"
gg5 <- 
  babynames %>%
  filter(sex=="F") %>%
  filter(name %in% group5) %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name, group=name), lwd=.5) +
  theme_bw() +
  theme(legend.position = "none")


gg5

 

It’s too difficult from this graph to identify individual names on the graph. These names appear to be those that have risen in usage quickly since 1975 and whose popularity has remained high. It may well be worth further data mining these names for more patterns. Several of them appear to have had mini-booms in the 1920s before falling away before their late twentieth century resurgence.

Grace is the most notable of these with Julia and Ella being two others:

  babynames %>%
  filter(sex=="F") %>%
  filter(name=="Grace" | name=="Julia" | name=="Ella") %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name), lwd=1) +
  theme_bw()

 

Lastly, group 7 is the largest group by far containing 164 names.

group7 <- names(resF1.k$cluster[resF1.k$cluster==7])
group7 # these are these names
##   [1] "Agnes"      "Alma"       "Anita"      "Ann"        "Anne"      
##   [6] "Annette"    "Annie"      "Arlene"     "Audrey"     "Beatrice"  
##  [11] "Becky"      "Bernice"    "Bertha"     "Beth"       "Beverly"   
##  [16] "Bonnie"     "Carla"      "Carmen"     "Carole"     "Carolyn"   
##  [21] "Carrie"     "Catherine"  "Cathy"      "Charlene"   "Charlotte" 
##  [26] "Cheryl"     "Christine"  "Cindy"      "Clara"      "Claudia"   
##  [31] "Colleen"    "Connie"     "Constance"  "Dana"       "Darlene"   
##  [36] "Dawn"       "Deanna"     "Debbie"     "Delores"    "Denise"    
##  [41] "Diana"      "Dianne"     "Dolores"    "Edith"      "Edna"      
##  [46] "Eileen"     "Elaine"     "Eleanor"    "Ellen"      "Elsie"     
##  [51] "Esther"     "Ethel"      "Eva"        "Florence"   "Gail"      
##  [56] "Georgia"    "Geraldine"  "Gertrude"   "Gina"       "Gladys"    
##  [61] "Glenda"     "Gloria"     "Gwendolyn"  "Hazel"      "Heidi"     
##  [66] "Holly"      "Ida"        "Irene"      "Jackie"     "Jacqueline"
##  [71] "Jane"       "Janet"      "Janice"     "Jeanette"   "Jeanne"    
##  [76] "Jessie"     "Jill"       "Jo"         "Joann"      "Joanne"    
##  [81] "Josephine"  "Joy"        "Joyce"      "Juanita"    "Judith"    
##  [86] "Judy"       "June"       "Kathryn"    "Kathy"      "Kay"       
##  [91] "Kim"        "Laurie"     "Leslie"     "Lillian"    "Lillie"    
##  [96] "Lois"       "Loretta"    "Lori"       "Lorraine"   "Louise"    
## [101] "Lucille"    "Lucy"       "Lynn"       "Marcia"     "Marian"    
## [106] "Marilyn"    "Marion"     "Marjorie"   "Marlene"    "Marsha"    
## [111] "Maureen"    "Melinda"    "Michele"    "Norma"      "Patsy"     
## [116] "Paula"      "Pauline"    "Peggy"      "Penny"      "Phyllis"   
## [121] "Regina"     "Renee"      "Rhonda"     "Rita"       "Roberta"   
## [126] "Robin"      "Rosa"       "Rose"       "Rosemary"   "Ruby"      
## [131] "Sally"      "Sheila"     "Sherri"     "Sherry"     "Sheryl"    
## [136] "Stacey"     "Stacy"      "Sue"        "Suzanne"    "Sylvia"    
## [141] "Tamara"     "Tammy"      "Tanya"      "Teresa"     "Terri"     
## [146] "Terry"      "Thelma"     "Theresa"    "Tina"       "Toni"      
## [151] "Tonya"      "Tracy"      "Valerie"    "Vera"       "Veronica"  
## [156] "Vicki"      "Vickie"     "Vivian"     "Wanda"      "Wendy"     
## [161] "Willie"     "Wilma"      "Yolanda"    "Yvonne"
gg7 <- 
  babynames %>%
  filter(sex=="F") %>%
  filter(name %in% group7) %$% 
  ggplot(., aes(year, n)) +
  geom_line(aes(color=name, group=name), lwd=.5) +
  theme_bw() +
  theme(legend.position = "none")

gg7

 

This is a pretty ugly plot! These names certainly could do with some further analysis. This appears (as we’d expect) to be a repository for names that don’t all fit the same exact pattern. Just looking at it, there appears to be groups within this plot of names that peak prior and after 1925 as well as names that peak prior and after 1950.

 

Conclusions

This was a pretty fun, quick run through of some simple, exploratory ways of analyzing huge amounts of babyname data to try and pick out patterns. If I was to look at this more thoroughly I would certainly try some other clustering methods as well as examining some alternative visualization techniques such as multidimensional scaling. Nevertheless, our intuitions that girls’ names can be grouped by their era specific popularity does seem to have have been identified by the k-means clustering.

One final time-series plot, looking at some of the era-specific names together:

grid.arrange(gg2, gg1, gg3, gg6, ncol=2)

 

Just one more thing…

An intriguing way of mapping multidimensional data into a 2d plot is to use T-distributed Stochastic Neighbor Embedding. This can be done in the R package tsne. I don’t understand everything about ‘tsne’ just yet, but it is pretty good at plotting these names in a 2D distribution that makes sense:

library(tsne)
D <- dist(babywideF1)  #create distance object


# creating dataframe for plotting colors and text on final plot
namesdf <- data.frame(name = c(group1, group2, group3, "Mary", group5, group6, group7), 
           group = c(rep(1, length(group1)), rep(2, length(group2)), rep(3, length(group3)), rep(4, 1),
                     rep(5, length(group5)), rep(6, length(group6)), rep(7, length(group7)))
           )
                     
namesdf %<>% arrange(name) #names in correct order to match rownames of babywideF1 

colors = rainbow(7)
names(colors) = unique(namesdf$group)

#define function used in plotting
ecb = function(x,y){ plot(x,t='n'); text(x,labels=rownames(babywideF1), col=colors[namesdf$group], cex=1) }

#plot
tsne_D = tsne(D, k=2,  epoch_callback = ecb, perplexity=50)

 

This is pretty cool. This data reduction and visualization method actually maps pretty well to what we did before. The different groups are denoted by different colors. Some logical patterns emerge. Mary is definitely an extreme datapoint. It is close in visual space to the red group of names that were predominant in the 1950s (e.g. Barbara, Patricia).

The names that were really popular in the 1920s are on the far left of visual space in yellow (e.g. Dorothy, Helen).

The names that have become very popular recently (e.g. Isabella, Madison, Abigail) are on the right side of visual space in light blue. Interestingly, Grace and Julia position next to each other with Ella not too far away - these are the names that were popular early and then have had a recent resurgence.

The names in darker blue at the far right hand side are those such as Jennifer that were very popular in the late 70s and 80s. Elizabeth with its 3 peaks is the furthest out. Significantly, Lisa and Kimberly are close to each other - this might be because their peaks were pre 1975. Julie is another with pre-1975 peaks, but that is a long way from its group.

The names in green are the names that blossomed in the 80s and 90s before more recently declining in popularity such as Megan and Lauren.

Finally, much could be done to look for further patterns in the pink group! but that will have to wait for another time !

 

Hope you found this useful or fun ! Any questions or comments, get in touch.