Word Cloud for Baby Gender Names in R

In this R Markdown session, I will demonstrate how to use a Word Cloud in R.

About the Data Set

The data set used is titled “Gender by Name”, which is a data set located on the UCI Machine Learning Repository site. The data set can be found at https://archive.ics.uci.edu/ml/datasets/Gender+by+Name.

The data uses open-sourced government data from the US, UK, Canada, and Australia. The baby names are from the years 1880 - 2019, depending on the range of the government data. The data contains four variables and over 147,000 entries. The variables are:

Name - First/Given Name
Gender - Gender of the baby (M/F)
Count - Number of times name was selected
Probability - the probability of a name given the aggregate count

Import the Data Set

First, the data set is loaded into R and named “gender_names”.

names = read.csv("~/R datasets/name_gender_dataset.csv")

Data Cleaning

Next, the data is inspected to see the structure and data types.

#Inspect the data
head(names)

##      Name Gender   Count Probability
## 1   James      M 5304407  0.01451679
## 2    John      M 5260831  0.01439753
## 3  Robert      M 4970386  0.01360266
## 4 Michael      M 4579950  0.01253414
## 5 William      M 4226608  0.01156713
## 6    Mary      F 4169663  0.01141129

#View the data types
sapply(names, class)

##        Name      Gender       Count Probability 
## "character" "character"   "integer"   "numeric"

In addition, a sum function is run to see if there are any missing values in the data.

#Check if there are any missing values.
sum(is.na(names))

## [1] 0

Next, a frequency and distribution table are created for gender to view the distribution of names by gender. This shows the amount of names in the data for each gender.

A list of the top 20 names for each gender is also created.

#The total number of names for males and females in the data.
counts_gender <- names%>%count(Gender)
counts_gender

##   Gender     n
## 1      F 89749
## 2      M 57520

#A frequency table to show the percentage of male and female names.
freq_gender <- table(names$Gender) / length(names$Gender)
freq_gender

## 
##         F         M 
## 0.6094222 0.3905778

#A table showing the top 20 male names and their amount.
counts_names_males <- names%>%arrange(desc(Count))%>%filter(Gender == 'M')%>%select(-Probability, -Gender)%>%head(., 20)
counts_names_males

##           Name   Count
## 1        James 5304407
## 2         John 5260831
## 3       Robert 4970386
## 4      Michael 4579950
## 5      William 4226608
## 6        David 3787547
## 7       Joseph 2695970
## 8      Richard 2638187
## 9      Charles 2433540
## 10      Thomas 2381034
## 11 Christopher 2196198
## 12      Daniel 2039641
## 13     Matthew 1738699
## 14     Anthony 1506437
## 15      George 1495736
## 16      Donald 1447641
## 17        Paul 1437346
## 18        Mark 1410637
## 19      Andrew 1394274
## 20      Steven 1347137

#A table showing the top 20 female names and their amount.
counts_names_females <- names%>%arrange(desc(Count))%>%filter(Gender == 'F')%>%select(-Probability, -Gender)%>%head(., 20)
counts_names_females

##         Name   Count
## 1       Mary 4169663
## 2  Elizabeth 1704140
## 3   Patricia 1608260
## 4   Jennifer 1584426
## 5      Linda 1480592
## 6    Barbara 1459870
## 7   Margaret 1280255
## 8    Jessica 1173979
## 9      Sarah 1162143
## 10     Susan 1154677
## 11   Dorothy 1118450
## 12     Helen 1033566
## 13     Karen 1018083
## 14     Nancy 1016057
## 15     Betty 1005188
## 16      Lisa 1005108
## 17    Ashley  957384
## 18      Anna  916075
## 19    Sandra  901623
## 20     Emily  899883

Word Cloud

A word cloud is created to visually identify the most popular names in the data. By clustering words together, a word cloud helps a viewer see the more prominent name choices for males and females.

First, a subset is created for each gender.

male_names <- names%>%filter(Gender == 'M')

female_names <- names%>%filter(Gender == 'F')

A word cloud is created for male names, with a maximum length of the top 40 names.

wordcloud(words = male_names$Name, freq = names$Count, min.freq = 100000, max.words = 40, colors = brewer.pal(9, "Dark2"))

## Warning in brewer.pal(9, "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors

A word cloud of the top 40 female names is also created.

wordcloud(words = female_names$Name, freq = names$Count, min.freq = 100000, max.words = 40, colors = brewer.pal(9, "Set1"))

## Warning in wordcloud(words = female_names$Name, freq = names$Count, min.freq =
## 1e+05, : Elizabeth could not be fit on page. It will not be plotted.

Conclusion

In this short R Markdown session, a word cloud was made for the most common male and female baby names. A brief distribution and frequency table were made to see the top 20 male and female names, as well as the variance in names for males and females.

Thank you to the reader for viewing my work.