In this R Markdown session, I will demonstrate how to use a Word Cloud in R.
The data set used is titled “Gender by Name”, which is a data set located on the UCI Machine Learning Repository site. The data set can be found at https://archive.ics.uci.edu/ml/datasets/Gender+by+Name.
The data uses open-sourced government data from the US, UK, Canada, and Australia. The baby names are from the years 1880 - 2019, depending on the range of the government data. The data contains four variables and over 147,000 entries. The variables are:
First, the data set is loaded into R and named “gender_names”.
names = read.csv("~/R datasets/name_gender_dataset.csv")
Next, the data is inspected to see the structure and data types.
#Inspect the data
head(names)
## Name Gender Count Probability
## 1 James M 5304407 0.01451679
## 2 John M 5260831 0.01439753
## 3 Robert M 4970386 0.01360266
## 4 Michael M 4579950 0.01253414
## 5 William M 4226608 0.01156713
## 6 Mary F 4169663 0.01141129
#View the data types
sapply(names, class)
## Name Gender Count Probability
## "character" "character" "integer" "numeric"
In addition, a sum function is run to see if there are any missing values in the data.
#Check if there are any missing values.
sum(is.na(names))
## [1] 0
Next, a frequency and distribution table are created for gender to view the distribution of names by gender. This shows the amount of names in the data for each gender.
A list of the top 20 names for each gender is also created.
#The total number of names for males and females in the data.
counts_gender <- names%>%count(Gender)
counts_gender
## Gender n
## 1 F 89749
## 2 M 57520
#A frequency table to show the percentage of male and female names.
freq_gender <- table(names$Gender) / length(names$Gender)
freq_gender
##
## F M
## 0.6094222 0.3905778
#A table showing the top 20 male names and their amount.
counts_names_males <- names%>%arrange(desc(Count))%>%filter(Gender == 'M')%>%select(-Probability, -Gender)%>%head(., 20)
counts_names_males
## Name Count
## 1 James 5304407
## 2 John 5260831
## 3 Robert 4970386
## 4 Michael 4579950
## 5 William 4226608
## 6 David 3787547
## 7 Joseph 2695970
## 8 Richard 2638187
## 9 Charles 2433540
## 10 Thomas 2381034
## 11 Christopher 2196198
## 12 Daniel 2039641
## 13 Matthew 1738699
## 14 Anthony 1506437
## 15 George 1495736
## 16 Donald 1447641
## 17 Paul 1437346
## 18 Mark 1410637
## 19 Andrew 1394274
## 20 Steven 1347137
#A table showing the top 20 female names and their amount.
counts_names_females <- names%>%arrange(desc(Count))%>%filter(Gender == 'F')%>%select(-Probability, -Gender)%>%head(., 20)
counts_names_females
## Name Count
## 1 Mary 4169663
## 2 Elizabeth 1704140
## 3 Patricia 1608260
## 4 Jennifer 1584426
## 5 Linda 1480592
## 6 Barbara 1459870
## 7 Margaret 1280255
## 8 Jessica 1173979
## 9 Sarah 1162143
## 10 Susan 1154677
## 11 Dorothy 1118450
## 12 Helen 1033566
## 13 Karen 1018083
## 14 Nancy 1016057
## 15 Betty 1005188
## 16 Lisa 1005108
## 17 Ashley 957384
## 18 Anna 916075
## 19 Sandra 901623
## 20 Emily 899883
A word cloud is created to visually identify the most popular names in the data. By clustering words together, a word cloud helps a viewer see the more prominent name choices for males and females.
First, a subset is created for each gender.
male_names <- names%>%filter(Gender == 'M')
female_names <- names%>%filter(Gender == 'F')
A word cloud is created for male names, with a maximum length of the top 40 names.
wordcloud(words = male_names$Name, freq = names$Count, min.freq = 100000, max.words = 40, colors = brewer.pal(9, "Dark2"))
## Warning in brewer.pal(9, "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
A word cloud of the top 40 female names is also created.
wordcloud(words = female_names$Name, freq = names$Count, min.freq = 100000, max.words = 40, colors = brewer.pal(9, "Set1"))
## Warning in wordcloud(words = female_names$Name, freq = names$Count, min.freq =
## 1e+05, : Elizabeth could not be fit on page. It will not be plotted.
In this short R Markdown session, a word cloud was made for the most common male and female baby names. A brief distribution and frequency table were made to see the top 20 male and female names, as well as the variance in names for males and females.
Thank you to the reader for viewing my work.