Heike Hofmann
Stat 579, Fall 2013
contains the 1000 most popular boy’s and girls’ baby names in the US from 1880-2011 (see http://www.babynamewizard.com/voyager)
bnames <- read.csv("http://www.hofroe.net/stat579/babynames.csv")
head(bnames)
Year Name Gender Freq Perc
1 1880 Mary F 7065 7.764
2 1880 Anna F 2604 2.862
3 1880 Emma F 2003 2.201
4 1880 Elizabeth F 1939 2.131
5 1880 Minnie F 1746 1.919
6 1880 Margaret F 1578 1.734
Load the baby names data: http://www.hofroe.net/stat579/babynames.csv
find the 20 most popular names (the ones that are in the top 1000 most often). Are you surprised?
extend the data set to include ranking by year and gender (use ddply and the function transform)
head(sort(table(bnames$Name), decreasing=T), 20)
Jessie Leslie Guadalupe Jean Lee James
266 251 248 247 244 243
John William Robert Francis Charles Dana
242 241 239 233 226 226
Mary Willie Marion Joseph Sidney Johnnie
226 226 222 219 217 216
Carmen Joe
215 214
Some names are in the top 1000 more than the number of years that we have from 1880 to 2012. They must have been there for both boys and girls.
library(plyr)
bnames <- ddply(bnames, .(Year, Gender), transform, Rank=rank(-Freq))
head(subset(bnames, (Year==2012) & (Rank < 10)))
Year Name Gender Freq Perc Rank
263878 2012 Sophia F 22158 1.2708 1
263879 2012 Emma F 20791 1.1924 2
263880 2012 Isabella F 18931 1.0857 3
263881 2012 Olivia F 17147 0.9834 4
263882 2012 Ava F 15418 0.8842 5
263883 2012 Emily F 13550 0.7771 6
Thinking about the data, what are some of the trends that you might want to explore? What additional variables would you need to create? What other data sources might you want to use?
Pair up and brainstorm for 2 minutes.
length
first/last letter
rank
percent vowels/consonants
influential people/events (brad, angelina, barack, elizabeth, … )
ncharsubstringpastetolower, toupperprint, catfind length of each baby name
get first and last letter for each baby name (make sure to convert all names to lower cases before)
think about how to determine number of vowels in a name
Advanced:
Find graphics to answer the following questions:
Does the first/last letter change over time? - does it depend on gender?
Which names are used both for girls and boys?
bnames$length <- nchar(as.character(bnames$Name))
bnames$Name <- tolower(as.character(bnames$Name))
bnames$first <- with(bnames, substring(Name, 1, 1))
bnames$last <- with(bnames, substring(Name, length, length))
summary(bnames)
Year Name Gender
Min. :1880 Length:259660 F:119418
1st Qu.:1912 Class :character M:140242
Median :1943 Mode :character
Mean :1944
3rd Qu.:1976
Max. :2012
Freq Perc Rank
Min. : 5 Min. :0.000 Min. : 1
1st Qu.: 5 1st Qu.:0.000 1st Qu.:388
Median : 5 Median :0.000 Median :628
Mean : 8 Mean :0.004 Mean :584
3rd Qu.: 6 3rd Qu.:0.001 3rd Qu.:824
Max. :8769 Max. :8.704 Max. :997
length first last
Min. : 2.00 Length:259660 Length:259660
1st Qu.: 5.00 Class :character Class :character
Median : 6.00 Mode :character Mode :character
Mean : 6.13
3rd Qu.: 7.00
Max. :15.00
gsub (pattern, replacement, x)grep, regexpr, gregexpr, strsplitx <- sample(bnames$Name, 3); x
[1] "jameison" "merwin" "loranzo"
grep('a', x)
[1] 1 3
regexpr('a', x)
[1] 2 -1 4
attr(,"match.length")
[1] 1 -1 1
attr(,"useBytes")
[1] TRUE
strsplit(x, 'a')
[[1]]
[1] "j" "meison"
[[2]]
[1] "merwin"
[[3]]
[1] "lor" "nzo"
gsub('a', '', x)
[1] "jmeison" "merwin" "lornzo"
Find the number of ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’s in each name
Find percentage of vowels in name
Can you spot a difference in vowels between boys and girls?
Find one pattern that helps you to
find all names that start with ‘Joh’
find all names of length 2 (without using nchar())
find all names that have a pattern of consonant-vowel-consonant-vowel-consonant-vowel-consonant …
find all names that are palindromes (e.g. Anna, Hannah, Ava, …) - is it possible to find one pattern that recognizes all palindromes?
see ?regex
?Quotes