Text and Patterns

Heike Hofmann
Stat 579, Fall 2013

Outline

  • Character Variables
  • Control Codes
  • Patterns & Matching

Baby Names Data

contains the 1000 most popular boy’s and girls’ baby names in the US from 1880-2011 (see http://www.babynamewizard.com/voyager)

bnames <- read.csv("http://www.hofroe.net/stat579/babynames.csv")
head(bnames)
  Year      Name Gender Freq  Perc
1 1880      Mary      F 7065 7.764
2 1880      Anna      F 2604 2.862
3 1880      Emma      F 2003 2.201
4 1880 Elizabeth      F 1939 2.131
5 1880    Minnie      F 1746 1.919
6 1880  Margaret      F 1578 1.734

Your Turn

  • Load the baby names data: http://www.hofroe.net/stat579/babynames.csv

  • find the 20 most popular names (the ones that are in the top 1000 most often). Are you surprised?

  • extend the data set to include ranking by year and gender (use ddply and the function transform)

head(sort(table(bnames$Name), decreasing=T), 20) 

   Jessie    Leslie Guadalupe      Jean       Lee     James 
      266       251       248       247       244       243 
     John   William    Robert   Francis   Charles      Dana 
      242       241       239       233       226       226 
     Mary    Willie    Marion    Joseph    Sidney   Johnnie 
      226       226       222       219       217       216 
   Carmen       Joe 
      215       214 

Some names are in the top 1000 more than the number of years that we have from 1880 to 2012. They must have been there for both boys and girls.

library(plyr)
bnames <- ddply(bnames, .(Year, Gender), transform, Rank=rank(-Freq))
head(subset(bnames, (Year==2012) & (Rank < 10)))
       Year     Name Gender  Freq   Perc Rank
263878 2012   Sophia      F 22158 1.2708    1
263879 2012     Emma      F 20791 1.1924    2
263880 2012 Isabella      F 18931 1.0857    3
263881 2012   Olivia      F 17147 0.9834    4
263882 2012      Ava      F 15418 0.8842    5
263883 2012    Emily      F 13550 0.7771    6

Brainstorming

Thinking about the data, what are some of the trends that you might want to explore? What additional variables would you need to create? What other data sources might you want to use?

Pair up and brainstorm for 2 minutes.

Some Ideas

  • length

  • first/last letter

  • rank

  • percent vowels/consonants

  • influential people/events (brad, angelina, barack, elizabeth, … )

Some useful commands

  • … back to the reference card
  • nchar
  • substring
  • paste
  • tolower, toupper
  • print, cat

Your turn

  • find length of each baby name

  • get first and last letter for each baby name (make sure to convert all names to lower cases before)

  • think about how to determine number of vowels in a name

Advanced:

Find graphics to answer the following questions:

  • Does the first/last letter change over time? - does it depend on gender?

  • Which names are used both for girls and boys?

bnames$length <- nchar(as.character(bnames$Name))
bnames$Name <- tolower(as.character(bnames$Name))
bnames$first <- with(bnames, substring(Name, 1, 1))
bnames$last <- with(bnames, substring(Name, length, length))
summary(bnames)
      Year          Name           Gender    
 Min.   :1880   Length:259660      F:119418  
 1st Qu.:1912   Class :character   M:140242  
 Median :1943   Mode  :character             
 Mean   :1944                                
 3rd Qu.:1976                                
 Max.   :2012                                
      Freq           Perc            Rank    
 Min.   :   5   Min.   :0.000   Min.   :  1  
 1st Qu.:   5   1st Qu.:0.000   1st Qu.:388  
 Median :   5   Median :0.000   Median :628  
 Mean   :   8   Mean   :0.004   Mean   :584  
 3rd Qu.:   6   3rd Qu.:0.001   3rd Qu.:824  
 Max.   :8769   Max.   :8.704   Max.   :997  
     length         first               last          
 Min.   : 2.00   Length:259660      Length:259660     
 1st Qu.: 5.00   Class :character   Class :character  
 Median : 6.00   Mode  :character   Mode  :character  
 Mean   : 6.13                                        
 3rd Qu.: 7.00                                        
 Max.   :15.00                                        

Patterns & Matches

  • gsub (pattern, replacement, x)
  • grep, regexpr, gregexpr, strsplit

Patterns & Matches

x <- sample(bnames$Name, 3); x
[1] "jameison" "merwin"   "loranzo" 

grep('a', x)
[1] 1 3
regexpr('a', x)
[1]  2 -1  4
attr(,"match.length")
[1]  1 -1  1
attr(,"useBytes")
[1] TRUE
strsplit(x, 'a')
[[1]]
[1] "j"      "meison"

[[2]]
[1] "merwin"

[[3]]
[1] "lor" "nzo"
gsub('a', '', x)
[1] "jmeison" "merwin"  "lornzo" 

Your Turn

  • Find the number of ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’s in each name

  • Find percentage of vowels in name

  • Can you spot a difference in vowels between boys and girls?

Regular Expressions

  • 'a|e' a or e
  • [aei] a or e or i
  • [aei] neither a nor e nor i
  • [aei] a, e, or i at the beginning
  • [aei]$ a, e , or i at the end
  • '.' any character
  • (pattern) defines substring for re-use - call by \1 \2 \3 ….

Repetition Quantifiers

  • ? preceding pattern is optional (matched 0 or 1 time)
  • * preceding pattern is matched zero or more times
  • + preceding pattern is matched at least once
  • {n} preceding pattern is matched exactly n times
  • {n, } preceding pattern is matched at least n times
  • {n, m} preceding pattern is matched at least n times and up to m times

Your Turn

Find one pattern that helps you to

  • find all names that start with ‘Joh’

  • find all names of length 2 (without using nchar())

  • find all names that have a pattern of consonant-vowel-consonant-vowel-consonant-vowel-consonant …

  • find all names that are palindromes (e.g. Anna, Hannah, Ava, …) - is it possible to find one pattern that recognizes all palindromes?

Advanced Patterns

see ?regex

  • [:alpha:] Any alphabetic character
  • [:lower:] Any lowercase character
  • [:upper:] Any uppercase character
  • [:digit:] Any digit
  • [:alnum:] Any alphanumeric character (alphabetic or digit)
  • [:space:] Any white space character (space, tab, vertical tab)
  • [:graph:] Any printable character, except space
  • [:print:] Any printable character, including the space
  • [:punct:] Any punctuation (i.e., a printable character that is not white space or alphanumeric)
  • [:cntrl:] Any nonprintable character

Special Characters

  • “\n” newline
  • “\r” carriage return
  • “\t” tabulator
  • “\b” “\” backslash
  • “\a” alert
  • see ?Quotes