Source file ⇒ 2017-lec12.Rmd

Today

Finish Collective Properties of Cases (chap 14 DC book)
group aesthetic in ggplot2
Regular expressions (chapter 16 DC book) –skip Chap 15

Collective Properties of Cases

Here is the link to the last lecture:

lecture 11

`group` aesthetic in ggplot2

When x is numeric the default is that all points belong to the same group

When x is categorical the default is that every point is its own group

Here is an example:

#we can convert a string in the form of a table into a data frame using read.table()
df = read.table(text = 
"School      Year    Value 
A           1998    5
B           1999    10
C           2000    7
A           2001    17
B           2002    15
C           2003    20",  header = TRUE)

df

School	Year	Value
A	1998	5
B	1999	10
C	2000	7
A	2001	17
B	2002	15
C	2003	20

Numeric x

When x is numeric x provides a natural ordering to points in the frame. It is natural to connect points closest to each other on the x axis.

df %>% ggplot(aes(x = Year, y = Value)) +       
   geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")

Because the points form a single group, there is a single line connecting the points and a single geom_smooth() line.

If we add the aesthetic group=School then points are grouped by School and there are three separate lines and three separate geom_smooth() lines.

df %>% ggplot(aes(x = Year, y = Value, group=School)) +       
   geom_line() + geom_point(aes(colour=School)) +  geom_smooth( method = "lm", se = FALSE, col="red")

categorical x

For categorical x, the default is that every point is its own group. Hence if you don’t use the group aesthetic you won’t get any lines.

df %>% ggplot(aes(x = factor(Year), y = Value)) +       
  geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")

df %>% ggplot(aes(x = factor(Year), y = Value, group = School)) +  geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")

Class exercise

Do problem 1:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-12-collection/

Regular expressions

A regular expression (regex) is a pattern that describes a set of strings.

example: The regex “[a-cx-z]”" matches “a”,“b”,“c”,“x”,“y”,“z”

example: The regex “ba+d”" matches “bad”, “baad”, “baaad” etc but not “bd”

Use of regex:

detect whether a pattern is contained in a string. Used with `filter()` and `grepl()`

To illustrate, consider the baby names data, summarised to give the total count of each name for each sex.

NameList <- BabyNames %>% 
  mutate( name=tolower(name) ) %>%
  group_by( name, sex ) %>%
  summarise( total=sum(count) ) %>%
  arrange( desc(total)) 

head(NameList,3)

## Source: local data frame [3 x 3]
## Groups: name [3]
## 
##     name   sex   total
##    <chr> <chr>   <int>
## 1  james     M 5091189
## 2   john     M 5073958
## 3 robert     M 4789776

Here are some examples of patterns in names and the use of a regular expression to detect them. The regular expression is the string in quotes. grepl() is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.

The name contains “shine”, as in “sunshine” or “moonshine”

NameList %>% 
  filter( grepl( "shine", name ) ) %>% 
  head(3)

## Source: local data frame [3 x 3]
## Groups: name [3]
## 
##       name   sex total
##      <chr> <chr> <int>
## 1 sunshine     F  4959
## 2  shineka     F   150
## 3    shine     F    95

The name contains three or more vowels in a row.

NameList %>% 
  filter( grepl( "[aeiou]{3,}", name ) ) %>% 
  head(3)

## Source: local data frame [3 x 3]
## Groups: name [3]
## 
##     name   sex  total
##    <chr> <chr>  <int>
## 1  louis     M 389910
## 2 louise     F 331551
## 3 isaiah     M 177412

The name contains three or more consonants in a row.

NameList %>% 
  filter( grepl( "[^aeiou]{3,}", name ) ) %>% 
  head(3)

## Source: local data frame [3 x 3]
## Groups: name [3]
## 
##          name   sex   total
##         <chr> <chr>   <int>
## 1 christopher     M 1984307
## 2     matthew     M 1540182
## 3     anthony     M 1391462

The name contains “mn”

NameList %>% 
  filter( grepl( "mn", name ) ) %>% 
  head()

## Source: local data frame [6 x 3]
## Groups: name [5]
## 
##      name   sex  total
##     <chr> <chr>  <int>
## 1  autumn     F 104408
## 2  sumner     M   2287
## 3    amna     F   1099
## 4 domnick     M    405
## 5  tatumn     F    280
## 6  autumn     M    258

The first, third, and fifth letters are consonants.

NameList %>% 
  filter( grepl( "^[^aeiou].[^aeiou].[^aeiou]", name ) ) %>% 
  head()

## Source: local data frame [6 x 3]
## Groups: name [6]
## 
##          name   sex   total
##         <chr> <chr>   <int>
## 1       james     M 5091189
## 2      robert     M 4789776
## 3       david     M 3565229
## 4      joseph     M 2557792
## 5 christopher     M 1984307
## 6     matthew     M 1540182

How often do boys’ names end in vowels?

NameList %>%
  filter( grepl( "[aeiou]$", name ) ) %>% 
  group_by( sex ) %>% 
  summarise( total=sum(total) )

sex	total
F	96702371
M	21054791

Girls’ names are almost five times as likely to end in vowels as boys’ names.

Reading Regular Expressions

There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.

There are many regular expression tutorials on the Internet, for instance this interactive one. Here is a useful cheat sheet

Some basics:

Very simple patterns:
- A single . means “any character.”
- A character, e.g., b, means just that character.
- Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)
- The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam matches Adam and adam
Repeats
- Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.
- A simple pattern followed by a + means “one or more times.” For example M(ab)+ means M followed by one or more ab.
- A simple pattern followed by a ? means “zero or one time.”
- A simple pattern followed by a * means “zero or more times.”
- A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.
Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”
- ^ at the beginning of a regular expression means “the start of the string”
- $ at the end means “the end of the string.”

In Class exercise

Do problem 2:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-12-collection/

Here are some other ways to use regex:

replace substrings containing a pattern with something else. Used with `mutate()` and `gsub()`

For example:

Numbers often come with comma separators or unit symbols such as % or $. For instance, here is part of a table about public debt from Wikipedia.

head(Debt,3)

country	debt	percentGDP
World	$56,308 billion	64%
United States	$17,607 billion	73.60%
Japan	$9,872 billion	214.30%

To use these numbers for computations, they must be cleaned up.

Debt %>% 
  mutate( debt=gsub("[$,]|billion","",debt),
          percentGDP=gsub("%", "", percentGDP))

country	debt	percentGDP
World	56308	64
United States	17607	73.60
Japan	9872	214.30

Lets remove a currency sign

gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))

## [1] "100.95" "45"

Lets remove leading or trailing spaces

gsub( "^ +| +$", "", "   My name is Julia     ")

## [1] "My name is Julia"

BabyNames %>% filter(year==1880) %>%
  extractMatches( "([aeiou])$", name, vowel=1 ) %>%
  group_by( sex, vowel ) %>% 
  summarise( total=sum(count) ) %>%
  arrange( sex, desc(total) ) %>%
  head()

## Source: local data frame [6 x 3]
## Groups: sex [1]
## 
##     sex  vowel total
##   <chr> <fctr> <int>
## 1     F      e 33380
## 2     F      a 31446
## 3     F     NA 25696
## 4     F      u   380
## 5     F      i    61
## 6     F      o    30

lec 12

Today

Collective Properties of Cases

`group` aesthetic in ggplot2

Numeric x

categorical x

Class exercise

Regular expressions

detect whether a pattern is contained in a string. Used with `filter()` and `grepl()`

Reading Regular Expressions

In Class exercise

replace substrings containing a pattern with something else. Used with `mutate()` and `gsub()`

For next time read chapter 16 in book and do hw 5

lec 12

Today

Collective Properties of Cases

group aesthetic in ggplot2

Numeric x

categorical x

Class exercise

Regular expressions

detect whether a pattern is contained in a string. Used with filter() and grepl()

Reading Regular Expressions

In Class exercise

replace substrings containing a pattern with something else. Used with mutate() and gsub()

For next time read chapter 16 in book and do hw 5

`group` aesthetic in ggplot2

detect whether a pattern is contained in a string. Used with `filter()` and `grepl()`

replace substrings containing a pattern with something else. Used with `mutate()` and `gsub()`