Source file ⇒ 2017-lec12.Rmd
group
aesthetic in ggplot2group
aesthetic in ggplot2When x
is numeric the default is that all points belong to the same group
When x
is categorical the default is that every point is its own group
Here is an example:
#we can convert a string in the form of a table into a data frame using read.table()
df = read.table(text =
"School Year Value
A 1998 5
B 1999 10
C 2000 7
A 2001 17
B 2002 15
C 2003 20", header = TRUE)
df
School | Year | Value |
---|---|---|
A | 1998 | 5 |
B | 1999 | 10 |
C | 2000 | 7 |
A | 2001 | 17 |
B | 2002 | 15 |
C | 2003 | 20 |
When x is numeric x provides a natural ordering to points in the frame. It is natural to connect points closest to each other on the x axis.
df %>% ggplot(aes(x = Year, y = Value)) +
geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")
Because the points form a single group, there is a single line connecting the points and a single geom_smooth()
line.
If we add the aesthetic group=School
then points are grouped by School
and there are three separate lines and three separate geom_smooth()
lines.
df %>% ggplot(aes(x = Year, y = Value, group=School)) +
geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")
For categorical x, the default is that every point is its own group. Hence if you don’t use the group
aesthetic you won’t get any lines.
df %>% ggplot(aes(x = factor(Year), y = Value)) +
geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")
df %>% ggplot(aes(x = factor(Year), y = Value, group = School)) + geom_line() + geom_point(aes(colour=School)) + geom_smooth( method = "lm", se = FALSE, col="red")
A regular expression (regex) is a pattern that describes a set of strings.
example: The regex “[a-cx-z]”" matches “a”,“b”,“c”,“x”,“y”,“z”
example: The regex “ba+d”" matches “bad”, “baad”, “baaad” etc but not “bd”
Use of regex:
filter()
and grepl()
To illustrate, consider the baby names data, summarised to give the total count of each name for each sex.
NameList <- BabyNames %>%
mutate( name=tolower(name) ) %>%
group_by( name, sex ) %>%
summarise( total=sum(count) ) %>%
arrange( desc(total))
head(NameList,3)
## Source: local data frame [3 x 3]
## Groups: name [3]
##
## name sex total
## <chr> <chr> <int>
## 1 james M 5091189
## 2 john M 5073958
## 3 robert M 4789776
Here are some examples of patterns in names and the use of a regular expression to detect them. The regular expression is the string in quotes. grepl()
is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.
The name contains “shine”, as in “sunshine” or “moonshine”
NameList %>%
filter( grepl( "shine", name ) ) %>%
head(3)
## Source: local data frame [3 x 3]
## Groups: name [3]
##
## name sex total
## <chr> <chr> <int>
## 1 sunshine F 4959
## 2 shineka F 150
## 3 shine F 95
The name contains three or more vowels in a row.
NameList %>%
filter( grepl( "[aeiou]{3,}", name ) ) %>%
head(3)
## Source: local data frame [3 x 3]
## Groups: name [3]
##
## name sex total
## <chr> <chr> <int>
## 1 louis M 389910
## 2 louise F 331551
## 3 isaiah M 177412
NameList %>%
filter( grepl( "[^aeiou]{3,}", name ) ) %>%
head(3)
## Source: local data frame [3 x 3]
## Groups: name [3]
##
## name sex total
## <chr> <chr> <int>
## 1 christopher M 1984307
## 2 matthew M 1540182
## 3 anthony M 1391462
NameList %>%
filter( grepl( "mn", name ) ) %>%
head()
## Source: local data frame [6 x 3]
## Groups: name [5]
##
## name sex total
## <chr> <chr> <int>
## 1 autumn F 104408
## 2 sumner M 2287
## 3 amna F 1099
## 4 domnick M 405
## 5 tatumn F 280
## 6 autumn M 258
NameList %>%
filter( grepl( "^[^aeiou].[^aeiou].[^aeiou]", name ) ) %>%
head()
## Source: local data frame [6 x 3]
## Groups: name [6]
##
## name sex total
## <chr> <chr> <int>
## 1 james M 5091189
## 2 robert M 4789776
## 3 david M 3565229
## 4 joseph M 2557792
## 5 christopher M 1984307
## 6 matthew M 1540182
How often do boys’ names end in vowels?
NameList %>%
filter( grepl( "[aeiou]$", name ) ) %>%
group_by( sex ) %>%
summarise( total=sum(total) )
sex | total |
---|---|
F | 96702371 |
M | 21054791 |
Girls’ names are almost five times as likely to end in vowels as boys’ names.
There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.
There are many regular expression tutorials on the Internet, for instance this interactive one. Here is a useful cheat sheet
Some basics:
A single .
means “any character.”
A character, e.g., b
, means just that character.
Characters enclosed in square brackets, e.g., [aeiou]
means any of those characters. (So, [aeiou]
is a pattern describing a vowel.)
The ^
inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam
matches Adam
and adam
Repeats
Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou]
means a capital M followed by a lower-case vowel.
A simple pattern followed by a +
means “one or more times.” For example M(ab)+
means M
followed by one or more ab
.
A simple pattern followed by a ?
means “zero or one time.”
A simple pattern followed by a *
means “zero or more times.”
A simple pattern followed by {2}
means “exactly two times.” Similarly, {2,5}
means between two and five times, {6,}
means six times or more.
Start and end of strings. For instance, [aeiou]{2}
means “exactly two vowels in a row.”
^
at the beginning of a regular expression means “the start of the string”
$
at the end means “the end of the string.”
Do problem 2:
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-12-collection/
Here are some other ways to use regex:
mutate()
and gsub()
For example:
Numbers often come with comma separators or unit symbols such as % or $. For instance, here is part of a table about public debt from Wikipedia.
head(Debt,3)
country | debt | percentGDP |
---|---|---|
World | $56,308 billion | 64% |
United States | $17,607 billion | 73.60% |
Japan | $9,872 billion | 214.30% |
To use these numbers for computations, they must be cleaned up.
Debt %>%
mutate( debt=gsub("[$,]|billion","",debt),
percentGDP=gsub("%", "", percentGDP))
country | debt | percentGDP |
---|---|---|
World | 56308 | 64 |
United States | 17607 | 73.60 |
Japan | 9872 | 214.30 |
Lets remove a currency sign
gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))
## [1] "100.95" "45"
Lets remove leading or trailing spaces
gsub( "^ +| +$", "", " My name is Julia ")
## [1] "My name is Julia"
BabyNames %>% filter(year==1880) %>%
extractMatches( "([aeiou])$", name, vowel=1 ) %>%
group_by( sex, vowel ) %>%
summarise( total=sum(count) ) %>%
arrange( sex, desc(total) ) %>%
head()
## Source: local data frame [6 x 3]
## Groups: sex [1]
##
## sex vowel total
## <chr> <fctr> <int>
## 1 F e 33380
## 2 F a 31446
## 3 F NA 25696
## 4 F u 380
## 5 F i 61
## 6 F o 30