Lesson 9 字符串

2017年12月28日

字符串

本章将介绍R中字符串的操作。字符串通常包含非结构化或半结构化的数据，因而将重点介绍描述字符串模式的正则表达式。
本章将使用stringr包。

library(stringr)
library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

字符串基础

创建字符串可以用英文的单引号或双引号。尽管两者没有本质区别，但一般建议用双引号，除非是想创建一个包含多个引号的字符串。

string1 <- "This is a string"
string2 <- 'If I want to include a "quote", I use single quotes'

有时在字符串结尾时会忘记输入双引号，这时如果运行代码Console会暂停，需要按Esc退出后，再重新运行。

如果要在字符串中包含单引号或双引号，需要在单引号或双引号前加\（转义字符）。

double_quote <- "\""
single_quote <- '\''

这也意味着如果想要在字符串中包含\，需要用"\\"。
如果直接在Console中打印字符串，会将表示转义字符的\作为一般字符输出。想要输出转义后的字符串，需要使用writeLines()。

x <- c("\"", "\\")
x

[1] "\"" "\\"

writeLines(x)

"
\

其他还有一些特殊字符。最常见的是\n（换行）和\t（tab）。
想要查看完整的转义字符列表，可以在Help中搜索"。
有时还会看到类似"\u00b5"的字符串，这是一种表示非英语字符的方法，好处是可以跨语言平台使用。

x <- "\u00b5"
x

[1] "μ"

字符串长度

在stringr包中，所有的函数都以str_开头。比如str_length()可以查看字符串中有多少个字符。

str_length(c("a", "R for data science", NA))

[1]  1 18 NA

使用str_作前缀的函数，在Rstudio中当打完str_时会触发自动补全，可以用上下键和tab键选择函数。

联结字符串

使用str_c()可以联结两个字符串。

str_c("x", "y")

[1] "xy"

str_c("x", "y", "z")

[1] "xyz"

使用sep参数可以指定两个字符串间的分隔符。

str_c("x", "y", sep = ", ")

[1] "x, y"

类似大多数R语言中的函数，NA会传染。

x <- c("abc", NA)
str_c("|", x, "|")

[1] "|abc|" NA

如果想要NA以字符串形式输出，可以使用str_replace_na()函数。

str_c("|", str_replace_na(x), "|")

[1] "|abc|" "|NA|"

str_c()函数可以向量化运算，如果联结的字符串向量中所含字符串的个数不同，会循环使用个数少的向量。

str_c("prefix-", c("a", "b", "c"), "-suffix")

[1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

个数为0的向量会被舍弃，与if结合使用时特别有用。

name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c("Good ", time_of_day, " ", name, 
      if (birthday) " and HAPPY BIRTHDAY", ".")

[1] "Good morning Hadley."

使用collapse可以将字符串向量合并成单个字符串。

str_c(c("x", "y", "z"), collapse = ", ")

[1] "x, y, z"

选取部分字符串

使用str_sub()可以选取字符串中的部分，start和end参数分别定义开始和结束的字符位置（包含）。

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

[1] "App" "Ban" "Pea"

str_sub(x, -3, -1)

[1] "ple" "ana" "ear"

注意当str_sub()操作的字符串长度不足时，会返回最多可能的字符，而不会报错。

str_sub("a", 1, 5)

[1] "a"

可以使用str_sub()的赋值形式来修改字符串。

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x

[1] "apple"  "banana" "pear"

练习

使用自己的语言描述str_c()函数中sep和collapse参数功能上的不同。
使用str_length()和str_sub()函数获取字符串"apple"的中间字母"p"。
研究str_trim()函数的作用？
研究str_to_upper()和str_to_title()函数的作用？

正则表达式

正则表达式是一种非常简洁的描述字符串模式的语言。我们将使用str_view()和str_view_all()函数辅助学习正则表达式。上述两个函数的参数包括一个字符串向量和一个正则表达式，函数的结果两者匹配的结果。

基本匹配

使用确定的字符串做最简单的模式匹配。

x <- c("apple", "banana", "pear")
str_view(x, "an")

str_view_all(x, "an")

使用.可匹配任意字符。

str_view(x, ".a.")

"."被默认用来匹配任意字符，那如何匹配字符"."？
这时需要使用转义字符告知正则表达式，我们想要匹配该字符而不是使用该字符的特殊功能。与字符串相同，正则表达式使用\对有特殊功能的字符进行转义，即\.。
但由于正则表达式都是使用字符串表示，\在字符串中同样被用作转义字符，所以要创建正则表达式\.，需要使用字符串"\\."。

创建匹配字符"."的正则表达式。

writeLines("\\.")

\.

str_view(c("abc", "a.c", "bef"), "a\\.c")

\在正则表达式中被默认作为转义字符，那如何匹配字符\？
这时要创建正则表达式\\，需要使用字符串"\\\\"表示。

x <- "a\\b"
writeLines(x)

a\b

str_view(x, "\\\\")

练习

解释为何以下3个正则表达式"\"、"\\"、"\\\"不能匹配字符串\。
如何匹配字符串"'\？
正则表达式\..\..\..会匹配什么模式的字符串，请举个例子，并试试能否匹配？

锚定

默认情况下，正则表达式会匹配字符串的任意部分，但也可以指定从开头或结尾进行匹配。

^ 表示从字符串的开头匹配。
$ 表示从字符串的结尾匹配。

x <- c("apple", "banana", "pear")
str_view(x, "^a")

str_view(x, "a$")

记忆^和$：If you begin with power (^), you end up with money ($)。

如果想让正则表达式完整匹配字符串，使用^和$对正则表达式进行前后锚固。

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")

str_view(x, "^apple$")

练习

如何匹配以下字符串"$^$"?
stringr::words中包含了1000个常用的英文单词，创建正则表达式寻找符合以下条件的英文单词：
1. 以"y"开头。
2. 以"x"结尾
3. 包含3个字母。
4. 包含7个以上字母。
由于列表很长，尝试使用str_view()中的match参数控制仅显示匹配的单词。

除了.外，还有一些特殊的匹配符号。

\d：匹配任意数字（digit）。
\s：匹配任意空白符（spacewhite，比如空格、tab、换行）。
[abc]：匹配a、b或c。
[^abc]：匹配除a、b、c外的任意字符。

要创建包含\d或\s的正则表达式，需要使用字符串"\\d"或 "\\s"。

str_view(c("grey", "gray"), "gr(e|a)y")

练习

创建正则表达式找出stringr::words中符合以下条件的单词。
1. 以元音字母开头。
2. 以ed但不是eed结尾。
3. 以ing或ise结尾。
验证以下单词拼写的经验规则"i before e except after c"（i通常在e前，比如believe，除非前面出现了c，这时i在e后，比如 receive）。
"q"后面总会是"u"吗？
写一个能够匹配所有杭州电话号码（0571-XXXXXXXX）的正则表达式。

重复

使用重复运算符控制模式重复的次数。

?：0或1
+：1或更多
*：0或更多

x <- "1888 in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

str_view(x, "CC+")

str_view(x, 'C[LX]+')

重复运算符的优先级很高，比如可以用colou?r匹配该单词的英式和美式写法。这也意味着更多的时候重复运算符需要与括号联合使用，比如bana(na)+。

也可以精确指定模式重复的次数。

{n}：n次
{n,}：最少n次
{,m}：最多m次
{n,m}：n到m次之间

str_view(x, "C{2}")

str_view(x, "C{2,}")

str_view(x, "C{2,3}")

默认情况下采用贪婪匹配，即匹配可能的最长字符串。也可以改成懒惰匹配，即在后面加?表示匹配可能的最短字符串。

str_view(x, 'C{2,3}?')

str_view(x, 'C[LX]+?')

练习

使用{m,n}形式表达?、+、*运算符相同的功能。
分别举2个例子描述下列正则表达式所匹配的内容（注意区分正则表达式和用字符串表示的正则表达式）。
1. ^.*$
2. "\\{.+\\}"
3. \d{4}-\d{2}-\d{2}
4. "\\\\{4}"（该题可仅举1个例子）
创建正则表达式找出stringr::words中符合以下条件的单词。
1. 以3个辅音字母开头。
2. 包含至少3个元音字母。
3. 包含两个或更多的“一个元音字母+一个辅音字母”的组合。

组合和引用

()除了能够改变优先级外，还能定义后续可供引用的分组，如\1可引用第一个()中内容、\2引用第二个()中内容。
下述正则表达式在stringr::fruit中寻找重复出现的任意两个字母组合。

str_view(fruit, "(..)\\1", match = TRUE)

练习

分别举至少2个例子描述下列表达式会匹配什么内容？
1. (.)\1\1
2. "(.)(.)\\2\\1"
3. (..)\1
4. "(.).\\1.\\1"
5. "(.)(.)(.).*\\3\\2\\1"
构建正则表达式匹配stringr::words中满足以下条件的单词。
1. 相同的字母开头或结尾
2. 包含重复出现的两个字母组合（比如"church"中"ch"重复出现）
3. 包含一个至少出现三次的字母（比如 "eleven"包含3个"e"）

应用正则表达式

我们已经了解了正则表达式的写法，下面学习如何将它们应用于实际问题中。将使用一系列的stringr函数，这些函数可以：

判断字符串是否匹配
找出匹配的位置
提取匹配的内容
使用新的值替换匹配
根据匹配分割字符串

如果你发现很难用一个正则表达式解决问题，尝试把这个问题分为几个小问题一个一个解决。与其创建一个复杂的正则表达式，不如创建一系列简单的正则表达式。

判断是否匹配

使用str_detect()可以判断字符串向量是否匹配特定模式。该函数返回的是一个和字符串向量长度相同的逻辑向量。

x <- c("apple", "banana", "pear")
str_detect(x, "e")

[1]  TRUE FALSE  TRUE

注意，当对逻辑向量进行数值计算时，FALSE为0和TRUE为1。因此在分析向量总体匹配情况时，sum()和mean()函数非常有用。
比如统计有多少单词是以t开头的？

sum(str_detect(words, "^t"))

[1] 65

以元音字母结尾的单词占多大比例？

mean(str_detect(words, "[aeiou]$"))

[1] 0.2765306

当匹配复杂逻辑时（比如match a or b but not c unless d），相比仅用一个str_detect()函数，使用逻辑运算符组合多个str_detect()函数更简单易懂。
比如找出不含有元音字母的所有单词。

no_vowels_1 <- !str_detect(words, "[aeiou]")
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)

[1] TRUE

str_detect()与[]结合，可以对向量做逻辑选择。

words[str_detect(words, "x$")]

[1] "box" "sex" "six" "tax"

或者使用str_subset()，更为方便。

str_subset(words, "x$")

[1] "box" "sex" "six" "tax"

当字符串向量是数据框的一列时，需要使用filter函数。

df <- tibble(word = words, length = str_length(word))
df %>% filter(str_detect(words, "x$"))

# A tibble: 4 x 2
   word length
  <chr>  <int>
1   box      3
2   sex      3
3   six      3
4   tax      3

str_count()是str_detect()的一个变体，相比与简单的判断TRUE或FALSE，str_count()函数会返回字符串中存在多少个匹配。

x <- c("apple", "banana", "pear")
str_count(x, "a")

[1] 1 3 1

平均一个单词有多少个元音字母。

mean(str_count(words, "[aeiou]"))

[1] 1.991837

str_count()可以和mutate()结合使用。

df %>% mutate(vowels = str_count(word, "[aeiou]"), 
             consonants = str_count(word, "[^aeiou]"))

# A tibble: 980 x 4
       word length vowels consonants
      <chr>  <int>  <int>      <int>
 1        a      1      1          0
 2     able      4      2          2
 3    about      5      3          2
 4 absolute      8      4          4
 5   accept      6      2          4
 6  account      7      3          4
 7  achieve      7      4          3
 8   across      6      2          4
 9      act      3      1          2
10   active      6      3          3
# ... with 970 more rows

注意，匹配不会重叠，比如在"abababa"中，"aba"会匹配多少次？结论是两次，不是3次。

str_count("abababa", "aba")

[1] 2

str_view_all("abababa", "aba")

注意str_view_all()函数，在stringr中，很多函数都是成对的，一个函数做单个匹配，另一个函数做全部匹配，后者的函数名多一个_all后缀。

练习

分别使用一个str_detect()函数和多个str_detect()函数的逻辑组合找出stringr::words中的以下单词。
1. 所有以x开头或结尾的单词
2. 所有以元音字母开头辅音字母结尾的单词
3. 存在包含全部5个元音字母的单词吗？
哪个单词元音字母最多，哪个单词元音字母的比例最高？

提取匹配

使用str_extract()可以提取匹配到的实际字符串。这里用一个复杂的例子，数据为Harvard sentences，该数据原本用来测试语音系统，数据存储在stringr::sentences中。

length(sentences)

[1] 720

head(sentences)

[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[6] "The juice of lemons makes fine punch."

假设想找出包含颜色单词的句子。首先创建一个颜色单词的向量，然后把它转变为正则表达式。

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match

[1] "red|orange|yellow|green|blue|purple"

首先筛选出包含颜色单词的句子，然后提取每个句子中包含的颜色单词是什么。

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)

[1] "blue" "blue" "red"  "red"  "red"  "blue"

str_extract()仅提取每个句子中第一个匹配的颜色单词。可以选择包含两个或以上颜色单词的句子来验证上述说法。

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)

str_extract(more, colour_match)

[1] "blue"   "green"  "orange"

这种现象在stringr包的函数中非常常见，因为只返回第一个匹配能够保证数据结构最为简单。使用str_extract_all()可以得到全部匹配，但返回的是一个列表（list）。

str_extract_all(more, colour_match)

[[1]]
[1] "blue" "red" 

[[2]]
[1] "green" "red"  

[[3]]
[1] "orange" "red"

在str_extract_all()中使用参数simplify = TRUE，会返回一个矩阵。

str_extract_all(more, colour_match, simplify = TRUE)

     [,1]     [,2] 
[1,] "blue"   "red"
[2,] "green"  "red"
[3,] "orange" "red"

如果匹配的数量不同，匹配少的会通过填充空字符串达到与匹配多的同样的长度。

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)

     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "b"  ""  
[3,] "a"  "b"  "c"

练习

从Harvard sentences数据中提取出：
1. 每句话的第一个单词
2. 所有以ing结尾的单词
3. 所有复数

分组提取匹配

()除了能用于改变优先级和引用外，还可以用来提取复杂匹配的一部分。
比如想提取句子中的名词，思考怎么用正则表达式表示？一种简单但并不完善的做法是寻找 "a"或"the"后面的单词。进一步思考单词怎么表示，单词是不包含空格的一串连续字母。

noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>% str_subset(noun) %>% head(10)
has_noun %>% str_extract(noun)

 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
 [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract()提取的是正则表达式匹配的完整内容，其返回的是一个向量。str_match()则可以提取正则表达式每个部分匹配的内容，其返回的是一个矩阵，矩阵的第一列是匹配的完整内容，后面几列是每个部分匹配的内容。

has_noun %>% str_match(noun)

      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"

如果数据在tibble中，使用tidyr::extract()分组提取匹配更加容易。功能与str_match()函数类似，但需要对匹配进行命名，作为tibble中的列名。

tibble(sentence = sentences) %>% 
  tidyr::extract(sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE)

# A tibble: 720 x 3
                                      sentence article    noun
 *                                       <chr>   <chr>   <chr>
 1  The birch canoe slid on the smooth planks.     the  smooth
 2 Glue the sheet to the dark blue background.     the   sheet
 3      It's easy to tell the depth of a well.     the   depth
 4    These days a chicken leg is a rare dish.       a chicken
 5        Rice is often served in round bowls.    <NA>    <NA>
 6       The juice of lemons makes fine punch.    <NA>    <NA>
 7 The box was thrown beside the parked truck.     the  parked
 8 The hogs were fed chopped corn and garbage.    <NA>    <NA>
 9         Four hours of steady work faced us.    <NA>    <NA>
10    Large size in stockings is hard to sell.    <NA>    <NA>
# ... with 710 more rows

与str_extract()类似，如果想要找出每个字符串中全部的匹配，需要使用str_match_all()。

练习

寻找"one"、"two"、"three"等数字之后的单词，同时提取出数字和单词。
找出所有的缩写，并将缩写符号'前后的两部分分开。

替换匹配

str_replace()和str_replace_all()函数可以实现用新的字符串替换匹配的内容。

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")

[1] "-pple"  "p-ar"   "b-nana"

str_replace_all(x, "[aeiou]", "-")

[1] "-ppl-"  "p--r"   "b-n-n-"

使用str_replace_all()函数还能通过提供一个命名向量实现多重替换。

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))

[1] "one house"    "two cars"     "three people"

除了用固定的字符串进行替换外，可以使用引用\1、\2等在匹配中插入内容。
比如替换第二个单词和第三个单词的顺序。

sentences %>% str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)

[1] "The canoe birch slid on the smooth planks." 
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."     
[4] "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."

练习

写一个用\替换字符串中/的函数。
使用str_replace_all()写一个能够实现str_to_lower()同样功能的函数。
交换stringr::words中每个单词第一个字母和最后一个字母的位置。

分割

使用str_split()函数可以将字符串分割成多个部分。
比如将句子分成单词。由于每个字符串分割后的数量不同，因此返回的是一个列表。

sentences %>% head(3) %>% str_split(" ")

[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

也可以使用参数simplify = TRUE使其返回一个矩阵。

sentences %>% head(5) %>% str_split(" ", simplify = TRUE)

     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]    
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth"
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"  
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"    
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"     
[5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls."
     [,8]          [,9]   
[1,] "planks."     ""     
[2,] "background." ""     
[3,] "a"           "well."
[4,] "rare"        "dish."
[5,] ""            ""

使用参数n =可以指定最多分割的数量。

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)

     [,1]      [,2]    
[1,] "Name"    "Hadley"
[2,] "Country" "NZ"    
[3,] "Age"     "35"

除了用正则表达式分割字符串，还能用boundary()函数加character、line、sentence、word等参数分割字符串。

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))

str_split(x, " ")[[1]]

[1] "This"      "is"        "a"         "sentence." ""          "This"     
[7] "is"        "another"   "sentence."

str_split(x, boundary("word"))[[1]]

[1] "This"     "is"       "a"        "sentence" "This"     "is"      
[7] "another"  "sentence"

[[1]]表示提取列表的第一个元素。

练习

将字符串"apples, pears, and bananas"分割成单词。
用boundary("word")进行为什么比用" "好？
如果用空字符串""分割会发生什么？尝试一下，再阅读帮助。

定位匹配

str_locate()和str_locate_all()函数可以给出匹配的开始位置和结束位置。当没有其他函数可以实现你想要的功能时，试试这个函数。比如使用str_locate()定位匹配，再使用str_sub()提取或修改匹配。

stringi

stringr包是基于stringi包创建的。stringr包精心挑选了42个最常用的字符串函数，而stringi包则有更加全面的234个函数。如果你发现用stringr包很难实现你要的功能，可以试试stringi包。两个包中的函数用法非常像，最大的区别是两个包中函数的前缀分别是str_和stri_。