总结来说:
* gsub("\\.","",names) 转义字符要用两个反斜杠,因为\本身也需要被转义,哈哈!!
* grep("x", names, value = TRUE) 有它,再也不用 names[which(...)] 了。
* 读入txt文件的方法! con = file("filename.txt","r"); data <- con %>% readLines; 记得close(con)
* We can use regular expressions in the menu `find`. This is great!
* 妈的,"^" 也是开始的意思!!!
* 绝对锚定问题太有用了: <<\A>>只匹配整个字符串的开始位置,<<\Z>>只匹配整个字符串的结束位置。
We do some simulation with the dataset gdp
library(magrittr)
library(stringr)
gdp = read.csv("/media/ghy/36D2072ED206F243/coursera/Data_science_specialization/C3_clean_data/regular_expression/gdp.csv")
names = tolower(names(gdp))
gsub("[.]", "", names)
## [1] "x" "grossdomesticproduct2012"
## [3] "x1" "x2"
## [5] "x3" "x4"
## [7] "x5" "x6"
## [9] "x7" "x8"
gsub("\\.","",names)
## [1] "x" "grossdomesticproduct2012"
## [3] "x1" "x2"
## [5] "x3" "x4"
## [7] "x5" "x6"
## [9] "x7" "x8"
# Applies a function to each element in a vector or list
firstElement = function(x){x[1]}
strsplit(names, "[.]") %>% sapply(firstElement)
## [1] "x" "gross" "x" "x" "x" "x" "x" "x"
## [9] "x" "x"
“[.]” and “\.” have the same effect in this case! There are several functions to be used frequently, grep, grepl, paste, paste0, str_trim, nchar, substr
.
* str_trim 删除空格
*
library(stringr)
grep("x", names, value = TRUE)
grep("x", names)
nchar("what life")
substr("what life", 1, 4)
str_trim("what ")
We do simulations with dataset en_US.news.txt
.
con <- file("/media/ghy/36D2072ED206F243/coursera/Data_science_specialization/C3_clean_data/regular_expression/en_US.news.txt","r")
newsdata <- readLines(con)
close(con)
Summary:
* 字符集是由一对方括号“[]”括起来的字符集合。使用字符集,你可以告诉正则表达式引擎仅仅匹配多个字符中的一个。
* 贪婪性问题
特殊字符集合:
<<\t>>代表 Tab(0x09)
<<\r>>代表回车符(0x0D)
<<\n>>代表换行符(0x0A)
<<\s>>代表“白字符”
<<\w>>代表单词字符
<<\d>>代表<<[0-9]>>
取反字符集的简写
<<[\S]>> = <<[^\s]>>
<<[\W]>> = <<[^\w]>>
<<[\D]>> = <<[^\d]>>
grep("[0-9]{10}", newsdata, value = T)
# At least 10 digits
grep("( [a-z]+ the day )|I think you are", newsdata, value = T)
corpus = grep("^[Ii] think", newsdata, value = T)
char = "I think"
paste0("^",char)
grep(paste0("^",char), newsdata, value = T)
gsub("^I think","",corpus)
corpus_words = strsplit(corpus, " ")
for (i in 1:length(corpus_words)){
for (j in 1:lenth(corpus_words[[i]])){
}
}
find
. This is great!grep("regex|regrex not", "regex not", value = T)
## [1] "regex not"
## 可以看出 R 并不猴急,为文本导向,而不是正则导向。
## 可以看出+号的贪婪性质,尽可能的重复前面的字符。
stringr::str_extract(string = "This is a <EM>first</EM> test", "<.+>" )
## [1] "<EM>first</EM>"
## 修正以上问题的可能方案是用“+”的惰性代替贪婪性。你可以在“+”后面紧跟一个问号“?”来达到这一点。“*”,“{}” 也可以这样修正。
stringr::str_extract(string = "This is a <EM>first</EM> test", "<.+?>" )
## [1] "<EM>"
## 另外一种修正方法 <[^>]+>
stringr::str_extract(string = "This is a <EM>first</EM> test", " <[^>]+>" )
## [1] " <EM>"
简单匹配
# stringr::str_extract(string = "This is a <EM>first</EM> test", "<.+>" )
# stringr::str_extract(string = "abcd999", pattern = "[^0-9]+")
#
# stringr::str_extract_all(string = "abcd999", pattern = "[^0-9]")
c("[^0-9]{1,10}","[^0-9]+", "\\t", "\\d", "\\w", "\\w\\d$") %>%
sapply(function(pattern){stringr::str_extract(string = "@abdc,99fd a8", pattern)})
## [^0-9]{1,10} [^0-9]+ \\t \\d \\w
## "@abdc," "@abdc," NA "9" "a"
## \\w\\d$
## "a8"
字符串开始和结束的锚定
c("\\w\\d$", "^\\d?$", "(^\\d)?$", "^(\\d?)$", "(^\\d?)$") %>%
sapply(function(pattern){stringr::str_extract(string = "@abdc,99fd a8", pattern)})
## \\w\\d$ ^\\d?$ (^\\d)?$ ^(\\d?)$ (^\\d?)$
## "a8" NA "" NA NA
可以看出运算order “?” -> “?” -> ““!!! 滚,这不是一个有关顺序的问题,这是一个他妈的你错把””当成非运算,别人本来是开始运算!!
c("\\w\\d$", "^\\d+.\\d+\\w$", "^(\\d+$)", "(^\\d)$", "^\\s*", "\\s*$") %>%
sapply(function(pattern){stringr::str_extract(string = "1855 513745a", pattern)})
## \\w\\d$ ^\\d+.\\d+\\w$ ^(\\d+$) (^\\d)$ ^\\s*
## NA "1855 513745a" NA NA ""
## \\s*$
## ""
绝对锚定问题: <<>>只匹配整个字符串的开始位置, <<>>只匹配整个字符串的结束位置。
单词边界问题是个什么鬼?几乎可以说<<>>匹配一个“字母数字序列”的开始和结束的位置。
c("\\A\\D", "\\w\\Z", "\\b.+\\b") %>%
sapply(function(pattern){stringr::str_extract(string = "This is a test", pattern)})
## \\A\\D \\w\\Z \\b.+\\b
## "T" "t" "This is a test"
一个例子 “<([A-Z][A-Z0-9])[^>]>.*?</\1>“, 不知道R中怎么引用
匹配 pattern 但不获取匹配结果,也就是说这是一个非获取匹配,不进行存储供以后使用。这在使用或字符“ (|) ”来组合一个模式的各个(?:pattern)部分是很有用。例如“ industr(?:y|ies) ”就是一个比“ industry|industries ”更简略的表达式。
“ Windows(?=95|98|NT|2000) ”能匹配“ Windows2000 ”中的(?=pattern)“ Windows ”,但不能匹配“ Windows3.1 ”中的“ Windows ”。然而R中似乎没有用。
c("i(?:s| )", ".\\Z", "\\b.+\\b", "This is(?=.)") %>%
sapply(function(pattern){stringr::str_extract(string = "<B>This
is a test</B>", pattern)})
## i(?:s| ) .\\Z \\b.+\\b This is(?=.)
## "is" ">" "B>This" NA
正则表达式的匹配模式