Ch 1: Dealing with two datasets

总结来说:

* gsub("\\.","",names) 转义字符要用两个反斜杠,因为\本身也需要被转义,哈哈!!
* grep("x", names, value = TRUE) 有它,再也不用 names[which(...)] 了。
* 读入txt文件的方法! con = file("filename.txt","r"); data <- con %>% readLines; 记得close(con)
* We can use regular expressions in the menu `find`. This is great! 
* 妈的,"^" 也是开始的意思!!!
* 绝对锚定问题太有用了: <<\A>>只匹配整个字符串的开始位置,<<\Z>>只匹配整个字符串的结束位置。

1-1 Editing text variables

We do some simulation with the dataset gdp

library(magrittr)
library(stringr)
gdp = read.csv("/media/ghy/36D2072ED206F243/coursera/Data_science_specialization/C3_clean_data/regular_expression/gdp.csv")
names = tolower(names(gdp))
gsub("[.]", "", names)
##  [1] "x"                        "grossdomesticproduct2012"
##  [3] "x1"                       "x2"                      
##  [5] "x3"                       "x4"                      
##  [7] "x5"                       "x6"                      
##  [9] "x7"                       "x8"
gsub("\\.","",names)
##  [1] "x"                        "grossdomesticproduct2012"
##  [3] "x1"                       "x2"                      
##  [5] "x3"                       "x4"                      
##  [7] "x5"                       "x6"                      
##  [9] "x7"                       "x8"
# Applies a function to each element in a vector or list
firstElement = function(x){x[1]}
strsplit(names, "[.]") %>% sapply(firstElement)
##  [1] "x"     "gross" "x"     "x"     "x"     "x"     "x"     "x"    
##  [9] "x"     "x"

“[.]” and “\.” have the same effect in this case! There are several functions to be used frequently, grep, grepl, paste, paste0, str_trim, nchar, substr.

* str_trim 删除空格
* 
library(stringr)
grep("x", names, value = TRUE)
grep("x", names)
nchar("what life")
substr("what life", 1, 4)
str_trim("what    ")

2-2 Regular expressions

We do simulations with dataset en_US.news.txt.

  • “()” refers to grouping
  • “[]” refers to form charater set
  • “{}” to define the times of repitition
con <- file("/media/ghy/36D2072ED206F243/coursera/Data_science_specialization/C3_clean_data/regular_expression/en_US.news.txt","r")
newsdata <- readLines(con)
close(con)

metacharacter, 元字符,操作字符

Summary:

* 字符集是由一对方括号“[]”括起来的字符集合。使用字符集,你可以告诉正则表达式引擎仅仅匹配多个字符中的一个。
* 贪婪性问题

特殊字符集合:

<<\t>>代表 Tab(0x09)
<<\r>>代表回车符(0x0D)
<<\n>>代表换行符(0x0A)
<<\s>>代表“白字符”
<<\w>>代表单词字符
<<\d>>代表<<[0-9]>>
取反字符集的简写
<<[\S]>> = <<[^\s]>>
<<[\W]>> = <<[^\w]>>
<<[\D]>> = <<[^\d]>>
grep("[0-9]{10}", newsdata, value = T)
# At least 10 digits
grep("( [a-z]+ the day )|I think you are", newsdata, value = T)
corpus = grep("^[Ii] think", newsdata, value = T)
char = "I think"
paste0("^",char)
grep(paste0("^",char), newsdata, value = T)
gsub("^I think","",corpus)
corpus_words = strsplit(corpus, " ")
for (i in 1:length(corpus_words)){
    for (j in 1:lenth(corpus_words[[i]])){
        
    }
}

Ch 2: 从中文文章中学习

2-1 基本介绍和简单匹配

  1. There different kinds regression expression engine, such Perl 5, .NET, JDK. two main class: DFA and NFA. NFA:tradition NFA and POSIX NFA.
  2. RE is introduced on the work “the representation of neural network” by Stephen Kleenel.
  3. Perl is introduced by Henry Spener
  4. We can use regular expressions in the menu find. This is great!
grep("regex|regrex not", "regex not", value = T)
## [1] "regex not"
## 可以看出 R 并不猴急,为文本导向,而不是正则导向。
  • 贪婪性问题
## 可以看出+号的贪婪性质,尽可能的重复前面的字符。
stringr::str_extract(string = "This is a <EM>first</EM> test", "<.+>" )
## [1] "<EM>first</EM>"
## 修正以上问题的可能方案是用“+”的惰性代替贪婪性。你可以在“+”后面紧跟一个问号“?”来达到这一点。“*”,“{}” 也可以这样修正。
stringr::str_extract(string = "This is a <EM>first</EM> test", "<.+?>" )
## [1] "<EM>"
## 另外一种修正方法 <[^>]+>
stringr::str_extract(string = "This is a <EM>first</EM> test", " <[^>]+>" )
## [1] " <EM>"

简单匹配

# stringr::str_extract(string = "This is a <EM>first</EM> test", "<.+>" )
# stringr::str_extract(string = "abcd999", pattern = "[^0-9]+")
# 
# stringr::str_extract_all(string = "abcd999", pattern = "[^0-9]")

c("[^0-9]{1,10}","[^0-9]+", "\\t", "\\d", "\\w", "\\w\\d$") %>%
  sapply(function(pattern){stringr::str_extract(string = "@abdc,99fd a8", pattern)})
## [^0-9]{1,10}      [^0-9]+          \\t          \\d          \\w 
##     "@abdc,"     "@abdc,"           NA          "9"          "a" 
##      \\w\\d$ 
##         "a8"

2-2 复杂匹配问题

字符串开始和结束的锚定

c("\\w\\d$", "^\\d?$", "(^\\d)?$", "^(\\d?)$", "(^\\d?)$") %>%
  sapply(function(pattern){stringr::str_extract(string = "@abdc,99fd a8", pattern)})
##  \\w\\d$   ^\\d?$ (^\\d)?$ ^(\\d?)$ (^\\d?)$ 
##     "a8"       NA       ""       NA       NA

可以看出运算order “?” -> “?” -> ““!!! 滚,这不是一个有关顺序的问题,这是一个他妈的你错把””当成非运算,别人本来是开始运算!!

c("\\w\\d$", "^\\d+.\\d+\\w$", "^(\\d+$)", "(^\\d)$", "^\\s*", "\\s*$") %>%
  sapply(function(pattern){stringr::str_extract(string = "1855 513745a", pattern)})
##        \\w\\d$ ^\\d+.\\d+\\w$       ^(\\d+$)        (^\\d)$          ^\\s* 
##             NA "1855 513745a"             NA             NA             "" 
##          \\s*$ 
##             ""

绝对锚定问题: <<>>只匹配整个字符串的开始位置, <<>>只匹配整个字符串的结束位置。

单词边界问题是个什么鬼?几乎可以说<<>>匹配一个“字母数字序列”的开始和结束的位置。

c("\\A\\D", "\\w\\Z", "\\b.+\\b") %>%
  sapply(function(pattern){stringr::str_extract(string = "This is a test", pattern)})
##           \\A\\D           \\w\\Z         \\b.+\\b 
##              "T"              "t" "This is a test"
  • 一个例子 “<([A-Z][A-Z0-9])[^>]>.*?</\1>“, 不知道R中怎么引用

  • 匹配 pattern 但不获取匹配结果,也就是说这是一个非获取匹配,不进行存储供以后使用。这在使用或字符“ (|) ”来组合一个模式的各个(?:pattern)部分是很有用。例如“ industr(?:y|ies) ”就是一个比“ industry|industries ”更简略的表达式。

  • “ Windows(?=95|98|NT|2000) ”能匹配“ Windows2000 ”中的(?=pattern)“ Windows ”,但不能匹配“ Windows3.1 ”中的“ Windows ”。然而R中似乎没有用。

c("i(?:s| )", ".\\Z", "\\b.+\\b", "This is(?=.)") %>%
  sapply(function(pattern){stringr::str_extract(string = "<B>This
is a test</B>", pattern)})
##     i(?:s| )         .\\Z     \\b.+\\b This is(?=.) 
##         "is"          ">"     "B>This"           NA

正则表达式的匹配模式

相关资料 and References

  1. wikipedia
  2. website http://www.regular-expressions.info/

Problems

  1. “[^a-z]” will match those who are not a-z, but how to match those strings who are not a word? It is a question of finding the next of a pharse in a sentence.