用R语言以规范和可靠的方式从网页上提取信息
解析HTML:
读取HTML文件的函数并不关心对HTML基础性的正式语法的理解,而只是识别HTML文件中包含的符号序列。
url <- "http://www.r-datacollection.com/materials/ch-2-html/fortunes.html"
fortunes <- readLines(con = url)
fortunes
## [1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\">"
## [2] "<html> <head>"
## [3] "<title>Collected R wisdoms</title>"
## [4] "</head>"
## [5] ""
## [6] "<body>"
## [7] "<div id=\"R Inventor\" lang=\"english\" date=\"June/2003\">"
## [8] " <h1>Robert Gentleman</h1>"
## [9] " <p><i>'What we have is nice, but we need something very different'</i></p>"
## [10] " <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>"
## [11] "</div>"
## [12] ""
## [13] "<div lang=\"english\" date=\"October/2011\">"
## [14] " <h1>Rolf Turner</h1>"
## [15] " <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>"
## [16] " <p><b>Source: </b><a href=\"https://stat.ethz.ch/mailman/listinfo/r-help\">R-help</a></p>"
## [17] "</div>"
## [18] ""
## [19] "<address><a href=\"http://www.rdatacollectionbook.com\"><i>The book homepage</i><a/></address>"
## [20] ""
## [21] "</body> </html>"
- 为了获得有用的HTML文件表征,需要运用一个能够理解标记结构特殊含义的程序,并在某个R的专用数据结构内部重建HTML文件隐含的层次结构。这种表示法被称为文档对象模型(DOM)。
- 从HTML到DOM的转化就是DOM解析器的任务。解析器属于一般类型的域相关程序,它会遍历HTML符号序列并在编程环境里的一个数据对象里重建文档的语义结构。
library(XML)
url <- "http://www.r-datacollection.com/materials/ch-2-html/fortunes.html"
parsed_fortunes <- htmlParse(file = url)
print(parsed_fortunes)
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
##
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
##
## <address>
## <a href="http://www.rdatacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
##
## </body>
## </html>
##
class(parsed_fortunes)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
- 在解析HTML的工程中,丢弃网页中不需要的部分有助于消除内存不足的问题并加快提取速度
- 处理器函数会规范C语言层次节点结构到R语言之间的转化过程,而处理器为处理节点(如删除、添加、修改)提供了方便使用的方法
- 在处理器默认没有变化的情况下,所有节点会被映射到R列表结构
- 把处理器指定为带有命名的函数的一个列表,这里的命名对应某个节点名,而函数明确了对该节点的操作,当处理到某个符合特定名字的节点时,对应的函数就会执行
- 通过XML组件,可以传递处理器函数对待定的XML元素进行操作,例如:操作指令、XML注释、CDATA或一般节点集
DOM风格解析的通用处理器:
| 函数 | 节点类型 |
|---|---|
| startElement() | XML元素 |
| text() | 文本节点 |
| comment() | 注释节点 |
| cdata() | |
| processingInstruction() | 操作指示 |
| namespace() | XML命名空间 |
| entity() | 实体引用 |
在HTML样本文件中删除<body>节点
h1 <- list("body" = function(x) {NULL})
parsed_fortunes <- htmlTreeParse(url, handlers = h1, asTree = TRUE)
parsed_fortunes
## $file
## [1] "http://www.r-datacollection.com/materials/ch-2-html/fortunes.html"
##
## $version
## [1] ""
##
## $children
## $children$html
## <html>
## <head>
## <title>Collected R wisdoms</title>
## </head>
## </html>
##
##
## attr(,"class")
## [1] "XMLDocumentContent"
parsed_fortunes$children
## $html
## <html>
## <head>
## <title>Collected R wisdoms</title>
## </head>
## </html>
删除文档中的注释和所有带有div或title名字的节点
h2 <- list(
startElement = function(node, ...) {
name = xmlName(node)
if(name %in% c("div", "title")) {NULL}else {node}
},
comment = function(node) {NULL}
)
parsed_fortunes <- htmlTreeParse(file = url, handlers = h2, asTree = TRUE)
parsed_fortunes
## $file
## [1] "http://www.r-datacollection.com/materials/ch-2-html/fortunes.html"
##
## $version
## [1] ""
##
## $children
## $children$html
## <html>
## <head/>
## <body>
## <address>
## <a href="http://www.rdatacollectionbook.com">
## <i>The book homepage</i>
## </a>
## <a/>
## </address>
## </body>
## </html>
##
##
## attr(,"class")
## [1] "XMLDocumentContent"
parsed_fortunes$children
## $html
## <html>
## <head/>
## <body>
## <address>
## <a href="http://www.rdatacollectionbook.com">
## <i>The book homepage</i>
## </a>
## <a/>
## </address>
## </body>
## </html>
提取信息:
提取封装在<i>标签里的斜体字信息
# 把对应网页中<i>节点的处理器函数定义为闭包函数
getItalics <- function() {
# 创建数据对象(局部容器变量),存放全局环境的工作区里的信息
i_container = character()
# 为<i>节点定义处理器函数
list(
"i" = function(node, ...) {
i_container <<- c(i_container, xmlValue(node))
},
# 返回前面产生的容器对象
returnI = function() i_container
)
}
h3 <- getItalics()
invisible(htmlTreeParse(url, handlers = h3))
h3$returnI()
## [1] "'What we have is nice, but we need something very different'"
## [2] "'R is wonderful, but it cannot work magic'"
## [3] "The book homepage"
- XML (eXtensible Markup Language, 可扩展标记语言)
- JSON (JavaScript Object Node) :
- 解析XML的原因:对XML文件产生一个能保留原结构的表征,据此能够从这些文件中进行简单的信息提取
- 这里主要介绍XML文件在R中:
- 如何查看XML?
- 如何导入XML?
- 如何访问XML?
- 如何把来自XML文档的信息转化为更便于进一步图形化或统计化分析的数据结构?
解析XML:
解析XML包和函数:
xmlEventParse()
Example:
解析XML文件technology.xml,该文件中存放了三个技术公司的股票信息,例如:当天的收盘价、最低价、最高价、成交量:
library(XML)
parsed_stocks <- xmlParse(file = "http://www.r-datacollection.com/materials/ch-3-xml/stocks/technology.xml")
parsed_stocks_root <- xmlRoot(parsed_stocks)
parsed_stocks_df <- xmlToDataFrame(parsed_stocks_root)
head(parsed_stocks_df)
## date close volume open high low company year
## 1 2013/11/13 520.634 7022001.0000 518 522.25 516.96 Apple 2013
## 2 2013/11/12 520.01 7295400.0000 517.67 523.92 517 Apple 2013
## 3 2013/11/11 519.048 8106796.0000 519.99 521.67 514.41 Apple 2013
## 4 2013/11/08 520.56 9945317.0000 514.58 521.13 512.59 Apple 2013
## 5 2013/11/07 512.492 9363278.0000 519.58 523.19 512.38 Apple 2013
## 6 2013/11/06 520.92 7959515.0000 524.15 524.86 518.2 Apple 2013
要检查XML文档是否遵循了规范,可以通过设置validate参数为TRUE来引入DOM创建后的校验步骤,即用一个DTD校验XML的合法性:
# 符合DTD规范的
parsed_stocks <- xmlParse(file = "http://www.r-datacollection.com/materials/ch-3-xml/stocks/technology.xml",
validate = TRUE)
# 不符合DTD规范的
stocks <- xmlParse(file = "http://www.r-datacollection.com/materials/ch-3-xml/stocks/technology-manip.xml",
validate = TRUE)
输出结果:
No declaration for element document
Error in xmlParse(file = "http://www.r-datacollection.com/materials/ch-3-xml/stocks/technology-manip.xml", :
XML document is invalid
在大部分数据抓取的情况下,校验文件是没有必要的,可以直接对他们进行原样处理。
XML文档解析之后,用
XMLpackage 的函数进行信息查找和提取
bond <- xmlParse("http://www.r-datacollection.com/materials/ch-3-xml/bond.xml")
class(bond)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
bond
## <?xml version="1.0" encoding="ISO-8859-1"?>
## <bond_movies>
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
## <movie id="2">
## <name>Live and Let Die</name>
## <year>1973</year>
## <actors bond="Roger Moore" villain="Yaphet Kotto"/>
## <budget>7M</budget>
## <boxoffice>126.4M</boxoffice>
## </movie>
## <movie id="3">
## <name>Skyfall</name>
## <year>2012</year>
## <actors bond="Daniel Craig" villain="Javier Bardem"/>
## <budget>175M</budget>
## <boxoffice>1108.6M</boxoffice>
## </movie>
## </bond_movies>
##
# 提取节点集
root <- xmlRoot(bond) # 提取顶级节点
root
## <bond_movies>
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
## <movie id="2">
## <name>Live and Let Die</name>
## <year>1973</year>
## <actors bond="Roger Moore" villain="Yaphet Kotto"/>
## <budget>7M</budget>
## <boxoffice>126.4M</boxoffice>
## </movie>
## <movie id="3">
## <name>Skyfall</name>
## <year>2012</year>
## <actors bond="Daniel Craig" villain="Javier Bardem"/>
## <budget>175M</budget>
## <boxoffice>1108.6M</boxoffice>
## </movie>
## </bond_movies>
xmlName(root) # 返回root元素的名字
## [1] "bond_movies"
xmlSize(root) # 返回root元素的子节点数
## [1] 3
# 在节点集中,基本的导航或子集操作在R中对普通列表的索引操作是类似的可以使用数字序号或名字作为索引来选择特定的节点
root[[1]]
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
root[["movie"]]
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
root[[1]][[1]]
## <name>Dr. No</name>
root[["movie"]][["name"]]
## <name>Dr. No</name>
root[[1]][[1]][[1]]
## Dr. No
root[["movie"]][["name"]][[1]]
## Dr. No
# 单个方括号返回XMLInternalNodeList类的对象
root["movie"]
## $movie
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
##
## $movie
## <movie id="2">
## <name>Live and Let Die</name>
## <year>1973</year>
## <actors bond="Roger Moore" villain="Yaphet Kotto"/>
## <budget>7M</budget>
## <boxoffice>126.4M</boxoffice>
## </movie>
##
## $movie
## <movie id="3">
## <name>Skyfall</name>
## <year>2012</year>
## <actors bond="Daniel Craig" villain="Javier Bardem"/>
## <budget>175M</budget>
## <boxoffice>1108.6M</boxoffice>
## </movie>
##
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
root[1]
## $movie
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
##
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
解析其他基于XML的语言文档:
xmlParse("http://www.r-datacollection.com/materials/ch-3-xml/rsscode.rss")
## <?xml version="1.0" encoding="UTF-8"?>
## <rss version="2.0">
## <channel>
## <title>The ADCR blog</title>
## <description>Blog to the ADCR book; Wiley 2014</description>
## <link>http://www.rdatacollection.com/blog</link>
## <lastBuildDate>Tue, 22 Oct 2013 00:01:00 +0000 </lastBuildDate>
## <item>
## <title>Why R is useful for web scraping</title>
## <description>R is becoming the most popular statistical software and is growing fast due to an active community publishing several additional packages every day. Yet, R is more than [...]</description>
## <link>http://www.rdatacollection.com/blog/why-r-is-useful</link>
## <pubDate>Tue, 22 Oct 2013 00:01:00 +0000 </pubDate>
## </item>
## </channel>
## </rss>
##
总结:
把整个XML对象转换为向量、数据框、列表等R数据结构
函数:
xmlToList()
Example:
单个向量的提取用xmlSApply(), 这个函数由lapply()和sapply()包裹的。该函数对一个XML节点进行操作,对其子节点调用任何给定的函数,并返回一个向量。通常该函数配合xmlValue()、xmlGetAttr()等提取元素或属性值:
root
## <bond_movies>
## <movie id="1">
## <name>Dr. No</name>
## <year>1962</year>
## <actors bond="Sean Connery" villain="Joseph Wiseman"/>
## <budget>1.1M</budget>
## <boxoffice>59.5M</boxoffice>
## </movie>
## <movie id="2">
## <name>Live and Let Die</name>
## <year>1973</year>
## <actors bond="Roger Moore" villain="Yaphet Kotto"/>
## <budget>7M</budget>
## <boxoffice>126.4M</boxoffice>
## </movie>
## <movie id="3">
## <name>Skyfall</name>
## <year>2012</year>
## <actors bond="Daniel Craig" villain="Javier Bardem"/>
## <budget>175M</budget>
## <boxoffice>1108.6M</boxoffice>
## </movie>
## </bond_movies>
xmlSApply(root, xmlValue)
## movie movie
## "Dr. No19621.1M59.5M" "Live and Let Die19737M126.4M"
## movie
## "Skyfall2012175M1108.6M"
xmlSApply(root[[1]], xmlValue)
## name year actors budget boxoffice
## "Dr. No" "1962" "" "1.1M" "59.5M"
xmlSApply(root[[2]], xmlValue)
## name year actors
## "Live and Let Die" "1973" ""
## budget boxoffice
## "7M" "126.4M"
xmlSApply(root[[3]], xmlValue)
## name year actors budget boxoffice
## "Skyfall" "2012" "" "175M" "1108.6M"
xmlSApply(root, xmlAttrs)
## movie.id movie.id movie.id
## "1" "2" "3"
xmlSApply(root, xmlGetAttr, "id")
## movie movie movie
## "1" "2" "3"
只要XML文档在层次关系上是扁平的,即离根节点最远的亲属节点是其孙子节点或子节点,就能用xmlToDataFrame()很轻松的转换为数据框(例外:<actor>元素上没用);用xmlToList()转换为列表:
movie.df <- xmlToDataFrame(root)
movie.df
## name year actors budget boxoffice
## 1 Dr. No 1962 1.1M 59.5M
## 2 Live and Let Die 1973 7M 126.4M
## 3 Skyfall 2012 175M 1108.6M
movie.list <- xmlToList(root)
movie.list
## $movie
## $movie$name
## [1] "Dr. No"
##
## $movie$year
## [1] "1962"
##
## $movie$actors
## bond villain
## "Sean Connery" "Joseph Wiseman"
##
## $movie$budget
## [1] "1.1M"
##
## $movie$boxoffice
## [1] "59.5M"
##
## $movie$.attrs
## id
## "1"
##
##
## $movie
## $movie$name
## [1] "Live and Let Die"
##
## $movie$year
## [1] "1973"
##
## $movie$actors
## bond villain
## "Roger Moore" "Yaphet Kotto"
##
## $movie$budget
## [1] "7M"
##
## $movie$boxoffice
## [1] "126.4M"
##
## $movie$.attrs
## id
## "2"
##
##
## $movie
## $movie$name
## [1] "Skyfall"
##
## $movie$year
## [1] "2012"
##
## $movie$actors
## bond villain
## "Daniel Craig" "Javier Bardem"
##
## $movie$budget
## [1] "175M"
##
## $movie$boxoffice
## [1] "1108.6M"
##
## $movie$.attrs
## id
## "3"
- 在很多情况下,XML文件通常比HTML文件大得多,文件的大小会超过计算机的内存容量。这个问题在涉及数据流时更为严重,因为在这种情况下,XML数据是逐步抵达的。
- 在DOM风格的解析器处理和存放的方式中,解析器产生给定XML文件的两个副本:C语言级别节点集;R语言数据结构。所以不适用于XML文件比较大以及数据流的情况。
- 对于上面提到的情况可以通过采用事件驱动的解析或SAX解析(Simple API for XML)来解析XML文件。
- 事件驱动的解析和DOM风格解析的差别:事件驱动跳过了在C语言级别创建完整DOM的步骤。相反,事件驱动的解析器会顺序遍历XML文件,一旦发现了某个感兴趣的特定元素,就会触发一个实时的、用户自定义的对该事件的反应。这个步骤让事件驱动解析相对DOM风格解析器具有了一个巨大的优势,因为计算机内存永远不需要容纳整个文件。
xmlEventParse()对technology.xml文件运行SAX解析器。branchFun <- function() {
container_date = numeric()
container_close = numeric()
"Apple" = function(node, ...) {
date = xmlValue(xmlChildren(node)[[c("date")]])
container_date <<- c(container_date, date)
close = xmlValue(xmlChildren(node)[[c("close")]])
container_close <<- c(container_close, close)
# print(c(close, date))
# Sys.sleep(0.5)
}
getContainer <- function() data.frame(date = container_date,
close = container_close)
list(Apple = Apple, getStore = getContainer)
}
h5 <- branchFun()
# 运行SAX解析器
invisible(xmlEventParse(file = "http://www.r-datacollection.com/materials/ch-3-xml/stocks/technology.xml",
branches = h5,
handlers = list()))
apple.stock <- h5$getStore()
head(apple.stock)
## date close
## 1 2013/11/13 520.634
## 2 2013/11/12 520.01
## 3 2013/11/11 519.048
## 4 2013/11/08 520.56
## 5 2013/11/07 512.492
## 6 2013/11/06 520.92
library(RJSONIO)
检查文档是否包含了合法的JSON数据:
isValidJSON("E:/R/Data Mining/Text Mining/character,stringr,RCurl,XML/Curl with R/materials/ch-3-xml/indy.json")
## [1] TRUE
fromJSON()读取JSON格式的内容并将其转换为R对象:
indy <- RJSONIO::fromJSON(content = "E:/R/Data Mining/Text Mining/character,stringr,RCurl,XML/Curl with R/materials/ch-3-xml/indy.json")
class(indy)
## [1] "list"
indy
## $`indy movies`
## $`indy movies`[[1]]
## $`indy movies`[[1]]$name
## [1] "Raiders of the Lost Ark"
##
## $`indy movies`[[1]]$year
## [1] 1981
##
## $`indy movies`[[1]]$actors
## Indiana Jones Dr. Ren Belloq
## "Harrison Ford" "Paul Freeman"
##
## $`indy movies`[[1]]$producers
## [1] "Frank Marshall" "George Lucas" "Howard Kazanjian"
##
## $`indy movies`[[1]]$budget
## [1] 1.8e+07
##
## $`indy movies`[[1]]$academy_award_ve
## [1] TRUE
##
##
## $`indy movies`[[2]]
## $`indy movies`[[2]]$name
## [1] "Indiana Jones and the Temple of Doom"
##
## $`indy movies`[[2]]$year
## [1] 1984
##
## $`indy movies`[[2]]$actors
## Indiana Jones Mola Ram
## "Harrison Ford" "Amish Puri"
##
## $`indy movies`[[2]]$producers
## [1] "Robert Watts"
##
## $`indy movies`[[2]]$budget
## [1] 28170000
##
## $`indy movies`[[2]]$academy_award_ve
## [1] TRUE
##
##
## $`indy movies`[[3]]
## $`indy movies`[[3]]$name
## [1] "Indiana Jones and the Last Crusade"
##
## $`indy movies`[[3]]$year
## [1] 1989
##
## $`indy movies`[[3]]$actors
## Indiana Jones Walter Donovan
## "Harrison Ford" "Julian Glover"
##
## $`indy movies`[[3]]$producers
## [1] "Robert Watts" "George Lucas"
##
## $`indy movies`[[3]]$budget
## [1] 4.8e+07
##
## $`indy movies`[[3]]$academy_award_ve
## [1] FALSE
对列表进行分解或分组,或把它强制转化为向量、数据框、或其他结构:当JSON或XML数据加载到R中的时候,用户经常要决定哪些信息子集是必要的并需要插入数据框中的。因此,对于从JSON/XML到R的数据格式转换工作,不可能有什么现成的通用函数。只能根据实际情况来创建子集的数据转换工具。
method 1:
library(stringr)
# 把列表结构扁平化为一个字符串向量
indy.vec <- unlist(indy, recursive = TRUE, use.names = TRUE)
indy.vec[str_detect(names(indy.vec), "name")]
## indy movies.name
## "Raiders of the Lost Ark"
## indy movies.name
## "Indiana Jones and the Temple of Doom"
## indy movies.name
## "Indiana Jones and the Last Crusade"
method 2:
sapply(indy[[1]], "[[", "year")
## [1] 1981 1984 1989
sapply(indy[[1]], "[[", "name")
## [1] "Raiders of the Lost Ark"
## [2] "Indiana Jones and the Temple of Doom"
## [3] "Indiana Jones and the Last Crusade"
sapply(indy[[1]], "[[", "actors") # 有问题的
## [,1] [,2] [,3]
## Indiana Jones "Harrison Ford" "Harrison Ford" "Harrison Ford"
## Dr. Ren Belloq "Paul Freeman" "Amish Puri" "Julian Glover"
sapply(indy[[1]], "[[", "producers")
## [[1]]
## [1] "Frank Marshall" "George Lucas" "Howard Kazanjian"
##
## [[2]]
## [1] "Robert Watts"
##
## [[3]]
## [1] "Robert Watts" "George Lucas"
sapply(indy[[1]], "[[", "budget")
## [1] 18000000 28170000 48000000
sapply(indy[[1]], "[[", "academy_award_ve")
## [1] TRUE TRUE FALSE
method 3:
library(plyr)
indy.unlist <- sapply(indy[[1]], unlist)
indy.df <- do.call("rbind.fill",
lapply(lapply(indy.unlist, t),
data.frame,
stringsAsFactors = FALSE))
names(indy.df)
## [1] "name" "year"
## [3] "actors.Indiana.Jones" "actors.Dr..Ren.Belloq"
## [5] "producers1" "producers2"
## [7] "producers3" "budget"
## [9] "academy_award_ve" "actors.Mola.Ram"
## [11] "producers" "actors.Walter.Donovan"
indy.df
## name year actors.Indiana.Jones
## 1 Raiders of the Lost Ark 1981 Harrison Ford
## 2 Indiana Jones and the Temple of Doom 1984 Harrison Ford
## 3 Indiana Jones and the Last Crusade 1989 Harrison Ford
## actors.Dr..Ren.Belloq producers1 producers2 producers3
## 1 Paul Freeman Frank Marshall George Lucas Howard Kazanjian
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> Robert Watts George Lucas <NA>
## budget academy_award_ve actors.Mola.Ram producers
## 1 1.8e+07 TRUE <NA> <NA>
## 2 28170000 TRUE Amish Puri Robert Watts
## 3 4.8e+07 FALSE <NA> <NA>
## actors.Walter.Donovan
## 1 <NA>
## 2 <NA>
## 3 Julian Glover
peanuts.json <- RJSONIO::fromJSON("http://www.r-datacollection.com/materials/ch-3-xml/peanuts.json",
nullValue = NA,
simplify = FALSE)
peanuts.df <- do.call("rbind",
lapply(peanuts.json,
data.frame,
stringsAsFactors = FALSE))
peanuts.df
## name sex age
## 1 van Pelt, Lucy female 32
## 2 Peppermint, Patty female NA
## 3 Brown, Charlie male 27
# 把为R数据转化JSON
peanuts.json <- RJSONIO::toJSON(peanuts.df, pretty = TRUE)
file.output <- file("peanuts_out.json")
writeLines(peanuts.json, file.output)
close(file.output)
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following objects are masked from 'package:RJSONIO':
##
## fromJSON, toJSON
x <- '[1, 2, true, false]'
jsonlite::fromJSON(x)
## [1] 1 2 1 0
x <- '["foo", true, false]'
jsonlite::fromJSON(x)
## [1] "foo" "TRUE" "FALSE"
x <- '[1, "foo", null, false]'
jsonlite::fromJSON(x)
## [1] "1" "foo" NA "FALSE"
(peanuts.json <- jsonlite::fromJSON("http://www.r-datacollection.com/materials/ch-3-xml/peanuts.json"))
## name sex age
## 1 van Pelt, Lucy female 32
## 2 Peppermint, Patty female NA
## 3 Brown, Charlie male 27
(indy <- jsonlite::fromJSON("E:/R/Data Mining/Text Mining/character,stringr,RCurl,XML/Curl with R/materials/ch-3-xml/indy.json"))
## $`indy movies`
## name year actors.Indiana Jones
## 1 Raiders of the Lost Ark 1981 Harrison Ford
## 2 Indiana Jones and the Temple of Doom 1984 Harrison Ford
## 3 Indiana Jones and the Last Crusade 1989 Harrison Ford
## actors.Dr. Ren Belloq actors.Mola Ram actors.Walter Donovan
## 1 Paul Freeman <NA> <NA>
## 2 <NA> Amish Puri <NA>
## 3 <NA> <NA> Julian Glover
## producers budget academy_award_ve
## 1 Frank Marshall, George Lucas, Howard Kazanjian 18000000 TRUE
## 2 Robert Watts 28170000 TRUE
## 3 Robert Watts, George Lucas 48000000 FALSE
indy.df <- indy$`indy movies`
indy.df
## name year actors.Indiana Jones
## 1 Raiders of the Lost Ark 1981 Harrison Ford
## 2 Indiana Jones and the Temple of Doom 1984 Harrison Ford
## 3 Indiana Jones and the Last Crusade 1989 Harrison Ford
## actors.Dr. Ren Belloq actors.Mola Ram actors.Walter Donovan
## 1 Paul Freeman <NA> <NA>
## 2 <NA> Amish Puri <NA>
## 3 <NA> <NA> Julian Glover
## producers budget academy_award_ve
## 1 Frank Marshall, George Lucas, Howard Kazanjian 18000000 TRUE
## 2 Robert Watts 28170000 TRUE
## 3 Robert Watts, George Lucas 48000000 FALSE
indy.df$name
## [1] "Raiders of the Lost Ark"
## [2] "Indiana Jones and the Temple of Doom"
## [3] "Indiana Jones and the Last Crusade"
解析XML文件的XPath设置(XPath表达式):
nodename选取此节点的所有子节点/作为路径内部的分割符/:选择根节点//:表示选择任意位置的某个节点@:表示选择某个属性*表示匹配任何元素节点@*表示匹配任何属性值node()表示匹配任何类型的节点
Tools for Parsing and Generating XML Within R and S-Plus
library(XML)
library(RCurl)
## Loading required package: bitops
url <- "http://www.cbooo.cn/year?year=2015"
#解析文件,需指定encoding为"UTF-8",否则乱码
url <- htmlParse(url, encoding="UTF-8")
tables <- readHTMLTable(url)
table <- tables[[1]]
names(table) <- c("title",
"type",
"boxoffice",
"meanprice",
"numofpeople",
"nation",
"date")
head(table)
## title type boxoffice meanprice numofpeople nation
## 1 1.捉妖记 魔幻 243952 37 42 中国
## 2 2.速度与激情7 动作 242655 39 42 美国/日本
## 3 3.港囧 喜剧 161336 33 40 中国
## 4 4.复仇者联盟2:奥创纪元 科幻 146438 40 29 美国
## 5 5.夏洛特烦恼 喜剧 144145 32 34 中国
## 6 6.侏罗纪世界 动作 142066 38 33 美国
## date
## 1 2015-07-16
## 2 2015-04-12
## 3 2015-09-25
## 4 2015-05-12
## 5 2015-09-30
## 6 2015-06-10
url = "http://data.earthquake.cn/datashare/datashare_more_quickdata_new.jsp"
url = "http://219.143.71.11/wdc4seis@bj/earthquakes/csn_quakes_p001.jsp"
wp = getURL(url)
doc = htmlParse(wp, asText = TRUE, encoding = "UTF-8")
tables = readHTMLTable(doc, header = TRUE, which = 2)
head(tables)
## Origin time(CST) Lat(°) Long(°) Depth(km) Mag
## 1 2012/01/08<U+00A0>14:20:08.0 42.10 87.50 7.0 M 5.0
## 2 2012/01/01<U+00A0>13:27:55.5 31.40 138.30 360.0 M 7.0
## 3 2011/12/27<U+00A0>23:21:58.5 51.80 95.90 10.0 M 7.0
## 4 2011/12/14<U+00A0>13:04:56.2 -7.50 146.80 120.0 M 7.2
## 5 2011/12/12<U+00A0>09:42:34.0 39.60 118.20 5.0 M 3.2
## 6 2011/12/01<U+00A0>20:48:19.8 38.40 76.90 10.0 M 5.2
## Region
## 1 NORTHERN XINJIANG, CHINA
## 2 SOUTHEAST OF HONSHU, JAPAN
## 3 SOUTHWESTERN SIBERIA, RUSSIA
## 4 EASTERN NEW GUINEA REG., P.N.G.
## 5 NORTHEASTERN CHINA
## 6 SOUTHERN XINJIANG, CHINA