library(htmltools)
## Warning: package 'htmltools' was built under R version 3.1.3
library(htmlTable)
## Warning: package 'htmlTable' was built under R version 3.1.3
library(XML)
## Warning: package 'XML' was built under R version 3.1.3
url <- "http://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tbls <- readHTMLTable(url)
sapply(tbls, nrow)
## $`NULL`
## [1] 247
##
## $`NULL`
## [1] 1
##
## $`NULL`
## NULL
##
## $`NULL`
## [1] 18
##
## $`NULL`
## [1] 16
pop <- readHTMLTable(url, which = 1, header = TRUE, stringsAsFactors = FALSE)
1. What type of data structure is pop? Since each column can have different modes of data(numeric,character, and so on), this is a dataframe datastructure.
str(pop)
## 'data.frame': 247 obs. of 6 variables:
## $ Rank : chr "1" "2" "3" "4" ...
## $ Country (or dependent territory): chr "China[Note 2]" "India" "United States" "Indonesia" ...
## $ Population : chr "1,369,120,000" "1,269,500,000" "320,640,000" "255,461,700" ...
## $ Date : chr "April 5, 2015" "April 5, 2015" "April 5, 2015" "July 1, 2015" ...
## $ % of world
## population : chr "18.9%" "17.5%" "4.4%" "3.53%" ...
## $ Source : chr "Official population clock" "Official Population clock" "Official population clock" "Official estimate" ...
2. Suppose instead you call readHTMLTable() with just the URL argument, against the provided URL, as shown below
theURL <- "http://www.w3schools.com/html/html_tables.asp"
hvalues <- readHTMLTable(theURL)
str(hvalues)
## List of 6
## $ NULL:'data.frame': 4 obs. of 4 variables:
## ..$ Number : Factor w/ 4 levels "1","2","3","4": 1 2 3 4
## ..$ First Name: Factor w/ 4 levels "Adam","Eve","Jill",..: 2 4 1 3
## ..$ Last Name : Factor w/ 4 levels "Doe","Jackson",..: 2 1 3 4
## ..$ Points : Factor w/ 4 levels "50","67","80",..: 4 3 2 1
## $ NULL: NULL
## $ NULL: NULL
## $ NULL: NULL
## $ NULL: NULL
## $ NULL:'data.frame': 10 obs. of 2 variables:
## ..$ Tag : Factor w/ 10 levels "<caption>","<col>",..: 4 8 10 6 1 3 2 9 5 7
## ..$ Description: Factor w/ 10 levels "Defines a cell in a table",..: 4 2 3 1 5 9 10 8 6 7
What is the type of variable returned in hvalues?
readHTMLTable() function reads data from one or more HTML tables in an HTML document. Variables are factors, categorical variables.
3. Write R code that shows how many HTML tables are represented in hvalues the are two tables, one with 4 rows and the other with 10 rows.
sapply(hvalues,nrow)
## $`NULL`
## [1] 4
##
## $`NULL`
## NULL
##
## $`NULL`
## NULL
##
## $`NULL`
## NULL
##
## $`NULL`
## NULL
##
## $`NULL`
## [1] 10
4. Modify the readHTMLTable code so that just the table with Number, FirstName, LastName, and Points is returned into a dataframe
modhvalues <- readHTMLTable(theURL, which = 1, header = TRUE, stringsAsFactors = FALSE)
head(modhvalues)
## Number First Name Last Name Points
## 1 1 Eve Jackson 94
## 2 2 John Doe 80
## 3 3 Adam Johnson 67
## 4 4 Jill Smith 50
5. Modify the returned data frame so only the Last Name and Points columns are shown.
lphvalues <- modhvalues[,c(3,4)]
head(lphvalues)
## Last Name Points
## 1 Jackson 94
## 2 Doe 80
## 3 Johnson 67
## 4 Smith 50
**6 Identify another interesting page on the web with HTML table values.This may be somewhat tricky, because while HTML tables are great for web-page scrapers, many HTML designers now prefer creating tables using other methods (such as
tags or .png files).**
url2 <- "http://en.wikipedia.org/wiki/Fortune_500_Computer_Software_and_Information_Company"
fortune <- readHTMLTable(url2)
str(fortune)
## List of 1
## $ NULL:'data.frame': 28 obs. of 7 variables:
## ..$ Company : Factor w/ 28 levels "Amazon.com","Apple Inc",..: 18 21 24 2 13 8 15 5 23 19 ...
## ..$ Type : Factor w/ 5 levels "Computer Peripherals",..: 5 5 5 2 2 2 2 2 2 2 ...
## ..$ 2014 ranking: Factor w/ 28 levels "--","[1]","128[6]",..: 13 28 16 23 5 2 25 27 1 19 ...
## ..$ 2013 ranking: Factor w/ 28 levels "--","131","133",..: 12 28 14 26 4 23 24 27 1 17 ...
## ..$ 2012 ranking: Factor w/ 26 levels "--","10","127",..: 13 26 14 6 2 17 22 24 1 18 ...
## ..$ 2011 ranking: Factor w/ 25 levels "--","--[2]","11",..: 14 25 15 12 3 16 21 22 2 19 ...
## ..$ 2010 ranking: Factor w/ 23 levels "--","10","100",..: 17 5 16 21 2 19 23 22 10 20 ...
7 How many HTML tables does that page contain? -Just one
8 Identify your web browser, and describe (in one or two sentences) how you view HTML page source in your web browser. google, right click on the table, then it gives the option to View page source.
9 (Optional challenge exercise) Instead of using readHTMLTable from the XML package, use the functionality in the rvest package to perform the same task. Which method do you prefer? Why might one prefer one package over the other? I gave up. I cannot make css or xpath in html_nodes()function.
library(rvest)
wikiurl <- html(βhttp://en.wikipedia.org/wiki/Fortune_500_Computer_Software_and_Information_Companyβ)
theader <- wikiurl %>% html_nodes(xpath=β//table//thβ) %>%
html_text()%>% theader
```