The following code is given in the assignment.
library(XML)
theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"
bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE, stringsAsFactors = FALSE)
bowlPool
## V1 V2 V3
## 1 Participant 1 Giant A Patriot Q
## 2 Participant 2 Giant B Patriot R
## 3 Participant 3 Giant C Patriot S
## 4 Participant 4 Giant D Patriot T
## 5 Participant 5 Giant E Patriot U
## 6 Participant 6 Giant F Patriot V
## 7 Participant 7 Giant G Patriot W
## 8 Participant 8 Giant H Patriot X
## 9 Participant 9 Giant I Patriot Y
## 10 Participant 10 Giant J Patriot Z
1. What type of data structure is bowlpool?
Answer
class(bowlPool)
## [1] "data.frame"
Hence the data structure is data.frame
2.Suppose instead you call readHTMLTable() with just the URL argument, against the provided URL, as shown below, What is the type of variable returned in hvalues?
theURL <- “http://www.w3schools.com/html/html_tables.asp”
hvalues <- readHTMLTable(theURL)
Answer
theURL <- "http://www.w3schools.com/html/html_tables.asp"
hvalues <- readHTMLTable(theURL)
class(hvalues)
## [1] "list"
Hence the answer is list data type
3. Write R code that shows how many HTML tables are represented in hvalues
Answer
To check how many HTML Tables were returned, we can use the length() function, as shown below:
length(hvalues)
## [1] 6
But sometimes, we may get some NULL values as the output of readHTMLTable() function, if we use this function on a URL directly. For example, in the following command we get a list of tables assigned to hvalues. But this list may have some NULL values.:
theURL <- "http://www.w3schools.com/html/html_tables.asp"
hvalues <- readHTMLTable(theURL)
hvalues
## $`NULL`
## Number First Name Last Name Points
## 1 1 Eve Jackson 94
## 2 2 John Doe 80
## 3 3 Adam Johnson 67
## 4 4 Jill Smith 50
##
## $`NULL`
## NULL
##
## $`NULL`
## NULL
##
## $`NULL`
## NULL
##
## $`NULL`
## NULL
##
## $`NULL`
## Tag
## 1 <table>
## 2 <th>
## 3 <tr>
## 4 <td>
## 5 <caption>
## 6 <colgroup>
## 7 <col>
## 8 <thead>
## 9 <tbody>
## 10 <tfoot>
## Description
## 1 Defines a table
## 2 Defines a header cell in a table
## 3 Defines a row in a table
## 4 Defines a cell in a table
## 5 Defines a table caption
## 6 Specifies a group of one or more columns in a table for formatting
## 7 Specifies column properties for each column within a <colgroup> element
## 8 Groups the header content in a table
## 9 Groups the body content in a table
## 10 Groups the footer content in a table
The above display shows that hvalues have NULL elements. If the list is really big, then we do not get much information from the display of the list. So instead of displaying the whole list’s contents, you can use the following command. If there is at least one NULL element in a list, then the below command displays TRUE.
any(lapply(hvalues[seq(hvalues)],is.null))
## Warning in any(lapply(hvalues[seq(hvalues)], is.null)): coercing argument
## of type 'list' to logical
## [1] TRUE
The above command shows that there is at least one NULL element in the list. But it has also shown a warning message. You can use the following command, to avoid the warning message:
any(data.frame(lapply(hvalues[seq(hvalues)],is.null))==T)
## [1] TRUE
Another way of checking for NULL values is to find the number of NULL values in a list. The following command checks how many NULL values are present in the list.
sum(data.frame(lapply(hvalues[seq(hvalues)],is.null))==T)
## [1] 4
The last 3 commands confirm that we have at least one NULL element in the list.
Finally to get the number of HTML Tables (which are not NULL in the list obtained), use this command:
sum(data.frame(lapply(hvalues[seq(hvalues)],is.null))==F)
## [1] 2
Hence we have 2 tables
To weed out the NULL elements in the list, you can use the following R Command:
hvalues[which(lapply(hvalues[seq(hvalues)],is.null) == T)] <- NULL
Let us display the hvalues again.
hvalues
## $`NULL`
## Number First Name Last Name Points
## 1 1 Eve Jackson 94
## 2 2 John Doe 80
## 3 3 Adam Johnson 67
## 4 4 Jill Smith 50
##
## $`NULL`
## Tag
## 1 <table>
## 2 <th>
## 3 <tr>
## 4 <td>
## 5 <caption>
## 6 <colgroup>
## 7 <col>
## 8 <thead>
## 9 <tbody>
## 10 <tfoot>
## Description
## 1 Defines a table
## 2 Defines a header cell in a table
## 3 Defines a row in a table
## 4 Defines a cell in a table
## 5 Defines a table caption
## 6 Specifies a group of one or more columns in a table for formatting
## 7 Specifies column properties for each column within a <colgroup> element
## 8 Groups the header content in a table
## 9 Groups the body content in a table
## 10 Groups the footer content in a table
The above display confirms that there are NO NULL tables in the list
You can also use the following command to get the number of NULL elements in the list, and this command’s output shows that there are no NULL elements in the list.
sum(data.frame(lapply(hvalues[seq(hvalues)],is.null))==T)
## [1] 0
4. Modify the readHTMLTable code so that just the table with Number, FirstName, LastName, and Points is returned into a dataframe
Answer
We know that the first element in the list contains our desired data frame. The following command will get the desired columns:
hvalues <- readHTMLTable(theURL)[[1]]
hvalues
## Number First Name Last Name Points
## 1 1 Eve Jackson 94
## 2 2 John Doe 80
## 3 3 Adam Johnson 67
## 4 4 Jill Smith 50
5. Modify the returned data frame so only the Last Name and Points columns are shown.
Answer
The following R command displays the “Last Name” and “Points” of the hvalues data frame
hvalues <- hvalues[,c("Last Name", "Points")]
hvalues
## Last Name Points
## 1 Jackson 94
## 2 Doe 80
## 3 Johnson 67
## 4 Smith 50
6. Identify another interesting page on the web with HTML table values. This may be somewhat tricky, because while HTML tables are great for web-page scrapers, many HTML designers now prefer creating tables using other methods (such as div tags or .png files).
Answer
I would like to read the tables present at the following URL:
“http://en.wikipedia.org/wiki/Timeline_of_Indian_history”
theURL <- "http://en.wikipedia.org/wiki/Timeline_of_Indian_history"
hvalues <- readHTMLTable(theURL)
7. How many HTML tables does that page contain?
Answer
length(hvalues)
## [1] 62
8 Identify your web browser, and describe (in one or two sentences) how you view HTML page source in your web browser.
Answer
I am using the Internet Explorer (Version 11.0.9600.17691). To view the HTML source in Internet Explorer browser, you have to right click on the web page, and select the option “View Source”
9. (Optional challenge exercise) Instead of using readHTMLTable from the XML package, use the functionality in the rvest package to perform the same task.
Which method do you prefer? Why might one prefer one package over the other?
Answer
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:XML':
##
## xml
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
theURL <- html("http://en.wikipedia.org/wiki/Timeline_of_Indian_history")
x<- theURL %>%
html_nodes("table")
length(x)
## [1] 62
The last command output displays that we have read 62 tables into x. The main advantage of using rvest is the ability to cascade commands (using dplyr/magrittr packages).