The following code is given in the assignment.

library(XML)
theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"
bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE, stringsAsFactors = FALSE)
bowlPool
##                V1      V2        V3
## 1   Participant 1 Giant A Patriot Q
## 2   Participant 2 Giant B Patriot R
## 3   Participant 3 Giant C Patriot S
## 4   Participant 4 Giant D Patriot T
## 5   Participant 5 Giant E Patriot U
## 6   Participant 6 Giant F Patriot V
## 7   Participant 7 Giant G Patriot W
## 8   Participant 8 Giant H Patriot X
## 9   Participant 9 Giant I Patriot Y
## 10 Participant 10 Giant J Patriot Z

1. What type of data structure is bowlpool?

Answer

class(bowlPool)
## [1] "data.frame"

Hence the data structure is data.frame

2.Suppose instead you call readHTMLTable() with just the URL argument, against the provided URL, as shown below, What is the type of variable returned in hvalues?

theURL <- “http://www.w3schools.com/html/html_tables.asp

hvalues <- readHTMLTable(theURL)

Answer

theURL <- "http://www.w3schools.com/html/html_tables.asp"
hvalues <- readHTMLTable(theURL)
class(hvalues)
## [1] "list"

Hence the answer is list data type

3. Write R code that shows how many HTML tables are represented in hvalues

Answer

To check how many HTML Tables were returned, we can use the length() function, as shown below:

length(hvalues)
## [1] 6

But sometimes, we may get some NULL values as the output of readHTMLTable() function, if we use this function on a URL directly. For example, in the following command we get a list of tables assigned to hvalues. But this list may have some NULL values.:

theURL <- "http://www.w3schools.com/html/html_tables.asp"
hvalues <- readHTMLTable(theURL)
hvalues
## $`NULL`
##   Number First Name Last Name Points
## 1      1        Eve   Jackson     94
## 2      2       John       Doe     80
## 3      3       Adam   Johnson     67
## 4      4       Jill     Smith     50
## 
## $`NULL`
## NULL
## 
## $`NULL`
## NULL
## 
## $`NULL`
## NULL
## 
## $`NULL`
## NULL
## 
## $`NULL`
##           Tag
## 1     <table>
## 2        <th>
## 3        <tr>
## 4        <td>
## 5   <caption>
## 6  <colgroup>
## 7       <col>
## 8     <thead>
## 9     <tbody>
## 10    <tfoot>
##                                                                Description
## 1                                                          Defines a table
## 2                                         Defines a header cell in a table
## 3                                                 Defines a row in a table
## 4                                                Defines a cell in a table
## 5                                                  Defines a table caption
## 6       Specifies a group of one or more columns in a table for formatting
## 7  Specifies column properties for each column within a <colgroup> element
## 8                                     Groups the header content in a table
## 9                                       Groups the body content in a table
## 10                                    Groups the footer content in a table

The above display shows that hvalues have NULL elements. If the list is really big, then we do not get much information from the display of the list. So instead of displaying the whole list’s contents, you can use the following command. If there is at least one NULL element in a list, then the below command displays TRUE.

any(lapply(hvalues[seq(hvalues)],is.null))
## Warning in any(lapply(hvalues[seq(hvalues)], is.null)): coercing argument
## of type 'list' to logical
## [1] TRUE

The above command shows that there is at least one NULL element in the list. But it has also shown a warning message. You can use the following command, to avoid the warning message:

any(data.frame(lapply(hvalues[seq(hvalues)],is.null))==T)
## [1] TRUE

Another way of checking for NULL values is to find the number of NULL values in a list. The following command checks how many NULL values are present in the list.

sum(data.frame(lapply(hvalues[seq(hvalues)],is.null))==T)
## [1] 4

The last 3 commands confirm that we have at least one NULL element in the list.

Finally to get the number of HTML Tables (which are not NULL in the list obtained), use this command:

sum(data.frame(lapply(hvalues[seq(hvalues)],is.null))==F)
## [1] 2

Hence we have 2 tables

To weed out the NULL elements in the list, you can use the following R Command:

hvalues[which(lapply(hvalues[seq(hvalues)],is.null) == T)] <- NULL

Let us display the hvalues again.

hvalues
## $`NULL`
##   Number First Name Last Name Points
## 1      1        Eve   Jackson     94
## 2      2       John       Doe     80
## 3      3       Adam   Johnson     67
## 4      4       Jill     Smith     50
## 
## $`NULL`
##           Tag
## 1     <table>
## 2        <th>
## 3        <tr>
## 4        <td>
## 5   <caption>
## 6  <colgroup>
## 7       <col>
## 8     <thead>
## 9     <tbody>
## 10    <tfoot>
##                                                                Description
## 1                                                          Defines a table
## 2                                         Defines a header cell in a table
## 3                                                 Defines a row in a table
## 4                                                Defines a cell in a table
## 5                                                  Defines a table caption
## 6       Specifies a group of one or more columns in a table for formatting
## 7  Specifies column properties for each column within a <colgroup> element
## 8                                     Groups the header content in a table
## 9                                       Groups the body content in a table
## 10                                    Groups the footer content in a table

The above display confirms that there are NO NULL tables in the list

You can also use the following command to get the number of NULL elements in the list, and this command’s output shows that there are no NULL elements in the list.

sum(data.frame(lapply(hvalues[seq(hvalues)],is.null))==T)
## [1] 0

4. Modify the readHTMLTable code so that just the table with Number, FirstName, LastName, and Points is returned into a dataframe

Answer

We know that the first element in the list contains our desired data frame. The following command will get the desired columns:

hvalues <- readHTMLTable(theURL)[[1]]
hvalues
##   Number First Name Last Name Points
## 1      1        Eve   Jackson     94
## 2      2       John       Doe     80
## 3      3       Adam   Johnson     67
## 4      4       Jill     Smith     50

5. Modify the returned data frame so only the Last Name and Points columns are shown.

Answer

The following R command displays the “Last Name” and “Points” of the hvalues data frame

hvalues <- hvalues[,c("Last Name", "Points")]
hvalues
##   Last Name Points
## 1   Jackson     94
## 2       Doe     80
## 3   Johnson     67
## 4     Smith     50

6. Identify another interesting page on the web with HTML table values. This may be somewhat tricky, because while HTML tables are great for web-page scrapers, many HTML designers now prefer creating tables using other methods (such as div tags or .png files).

Answer

I would like to read the tables present at the following URL:

http://en.wikipedia.org/wiki/Timeline_of_Indian_history

theURL <- "http://en.wikipedia.org/wiki/Timeline_of_Indian_history"
hvalues <- readHTMLTable(theURL)

7. How many HTML tables does that page contain?

Answer

length(hvalues)
## [1] 62

8 Identify your web browser, and describe (in one or two sentences) how you view HTML page source in your web browser.

Answer

I am using the Internet Explorer (Version 11.0.9600.17691). To view the HTML source in Internet Explorer browser, you have to right click on the web page, and select the option “View Source”

9. (Optional challenge exercise) Instead of using readHTMLTable from the XML package, use the functionality in the rvest package to perform the same task.
Which method do you prefer? Why might one prefer one package over the other?

Answer

library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:XML':
## 
##     xml
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
theURL <- html("http://en.wikipedia.org/wiki/Timeline_of_Indian_history")

x<- theURL %>% 
  html_nodes("table")

length(x)
## [1] 62

The last command output displays that we have read 62 tables into x. The main advantage of using rvest is the ability to cascade commands (using dplyr/magrittr packages).