Code based in Jared Lander’s R for Everyone:

# load the XML library
library('XML')
## Warning: package 'XML' was built under R version 3.1.3
# set the target URL
theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"
# read the URL
bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE, stringsAsFactors = FALSE)
# display the result
bowlPool
##                V1      V2        V3
## 1   Participant 1 Giant A Patriot Q
## 2   Participant 2 Giant B Patriot R
## 3   Participant 3 Giant C Patriot S
## 4   Participant 4 Giant D Patriot T
## 5   Participant 5 Giant E Patriot U
## 6   Participant 6 Giant F Patriot V
## 7   Participant 7 Giant G Patriot W
## 8   Participant 8 Giant H Patriot X
## 9   Participant 9 Giant I Patriot Y
## 10 Participant 10 Giant J Patriot Z

1. What type of data structure is bowlpool?

# display the data structure
str(bowlPool)
## 'data.frame':    10 obs. of  3 variables:
##  $ V1: chr  "Participant 1" "Participant 2" "Participant 3" "Participant 4" ...
##  $ V2: chr  "Giant A" "Giant B" "Giant C" "Giant D" ...
##  $ V3: chr  "Patriot Q" "Patriot R" "Patriot S" "Patriot T" ...

We can see that bowlpool is a data.frame.

2. Suppose instead you call readHTMLTable() with just the URL argument, against the provided URL, as shown below

# set the target URL
theURL <- "http://www.w3schools.com/html/html_tables.asp"
# read the URL
hvalues <- readHTMLTable(theURL)

What is the type of variable returned in hvalues?

# display the structure of the data
str(hvalues)
## List of 6
##  $ NULL:'data.frame':    4 obs. of  4 variables:
##   ..$ Number    : Factor w/ 4 levels "1","2","3","4": 1 2 3 4
##   ..$ First Name: Factor w/ 4 levels "Adam","Eve","Jill",..: 2 4 1 3
##   ..$ Last Name : Factor w/ 4 levels "Doe","Jackson",..: 2 1 3 4
##   ..$ Points    : Factor w/ 4 levels "50","67","80",..: 4 3 2 1
##  $ NULL: NULL
##  $ NULL: NULL
##  $ NULL: NULL
##  $ NULL: NULL
##  $ NULL:'data.frame':    10 obs. of  2 variables:
##   ..$ Tag        : Factor w/ 10 levels "<caption>","<col>",..: 4 8 10 6 1 3 2 9 5 7
##   ..$ Description: Factor w/ 10 levels "Defines a cell in a table",..: 4 2 3 1 5 9 10 8 6 7

We can see that hvalues is a list of data.frames.

3. Write R code that shows how many HTML tables are represented in hvalues

# determine how many HTML tables are represented in hvalues
summaryHvalues<-summary(hvalues)
nTables=sum(summaryHvalues[,1]!="0")
nTables
## [1] 2

We can see that there are 2 tables that have a non-zero length.

4. Modify the readHTMLTable code so that just the table with Number, FirstName, LastName, and Points is returned into a dataframe

# read in just the table with Number, FirstName, LastName, # and Points
hvalues <- readHTMLTable(theURL,which=1)
head(hvalues)
##   Number First Name Last Name Points
## 1      1        Eve   Jackson     94
## 2      2       John       Doe     80
## 3      3       Adam   Johnson     67
## 4      4       Jill     Smith     50

5. Modify the returned data frame so only the Last Name and Points columns are shown.

# read in just hte table with LastName and # and Points
newHvalues<-hvalues[,c(3,4)]
head(newHvalues)
##   Last Name Points
## 1   Jackson     94
## 2       Doe     80
## 3   Johnson     67
## 4     Smith     50
** 6 Identify another interesting page on the web with HTML table values. This may be somewhat tricky, because while HTML tables are great for web-page scrapers, many HTML designers now prefer creating tables using other methods (such as

tags or .png files).**

The following code extracts the tables from the wikipedia site on the S&P500 index components:

# set the target URL
tableUrl<-"http://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
# read the URL
tables <- readHTMLTable(tableUrl,which=1,header = T, stringsAsFactors = T)
nTables<-length(tables)
nTables
## [1] 8

Here is a sample of the output:

head(tables)
##   Ticker symbol            Security SEC filings            GICS Sector
## 1           MMM          3M Company     reports            Industrials
## 2           ABT Abbott Laboratories     reports            Health Care
## 3          ABBV              AbbVie     reports            Health Care
## 4           ACN       Accenture plc     reports Information Technology
## 5           ACE         ACE Limited     reports             Financials
## 6           ACT         Actavis plc     reports            Health Care
##                  GICS Sub Industry Address of Headquarters
## 1         Industrial Conglomerates     St. Paul, Minnesota
## 2 Health Care Equipment & Services North Chicago, Illinois
## 3                  Pharmaceuticals North Chicago, Illinois
## 4   IT Consulting & Other Services         Dublin, Ireland
## 5    Property & Casualty Insurance     Zurich, Switzerland
## 6                  Pharmaceuticals         Dublin, Ireland
##   Date first added        CIK
## 1                  0000066740
## 2                  0000001800
## 3       2012-12-31 0001551152
## 4       2011-07-06 0001467373
## 5       2010-07-15 0000896159
## 6                  0000884629

7 How many HTML tables does that page contain?

The page technically has 3 tables. The first has the current index components, the second indicates things that were added and removed over time, and the third has the links to the ‘Companies Portal’ and the ‘Lists Portal’. The code above reads only the first table.

8 Identify your web browser, and describe (in one or two sentences) how you view HTML page source in your web browser.

Google Chrome was used as the web browser while completing this assignment. To view the HTML page source, go to the URL of interest, right click on the body of the page once it is loaded and select ‘View page source’. A new table will open in the browser with the HTML source code.