Code based in Jared Lander’s R for Everyone:
# load the XML library
library('XML')
## Warning: package 'XML' was built under R version 3.1.3
# set the target URL
theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"
# read the URL
bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE, stringsAsFactors = FALSE)
# display the result
bowlPool
## V1 V2 V3
## 1 Participant 1 Giant A Patriot Q
## 2 Participant 2 Giant B Patriot R
## 3 Participant 3 Giant C Patriot S
## 4 Participant 4 Giant D Patriot T
## 5 Participant 5 Giant E Patriot U
## 6 Participant 6 Giant F Patriot V
## 7 Participant 7 Giant G Patriot W
## 8 Participant 8 Giant H Patriot X
## 9 Participant 9 Giant I Patriot Y
## 10 Participant 10 Giant J Patriot Z
# display the data structure
str(bowlPool)
## 'data.frame': 10 obs. of 3 variables:
## $ V1: chr "Participant 1" "Participant 2" "Participant 3" "Participant 4" ...
## $ V2: chr "Giant A" "Giant B" "Giant C" "Giant D" ...
## $ V3: chr "Patriot Q" "Patriot R" "Patriot S" "Patriot T" ...
We can see that bowlpool is a data.frame.
2. Suppose instead you call readHTMLTable() with just the URL argument, against the provided URL, as shown below
# set the target URL
theURL <- "http://www.w3schools.com/html/html_tables.asp"
# read the URL
hvalues <- readHTMLTable(theURL)
What is the type of variable returned in hvalues?
# display the structure of the data
str(hvalues)
## List of 6
## $ NULL:'data.frame': 4 obs. of 4 variables:
## ..$ Number : Factor w/ 4 levels "1","2","3","4": 1 2 3 4
## ..$ First Name: Factor w/ 4 levels "Adam","Eve","Jill",..: 2 4 1 3
## ..$ Last Name : Factor w/ 4 levels "Doe","Jackson",..: 2 1 3 4
## ..$ Points : Factor w/ 4 levels "50","67","80",..: 4 3 2 1
## $ NULL: NULL
## $ NULL: NULL
## $ NULL: NULL
## $ NULL: NULL
## $ NULL:'data.frame': 10 obs. of 2 variables:
## ..$ Tag : Factor w/ 10 levels "<caption>","<col>",..: 4 8 10 6 1 3 2 9 5 7
## ..$ Description: Factor w/ 10 levels "Defines a cell in a table",..: 4 2 3 1 5 9 10 8 6 7
We can see that hvalues is a list of data.frames.
3. Write R code that shows how many HTML tables are represented in hvalues
# determine how many HTML tables are represented in hvalues
summaryHvalues<-summary(hvalues)
nTables=sum(summaryHvalues[,1]!="0")
nTables
## [1] 2
We can see that there are 2 tables that have a non-zero length.
4. Modify the readHTMLTable code so that just the table with Number, FirstName, LastName, and Points is returned into a dataframe
# read in just the table with Number, FirstName, LastName, # and Points
hvalues <- readHTMLTable(theURL,which=1)
head(hvalues)
## Number First Name Last Name Points
## 1 1 Eve Jackson 94
## 2 2 John Doe 80
## 3 3 Adam Johnson 67
## 4 4 Jill Smith 50
5. Modify the returned data frame so only the Last Name and Points columns are shown.
# read in just hte table with LastName and # and Points
newHvalues<-hvalues[,c(3,4)]
head(newHvalues)
## Last Name Points
## 1 Jackson 94
## 2 Doe 80
## 3 Johnson 67
## 4 Smith 50
** 6 Identify another interesting page on the web with HTML table values. This may be somewhat tricky, because while HTML tables are great for web-page scrapers, many HTML designers now prefer creating tables using other methods (such as
tags or .png files).**
The following code extracts the tables from the wikipedia site on the S&P500 index components:
# set the target URL
tableUrl<-"http://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
# read the URL
tables <- readHTMLTable(tableUrl,which=1,header = T, stringsAsFactors = T)
nTables<-length(tables)
nTables
## [1] 8
Here is a sample of the output:
head(tables)
## Ticker symbol Security SEC filings GICS Sector
## 1 MMM 3M Company reports Industrials
## 2 ABT Abbott Laboratories reports Health Care
## 3 ABBV AbbVie reports Health Care
## 4 ACN Accenture plc reports Information Technology
## 5 ACE ACE Limited reports Financials
## 6 ACT Actavis plc reports Health Care
## GICS Sub Industry Address of Headquarters
## 1 Industrial Conglomerates St. Paul, Minnesota
## 2 Health Care Equipment & Services North Chicago, Illinois
## 3 Pharmaceuticals North Chicago, Illinois
## 4 IT Consulting & Other Services Dublin, Ireland
## 5 Property & Casualty Insurance Zurich, Switzerland
## 6 Pharmaceuticals Dublin, Ireland
## Date first added CIK
## 1 0000066740
## 2 0000001800
## 3 2012-12-31 0001551152
## 4 2011-07-06 0001467373
## 5 2010-07-15 0000896159
## 6 0000884629
7 How many HTML tables does that page contain?
The page technically has 3 tables. The first has the current index components, the second indicates things that were added and removed over time, and the third has the links to the ‘Companies Portal’ and the ‘Lists Portal’. The code above reads only the first table.
8 Identify your web browser, and describe (in one or two sentences) how you view HTML page source in your web browser.
Google Chrome was used as the web browser while completing this assignment. To view the HTML page source, go to the URL of interest, right click on the body of the page once it is loaded and select ‘View page source’. A new table will open in the browser with the HTML source code.