607-week10-assignment.R

library(htmltools)

## Warning: package 'htmltools' was built under R version 3.1.3

library(htmlTable)

## Warning: package 'htmlTable' was built under R version 3.1.3

library(XML)

## Warning: package 'XML' was built under R version 3.1.3

url <- "http://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tbls <- readHTMLTable(url)
sapply(tbls, nrow)

## $`NULL`
## [1] 247
## 
## $`NULL`
## [1] 1
## 
## $`NULL`
## NULL
## 
## $`NULL`
## [1] 18
## 
## $`NULL`
## [1] 16

pop <- readHTMLTable(url, which = 1, header = TRUE, stringsAsFactors = FALSE)

1. What type of data structure is pop? Since each column can have different modes of data(numeric,character, and so on), this is a dataframe datastructure.

str(pop)

## 'data.frame':    247 obs. of  6 variables:
##  $ Rank                            : chr  "1" "2" "3" "4" ...
##  $ Country (or dependent territory): chr  "China[Note 2]" "India" "United States" "Indonesia" ...
##  $ Population                      : chr  "1,369,120,000" "1,269,500,000" "320,640,000" "255,461,700" ...
##  $ Date                            : chr  "April 5, 2015" "April 5, 2015" "April 5, 2015" "July 1, 2015" ...
##  $ % of world
## population          : chr  "18.9%" "17.5%" "4.4%" "3.53%" ...
##  $ Source                          : chr  "Official population clock" "Official Population clock" "Official population clock" "Official estimate" ...

2. Suppose instead you call readHTMLTable() with just the URL argument, against the provided URL, as shown below

theURL <- "http://www.w3schools.com/html/html_tables.asp"
hvalues <- readHTMLTable(theURL)
str(hvalues)

## List of 6
##  $ NULL:'data.frame':    4 obs. of  4 variables:
##   ..$ Number    : Factor w/ 4 levels "1","2","3","4": 1 2 3 4
##   ..$ First Name: Factor w/ 4 levels "Adam","Eve","Jill",..: 2 4 1 3
##   ..$ Last Name : Factor w/ 4 levels "Doe","Jackson",..: 2 1 3 4
##   ..$ Points    : Factor w/ 4 levels "50","67","80",..: 4 3 2 1
##  $ NULL: NULL
##  $ NULL: NULL
##  $ NULL: NULL
##  $ NULL: NULL
##  $ NULL:'data.frame':    10 obs. of  2 variables:
##   ..$ Tag        : Factor w/ 10 levels "<caption>","<col>",..: 4 8 10 6 1 3 2 9 5 7
##   ..$ Description: Factor w/ 10 levels "Defines a cell in a table",..: 4 2 3 1 5 9 10 8 6 7

What is the type of variable returned in hvalues?

readHTMLTable() function reads data from one or more HTML tables in an HTML document. Variables are factors, categorical variables.

3. Write R code that shows how many HTML tables are represented in hvalues the are two tables, one with 4 rows and the other with 10 rows.

sapply(hvalues,nrow)

## $`NULL`
## [1] 4
## 
## $`NULL`
## NULL
## 
## $`NULL`
## NULL
## 
## $`NULL`
## NULL
## 
## $`NULL`
## NULL
## 
## $`NULL`
## [1] 10

4. Modify the readHTMLTable code so that just the table with Number, FirstName, LastName, and Points is returned into a dataframe

modhvalues <- readHTMLTable(theURL, which = 1, header = TRUE, stringsAsFactors = FALSE)

head(modhvalues)

##   Number First Name Last Name Points
## 1      1        Eve   Jackson     94
## 2      2       John       Doe     80
## 3      3       Adam   Johnson     67
## 4      4       Jill     Smith     50

5. Modify the returned data frame so only the Last Name and Points columns are shown.

lphvalues <- modhvalues[,c(3,4)]
head(lphvalues)

##   Last Name Points
## 1   Jackson     94
## 2       Doe     80
## 3   Johnson     67
## 4     Smith     50

**6 Identify another interesting page on the web with HTML table values.This may be somewhat tricky, because while HTML tables are great for web-page scrapers, many HTML designers now prefer creating tables using other methods (such as

tags or .png files).**

url2 <- "http://en.wikipedia.org/wiki/Fortune_500_Computer_Software_and_Information_Company"

fortune <- readHTMLTable(url2)

str(fortune)

## List of 1
##  $ NULL:'data.frame':    28 obs. of  7 variables:
##   ..$ Company     : Factor w/ 28 levels "Amazon.com","Apple Inc",..: 18 21 24 2 13 8 15 5 23 19 ...
##   ..$ Type        : Factor w/ 5 levels "Computer Peripherals",..: 5 5 5 2 2 2 2 2 2 2 ...
##   ..$ 2014 ranking: Factor w/ 28 levels "--","[1]","128[6]",..: 13 28 16 23 5 2 25 27 1 19 ...
##   ..$ 2013 ranking: Factor w/ 28 levels "--","131","133",..: 12 28 14 26 4 23 24 27 1 17 ...
##   ..$ 2012 ranking: Factor w/ 26 levels "--","10","127",..: 13 26 14 6 2 17 22 24 1 18 ...
##   ..$ 2011 ranking: Factor w/ 25 levels "--","--[2]","11",..: 14 25 15 12 3 16 21 22 2 19 ...
##   ..$ 2010 ranking: Factor w/ 23 levels "--","10","100",..: 17 5 16 21 2 19 23 22 10 20 ...

7 How many HTML tables does that page contain? -Just one

8 Identify your web browser, and describe (in one or two sentences) how you view HTML page source in your web browser. google, right click on the table, then it gives the option to View page source.

9 (Optional challenge exercise) Instead of using readHTMLTable from the XML package, use the functionality in the rvest package to perform the same task. Which method do you prefer? Why might one prefer one package over the other? I gave up. I cannot make css or xpath in html_nodes()function.

library(rvest)

wikiurl <- html(“http://en.wikipedia.org/wiki/Fortune_500_Computer_Software_and_Information_Company”)

theader <- wikiurl %>% html_nodes(xpath=“//table//th”) %>%
html_text()%>% theader

```

week10_assignment

Jamey Etherton

Thursday, April 02, 2015

607-week10-assignment.R