Some Data on the Web

http://gccs.surge.sh/tidypres.html

Let’s say we want that “Recent Princes” table.

Package rvest

Attach the rvest package:

library(rvest)

Read in the Page

Here’s the URL we want:

url <- "http://gccs.surge.sh/tidypres.html"

Now we grab the page:

page <- read_html(url)

Go For the Tables

tables <- html_nodes(page, "table")

Did we get it? Try this:

tables[[2]]
{xml_node}
<table class="table table-hover table-bordered">
 [1] <tr>\n<th>sex</th>\n  \n   <th>count</th>\n   \n  ...
 [2] <tr>\n<td>M</td>\n    \n   <td>73</td>\n      \n  ...
 [3] <tr>\n<td>M</td>\n    \n   <td>92</td>\n      \n  ...
 and so on for more rows ...

Yep, looks like it did!

Turn Into a Data Frame

recentPrinces <- html_table(tables[[2]])

Did it work? Try this:

str(recentPrinces)
## 'data.frame':    39 obs. of  3 variables:
##  $ sex  : chr  "M" "M" "M" "M" ...
##  $ count: int  73 92 131 146 137 5 167 206 195 5 ...
##  $ year : int  1978 1979 1980 1981 1982 1983 1983 1984 1985 1986 ...

And try this:

DT::datatable(recentPrinces, options = list(
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

Analyse

Let’s make a graph:

ggplot(recentPrinces, aes(x = year, y = count)) +
  geom_line(aes(color = sex)) +
  labs(x = "Year",
       y = "number of babies named 'Prince'",
       title = "The name 'Prince' has been getting popular!",
       subtitle = "(for boys, at any rate ...)")