This is an tutorial about data scraping in R using rvest package.
if (!require("pacman")) install.packages("pacman")
Loading required package: pacman
pacman::p_load(rvest, dplyr, stringr,DT,tidyr, readxl,knitr,ggplot2)
#library(rvest)
#library(stringr) # deal with string in r
#library(tidyr) # data cleaning
#library(DT) # for printing nice HTML output tables
A HTML file is structured (hierarchical / tree based).
*everything in an HTML document is a node:**
htmltree
One example HTML file:
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>Lesson one</h1>
<p>Hello world!</p>
</body>
</html>
The tree structure:
example
rvest is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward.
Core functions:
read_html - read HTML data from a url or character string.
html_nodes - select specified nodes from the HTML document usign CSS selectors.
html_table - parse an HTML table into a data frame.
html_text - extract tag pairs’ content.
html_name - extract tags’ names.
html_attrs - extract all of each tag’s attributes.
html_attr - extract tags’ attribute value by name.
Selector gadget helps us identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document.
To use it, open the page
Click on the element you want to select. Selectorgadget will make a first guess at what css selector you want. It’s likely to be bad since it only has one example to learn from, but it’s a start. Elements that match the selector will be highlighted in yellow.
Click on elements that shouldn’t be selected. They will turn red. Click on elements that should be selected. They will turn green.
Iterate until only the elements you want are selected. Selectorgadget isn’t perfect and sometimes won’t be able to find a useful css selector. Sometimes starting from a different element helps.
Source:
htmlpage <- read_html("http://forecast.weather.gov/MapClick.php?lat=42.31674913306716&lon=-71.42487878862437&site=all&smap=1#.VRsEpZPF84I")
forecasthtml <- html_nodes(htmlpage, ".forecast-text")
forecast <- html_text(forecasthtml)
forecast
[1] "A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible. "
[2] "A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible. "
[3] "Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph. "
[4] "Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight. "
[5] "Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning. "
[6] "Mostly cloudy, with a low around 39."
[7] "Partly sunny, with a high near 49."
[8] "Partly cloudy, with a low around 33."
[9] "Sunny, with a high near 52."
[10] "Mostly clear, with a low around 34."
[11] "Mostly sunny, with a high near 59."
[12] "Partly cloudy, with a low around 41."
[13] "Mostly sunny, with a high near 55."
[14] "Partly cloudy, with a low around 38."
[15] "Mostly sunny, with a high near 56."
A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible.
A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible.
Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph.
Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight.
Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning.
Mostly cloudy, with a low around 39.
Partly sunny, with a high near 49.
Partly cloudy, with a low around 33.
Sunny, with a high near 52.
Mostly clear, with a low around 34.
Mostly sunny, with a high near 59.
Partly cloudy, with a low around 41.
Mostly sunny, with a high near 55.
Partly cloudy, with a low around 38.
Mostly sunny, with a high near 56.
No date, try to add date
b , .forecast-text
forecasthtml <- html_nodes(htmlpage, "b , .forecast-text")
#forecasthtml <- html_nodes(htmlpage, "#detailed-forecast-body b , .forecast-text")
forecast <- html_text(forecasthtml)
forecast
[1] "Current conditions at"
[2] "Lat:Â "
[3] "Lon:Â "
[4] "Elev:Â "
[5] "Humidity"
[6] "Wind Speed"
[7] "Barometer"
[8] "Dewpoint"
[9] "Visibility"
[10] "Last update"
[11] "More Information:"
[12] "Extended Forecast for"
[13] "This Afternoon"
[14] "A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible. "
[15] "Tonight"
[16] "A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible. "
[17] "Friday"
[18] "Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph. "
[19] "Friday Night"
[20] "Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight. "
[21] "Saturday"
[22] "Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning. "
[23] "Saturday Night"
[24] "Mostly cloudy, with a low around 39."
[25] "Sunday"
[26] "Partly sunny, with a high near 49."
[27] "Sunday Night"
[28] "Partly cloudy, with a low around 33."
[29] "Monday"
[30] "Sunny, with a high near 52."
[31] "Monday Night"
[32] "Mostly clear, with a low around 34."
[33] "Tuesday"
[34] "Mostly sunny, with a high near 59."
[35] "Tuesday Night"
[36] "Partly cloudy, with a low around 41."
[37] "Wednesday"
[38] "Mostly sunny, with a high near 55."
[39] "Wednesday Night"
[40] "Partly cloudy, with a low around 38."
[41] "Thursday"
[42] "Mostly sunny, with a high near 56."
[43] "Map function requires Javascript and a compatible browser."
Current conditions at
Lat:Â
Lon:Â
Elev:Â
Humidity
Wind Speed
Barometer
Dewpoint
Visibility
Last update
More Information:
Extended Forecast for
This Afternoon
A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible.
Tonight
A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible.
Friday
Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph.
Friday Night
Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight.
Saturday
Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning.
Saturday Night
Mostly cloudy, with a low around 39.
Sunday
Partly sunny, with a high near 49.
Sunday Night
Partly cloudy, with a low around 33.
Monday
Sunny, with a high near 52.
Monday Night
Mostly clear, with a low around 34.
Tuesday
Mostly sunny, with a high near 59.
Tuesday Night
Partly cloudy, with a low around 41.
Wednesday
Mostly sunny, with a high near 55.
Wednesday Night
Partly cloudy, with a low around 38.
Thursday
Mostly sunny, with a high near 56.
Map function requires Javascript and a compatible browser.
Too much. Click the yellow places that we do not want. It turns red. Unselect.
#detailed-forecast-body b , .forecast-text
forecasthtml <- html_nodes(htmlpage, "#detailed-forecast-body b , .forecast-text")
forecast <- html_text(forecasthtml)
forecast
[1] "This Afternoon"
[2] "A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible. "
[3] "Tonight"
[4] "A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible. "
[5] "Friday"
[6] "Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph. "
[7] "Friday Night"
[8] "Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight. "
[9] "Saturday"
[10] "Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning. "
[11] "Saturday Night"
[12] "Mostly cloudy, with a low around 39."
[13] "Sunday"
[14] "Partly sunny, with a high near 49."
[15] "Sunday Night"
[16] "Partly cloudy, with a low around 33."
[17] "Monday"
[18] "Sunny, with a high near 52."
[19] "Monday Night"
[20] "Mostly clear, with a low around 34."
[21] "Tuesday"
[22] "Mostly sunny, with a high near 59."
[23] "Tuesday Night"
[24] "Partly cloudy, with a low around 41."
[25] "Wednesday"
[26] "Mostly sunny, with a high near 55."
[27] "Wednesday Night"
[28] "Partly cloudy, with a low around 38."
[29] "Thursday"
[30] "Mostly sunny, with a high near 56."
This Afternoon
A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible.
Tonight
A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible.
Friday
Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph.
Friday Night
Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight.
Saturday
Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning.
Saturday Night
Mostly cloudy, with a low around 39.
Sunday
Partly sunny, with a high near 49.
Sunday Night
Partly cloudy, with a low around 33.
Monday
Sunny, with a high near 52.
Monday Night
Mostly clear, with a low around 34.
Tuesday
Mostly sunny, with a high near 59.
Tuesday Night
Partly cloudy, with a low around 41.
Wednesday
Mostly sunny, with a high near 55.
Wednesday Night
Partly cloudy, with a low around 38.
Thursday
Mostly sunny, with a high near 56.
Put them together
paste(forecast, collapse =" ")
[1] "This Afternoon A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible. Tonight A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible. Friday Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph. Friday Night Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight. Saturday Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning. Saturday Night Mostly cloudy, with a low around 39. Sunday Partly sunny, with a high near 49. Sunday Night Partly cloudy, with a low around 33. Monday Sunny, with a high near 52. Monday Night Mostly clear, with a low around 34. Tuesday Mostly sunny, with a high near 59. Tuesday Night Partly cloudy, with a low around 41. Wednesday Mostly sunny, with a high near 55. Wednesday Night Partly cloudy, with a low around 38. Thursday Mostly sunny, with a high near 56."
This Afternoon A chance of showers. Cloudy, with a high near 63. West wind around 6 mph. Chance of precipitation is 40%. New precipitation amounts of less than a tenth of an inch possible. Tonight A chance of showers, mainly before 7pm. Mostly cloudy, with a low around 43. Northwest wind 5 to 7 mph. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible. Friday Mostly sunny, with a high near 52. Northwest wind 6 to 9 mph, with gusts as high as 23 mph. Friday Night Mostly clear, with a low around 28. Northwest wind around 6 mph becoming calm after midnight. Saturday Mostly sunny, with a high near 52. Calm wind becoming west 5 to 9 mph in the morning. Saturday Night Mostly cloudy, with a low around 39. Sunday Partly sunny, with a high near 49. Sunday Night Partly cloudy, with a low around 33. Monday Sunny, with a high near 52. Monday Night Mostly clear, with a low around 34. Tuesday Mostly sunny, with a high near 59. Tuesday Night Partly cloudy, with a low around 41. Wednesday Mostly sunny, with a high near 55. Wednesday Night Partly cloudy, with a low around 38. Thursday Mostly sunny, with a high near 56.
#seven-day-forecast-list p
forecasthtml <- html_nodes(htmlpage, "#seven-day-forecast-list p")
forecast <- html_text(forecasthtml)
paste(forecast, collapse =" ")
[1] "ThisAfternoon ChanceShowers High: 63 °F Tonight ChanceShowers thenMostly Cloudy Low: 43 °F Friday Mostly Sunny High: 52 °F FridayNight Mostly Clear Low: 28 °F Saturday Mostly Sunny High: 52 °F SaturdayNight Mostly Cloudy Low: 39 °F Sunday Partly Sunny High: 49 °F SundayNight Partly Cloudy Low: 33 °F Monday Sunny High: 52 °F"
ThisAfternoon ChanceShowers High: 63 °F Tonight ChanceShowers thenMostly Cloudy Low: 43 °F Friday Mostly Sunny High: 52 °F FridayNight Mostly Clear Low: 28 °F Saturday Mostly Sunny High: 52 °F SaturdayNight Mostly Cloudy Low: 39 °F Sunday Partly Sunny High: 49 °F SundayNight Partly Cloudy Low: 33 °F Monday Sunny High: 52 °F
#titleCast .itemprop
html <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- html_nodes(html, "#titleCast .itemprop")
length(cast)
[1] 30
cast[1:2]
{xml_nodeset (2)}
[1] <td class="itemprop" itemprop="actor" itemscope="" itemtype="http:// ...
[2] <span class="itemprop" itemprop="name">Will Arnett</span>
Looking carefully at this output, we see twice as many matches as we expected. That’s because we’ve selected both the table cell and the text inside the cell. We can experiment with selectorgadget to find a better match or look at the html directly.
try #titleCast span.itemprop
cast <- html_nodes(html, "#titleCast span.itemprop")
length(cast)
[1] 15
html_text(cast)
[1] "Will Arnett" "Elizabeth Banks" "Craig Berry"
[4] "Alison Brie" "David Burrows" "Anthony Daniels"
[7] "Charlie Day" "Amanda Farinos" "Keith Ferguson"
[10] "Will Ferrell" "Will Forte" "Dave Franco"
[13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
Will Arnett
Elizabeth Banks
Craig Berry
Alison Brie
David Burrows
Anthony Daniels
Charlie Day
Amanda Farinos
Keith Ferguson
Will Ferrell
Will Forte
Dave Franco
Morgan Freeman
Todd Hansen
Jonah Hill
.ratingValue span
score <- html_nodes(html, ".ratingValue span")
length(score)
[1] 3
html_text(score)
[1] "7.8" "/" "10"
7.8
/
10
Put them together
paste(html_text(score), collapse ="")
[1] "7.8/10"
7.8/10
Now Playing (Box Office).aux-content-widget-2:nth-child(11) .title a
html <- read_html("http://www.imdb.com/")
playing <- html_nodes(html, ".aux-content-widget-2:nth-child(11) .title a")
length(playing)
[1] 5
playing[1:3]
{xml_nodeset (3)}
[1] <a href="/title/tt5325452?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768 ...
[2] <a href="/title/tt3062096?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768 ...
[3] <a href="/title/tt3393786?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768 ...
Get text
movies= html_text(playing)
movies
[1] " Boo! A Madea Halloween " " Inferno "
[3] " Jack Reacher: Never Go Back " " The Accountant "
[5] " Ouija: Origin of Evil "
Boo! A Madea Halloween
Inferno
Jack Reacher: Never Go Back
The Accountant
Ouija: Origin of Evil
Get link
link=html_attr(playing, "href")
link
[1] "/title/tt5325452?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t0"
[2] "/title/tt3062096?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t1"
[3] "/title/tt3393786?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t2"
[4] "/title/tt2140479?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t3"
[5] "/title/tt4361050?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t4"
/title/tt5325452?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t0
/title/tt3062096?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t1
/title/tt3393786?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t2
/title/tt2140479?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t3
/title/tt4361050?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t4
link = paste0("http://www.imdb.com", link )
link
[1] "http://www.imdb.com/title/tt5325452?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t0"
[2] "http://www.imdb.com/title/tt3062096?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t1"
[3] "http://www.imdb.com/title/tt3393786?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t2"
[4] "http://www.imdb.com/title/tt2140479?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t3"
[5] "http://www.imdb.com/title/tt4361050?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t4"
http://www.imdb.com/title/tt5325452?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t0
http://www.imdb.com/title/tt3062096?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t1
http://www.imdb.com/title/tt3393786?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t2
http://www.imdb.com/title/tt2140479?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t3
http://www.imdb.com/title/tt4361050?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2495768522&pf_rd_r=0VTJRNSP4VYPD3J9S3AS&pf_rd_s=right-7&pf_rd_t=15061&pf_rd_i=homepage&ref_=hm_cht_t4
.secondary-text
boxoffice <- html_nodes(html, ".secondary-text")
length(boxoffice)
[1] 16
boxoffice[1:3]
{xml_nodeset (3)}
[1] <span class="secondary-text"/>
[2] <span class="secondary-text"/>
[3] <span class="secondary-text"/>
boxoffice = html_text(boxoffice)
boxoffice = boxoffice[7:11]
boxoffice
[1] "Weekend: $17.2M" "Weekend: $14.9M" "Weekend: $9.6M" "Weekend: $8.5M"
[5] "Weekend: $7.1M"
Weekend: $17.2M
Weekend: $14.9M
Weekend: $9.6M
Weekend: $8.5M
Weekend: $7.1M
imbddf <- data_frame(movie= movies, boxoffice = boxoffice,link = link)
datatable(imbddf)
We use the read_table function to read a web page and get the table. More organized text.
url <- 'http://espn.go.com/nfl/superbowl/history/winners'
webpage <- read_html(url)
Next, we use the functions html_nodes and html_table to extract the HTML table element and convert it to a data frame.
use ? html_talbe to check out the arguments.
Do pay attetion to fill = TRUE
If we only need first element of a list, using [[i]] double square bracket.
sb_table <- html_nodes(webpage, 'table')
str(sb_table)
List of 1
$ :List of 2
..$ node:<externalptr>
..$ doc :<externalptr>
..- attr(*, "class")= chr "xml_node"
- attr(*, "class")= chr "xml_nodeset"
sb <- html_table(sb_table,fill=TRUE)[[1]]
#head(sb)
datatable(sb, caption = 'Table 1: Not clean and tidy data.')
We remove the first two rows, and set the column names.
sb <- sb[-(1:2), ]
names(sb) <- c("number", "date", "site", "result")
#head(sb)
datatable(sb, caption = 'Table 2: Improvment to clean and tidy data.')
It is traditional to use Roman numerals to refer to Super Bowls, but Arabic numerals are more convenient to work with. We will also convert the date to a standard format.
library(lubridate) # easy to parse datetime data
Attaching package: 'lubridate'
The following object is masked from 'package:base':
date
sb$number <- 1:50
#sb$date <- as.Date(sb$date, "%B. %d, %Y")
sb$date <- mdy(sb$date)
#head(sb)
datatable(sb, caption = 'Table 3: Improvment to clean and tidy data.')
The result column should be split into four columns as the winning teams name, the winners score, the losing teams name, and the losers score. We start by splitting the results column into two columns at the comma. This operation uses the separate function from the tidyr package.
sb <- separate(sb, result, c('winner', 'loser'), sep=', ', remove=TRUE)
#head(sb)
datatable(sb, caption = 'Table 4: Clean and tidy data.')
read_html ,html_nodes ,html_table.[[1]] to access the first element in a listdatatable in library(DT) to print out pretty tableread_html ,html_nodes ,html_table.