https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html
The goal of RSelenium is to make it easy to connect to a Selenium Server from within R. RSelenium provides R bindings for the Selenium Webdriver API. Selenium is a project focused on automating web browsers. RSelenium allows you to carry out unit testing on your webpages across a range of browser.
Selenium Server is a standalone java program which allows you to run HTML test suites in a range of different browsers, plus extra options like reporting.
You may, or may not, need to run a Selenium Server, depending on how you intend to use Selenium-WebDriver (RSelenium
).
If you intend to drive a browser on the same machine that RSelenium is running on, you will need to have a Selenium Server running on that machine.
rsDriver
This function rsDriver
manages the binaries needed for running a Selenium Server. If you have a error message like Error in if (file.access(phantompath, 1) < 0) { : 인자의 길이가 0입니다
, then you may want to remove the “phantomjs” package and run binaries with the selenium
function from the wdman package. To do so, run the following commands:
binman::rm_platform("phantomjs")
wdman::selenium(retcommand = TRUE)
RSelenium has a main reference class named rsDriver. To connect to a server, you need to instantiate a new rsDriver with appropriate options.
library(RSelenium)
rD <- rsDriver(port = 9468L, browser="chrome", chromever = "87.0.4280.88") # Any four digit integer numbers ending with L, except for "4567L"/"6789L"
remDr <- rD[["client"]]
rsDriver starts a selenium server and browser.
remDr should now have a connection to the Selenium Server. You can query the status of the remote server using the getStatus method:
remDr$getStatus()
To open/close a new controlled browser, you can use the open/close method:
remDr$open()
remDr$close()
To navigate a web page of a url:
remDr$navigate("https://www.youtube.com")
# Search "코로나"
# Arrange the results by "조회수" (View Count) within a day
remDr$navigate("https://www.youtube.com/results?search_query=코로나&sp=CAMSAggC")
The search result is updated as we scroll our mouse down. So, we can send such a scrolling event to the HTML element to see the page all.
page_up
or page_down
/end
to the browser.remDr$refresh()
# find the webpage body
webElem <- remDr$findElement("xpath", '/html/body')
#scroll down the webpage once
webElem$sendKeysToElement(list(key = "page_down"))
webElem$sendKeysToElement(list(key = "end"))
#scroll up once
webElem$sendKeysToElement(list(key = "page_up"))
Please make sure to sleep a couple of seconds because it takes time to load contents
# system sleep
Sys.sleep(5) # in seconds
Scroll down multiple times
remDr$navigate("https://www.youtube.com/results?search_query=%EC%BD%94%EB%A1%9C%EB%82%98&sp=CAMSBAgCEAE%253D")
webElem <- remDr$findElement("xpath", "/html/body")
# Scroll down 10 times
for(i in 1:10){
webElem$sendKeysToElement(list(key = "end"))
# please make sure to sleep a couple of seconds because it takes time to load contents
Sys.sleep(5)
}
Click a Single Element by XPath:
’//*[@id="video-title"]’
remDr$navigate("https://www.youtube.com/results?search_query=%EC%BD%94%EB%A1%9C%EB%82%98&sp=CAMSBAgCEAE%253D")
#locate element using XPath(find the first match)
webElem <- remDr$findElement(using="xpath", '//*[@id="video-title"]')
webElem$getElementText()[[1]][1] # Extract text
webElem$clickElement()
Most of data on the web is in large scale as HTML. It is often not available in a form that is useful for analysis, such as hierarchical or tree-based:
This is an R HTML document.
rvest
is very useful R library that helps you collect information from web pages. It is designed to work with tidyverse
, inspired by Python libraries for web scraping like BeautifulSoup.
Let’s take a look at some important functions in rvest:
Function | Description |
---|---|
read_html() |
Create an html document from an URL |
html_nodes(doc, "table td") |
Select parts of a document using css selectors |
html_nodes(doc, xpath = "//table//td") |
Select parts of a document using xpath selectors |
html_text() |
Extract text from html document |
html_attr() |
Get a single html attribute |
html_attrs() |
Get all HTML attributes |
html_table() |
Parse HTML tables into a data frame |
repair_encoding() |
Detect and repair problems regarding encoding |
Let’s scrape all the URLs to the most viewed video clips about “코로나” within 24 hours on YouTube. By doing so, we can then collect information to find out how many views there are on each clip.
We can find the URLs on the page of YouTube, using the following URL:
https://www.youtube.com/results?search_query=%EC%BD%94%EB%A1%9C%EB%82%98&sp=CAMSAggC
https://www.youtube.com/results?search_query=코로나&sp=CAMSAggC
To collecting the URLs to each video clip, we will use the landing page URL of the website.
The following codes could be used to load the library and store the URL to the variable:
#install.packages("rvest")
library(rvest)
Let say we want to collect all the view counts on each video clip, resulting from our search query. To do so, we need to collect the URLs to the clips, first.
In following part, we will write XPath rules to collect such information, and then will be writing R script to collect the information.
First, we will write XPath rules to collect information about the title of each clip and its URL.
Let’s navigate the landing page of the website YouTube.com
. As we exercised in previous classes, we will use Google Developer Tools to create and test XPath rules.
Let’s write XPath rule to get the title.
//*[@id="video-title"]
page <- remDr$getPageSource()[[1]]
html <- read_html(page)
library(tidyverse)
html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('title')
html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('href')
webElem <- remDr$findElement("xpath", "/html/body")
# Scroll down 2 times
for(i in 1:2){
webElem$sendKeysToElement(list(key = "end"))
# please make sure to sleep a couple of seconds because it takes time to load contents
Sys.sleep(2)
}
page <- remDr$getPageSource()[[1]]
html <- read_html(page)
html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('title')
html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('href')
library(tidyverse)
titles <- html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('title')
urls <- html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('href')
corona_yt <- tibble(Title = titles, URL=str_c("https://www.youtube.com", urls, sep=""))
corona_yt
XPath rule for view count: ’//*[@id="count"]/yt-view-count-renderer/span[1]’
corona_yt$URL[1]
remDr$navigate(corona_yt$URL[1])
page <- remDr$getPageSource()[[1]]
html <- read_html(page)
view <- html %>% html_nodes(xpath='//*[@id="count"]/yt-view-count-renderer/span[1]') %>% html_text()
view
corona_yt$View <- NULL
for(i in 1:length(corona_yt$URL)){
remDr$navigate(corona_yt$URL[i])
Sys.sleep(5)
page <- remDr$getPageSource()[[1]]
html <- read_html(page)
corona_yt$View[i] <- html %>% html_nodes(xpath='//*[@id="count"]/yt-view-count-renderer/span[1]') %>% html_text()
}
corona_yt
class(corona_yt)
write.csv(corona_yt, file="corona_yt.csv", row.names = F)