RSelenium Basics

https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html

Introduction

The goal of RSelenium is to make it easy to connect to a Selenium Server from within R. RSelenium provides R bindings for the Selenium Webdriver API. Selenium is a project focused on automating web browsers. RSelenium allows you to carry out unit testing on your webpages across a range of browser.

Connecting to a Selenium Server

What is a Selenium Server?

Selenium Server is a standalone java program which allows you to run HTML test suites in a range of different browsers, plus extra options like reporting.

You may, or may not, need to run a Selenium Server, depending on how you intend to use Selenium-WebDriver (RSelenium).

Do I Need to Run a Selenium Server?

If you intend to drive a browser on the same machine that RSelenium is running on, you will need to have a Selenium Server running on that machine.

How Do I Run the Selenium Server?

rsDriver

This function rsDriver manages the binaries needed for running a Selenium Server. If you have a error message like Error in if (file.access(phantompath, 1) < 0) { : 인자의 길이가 0입니다, then you may want to remove the “phantomjs” package and run binaries with the selenium function from the wdman package. To do so, run the following commands:

binman::rm_platform("phantomjs")
wdman::selenium(retcommand = TRUE)

How Do I Connect to a Running Server?

RSelenium has a main reference class named rsDriver. To connect to a server, you need to instantiate a new rsDriver with appropriate options.

library(RSelenium)
rD <- rsDriver(port = 9468L, browser="chrome", chromever = "87.0.4280.88") # Any four digit integer numbers ending with L, except for "4567L"/"6789L"
remDr <- rD[["client"]]

rsDriver starts a selenium server and browser.

remDr should now have a connection to the Selenium Server. You can query the status of the remote server using the getStatus method:

remDr$getStatus()

Interacting with the Web Page

Some RSelenium methods and how they can be used

  • $navigate(): for navigating to urls pages
  • $getPageSource(): for getting the current page source
  • $findElement(): for identifying elements within a web page
  • $getElementText(): for getting the text of elements in a web page
  • $executeScript(): for executing JavaScript in the context of the currently selected frame/window
  • $clickElement(): for clicking elements within a web page
  • $open(): for opening a webBrowser client
  • $close(): for closing a webBrowser client

To open/close a new controlled browser, you can use the open/close method:

remDr$open()
remDr$close()

To navigate a web page of a url:

remDr$navigate("https://www.youtube.com")
# Search "코로나"
# Arrange the results by "조회수" (View Count) within a day
remDr$navigate("https://www.youtube.com/results?search_query=코로나&sp=CAMSAggC")

The search result is updated as we scroll our mouse down. So, we can send such a scrolling event to the HTML element to see the page all.

Simulating Scrolls and Clicks

  1. Simulating scrolls: We can scroll up or down by sending the page_up or page_down/end to the browser.
remDr$refresh()

# find the webpage body
webElem <- remDr$findElement("xpath", '/html/body')

#scroll down the webpage once
webElem$sendKeysToElement(list(key = "page_down"))
webElem$sendKeysToElement(list(key = "end"))

#scroll up once
webElem$sendKeysToElement(list(key = "page_up"))

Please make sure to sleep a couple of seconds because it takes time to load contents

# system sleep
Sys.sleep(5) # in seconds

Scroll down multiple times

remDr$navigate("https://www.youtube.com/results?search_query=%EC%BD%94%EB%A1%9C%EB%82%98&sp=CAMSBAgCEAE%253D")

webElem <- remDr$findElement("xpath", "/html/body")

# Scroll down 10 times
for(i in 1:10){      
  webElem$sendKeysToElement(list(key = "end"))
  # please make sure to sleep a couple of seconds because it takes time to load contents
  Sys.sleep(5)    
}
  1. Simulating Clicks

Click a Single Element by XPath:

’//*[@id="video-title"]

remDr$navigate("https://www.youtube.com/results?search_query=%EC%BD%94%EB%A1%9C%EB%82%98&sp=CAMSBAgCEAE%253D")

#locate element using XPath(find the first match)
webElem <- remDr$findElement(using="xpath", '//*[@id="video-title"]')
webElem$getElementText()[[1]][1] # Extract text
webElem$clickElement()

rvest package

Introducing rvest

Most of data on the web is in large scale as HTML. It is often not available in a form that is useful for analysis, such as hierarchical or tree-based:

First HTML

This is an R HTML document.

rvest is very useful R library that helps you collect information from web pages. It is designed to work with tidyverse, inspired by Python libraries for web scraping like BeautifulSoup.

Let’s take a look at some important functions in rvest:

Function Description
read_html() Create an html document from an URL
html_nodes(doc, "table td") Select parts of a document using css selectors
html_nodes(doc, xpath = "//table//td") Select parts of a document using xpath selectors
html_text() Extract text from html document
html_attr() Get a single html attribute
html_attrs() Get all HTML attributes
html_table() Parse HTML tables into a data frame
repair_encoding() Detect and repair problems regarding encoding

How to collect multiple URLs from a webpage

Let’s scrape all the URLs to the most viewed video clips about “코로나” within 24 hours on YouTube. By doing so, we can then collect information to find out how many views there are on each clip.

We can find the URLs on the page of YouTube, using the following URL:

https://www.youtube.com/results?search_query=%EC%BD%94%EB%A1%9C%EB%82%98&sp=CAMSAggC

https://www.youtube.com/results?search_query=코로나&sp=CAMSAggC

To collecting the URLs to each video clip, we will use the landing page URL of the website.

The following codes could be used to load the library and store the URL to the variable:

#install.packages("rvest")
library(rvest)

Let say we want to collect all the view counts on each video clip, resulting from our search query. To do so, we need to collect the URLs to the clips, first.

In following part, we will write XPath rules to collect such information, and then will be writing R script to collect the information.

Writing XPath rules

  • First, we will write XPath rules to collect information about the title of each clip and its URL.

  • Let’s navigate the landing page of the website YouTube.com. As we exercised in previous classes, we will use Google Developer Tools to create and test XPath rules.

  • Let’s write XPath rule to get the title.

//*[@id="video-title"]

Read the html document from the web page

page <- remDr$getPageSource()[[1]]
html <- read_html(page)

Scraping titles and urls through XPath rules

library(tidyverse)

html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('title')

html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('href')

Automating scrolls to load all the clips

webElem <- remDr$findElement("xpath", "/html/body")

# Scroll down 2 times
for(i in 1:2){      
  webElem$sendKeysToElement(list(key = "end"))
  # please make sure to sleep a couple of seconds because it takes time to load contents
  Sys.sleep(2)    
}

Read the html document from the updated web page

page <- remDr$getPageSource()[[1]]
html <- read_html(page)

Scraping titles and urls through XPath rules

html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('title')
html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('href')

library(tidyverse)
titles <- html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('title')
urls <- html %>% html_nodes(xpath='//*[@id="video-title"]') %>% html_attr('href')

Combining the two vectors into a data frame

corona_yt <- tibble(Title = titles, URL=str_c("https://www.youtube.com", urls, sep=""))
corona_yt

Scraping the view counts of each video clip. It’s easy to see the counts for each clip on the page that was already searched on by viewers.

XPath rule for view count: ’//*[@id="count"]/yt-view-count-renderer/span[1]’

corona_yt$URL[1]
remDr$navigate(corona_yt$URL[1])

page <- remDr$getPageSource()[[1]]
html <- read_html(page)
view <- html %>% html_nodes(xpath='//*[@id="count"]/yt-view-count-renderer/span[1]') %>% html_text()
view

Creating a new column in corona_df for the view counts through for loop

corona_yt$View <- NULL

for(i in 1:length(corona_yt$URL)){
  remDr$navigate(corona_yt$URL[i])
  Sys.sleep(5)
  page <- remDr$getPageSource()[[1]]
  html <- read_html(page)
  corona_yt$View[i] <- html %>% html_nodes(xpath='//*[@id="count"]/yt-view-count-renderer/span[1]') %>% html_text()
}
corona_yt
class(corona_yt)

write.csv(corona_yt, file="corona_yt.csv", row.names = F)