The contents of this class are an excerpt from the book “R Web Scraping Quick Start Guide” by Olgun Aydin.
We learned in previous classes that accessing particular information in webpages may be impeded when a site employs methods for dynamic data requests, especially through JavaScript objects.
To solve the problem, we introduce a generalized approach to cope with dynamically rendered webpages by means of browser control.
Instead of bypassing web browsers, we leverage their capabilities of interpreting JavaScript and formulating changes to the live DOM tree by directly including them into the scraping process.
Essentially, this means that all communication with a webpage is routed through a web browser session to which we send and from which we receive information.
Here, we introduce the Selenium and Webdriver framework for browser automation and its implementation in R via the RSelenium package.
Static scraping is sufficient for retrieving data from static lists, but we need to automate the browser and interact with the DOM to retrieve data from a website controlled by JavaScript.
Selenium is a package designed to be used to test open source web applications for different browsers and platforms. The main purpose of Selenium is to automate web-based applications.
Selenium is designed to automate the operations of a web browser. With Selenium, any user can automatically perform interactions that can be performed manually. Selenium can be used for any kind of automation, but the priority is to create automated web application tests.
Before the creation of Selenium this kind of testing had been carried out manually—a tedious and error-prone undertaking. Selenium solves this problem by providing drivers to control browser behavior such as clicks, scrolls, swipes, and text inputs.
Selenium’s capability to drive interactions with the webpage through the browser is of more general use besides testing purposes. Because it allows to remote-control the browser, we can work with and request information directly from the live DOM tree, that is, how the visual display is presented in the browser window.
Selenium WebDriver is an open-source suite of software with the primary purpose of providing a coherent, cross-platform framework for testing applications that run natively in the browser.
So, Selenium WebDriver provides the remote control interface that instructs the behavior of web browsers remotely.
Because WebDriver uses a real web browser to access the website, there is no difference than browsing the web by a human.
When you navigate to a web page using WebDriver, the browser loads all the website resources (JavaScript files, images, css files, and so on) and executes all the JavaScripts on the page. It also keeps all cookies created by the website.
This makes it very difficult to determine whether a real person or a robot has accessed the website. With WebDriver, this can be done in a few simple steps, although it’s really hard to simulate all these actions in a program that sends handmade HTTP requests to the server.
Sometimes, the data to be extracted may not be included in the raw HTML that was received after an HTTP request was made. Although it is impossible to receive this data only with HTTP requests, it is usually easier to allow a web browser to do it for you. In these situations, WebDriver is a great help.
We need a web browser to see how the structure and content of the web page looks. Using WebDriver is a great way to get screenshots while surfing the web.
Even if you need to scrape a small portion of a website, it is important that your program is associated with all the Selenium WebDriver tools and that the WebDriver can be installed on each browser.
When a webpage is scraped using WebDriver, the entire web browser is loaded into the system memory. This takes a long time and consumes the system resources, and may cause the security systems to react.
Web browsers will wait until the entire webpage is loaded and will only allow us to access the website’s assets. Scraping can take longer than sending simple HTTP requests to a web server.
Web browsers load additional files such as css, js, and image files when navigating a webpage.
RSelenium is designed to make it easy to connect to a Selenium server or a remote Selenium server. RSelenium allows connection from the R environment to the Selenium WebDriver API. Selenium is a project that focuses on automating web browsers.
Selenium Server is an independent java program that allows you to run HTML test suites in different browsers.
If you want to navigate your website using a browser on the same machine that RSelenium is running on, you need to run Selenium Server on the machine.
RSelenium is one of the most useful tools of R. With just a few lines of code, you can automate web pages and create scraping systems. This is useful for testing web applications as well as for collecting data from multiple web pages.
RSelenium is an R library that allows you to use the Selenium 2.0 WebDriver project, designed to automatically test Web applications, in the R environment.
To use Selenium in R, we need to download and install Java into your PC/Mac.
https://www.java.com/ko/download/manual.jsp
Please check your Window system (32 bit vs. 64 bit) to download the relevant version of Java. Most of you will download the 64 bit version of Java.
Install the downloaded file and check if the Java folder is created under C:\Program Files
Copy the location path to the folder jre1.8.0_311
"C:\Program Files\Java\jre1.8.0_311"
Setting the Path: Go to Advanced System Setting(고급 시스템 설정) under your System setting
Click the Environment Variable(환경 변수)
Click “New”(새로 만들기) under System variables(시스템 변수) and enter “JAVA_HOME” in Variable Name(변수 이름) and the copied location path like “C:/Program Files/Java/jre1.8.0_311” in Variable Value(변수 값), and click Confirm(확인)
Click “Edit”(편집) under the User Variable(사용자 변수) and click Edit(편집) for the first Path, and add the location path “C:/Program Files/Java/jre1.8.0_311”, followed by semicolon(;), to the first path.
Run the following code in your RStudio console to set the Environment Variable.
Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_311')
# Be careful to change the location slash
To use the RSelenium package, Selenium Server is to be up and running in your PC/Mac as follows.
Create a folder named “selenium” under your C Drive
Download the Selenium Standalone Server (Version 3.141.59) and save the file into the selenium folder
https://www.selenium.dev/downloads/
https://github.com/mozilla/geckodriver/releases
https://sites.google.com/chromium.org/driver/downloads?authuser=0
Open Command Prompt by selecting the Search Box and entering “cmd”
Enter cd C:\selenium
and Run
Enter “java -Dwebdriver.gecko.driver=”geckodriver.exe" -jar selenium-server-standalone-3.141.59.jar -port 6789" and Run
Check if Selenium Server is up and running on port 6789 and, if yes, you are good to go!
Congratulations! You are now ready to use RSelenium for web scraping!
In this part, we are going to focus on collecting data from NAVER News by using RSelenium. First of all, we are going to navigate the URL of the NAVER News page, and collects comments on the news page.
Let’s start collecting data from NAVER by using the RSelenium
library. First of all, we need to install and load the RSelenium package into your current R session by the following command to use Selenium in R.
#install.packages("RSelenium")
library(RSelenium)
Now we load Selenium drivers and start Selenium. It may take time, so please wait till loading finishes:
# Loading drivers and starting selenium
rD <- rsDriver(port = 8879L, browser="chrome", chromever = "96.0.4664.45") # Any four digit integer numbers ending with L, except for "6789L"
remDr <- rD[["client"]]
Let’s navigate a page at NAVER News
After running the following command, Selenium driver will start the Chrome browser by visiting the URL.
remDr$navigate("https://news.naver.com/main/read.naver?mode=LSD&mid=shm&sid1=103&oid=057&aid=0001623192")
As you can see, there is information that says, “Chrome is being controlled by automated test software”
We will collect users’ comments on the news.
To use Selenium in R, we need to download and install Java into your Mac.
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
To download the file for MAC OS, you may need to join the ORACLE membership for free
Install the downloaded file
Open a Terminal window by entering “Terminal” in the Search Box
Enter “java -version” to check which version of JAVA your MAC is using
To use the RSelenium package, Selenium Server is to be up and running in your Mac as follows.
https://www.selenium.dev/downloads/
Move the file into the working directory of your RStudio session
Open the Terminal window
Enter “cd ~/(the working directory path)”
Enter “ls” to check that the file actually made it in the folder
Enter “java -jar selenium-server-standalone-3.141.59.jar -port 6789” and Run
Check if Selenium Server is up and running on port 6789 and, if yes, you are good to go!
Congratulations! You are now ready to use RSelenium for web scraping!
In this part, we are going to focus on collecting data from NAVER News by using RSelenium. First of all, we are going to navigate the URL of the NAVER News page, and collects comments on the news page.
Let’s start collecting data from NAVER by using the RSelenium
library. First of all, we need to install and load the RSelenium package into your current R session by the following command to use Selenium in R.
#install.packages("RSelenium")
library(RSelenium)
Now we load Selenium drivers and start Selenium. It may take time, so please wait till loading finishes:
# Loading drivers and starting selenium
rD <- rsDriver(port = 8879L, browser="chrome", chromever = "96.0.4664.45") # Any four digit integer numbers ending with L, except for "8879L"
remDr <- rD[["client"]]
Let’s navigate a page at NAVER News
After running the following command, Selenium driver will start the Chrome browser by visiting the URL.
remDr$navigate("https://news.naver.com/main/read.naver?mode=LSD&mid=shm&sid1=103&oid=057&aid=0001623192")
As you can see, there is information that says, “Chrome is being controlled by automated test software”