Why Selenium?

How to use RSelenium?

#install.packages("RSelenium")
library(RSelenium)

How Do I Connect to a Running Server?

RSelenium has a main reference class named rsDriver. To connect to a server, you need to instantiate a new rsDriver with appropriate options.

Now we load Selenium drivers and start Selenium. It may take time, so please wait till loading finishes:

rD <- rsDriver(port = 1113L, browser="chrome", chromever = "96.0.4664.45") 
remDr <- rD[["client"]]

rsDriver starts a selenium server and browser.

As you can see, there is information that says, “Chrome is being controlled by automated test software”

remDr should now have a connection to the Selenium Server. You can query the status of the remote server using the getStatus method:

remDr$getStatus()

We may navigate to a url linked to a NAVER news page:

After running the following command, Selenium driver will start the Chrome browser by visiting the URL.

remDr$navigate("https://news.naver.com/main/read.naver?mode=LSD&mid=shm&sid1=103&oid=057&aid=0001623192")

Let’s start to collect users’ comments on the clip.

We are going to find the HTML element that we are interested in. To do this, we are going to use the XPath rule. The XPath rule for getting the first comment will be as shown in the following code block:

’//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[1]/div[1]/div/div[2]/span[1]’

The XPath rule for getting the second comment:

’//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[2]/div[1]/div/div[2]/span[1]’

To find the HTML element, we should use the findElement object of remDr, so we can find the HTML element using the following command:

comment1 <- remDr$findElement(using = 'xpath', '//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[1]/div[1]/div/div[2]/span[1]')

After finding the element we are interested in, we can use the getElementText object of comment1. We can use the following command to get the text element of the HTML element. As you can see, we managed to get the text of the first comment on the Naver news article.

comment1$getElementText()
class(comment1$getElementText())
comment1$getElementText()[[1]][1]

Likewise, we can obtain the 10th comments as follows:

comment10 <- remDr$findElement(using = 'xpath', '//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[10]/div[1]/div/div[2]/span[1]')

comment10$getElementText()
class(comment10$getElementText())
comment10$getElementText()[[1]][1]

Let’s create the null variable that will be used for storing all the comments made by the users:

library(tidyverse)
comments_df <- tibble()
class(comments_df)
comments_df

Because we have 10 comments on this news page, we are going to use 10 steps for loop. Not to be blocked by NAVER, we are putting a delay by using the Sys.sleep function:

i <- 3
paste('//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[',i,']/div[1]/div/div[2]/span[1]', sep='')



for(i in 1:10){
  webElem1 <- remDr$findElement(using = 'xpath', paste('//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[',i,']/div[1]/div/div[2]/span[1]', sep=''))
  Sys.sleep(1)
  comment <- tibble(comment = webElem1$getElementText()[[1]][1])
  comments_df <- bind_rows(comments_df, comment)
}

If we print the comments_df data frame, we will be able to see the data frame that contains the 10 comments on the news page at NAVER:

comments_df

Now let’s try to trigger some events by using the RSelenium package. For example, we can click the button for loading more comments: “댓글 더보기”. By doing so, we can get a new URL to access all the comments made on the news article at NAVER:

Now, let’s try to open the comment page by clicking “댓글 더보기” through RSelenium. To do so, we first have to mark this part of the HTML using XPath and then we can send the click event to this web element:

webElem2 <- remDr$findElement(using = "xpath", '//*[@id="cbox_module"]/div[2]/div[9]/a/span[1]')

Following command will help us to navigate the comment page in this category:

webElem2$clickElement()

If you run the preceding code, you will see that the browser displays the comment page.

And we can load more comments by clicking “더보기” through RSelenium. As shown above, we need to use XPath and send the click event to the web element:

’//*[@id="cbox_module"]/div[2]/div[9]/a/span/span/span[1]’

WebElem3 <- remDr$findElement(using = "xpath", '//*[@id="cbox_module"]/div[2]/div[9]/a/span/span/span[1]')

WebElem3$clickElement()

We can further load more comments by sending the click event repeatedly. In this case, because the number of the comments is 190 and each click yields 20 comments, the click needs to be repeated at least 10 times more to load all the comments:

for(i in 1:10) {
  WebElem3$clickElement()
  Sys.sleep(1)
}

Now we are able to extract all the 190 comments from the new page by using the XPath rule. So, we are going to use 190 steps for loop.

’//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[1]/div[1]/div/div[2]/span[1]’

comments_df <- tibble()

for(i in 1:190){
  webElem1 <- remDr$findElement(using = 'xpath', paste('//*[@id="cbox_module_wai_u_cbox_content_wrap_tabpanel"]/ul/li[',i,']/div[1]/div/div[2]/span[1]', sep=''))
  Sys.sleep(1)
  comment <- tibble(comment = webElem1$getElementText()[[1]][1])
  comments_df <- bind_rows(comments_df, comment)
}

comments_df

write.csv(comments_df, file="comments_df.csv")

Assignment

Navigate any NAVER News page with at least 100 comments or more to collect all the comments made on the news and export the data into a csv file.

Stop Selenium Driver and start the server again

# Closing the Selenium server
remDr$close()
rD[["server"]]$stop()

# Loading drivers and starting selenium
rD <- rsDriver(port = 1112L, browser="chrome", chromever = "96.0.4664.45") # Any four digit integer numbers ending with L, except for "1111L"
remDr <- rD[["client"]]