Rselenium

Many websites are difficult to scrap because they use JavaScript and jQuery to dynamically extract data from the database. For example, on common social media sites such as LinkedIn or Facebook, when you scroll down the page, new content will be loaded and the URL will not change. These sites are difficult to scrap, To simpify scraping task, we can adjust the URL based on a certain system pattern to load a new page.

For example, if we check the Grainger website, we will see that the URL changes systematically, for example. https://www.grainger.com/

my_data <- read.csv("https://raw.githubusercontent.com/szx868/data607/master/Presentation/infile.txt",header=T)
my_data
##    Item.Number
## 1       19YP94
## 2        2BAE7
## 3        2BAE9
## 4        2BAF8
## 5        2BAF9
## 6       472U87
## 7        2BAG5
## 8        2BAH1
## 9        2BAH3
## 10       2BAH4
## 11       2BAH7
## 12       2BAH8
## 13       2BAJ2
library(RSelenium)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringr)


cprof <- list(chromeOptions = 
                list(extensions = 
                       list(base64enc::base64encode("VPN_PROXY_MASTER.crx"))
                ))


rD <- rsDriver(port = 4453L,extraCapabilities=cprof, browser ="chrome",chromever = "latest")
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
remDr <- rD[["client"]]
remDr$setTimeout(type = 'page load', milliseconds = 120000)
remDr$setTimeout(type = 'implicit', milliseconds = 120000)
remDr$navigate("chrome-extension://lnfdmdhmfbimhhpaeocncdlhiodoblbd/popup/popup.html")
time <- 5
Sys.sleep(time)
webElem <- remDr$findElement("css", "body")
# find button
morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
# click button
morereviews$clickElement()
# wait
Sys.sleep(8)
temphtml <- remDr$getPageSource()[[1]]
  if(str_detect(temphtml,"Connected") == FALSE){
          morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
          morereviews$clickElement()
          Sys.sleep(8)
  }
remDr$setTimeout(type = 'page load', milliseconds = 120000)
remDr$setTimeout(type = 'implicit', milliseconds = 120000)
remDr$navigate("https://www.grainger.com/")
webElem2 <- remDr$findElement("css", "body")
morereviews2 <- remDr$findElement(using = "name", value = "searchQuery")$sendKeysToElement(list('3d264',key="enter"))
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
remDr$close()
rD$server$stop()
## [1] TRUE
rm(rD, remDr)
gc()
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1266711 67.7    2345677 125.3  2345677 125.3
## Vcells 2339824 17.9    8388608  64.0  4094265  31.3
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
## [1] 0

image info image info image info

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library("knitr")

item.desc <- read_html(html) %>% # parse HTML
  html_nodes(".specifications__description") %>% # extract class nodes with class = "specifications__description"
  html_text() 
item.value <- read_html(html) %>% # parse HTML
  html_nodes(".specifications__value") %>% # extract class nodes with class = "specifications__item"
  html_text() 
tech.spec <- cbind(item.desc, item.value)
tech.spec
##       item.desc                                         item.value          
##  [1,] "Item - Plugs and Receptacles"                    "Locking Connector" 
##  [2,] "Amps - Plugs and Receptacles"                    "15/10"             
##  [3,] "Voltage - Plugs and Receptacles"                 "125/250V AC"       
##  [4,] "NEMA Plug Configuration - Plugs and Receptacles" "15A, Non-NEMA"     
##  [5,] "Number of Poles"                                 "3"                 
##  [6,] "Number of Wires"                                 "3"                 
##  [7,] "Wiring Style"                                    "Standard"          
##  [8,] "Color - Plugs and Receptacles"                   "Black/White"       
##  [9,] "Shrouded / Non-Shrouded"                         "Non-Shrouded"      
## [10,] "Grade - Plugs and Receptacles"                   "Industrial"        
## [11,] "Power Indicator"                                 "No"                
## [12,] "Corrosion Resistant"                             "No"                
## [13,] "IP Rating"                                       "20"                
## [14,] "Cord Size"                                       "0.23 in to 0.72 in"
## [15,] "Phase"                                           "1"                 
## [16,] "Material"                                        "Nylon"             
## [17,] "Antimicrobial"                                   "No"                
## [18,] "Standards"                                       "UL/CSA"            
## [19,] "Item"                                            "Locking Connector"
count <- 0 
for(i in 1:nrow(my_data)){
  count <- count + 1
  print(count)
  print(my_data[i,])
}
## [1] 1
## [1] "19YP94"
## [1] 2
## [1] "2BAE7"
## [1] 3
## [1] "2BAE9"
## [1] 4
## [1] "2BAF8"
## [1] 5
## [1] "2BAF9"
## [1] 6
## [1] "472U87"
## [1] 7
## [1] "2BAG5"
## [1] 8
## [1] "2BAH1"
## [1] 9
## [1] "2BAH3"
## [1] 10
## [1] "2BAH4"
## [1] 11
## [1] "2BAH7"
## [1] 12
## [1] "2BAH8"
## [1] 13
## [1] "2BAJ2"
library(stringr)
library(RSelenium)
library(tidyverse)
library(rvest)
library("knitr")

my_data <- read.csv("https://raw.githubusercontent.com/szx868/data607/master/Presentation/infile.txt",header=T)
my_data
##    Item.Number
## 1       19YP94
## 2        2BAE7
## 3        2BAE9
## 4        2BAF8
## 5        2BAF9
## 6       472U87
## 7        2BAG5
## 8        2BAH1
## 9        2BAH3
## 10       2BAH4
## 11       2BAH7
## 12       2BAH8
## 13       2BAJ2
df = data.frame(x = character(), y = character(), z = character())

      
for (i in 1:3){
      cprof <- list(chromeOptions = 
                list(extensions = 
                       list(base64enc::base64encode("VPN_PROXY_MASTER.crx"))
                ))
      rD <- rsDriver(port = 4451L,extraCapabilities=cprof, browser ="chrome",chromever = "latest")
      remDr <- rD[["client"]]
      remDr$setTimeout(type = 'page load', milliseconds = 120000)
      remDr$setTimeout(type = 'implicit', milliseconds = 120000)
      remDr$navigate("chrome-extension://lnfdmdhmfbimhhpaeocncdlhiodoblbd/popup/popup.html")
      Sys.sleep(5)
      webElem <- remDr$findElement("css", "body")
      # find button
      morereviews <- remDr$findElement(using = 'css selector', ".start-btn")

      # click button
      morereviews$clickElement()
      # wait
      Sys.sleep(8)
      
      temphtml <- remDr$getPageSource()[[1]]
      if(str_detect(temphtml,"Connected") == FALSE){
          morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
          morereviews$clickElement()
          Sys.sleep(8)
      }
      remDr$setTimeout(type = 'page load', milliseconds = 120000)
      remDr$setTimeout(type = 'implicit', milliseconds = 120000)
      remDr$navigate("https://www.grainger.com/")
      webElem2 <- remDr$findElement("css", "body")
      morereviews2 <- remDr$findElement(using = "name", value = "searchQuery")$sendKeysToElement(list(my_data[i,],key="enter"))
      Sys.sleep(5) # give the page time to fully load
      html <- remDr$getPageSource()[[1]]
      remDr$close()
      rD$server$stop()
      rm(rD, remDr)
      gc()
      system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
      item.desc <- read_html(html) %>% # parse HTML
        html_nodes(".specifications__description") %>% # extract class nodes with class = "specifications__description"
        html_text() 
      item.value <- read_html(html) %>% # parse HTML
        html_nodes(".specifications__value") %>% # extract class nodes with class = "specifications__item"
      html_text() 
      df <- rbind(df, data.frame(x = item.desc, y = item.value, z = my_data[i,]))
    
}
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "chrome"
## 
## $browserVersion
## [1] "86.0.4240.75"
## 
## $chrome
## $chrome$chromedriverVersion
## [1] "86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})"
## 
## $chrome$userDataDir
## [1] "C:\\Users\\JOSEPH~1\\AppData\\Local\\Temp\\scoped_dir29272_1537044230"
## 
## 
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:53896"
## 
## 
## $networkConnectionEnabled
## [1] FALSE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "windows"
## 
## $proxy
## named list()
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $`webauthn:virtualAuthenticators`
## [1] TRUE
## 
## $webdriver.remote.sessionid
## [1] "15708c7aa490379df48975e07d3ac9a7"
## 
## $id
## [1] "15708c7aa490379df48975e07d3ac9a7"
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "chrome"
## 
## $browserVersion
## [1] "86.0.4240.75"
## 
## $chrome
## $chrome$chromedriverVersion
## [1] "86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})"
## 
## $chrome$userDataDir
## [1] "C:\\Users\\JOSEPH~1\\AppData\\Local\\Temp\\scoped_dir20112_1582312997"
## 
## 
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:54006"
## 
## 
## $networkConnectionEnabled
## [1] FALSE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "windows"
## 
## $proxy
## named list()
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $`webauthn:virtualAuthenticators`
## [1] TRUE
## 
## $webdriver.remote.sessionid
## [1] "92303306f128fe0ba14cd55537191a6d"
## 
## $id
## [1] "92303306f128fe0ba14cd55537191a6d"
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "chrome"
## 
## $browserVersion
## [1] "86.0.4240.75"
## 
## $chrome
## $chrome$chromedriverVersion
## [1] "86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})"
## 
## $chrome$userDataDir
## [1] "C:\\Users\\JOSEPH~1\\AppData\\Local\\Temp\\scoped_dir11412_617370987"
## 
## 
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:54155"
## 
## 
## $networkConnectionEnabled
## [1] FALSE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "windows"
## 
## $proxy
## named list()
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $`webauthn:virtualAuthenticators`
## [1] TRUE
## 
## $webdriver.remote.sessionid
## [1] "df7759c3506362ed4b7d61a051c31b5e"
## 
## $id
## [1] "df7759c3506362ed4b7d61a051c31b5e"
kable(df)
x y z
Item Standard Plate Caster 19YP94
Caster Sub-Type Rigid 19YP94
Load Rating 400 lb 19YP94
Wheel Dia. 5 in 19YP94
Wheel Width 1 5/8 in 19YP94
Mounting Height 6 1/10 in 19YP94
Wheel/Tread Material Polyurethane 19YP94
Wheel/Tread Hardness Rating Firm 19YP94
Wheel/Tread Hardness 92 Shore A 19YP94
Wheel/Tread Color Yellow 19YP94
Wheel Bearing Type Precision Ball 19YP94
Mounting Plate Bolt Hole Pattern C 19YP94
Mounting Plate Size 2 1/2 in x 3 5/8 in 19YP94
Inside Bolt Hole Spacing 1-3/4 in x 2-13/16 in 19YP94
Outside Bolt Hole Spacing 1-3/4 in x 3-1/16 in 19YP94
Mounting Bolt Dia. 5/16 in 19YP94
Number of Mounting Holes 4 19YP94
Swivel Lock Type No Lock Included 19YP94
Caster Brake Type No Brake Included 19YP94
Frame Finish Zinc Plated 19YP94
Caster Frame Material Steel 19YP94
Tread Shape Standard 19YP94
Includes Thread Guards No 19YP94
Wheel Core Material Aluminum 19YP94
Core Color Gray 19YP94
Non-Marking Yes 19YP94
Number of Wheels 1 19YP94
Replacement Wheel Mfr. No. ALTH 125/15K-B10 19YP94
Application General / Industrial / Commercial / Institutional purpose 19YP94
Item Fiber Disc 2BAE7
Disc Diameter 5 in 2BAE7
Mounting Hole Size 7/8 in 2BAE7
Abrasive Type Coated 2BAE7
Abrasive Material Zirconia Alumina 2BAE7
Abrasive Grit 36 2BAE7
Abrasive Grade Extra Coarse 2BAE7
Disc Backing Material Fiber 2BAE7
Backing Weight Y 2BAE7
Max. RPM 12,000 RPM 2BAE7
Series 501C 2BAE7
Color Black 2BAE7
Vacuum Hole Design Non-Vacuum 2BAE7
Item Hook-and-Loop Sanding Disc 2BAE9
Abrasive Type Coated 2BAE9
Vacuum Hole Design Non-Vacuum 2BAE9
Disc Diameter 3 in 2BAE9
Abrasive Grit 400 2BAE9
Abrasive Grade Super Fine 2BAE9
Abrasive Material Aluminum Oxide 2BAE9
Disc Backing Material Film 2BAE9
Backing Weight 2 mil 2BAE9
Series 216U 2BAE9
Color Gold 2BAE9

Inconclusion

RSelenium provides many other functions, which are not described here. This is an introduction to how Selenium works and how to interact with other common R packages such as rvest. Among other features of Selenium, you can take screenshots, click on specific links or sections on the page, scroll down the page and enter any keyboard strokes into any part of the web page. When combined with classic crawling technology, it has a wide range of uses and can scrap from almost any website.

Source https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf