Many websites are difficult to scrap because they use JavaScript and jQuery to dynamically extract data from the database. For example, on common social media sites such as LinkedIn or Facebook, when you scroll down the page, new content will be loaded and the URL will not change. These sites are difficult to scrap, To simpify scraping task, we can adjust the URL based on a certain system pattern to load a new page.
For example, if we check the Grainger website, we will see that the URL changes systematically, for example. https://www.grainger.com/
my_data <- read.csv("https://raw.githubusercontent.com/szx868/data607/master/Presentation/infile.txt",header=T)
my_data
## Item.Number
## 1 19YP94
## 2 2BAE7
## 3 2BAE9
## 4 2BAF8
## 5 2BAF9
## 6 472U87
## 7 2BAG5
## 8 2BAH1
## 9 2BAH3
## 10 2BAH4
## 11 2BAH7
## 12 2BAH8
## 13 2BAJ2
library(RSelenium)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
cprof <- list(chromeOptions =
list(extensions =
list(base64enc::base64encode("VPN_PROXY_MASTER.crx"))
))
rD <- rsDriver(port = 4453L,extraCapabilities=cprof, browser ="chrome",chromever = "latest")
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
remDr <- rD[["client"]]
remDr$setTimeout(type = 'page load', milliseconds = 120000)
remDr$setTimeout(type = 'implicit', milliseconds = 120000)
remDr$navigate("chrome-extension://lnfdmdhmfbimhhpaeocncdlhiodoblbd/popup/popup.html")
time <- 5
Sys.sleep(time)
webElem <- remDr$findElement("css", "body")
# find button
morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
# click button
morereviews$clickElement()
# wait
Sys.sleep(8)
temphtml <- remDr$getPageSource()[[1]]
if(str_detect(temphtml,"Connected") == FALSE){
morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
morereviews$clickElement()
Sys.sleep(8)
}
remDr$setTimeout(type = 'page load', milliseconds = 120000)
remDr$setTimeout(type = 'implicit', milliseconds = 120000)
remDr$navigate("https://www.grainger.com/")
webElem2 <- remDr$findElement("css", "body")
morereviews2 <- remDr$findElement(using = "name", value = "searchQuery")$sendKeysToElement(list('3d264',key="enter"))
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
remDr$close()
rD$server$stop()
## [1] TRUE
rm(rD, remDr)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1266711 67.7 2345677 125.3 2345677 125.3
## Vcells 2339824 17.9 8388608 64.0 4094265 31.3
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
## [1] 0
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
library("knitr")
item.desc <- read_html(html) %>% # parse HTML
html_nodes(".specifications__description") %>% # extract class nodes with class = "specifications__description"
html_text()
item.value <- read_html(html) %>% # parse HTML
html_nodes(".specifications__value") %>% # extract class nodes with class = "specifications__item"
html_text()
tech.spec <- cbind(item.desc, item.value)
tech.spec
## item.desc item.value
## [1,] "Item - Plugs and Receptacles" "Locking Connector"
## [2,] "Amps - Plugs and Receptacles" "15/10"
## [3,] "Voltage - Plugs and Receptacles" "125/250V AC"
## [4,] "NEMA Plug Configuration - Plugs and Receptacles" "15A, Non-NEMA"
## [5,] "Number of Poles" "3"
## [6,] "Number of Wires" "3"
## [7,] "Wiring Style" "Standard"
## [8,] "Color - Plugs and Receptacles" "Black/White"
## [9,] "Shrouded / Non-Shrouded" "Non-Shrouded"
## [10,] "Grade - Plugs and Receptacles" "Industrial"
## [11,] "Power Indicator" "No"
## [12,] "Corrosion Resistant" "No"
## [13,] "IP Rating" "20"
## [14,] "Cord Size" "0.23 in to 0.72 in"
## [15,] "Phase" "1"
## [16,] "Material" "Nylon"
## [17,] "Antimicrobial" "No"
## [18,] "Standards" "UL/CSA"
## [19,] "Item" "Locking Connector"
count <- 0
for(i in 1:nrow(my_data)){
count <- count + 1
print(count)
print(my_data[i,])
}
## [1] 1
## [1] "19YP94"
## [1] 2
## [1] "2BAE7"
## [1] 3
## [1] "2BAE9"
## [1] 4
## [1] "2BAF8"
## [1] 5
## [1] "2BAF9"
## [1] 6
## [1] "472U87"
## [1] 7
## [1] "2BAG5"
## [1] 8
## [1] "2BAH1"
## [1] 9
## [1] "2BAH3"
## [1] 10
## [1] "2BAH4"
## [1] 11
## [1] "2BAH7"
## [1] 12
## [1] "2BAH8"
## [1] 13
## [1] "2BAJ2"
library(stringr)
library(RSelenium)
library(tidyverse)
library(rvest)
library("knitr")
my_data <- read.csv("https://raw.githubusercontent.com/szx868/data607/master/Presentation/infile.txt",header=T)
my_data
## Item.Number
## 1 19YP94
## 2 2BAE7
## 3 2BAE9
## 4 2BAF8
## 5 2BAF9
## 6 472U87
## 7 2BAG5
## 8 2BAH1
## 9 2BAH3
## 10 2BAH4
## 11 2BAH7
## 12 2BAH8
## 13 2BAJ2
df = data.frame(x = character(), y = character(), z = character())
for (i in 1:3){
cprof <- list(chromeOptions =
list(extensions =
list(base64enc::base64encode("VPN_PROXY_MASTER.crx"))
))
rD <- rsDriver(port = 4451L,extraCapabilities=cprof, browser ="chrome",chromever = "latest")
remDr <- rD[["client"]]
remDr$setTimeout(type = 'page load', milliseconds = 120000)
remDr$setTimeout(type = 'implicit', milliseconds = 120000)
remDr$navigate("chrome-extension://lnfdmdhmfbimhhpaeocncdlhiodoblbd/popup/popup.html")
Sys.sleep(5)
webElem <- remDr$findElement("css", "body")
# find button
morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
# click button
morereviews$clickElement()
# wait
Sys.sleep(8)
temphtml <- remDr$getPageSource()[[1]]
if(str_detect(temphtml,"Connected") == FALSE){
morereviews <- remDr$findElement(using = 'css selector', ".start-btn")
morereviews$clickElement()
Sys.sleep(8)
}
remDr$setTimeout(type = 'page load', milliseconds = 120000)
remDr$setTimeout(type = 'implicit', milliseconds = 120000)
remDr$navigate("https://www.grainger.com/")
webElem2 <- remDr$findElement("css", "body")
morereviews2 <- remDr$findElement(using = "name", value = "searchQuery")$sendKeysToElement(list(my_data[i,],key="enter"))
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
item.desc <- read_html(html) %>% # parse HTML
html_nodes(".specifications__description") %>% # extract class nodes with class = "specifications__description"
html_text()
item.value <- read_html(html) %>% # parse HTML
html_nodes(".specifications__value") %>% # extract class nodes with class = "specifications__item"
html_text()
df <- rbind(df, data.frame(x = item.desc, y = item.value, z = my_data[i,]))
}
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
##
## $browserName
## [1] "chrome"
##
## $browserVersion
## [1] "86.0.4240.75"
##
## $chrome
## $chrome$chromedriverVersion
## [1] "86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})"
##
## $chrome$userDataDir
## [1] "C:\\Users\\JOSEPH~1\\AppData\\Local\\Temp\\scoped_dir29272_1537044230"
##
##
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:53896"
##
##
## $networkConnectionEnabled
## [1] FALSE
##
## $pageLoadStrategy
## [1] "normal"
##
## $platformName
## [1] "windows"
##
## $proxy
## named list()
##
## $setWindowRect
## [1] TRUE
##
## $strictFileInteractability
## [1] FALSE
##
## $timeouts
## $timeouts$implicit
## [1] 0
##
## $timeouts$pageLoad
## [1] 300000
##
## $timeouts$script
## [1] 30000
##
##
## $unhandledPromptBehavior
## [1] "dismiss and notify"
##
## $`webauthn:virtualAuthenticators`
## [1] TRUE
##
## $webdriver.remote.sessionid
## [1] "15708c7aa490379df48975e07d3ac9a7"
##
## $id
## [1] "15708c7aa490379df48975e07d3ac9a7"
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
##
## $browserName
## [1] "chrome"
##
## $browserVersion
## [1] "86.0.4240.75"
##
## $chrome
## $chrome$chromedriverVersion
## [1] "86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})"
##
## $chrome$userDataDir
## [1] "C:\\Users\\JOSEPH~1\\AppData\\Local\\Temp\\scoped_dir20112_1582312997"
##
##
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:54006"
##
##
## $networkConnectionEnabled
## [1] FALSE
##
## $pageLoadStrategy
## [1] "normal"
##
## $platformName
## [1] "windows"
##
## $proxy
## named list()
##
## $setWindowRect
## [1] TRUE
##
## $strictFileInteractability
## [1] FALSE
##
## $timeouts
## $timeouts$implicit
## [1] 0
##
## $timeouts$pageLoad
## [1] 300000
##
## $timeouts$script
## [1] 30000
##
##
## $unhandledPromptBehavior
## [1] "dismiss and notify"
##
## $`webauthn:virtualAuthenticators`
## [1] TRUE
##
## $webdriver.remote.sessionid
## [1] "92303306f128fe0ba14cd55537191a6d"
##
## $id
## [1] "92303306f128fe0ba14cd55537191a6d"
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
##
## $browserName
## [1] "chrome"
##
## $browserVersion
## [1] "86.0.4240.75"
##
## $chrome
## $chrome$chromedriverVersion
## [1] "86.0.4240.22 (398b0743353ff36fb1b82468f63a3a93b4e2e89e-refs/branch-heads/4240@{#378})"
##
## $chrome$userDataDir
## [1] "C:\\Users\\JOSEPH~1\\AppData\\Local\\Temp\\scoped_dir11412_617370987"
##
##
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:54155"
##
##
## $networkConnectionEnabled
## [1] FALSE
##
## $pageLoadStrategy
## [1] "normal"
##
## $platformName
## [1] "windows"
##
## $proxy
## named list()
##
## $setWindowRect
## [1] TRUE
##
## $strictFileInteractability
## [1] FALSE
##
## $timeouts
## $timeouts$implicit
## [1] 0
##
## $timeouts$pageLoad
## [1] 300000
##
## $timeouts$script
## [1] 30000
##
##
## $unhandledPromptBehavior
## [1] "dismiss and notify"
##
## $`webauthn:virtualAuthenticators`
## [1] TRUE
##
## $webdriver.remote.sessionid
## [1] "df7759c3506362ed4b7d61a051c31b5e"
##
## $id
## [1] "df7759c3506362ed4b7d61a051c31b5e"
kable(df)
| x | y | z |
|---|---|---|
| Item | Standard Plate Caster | 19YP94 |
| Caster Sub-Type | Rigid | 19YP94 |
| Load Rating | 400 lb | 19YP94 |
| Wheel Dia. | 5 in | 19YP94 |
| Wheel Width | 1 5/8 in | 19YP94 |
| Mounting Height | 6 1/10 in | 19YP94 |
| Wheel/Tread Material | Polyurethane | 19YP94 |
| Wheel/Tread Hardness Rating | Firm | 19YP94 |
| Wheel/Tread Hardness | 92 Shore A | 19YP94 |
| Wheel/Tread Color | Yellow | 19YP94 |
| Wheel Bearing Type | Precision Ball | 19YP94 |
| Mounting Plate Bolt Hole Pattern | C | 19YP94 |
| Mounting Plate Size | 2 1/2 in x 3 5/8 in | 19YP94 |
| Inside Bolt Hole Spacing | 1-3/4 in x 2-13/16 in | 19YP94 |
| Outside Bolt Hole Spacing | 1-3/4 in x 3-1/16 in | 19YP94 |
| Mounting Bolt Dia. | 5/16 in | 19YP94 |
| Number of Mounting Holes | 4 | 19YP94 |
| Swivel Lock Type | No Lock Included | 19YP94 |
| Caster Brake Type | No Brake Included | 19YP94 |
| Frame Finish | Zinc Plated | 19YP94 |
| Caster Frame Material | Steel | 19YP94 |
| Tread Shape | Standard | 19YP94 |
| Includes Thread Guards | No | 19YP94 |
| Wheel Core Material | Aluminum | 19YP94 |
| Core Color | Gray | 19YP94 |
| Non-Marking | Yes | 19YP94 |
| Number of Wheels | 1 | 19YP94 |
| Replacement Wheel | Mfr. No. ALTH 125/15K-B10 | 19YP94 |
| Application | General / Industrial / Commercial / Institutional purpose | 19YP94 |
| Item | Fiber Disc | 2BAE7 |
| Disc Diameter | 5 in | 2BAE7 |
| Mounting Hole Size | 7/8 in | 2BAE7 |
| Abrasive Type | Coated | 2BAE7 |
| Abrasive Material | Zirconia Alumina | 2BAE7 |
| Abrasive Grit | 36 | 2BAE7 |
| Abrasive Grade | Extra Coarse | 2BAE7 |
| Disc Backing Material | Fiber | 2BAE7 |
| Backing Weight | Y | 2BAE7 |
| Max. RPM | 12,000 RPM | 2BAE7 |
| Series | 501C | 2BAE7 |
| Color | Black | 2BAE7 |
| Vacuum Hole Design | Non-Vacuum | 2BAE7 |
| Item | Hook-and-Loop Sanding Disc | 2BAE9 |
| Abrasive Type | Coated | 2BAE9 |
| Vacuum Hole Design | Non-Vacuum | 2BAE9 |
| Disc Diameter | 3 in | 2BAE9 |
| Abrasive Grit | 400 | 2BAE9 |
| Abrasive Grade | Super Fine | 2BAE9 |
| Abrasive Material | Aluminum Oxide | 2BAE9 |
| Disc Backing Material | Film | 2BAE9 |
| Backing Weight | 2 mil | 2BAE9 |
| Series | 216U | 2BAE9 |
| Color | Gold | 2BAE9 |
RSelenium provides many other functions, which are not described here. This is an introduction to how Selenium works and how to interact with other common R packages such as rvest. Among other features of Selenium, you can take screenshots, click on specific links or sections on the page, scroll down the page and enter any keyboard strokes into any part of the web page. When combined with classic crawling technology, it has a wide range of uses and can scrap from almost any website.
Source https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf