You have a massive Data Science assignment to complete and it is due the following day. You realise you need to find some data. Fortunately, you find yourself (at 2am in the morning) on a website which contains a trove of data that meets your requirements, unfortunately, you realise you would need to spend literally hours sitting there navigating pages and clicking buttons to download all the data you need.
I am sure many of us have been in this very boat and have opted to spend the time mindlessly clicking on buttons and links. Fortunately, R has provided a package which will allow you to automate the mind numbing tasks described above, this package is called RSelenium.
Selenium, not to be confused with Selena Gomez, is an automation tool which mimics the actions and patterns of a physical person interacting with a website. Selenium is commonly used in the IT Industry to automate testing of system functionalities prior to the deployment, this has the benefit of not boring the person to tears and reducing the cost footprint of a change to the organisation.
To demonstrate how powerful RSelenium is, I will be going through the following scenario:
Scenario: As part of the above assignment, you realise you need to translate address data into geolocational data in one of the data sources, you come across a website called https://www.latlong.net/ which enables you to lookup the Latitude and Longitude for any given location.
Firstly, install the RSelenium package via the following code:
install.packages('RSelenium')I actually had an extremely difficult time here, as I ran into an incompatibility issue with my Chromedriver version (Chromedriver: an application that sits between R and Chrome, the Chromedriver translates the R code to visible actions in Chrome).
The issue I ran into is as per below:
[1] "Connecting to remote server"
Selenium message:session not created: This version of ChromeDriver only supports Chrome version 77
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'HIDDEN', ip: 'HIDDEN', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_181'
Could not open chrome browser.
Client error message:
Summary: SessionNotCreatedException
Detail: A new session could not be created.
Further Details: run errorDetails method
Check server log for further details.To fix this issue, I used the following code to force a particular version of Chromedriver for R to use, in my case version 76.0.3809.68 worked.
driver <- rsDriver(browser=c("chrome"), chromever="76.0.3809.68")However, you can find out which version is accepted via the following code:
binman::list_versions("chromedriver")From the picture below, there are really 2 steps required in order to perform the search.
Step 1: Type in a location in the ‘Place Name’
Step 2: Click the ‘Find’ button.
When looking at the html code of the page, we can see the class of ‘Place Name’ is ‘width70’ and ‘Find’ is of the class ‘button’.
Hence, our codebase to automate the insert of an address into ‘Place Name’ then clicking the ‘Find’ button is as per below:
# Load all the relevant libraries and initiate a connect to Selenium server via Chromedriver.
library(RSelenium)
driver <- rsDriver(browser=c("chrome"), chromever="76.0.3809.68")
remote_driver <- driver[["client"]]
remote_driver$open()
# Navigate to the below URL
remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")
# Locate and save the 'Place Name' field into address_textfield
# There are multiple ways of locating an element from a html page, the example below uses 'class', in Step 4 we will look at an another approach.
address_textfield <- remote_driver$findElement(using = 'class', value = 'width70')
# Mimic the process of typing 15 Broadway Ultimo NSW 2007 into Place Name
address_textfield$sendKeysToElement(list("15 Broadway Ultimo NSW 2007"))
# Locate and save the 'Find' button into button_element
find_button <- remote_driver$findElement(using = 'class', value = "button")
# Click the 'Find' button and wait 5 seconds for results to appear.
find_button$clickElement()
# Sleep effective suspends the process for X amount of time, in the case below, that would be 5 seconds. Sleep is required here to cater for the search to return an result, this time may vary depending performance and also workload of the service, a better implementation would be to use a loop and check whether there is an result, if there is a result, exit the loop otherwise sleep for another second.
Sys.sleep(5)In this step, we will be scraping the results from the search. The data is populated in 2 fields, Longitude and Latitude.
# We will be locating the elements via the 'id' of that field (which is the second approach).
# Scrap the 'value' of that field and same into lat and lng variables.
lat <- remote_driver$findElement(using = "id", value="lat")$getElementAttribute("value")
lng <- remote_driver$findElement(using = "id", value="lng")$getElementAttribute("value")
# Print results.
print(lat)
print(lng) street_names <- c("15 Broadway Ultimo NSW 2007",
"16 Broadway Ultimo NSW 2007",
"17 Broadway Ultimo NSW 2007",
"18 Broadway Ultimo NSW 2007",
"19 Broadway Ultimo NSW 2007")
library(RSelenium)
driver <- rsDriver(browser=c("chrome"), chromever="76.0.3809.68")
remote_driver <- driver[["client"]]
remote_driver$open()
remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")
for(i in 1:length(street_names)) {
remote_driver$refresh()
Sys.sleep(1)
address_textfield <- remote_driver$findElement(using = 'class', value = 'width70')
address_textfield$sendKeysToElement(list(street_names[i]))
find_button <- remote_driver$findElement(using = 'class', value = "button")
find_button$clickElement()
Sys.sleep(5)
lat[i] <- remote_driver$findElement(using = "id", value="lat")$getElementAttribute("value")
lng[i] <- remote_driver$findElement(using = "id", value="lng")$getElementAttribute("value")
}
lat
lng