Web crawling on JavaScripts’ web pages

This is the first part of a series on application of text and web mining in Criminology. In this project we are going to extract data and text from the Orange county Sheriff department in Orlando. Then we will use the extracted dataset for further analysis using Natural Language Processing (NLP) techniques.

RSelenium for web crawling

For web crawling, I usually use Selenium solution. Selenium is a Web Scraping tool that simulates a user surfing the Internet. For example, you can use it to automatically look for Google queries and read the results, log in to your social accounts, simulate a user to test your web application, and anything you find in your daily live that it’s repetitive. In order to use selenium inside R you have to install RSelenium package. For learning basic of RSelenium you can use this RSelenium. Also, I recommend you to use Google Chrome web driver which you can find it here Chrome WebDriver. For a quick tutorial on RSelenium it is worth to watch this video on YouTube Video

First start with loading RSelenium and introducing the link. We want to extract information about Unresolved Homicide from Orlando’s Orange County Sheriff’s Office. The link for this webpage is Here. I put the screen shot for you to make sure you are following the right link.

Orlando’s Orange County Sheriff’s Office Webiste This website is heavily based on JavaScript. The middle part of the website which contains the list of cases is a good example of pure JavaScript. For scrapping JavaScript webpage your code should be capable of identifying the JavaScript code. For example, is R, the XML and rvest packages which we use them usually for scrapping cannot capture the JavaScript codes. Here we use RSelenium which works as an internet explorer and can read many different codes in the webpage. Then by connecting Chrome to the R we can extract the content of these JavaScript codes. The part of page we are looking for is actually this part is in below. There are at most 16 cases per each page. We want to click on each case and go to its profile and extract and save information form their profiles.

JavaScripts part of the page

JavaScripts part of the page

Let’s start RSelenium driver (I will use Chrome as a driver) and navigate to the webpage then we will see what is important in scrapping these kinds of pages.

library(RSelenium)
link1 = "https://www.ocso.com/Crime-Information/Unresolved-Homicide"
rD <- rsDriver(port = 4446L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate(link1)

Important Consideration

There are two things you have consider about the JavaScript’s pages:

  1. All the elements in a webpage are not visible and fully loaded in the webpage unless you scroll the page to nearest position of those elements
  2. Give some time to your browser to load the information because JavaScript codes are lazy to load

In order to find the position of elements in the webpage you can use a “Page Ruler” extension on you Chrome Page Ruler(Remember of you are using Chrome driver, each time you close the driver all the cashes, save passwords, cookies, and extensions will be removed so it is good idea to have another standalone Chrome browser installed in your system).

If you activate page ruler you can easily find the position of each element in the webpage. By clicking on the upper left side of the bar you can activate the finder:

Activating tracking mode

Activating tracking mode

Then try to scroll down the page and find the first Unresolved case and select it using “Page Ruler”. Before that make sure that the “Tracking Mode” on upper right corner of the page is turned on. You can see below that I have selected the first case picture to see the position of this element.

Selecting the first case

Selecting the first case

As it is clear from the above picture, the height of selected element from the top of the page is 1775. So, if we want to extract the information of this element and the other elements in this row we have to scroll the page to this position. you can use the below code in order to scroll the page down in RSelenium:

remDr$executeScript("window.scrollTo(0,1775);")

The first argument here “0” is the position of element from the left side in pixel which here is set to zero because there is no need for scrolling to left or right to see the element. The second argument here is “1775” which is the distance of element from the top of the page in pixel. So, by increasing 1775 the page will scroll down and down.

In this project we want to go to different pages inside the JavaScript as well. Take a look at the below picture:

Different pages

Different pages

Start navigating and scraping

First we have to create some empty lists in order to place extracted elements into these lists:

#Craete empty lists
casenumber=list()
date_of_murdered=list()
location=list()
race=list()
sex=list()
description=list()
SUBJECT_race=list()
SUBJECT_sex=list()
data2=data.frame()

I will use a list of texts of year captions in order to navigate chrome to each pages:

#Create a list of texts for each year
yr1=c("1980-89","1990-99","2000-09","2010-present")

If you check the above picture you will find out that I just choose the last for periods of time from the list of different periods. Then, I will create a for loop using which I will explore each page separately. The code is something similar to this one:

for (k in 1:4){
  cl1 <- remDr$findElement(using="link text",yr1[k])
  cl1$clickElement()
}

In each page we have to extract the page source then number of cases in each page and go through each case’s webpage. The following code will get the source page and find number of elements in the page and put it into ln1 object.

node2 <- read_html(remDr$getPageSource()[[1]])
ln1=length(html_nodes(node2,xpath = "//div[2]/div/div/div/span/a"))

Then we need a for loop to iterate until “ln1” and go to each case webpage and extract the information. If you click on the photo of each case the browser will navigate into the webpage containing the detail information of a case. In the below picture I highlighted the information we need from each page:

Desired information

Desired information

We will extract their Case-number, Date murdered, Location of incidence, Race, Sex Description, and offender Race and Sex. Below you can find a for loop for extracting the information of each page. This for loop go to each case’s webpage and then return to the previous page by clicking on return button at the bottom of case’s page.

for(i in 1:ln1){
    if (i %in% c(1,2,3,4)) {remDr$executeScript("window.scrollTo(0,1775);")}
    if (i %in% c(5,6,7,8)) {remDr$executeScript("window.scrollTo(0,2121);")}
    if (i %in% c(9,10,11,12)) {remDr$executeScript("window.scrollTo(0,2467);")}
    if (i %in% c(13,14,15,16)) {remDr$executeScript("window.scrollTo(0,2813);")}
    Sys.sleep(3)
    cl1 <- remDr$findElement(using="xpath",paste0("//div[2]/div/div","[",i,"]","/div/span/a"))
    cl1$clickElement()
    Sys.sleep(3)
    remDr$executeScript("window.scrollTo(0,1400);")
    remDr$executeScript("window.scrollBy(0,-50);")
    remDr$executeScript("window.scrollBy(0,50);")
    Sys.sleep(2)
    node3 <- read_html(remDr$getPageSource()[[1]])
    casenumber[i]<-node3 %>%
      html_nodes(xpath = "//td/table/tbody/tr[2]/td[1]/table/tbody/tr[1]/td[2]") %>%
      html_text(trim=TRUE)
    date_of_murdered[i]<-node3 %>%
      html_nodes(xpath="//td/table/tbody/tr[2]/td[1]/table/tbody/tr[2]/td[2]") %>%
      html_text(trim=TRUE)
    location[i]<-node3 %>%
      html_nodes(xpath="//td/table/tbody/tr[2]/td[1]/table/tbody/tr[3]/td[2]") %>%
      html_text(trim=TRUE)
   race[i]<-node3 %>%
      html_nodes(xpath="//tr[4]/td/table/tbody/tr/td[4]") %>%
      html_text(trim=TRUE)
    sex[i]<-node3 %>%
      html_nodes(xpath="//tr[4]/td/table/tbody/tr/td[6]") %>%
      html_text(trim=TRUE)
    description[i]<-node3 %>%
      html_nodes(xpath="//td/table/tbody/tr[2]/td[1]/table/tbody/tr[8]/td[2]") %>%
      html_text(trim=TRUE)
    SUBJECT_race[i]<-node3 %>%
      html_nodes(xpath="//tr[3]/td/table/tbody/tr/td[4]") %>%
      html_text(trim=TRUE)
    SUBJECT_sex[i]<-node3 %>%
      html_nodes(xpath="//tr[3]/td/table/tbody/tr/td[6]") %>%
      html_text(trim=TRUE)
    Sys.sleep(3)
    remDr$executeScript("window.scrollTo(0,3000);")
    Sys.sleep(3)
    cl1 <- remDr$findElement(using="css selector",".dnnSecondaryAction")
    remDr$executeScript("window.scrollBy(0,-50);")
    remDr$executeScript("window.scrollBy(0,50);")
    cl1$clickElement()
    Sys.sleep(3)
    }

You can see that I used five Sys.sleep(3). As I stated before we have to give some time to browser to load JavaScripts. These codes cause delay in running R codes and let JavaScrip codes to be loaded without any problem. There are another tricks which I deployed in my code. First part is this part:

if (i %in% c(1,2,3,4)) {remDr$executeScript("window.scrollTo(0,1775);")}
if (i %in% c(5,6,7,8)) {remDr$executeScript("window.scrollTo(0,2121);")}
if (i %in% c(9,10,11,12)) {remDr$executeScript("window.scrollTo(0,2467);")}
if (i %in% c(13,14,15,16)) {remDr$executeScript("window.scrollTo(0,2813);")}

These 4 lines of code will scroll page to the desired location. If it is scrapping the case 1 to 4 the location of mouse cursor will be at 1775 pixel from the top of the page, for scrapping 5 to 8 the location would be 2121, and so on. So, by means of these 4 lines your code will never miss the JavaScript loading and will always keep track of desired element. There is another trick which sometimes might help you to load JavaScript pages better, I will call it Shake it. In my experience sometimes even you focus on the element you want to scrape, the codes for that part might not be loaded correctly. In this case you have to scroll down and up to make sure it is fully loaded. The following lines of codes do this task for us. The ‘scrollBy’ will scroll the page from the current position by the (x,y) position.

remDr$executeScript("window.scrollBy(0,-50);")
remDr$executeScript("window.scrollBy(0,50);")

The rest of this code is just about extracting information from the website and using three for loop in order to go different webpages back and forth. You can find the whole code Here