Web Scraping Webinar

Install R and R Studio

First of all, R must be downloaded and installed. Please download R from https://cloud.r-project.org/.
Then R Studio needs to be downloaded from https://rstudio.com/products/rstudio/download/#download and installed.
In order for the functions contained in a package to be available in the R environment, the package must first be available in the R environment. This is a two step process. Please open R Studio and follow the steps mentioned below.

Data Source

After following the steps mentioned above, everything should be in place to jump into writing some simple code blocks in R to scrape data from the web. In this part, financial market platform https://www.investing.com/ will be used to have simple use cases for scraping data using R.

Investing.com is a platform that provides real-time data, charts, financial instruments, news and analysis in 44 different languages from all around the world. With over 46 million monthly users, Investing.com is one of the top three global finance sites according to SimilarWeb and Alexa

Investing.com offers unlimited and completely free of charge access to services such as real-time offers, customized portfolios, personal alerts, calendars, calculators, and financial information.

Investing.com also provides information on Commodities, Cryptocurrencies, World Indices, World Currencies, Commodities, Bonds, Funds and Interest Rates, ETF Futures and Options

Installing rvest

#install.packages("rvest")
library(rvest)

Step by step web Scraping

Since the focus on the following part will be about “World Indices”, World Indices needs to be clicked. Then the page related to “World Indices” appears.

From this page, researchers can get useful information about stock markets from all over the world. As it is seen, there is no possibility to download those data points easily to excel files or any other type of text files. Only way to have this data is using web scraping techniques. rvest package provides efficient and quick solutions for web scraping. The page is reached by visiting https://www.investing.com/indices/world-indices, the URl is seen on the address bar of the browser. Next step is reading the html file of the page via read_html function from rvest package. Usage of the function is straight forward. It is enough to pass the url to the function as shown below.

read_html("https://www.investing.com/indices/world-indices")

## {html_document}
## <html dir="ltr" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xmlns:schema="http://schema.org/" class="com" lang="en" geo="PL">
##  [1] <head>\n<link rel="dns-prefetch" href="https://i-invdn-com.investing.com ...
##  [2] <body class="takeover dfpTakeovers ">\n\n<noscript>\n        <iframe src ...
##  [3] <img src="https://secure.adnxs.com/seg?t=2&amp;add=19833489" width="1" h ...
##  [4] <script>var TimeZoneID = +"8";window.timezoneOffset = +"-14400";window.s ...
##  [5] <script>window.uid = 0</script>
##  [6] <script>\n\n\t\t\t$(function(){\n\t\t\t\t$(window).trigger("socketRetry" ...
##  [7] <script>function refresher() { void (0);}</script>
##  [8] <script>\n    \tvar _comscore = _comscore || [];\n    \t_comscore.push(\ ...
##  [9] <script type="text/javascript">\n                var google_conversion_i ...
## [10] <script type="text/javascript" src="//www.googleadservices.com/pagead/co ...
## [11] <noscript><div style="display:inline;"><img height="1" width="1" style=" ...
## [12] <script type="text/javascript">\n        fbq = window.fbq || $.noop;\n   ...
## [13] <script>\n    $(function () {\n        $('.googleButtonWrapper').hover(f ...
## [14] <div id="g_id_onload" data-client_id="606447380154-9825jtap5as2sm0f868m5 ...

As a result of the command above, the HTML document is taken and part of it is printed to the R Studio console. HTML documents should be stored in a variable to be able to process it and extract the parts that are interested in.

HTML document can be stored in a variable by using the following command.

html_doc<-read_html("https://www.investing.com/indices/world-indices")

There are few techniques to parse HTML documents gathered from web pages. In this exercise, XPath technique will be used to parse HTML documents. XPath is a query language used to access the elements and attributes of documents created with Extensible Markup Language (XML). Since an HTML document has a very similar structure to this structure, XPath is also used to parse HTML document

html_nodes function in rvest package is going to be used to parse HTML document using XPath. As it’s seen on the following R command, the first argument of the function should be assigned as a variable which contains an HTML document and since parsing is done using XPath, the second argument should be “xpath”. To be able to pass XPath on the clipboard, please right click and click copy or use CTRL+V.Thanks to the command below, part of the HTML document, which contains indices for Argentina, is parsed and stored in a variable called table. It is possible to take a closer look at what exactly is parsed by printing table variable, as shown below.

table <- html_nodes(html_doc,xpath = '//*[@id="indice_table_37"]')

As it is seen above, table variable still has some HTML elements. By using html_table function, it is possible to get only the HTML table.

html_table(table)

## [[1]]
## # A tibble: 2 x 9
##   ``    Index             Last     High     Low        Chg. `Chg. %` Time  ``   
##   <lgl> <chr>             <chr>    <chr>    <chr>     <dbl> <chr>    <chr> <lgl>
## 1 NA    S&P Merval        275,717… 276,242… 270,610… 5.39e3 +1.99%   14/04 NA   
## 2 NA    S&P/BYMA Argenti… 11,602,… 11,628,… 11,402,… 2.09e5 +1.83%   14/04 NA

Let’s assign the table to a variable called argentina by using the following command. After running the following command, data.frame called argentina appears in the Global environment of RStudio

argentina <- html_table(table)[[1]]

First column and last column of the table shown in Figure 9, are empty. The columns can be removed by using the following command.

argentina_table <- argentina[,c(-1,-9)]
argentina_table

## # A tibble: 2 x 7
##   Index                    Last       High       Low         Chg. `Chg. %` Time 
##   <chr>                    <chr>      <chr>      <chr>      <dbl> <chr>    <chr>
## 1 S&P Merval               275,717.41 276,242.63 270,610.… 5.39e3 +1.99%   14/04
## 2 S&P/BYMA Argentina Gene… 11,602,208 11,628,267 11,402,7… 2.09e5 +1.83%   14/04

As a result of the command above, the HTML document is taken and part of it is printed to the R Studio console. HTML documents should be stored in a variable to be able to process them and extract the parts that are interested in. HTML documents could be stored in a variable called html_doc2 by using the following command.

html_doc2<-read_html("https://www.investing.com/indices/world-indices")
html_doc2

## {html_document}
## <html dir="ltr" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xmlns:schema="http://schema.org/" class="com" lang="en" geo="PL">
##  [1] <head>\n<link rel="dns-prefetch" href="https://i-invdn-com.investing.com ...
##  [2] <body class="takeover dfpTakeovers ">\n\n<noscript>\n        <iframe src ...
##  [3] <img src="https://secure.adnxs.com/seg?t=2&amp;add=19833489" width="1" h ...
##  [4] <script>var TimeZoneID = +"8";window.timezoneOffset = +"-14400";window.s ...
##  [5] <script>window.uid = 0</script>
##  [6] <script>\n\n\t\t\t$(function(){\n\t\t\t\t$(window).trigger("socketRetry" ...
##  [7] <script>function refresher() { void (0);}</script>
##  [8] <script>\n    \tvar _comscore = _comscore || [];\n    \t_comscore.push(\ ...
##  [9] <script type="text/javascript">\n                var google_conversion_i ...
## [10] <script type="text/javascript" src="//www.googleadservices.com/pagead/co ...
## [11] <noscript><div style="display:inline;"><img height="1" width="1" style=" ...
## [12] <script type="text/javascript">\n        fbq = window.fbq || $.noop;\n   ...
## [13] <script>\n    $(function () {\n        $('.googleButtonWrapper').hover(f ...
## [14] <div id="g_id_onload" data-client_id="606447380154-9825jtap5as2sm0f868m5 ...

html_nodes function is going to be used to parse HTML documents using XPath. As it’s seen on the following R command, the first argument of the function should be the variable which contains an HTML document and since parsing is done using XPath, the second argument should be “xpath”. To be able to pass XPath on the clipboard, please right click and click copy or use CTRL+V.

company_names  <- html_nodes(html_doc2,xpath = '//*[@id="marketMoversBoxWrapper"]')
company_tables <- html_table(company_names)
companies <- company_tables[[1]]
companies <- companies[,c(-1,-7)]
companies

## # A tibble: 23 x 5
##    Name  Last   `Chg. %` Vol.   ``   
##    <chr> <chr>  <chr>    <chr>  <lgl>
##  1 TSLA  185.00 -0.48%   95.97M NA   
##  2 AMZN  102.51 +0.11%   50.12M NA   
##  3 AAPL  165.21 -0.21%   48.02M NA   
##  4 JPM   138.73 +7.55%   43.78M NA   
##  5 NVDA  267.58 +1.11%   39.27M NA   
##  6 META  221.49 +0.52%   21.52M NA   
##  7 MSFT  286.14 -1.28%   20.89M NA   
##  8 Name  Last   Chg. %   Vol.   NA   
##  9 JPM   138.73 +7.55%   43.78M NA   
## 10 SBNY  0.17   +6.94%   7.10M  NA   
## # … with 13 more rows

Web Scraping Webinar

Olgun AYDIN

2023-04-17

Install R and R Studio

Data Source

Installing rvest

Step by step web Scraping