Toubleshooting

How Do I Run the Selenium Server?

rsDriver

This function rsDriver manages the binaries needed for running a Selenium Server. If you have an error message like “Selenium message:session not created: This version of ChromeDriver only supports Chrome version 88”, then you may want to 1) set the “chromever” argument to be “96.0.4664.45” to run the current running browser or 2) install Firefox in your PC/Mac and set “firefox” as a starting browser.

Option 1: chromever argument

library(RSelenium)
rD <- rsDriver(port = 4040L, browser="chrome", chromever = "96.0.4664.45")
remDr <- rD[["client"]]
remDr$close()

Option 2: firefox browser

rD <- rsDriver(port = 3030L, browser="firefox")
remDr <- rD[["client"]]
remDr$close()

Web Scraping: Good Practice

What is robots.txt?

Maintainers of websites sometimes want to keep at least some of their content prohibited from being crawled, for example, to keep their server traffic in check. This is what the robots.txt file is used for. This “Robots Exclusion Protocol” tells the robots which information on the site may be harvested.

The idea was to specify which information may or may not be accessed by web robots (crawlers) in a text file stored in the root directory of a website. <www.youtube.com/robots.txt>

The fact that robots.txt does not follow an official standard has led to inconsistencies and uncontrolled extensions of the grammar.

There is a set of rules, however, that is followed by most robots.txt on the Web. Rules are listed bot by bot. A set of rules for the Googlebot robot could look as follows:

User-agent: Googlebot
Disallow: /images/
Disallow: /private/

This tells the Googlebot robot, specified in the User-agent field, not to crawl content from the subdirectories /images/ and /private/. Well-behaved web bots are supposed to look for their name in the list of User-Agents in the robots.txt and obey the rules.

The Disallow field can contain partial or full URLs. Rules can be generalized with the asterisk (*).

User-agent: *
Disallow: /private/

This means that any robot that is not explicitly recorded is disallowed to crawl the /private/ subdirectory. A general ban is formulated as:

User-agent: *
Disallow: /

The single slash / encompasses the entire website. That is, all robots for crawling are prohibited from the entire server.

To allow all robots complete access:

User-agent: *
Disallow: 

The "User-agent: *" means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site. “Disallow:” tells the robot can access all the directories of the site for scraping.

Several records are separated by one or more empty lines.

User-agent: Googlebot
Disallow: /images/

User-agent: Slurp
Disallow: /images/

A frequently used extension of this basic set of rules is the use of the Allow field. As the name already states, such fields list directories which are explicitly accepted for scraping. Combinations of Allow and Disallow rules enable webpage maintainers to exclude directories as a whole from crawling, but allow specific subdirectories or files within this directory to be crawled.

User-agent: *
Disallow: /images/
Allow: /images/public/

Another extension of the robots.txt standard is the Crawl-delay field which asks crawlers to pause between requests for a certain number of seconds. In the following robots.txt, Googlebot is allowed to scrape everything except one directory, while all other users may access everything but have to pause for 2 seconds between each request.

User-agent: *
Crawl-deplay: 2
User-agent: Googlebot
Disallow: /search/

How to read the robots.txt?

Let’s examine the robots.txt file of YouTube. First, we specify the link to the file.

youtube_robotstxt <- "https://www.youtube.com/robots.txt"

Next, we retrieve the list of directories that is prohibited from being crawled by any bot which is not otherwise listed.

bots <- readLines(youtube_robotstxt)
bots

YouTube generally prohibits crawling from the following pages. Let’s see how YouTube disallows any bot to access a set of directories.

It is important to say that robots.txt has little to do with a firewall against robots or any other protection or any other protection mechanism.

It does not prevent a website from being crawled at all. Rather, it is an advice from the website maintainer.

To the best of our knowledge, there is no law which explicitly states that robots.txt contents must not be disregarded. However, we strongly recommend that you have an eye on it every time you work with a new website, stay identifiable and in case of doubt contact the owner in advance.

If you want to learn more about web robots and how robots.txt works, the page http://www.robotstxt.org/ is a good start. It provides a more detailed explanation of the syntax and a useful collection of Frequently Asked Questions.

Be Friendly!

Once you have decided that scraping the data directly from a webpage is the way of gathering data, you should consider the Robots Exclusion Protocol if there is any. The robots.txt is usually not meant to block individual requests to a site, but to prevent a webpage to be indexed by a search engine or other meta search applications. If you want to gather information from a page that documents disallowance of web robot activity in its robots.txt, you should reconsider your task.

Do you plan to scrape data in a bot-like manner? Has your task the potential to do the web server any harm? In case of doubt, get into contact with the page administrator or take a look at the terms of use, if there are any. Ensure that your plans are with no ill intent, and stay identifiable with an adequate use of the identifying HTTP header fields.

An inspection of YouTube’s robots.txt reveals that robots are officially allowed to work in the /feed/trending subdirectory.