Webscraping

HSS 611: Programming for HSS

Taegyoon Kim

Nov 4, 2025

Agenda

  • BeautifulSoup library

    • When to use it
    • How to use it
  • Robots.txt

  • Ethics of webscraping

APIs vs. scraping web pages

  • APIs provide structured data (usually JSON)

  • The data from API requests meant for automated consumption by machines

  • If an API was created, there is likely an intention to maintain (including backward compatibility)

    • Backward compratibility: new versions of the API still work with older code written for previous versions
  • If you can achieve your goal with an API, use the API

You can’t always use APIs

  • Some services just don’t have an API

  • Sometimes there is an API, but there are other barriers:

    • Rate limit
    • Paid
    • Does not provide what exactly you need
  • But they do have a website, which you can scrape

Scraping webpages

  • We still use the requests library

  • Send a (get) request with an URL

  • Get an HTML document

  • Use BeautifulSoup to pull necessary data out of the HTML document

  • (With an API, we don’t need BeautifulSoup because Python already understands the structure of the JSON file that gets returned)

The structure of an HTML document

<!DOCTYPE html>
<html>
  <head>
    <title> Page Title </title>
  </head>
  <body>
    <h1>This is a heading </h1>
    <p> This is a paragraph </p>
  </body>
</html>
  • Most information we need is inside body

  • Tags (enlcosed in <>) help navigate the document!

    • Keywords that tell the browser what kind of content something is, or how it should behave

The structure of an HTML document

<!DOCTYPE html>
<html>
  <head>
    <title> Page Title </title>
  </head>
  <body>
    <h1>This is a heading </h1>
    <p> This is a paragraph </p>
  </body>
</html>
  • Common tags:

    • <h1>, <h2><h6> denote headings
    • <p> is used for a paragraph
    • <a> is used for links (anchor)
    • <div> (division) and <span> are generic tags to group other tags
    • <b> for bold and <i> for italic
  • Many others for tables, forms, buttons, etc.

  • Tags are nested

    • A tree-like structure with <html> at the root

The structure of an HTML document

  • Tags typically need to be closed with a forward slash (/):

    • <tag> some content </tag>
  • Tags can have attributes

    • Provide additional information about elements
    • Control the element’s behavior
  • They typically appear as name-value pairs:

    • <element attribute="value"> some content </element>
  • E.g., an abbreviation

    • <abbr id="anId" class="aClass" style="color:blue;" title="Hypertext Markup Language">HTML</abbr>

The structure of an HTML document

  • Two of the most common attributes are id and class

  • id attribute provides a document-wide unique identifier for an element

  • class attribute provides a way of classifying similar elements

  • href specifies the URL of the page the link goes to (it works with <a>)

    • <a href="https://www.example.co.kr">Example</a>
  • There are many other attributes

BeautifulSoup

  • BeautifulSoup represents an HTML document as a navigable tree

  • Identify and navigate to specific elements using tags and attributes

  • Tutorial and documentation on BeautifulSoup website

  • Let’s look at some parts of the tutorial

BeautifulSoup

BeatifulSoup

BeatifulSoup

More, more, and more

Go through the BeautifulSoup Documentation to learn more:

Use the browser first

  • Browsers like Google Chrome and Mozilla Firefox have great tools to guide your scraping

  • Navigate to the web page you want to scrape

  • Go to the object(s) you want to target

  • Right click, then click on “Inspect”

  • See which part of HTML pertains to which part of document

Use the browser first

Google Chrome’s Inspect Tool

Sleep

  • To scrape a website gently, you can use the sleep function

  • Specify how many seconds to sleep

  • Place between requests (e.g. somewhere in a for loop) to slow down

  • Python will slow down before moving beyond that line

  • Sometimes sleeping 1 second will be enough, sometimes 1 minute, sometimes more

    • We could also randomize delay (e.g., 1–10 seconds) to mimic natural human browsing behavior

Robots.txt

  • Websites communicate with scrapers using robots.txt

  • Typically provides information on:

    • Disallowed sections of website
    • Which scrapers gets which permission
  • Accessible through URL + /robots.txt

    • E.g., http://www.example.com/robots.txt

Syntax of Robots.txt

  • User-agent: it determines who is allowed or not to do the scraping
    • The * (read wildcard) means all-scrapers
  • Disallow: denotes parts of the website that are disallowed
    • No value means everything is allowed
    • / means everything is disallowed
    • /folder/ means everything in that subfolder is disallowed
  • Sitemap: gives a list of all web pages on the site
    • It can be a list of lists (nested) that eventually branch out to all web pages

Examples Robots.txts

  • Example from YouTube
User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /api/
Disallow: /comment
Disallow: /feeds/videos.xml
Disallow: /file_download
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /qr
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax
Disallow: /youtubei/

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml

Robots.txt