Webscraping

HSS 611: Programming for HSS

Taegyoon Kim

Nov 4, 2025

Agenda

BeautifulSoup library
- When to use it
- How to use it
Robots.txt
Ethics of webscraping

APIs vs. scraping web pages

APIs provide structured data (usually JSON)
The data from API requests meant for automated consumption by machines
If an API was created, there is likely an intention to maintain (including backward compatibility)
- Backward compratibility: new versions of the API still work with older code written for previous versions
If you can achieve your goal with an API, use the API

You can’t always use APIs

Some services just don’t have an API
Sometimes there is an API, but there are other barriers:
- Rate limit
- Paid
- Does not provide what exactly you need
But they do have a website, which you can scrape

Scraping webpages

We still use the requests library
Send a (get) request with an URL
Get an HTML document
Use BeautifulSoup to pull necessary data out of the HTML document
(With an API, we don’t need BeautifulSoup because Python already understands the structure of the JSON file that gets returned)

The structure of an HTML document

<!DOCTYPE html>
<html>
  <head>
    <title> Page Title </title>
  </head>
  <body>
    <h1>This is a heading </h1>
    <p> This is a paragraph </p>
  </body>
</html>

Most information we need is inside body
Tags (enlcosed in <>) help navigate the document!
- Keywords that tell the browser what kind of content something is, or how it should behave

The structure of an HTML document

<!DOCTYPE html>
<html>
  <head>
    <title> Page Title </title>
  </head>
  <body>
    <h1>This is a heading </h1>
    <p> This is a paragraph </p>
  </body>
</html>

Common tags:
- <h1>, <h2> … <h6> denote headings
- <p> is used for a paragraph
- <a> is used for links (anchor)
- <div> (division) and <span> are generic tags to group other tags
- <b> for bold and <i> for italic
Many others for tables, forms, buttons, etc.
Tags are nested
- A tree-like structure with <html> at the root

The structure of an HTML document

Tags typically need to be closed with a forward slash (/):
- <tag> some content </tag>
Tags can have attributes
- Provide additional information about elements
- Control the element’s behavior
They typically appear as name-value pairs:
- <element attribute="value"> some content </element>
E.g., an abbreviation
- <abbr id="anId" class="aClass" style="color:blue;" title="Hypertext Markup Language">HTML</abbr>

The structure of an HTML document

Two of the most common attributes are id and class
id attribute provides a document-wide unique identifier for an element
class attribute provides a way of classifying similar elements
href specifies the URL of the page the link goes to (it works with <a>)
- <a href="https://www.example.co.kr">Example</a>
There are many other attributes

BeautifulSoup

BeautifulSoup represents an HTML document as a navigable tree
Identify and navigate to specific elements using tags and attributes
Tutorial and documentation on BeautifulSoup website
Let’s look at some parts of the tutorial

BeautifulSoup

BeatifulSoup

Navigating the tree

Get the title tag
The tag and what it contains are recognized as a Tag object

Navigating the tree

Get the string (text) in the Tag object

Get the name of the tag

Get the name of the parent tag

Navigating the tree

If a tag occurs more than once, it’ll go to first occurrence

Navigating the tree

We can get all of them too with find_all
find_all returns a ResultSet object where each element is a Tag
ResultSet objects work similarly to lists (iterable, indexable/slicable)

Navigating the tree

We can use attributes to navigate too
Want neither just first element, nor all of them together?
Remember, id is a document-wide unique identifier

Navigating the tree

We could use find_all here and get a ResultSet

But we know there’s only one element
find will find the first element, which—in this case—is fine because there’s only one element anyway

Navigating the tree

We can find_all with any attribute
class is a very common attribute
But need to use class_ argument inside find_all

Navigating the tree

Find by both attribute and class

Get the string in this tag

Navigating the tree

Use get to get the values of the attributes
Get all the hyperlinks in the page

Or, with list comprehension

Navigating the tree

Get all of the text in the page

More, more, and more

Go through the BeautifulSoup Documentation to learn more:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Use the browser first

Browsers like Google Chrome and Mozilla Firefox have great tools to guide your scraping
Navigate to the web page you want to scrape
Go to the object(s) you want to target
Right click, then click on “Inspect”
See which part of HTML pertains to which part of document

Use the browser first

Google Chrome’s Inspect Tool

Sleep

To scrape a website gently, you can use the sleep function
Specify how many seconds to sleep
Place between requests (e.g. somewhere in a for loop) to slow down
Python will slow down before moving beyond that line
Sometimes sleeping 1 second will be enough, sometimes 1 minute, sometimes more
- We could also randomize delay (e.g., 1–10 seconds) to mimic natural human browsing behavior

Robots.txt

Websites communicate with scrapers using robots.txt
Typically provides information on:
- Disallowed sections of website
- Which scrapers gets which permission
Accessible through URL + /robots.txt
- E.g., http://www.example.com/robots.txt

Syntax of Robots.txt

User-agent: it determines who is allowed or not to do the scraping
- The * (read wildcard) means all-scrapers
Disallow: denotes parts of the website that are disallowed
- No value means everything is allowed
- / means everything is disallowed
- /folder/ means everything in that subfolder is disallowed
Sitemap: gives a list of all web pages on the site
- It can be a list of lists (nested) that eventually branch out to all web pages

Examples Robots.txts

Example from YouTube

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /api/
Disallow: /comment
Disallow: /feeds/videos.xml
Disallow: /file_download
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /qr
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax
Disallow: /youtubei/

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml

Robots.txt

It’s good etiquette to respect robots.txt
If not respected
- IP address might be blocked (usually temporary)
- Legal consequences
LinkedIn vs. hiQ Labs
사람인 vs. 잡코리아 & 여기어때 vs. 야놀자