HSS 611: Programming for HSS
Nov 4, 2025
BeautifulSoup library
Robots.txt
Ethics of webscraping
APIs provide structured data (usually JSON)
The data from API requests meant for automated consumption by machines
If an API was created, there is likely an intention to maintain (including backward compatibility)
If you can achieve your goal with an API, use the API
Some services just don’t have an API
Sometimes there is an API, but there are other barriers:
But they do have a website, which you can scrape
We still use the requests library
Send a (get) request with an URL
Get an HTML document
Use BeautifulSoup to pull necessary data out of the HTML document
(With an API, we don’t need BeautifulSoup because Python already understands the structure of the JSON file that gets returned)
<!DOCTYPE html>
<html>
<head>
<title> Page Title </title>
</head>
<body>
<h1>This is a heading </h1>
<p> This is a paragraph </p>
</body>
</html>Most information we need is inside body
Tags (enlcosed in <>) help navigate the document!
<!DOCTYPE html>
<html>
<head>
<title> Page Title </title>
</head>
<body>
<h1>This is a heading </h1>
<p> This is a paragraph </p>
</body>
</html>Common tags:
<h1>, <h2> … <h6> denote headings<p> is used for a paragraph<a> is used for links (anchor)<div> (division) and <span> are generic tags to group other tags<b> for bold and <i> for italicMany others for tables, forms, buttons, etc.
Tags are nested
<html> at the root
Tags typically need to be closed with a forward slash (/):
<tag> some content </tag>Tags can have attributes
They typically appear as name-value pairs:
<element attribute="value"> some content </element>E.g., an abbreviation
<abbr id="anId" class="aClass" style="color:blue;" title="Hypertext Markup Language">HTML</abbr>
Two of the most common attributes are id and class
id attribute provides a document-wide unique identifier for an element
class attribute provides a way of classifying similar elements
href specifies the URL of the page the link goes to (it works with <a>)
<a href="https://www.example.co.kr">Example</a>There are many other attributes
BeautifulSoup represents an HTML document as a navigable tree
Identify and navigate to specific elements using tags and attributes
Tutorial and documentation on BeautifulSoup website
Let’s look at some parts of the tutorial
Get the title tag
The tag and what it contains are recognized as a Tag object
We can get all of them too with find_all
find_all returns a ResultSet object where each element is a Tag
ResultSet objects work similarly to lists (iterable, indexable/slicable)
We can use attributes to navigate too
Want neither just first element, nor all of them together?
Remember, id is a document-wide unique identifier
find_all here and get a ResultSet
find will find the first element, which—in this case—is fine because there’s only one element anyway
We can find_all with any attribute
class is a very common attribute
But need to use class_ argument inside find_all
Use get to get the values of the attributes
Get all the hyperlinks in the page
Go through the BeautifulSoup Documentation to learn more:
Browsers like Google Chrome and Mozilla Firefox have great tools to guide your scraping
Navigate to the web page you want to scrape
Go to the object(s) you want to target
Right click, then click on “Inspect”
See which part of HTML pertains to which part of document
Google Chrome’s Inspect Tool
To scrape a website gently, you can use the sleep function
Specify how many seconds to sleep
Place between requests (e.g. somewhere in a for loop) to slow down
Python will slow down before moving beyond that line
Sometimes sleeping 1 second will be enough, sometimes 1 minute, sometimes more
Websites communicate with scrapers using robots.txt
Typically provides information on:
Accessible through URL + /robots.txt
http://www.example.com/robots.txt
User-agent: it determines who is allowed or not to do the scraping
* (read wildcard) means all-scrapersDisallow: denotes parts of the website that are disallowed
/ means everything is disallowed/folder/ means everything in that subfolder is disallowedSitemap: gives a list of all web pages on the site
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /api/
Disallow: /comment
Disallow: /feeds/videos.xml
Disallow: /file_download
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /qr
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax
Disallow: /youtubei/
Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml
It’s good etiquette to respect robots.txt
If not respected