Web Scraping with Python & BeautifulSoup

David Loeb

How the internet works

BeautifulSoup (+ requests): simpler, great for quickly scraping small number of pages
Scrapy: more powerful but more complicated, best for “crawling” ie scraping tons of pages that link to one another
To keep it simple for now I’ll show you BS, but just a heads that I’m less familiar with it because I always use Scrapy

The BS approach first uses the requests package to communicate with website
requests.get(url): sends request to website and returns a Response object
response.text: extracts the HTML text from the Response obj
BeautifulSoup(responsetxt, ‘lxml’): create a BeautifulSoup object from the HTML text, which you can use for locating and extracting the text data u want

HTML is a series of nested elements
Each element has a specific type, denoted by its tag
Elements also have attributes that further distinguish them from one another
Use these 3 things to locate the elements with data we want:
- Nested structure
- Element type
- Attributes

Basic structure:

<element1>
  <element2>Content</element2>
</element1>

Typically written top to bottom like this with indents to make the nested structure clear

Many element types, but just a handful of the most important ones you’ll typically encounter
head & body: separate sites into these two broad sections
div: general divider elements, separating sections within the head and body sections
h1: headers (and subheaders are h2, h3, etc)
ol & ul: ordered and unordered lists; have li elements within them containing each list entry
table: table element, with tr and td elements within them containing the table rows and table data respectively
a: elements containing links

Many elements have attributes within their enclosing brackets (<>) that describe them further
Two most commonly used for locating
- class: identifies sets of elements that have similar purpose
- id: uniquely identifies elements
The other most important one is href, which contains links

<a class=“page-link” id=“next-page” href=“site.com/sitepage2”>

Call .find() / .find_all() methods on the BS object to locate specific elements
- Argument 1: element tag
- Argument 2 (optional): attribute

bs_object.find_all(“article”, class_=“main-article”)

The example above returns all article elements with the attribute class="main-article"
.find() would just return the first such element

Once you’ve located the elements, use the .get_text() method to extract the content
Then you can use Python’s built in open() framework for saving files, to save as a csv or text file

element_text = bs_object.find_all(article, class_=“main-article”).get_text()
with open("filename.txt", "w") as file:
  file.write(element_text)

Sometimes you need to get more specific, like you only want the article elements contained within a specific div element
In this case we’d use CSS or XPath locators, HTML searching frameworks that take advantage of the nested structure
But I think it’s too much info for today!