<element1>
<element2>Content</element2>
</element1>
requests.get(url)
: sends request to website and returns a Response
objectresponse.text
: extracts the HTML text from the Response
objBeautifulSoup(responsetxt, ‘lxml’)
: create a BeautifulSoup
object from the HTML text, which you can use for locating and extracting the text data u wantelements
element
has a specific type, denoted by its tag
attributes
that further distinguish them from one anotherBasic structure:
Typically written top to bottom like this with indents to make the nested structure clear
head
& body
: separate sites into these two broad sectionsdiv
: general divider elements, separating sections within the head and body sectionsh1
: headers (and subheaders are h2
, h3
, etc)ol
& ul
: ordered and unordered lists; have li
elements within them containing each list entrytable
: table element, with tr
and td
elements within them containing the table rows and table data respectivelya
: elements containing links<>
) that describe them furtherclass
: identifies sets of elements that have similar purposeid
: uniquely identifies elementshref
, which contains links.find()
/ .find_all()
methods on the BS object to locate specific elements
class="main-article"
.find()
would just return the first such element.get_text()
method to extract the contentopen()
framework for saving files, to save as a csv or text filearticle
elements contained within a specific div
element