- parse HTML by default, can parse XML
Modules to Import:
- BeautifulSoup
- CData
- ProcessingInstruction
- Declaration
- DocType
Basic Commands:
import urllib2
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
# use the line below to down load a webpage
html = urllib2.urlopen('web address').read()
soup = BeautifulSoup(open(doc.html))
soup.prettify() => read html
soup.get_text() => all text
soup.get_text(‘|’, strip=True) => all text as unicode, separate tags with |, remove line breaks
Search
.find()
- ('tag', {'attr' : 'value'})
.find_all()
- string
- string, string
- attr = ‘’text”
- attrs={"data-foo": "value"}
- regex
- list
- true => all tags
----
import re
for tag in soup.find_all(re.compile("t")):
print(tag.name)
----
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id’)
soup.find_all(has_class_but_no_id)
----
Navigation
.a_tag.b_tag => get first b_tag in a_tag
.contents => list
.strings => text from the doc
.stripped_strings => text w/o line breaks
.children
.parent
.next_element => different then children
.previous_element
.next_sibling
.previous_sibling
Interables
.decendents
.parents
.next_elements
.previous_elements
.next_siblings
.previous_siblings
BeautifulSoup Objects
Main
- Tag
- NavigableString
- BeautifulSoup
- Comment
Lesser - all subclass NavigableString
- CData
- ProcessingInstruction
- Declaration
- DocType
Tag Object
.tag => first tag
.tag.name => tag
.tag.string => text
.tag.get(‘attr’) => use if you don’t know if tag is defined
.tag attr are in a dictionary
- multivalued tag attributes => list
- multivalued tag attributes - class, rev, accept-charset, headers, accesskey
- ‘id’ is not multivalued => string
- you can change tag attributes
NavigableString Object
- tag.string => text within a string
- tag.string.replace("any_text”)
- use outside of BeautifulSoup by converting to unicode
- unicode(tag.string)
- supports all navigation except .contents .string .find()
BeautifulSoup Object
- whole document
- soup.name => u’[document]’
- supports most navigation
Comment Object
- NavigableString subclass
- <!— text -->
- display with special formatting when prettified