Saturday, March 1, 2014

BeautifulSoup - Cheat Sheet

9:02 PM Posted by Sini 4 comments

BeautifulSoup - cheat sheet
  • parse HTML by default, can parse XML

Modules to Import:
  1. BeautifulSoup
  2. CData
  3. ProcessingInstruction
  4. Declaration
  5. DocType

Basic Commands:
import urllib2
from bs4 import BeautifulSoup

# use the line below to down load a webpage
html = urllib2.urlopen('web address').read()

soup = BeautifulSoup(open(doc.html))
soup.prettify() => read html
soup.get_text() => all text
soup.get_text(‘|’, strip=True) => all text as unicode, separate tags with |, remove line breaks

Search
.find()
  • ('tag', {'attr' : 'value'})
.find_all() 
  • string
  • string, string
  • attr = ‘’text”
  • attrs={"data-foo": "value"}
  • regex
  • list
  • true => all tags
----
import re
for tag in soup.find_all(re.compile("t")):
  print(tag.name)
----
def has_class_but_no_id(tag): 
  return tag.has_attr('class') and not tag.has_attr('id’)

soup.find_all(has_class_but_no_id)
----

Navigation
.a_tag.b_tag => get first b_tag in a_tag
.contents => list
.strings => text from the doc
.stripped_strings => text w/o line breaks
.children
.parent
.next_element => different then children
.previous_element
.next_sibling
.previous_sibling

Interables
.decendents
.parents
.next_elements
.previous_elements
.next_siblings
.previous_siblings

BeautifulSoup Objects
Main
  1. Tag
  2. NavigableString
  3. BeautifulSoup
  4. Comment

Lesser - all subclass NavigableString
  1. CData
  2. ProcessingInstruction
  3. Declaration
  4. DocType

Tag Object
.tag => first tag
.tag.name => tag
.tag.string => text
.tag.get(‘attr’) => use if you don’t know if tag is defined
.tag attr are in a dictionary
  • multivalued tag attributes => list
  • multivalued tag attributes - class, rev, accept-charset, headers, accesskey
  • ‘id’ is not multivalued => string
  • you can change tag attributes 

NavigableString Object
  • tag.string => text within a string
  • tag.string.replace("any_text”)
  • use outside of BeautifulSoup by converting to unicode
  • unicode(tag.string)
  • supports all navigation except .contents .string .find()

BeautifulSoup Object
  • whole document
  • soup.name => u’[document]’
  • supports most navigation

Comment Object
  • NavigableString subclass
  • <!— text -->
  • display with special formatting when prettified