Friday, August 1, 2014

Mike Schmidt - Is Eureqa a genetic algorithm?

3:33 AM Posted by Sini 41 comments
Just saw Michael Schmidt speak at Pivotal Labs about Eureqa.

His presentation was very similar to this one at TEDx.

It was an interesting discussion about his algorithm which tries to distill understanding out of data not just accurate and mystical prediction as most machine learning algorithms do.  In other words rather than hiding the prediction behind a trained black box, it seeks to reveal the true features and formulas that transform you data parameters from x to y.  For instance, is formula y = sin(x), y = cos(x), or y = x^2.  These transforms your x parameter are feature generation and ordinarily it can be a difficult skill to master, but Eureqa seems to do it with ease.

How?

He showed a number of slides that resemble a decision tree with the nodes being +, -, /, * and various other transformations but such a process does not seem to have an implicit feedback loop to tell you whether you were right or wrong.

He also stressed that processing power these days makes it possible, so it is very computationally intensive.

"The search space for equations is infinite."
"The approach that works very well is based on natural selection, particularly darwin evolution."

He is using a genetic algorithm to generate a plethora of formulas which he then tests for accuracy against the data.  He would go through a process of kill off the formulas with the weakest predictive quality and cross pollenating others at random.

Another point he stressed today and in the video is that he focused on what is not changing in the data and how is was challenging to find the most simple non-trivial formulas that describe those rules.  He uses the concept of the Pareto Frontier for this.

Just a guess, but I would imagine that he would compare and possibly cluster on the most successful formulas to reveal those fundamental rules.

Saturday, March 1, 2014

BeautifulSoup - Cheat Sheet

9:02 PM Posted by Sini 4 comments

BeautifulSoup - cheat sheet
  • parse HTML by default, can parse XML

Modules to Import:
  1. BeautifulSoup
  2. CData
  3. ProcessingInstruction
  4. Declaration
  5. DocType

Basic Commands:
import urllib2
from bs4 import BeautifulSoup

# use the line below to down load a webpage
html = urllib2.urlopen('web address').read()

soup = BeautifulSoup(open(doc.html))
soup.prettify() => read html
soup.get_text() => all text
soup.get_text(‘|’, strip=True) => all text as unicode, separate tags with |, remove line breaks

Search
.find()
  • ('tag', {'attr' : 'value'})
.find_all() 
  • string
  • string, string
  • attr = ‘’text”
  • attrs={"data-foo": "value"}
  • regex
  • list
  • true => all tags
----
import re
for tag in soup.find_all(re.compile("t")):
  print(tag.name)
----
def has_class_but_no_id(tag): 
  return tag.has_attr('class') and not tag.has_attr('id’)

soup.find_all(has_class_but_no_id)
----

Navigation
.a_tag.b_tag => get first b_tag in a_tag
.contents => list
.strings => text from the doc
.stripped_strings => text w/o line breaks
.children
.parent
.next_element => different then children
.previous_element
.next_sibling
.previous_sibling

Interables
.decendents
.parents
.next_elements
.previous_elements
.next_siblings
.previous_siblings

BeautifulSoup Objects
Main
  1. Tag
  2. NavigableString
  3. BeautifulSoup
  4. Comment

Lesser - all subclass NavigableString
  1. CData
  2. ProcessingInstruction
  3. Declaration
  4. DocType

Tag Object
.tag => first tag
.tag.name => tag
.tag.string => text
.tag.get(‘attr’) => use if you don’t know if tag is defined
.tag attr are in a dictionary
  • multivalued tag attributes => list
  • multivalued tag attributes - class, rev, accept-charset, headers, accesskey
  • ‘id’ is not multivalued => string
  • you can change tag attributes 

NavigableString Object
  • tag.string => text within a string
  • tag.string.replace("any_text”)
  • use outside of BeautifulSoup by converting to unicode
  • unicode(tag.string)
  • supports all navigation except .contents .string .find()

BeautifulSoup Object
  • whole document
  • soup.name => u’[document]’
  • supports most navigation

Comment Object
  • NavigableString subclass
  • <!— text -->
  • display with special formatting when prettified