BeautifulSoup is the standard response. I think lxml will not work very well unless the html is extremely nicely formatted, but I could be wrong.
For what you describe I would suggest developing seat-of-the-pants heuristics -- just get the page using httplib and then use string.find liberally. I've had at least three consulting gigs solving this problems using various techniques and the general problem is quite difficult, but if you are trying to parse just a few pages in simple ways developing special purpose heuristics is pretty easy (until they redesign the pages, which they will do every so often).
Best of luck, -- Aaron Watters
btw: If you have lots of money to spend on this my former client connotate.com does this sort of scraping (and I developed some of the code).
Also, if you are familiar with jQuery selector syntax, pyquery is
For the wages of sin is death, but the
free gift of God is eternal life in
Christ Jesus our Lord (Rom 6:23)
On 02/20/2011 11:21 PM, Deb Midya wrote:
Hi Python web-sig users,
Thanks in advance and I am new to web-sig.
I am using Python 2.6 on Windows XP.
May I request you to assist me for the following
On Mon, Feb 21, 2011 at 1:21 AM, Deb Midya <[hidden email]> wrote:
> I like to extract web data from the site (http://finance.yahoo.com, for example).
> The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
> I am trying to automate the extraction of data.