Extracting web data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting web data

Deb Midya
Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.
 
Regards,
 
Deb
 

 
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

Deb Midya
Joost,
 
Thank you very much for your response.
 
I have found that there is no binary file of lxml in the package index of python.org.
 
I am using Python 2.6 on Windows XP.
 
Is there any alternative solution?
 
Once again, thank you very much for the time you have given.
 
Regards,
 
Deb

--- On Mon, 21/2/11, Joost Molenaar <[hidden email]> wrote:

From: Joost Molenaar <[hidden email]>
Subject: Re: [Web-SIG] Extracting web data
To: "Deb Midya" <[hidden email]>
Received: Monday, 21 February, 2011, 5:19 PM

You should look at lxml, it knows how to parse HTML and XML and lets you use XPath to find the data you need.
Joost Molenaar
Op 21 feb 2011 05:28 schreef "Deb Midya" <debmidya@...>:

Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.
 
Regards,
 
Deb
 

 
_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/j.j.molenaar%40gmail.com


 
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

James Mills-3
In reply to this post by Deb Midya
On Mon, Feb 21, 2011 at 2:21 PM, Deb Midya <[hidden email]> wrote:
Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.

You might want to look into using either
the lxml or BeautifulSoup modules.

cheers
James

--
-- James Mills
--
-- "Problems are solved by method"

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

Aaron Watters-2
BeautifulSoup is the standard response.
I think lxml will not work very well unless the
html is extremely nicely formatted, but I could
be wrong.

For what you describe I would suggest developing
seat-of-the-pants heuristics -- just get the page
using httplib and then use string.find liberally.
I've had at least three consulting gigs solving
this problems using various techniques and the general
problem is quite difficult, but if you are trying to
parse just a few pages in simple ways developing
special purpose heuristics is pretty easy (until they
redesign the pages, which they will do every so often).

Best of luck, -- Aaron Watters

btw: If you have lots of money to spend on this
  my former client connotate.com does this sort
  of scraping (and I developed some of the code).

--- On Mon, 2/21/11, James Mills <[hidden email]> wrote:

From: James Mills <[hidden email]>
Subject: Re: [Web-SIG] Extracting web data
To: "web-sig" <[hidden email]>
Date: Monday, February 21, 2011, 7:07 PM

On Mon, Feb 21, 2011 at 2:21 PM, Deb Midya <debmidya@...> wrote:
Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.

You might want to look into using either
the lxml or BeautifulSoup modules.

cheers
James

--
-- James Mills
--
-- "Problems are solved by method"

-----Inline Attachment Follows-----

_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/arw1961%40yahoo.com

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

Joost Molenaar-2
In reply to this post by Deb Midya
Hi Deb, sorry for sending directly to you instead of to the list, gmail makes it very easy to click the wrong reply button. :)

It seems you will have to install a slightly older (5 months) version of lxml if you need a binary release, so try version 2.2.8 at http://pypi.python.org/pypi/lxml/2.2.8 instead of the newest 2.3.

Joost

On 22 February 2011 00:59, Deb Midya <[hidden email]> wrote:
Joost,
 
Thank you very much for your response.
 
I have found that there is no binary file of lxml in the package index of python.org.
 
I am using Python 2.6 on Windows XP.
 
Is there any alternative solution?
 
Once again, thank you very much for the time you have given.
 
Regards,
 
Deb

--- On Mon, 21/2/11, Joost Molenaar <[hidden email]> wrote:

From: Joost Molenaar <[hidden email]>
Subject: Re: [Web-SIG] Extracting web data
To: "Deb Midya" <[hidden email]>
Received: Monday, 21 February, 2011, 5:19 PM


You should look at lxml, it knows how to parse HTML and XML and lets you use XPath to find the data you need.
Joost Molenaar
Op 21 feb 2011 05:28 schreef "Deb Midya" <debmidya@...>:

Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.
 
Regards,
 
Deb
 

 
_______________________________________________
Web-SIG mailing list
Web-SIG@...
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/j.j.molenaar%40gmail.com


 


_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

James Y Knight
In reply to this post by James Mills-3
On Feb 21, 2011, at 7:07 PM, James Mills wrote:
> You might want to look into using either
> the lxml or BeautifulSoup modules.

For parsing random HTML, the html5lib module works much better than either of those.


_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

Randy Syring-3
In reply to this post by Deb Midya
Also, if you are familiar with jQuery selector syntax, pyquery is very helpful!
--------------------------------------
Randy Syring
Intelicom
Direct: 502-276-0459
Office: 502-212-9913

For the wages of sin is death, but the
free gift of God is eternal life in 
Christ Jesus our Lord (Rom 6:23)

On 02/20/2011 11:21 PM, Deb Midya wrote:
Hi Python web-sig users,
 
Thanks in advance and I am new to web-sig.
 
I am using Python 2.6 on Windows XP.
 
May I request you to assist me for the following please.
 
I like to extract web data from the site (http://finance.yahoo.com, for example).
 
The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
 
I am trying to automate the extraction of data.
 
Is there any Python module or any assistance please?
 
Once again, thank you very much for the time you have given.
 
Regards,
 
Deb
 

 
_______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/randy%40rcs-comp.com

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

Lennart Regebro-2
In reply to this post by Aaron Watters-2
On Tue, Feb 22, 2011 at 01:52, Aaron Watters <[hidden email]> wrote:
BeautifulSoup is the standard response.
I think lxml will not work very well unless the
html is extremely nicely formatted, but I could
be wrong.

lxml handles broken HTML pretty well.

Tere are Windows binaries here: http://pypi.python.org/pypi/lxml/2.2.8

//Lennart

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Extracting web data

Bruno Rezende
In reply to this post by Deb Midya
Hi Deb,


On Mon, Feb 21, 2011 at 1:21 AM, Deb Midya <[hidden email]> wrote:
>
>
> I like to extract web data from the site (http://finance.yahoo.com, for example).
>
> The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids).
>
> I am trying to automate the extraction of data.
>
>

take a look at scrapy: http://doc.scrapy.org/intro/overview.html

--
Bruno
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com