Scraping with authentication: Scrapy vs BeautifulSoup?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Scraping with authentication: Scrapy vs BeautifulSoup?

Stephen McInerney

What do people use for scraping on a website requiring (login form-based) authentication?
  • BeautifulSoup: does not handle authentication or cookies
  • Scrapy: does but more heavyweight paradigm to learn, incl. XPath

Some discussion: http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python

Thanks,
Stephen


_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Scraping with authentication: Scrapy vs BeautifulSoup?

Peter Borocz
While usually thought of only for testing, I've happily used twill for the authentication/cookie/form-handling portion then beautifulsoup for the parsing. Twill can be configured to use beautifulsoup directly but with direct access to the underlying page, you can use any parsing library you like.

PeterB

On Sat, Jun 25, 2011 at 1:42 PM, Stephen McInerney <[hidden email]> wrote:

What do people use for scraping on a website requiring (login form-based) authentication?
  • BeautifulSoup: does not handle authentication or cookies
  • Scrapy: does but more heavyweight paradigm to learn, incl. XPath

Some discussion: http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python

Thanks,
Stephen


_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies



--

peter.borocz at gmail dot com

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Scraping with authentication: Scrapy vs BeautifulSoup?

Glen Jarvis
In reply to this post by Stephen McInerney
Stephen,
    Beautiful soup really just parses the HTML. It doesn't (have to) retrieve the page for you.

    You can use the built-in httplib2, urllib libraries to retrieve the page (also with authentication) and then use BeautifulSoup to parse the page.

Cheers,


Glen

On Jun 25, 2011, at 1:42 PM, Stephen McInerney <[hidden email]> wrote:


What do people use for scraping on a website requiring (login form-based) authentication?
  • BeautifulSoup: does not handle authentication or cookies
  • Scrapy: does but more heavyweight paradigm to learn, incl. XPath

Some discussion: http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python

Thanks,
Stephen

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Scraping with authentication: Scrapy vs BeautifulSoup?

Aaron Peterson
In reply to this post by Stephen McInerney

Hello:

Mechanize is another good module for automating this kind of thing.

HTH,

Aaron

On Jun 25, 2011 1:43 PM, "Stephen McInerney" <[hidden email]> wrote:
>
>
> What do people use for scraping on a website requiring (login form-based) authentication?
> BeautifulSoup: does not handle authentication or cookiesScrapy: does but more heavyweight paradigm to learn, incl. XPath
> Some discussion: http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python
>
> Thanks,
> Stephen
>
>

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Scraping with authentication: Scrapy vs BeautifulSoup?

Dwight Hubbard-3
In reply to this post by Glen Jarvis
For scraping with authentication I find the twill module is very good.


From: Glen Jarvis <[hidden email]>
To: Stephen McInerney <[hidden email]>
Cc: "<[hidden email]>" <[hidden email]>
Sent: Saturday, June 25, 2011 6:48 PM
Subject: Re: [Baypiggies] Scraping with authentication: Scrapy vs BeautifulSoup?

Stephen,
    Beautiful soup really just parses the HTML. It doesn't (have to) retrieve the page for you.

    You can use the built-in httplib2, urllib libraries to retrieve the page (also with authentication) and then use BeautifulSoup to parse the page.

Cheers,


Glen

On Jun 25, 2011, at 1:42 PM, Stephen McInerney <[hidden email]> wrote:


What do people use for scraping on a website requiring (login form-based) authentication?
  • BeautifulSoup: does not handle authentication or cookies
  • Scrapy: does but more heavyweight paradigm to learn, incl. XPath

Some discussion: http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python

Thanks,
Stephen

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies


_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Scraping with authentication: Scrapy vs BeautifulSoup?

Ryan Larrabure
If you're scraping HTML, all reasonable roads seem to lead to xpath.
I'd use httplib2 and lxml.  Avoid mechanize.  It's form handling is
very poor (it'll read forms stored inline within javascript tags).

On Mon, Jun 27, 2011 at 3:07 PM, Dwight Hubbard
<[hidden email]> wrote:

> For scraping with authentication I find the twill module is very good.
>
> ________________________________
> From: Glen Jarvis <[hidden email]>
> To: Stephen McInerney <[hidden email]>
> Cc: "<[hidden email]>" <[hidden email]>
> Sent: Saturday, June 25, 2011 6:48 PM
> Subject: Re: [Baypiggies] Scraping with authentication: Scrapy vs
> BeautifulSoup?
>
> Stephen,
>     Beautiful soup really just parses the HTML. It doesn't (have to)
> retrieve the page for you.
>     You can use the built-in httplib2, urllib libraries to retrieve the page
> (also with authentication) and then use BeautifulSoup to parse the page.
> Cheers,
>
> Glen
> On Jun 25, 2011, at 1:42 PM, Stephen McInerney <[hidden email]>
> wrote:
>
>
> What do people use for scraping on a website requiring (login form-based)
> authentication?
>
> BeautifulSoup: does not handle authentication or cookies
> Scrapy: does but more heavyweight paradigm to learn, incl. XPath
>
> Some discussion:
> http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python
>
> Thanks,
> Stephen
>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies