Quantcast

Website change tracker

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Website change tracker

Bhavya
Hello,

I am newbie to Python coding. And, I had a question. I want to write a
script which will check content changes in websites & send e-mail to a
admin whenever there are changes.
Ideally this script/program should be scalable for say about 1000 websites
at a time..

How do I start ? I have previous programming experience. Python - my
knowledge would be like python 101.

Thank you,
Bhavya
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

kracekumar ramaraju

> Hello,
>
> I am newbie to Python coding. And, I had a question. I want to write a
> script which will check content changes in websites&  send e-mail to a
> admin whenever there are changes.
How many times in a day or how often will this check be performed ?

You must look into how to use md5, diff utilities, for web scraping
scrapy library is advised.

> Ideally this script/program should be scalable for say about 1000 websites
> at a time..
>
> How do I start ? I have previous programming experience. Python - my
> knowledge would be like python 101.
>
> Thank you,
> Bhavya
> _______________________________________________
> BangPypers mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/bangpypers


--
"Talk is cheap, show me the code" -- Linus Torvalds
Regards
Kracekumar.R
www.kracekumar.com

_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
vid
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

vid
On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker
<[hidden email]> wrote:

>
>> Hello,
>>
>> I am newbie to Python coding. And, I had a question. I want to write a
>> script which will check content changes in websites&  send e-mail to a
>>
>> admin whenever there are changes.
>
> How many times in a day or how often will this check be performed ?
>
> You must look into how to use md5, diff utilities, for web scraping scrapy
> library is advised.
>
>> Ideally this script/program should be scalable for say about 1000 websites
>> at a time..

1000 sites at a time? Wow, that's huge. Scraping that many sites is
resource intensive, would need a nice big stable server that can
handle the huge data dumps. Fwiw, Scrapy will only dump the data in
the json files so check out a little about the database you want to
use, the frontend to serve it, a queueing system to scale 1000 sites,
etc... Also, some sites instantly ban scrapers. Watch out for that,
and goodluck :)

--
Regards,
Vid
http://svaksha.com ॥
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Bhavya
Thanks everyone:)...Much appreciated.
I will work on it & let the group know how it goes.

Thanks,
Bhavya

On Fri, Jun 8, 2012 at 1:06 PM, vid <[hidden email]> wrote:

> On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker
> <[hidden email]> wrote:
> >
> >> Hello,
> >>
> >> I am newbie to Python coding. And, I had a question. I want to write a
> >> script which will check content changes in websites&  send e-mail to a
> >>
> >> admin whenever there are changes.
> >
> > How many times in a day or how often will this check be performed ?
> >
> > You must look into how to use md5, diff utilities, for web scraping
> scrapy
> > library is advised.
> >
> >> Ideally this script/program should be scalable for say about 1000
> websites
> >> at a time..
>
> 1000 sites at a time? Wow, that's huge. Scraping that many sites is
> resource intensive, would need a nice big stable server that can
> handle the huge data dumps. Fwiw, Scrapy will only dump the data in
> the json files so check out a little about the database you want to
> use, the frontend to serve it, a queueing system to scale 1000 sites,
> etc... Also, some sites instantly ban scrapers. Watch out for that,
> and goodluck :)
>
> --
> Regards,
> Vid
> ॥ http://svaksha.com ॥
> _______________________________________________
> BangPypers mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/bangpypers
>
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Sriram Narayanan
If you need to check for the absence of certain content, then write tests
using Sahi or Selenium, and run those at periodic intervals.

Ram
On Jun 8, 2012 11:21 PM, "Bhavya" <[hidden email]> wrote:

> Thanks everyone:)...Much appreciated.
> I will work on it & let the group know how it goes.
>
> Thanks,
> Bhavya
>
> On Fri, Jun 8, 2012 at 1:06 PM, vid <[hidden email]> wrote:
>
> > On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker
> > <[hidden email]> wrote:
> > >
> > >> Hello,
> > >>
> > >> I am newbie to Python coding. And, I had a question. I want to write a
> > >> script which will check content changes in websites&  send e-mail to a
> > >>
> > >> admin whenever there are changes.
> > >
> > > How many times in a day or how often will this check be performed ?
> > >
> > > You must look into how to use md5, diff utilities, for web scraping
> > scrapy
> > > library is advised.
> > >
> > >> Ideally this script/program should be scalable for say about 1000
> > websites
> > >> at a time..
> >
> > 1000 sites at a time? Wow, that's huge. Scraping that many sites is
> > resource intensive, would need a nice big stable server that can
> > handle the huge data dumps. Fwiw, Scrapy will only dump the data in
> > the json files so check out a little about the database you want to
> > use, the frontend to serve it, a queueing system to scale 1000 sites,
> > etc... Also, some sites instantly ban scrapers. Watch out for that,
> > and goodluck :)
> >
> > --
> > Regards,
> > Vid
> > ॥ http://svaksha.com ॥
> > _______________________________________________
> > BangPypers mailing list
> > [hidden email]
> > http://mail.python.org/mailman/listinfo/bangpypers
> >
> _______________________________________________
> BangPypers mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/bangpypers
>
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
vid
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

vid
On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote:
> If you need to check for the absence of certain content, then write tests
> using Sahi or Selenium, and run those at periodic intervals.

Does Sahi have python bindings now? The last I checked was 2 years ago
so it must have come a long way.


Regards,
Vid
http://svaksha.com ॥
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Arvind K
On Jun 9, 2012 10:51 AM, "vid" <[hidden email]> wrote:
>
> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]>
wrote:
> > If you need to check for the absence of certain content, then write
tests
> > using Sahi or Selenium, and run those at periodic intervals.
>
> Does Sahi have python bindings now? The last I checked was 2 years ago
> so it must have come a long way.
>
>
> Regards,
> Vid
> ॥ http://svaksha.com
Hi, you can use lxml to do what you are looking for.. The library has a
diff method..
http://lxml.de
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
vid
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

vid
On Sat, Jun 9, 2012 at 5:32 AM, Arvind K <[hidden email]> wrote:

> On Jun 9, 2012 10:51 AM, "vid" <[hidden email]> wrote:
>>
>> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]>
> wrote:
>> > If you need to check for the absence of certain content, then write
> tests
>> > using Sahi or Selenium, and run those at periodic intervals.
>>
>> Does Sahi have python bindings now? The last I checked was 2 years ago
>> so it must have come a long way.
>>
>>
>> Regards,
>> Vid
>> ॥ http://svaksha.com
> Hi, you can use lxml to do what you are looking for.. The library has a
> diff method..
> http://lxml.de

Btw, were you replying to my question about sahi and python bindings?
Because afaik, lxml is more of a parser than a functional testing
tool.

Regards,
Vid
http://svaksha.com ॥
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Arvind K
On 9 June 2012 11:11, vid <[hidden email]> wrote:

> On Sat, Jun 9, 2012 at 5:32 AM, Arvind K <[hidden email]> wrote:
> > On Jun 9, 2012 10:51 AM, "vid" <[hidden email]> wrote:
> >>
> >> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]>
> > wrote:
> >> > If you need to check for the absence of certain content, then write
> > tests
> >> > using Sahi or Selenium, and run those at periodic intervals.
> >>
> >> Does Sahi have python bindings now? The last I checked was 2 years ago
> >> so it must have come a long way.
> >>
> >>
> >> Regards,
> >> Vid
> >> ॥ http://svaksha.com
> > Hi, you can use lxml to do what you are looking for.. The library has a
> > diff method..
> > http://lxml.de
>
> Btw, were you replying to my question about sahi and python bindings?
> Because afaik, lxml is more of a parser than a functional testing
> tool.
>
> Regards,
> Vid
> ॥ http://svaksha.com <http://mail.python.org/mailman/listinfo/bangpypers>
>

Umm no.  I was replying to Bhavya.
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Anand Chitipothu-2
In reply to this post by vid
On Sat, Jun 9, 2012 at 10:50 AM, vid <[hidden email]> wrote:
> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote:
>> If you need to check for the absence of certain content, then write tests
>> using Sahi or Selenium, and run those at periodic intervals.
>
> Does Sahi have python bindings now? The last I checked was 2 years ago
> so it must have come a long way.

Yes.

http://selenium-python.readthedocs.org/

Anand
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Sriram Narayanan
In reply to this post by vid
On Sat, Jun 9, 2012 at 10:50 AM, vid <[hidden email]> wrote:
> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote:
>> If you need to check for the absence of certain content, then write tests
>> using Sahi or Selenium, and run those at periodic intervals.
>
> Does Sahi have python bindings now? The last I checked was 2 years ago
> so it must have come a long way.

Sorry, I was thinking about the requirement (check for website content
change), and didn't consider a python-only solution.

>
>
> Regards,
> Vid
> ॥ http://svaksha.com ॥
> _______________________________________________
> BangPypers mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/bangpypers



--
------------------------------------
Belenix: www.belenix.org
Twitter: @sriramnrn
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
vid
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

vid
In reply to this post by Anand Chitipothu-2
On Sat, Jun 9, 2012 at 6:03 AM, Anand Chitipothu <[hidden email]> wrote:

> On Sat, Jun 9, 2012 at 10:50 AM, vid <[hidden email]> wrote:
>> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote:
>>> If you need to check for the absence of certain content, then write tests
>>> using Sahi or Selenium, and run those at periodic intervals.
>>
>> Does Sahi have python bindings now? The last I checked was 2 years ago
>> so it must have come a long way.
>
> Yes.
>
> http://selenium-python.readthedocs.org/

Sahi[0] != Selenium

[0] http://sahi.co.in/

--
Regards,
Vid
http://svaksha.com ॥
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Website change tracker

Anand Balachandran Pillai
In reply to this post by vid
On Fri, Jun 8, 2012 at 10:36 PM, vid <[hidden email]> wrote:

> On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker
> <[hidden email]> wrote:
> >
> >> Hello,
> >>
> >> I am newbie to Python coding. And, I had a question. I want to write a
> >> script which will check content changes in websites&  send e-mail to a
> >>
> >> admin whenever there are changes.
> >
> > How many times in a day or how often will this check be performed ?
> >
> > You must look into how to use md5, diff utilities, for web scraping
> scrapy
> > library is advised.
> >
> >> Ideally this script/program should be scalable for say about 1000
> websites
> >> at a time..
>
> 1000 sites at a time? Wow, that's huge. Scraping that many sites is
> resource intensive, would need a nice big stable server that can
> handle the huge data dumps. Fwiw, Scrapy will only dump the data in
> the json files so check out a little about the database you want to
> use, the frontend to serve it, a queueing system to scale 1000 sites,
> etc... Also, some sites instantly ban scrapers. Watch out for that,
> and goodluck :)
>

 This is much more easier than you think. It looks big because
 you are solving it as a full-scale scraping problem. This is in fact
 more in the lines of an "incremental crawler".

 Write a simple crawler that keeps track of a few key entrypoint
 URLs on every site. You can typically get them from the sitemap
 or from querying google. The crawler can be hand-written or use
 existing frameworks like pycurl, scrapy etc.

 1. When crawling, use a HEAD request to fetch the page. This
 ensures you only get the headers of the page not the data. Store
the metadata of interest to a file - use an MD5 hash of the URL as
a unique name and use a two level directory scheme of squid.
The fields of interest would be last-modified-time, etag (if any)
and content-length.

2. Recrawl at fixed intervals. Before requesting a URL load its
metadata from the cache if it exists - Fill in the "If-Modified-Since"
header and put the last-modified-time in there. Also you can optionally
add "If-None-Match" for the etag, if found.

3. If page is not modified, server returns HTTP 304 error. Handle it.
Otherwise download the page or do whatever other actions. Update the
cache if modified.

For 1000 sites, partition the sites into multiple sets and do such
incremental
crawls frequently. Use random selection to pick up the sites per set.

Use random selection of starting URLs to ensure you visit most parts
of a site every subsequent crawl.

I have written such systems before and still maintain them. It is an
interesting
area. Ask if you have specific questions.


>
> --
> Regards,
> Vid
> ॥ http://svaksha.com ॥
> _______________________________________________
> BangPypers mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/bangpypers
>



--
Regards,

--Anand
_______________________________________________
BangPypers mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/bangpypers
Loading...