|
Hello,
I am newbie to Python coding. And, I had a question. I want to write a script which will check content changes in websites & send e-mail to a admin whenever there are changes. Ideally this script/program should be scalable for say about 1000 websites at a time.. How do I start ? I have previous programming experience. Python - my knowledge would be like python 101. Thank you, Bhavya _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
> Hello, > > I am newbie to Python coding. And, I had a question. I want to write a > script which will check content changes in websites& send e-mail to a > admin whenever there are changes. How many times in a day or how often will this check be performed ? You must look into how to use md5, diff utilities, for web scraping scrapy library is advised. > Ideally this script/program should be scalable for say about 1000 websites > at a time.. > > How do I start ? I have previous programming experience. Python - my > knowledge would be like python 101. > > Thank you, > Bhavya > _______________________________________________ > BangPypers mailing list > [hidden email] > http://mail.python.org/mailman/listinfo/bangpypers -- "Talk is cheap, show me the code" -- Linus Torvalds Regards Kracekumar.R www.kracekumar.com _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker
<[hidden email]> wrote: > >> Hello, >> >> I am newbie to Python coding. And, I had a question. I want to write a >> script which will check content changes in websites& send e-mail to a >> >> admin whenever there are changes. > > How many times in a day or how often will this check be performed ? > > You must look into how to use md5, diff utilities, for web scraping scrapy > library is advised. > >> Ideally this script/program should be scalable for say about 1000 websites >> at a time.. 1000 sites at a time? Wow, that's huge. Scraping that many sites is resource intensive, would need a nice big stable server that can handle the huge data dumps. Fwiw, Scrapy will only dump the data in the json files so check out a little about the database you want to use, the frontend to serve it, a queueing system to scale 1000 sites, etc... Also, some sites instantly ban scrapers. Watch out for that, and goodluck :) -- Regards, Vid ॥ http://svaksha.com ॥ _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
Thanks everyone:)...Much appreciated.
I will work on it & let the group know how it goes. Thanks, Bhavya On Fri, Jun 8, 2012 at 1:06 PM, vid <[hidden email]> wrote: > On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker > <[hidden email]> wrote: > > > >> Hello, > >> > >> I am newbie to Python coding. And, I had a question. I want to write a > >> script which will check content changes in websites& send e-mail to a > >> > >> admin whenever there are changes. > > > > How many times in a day or how often will this check be performed ? > > > > You must look into how to use md5, diff utilities, for web scraping > scrapy > > library is advised. > > > >> Ideally this script/program should be scalable for say about 1000 > websites > >> at a time.. > > 1000 sites at a time? Wow, that's huge. Scraping that many sites is > resource intensive, would need a nice big stable server that can > handle the huge data dumps. Fwiw, Scrapy will only dump the data in > the json files so check out a little about the database you want to > use, the frontend to serve it, a queueing system to scale 1000 sites, > etc... Also, some sites instantly ban scrapers. Watch out for that, > and goodluck :) > > -- > Regards, > Vid > ॥ http://svaksha.com ॥ > _______________________________________________ > BangPypers mailing list > [hidden email] > http://mail.python.org/mailman/listinfo/bangpypers > BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
If you need to check for the absence of certain content, then write tests
using Sahi or Selenium, and run those at periodic intervals. Ram On Jun 8, 2012 11:21 PM, "Bhavya" <[hidden email]> wrote: > Thanks everyone:)...Much appreciated. > I will work on it & let the group know how it goes. > > Thanks, > Bhavya > > On Fri, Jun 8, 2012 at 1:06 PM, vid <[hidden email]> wrote: > > > On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker > > <[hidden email]> wrote: > > > > > >> Hello, > > >> > > >> I am newbie to Python coding. And, I had a question. I want to write a > > >> script which will check content changes in websites& send e-mail to a > > >> > > >> admin whenever there are changes. > > > > > > How many times in a day or how often will this check be performed ? > > > > > > You must look into how to use md5, diff utilities, for web scraping > > scrapy > > > library is advised. > > > > > >> Ideally this script/program should be scalable for say about 1000 > > websites > > >> at a time.. > > > > 1000 sites at a time? Wow, that's huge. Scraping that many sites is > > resource intensive, would need a nice big stable server that can > > handle the huge data dumps. Fwiw, Scrapy will only dump the data in > > the json files so check out a little about the database you want to > > use, the frontend to serve it, a queueing system to scale 1000 sites, > > etc... Also, some sites instantly ban scrapers. Watch out for that, > > and goodluck :) > > > > -- > > Regards, > > Vid > > ॥ http://svaksha.com ॥ > > _______________________________________________ > > BangPypers mailing list > > [hidden email] > > http://mail.python.org/mailman/listinfo/bangpypers > > > _______________________________________________ > BangPypers mailing list > [hidden email] > http://mail.python.org/mailman/listinfo/bangpypers > BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote:
> If you need to check for the absence of certain content, then write tests > using Sahi or Selenium, and run those at periodic intervals. Does Sahi have python bindings now? The last I checked was 2 years ago so it must have come a long way. Regards, Vid ॥ http://svaksha.com ॥ _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
On Jun 9, 2012 10:51 AM, "vid" <[hidden email]> wrote:
> > On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote: > > If you need to check for the absence of certain content, then write tests > > using Sahi or Selenium, and run those at periodic intervals. > > Does Sahi have python bindings now? The last I checked was 2 years ago > so it must have come a long way. > > > Regards, > Vid > ॥ http://svaksha.com Hi, you can use lxml to do what you are looking for.. The library has a diff method.. http://lxml.de _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
On Sat, Jun 9, 2012 at 5:32 AM, Arvind K <[hidden email]> wrote:
> On Jun 9, 2012 10:51 AM, "vid" <[hidden email]> wrote: >> >> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> > wrote: >> > If you need to check for the absence of certain content, then write > tests >> > using Sahi or Selenium, and run those at periodic intervals. >> >> Does Sahi have python bindings now? The last I checked was 2 years ago >> so it must have come a long way. >> >> >> Regards, >> Vid >> ॥ http://svaksha.com > Hi, you can use lxml to do what you are looking for.. The library has a > diff method.. > http://lxml.de Btw, were you replying to my question about sahi and python bindings? Because afaik, lxml is more of a parser than a functional testing tool. Regards, Vid ॥ http://svaksha.com ॥ _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
On 9 June 2012 11:11, vid <[hidden email]> wrote:
> On Sat, Jun 9, 2012 at 5:32 AM, Arvind K <[hidden email]> wrote: > > On Jun 9, 2012 10:51 AM, "vid" <[hidden email]> wrote: > >> > >> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> > > wrote: > >> > If you need to check for the absence of certain content, then write > > tests > >> > using Sahi or Selenium, and run those at periodic intervals. > >> > >> Does Sahi have python bindings now? The last I checked was 2 years ago > >> so it must have come a long way. > >> > >> > >> Regards, > >> Vid > >> ॥ http://svaksha.com > > Hi, you can use lxml to do what you are looking for.. The library has a > > diff method.. > > http://lxml.de > > Btw, were you replying to my question about sahi and python bindings? > Because afaik, lxml is more of a parser than a functional testing > tool. > > Regards, > Vid > ॥ http://svaksha.com <http://mail.python.org/mailman/listinfo/bangpypers> > Umm no. I was replying to Bhavya. _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
In reply to this post by vid
On Sat, Jun 9, 2012 at 10:50 AM, vid <[hidden email]> wrote:
> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote: >> If you need to check for the absence of certain content, then write tests >> using Sahi or Selenium, and run those at periodic intervals. > > Does Sahi have python bindings now? The last I checked was 2 years ago > so it must have come a long way. Yes. http://selenium-python.readthedocs.org/ Anand _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
In reply to this post by vid
On Sat, Jun 9, 2012 at 10:50 AM, vid <[hidden email]> wrote:
> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote: >> If you need to check for the absence of certain content, then write tests >> using Sahi or Selenium, and run those at periodic intervals. > > Does Sahi have python bindings now? The last I checked was 2 years ago > so it must have come a long way. Sorry, I was thinking about the requirement (check for website content change), and didn't consider a python-only solution. > > > Regards, > Vid > ॥ http://svaksha.com ॥ > _______________________________________________ > BangPypers mailing list > [hidden email] > http://mail.python.org/mailman/listinfo/bangpypers -- ------------------------------------ Belenix: www.belenix.org Twitter: @sriramnrn _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
In reply to this post by Anand Chitipothu-2
On Sat, Jun 9, 2012 at 6:03 AM, Anand Chitipothu <[hidden email]> wrote:
> On Sat, Jun 9, 2012 at 10:50 AM, vid <[hidden email]> wrote: >> On Sat, Jun 9, 2012 at 1:43 AM, Sriram Narayanan <[hidden email]> wrote: >>> If you need to check for the absence of certain content, then write tests >>> using Sahi or Selenium, and run those at periodic intervals. >> >> Does Sahi have python bindings now? The last I checked was 2 years ago >> so it must have come a long way. > > Yes. > > http://selenium-python.readthedocs.org/ Sahi[0] != Selenium [0] http://sahi.co.in/ -- Regards, Vid ॥ http://svaksha.com ॥ _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
|
In reply to this post by vid
On Fri, Jun 8, 2012 at 10:36 PM, vid <[hidden email]> wrote:
> On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker > <[hidden email]> wrote: > > > >> Hello, > >> > >> I am newbie to Python coding. And, I had a question. I want to write a > >> script which will check content changes in websites& send e-mail to a > >> > >> admin whenever there are changes. > > > > How many times in a day or how often will this check be performed ? > > > > You must look into how to use md5, diff utilities, for web scraping > scrapy > > library is advised. > > > >> Ideally this script/program should be scalable for say about 1000 > websites > >> at a time.. > > 1000 sites at a time? Wow, that's huge. Scraping that many sites is > resource intensive, would need a nice big stable server that can > handle the huge data dumps. Fwiw, Scrapy will only dump the data in > the json files so check out a little about the database you want to > use, the frontend to serve it, a queueing system to scale 1000 sites, > etc... Also, some sites instantly ban scrapers. Watch out for that, > and goodluck :) > This is much more easier than you think. It looks big because you are solving it as a full-scale scraping problem. This is in fact more in the lines of an "incremental crawler". Write a simple crawler that keeps track of a few key entrypoint URLs on every site. You can typically get them from the sitemap or from querying google. The crawler can be hand-written or use existing frameworks like pycurl, scrapy etc. 1. When crawling, use a HEAD request to fetch the page. This ensures you only get the headers of the page not the data. Store the metadata of interest to a file - use an MD5 hash of the URL as a unique name and use a two level directory scheme of squid. The fields of interest would be last-modified-time, etag (if any) and content-length. 2. Recrawl at fixed intervals. Before requesting a URL load its metadata from the cache if it exists - Fill in the "If-Modified-Since" header and put the last-modified-time in there. Also you can optionally add "If-None-Match" for the etag, if found. 3. If page is not modified, server returns HTTP 304 error. Handle it. Otherwise download the page or do whatever other actions. Update the cache if modified. For 1000 sites, partition the sites into multiple sets and do such incremental crawls frequently. Use random selection to pick up the sites per set. Use random selection of starting URLs to ensure you visit most parts of a site every subsequent crawl. I have written such systems before and still maintain them. It is an interesting area. Ask if you have specific questions. > > -- > Regards, > Vid > ॥ http://svaksha.com ॥ > _______________________________________________ > BangPypers mailing list > [hidden email] > http://mail.python.org/mailman/listinfo/bangpypers > -- Regards, --Anand _______________________________________________ BangPypers mailing list [hidden email] http://mail.python.org/mailman/listinfo/bangpypers |
| Powered by Nabble | Edit this page |
