Have any of you used or written a tool which will convert a web page into a file that is cleanly paginated and printable? I need to print out some documentation which was only written in HTML.
It's more convenient to have paper docs you can make notes on and use when you're away from a wifi connection. Below is just one of many pages from http://dangerousprototypes.com/docs, which I'd like to get into a printable format
The Python Weekly archives is another site I'd like to be able to download and have available offline
(although not necessarily printable) I'm hoping I don't need to reinvent the wheel.
Thanks _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
Hi Tony,
On 3/29/12 11:04 PM, Tony Cappellini wrote: > Have any of you used or written a tool which will convert a web page > into a file that > is cleanly paginated and printable? Perhaps I misunderstood the problem. The tool I've used for this task is my web browser (Firefox, Safari). On a Mac, I use Print > Save as PDF to get the output into a file. Monte _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Tony Cappellini-2
Tony> I'm hoping I don't need to reinvent the wheel. lynx -dump | pr ? -- Ian Zimmerman gpg public key: 1024D/C6FF61AD fingerprint: 66DC D68F 5C1B 4D71 2EE5 BD03 8A00 786C C6FF 61AD http://www.gravatar.com/avatar/c66875cda51109f76c6312f4d4743d1e.png Rule 420: All persons more than eight miles high to leave the court. _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Monte Davidoff
Monte,
While that will work for the current page, I should have mentioned I'm looking for a program that should be able to follow urls several levels (the level of urls will be determined by the user).
Since there is a lot of documentation, I was hoping to find a tool I can pass a top level url (or series of urls) to the program and let it do all the work. In the case of PythonWeekly, many of the main urls link to other offsite urls.
Saving the main url to pdf wont get the offsite content. thanks On Thu, Mar 29, 2012 at 11:15 PM, Monte Davidoff <[hidden email]> wrote: Hi Tony, _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Tony Cappellini-2
On Thu, Mar 29, 2012 at 11:04 PM, Tony Cappellini <[hidden email]> wrote:
> > Have any of you used or written a tool which will convert a web page into a > file that > is cleanly paginated and printable? > > I need to print out some documentation which was only written in HTML. > It's more convenient to have paper docs you can make notes on and use when > you're away from a wifi connection. > Check out Print Friendly http://www.printfriendly.com/ It's a web service that converts a page / URL to a PDF. I've had good luck with it converting doc pages to PDF for kindle reading. It's not perfect, but did a reasonable job on the bus pirate page (actually, it usually does better.) > > > I'm hoping I don't need to reinvent the wheel. > You might have to grease it, though. Print friendly only handles one page, it doesn't walk the tree of included links. But it's a web service, so you can probably call it from urllib without too much effort. There are no terms of service or API posted on the site, but it looks like a bay area project if you need to get in touch with them. mike _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Tony Cappellini-2
Tony Cappellini writes:
> While that will work for the current page, I should have mentioned I'm > looking for a program that should be > able to follow urls several levels (the level of urls will be determined by > the user). It's fairly easy to convert HTML pages to PDF with Python QtWebView and QPrinter (you don't have to be running KDE or set up a GUI). I've never found a way to do the same using python-webkit (the GTK bindings). http://shallowsky.com/blog/programming/html-slides-to-pdf.html I'm sure the webkit API will let you get a list of links and follow them, though I don't know the calls offhand. Or you could get them with a grep of the original page, and run your print script on the list of URLs (as I do there). ...Akkana _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Michael Pittaro-3
_______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Tony Cappellini-2
Tony Cappellini <[hidden email]> wrote:
> Monte, > > While that will work for the current page, I should have mentioned I'm > looking for a program that should be > able to follow urls several levels (the level of urls will be determined by > the user). Sounds like you want to save a whole site, not just a Web page. Anyway, the tools I like are "wkpdf" and its alters done with Qt and GTK++. They use WebKit to render to PDF. Pretty common by now. Try wk2pdf.py, for instance. To walk a site, you might try the Plucker tool. It does a pretty good job, if you can still find it. Bill _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
>>Sounds like you want to save a whole site, not just a Web page. No, not a whole site. Just many pages _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by jalopyuser
On Friday 2012-03-30 12:27 (-0700), Bill Janssen <[hidden email]> wrote:
> Tony Cappellini<[hidden email]> wrote: > >> While that will work for the current page, I should have mentioned I'm >> looking for a program that should be >> able to follow urls several levels (the level of urls will be determined by >> the user). > Sounds like you want to save a whole site, not just a Web page.... > To walk a site, you might try the Plucker tool. It does a pretty good > job, if you can still find it. Yup this isn't a printing issue, its a scraping issue. Plucker/plucker-desktop is around in debian and supports the depth option you want. The GU is kinda old and clunky but works. On the other hand wget does this too, but it is less user friendly ;-) I would strongly encourage you to NOT write tools for this as it can get complex. There are some wget GUI wrappers knocking around (I can't recommend any though). The printing piece is another (complex) problem, do you need to flatten (links between) the pages or print them separately, and in what order? I don't have a good answer to that, it is a navigation problem. But pulling down the pages is the first piece. Chris _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
On Friday 2012-03-30 12:57 (-0700), Chris Clark
<[hidden email]> wrote: > >> Tony Cappellini<[hidden email]> wrote: >> >>> While that will work for the current page, I should have mentioned I'm >>> looking for a program that should be >>> able to follow urls several levels (the level of urls will be >>> determined by >>> the user). >> > ..... wget does this too, but it is less user friendly ;-) I would > strongly encourage you to NOT write tools for this as it can get > complex. There are some wget GUI wrappers knocking around (I can't > recommend any though). I forgot to include an example: Pull down (up to a depth of 5) recursively, rename links and stay on web site (do not follow external links) wget -L --recursive --convert-links ..... Also see --restrict-file-names, --no-directories, and --level Chris _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
In reply to this post by Tony Cappellini-2
Hi Tony,
I think wget and Print Friendly are great suggestions. I'd add htmldoc [1] and lxml [2], depending on what you want. htmldoc is a nice tool (gui and command line) to generate pdfs from html and you can use lxml to easily extract the content and remove cruft. For instance, I don't quite like the way printfriendly manage images for the Bus Blaster project: http://dangerousprototypes.com/docs/Bus_Blaster http://www.printfriendly.com/print/v2?url=http%3A%2F%2Fdangerousprototypes.com%2Fdocs%2FBus_Blaster For this kind of thing I'd use lxml to extract only the content and remove the things I don't want such as table of contents: https://gist.github.com/2277731 [1] http://www.htmldoc.org/ [2] http://lxml.de/ Cheers, Pedro -- http://pedrokroger.net _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
Thanks to all who replied.
On Sun, Apr 1, 2012 at 12:03 PM, Pedro Kroger <[hidden email]> wrote: Hi Tony, _______________________________________________ Baypiggies mailing list [hidden email] To change your subscription options or unsubscribe: http://mail.python.org/mailman/listinfo/baypiggies |
Free forum by Nabble | Edit this page |