ANN: htmldammit 0.1

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

ANN: htmldammit 0.1

Tal Einat
I'm happy to announce htmldammit <>,
a library for decoding binary HTML data into Unicode in the best possible

Suggestions and comments are most welcome!

Installation: `pip install htmldammit`

Source: GitHub <>


When it comes to decoding HTML,Common libraries such as requests and
BeautifulSoup do not make the best possible use of information from both
the raw data as well as the HTTP Content-Type header. Combining them to get
the best possible result is surprisingly far from straightforward.

BeauftulSoup comes closest with it UnicodeDammit utility, but gives no
direct support for taking advantage of Content-Type header information.
Furthermore, in order to utilize the Content-Type header while preferring
embedded charset declarations in a document when those exist, quite some
hacking is required.

htmldammit takes care of all of this. It uses UnicodeDammit in a
non-trivial manner under the hood. It supplies simple functions to extract
all of the needed data from a urlopen() and requests.get() response objects
and return Unicode HTML.

Additionally, a request hook for requests and an opener for urllib are
supplied to make using this consistently simpler.

- Tal Einat

        Support the Python Software Foundation: