When it comes to decoding HTML,Common libraries such as requests and
BeautifulSoup do not make the best possible use of information from both
the raw data as well as the HTTP Content-Type header. Combining them to get
the best possible result is surprisingly far from straightforward.
BeauftulSoup comes closest with it UnicodeDammit utility, but gives no
direct support for taking advantage of Content-Type header information.
Furthermore, in order to utilize the Content-Type header while preferring
embedded charset declarations in a document when those exist, quite some
hacking is required.
htmldammit takes care of all of this. It uses UnicodeDammit in a
non-trivial manner under the hood. It supplies simple functions to extract
all of the needed data from a urlopen() and requests.get() response objects
and return Unicode HTML.
Additionally, a request hook for requests and an opener for urllib are
supplied to make using this consistently simpler.