Handling unwanted Unicode \u2019 characters in XML

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney
Here's one for the XML people,

I am using XML imported from FrameMaker, which contains the unwanted Unicode
character '\u2019' (the character started out as a plain apostrophe in the source Frame document.)
It seems this is a common issue with many word-processors (MS, Frame etc.)
using the funky right- and left- leaning apostrophes. I see many references to this issue on the web.

You can't print Unicode strings as is, it causes an exception, you must encode them (to ASCII).
But the ASCII encoding of \u2019 is not very human-readable or useful:
>>> u'\u2019'.encode('utf-8')
'\xe2\x80\x99'

Hence I thought I should do a find or replace with a regex to map the unwanted \u2019 back to plain old apostrophe.
(You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))

But then I thought:
In the interest of preventing exceptions by making sure all Unicode characters are either mapped to ASCII
or removed, it seems like I really want a Unicode version of string.maketrans() and string.translate(), which is deprecated.
Can anyone tell me what that equivalent is, for Unicode fns?

Thanks,
Stephen





Use video conversation to talk face-to-face with Windows Live Messenger. Get started.
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Chris Rebert
On Tue, Jul 1, 2008 at 3:36 PM, Stephen McInerney
<[hidden email]> wrote:

> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted Unicode
> character '\u2019' (the character started out as a plain apostrophe in the
> source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many references
> to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you must
> encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
>>>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'

That's UTF-8, not ASCII (there's a big difference), and you're seeing
the repr() of the encoded string, which is of course an ugly escape
sequence.
If instead you print the encoded string, you get:

>>> print u'\u2019'.encode('utf-8')
'

Which is perfectly sensible. Same for other unicode chars.

Are you really sure you need this to be ASCII and not UTF-8? If so,
why do need it to be true ASCII?

- Chris

>
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
>
> Thanks,
> Stephen
>
>
>
>
> ________________________________
> Use video conversation to talk face-to-face with Windows Live Messenger. Get
> started.
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Andy Wiggin
In reply to this post by Stephen McInerney
I don't know the answer to your specific question, but I did read a
good article a while back that described doing something similar,
involving processing XML, Unicode and those fancy double quotes. If
you're interested:

  http://www.linuxjournal.com/article/9319

-Andy
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney
In reply to this post by Chris Rebert
Hi Chris,

> Are you really sure you need this to be ASCII and not UTF-8? If so,
> why do need it to be true ASCII?

I want it to be ASCII so I can print it, and do regex matching.
Unless I need to move with the times, and start doing Unicode regexes as default.
But I'm using 2.5.2 so I'd really prefer to keep everything in ASCII-land.
It's a pain when you're debugging and print keeps throwing exceptions.
And on this case, the apostrophe was not Unicode to start with.

> > But the ASCII encoding of \u2019 is not very human-readable or useful:
> >>>> u'\u2019'.encode('utf-8')
> > '\xe2\x80\x99'
>
> That's UTF-8, not ASCII (there's a big difference), and you're seeing
> the repr() of the encoded string, which is of course an ugly escape
> sequence.
> If instead you print the encoded string, you get:
>
> >>> print u'\u2019'.encode('utf-8')
> '

I don't get that, I get this: 'â' (does it depend on C locale settings? if so, that's not very satisfactory at all):
>>> print u'\u2019'.encode('utf-8')
â

Thanks,
Stephen


Need to know now? Get instant answers with Windows Live Messenger. IM on your terms.
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Matt Good-3
In reply to this post by Stephen McInerney
On Jul 1, 2008, at 3:36 PM, Stephen McInerney wrote:

> it seems like I really want a Unicode version of string.maketrans()  
> and string.translate(), which is deprecated.

No, neither of those are deprecated.  The only deprecated functions in  
the string module are the ones listed here:
http://docs.python.org/lib/node42.html

-- Matt
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney
Matt,

> > it seems like I really want a Unicode version of string.maketrans()
> > and string.translate(), which is deprecated.
>
> No, neither of those are deprecated. The only deprecated functions in
> the string module are the ones listed here:
> http://docs.python.org/lib/node42.html

Check that URL again: string.translate() IS deprecated, but string.maketrans() is not.
unicode.translate() is not deprecated.
However, unicode.translate() will not take the optional third argument 'deletechars'
which string.translate() did.
Some people have called for it to add this to be backwards-compatible.

So I can't see where to get the functionality I want.
For now, to get me unstuck, I wrote a Unicode regex search-and-replace and I
just iterate that over the entire input XML tree. Crude but gets me out of jail for now.

By the way, the XML is coming in via ElementTree's parse() method. I see some references
in Unicode tutorials to creating a custom codec in order to get the translate()
functionality, but ET doesn't have any hook for supporting that.

(PS Thanks for your article, but it seemed to be about converting from ASCII apostrophes
to Unicode ones, not the reverse, which is more tricky.)

Regards,
Stephen


The i’m Talkaton. Can 30-days of conversation change the world? Find out now.
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Chad Netzer
In reply to this post by Stephen McInerney
On Tue, Jul 1, 2008 at 4:24 PM, Stephen McInerney
<[hidden email]> wrote:

> I don't get that, I get this: 'â' (does it depend on C locale settings? if
> so, that's not very satisfactory at all):
>>>> print u'\u2019'.encode('utf-8')
> â

Hmmm, what are the results of these set of commands?

$ python
Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF8')
>>> print u'\u2019'
'
>>> print u'\u00E2'
â
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Terry Carroll
In reply to this post by Stephen McInerney
Sorry, meant to send this to the list....


On Tue, 1 Jul 2008, Stephen McInerney wrote:

> Check that URL again: string.translate() IS deprecated, but
> string.maketrans() is not. unicode.translate() is not deprecated.

But can you set up the translate table, though?


>>> import string
>>> trantab = string.maketrans(u"u\2019", u"'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x81' in
position 1: ordinal not in range(128)


I also note that the docs for the translate() string method suggest:

    Note, a more flexible approach is to create a custom character mapping
    codec using the codecs module (see encodings.cp1251 for an example).

But reading the codecs docs raised more questions for me than they
answered; it certainly isn't as straightforward as the ascii translation
was.


_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Terry Carroll
On Tue, 1 Jul 2008, Terry Carroll wrote:

> On Tue, 1 Jul 2008, Stephen McInerney wrote:
>
> > Check that URL again: string.translate() IS deprecated, but
> > string.maketrans() is not. unicode.translate() is not deprecated.
>
> But can you set up the translate table, though?

Ah, here's how it works:

>>> d = u"doesn\u2019t"    # "doesn't", with a curly-quote
>>> trtab={0x2019:u"'"}    # map codepoint 2019 to the "'" character
>>> d.translate(trtab)
u"doesn't"



_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Max Slimmer
In reply to this post by Stephen McInerney
Stephen McInerney wrote:

> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted
> Unicode
> character '\u2019' (the character started out as a plain apostrophe in
> the source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many
> references to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you
> must encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
> >>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'
>
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
>
> Thanks,
> Stephen
>
>
>
>
> ------------------------------------------------------------------------
> Use video conversation to talk face-to-face with Windows Live
> Messenger. Get started.
> <http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_072008>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
I do processing of xml data and often it contains these interesting
unicode chars along with a few other non-ascii chars. The interesting
thing is that there are a small subset of non-ascii chars that don't map
one to one  i.e. ascii ordinal value to same unicode ordinal value. I
will include a list of these with their names and you can make up a
small dict to map them to what ever you would like.

but if you want to convert these unicode to string or visa versa look at
the following:
 >>> u = u'a\u2019b'
 >>> a = 'a\x92b'
 >>> u
u'a\u2019b'
 >>> a
'a\x92b'
 >>> u.encode('cp1252')
'a\x92b'
 >>> a.decode('cp1252')
u'a\u2019b'
 >>>
the coolest way to convert these (I think) is the following:
import renonAscii = re.compile('[^\x01-\x7f]')  

def escapeCP1252(s):
    return nonAscii.sub(_esccp1252,s)
def _esccp1252(m):
    return "&#%d;" % ord(_cp1252(m))
def _cp1252(m):
    c = unichr(ord(m.group(0)))
    return cp1252.get(c, c)

the above simply finds all non-ascii chrs and returns an html &#nn;
escape string, if you want to return a quote char instead of u'u\2019'
then have the rtn return cp1252.get(ord(m.group(0)) after modifying the
dict cp1252 to your taste.
cp1252 = {
    # from http://www.microsoft.com/typography/unicode/1252.htm
    u"\x80": u"\u20AC", # EURO SIGN
    u"\x82": u"\u201A", # SINGLE LOW-9 QUOTATION MARK
    u"\x83": u"\u0192", # LATIN SMALL LETTER F WITH HOOK
    u"\x84": u"\u201E", # DOUBLE LOW-9 QUOTATION MARK
    u"\x85": u"\u2026", # HORIZONTAL ELLIPSIS
    u"\x86": u"\u2020", # DAGGER
    u"\x87": u"\u2021", # DOUBLE DAGGER
    u"\x88": u"\u02C6", # MODIFIER LETTER CIRCUMFLEX ACCENT
    u"\x89": u"\u2030", # PER MILLE SIGN
    u"\x8A": u"\u0160", # LATIN CAPITAL LETTER S WITH CARON
    u"\x8B": u"\u2039", # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    u"\x8C": u"\u0152", # LATIN CAPITAL LIGATURE OE
    u"\x8E": u"\u017D", # LATIN CAPITAL LETTER Z WITH CARON
    u"\x91": u"\u2018", # LEFT SINGLE QUOTATION MARK
    u"\x92": u"\u2019", # RIGHT SINGLE QUOTATION MARK
    u"\x93": u"\u201C", # LEFT DOUBLE QUOTATION MARK
    u"\x94": u"\u201D", # RIGHT DOUBLE QUOTATION MARK
    u"\x95": u"\u2022", # BULLET
    u"\x96": u"\u2013", # EN DASH
    u"\x97": u"\u2014", # EM DASH
    u"\x98": u"\u02DC", # SMALL TILDE
    u"\x99": u"\u2122", # TRADE MARK SIGN
    u"\x9A": u"\u0161", # LATIN SMALL LETTER S WITH CARON
    u"\x9B": u"\u203A", # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    u"\x9C": u"\u0153", # LATIN SMALL LIGATURE OE
    u"\x9E": u"\u017E", # LATIN SMALL LETTER Z WITH CARON
    u"\x9F": u"\u0178", # LATIN CAPITAL LETTER Y WITH DIAERESIS





_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney
In reply to this post by Chad Netzer

> > I don't get that, I get this: 'â' (does it depend on C locale settings? if
> > so, that's not very satisfactory at all):
> >>>> print u'\u2019'.encode('utf-8')
> > â
>
> Hmmm, what are the results of these set of commands?
>
> $ python
> Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
> >>> import locale
> >>> locale.getdefaultlocale()
> ('en_US', 'UTF8')
> >>> print u'\u2019'
> '
> >>> print u'\u00E2'
> â

For me, it's:
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'ISO8859-1')

But should I be changing setdefaultlocale() ?



Use video conversation to talk face-to-face with Windows Live Messenger. Get started.
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Chad Netzer
On Tue, Jul 1, 2008 at 6:37 PM, Stephen McInerney
<[hidden email]> wrote:

> For me, it's:
>>>> import locale
>>>> locale.getdefaultlocale()
> ('en_US', 'ISO8859-1')
>
> But should I be changing setdefaultlocale() ?

You need to execute all the statements.  I'm having difficulty
understanding how the unicode literal U+2019 can map to U+00E2 like
you say.

Execute all these statements with cut-n-paste and give us the results:

a = u'\u2019'
b = u'\u00E2'
print a
print b
print a.encode('utf-8')
print b.encode('utf-8')
ord(a)
ord(b)
unichr(ord(a))
unichr(ord(b))
import sys
sys.maxunicode
sys.byteorder


It might be something trivial that I'm overlooking...  Also, you
mentioned an exception when trying to print the literal?  I assume it
was a UnicodeEncodeError?  I'd like to see what it was, in any case.

Also, Windows, I assume (since it's ISO8859-1)?  Could it somehow be
related to this?:

http://en.wikipedia.org/wiki/ISO_8859-1#The_ISO-8859-1.2FWindows-1252_mixup

C
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Shannon -jj Behrens
In reply to this post by Stephen McInerney
On Tue, Jul 1, 2008 at 3:36 PM, Stephen McInerney
<[hidden email]> wrote:

> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted Unicode
> character '\u2019' (the character started out as a plain apostrophe in the
> source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many references
> to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you must
> encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
>>>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'
>
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?

That reminds me of this:

latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

Now is probably as good a time as any to learn about Unicode.  Here's
an easy start:
http://wiki.pylonshq.com/display/pylonsdocs/Unicode

Happy Hacking!
-jj

--
It's a walled garden, but the flowers sure are lovely!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Jeff Younker
On Jul 1, 2008, at 10:34 PM, Shannon -jj Behrens wrote:
> That reminds me of this:
>
> latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

That's my favorite recipe ever.  I used it just last week.

-jeff

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney
In reply to this post by Chad Netzer
Chad,

I'm on Solaris 10. Below are your replies, but it's faster for you to call me.
[to everyone else who sent suggestions like latin1_to_ascii -- The UNICODE Hammer,
I'm reading them too. I'll send out a rollup when I finally figure out the best approach
for my context.]

> You need to execute all the statements. I'm having difficulty
> understanding how the unicode literal U+2019 can map to U+00E2 like you say.

You're making a wrong assumption that 'â' must mean U+00E2, it's just some
non-7-bit character which the shell objects to and mangles.
 
> Execute all these statements with cut-n-paste and give us the results:
>
>>> a = u'\u2019'
>>> b = u'\u00E2'
>>> print a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 0: ordinal not in range(256)
>>> print b
â
>>> print a.encode('utf-8')
’
>>> print b.encode('utf-8')
â
>>> ord(a)
8217
>>> ord(b)
226
>>> unichr(ord(a))
u'\u2019'
>>> unichr(ord(b))
u'\xe2'
>>> import sys
>>> sys.maxunicode
65535
>>> sys.byteorder
'big'

> It might be something trivial that I'm overlooking... Also, you
> mentioned an exception when trying to print the literal? I assume it
> was a UnicodeEncodeError? I'd like to see what it was, in any case.

Yes, it was the usual culprit that thousands are plagued by:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 53: ordinal not in range(256)

Regards,
Stephen


Watch “Cause Effect,” a show about real people making a real difference. Learn more.
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: Handling unwanted Unicode \u2019 characters in XML

Chad Netzer
On Thu, Jul 3, 2008 at 2:08 PM, Stephen McInerney
<[hidden email]> wrote:
> Chad,
>
> I'm on Solaris 10.

>> You need to execute all the statements. I'm having difficulty
>> understanding how the unicode literal U+2019 can map to U+00E2 like you
>> say.
>
> You're making a wrong assumption that 'â' must mean U+00E2, it's just some
> non-7-bit character which the shell objects to and mangles.

Ah, it all makes sense to me now.  Your terminal is using latin-1
encoding, and when you explicitly encode the character u'\u2019', to
utf-8, you get the three byte string '\xe2\x80\x99', the first byte of
which is circumflex 'a'.

The weird part is that in your first message, when you printed
u'\u2019'.encode('utf-8'), you said you got the circumflex 'a'
character (latin-1 0xE8), but your latest message indicates you
sometimes get circumflex 'a' followed by two more characters (Euro,
and Trademark), which makes more sense.  Hmmm... Those look like they
are actually Windows-1252 character values (0x80 and 0x99):

http://en.wikipedia.org/wiki/Windows-1252

In any case, it sounds like you need a more voracious "Unicode
HAMMER", which would convert the unicode RIGHT SINGLE QUOTATION MARK
into ascii APOSTROPHE (among other translational abominations), but a
simple unicode replace() might work.

ie.
>>> a = u'\u2019'
u'\u2019'
>>> a.replace(u'\u2019', u'\u0027')
u"'"   # Uhhh, that's a single apostrophe in there...

Obviously the above could be done more intelligently by matching left
quotations, etc., but its a quick and dirty kludge for now.

C
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies