[Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Robert Sjoblom
Okay, so here's a fun one. Since I'm on a japanese locale my native
encoding is cp932. I was thinking of writing a parser for a bunch of
text files, but I stumbled on even printing the contents due to ...
something. I don't know what encoding the text file uses, which isn't
helping my case either (I have asked, but I've yet to get an answer).

Okay, so:

address = "C:/Path/to/file/file.ext"
with open(address, encoding="cp1252") as alpha:
    text = alpha.readlines()
    for line in text:
        print(line)

It starts to print until it hits the wonderful character é or '\xe9',
where it gives me this happy traceback:
Traceback (most recent call last):
  File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
line 8, in <module>
    print(line)
UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
position 13: illegal multibyte sequence

I can open the document and view it in UltraEdit -- and it displays
correct characters there -- but UE can't give me what encoding it
uses. Any chance of solving this without having to switch from my
japanese locale? Also, the cp1252 is just an educated guess, but it
doesn't really matter because it always comes back to the cp932 error.

--
best regards,
Robert S.
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Dave Angel-3
On 03/10/2012 06:38 PM, Robert Sjoblom wrote:

> Okay, so here's a fun one. Since I'm on a japanese locale my native
> encoding is cp932. I was thinking of writing a parser for a bunch of
> text files, but I stumbled on even printing the contents due to ...
> something. I don't know what encoding the text file uses, which isn't
> helping my case either (I have asked, but I've yet to get an answer).
>
> Okay, so:
>
> address = "C:/Path/to/file/file.ext"
> with open(address, encoding="cp1252") as alpha:
>      text = alpha.readlines()
>      for line in text:
>          print(line)
>
> It starts to print until it hits the wonderful character é or '\xe9',
> where it gives me this happy traceback:
> Traceback (most recent call last):
>    File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
> line 8, in<module>
>      print(line)
> UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
> position 13: illegal multibyte sequence
>
> I can open the document and view it in UltraEdit -- and it displays
> correct characters there -- but UE can't give me what encoding it
> uses. Any chance of solving this without having to switch from my
> japanese locale? Also, the cp1252 is just an educated guess, but it
> doesn't really matter because it always comes back to the cp932 error.
>

There are just 256 possible characters in cp1252, and 256 in cp932.  So
you should expect to see this error if your input file is
unconstrained.  And since you don't know what encoding it's in, you
might as well consider it unconstrained.

In other words, there are possible characters in the cp1252 that just
won't display in cp932.

You can "solve" the problem by pretending the input file is also cp932
when you open it. That way you'll get the wrong characters, but no
errors.  Or you can solve it by encoding the output explicitly, telling
it to ignore errors.  I don't know how to do that in Python 3.x.  
Finally, you can change your console to be utf-8, and find a font that
includes both sets of characters.


--

DaveA

_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Robert Sjoblom
> You can "solve" the problem by pretending the input file is also cp932 when
> you open it. That way you'll get the wrong characters, but no errors.
So I tried that:
Traceback (most recent call last):
  File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
line 6, in <module>
    text = alpha.readlines()
UnicodeDecodeError: 'cp932' codec can't decode bytes in position
1374-1375: illegal multibyte sequence

> Or
> you can solve it by encoding the output explicitly, telling it to ignore
> errors.  I don't know how to do that in Python 3.x.
Me neither. I will research this tomorrow.

> Finally, you can change
> your console to be utf-8, and find a font that includes both sets of
> characters.
While that might be a tempting solution, it would be best if this
worked without having to do any changes to the environment itself; it
would be best if it could run on any platform, but I'll take a Windows
machine with no changes to command line if I have to.

--
best regards,
Robert S.
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Steven D'Aprano-8
In reply to this post by Dave Angel-3
On Sat, Mar 10, 2012 at 08:03:18PM -0500, Dave Angel wrote:

> There are just 256 possible characters in cp1252, and 256 in cp932.

CP932 is also known as MS-KANJI or SHIFT-JIS (actually, one of many
variants of SHIFT-JS). It is a multi-byte encoding, which means it has
far more than 256 characters.

http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
http://en.wikipedia.org/wiki/Shift_JIS

The actual problem the OP has got is that the *multi-byte* sequence he
is trying to print is illegal when interpreted as CP932. Personally I
think that's a bug in the terminal, or possibly even print, since he's
not printing bytes but characters, but I haven't given that a lot of
thought so I might be way out of line.

The quick and dirty fix is to change the encoding of his terminal, so
that it no longer tries to interpret the characters printed using CP932.
That will also mean he'll no longer see valid Japanese characters.

But since he appears to be using Windows, I don't know if this is
possible, or easy.


[...]
> You can "solve" the problem by pretending the input file is also cp932
> when you open it. That way you'll get the wrong characters, but no
> errors.

Not so -- there are multi-byte sequences that can't be read in CP932.

>>> b"\xe9x".decode("cp932")  # this one works
'騙'
>>> b"\xe9!".decode("cp932")  # this one doesn't
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 0-1:
illegal multibyte sequence

In any case, the error doesn't occur when he reads the data, but when he
prints it. Once the data is read, it is already Unicode text, so he
should be able to print any character. At worst, it will print as a
missing character (a square box or space) rather than the expected
glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a
bug since print shouldn't be decoding anything. (At worst, it needs to
*encode*.)


--
Steven

_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Peter Otten
In reply to this post by Robert Sjoblom
Robert Sjoblom wrote:

> Okay, so here's a fun one. Since I'm on a japanese locale my native
> encoding is cp932. I was thinking of writing a parser for a bunch of
> text files, but I stumbled on even printing the contents due to ...
> something. I don't know what encoding the text file uses, which isn't
> helping my case either (I have asked, but I've yet to get an answer).
>
> Okay, so:
>
> address = "C:/Path/to/file/file.ext"
> with open(address, encoding="cp1252") as alpha:

Superfluous readlines() alert:

>     text = alpha.readlines()
>     for line in text:
>         print(line)

You can iterate over the file directly with

#python3
for line in alpha:
    print(line, end="")

or even

sys.stdout.writelines(alpha)

> It starts to print until it hits the wonderful character é or '\xe9',
> where it gives me this happy traceback:
> Traceback (most recent call last):
>   File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
> line 8, in <module>
>     print(line)
> UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
> position 13: illegal multibyte sequence
>
> I can open the document and view it in UltraEdit -- and it displays
> correct characters there -- but UE can't give me what encoding it
> uses. Any chance of solving this without having to switch from my
> japanese locale? Also, the cp1252 is just an educated guess, but it
> doesn't really matter because it always comes back to the cp932 error.

# python3
output_encoding = sys.stdout.encoding or "UTF-8"
error_handling = "replace"
Writer = codecs.getwriter(output_encoding)

outstream = Writer(sys.stdout.buffer, error_handling)
with open(filename, "r", encoding="cp1252") as instream:
    for line in instream:
        print(line, end="", file=outstream)


error_handling = "replace" prints "?" for characters that cannot be
displayed in the target encoding.


_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Peter Otten
In reply to this post by Steven D'Aprano-8
Steven D'Aprano wrote:

> glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a
> bug since print shouldn't be decoding anything. (At worst, it needs to
> *encode*.)

You have correctly derived the actual traceback ;)

[Robert]
> It starts to print until it hits the wonderful character é or '\xe9',
> where it gives me this happy traceback:
> Traceback (most recent call last):
>   File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
> line 8, in <module>
>     print(line)
> UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
> position 13: illegal multibyte sequence
 
In nuce:

$ PYTHONIOENCODING=cp932 python3 -c 'print("\xe9")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position
0: illegal multibyte sequence

(I have to lie about the encoding; my terminal speaks UTF-8)

_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor