I've just been bitten by what to me looks like a misfeature in Jython. I
have a Python client and a Jython server on the same machine, talking over
a localhost socket. I'm using repr() and eval() as my wire protocol
(because there's no possibility of anyone else being able to connect to
this socket; though I'm sufficiently paranoid that I'm pre-parsing the
repr()'d packets anyway to ensure that they're safe).
My problem is this: when repr()'ing a string, Jython only add the u'' if
the string contains char values > 255.
So when I send a list of strings over the socket from my Jython server to
my Python client, some of them come out as plain strings and others come
out as unicode strings. This had me swearing for several hours yesterday
(I've calmed down now 8-) because it never occurred to me that a list that
went in one end homogeneous could come out of the other end heterogeneous.
This looks to me like a terrible violation of the Principle of Least
Surprise. Everything works perfectly until a character over 255 comes
along and suddenly your code breaks with an encoding error. It also makes
Jython significantly different from CPython.
Any comments? Can I consider this a bug and enter a bug report / patch?
Or am I simply abusing repr()?
My problem is this: when repr()'ing a string, Jython only add the u'' if the string contains char values > 255.
I'm afraid this is the behavior for Jython 2.1. Jython 2.2 should
behave better, as long as the strings are initialized as type
unicode. So, in 2.1 u"abc" would translate to "abc", now in 2.2
u"abc" stays u"abc".
The reason for this odd behavior is: in Jython *all* strings are
represented by unicode strings internally, and the Python unicode
support is just a compatibility wrapper. In 2.1 this wrapper
merely consists of a check for characters values > 255 in the
string. In 2.2 there is a separate PyUnicode type that
(hopefully) provides better compatibility.
Some day (at least according to
http://www.python.org/peps/pep-3000.html) Python will come around to a
more jythony point of view :) and this problem should go away.
Then again, the time-frame for Python3000 is somewhat undefined last I
heard (much less the time-frame for a presumed Jython3000).
> I'm afraid this is the behavior for Jython 2.1. Jython 2.2 should behave
> better, as long as the strings are initialized as type unicode. So, in
> 2.1 u"abc" would translate to "abc", now in 2.2 u"abc" stays u"abc".
It doesn't seem to work like that in 2.2a1:
Jython 2.2a1 on java1.4.2_08 (JIT: null)
Type "copyright", "credits" or "license" for more information.
>>> x = u'unicode string'
>>> print repr(x)
and looking at the code for PyString.encode_UnicodeEscape, it's clear that
it only adds the u'' when there's a character > 255 in there. Nothing
seems to be different in CVS HEAD, unless I'm missing something...?
(Besides, my strings are being read over a socket, so they're all created
the same way.)
> The reason for this odd behavior is: in Jython *all* strings are represented
> by unicode strings internally
I know - that's why I was expecting repr() to always put the u'' there.
> In 2.2 there is a separate PyUnicode
> type that (hopefully) provides better compatibility.
How does that square with the fact that all Java strings are Unicode? Is
there any documentation that discusses the relationship between Jython,
Java and Unicode? Some sort of Best Practice document?