I've been working on http://bugs.jython.org/issue2356 which I'd like to
get in 2.7.1 -- it seems rather poor that Jython simply does not run for
users whose names have an un-American character ;). I know this issue is
not a blocker in most minds.
I've made pretty good progress by allowing file names to be unicode
objects more often than they would be in CPython 2, which usually
returns them as bytes in some encoding that we may not know. I've got
the launcher to work properly, and straightened the logic in our
printing of trace-backs and exceptions from Java. Unicode file names
seems the way to go for Jython because:
1. Java gives us competently decoded unicode file names, from
java.io.File, etc.. Re-encoding the result will be a pain (and
2. We appear not to have the codec we need ('mbcs'), that CPython
reports on Windows via sys.getfilesystemencoding().
3. We do this already. In 2.7.0, os.getcwd() returns unicode if necessary.
Most regression tests pass. However, I'm struggling with test_doctest.
Problems arise when mixing unicode and bytes when one byte is 128 and
over. This happens in ''.join(list) and formatted output like "%s %s" %
(ustr, bstr). The behaviour of these is identical with CPython: they
raise UnicodeDecodeError because the bytes are promoted to characters
with a strict ascii interpretation. This happens a lot in doctest.py and
traceback.py, for example, where file paths and stack dumps that include
them, are now frequently unicode, while other inputs are byte data
containing file paths presented in the console encoding.
I can beat this into submission with enough customisation of the stdlib
modules, but that always makes me uncomfortable. I usually see that as a
hint that user code might also need to change. This may be unfounded. I
can probably ensure no impact to users of only ascii paths, and the
others seem unable to run Jython at all (in the scope of this issue).
However, I'm seriously wondering if I should pursue the approach where
file names from Java are re-encoded to bytes (maybe as utf-8
everywhere), but that's grim.