Format strings

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Format strings

Josef Spillner-3
Hi,

as pointed out by Adeodato Simó, there exists a discrepancy between the
intuitive and the (C-legacy-based?) actual handling of format strings with
unicode arguments:

http://chistera.yi.org/~adeodato/blog/misc/44_utf8_printf.html

This does indeed seem to be problematic, and I cannot see any legitimate
reason for this. Can this be fixed for the upcoming version please?

Josef
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

M.-A. Lemburg
Josef Spillner wrote:
> Hi,
>
> as pointed out by Adeodato Simó, there exists a discrepancy between the
> intuitive and the (C-legacy-based?) actual handling of format strings with
> unicode arguments:
>
> http://chistera.yi.org/~adeodato/blog/misc/44_utf8_printf.html

I don't see the relationship to Python in that posting...

> This does indeed seem to be problematic, and I cannot see any legitimate
> reason for this. Can this be fixed for the upcoming version please?

Please give an example.

Thanks,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 25 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

Josef Spillner-3
El Viernes, 25. Noviembre 2005 19:14, escribió:
> I don't see the relationship to Python in that posting...

The following should demonstrate it:

# -*- coding: utf-8 -*-
print "'%2s'" % "a"
print "'%2s'" % "á"
print "'%2s'" % u"á"

In the second case, while the string literal is recognized as utf-8 (thus two
bytes being one character in this case), it eats the two character format
string alone and doesn't leave any space for the empty character.

Note that if the file encoding is not given, then it would display as 'á',
which is correct under the circumstances.

But in general, I don't see why line two in the example above cannot be like
line three. It is not intuitive to only have one character printed as opposed
to the two that are requested from the format string.

Actually, a related question: why are string objects ASCII by default instead
of the encoding specified at the beginning of the file? Are there any plans
to merge the "unicode" string functionality into basic strings?

Josef
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

"Martin v. Löwis"
Josef Spillner wrote:
> # -*- coding: utf-8 -*-
> print "'%2s'" % "a"
> print "'%2s'" % "á"
> print "'%2s'" % u"á"
>
> In the second case, while the string literal is recognized as utf-8 (thus two
> bytes being one character in this case), it eats the two character format
> string alone and doesn't leave any space for the empty character.

This is correct behaviour, and by design.

> Note that if the file encoding is not given, then it would display as 'á',
> which is correct under the circumstances.

It is correct either way. A byte string is a byte string is a byte
string is a  string of bytes is not a Unicode string.

The string in the second print statement actually *has* two bytes, so
that it takes two bytes of output is correct.

Regards,
Martin
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

Josef Spillner-3
El Viernes, 25. Noviembre 2005 23:16, escribió:
> It is correct either way. A byte string is a byte string is a byte
> string is a  string of bytes is not a Unicode string.

That was the second part of my question. If a programmer writes down a string,
and the source file encoding is declared to be utf-8, why then is the string
still not encoded in utf-8 by default?
Why all the hassle of using u"..." instead of making it the default?
There is a lot of python source code I maintain, and it would simplify coding
a lot if this could be made the default.

Josef
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

M.-A. Lemburg
Josef Spillner wrote:
> El Viernes, 25. Noviembre 2005 23:16, escribió:
>
>>It is correct either way. A byte string is a byte string is a byte
>>string is a  string of bytes is not a Unicode string.
>
>
> That was the second part of my question. If a programmer writes down a string,
> and the source file encoding is declared to be utf-8, why then is the string
> still not encoded in utf-8 by default?

Because the source code encoding is only used to decode the Unicode
literals in the source code into Unicode objects.

Plain string literals do not have an encoding attached and
are regarded as plain byte code strings. As a result, they are
passed through the decoding mechanism by reencoding them after
first decding them to Unicode (using the source code encoding).

> Why all the hassle of using u"..." instead of making it the default?

This will happen in Python 3.0.

> There is a lot of python source code I maintain, and it would simplify coding
> a lot if this could be made the default.

Indeed, but it potentially also breaks a lot of code since Python
and the many extensions for it are not yet fully Unicode compatible.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 28 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

"Martin v. Löwis"
In reply to this post by Josef Spillner-3
Josef Spillner wrote:
> El Viernes, 25. Noviembre 2005 23:16, escribió:
>
>>It is correct either way. A byte string is a byte string is a byte
>>string is a  string of bytes is not a Unicode string.
>
>
> That was the second part of my question. If a programmer writes down a string,
> and the source file encoding is declared to be utf-8, why then is the string
> still not encoded in utf-8 by default?

But it is encoded in utf-8! Why do you say it isn't? "be encoded in
UTF-8" is different from "be a Unicode string". Unicode strings are
a separate data type (different from byte strings). "UTF-8" is a
*byte* encoding, so an UTF-8 string is *not* a character string,
but a byte string.

> Why all the hassle of using u"..." instead of making it the default?
> There is a lot of python source code I maintain, and it would simplify coding
> a lot if this could be made the default.

There is an undocumented -U option which makes all string literals
Unicode strings. Please try this out - you will likely find that
your application breaks.

Regards,
Martin
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

Josef Spillner-3
In reply to this post by M.-A. Lemburg
[I removed the CC:s since we're all subscribed I think.]

El Lunes, 28. Noviembre 2005 12:55, escribió:
> Plain string literals do not have an encoding attached and
> are regarded as plain byte code strings. As a result, they are
> passed through the decoding mechanism by reencoding them after
> first decding them to Unicode (using the source code encoding).

But (my last remaining question, as it seems), the default encoding of
unicode() is "ascii" instead of "utf-8" even for this particular source file
which specifies utf-8 encoding.
Would changing this to match the source file encoding break applications as
well?

Note that the documentation is not really helpful about this aspect. I'd like
to advocate for an i18n paragraph in the tutorial even, where such
behavioural aspects are put into relation with each other, and explained in
the concept of modern (and legacy) runtime environment concepts.

Or it'd be helpful to link to the Unicode HOWTO from the tutorial/module
index. However, both of them contradict slightly, e.g. in the parameter
description to unicode().

Compare:
[All of its arguments should be 8-bit strings]
vs.
[if object is a Unicode string or subclass it will return that Unicode string]
(actually it should say "Unicode object" below, right?)

>> Why all the hassle of using u"..." instead of making it the default?
>This will happen in Python 3.0.

Ah, nice to know.

>> There is a lot of python source code I maintain, and it would simplify
>> coding a lot if this could be made the default.

> Indeed, but it potentially also breaks a lot of code since Python
> and the many extensions for it are not yet fully Unicode compatible.

I just tested -U on my applications. It seems that the 'random' module is a
large offender. Otherwise, it seems to work ok. Some PyGame oddities but
those are actually present without -U as well, and I'm going to look into
fixing the library.

Is anyone coordinating the work, i.e. is there a "unicode compatibility status
map" or anything similar?

Josef
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

Josef Spillner-3
In reply to this post by "Martin v. Löwis"
El Lunes, 28. Noviembre 2005 20:46, Martin v. Löwis escribió:
> But it is encoded in utf-8! Why do you say it isn't? "be encoded in
> UTF-8" is different from "be a Unicode string". Unicode strings are
> a separate data type (different from byte strings). "UTF-8" is a
> *byte* encoding, so an UTF-8 string is *not* a character string,
> but a byte string.

OK, sorry, my mistake.

> There is an undocumented -U option which makes all string literals
> Unicode strings. Please try this out - you will likely find that
> your application breaks.

See my other reply. It'd be helpful to have a chapter advertising this option
for people to be able to prepare any necessary changes. Of course a
prerequisite would be that at least the basic included modules work with it.
It could also serve as an invitation for people to help migrating the modules.

Josef
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

"Martin v. Löwis"
In reply to this post by Josef Spillner-3
Josef Spillner wrote:
> But (my last remaining question, as it seems), the default encoding of
> unicode() is "ascii" instead of "utf-8" even for this particular source file
> which specifies utf-8 encoding.
> Would changing this to match the source file encoding break applications as
> well?

No. *That* would not be implementable (or, if somehow implemented, would
break applications). In general, if you convert a Unicode string into
a byte string, you cannot even be sure it originally came from source
code. Say you do

a = u"Martin "
b = u"v. "
c = u"Löwis"
mvl = a+b+c

Now, the object mvl does not have any source code: so which encoding
should be used to encode it? If you have an answer: how does that change
if I have

mvl = mod1.a+mod2.b+mod3.c

> Note that the documentation is not really helpful about this aspect. I'd like
> to advocate for an i18n paragraph in the tutorial even, where such
> behavioural aspects are put into relation with each other, and explained in
> the concept of modern (and legacy) runtime environment concepts.

Contributions to the documentation is welcome.

> Compare:
> [All of its arguments should be 8-bit strings]
> vs.
> [if object is a Unicode string or subclass it will return that Unicode string]
> (actually it should say "Unicode object" below, right?)

I personally use "Unicode string" (type unicode) vs. "byte string" (type
str). Both are strings.

> Is anyone coordinating the work, i.e. is there a "unicode compatibility status
> map" or anything similar?

No. It is so far from actually working that nobody bothers to fix it.
However, if you have specific contributions which improve the state
(i.e. have no behaviour change if -U is not specified, but fix a bug
  when it is), those are appreciated.

Regards,
Martin
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: Format strings

Josef Spillner-3
El Miércoles, 30. Noviembre 2005 23:52, Martin v. Löwis escribió:

> a = u"Martin "
> b = u"v. "
> c = u"Löwis"
> mvl = a+b+c
>
> Now, the object mvl does not have any source code: so which encoding
> should be used to encode it? If you have an answer: how does that change
> if I have
>
> mvl = mod1.a+mod2.b+mod3.c

Ah, indeed.
Although, there is sys.getdefaultencoding() (== "ascii"), controlling
sys.setdefaultencoding() could be the application's task.

> No. It is so far from actually working that nobody bothers to fix it.
> However, if you have specific contributions which improve the state
> (i.e. have no behaviour change if -U is not specified, but fix a bug
>   when it is), those are appreciated.

OK, I'll look into it.

Josef
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig