non-ascii docstrings

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

non-ascii docstrings

Edward Loper
I've been working on epydoc, and the question has come up of how I
should treat non-unicode docstrings that contain non-ascii characters.
An example of such a file is "python2.4/encodings/string_escape.py",
whose module docstring contains an 'o' with an umlaut.

In particular, the question is whether I should assume that the
docstring is encoded with the encoding specified by the "-*- coding -*-"
directive at the top of the file.

The reason why we *wouldn't* use the encoding is that PEP 263 [1], which
defines the coding directive, says that it does *not* apply to
non-unicode string literals.  In particular, PEP 263 says that the
entire file should be read & tokenized using the specified coding, but
once string objects are created, they should be reencoded back into
8-bit strings using the file encoding.

So the "correct" fix is for the author of the module to use unicode
literals instead of string literals for docstrings that contain
non-ascii characters.  This has the advantage that if a user tries to
look at the docstring via introspection, it will be correct.

On the other hand, epydoc is often used by people other than the author
of a module, and requiring them to go through and replace all string
literal docstrings with unicode literals seems a bit unreasonable.

In a way, this is similar to the mistake I've seen many times of using
non-escaped backslashes inside docstrings.  e.g.:

def wc(filename):
     """
     Count the number of words in the given file. E.g.:
         >>> wc("c:\test\new.txt")
         100
     """

Which looks fine in the source file, but looks quite broken if you print
its __doc__:

 >>> print wc.__doc__
     Count the number of words in the given file. E.g.:
          >>> wc("c:     est
ew.txt")
     100

(The right fix in that case is probably to use a raw-string.)

So the question is..  Should epydoc (and other tools like it) be
compliant with PEP 263 (and consistent with Python); or should they "do
what I mean, not what I say" and treat non-ascii docstrings as if they
were encoded using the module's encoding?

-Edward

http://www.python.org/doc/peps/pep-0263/
_______________________________________________
Doc-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/doc-sig
Reply | Threaded
Open this post in threaded view
|

Re: non-ascii docstrings

David Goodger
[Edward Loper]
> I've been working on epydoc, and the question has come up of how I
> should treat non-unicode docstrings that contain non-ascii
> characters.  An example of such a file is
> "python2.4/encodings/string_escape.py", whose module docstring
> contains an 'o' with an umlaut.
>
> In particular, the question is whether I should assume that the
> docstring is encoded with the encoding specified by the "-*- coding
> -*-" directive at the top of the file.

I think that although it's the only possible assumption, it's also
potentially a wrong assumption.  IOW, don't assume anything.

> The reason why we *wouldn't* use the encoding is that PEP 263 [1],
> which defines the coding directive, says that it does *not* apply to
> non-unicode string literals.  In particular, PEP 263 says that the
> entire file should be read & tokenized using the specified coding,
> but once string objects are created, they should be reencoded back
> into 8-bit strings using the file encoding.

One reason is that the module code may expect such string literals to
have their original encoding.  String literals can contain arbitrary
8-bit data (strings are bytes, not characters).  Attempting to decode
such strings is inviting misinterpretation.

Another reason is simple: "In the face of ambiguity, refuse the
temptation to guess."

> So the "correct" fix is for the author of the module to use unicode
> literals instead of string literals for docstrings that contain
> non-ascii characters.  This has the advantage that if a user tries
> to look at the docstring via introspection, it will be correct.
>
> On the other hand, epydoc is often used by people other than the
> author of a module, and requiring them to go through and replace all
> string literal docstrings with unicode literals seems a bit
> unreasonable.

Yes, it's unreasonable.  But such code is buggy IMO.  It's also
unreasonable to expect Epydoc to correctly interpret garbage input.
Don't do it.

> So the question is..  Should epydoc (and other tools like it) be
> compliant with PEP 263 (and consistent with Python); or should they
> "do what I mean, not what I say" and treat non-ascii docstrings as
> if they were encoded using the module's encoding?

Be compliant with PEP 263, issue a warning (PEP 263, Implementation,
step 1), and either ignore such string literals or represent them as
strings of bytes (using "\xYY" notation).

--
David Goodger <http://python.net/~goodger>


_______________________________________________
Doc-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/doc-sig

signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: non-ascii docstrings

Edward Loper
David Goodger wrote:
>> In particular, the question is whether I should assume that the
>> docstring is encoded with the encoding specified by the "-*- coding
>> -*-" directive at the top of the file.
>
> I think that although it's the only possible assumption, it's also
> potentially a wrong assumption.  IOW, don't assume anything.

That was my inclination at first, but it appears that there are a large
number of python files out there that use non-ascii docstrings.  Asking
the epydoc user (who is very often not the package author) to go through
and add a 'u' in front of every docstring (but *not* any other string --
that might break the program) seems unreasonable.  And I have yet to see
a single python module where the -*- coding -*- directive is *not* the
right encoding for the docstrings.

> Another reason is simple: "In the face of ambiguity, refuse the
> temptation to guess."

Practicality beats purity. :)

> Yes, it's unreasonable.  But such code is buggy IMO.  It's also
> unreasonable to expect Epydoc to correctly interpret garbage input.

Small consolation to the user who's just trying to learn how to use a
package that they didn't write.

-Edward

_______________________________________________
Doc-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/doc-sig
Reply | Threaded
Open this post in threaded view
|

Re: non-ascii docstrings

Chris Jerdonek
On Mar 24, 2006, at 8:32 PM, Edward Loper wrote:

> David Goodger wrote:
>>> In particular, the question is whether I should assume that the
>>> docstring is encoded with the encoding specified by the "-*- coding
>>> -*-" directive at the top of the file.
>>
>> Yes, it's unreasonable.  But such code is buggy IMO.  It's also
>> unreasonable to expect Epydoc to correctly interpret garbage input.
>
> Small consolation to the user who's just trying to learn how to use a
> package that they didn't write.

Can't you make it an option (messy/pure)?

--Chris

_______________________________________________
Doc-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/doc-sig
Reply | Threaded
Open this post in threaded view
|

Re: non-ascii docstrings

Laura Creighton
In reply to this post by Edward Loper

I have never seen a module where the -*- coding -*- is not the same as
the docstring, either.  And the greatest number of times I have seen
this is where people are using some company-wide tool, possibly
third-party and possibly to integrate with java code -- to extract
the docstrings, and also have a requirement that the docstring
contains the name of the person who wrote, and or modified the code.

Indeed, in the matter of encoding, I wish that python would guess
a whole lot more.  One of the most common 'first python programs'
that non-computer people write is 'my phonelist manager' and another is
'my cd collection manager'.  I think that they have plenty enough
to worry about without needing to find out about encodings before
their first python program runs.  Most places _have_ a locale sort
of setting, and I would be in favour of trying whatever is there
when encountering something that is not ascii.

Laura

_______________________________________________
Doc-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/doc-sig
Reply | Threaded
Open this post in threaded view
|

Re: non-ascii docstrings

Fredrik Lundh
Laura Creighton wrote:

> Indeed, in the matter of encoding, I wish that python would guess
> a whole lot more.  One of the most common 'first python programs'
> that non-computer people write is 'my phonelist manager' and another is
> 'my cd collection manager'.  I think that they have plenty enough
> to worry about without needing to find out about encodings before
> their first python program runs.  Most places _have_ a locale sort
> of setting, and I would be in favour of trying whatever is there
> when encountering something that is not ascii.

as long as the interpreter prints a warning when it falls back on the
default...  oh, wait.

$ python2.2 welcome.py
Welcome to Linköping

$ python2.3 welcome.py
sys:1: DeprecationWarning: Non-ASCII character '\xf6' in file welcome.py on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details
Welcome to Linköping

$ python2.4 welcome.py
sys:1: DeprecationWarning: Non-ASCII character '\xf6' in file welcome.py on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details
Welcome to Linköping

$ python2.5 welcome.py
  File "welcome.py", line 1
SyntaxError: Non-ASCII character '\xf6' in file /users/fredrik/welcome.py on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details

guess this means that newbies should make sure to run their first
program under multiple Python versions...

</F>




_______________________________________________
Doc-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/doc-sig