Unrecognized backslash escapes in string literals

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Chris Angelico
In Python, unrecognized escape sequences are treated literally,
without (as far as I can tell) any sort of warning or anything. This
can mask bugs, especially when Windows path names are used:

>>> 'C:\sqlite\Beginner.db'
'C:\\sqlite\\Beginner.db'
>>> 'c:\sqlite\beginner.db'
'c:\\sqlite\x08eginner.db'

To a typical Windows user, the two strings should be equivalent - case
insensitive file names, who cares whether you say "Beginner" or
"beginner"? But to Python, one of them will happen to work, the other
will fail badly.

Why is it that Python interprets them this way, and doesn't even give
a warning? What happened to errors not passing silently? Or, looking
at this the other way: Is there a way to enable such warnings/errors?
I can't see one in 'python[3] -h', but if there's some way elsewhere,
that would be a useful thing to recommend to people (I already
recommend running Python 2 with -tt).

ChrisA


Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Ben Finney-10
Chris Angelico <rosuav at gmail.com> writes:

> In Python, unrecognized escape sequences are treated literally,
> without (as far as I can tell) any sort of warning or anything.

Right. Text strings literals are documented to work that way
<URL:https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str>,
which refers the reader to the language reference
<URL:https://docs.python.org/3/reference/lexical_analysis.html#strings>.

> Why is it that Python interprets them this way, and doesn't even give
> a warning?

Because the interpretation of those literals is unambiguous and correct.

It's unfortunate that MS Windows inherited the incompatible ?backslash
is a path separator?, long after backslash was already established in
many programming languages as the escape character.

> Is there a way to enable such warnings/errors?

A warning or error for a correctly formatted literal with an unambiguous
meaning would be an up-Pythonic thing to have.

I can see the motivation, but really the best solution is to learn that
the backslash is an escape character in Python text string literals.

This has the advantage that it's the same escape character used for text
string literals in virtually every other programming language, so you're
not needing to learn anything unusual.

--
 \        ?The deepest sin against the human mind is to believe things |
  `\           without evidence.? ?Thomas Henry Huxley, _Evolution and |
_o__)                                                    Ethics_, 1893 |
Ben Finney



Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Dave Angel-4
In reply to this post by Chris Angelico
On 02/22/2015 09:29 PM, Chris Angelico wrote:

> In Python, unrecognized escape sequences are treated literally,
> without (as far as I can tell) any sort of warning or anything. This
> can mask bugs, especially when Windows path names are used:
>
>>>> 'C:\sqlite\Beginner.db'
> 'C:\\sqlite\\Beginner.db'
>>>> 'c:\sqlite\beginner.db'
> 'c:\\sqlite\x08eginner.db'
>
> To a typical Windows user, the two strings should be equivalent - case
> insensitive file names, who cares whether you say "Beginner" or
> "beginner"? But to Python, one of them will happen to work, the other
> will fail badly.
>
> Why is it that Python interprets them this way, and doesn't even give
> a warning? What happened to errors not passing silently? Or, looking
> at this the other way: Is there a way to enable such warnings/errors?
> I can't see one in 'python[3] -h', but if there's some way elsewhere,
> that would be a useful thing to recommend to people (I already
> recommend running Python 2 with -tt).
>
> ChrisA
>

I've long thought they should be errors, but in Python they're not even
warnings.  It's one thing to let a user be sloppy on a shell's
commandline, but in a program, if you have an invalid escape sequence,
it should be an invalid string literal, full stop.

And Python doesn't even treat these invalid sequences the same (broken)
way C does.  The documentation explicitly says it's different than C.
If you're going to be different, at least be strict.

--
DaveA


Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Chris Angelico
In reply to this post by Ben Finney-10
On Mon, Feb 23, 2015 at 1:41 PM, Ben Finney <ben+python at benfinney.id.au> wrote:
> Chris Angelico <rosuav at gmail.com> writes:
>
>> Why is it that Python interprets them this way, and doesn't even give
>> a warning?
>
> Because the interpretation of those literals is unambiguous and correct.

And it also implies that never, in the entire infinite future of
Python development, will any additional escapes be invented - because
then it'd be ambiguous (in versions up to X, "\s" means "\\s", and
after that, "\s" means something else).

> It's unfortunate that MS Windows inherited the incompatible ?backslash
> is a path separator?, long after backslash was already established in
> many programming languages as the escape character.

I agree, the fault is primarily with Windows. But I've seen similar
issues when people use /-\| for box drawing and framing and such;
Windows paths are by far the most common case of this, but not the
sole.

>> Is there a way to enable such warnings/errors?
>
> A warning or error for a correctly formatted literal with an unambiguous
> meaning would be an up-Pythonic thing to have.
> ...
> This has the advantage that it's the same escape character used for text
> string literals in virtually every other programming language, so you're
> not needing to learn anything unusual.

And yet the treatment of the edge case differs. In C, for instance,
you get a compiler warning, and then the backslash is removed and
you're left with just the other character.

The trouble isn't that people need to learn that backslashes are
special in Python string literals. The trouble is that, especially
when file names are frequently being written with uppercase first
letters, it's very easy to have code that just so happens to work,
without being reliable. Having spent some time working with paths like
these:

fn = "C:\Foo\Bar\Asdf.ext"

and then to find that each of these fails, but in a different way:

path = "C:\Foo\Bar\"; fn = path + "Asdf.ext"
fn = "c:\foo\bar\asdf.ext"
fn = "c:\users\myname\blah"

would surely count as surprising. Particularly since the last one will
work fine in Python 2 sans unicode_literals, and will then blow up in
Python 3 - because, contrary to the "no additional escapes"
assumption, Unicode strings introduced new escapes, which means that
"\u0123" has different meaning in byte strings and Unicode strings. In
fact, that's an exception to the usual rule of "upper case is safe",
and it's one that *will* trip people up, thanks to the "C:\Users"
directory on a modern Windows system. What's the betting people will
blame the failure on Python 3 and/or Unicode, rather than on the
sloppy use of escapes and the poor choice of path separator on a
popular platform?

ChrisA


Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Dave Angel-4
In reply to this post by Ben Finney-10
On 02/22/2015 09:41 PM, Ben Finney wrote:

> Chris Angelico <rosuav at gmail.com> writes:
>
>> In Python, unrecognized escape sequences are treated literally,
>> without (as far as I can tell) any sort of warning or anything.
>
> Right. Text strings literals are documented to work that way
> <URL:https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str>,
> which refers the reader to the language reference
> <URL:https://docs.python.org/3/reference/lexical_analysis.html#strings>.
>
>> Why is it that Python interprets them this way, and doesn't even give
>> a warning?
>
> Because the interpretation of those literals is unambiguous and correct.

Correct according to a misguided language definition.

>
> It's unfortunate that MS Windows inherited the incompatible ?backslash
> is a path separator?, long after backslash was already established in
> many programming languages as the escape character.

Windows "inherited" it from DOS.  But since Windows was nothing but a
DOS shell for several years, that's not surprising.  The historical
problem came from CP/M's use of the forward slash for a
switch-character.  Since MSDOS/PCDOS/QDOS was trying to permit
transliterated CP/M programs, and because subdirectories were an
afterthought (version 2.0), they felt they needed to pick a different
character.  At one time, the switch-character could be set by the user,
but most programs ignored that, so it died.

>
>> Is there a way to enable such warnings/errors?
>
> A warning or error for a correctly formatted literal with an unambiguous
> meaning would be an up-Pythonic thing to have.
>
> I can see the motivation, but really the best solution is to learn that
> the backslash is an escape character in Python text string literals.
>
> This has the advantage that it's the same escape character used for text
> string literals in virtually every other programming language, so you're
> not needing to learn anything unusual.
>

I might be able to buy that argument if it was done the same way, but as
it says in:
   https://docs.python.org/3/reference/lexical_analysis.html#strings

"""Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result. (This
behavior is useful when debugging: if an escape sequence is mistyped,
the resulting output is more easily recognized as broken.)
"""

The word "broken" is an admission that this was a flawed approach.  If
it's broken, it should be an error.

I'm not suggesting that the implementation should falsely trigger an
error.  But that the language definition should be changed to define it
as an error.

--
DaveA


Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Chris Angelico
In reply to this post by Ben Finney-10
On Mon, Feb 23, 2015 at 1:41 PM, Ben Finney <ben+python at benfinney.id.au> wrote:
> Right. Text strings literals are documented to work that way
> <URL:https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str>,
> which refers the reader to the language reference
> <URL:https://docs.python.org/3/reference/lexical_analysis.html#strings>.

BTW, quoting from that:

"""
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result. (This
behavior is useful when debugging: if an escape sequence is mistyped,
the resulting output is more easily recognized as broken.)
"""

I'm not sure it's more obviously broken. Comparing Python and Pike:

>>> "asdf\qwer"
'asdf\\qwer'

> "asdf\qwer";
(1) Result: "asdfqwer"

Which is the "more easily recognized as broken" depends on what the
actual intention was. If you wanted to have a backslash (eg a path
name), then the second one is, because you've just run two path
components together. If you wanted to have some sort of special
character ("\n"), then they're both going to be about the same - you'd
expect to see "\n" in the output, one has added a backslash (assuming
you're looking at the repr), the other has removed it. Likewise if you
wanted some other symbol (eg forward slash), they're about the same (a
doubled backslash, or a complete omission, same diff). But if you just
fat-fingered a backslash into a string where it completely doesn't
belong, then seeing a doubled backslash is definitely better than
seeing just the following character (which would mask the error
entirely). Since the interpreter can't know what the intention was, it
obviously has to do just one thing and stick with it.

I'm not convinced this is really an advantage. Python has been aiming
more and more towards showing problems immediately, rather than having
them depend on your data - for instance, instead of letting you treat
bytes and characters as identical until you hit something that isn't
ASCII, Py3 forces you to distinguish from the start. That said,
though, there's probably a lot of code out there that depends on
backslashes being non-special, so it's quite probably something that
can't be changed. But it'd be nice to be able to turn on a warning for
it.

ChrisA


Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Ben Finney-10
Chris Angelico <rosuav at gmail.com> writes:

> That said, though, there's probably a lot of code out there that
> depends on backslashes being non-special, so it's quite probably
> something that can't be changed. But it'd be nice to be able to turn
> on a warning for it.

If you're motivated to see such warnings, an appropriate place to
implement them would be in PyLint or another established static code
analysis tool.

--
 \            ?The whole area of [treating source code as intellectual |
  `\    property] is almost assuring a customer that you are not going |
_o__)               to do any innovation in the future.? ?Gary Barnett |
Ben Finney



Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Peter Otten
Ben Finney wrote:

> Chris Angelico <rosuav at gmail.com> writes:
>
>> That said, though, there's probably a lot of code out there that
>> depends on backslashes being non-special, so it's quite probably
>> something that can't be changed. But it'd be nice to be able to turn
>> on a warning for it.
>
> If you're motivated to see such warnings, an appropriate place to
> implement them would be in PyLint or another established static code
> analysis tool.

Pylint already produces a warning. However, it cannot read the author's
mind:

$ cat tmp.py
print("C:\alpha")
print("C:\beta")
print("C:\gamma")
$ pylint tmp.py
************* Module tmp
W:  3, 0: Anomalous backslash in string: '\g'. String constant might be
missing an r prefix. (anomalous-backslash-in-string)
C:  1, 0: Missing module docstring (missing-docstring)

The same would go for a warning built into the compiler. Maybe having
editors highlight the special combinations would be the more helpful
approach. A tooltip could explain the meaning.






Reply | Threaded
Open this post in threaded view
|

Unrecognized backslash escapes in string literals

Serhiy Storchaka-2
In reply to this post by Chris Angelico
On 23.02.15 04:55, Chris Angelico wrote:
> I agree, the fault is primarily with Windows. But I've seen similar
> issues when people use /-\| for box drawing and framing and such;
> Windows paths are by far the most common case of this, but not the
> sole.

There is also issues with regular expressions.

 >>> import re
 >>> re.match(r'a\s', 'a ')
<_sre.SRE_Match object; span=(0, 2), match='a '>
 >>> re.match('a\s', 'a ')
<_sre.SRE_Match object; span=(0, 2), match='a '>
 >>> re.match(r'a\b', 'a ')
<_sre.SRE_Match object; span=(0, 1), match='a'>
 >>> re.match('a\b', 'a ')

Oops.

'a\s' works the same as r'a\s', but 'a\b' works different from r'a\b'.