Python NBSP DWIM

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

TIm Chase-3
str.split() doesn't seem to respect non-breaking space:

  Python 3.4.2 (default, Oct  8 2014, 10:45:20)
  [GCC 4.9.1] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> print(repr("hello\N{NO-BREAK SPACE}world".split()))
  ['hello', 'world']

What's the purpose of a non-breaking space if it's treated like a
space for breaking/splitting purposes? :-)

Is this a bug?

-tkc





Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Mark Lawrence
On 10/06/2015 14:28, Tim Chase wrote:

> str.split() doesn't seem to respect non-breaking space:
>
>    Python 3.4.2 (default, Oct  8 2014, 10:45:20)
>    [GCC 4.9.1] on linux
>    Type "help", "copyright", "credits" or "license" for more information.
>    >>> print(repr("hello\N{NO-BREAK SPACE}world".split()))
>    ['hello', 'world']
>
> What's the purpose of a non-breaking space if it's treated like a
> space for breaking/splitting purposes? :-)
>
> Is this a bug?
>
> -tkc
>

IMNSHO yes.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence


Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Skip Montanaro
In reply to this post by TIm Chase-3
On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
<python.list at tim.thechases.com> wrote:
> Is this a bug?

Looks like it's been reported a few times with slightly different context:

https://bugs.python.org/issue6537
https://bugs.python.org/issue16623
https://bugs.python.org/issue20491
https://bugs.python.org/issue1390608

The couple times it's come up in the context of str.split, it's been
rejected, since the purpose of that method is to split words.

Skip

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Laura Creighton-2
In a message of Wed, 10 Jun 2015 09:28:24 -0500, Skip Montanaro writes:

>On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
><python.list at tim.thechases.com> wrote:
>> Is this a bug?
>
>Looks like it's been reported a few times with slightly different context:
>
>https://bugs.python.org/issue6537
>https://bugs.python.org/issue16623
>https://bugs.python.org/issue20491
>https://bugs.python.org/issue1390608
>
>The couple times it's come up in the context of str.split, it's been
>rejected, since the purpose of that method is to split words.
>
>Skip

In these unicode days, this thinking may need to be revisited.  There
are many languages where whitespace does not separate words -- either
words aren't separated, or in Vietnamese, spaces separate syllables,
so entire words have spaces in them.

Laura


Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

random832@fastmail.us
On Wed, Jun 10, 2015, at 11:03, Laura Creighton wrote:
> In these unicode days, this thinking may need to be revisited.  There
> are many languages where whitespace does not separate words -- either
> words aren't separated, or in Vietnamese, spaces separate syllables,
> so entire words have spaces in them.

Text wrapping for CJK scripts is another topic that might be worth
addressing in textwrap - words aren't space-separated, but there are
still rules about where you can place a line break. Generally these are
centered around preventing punctuation marks from being orphaned rather
than any attempt to algorithmically find word boundaries.

For the process called "Oikomi", while messing with kerning is not
strictly possible for monospaced text, it might be worthwhile in general
to have "preferred" and "maximum" line widths as parameters for
textwrap.

http://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Steven D'Aprano-8
In reply to this post by TIm Chase-3
On Thu, 11 Jun 2015 12:28 am, Skip Montanaro wrote:

> On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
> <python.list at tim.thechases.com> wrote:
>> Is this a bug?
>
> Looks like it's been reported a few times with slightly different context:
>
> https://bugs.python.org/issue6537
> https://bugs.python.org/issue16623
> https://bugs.python.org/issue20491
> https://bugs.python.org/issue1390608
>
> The couple times it's come up in the context of str.split, it's been
> rejected, since the purpose of that method is to split words.

That reasoning is ... strange. The whole point of the NBSP is specifically
*not* to split on it. If you wanted it to split, you would use a regular
space.

(Oh, and for the record, there are at least two non-breaking spaces in
Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)

http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/charts/PDF/U2000.pdf


Non-breaking spaces should be used for when you want to prevent
word-wrapping, and also for "open form" compound words:

http://grammar.ccc.commnet.edu/grammar/compounds.htm

textwrap should also treat NBSPs as non-spaces for the purposes of wrapping.

As a work-around, I think this should work:

- split the string on NBSPs;

- for substring returned, split normally;

- merge sub-substrings.


def split(s):
    """Split on whitespace, except NBSP.

    >>> split(u'hello world spam\\u00A0eggs cheese')
    [u'hello', u'world', u'spam\\xa0eggs', 'cheese']

    """
    words = []
    NBSP = u'\u00A0'
    substrings = s.split(NBSP)
    for i, sub in enumerate(substrings):
        parts = sub.split()
        if i == 0:
            words.extend(parts)
        else:
            words[-1] += NBSP + parts[0]
            words.extend(parts[1:])
    return words
       

--
Steven


Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Chris Angelico
On Thu, Jun 11, 2015 at 3:11 AM, Steven D'Aprano <steve at pearwood.info>
wrote:
> (Oh, and for the record, there are at least two non-breaking spaces in
> Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
>
> http://www.unicode.org/charts/PDF/U0080.pdf
> http://www.unicode.org/charts/PDF/U2000.pdf

And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as
the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've been
fighting with VLC Media Player over the font it uses for subtitles; for
some bizarre reason, that font represents U+FEFF not with zero pixels of
emptiness, but with a box containing the letters "ZWN" "BSP" on two lines.
Yeah, because that totally takes up zero width and looks like blank space.

ChrisA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20150611/c5da01d8/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

random832@fastmail.us
On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote:
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> been
> fighting with VLC Media Player over the font it uses for subtitles; for
> some bizarre reason, that font represents U+FEFF not with zero pixels of
> emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines.
> Yeah, because that totally takes up zero width and looks like blank
> space.

As I understand it, the proper behavior is that the ZWNBSP that is the
byte order mark shall never appear in an in-memory representation of the
first line of a BOM-encoded file, or any other line of the concatenation
of two BOM-encoded files, but should "vanish" when the file is opened
and first read from. So it shouldn't be showing up in your subtitles
regardless of its rendering behavior.

The real world, needless to say, isn't so nice.

IIRC there's also a font in MS windows that uses various glyphs which
are zero-width, but are not blank, to represent ZWJ, ZWNJ, RLM, and LRM.
Good for seeing what is happening, bad for actually rendering text
that's intended to contain these characters. Though there's another
argument that ideally a rendering engine should not render any such
glyph unless something like "visible controls" has been selected (the
real world, again, isn't so nice, which is why most symbols intended for
visible control style rendering have their own distinct code points
rather than using those of the control characters they represent).

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Chris Angelico
On Thu, Jun 11, 2015 at 11:02 AM, <random832 at fastmail.us> wrote:

>
> On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote:
> > And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as
> > the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> > been
> > fighting with VLC Media Player over the font it uses for subtitles; for
> > some bizarre reason, that font represents U+FEFF not with zero pixels of
> > emptiness, but with a box containing the letters "ZWN" "BSP" on two
> > lines.
> > Yeah, because that totally takes up zero width and looks like blank
> > space.
>
> As I understand it, the proper behavior is that the ZWNBSP that is the
> byte order mark shall never appear in an in-memory representation of the
> first line of a BOM-encoded file, or any other line of the concatenation
> of two BOM-encoded files, but should "vanish" when the file is opened
> and first read from. So it shouldn't be showing up in your subtitles
> regardless of its rendering behavior.

It's a perfectly valid character for other purposes; it's coming up in
the middle of pieces of text, which should be 100% legal. No, it's a
font problem.

ChrisA

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Steven D'Aprano-8
In reply to this post by Steven D'Aprano-8
On Thu, 11 Jun 2015 10:09 am, Chris Angelico wrote:

> On Thu, Jun 11, 2015 at 3:11 AM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> (Oh, and for the record, there are at least two non-breaking spaces in
>> Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
>>
>> http://www.unicode.org/charts/PDF/U0080.pdf
>> http://www.unicode.org/charts/PDF/U2000.pdf
>
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE",

No, despite the name, that is not a space character, it is a formatting
character. Due to Unicode's stability policy, the name is stuck forever,
but it should not be treated as a space character:

py> unicodedata.category(' ')
'Zs'
py> unicodedata.category('\u00A0')  # NBSP
'Zs'
py> unicodedata.category('\uFEFF')  # ZWNBSP
'Cf'


Ideally, outside of the BOM, you should never come across a ZWNBSP. You
should use U+2060 WORD JOINER instead. But if you do come across one
outside of the BOM, it should be treated as a legitimate non-space
character:

http://www.unicode.org/faq/utf_bom.html#bom6

Although ZWNBSP is a "default ignorable" code point, I believe that the font
is well within its rights to show it with a visible glyph:

    "Fonts can contain glyphs intended for visible display of
    default ignorable code points that would otherwise be
    rendered invisibly when not supported."

http://www.unicode.org/faq/unsup_char.html


> notable because it's also used as
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> been fighting with VLC Media Player over the font it uses for subtitles;
> for some bizarre reason, that font represents U+FEFF not with zero pixels
> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines. Yeah, because that totally takes up zero width and looks like blank
> space.

Why do the subtitles contain ZWNBSP in the first place? Surely they're not
English subtitles?


--
Steven


Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Chris Angelico
On Thu, Jun 11, 2015 at 12:26 PM, Steven D'Aprano <steve at pearwood.info> wrote:

> No, despite the name, that is not a space character, it is a formatting
> character. Due to Unicode's stability policy, the name is stuck forever,
> but it should not be treated as a space character:
>
> py> unicodedata.category(' ')
> 'Zs'
> py> unicodedata.category('\u00A0')  # NBSP
> 'Zs'
> py> unicodedata.category('\uFEFF')  # ZWNBSP
> 'Cf'
>
>
> Ideally, outside of the BOM, you should never come across a ZWNBSP. You
> should use U+2060 WORD JOINER instead. But if you do come across one
> outside of the BOM, it should be treated as a legitimate non-space
> character:
>
> http://www.unicode.org/faq/utf_bom.html#bom6
>
> Although ZWNBSP is a "default ignorable" code point, I believe that the font
> is well within its rights to show it with a visible glyph:
>
>     "Fonts can contain glyphs intended for visible display of
>     default ignorable code points that would otherwise be
>     rendered invisibly when not supported."
>
> http://www.unicode.org/faq/unsup_char.html

Huh. Okay, my bad. I was under the impression that it was supposed to
take up no width, as the name implies, but stability trumps logic
sometimes. Learn something new every day.

>> notable because it's also used as
>> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
>> been fighting with VLC Media Player over the font it uses for subtitles;
>> for some bizarre reason, that font represents U+FEFF not with zero pixels
>> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
>> lines. Yeah, because that totally takes up zero width and looks like blank
>> space.
>
> Why do the subtitles contain ZWNBSP in the first place? Surely they're not
> English subtitles?

No, they're not :) The character comes up in the Cantonese and
Japanese subs for Once Upon A December.

http://youtu.be/CEpcUeWP0bg
http://youtu.be/WFZAaHrHens

Possibly some others in the series as well. It may well be a fault in
the subtitles, but most programs I've seen don't show U+FEFF as a big
fat box.

ChrisA

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

random832@fastmail.us
On Wed, Jun 10, 2015, at 23:05, Chris Angelico wrote:
> http://youtu.be/CEpcUeWP0bg
> http://youtu.be/WFZAaHrHens

An example of the actual subtitle text would be more useful than a
youtube link to the video, since we're unlikely to be able to see what
context the character appears in if our client doesn't show it. (I don't
think the default youtube player does). And you haven't even included a
time code.

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Steven D'Aprano-8
In reply to this post by Steven D'Aprano-8
On Thu, 11 Jun 2015 01:05 pm, Chris Angelico wrote:
[...]

>> Why do the subtitles contain ZWNBSP in the first place? Surely they're
>> not English subtitles?
>
> No, they're not :) The character comes up in the Cantonese and
> Japanese subs for Once Upon A December.
>
> http://youtu.be/CEpcUeWP0bg
> http://youtu.be/WFZAaHrHens
>
> Possibly some others in the series as well. It may well be a fault in
> the subtitles, but most programs I've seen don't show U+FEFF as a big
> fat box.

I think that for backwards compatibility, applications (or fonts) are
permitted to treat U+FEFF as a zero-width invisible character, so perhaps
you can raise a feature request with VLC.



--
Steven


Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Chris Angelico
In reply to this post by random832@fastmail.us
On Thu, Jun 11, 2015 at 1:18 PM,  <random832 at fastmail.us> wrote:
> On Wed, Jun 10, 2015, at 23:05, Chris Angelico wrote:
>> http://youtu.be/CEpcUeWP0bg
>> http://youtu.be/WFZAaHrHens
>
> An example of the actual subtitle text would be more useful than a
> youtube link to the video, since we're unlikely to be able to see what
> context the character appears in if our client doesn't show it. (I don't
> think the default youtube player does). And you haven't even included a
> time code.

Unfortunately I can't really offer anything better, as the text I saw
was after a lot of processing (youtube-dl, then some other
post-processing), and I don't actually remember which file it was that
bugged me about this, now. But the subs/annotations (visible in the
default player if you turn on "Subtitles" down the bottom) do include
U+FEFF; in each case, it's on the very last line of the song, although
that's not where I remember it occurring.

ChrisA

Reply | Threaded
Open this post in threaded view
|

Python NBSP DWIM

Chris Angelico
In reply to this post by Steven D'Aprano-8
On Thu, Jun 11, 2015 at 1:27 PM, Steven D'Aprano <steve at pearwood.info> wrote:

> On Thu, 11 Jun 2015 01:05 pm, Chris Angelico wrote:
> [...]
>>> Why do the subtitles contain ZWNBSP in the first place? Surely they're
>>> not English subtitles?
>>
>> No, they're not :) The character comes up in the Cantonese and
>> Japanese subs for Once Upon A December.
>>
>> http://youtu.be/CEpcUeWP0bg
>> http://youtu.be/WFZAaHrHens
>>
>> Possibly some others in the series as well. It may well be a fault in
>> the subtitles, but most programs I've seen don't show U+FEFF as a big
>> fat box.
>
> I think that for backwards compatibility, applications (or fonts) are
> permitted to treat U+FEFF as a zero-width invisible character, so perhaps
> you can raise a feature request with VLC.

Yeah. Well, like I said - learn something new every day. I didn't know
it wasn't a bug. (Though it'd still be a font issue, not a VLC one.
With other fonts, it comes up looking different, in some cases
invisible. Unfortunately, the fonts that look good aren't the fonts
that have glyphs for all characters, so I need to figure out why font
substitution isn't working right. But that's a separate issue.)

ChrisA