[Django] #30892: Slugify and SlugField with Diacritics

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Django] #30892: Slugify and SlugField with Diacritics

Django
#30892: Slugify and SlugField with Diacritics
--------------------------------------+------------------------
               Reporter:  originell   |          Owner:  nobody
                   Type:  Bug         |         Status:  new
              Component:  Utilities   |        Version:  2.2
               Severity:  Normal      |       Keywords:
           Triage Stage:  Unreviewed  |      Has patch:  0
    Needs documentation:  0           |    Needs tests:  0
Patch needs improvement:  0           |  Easy pickings:  0
                  UI/UX:  0           |
--------------------------------------+------------------------
 While working on an international project, we discovered that the
 turkish/azerbaijani letter `İ` can not be properly processed when
 `SlugField` and `slugify` are run with `allow_unicode=True`.

 The project itself runs with Django 2.2.6 and Wagtail 2.6.2. I first
 talked about this in the Wagtail Support Channel and while researching
 further, discovered that this is a Django/Python related issue. This was
 tested on Python 3.6 and on Python 3.7.

 (quick shoutout to Matt Wescott @gasmanic of Wagtail Fame for being a
 sparing partner in this)

 There is a rather detailed analysis (README) in a git repo I created
 https://github.com/originell/django-wagtail-turkish-i - it was also the
 basis for my initial call for help in wagtail's support channel. Meanwhile
 I have extended it with a Django-only project, as to be a 100% sure this
 has nothing to do with Wagtail.

 I was not able to find anything similar in trac. While I encourage whoever
 is reading this to actually read the README in the git repo, I want to
 honor your time and will try to provide a more concise version of the bug
 here.

 ===  Explanation

 `models.py`

 {{{#!python
 from django.db import models
 from django.utils.text import slugify


 class Page(models.Model):
     title = models.CharField(max_length=255)
     slug = models.SlugField(allow_unicode=True)

     def __str__(self):
         return self.title
 }}}

 Using this in a shell/test like a (Model)Form might:

 {{{#!python
 from django.utils.text import slugify

 page = Page(title="Hello İstanbul")
 page.slug = slugify(page.title, allow_unicode=True)
 page.full_clean()
 }}}

 `full_clean()` then raises

     django.core.exceptions.ValidationError: {'slug': ["Enter a valid
 'slug' consisting of Unicode letters, numbers, underscores, or hyphens."]}

 Why is that?

 `slugify` does the following internally:

 {{{#!python
 re.sub(r'[^\w\s-]', '', value).strip().lower()
 }}}

 Thing is, Python's `.lower()` of the `İ` in `İstanbul` looks like this:

 {{{#!python
 >>> [unicodedata.name(character) for character in 'İ'.lower()]
 ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
 }}}

 So, while `slugify()` finishes, the result is then passed into `SlugField`
 which uses the `slug_unicode_re` (`^[-\w]+\Z`). However, the ''Combining
 Dot Above'' is not a valid `\w`:

 {{{#!python
 >>>  [(character, unicodedata.name(character),
 slug_unicode_re.match(character)) for character in 'İ'.lower()]
 [
  ('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),
  # EUREKA!!
  ('̇', 'COMBINING DOT ABOVE', None)
 ]
 }}}

 So that's why the `ValidationError` is raised.

 === Proposed Solution

 The culprit seems to be the order in which `lower()` is called in slugify.
 The assumption that the lowercase version of a `re.sub`-ed string is still
 a valid `slug_unicode_re`, does not seem to hold true.

 Hence, instead of doing this in `slugify()`

 {{{#!python
 re.sub(r'[^\w\s-]', '', value).strip().lower()
 }}}

 It might be better to do it like this

 {{{#!python
 re.sub(r'[^\w\s-]', '', value.lower()).strip()
 }}}

 === Is Python the actual culprit?

 Yeah it might be. Matt (@gasmanic) urged me to also take a look if Python
 might be doing this wrong.

 -   The `İ` is the ''Latin Capital Letter I with Dot Above''. It's
 codepoint is `U+0130` According to the chart for the Latin Extended-A set
 (https://www.unicode.org/charts/PDF/U0100.pdf), it's lowercase version is
 `U+0069`.
 -   `U+0069` lives in the ''C0 Controls and Basic Latin set''
 (https://www.unicode.org/charts/PDF/U0000.pdf). Lo and behold: it is the
 ''Latin small letter I''. So a latin lowercase `i`.

 Does this really mean that Python is doing something weird here by adding
 the ''Combining dot above''? Honestly, I can't imagine that and I am
 probably missing an important thing here because my view is too naive.

 ---

 I hope this shorter explanation makes sense. If it does not, please try to
 read through the detailed analysis in the repo
 (https://github.com/originell/django-wagtail-
 turkish-i/blob/master/README.md). If that also does not make a ton of
 sense, let me know.

 In any case, thank you for taking the time to read this bug report.
 Looking forward to feedback and thoughts. I am happy to oblige in any
 capacity necessary.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

--
You received this message because you are subscribed to the Google Groups "Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-updates/052.5691f39139d866e6d30d72cb47f1a43b%40djangoproject.com.
Reply | Threaded
Open this post in threaded view
|

Re: [Django] #30892: slugify() doesn't return a valid slug for "İ". (was: Slugify and SlugField with Diacritics)

Django
#30892: slugify() doesn't return a valid slug for "İ".
---------------------------+------------------------------------
     Reporter:  Luis Nell  |                    Owner:  nobody
         Type:  Bug        |                   Status:  new
    Component:  Utilities  |                  Version:  master
     Severity:  Normal     |               Resolution:
     Keywords:             |             Triage Stage:  Accepted
    Has patch:  0          |      Needs documentation:  0
  Needs tests:  0          |  Patch needs improvement:  0
Easy pickings:  0          |                    UI/UX:  0
---------------------------+------------------------------------
Changes (by felixxm):

 * version:  2.2 => master
 * stage:  Unreviewed => Accepted


Comment:

 Thanks for this ticket. I'm not sure if your solution is correct because
 I'm afraid that we may hit another edge case, nevertheless `slugify()`
 should return a valid slug.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:1>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

--
You received this message because you are subscribed to the Google Groups "Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-updates/067.e8055eca7c61e3f3a475b761e2a0df47%40djangoproject.com.
Reply | Threaded
Open this post in threaded view
|

Re: [Django] #30892: slugify() doesn't return a valid slug for "İ".

Django
In reply to this post by Django
#30892: slugify() doesn't return a valid slug for "İ".
---------------------------+------------------------------------
     Reporter:  Luis Nell  |                    Owner:  nobody
         Type:  Bug        |                   Status:  new
    Component:  Utilities  |                  Version:  master
     Severity:  Normal     |               Resolution:
     Keywords:             |             Triage Stage:  Accepted
    Has patch:  0          |      Needs documentation:  0
  Needs tests:  0          |  Patch needs improvement:  0
Easy pickings:  0          |                    UI/UX:  0
---------------------------+------------------------------------

Comment (by Luis Nell):

 True that.

 I also thought about looping over the characters and simply removing
 everything that does not match the `\w`. However, I threw that away
 because I thought that this might as well hit some edge case where it
 could modify the meaning of a word.

 In any case, if we can find a (more solid) process, I'd be very happy to
 produce a patch and tests for that.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

--
You received this message because you are subscribed to the Google Groups "Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-updates/067.56b57b948c1196a949a077b74a0618fa%40djangoproject.com.