possible bug in my UCA implementation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

possible bug in my UCA implementation

James Tauber

My Python Unicode Collation Algorithm implementation is giving  
unexpected results that could be because of:

1. a bug in my code
2. a bug in the DUCET
3. a difference of opinion between the way I think Ancient Greek  
should be collated and the way DUCET thinks so

I'd like to get the opinion of some of you who are more familiar with  
UCA (and perhaps can try my example out on ICU)

For the purposes of testing, say I'm trying to sort the three words:

(1) ᾅδης
(2) Ἄβελ
(3) ἀββά

In my view they should be sorted in the reverse to what they are now,  
but my pyuca code sorts them in the order listed above.

pyuca assigns the words the following sort keys:

(1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',  
'0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20', '0x0',  
'0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0', '0x3b1',  
'0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
(2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',  
'0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8', '0x2',  
'0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313', '0x301',  
'0x3b2', '0x3b5', '0x3bb']
(3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',  
'0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2', '0x2',  
'0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313', '0x3b2',  
'0x3b2', '0x3b1', '0x301']

The problem is that ᾅ (the first character of (1)) expands to 4  
collation elements, Ἄ (the first character of (2)) to 3 and ἀ (the  
first character of (3)) to 2 and as a result and, because all but the  
first element is zero, they are comparing less, just by virtue of  
having more collation elements.

I don't even understand why these letters are being treated as  
expansions rather than simply taking advantage of the secondary and  
tertiary levels, but sure enough that is how the DUCET describes them.

Am I missing something fundamental in the algorithm? Or is it  
possible the DUCET is wrong?

James
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: possible bug in my UCA implementation

James Tauber

I discovered I wasn't converting to NFD first but this doesn't solve  
the problem, it just explains why expansions were used.

Even using NFD, I get the same result.

James

On 30/01/2006, at 4:35 PM, James Tauber wrote:

>
> My Python Unicode Collation Algorithm implementation is giving  
> unexpected results that could be because of:
>
> 1. a bug in my code
> 2. a bug in the DUCET
> 3. a difference of opinion between the way I think Ancient Greek  
> should be collated and the way DUCET thinks so
>
> I'd like to get the opinion of some of you who are more familiar  
> with UCA (and perhaps can try my example out on ICU)
>
> For the purposes of testing, say I'm trying to sort the three words:
>
> (1) ᾅδης
> (2) Ἄβελ
> (3) ἀββά
>
> In my view they should be sorted in the reverse to what they are  
> now, but my pyuca code sorts them in the order listed above.
>
> pyuca assigns the words the following sort keys:
>
> (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',  
> '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20',  
> '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0',  
> '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
> (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',  
> '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8',  
> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313',  
> '0x301', '0x3b2', '0x3b5', '0x3bb']
> (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',  
> '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2',  
> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313',  
> '0x3b2', '0x3b2', '0x3b1', '0x301']
>
> The problem is that ᾅ (the first character of (1)) expands to 4  
> collation elements, Ἄ (the first character of (2)) to 3 and ἀ  
> (the first character of (3)) to 2 and as a result and, because all  
> but the first element is zero, they are comparing less, just by  
> virtue of having more collation elements.
>
> I don't even understand why these letters are being treated as  
> expansions rather than simply taking advantage of the secondary and  
> tertiary levels, but sure enough that is how the DUCET describes them.
>
> Am I missing something fundamental in the algorithm? Or is it  
> possible the DUCET is wrong?
>
> James

_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: possible bug in my UCA implementation

"Martin v. Löwis"
James Tauber wrote:
> I discovered I wasn't converting to NFD first but this doesn't solve  
> the problem, it just explains why expansions were used.
>
> Even using NFD, I get the same result.

If you want to get a response to your message, you probably need to
provide more explanation of background, so people can easier understand
your reasoning. I, for one, have no idea what DUCET is.

Regards,
Martin
_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: possible bug in my UCA implementation

James Tauber
DUCET is the Unicode Consortium's Default Unicode Collation Element  
Table.

The relevant Unicode TS is #10:

http://www.unicode.org/reports/tr10/

which is what I've done a preliminary implementation of.

i18n-sig is probably the wrong place to be asking but I thought there  
might be people here that have direct experience with DUCET and the UCA.

James


On 13/02/2006, at 12:10 PM, Martin v. Löwis wrote:

> James Tauber wrote:
>> I discovered I wasn't converting to NFD first but this doesn't solve
>> the problem, it just explains why expansions were used.
>>
>> Even using NFD, I get the same result.
>
> If you want to get a response to your message, you probably need to
> provide more explanation of background, so people can easier  
> understand
> your reasoning. I, for one, have no idea what DUCET is.
>
> Regards,
> Martin

_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig
Reply | Threaded
Open this post in threaded view
|

Re: possible bug in my UCA implementation

James Tauber
In reply to this post by James Tauber

Okay, I've fixed the problem. It did turn out to be a bug in my  
implementation.

The core of the algorithm was implemented as

        for level in range(4):
                if level:
                  sort_key.append(0) # level separator
                for element in collation_elements:
                        ce_l = int(element[1][level], 16)
                        sort_key.append(ce_l)


instead of

        for level in range(4):
                if level:
                  sort_key.append(0) # level separator
                for element in collation_elements:
                        ce_l = int(element[1][level], 16)
                        if ce_l:
                                sort_key.append(ce_l)

In  other words, the UCA itself makes it clear (Step S3.5) that only  
non-zero CE_L should be appended.

I'll update the code on my website shortly.

James


On 13/02/2006, at 12:02 PM, James Tauber wrote:

>
> I discovered I wasn't converting to NFD first but this doesn't solve
> the problem, it just explains why expansions were used.
>
> Even using NFD, I get the same result.
>
> James
>
> On 30/01/2006, at 4:35 PM, James Tauber wrote:
>
>>
>> My Python Unicode Collation Algorithm implementation is giving
>> unexpected results that could be because of:
>>
>> 1. a bug in my code
>> 2. a bug in the DUCET
>> 3. a difference of opinion between the way I think Ancient Greek
>> should be collated and the way DUCET thinks so
>>
>> I'd like to get the opinion of some of you who are more familiar
>> with UCA (and perhaps can try my example out on ICU)
>>
>> For the purposes of testing, say I'm trying to sort the three words:
>>
>> (1) ᾅδης
>> (2) Ἄβελ
>> (3) ἀββά
>>
>> In my view they should be sorted in the reverse to what they are
>> now, but my pyuca code sorts them in the order listed above.
>>
>> pyuca assigns the words the following sort keys:
>>
>> (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',
>> '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20',
>> '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0',
>> '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
>> (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',
>> '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8',
>> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313',
>> '0x301', '0x3b2', '0x3b5', '0x3bb']
>> (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',
>> '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2',
>> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313',
>> '0x3b2', '0x3b2', '0x3b1', '0x301']
>>
>> The problem is that ᾅ (the first character of (1)) expands to 4
>> collation elements, Ἄ (the first character of (2)) to 3 and ἀ
>> (the first character of (3)) to 2 and as a result and, because all
>> but the first element is zero, they are comparing less, just by
>> virtue of having more collation elements.
>>
>> I don't even understand why these letters are being treated as
>> expansions rather than simply taking advantage of the secondary and
>> tertiary levels, but sure enough that is how the DUCET describes  
>> them.
>>
>> Am I missing something fundamental in the algorithm? Or is it
>> possible the DUCET is wrong?
>>
>> James
>
> _______________________________________________
> I18n-sig mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/i18n-sig

--
James Tauber                       http://jtauber.com/
journeyman of some   http://jtauber.com/blog/


_______________________________________________
I18n-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/i18n-sig