Inconsistent pylupdate5 behaviour on UTF8 data

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent pylupdate5 behaviour on UTF8 data

Giuseppe Corbelli
Hi all
I found a puzzling pylupdate5 behaviour inconsistency between Linux and
Windows versions.
Scenario: I am extracting translatable strings from python modules. The
files are saved as UTF8, I run pylupdate and get different
representations in the XML output.

pylupdate5 v5.14.1 as Debian package on Linux and fresh pip install in a
venv on Windows 10.

As you can find in the attached test data:

- on windows the 'ç' character (U+00E7 ç c3 a7 LATIN SMALL LETTER C WITH
CEDILLA) is converted to <source>this needs UTF8 encoding:
&#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>

- on linux the same 'ç' correctly converts to <source>this needs UTF8
encoding: &#xe7;&#xb0;&#xa7;</source>

So it seems that on windows each byte of the utf8 string is replaced
with its unicode point in xml numeric character format, while on linux
the same applies (correctly) to the character itself (formed by two
bytes in UTF8).

Am I doing something wrong?

Thanks
--
Giuseppe Corbelli

_______________________________________________
PyQt mailing list    [hidden email]
https://www.riverbankcomputing.com/mailman/listinfo/pyqt

it_IT.ts.linux (520 bytes) Download Attachment
it_IT.ts.win32 (538 bytes) Download Attachment
module.py (124 bytes) Download Attachment
test.pro (89 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent pylupdate5 behaviour on UTF8 data

Phil Thompson-5
On 12/02/2020 15:27, Giuseppe Corbelli wrote:

> Hi all
> I found a puzzling pylupdate5 behaviour inconsistency between Linux
> and Windows versions.
> Scenario: I am extracting translatable strings from python modules.
> The files are saved as UTF8, I run pylupdate and get different
> representations in the XML output.
>
> pylupdate5 v5.14.1 as Debian package on Linux and fresh pip install in
> a venv on Windows 10.
>
> As you can find in the attached test data:
>
> - on windows the 'ç' character (U+00E7 ç c3 a7 LATIN SMALL LETTER C
> WITH CEDILLA) is converted to <source>this needs UTF8 encoding:
> &#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>
>
> - on linux the same 'ç' correctly converts to <source>this needs UTF8
> encoding: &#xe7;&#xb0;&#xa7;</source>
>
> So it seems that on windows each byte of the utf8 string is replaced
> with its unicode point in xml numeric character format, while on linux
> the same applies (correctly) to the character itself (formed by two
> bytes in UTF8).
>
> Am I doing something wrong?

I can't reproduce this - I get identical results on Windows, Linux and
macOS.

If you want to try and debug your own installation then look at
evilBytes() in qpy\pylupdate\metatranslator.cpp

Phil
_______________________________________________
PyQt mailing list    [hidden email]
https://www.riverbankcomputing.com/mailman/listinfo/pyqt
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent pylupdate5 behaviour on UTF8 data

Giuseppe Corbelli
On 2/16/20 2:01 PM, Phil Thompson wrote:

> On 12/02/2020 15:27, Giuseppe Corbelli wrote:
>> Hi all
>> I found a puzzling pylupdate5 behaviour inconsistency between Linux
>> and Windows versions.
>> Scenario: I am extracting translatable strings from python modules.
>> The files are saved as UTF8, I run pylupdate and get different
>> representations in the XML output.
>>
>> pylupdate5 v5.14.1 as Debian package on Linux and fresh pip install in
>> a venv on Windows 10.
>>
>> As you can find in the attached test data:
>>
>> - on windows the 'ç' character (U+00E7    ç    c3 a7    LATIN SMALL
>> LETTER C
>> WITH CEDILLA) is converted to <source>this needs UTF8 encoding:
>> &#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>
>>
>> - on linux the same 'ç' correctly converts to <source>this needs UTF8
>> encoding: &#xe7;&#xb0;&#xa7;</source>
>>
>> So it seems that on windows each byte of the utf8 string is replaced
>> with its unicode point in xml numeric character format, while on linux
>> the same applies (correctly) to the character itself (formed by two
>> bytes in UTF8).
>>
>> Am I doing something wrong?
>
> I can't reproduce this - I get identical results on Windows, Linux and
> macOS.
>
> If you want to try and debug your own installation then look at
> evilBytes() in qpy\pylupdate\metatranslator.cpp

Turns out that there's something in XML re-parsing (or maybe something
else that escapes me). Same dataset as my previous email applies.

This is what happens if you run pylupdate (5.14.1) two times in a row in
a windows 10 box:

(venv_latest) C:\devel\Dynamometer\Supervisor\norms>pylupdate5 -verbose
test.pro
Updating 'locale/it_IT.ts'...
     Found 2 source texts (2 new and 0 already existing)

(venv_latest) C:\devel\Dynamometer\Supervisor\norms>pylupdate5 -verbose
test.pro
Updating 'locale/it_IT.ts'...
     Found 2 source texts (1 new and 1 already existing)
     Kept 0 obsolete translations
     Removed 1 obsolete untranslated entry

The second time the UTF8 entry gets screwed up.
Everything is fine on Linux, same pylupdate version.

Digging some more...

--
Giuseppe Corbelli
_______________________________________________
PyQt mailing list    [hidden email]
https://www.riverbankcomputing.com/mailman/listinfo/pyqt
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent pylupdate5 behaviour on UTF8 data

Giuseppe Corbelli
On 2/18/20 11:37 AM, Giuseppe Corbelli wrote:

> Turns out that there's something in XML re-parsing (or maybe something
> else that escapes me). Same dataset as my previous email applies.
>
> This is what happens if you run pylupdate (5.14.1) two times in a row in
> a windows 10 box:
>
> (venv_latest) C:\devel\Dynamometer\Supervisor\norms>pylupdate5 -verbose
> test.pro
> Updating 'locale/it_IT.ts'...
>      Found 2 source texts (2 new and 0 already existing)
>
> (venv_latest) C:\devel\Dynamometer\Supervisor\norms>pylupdate5 -verbose
> test.pro
> Updating 'locale/it_IT.ts'...
>      Found 2 source texts (1 new and 1 already existing)
>      Kept 0 obsolete translations
>      Removed 1 obsolete untranslated entry
>
> The second time the UTF8 entry gets screwed up.
> Everything is fine on Linux, same pylupdate version.
>
> Digging some more...

Setting CODECFORTR=UTF-8 in .pro works around the issue.
It sets the 'encoding' attribute on '<message>' entities, and non-ascii
chars are saved as UTF8 instead of XML entities.

I will stop here as it solves my problem. Phil, if you have no strategic
interest in pursuing this stuff maybe this is worth mentioning in the
documentation.

Thanks
--
Giuseppe Corbelli
_______________________________________________
PyQt mailing list    [hidden email]
https://www.riverbankcomputing.com/mailman/listinfo/pyqt
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent pylupdate5 behaviour on UTF8 data

Phil Thompson-5
On 18/02/2020 10:57, Giuseppe Corbelli wrote:

> On 2/18/20 11:37 AM, Giuseppe Corbelli wrote:
>> Turns out that there's something in XML re-parsing (or maybe something
>> else that escapes me). Same dataset as my previous email applies.
>>
>> This is what happens if you run pylupdate (5.14.1) two times in a row
>> in a windows 10 box:
>>
>> (venv_latest) C:\devel\Dynamometer\Supervisor\norms>pylupdate5
>> -verbose test.pro
>> Updating 'locale/it_IT.ts'...
>>      Found 2 source texts (2 new and 0 already existing)
>>
>> (venv_latest) C:\devel\Dynamometer\Supervisor\norms>pylupdate5
>> -verbose test.pro
>> Updating 'locale/it_IT.ts'...
>>      Found 2 source texts (1 new and 1 already existing)
>>      Kept 0 obsolete translations
>>      Removed 1 obsolete untranslated entry
>>
>> The second time the UTF8 entry gets screwed up.
>> Everything is fine on Linux, same pylupdate version.
>>
>> Digging some more...
>
> Setting CODECFORTR=UTF-8 in .pro works around the issue.
> It sets the 'encoding' attribute on '<message>' entities, and
> non-ascii chars are saved as UTF8 instead of XML entities.
>
> I will stop here as it solves my problem. Phil, if you have no
> strategic interest in pursuing this stuff maybe this is worth
> mentioning in the documentation.

What if you use trUtf8() instead if tr()?

Phil
_______________________________________________
PyQt mailing list    [hidden email]
https://www.riverbankcomputing.com/mailman/listinfo/pyqt
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent pylupdate5 behaviour on UTF8 data

Giuseppe Corbelli
On 2/18/20 5:58 PM, Phil Thompson wrote:
> What if you use trUtf8() instead if tr()?

I explored all the combinations I could think of on Windows 10, pyqt
5.14.1 from pip and linguist 5.13.2 and I could NOT find any working
combination. Below I am attaching the test results. Rather lengthy and
boring I fear.

If gist is preferrable:
https://gist.github.com/cowo78/26057f575ddfa3ee20a0b636acd894ff


Section A - using trUtf8() in code
===============================================================================
Using trUtf8 I ALWAYS get a 'Non-ASCII character detected in trUtf8
string' warning

Case 1 - NOT working
-------------------------------------------------------------------------------
trUtf8()
# CODECFORSRC = UTF-8
# CODECFORTR = UTF-8

Message created:
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>

Repeated pylupdate5 runs are OK, the same message is consistently generated.

Processed by linguist 5.13.2:
<message>
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation>UTF8</translation>
</message>

Reprocessed by pylupdate5
<message>
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: &#xe7;&#xb0;&#xa7;</source>
     <translation type="obsolete">UTF8</translation>
</message>
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>


Case 2 - NOT working
-------------------------------------------------------------------------------
trUtf8()
CODECFORSRC = UTF-8
# CODECFORTR = UTF-8

Message created the FIRST time and subsequent ODD runs
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: 簧</source>
     <translation type="unfinished"></translation>
</message>

Message created the SECOND time and subsequent EVEN runs
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>


Case 3 - NOT working
-------------------------------------------------------------------------------
trUtf8()
# CODECFORSRC = UTF-8
CODECFORTR = UTF-8

Message created:
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>

Repeated pylupdate5 runs are OK, the same message is consistently generated.

Processed by linguist 5.13.2:
<message>
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation>utf8</translation>
</message>

Reprocessed by pylupdate5
<message>
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: &#xe7;&#xb0;&#xa7;</source>
     <translation type="obsolete">utf8</translation>
</message>
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>


Case 4 - NOT working
-------------------------------------------------------------------------------
trUtf8()
CODECFORSRC = UTF-8
CODECFORTR = UTF-8

Message created:
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>

Repeated pylupdate5 runs are OK, the same message is consistently generated.

Processed by linguist 5.13.2:
<message>
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation>utf8</translation>
</message>

Reprocessed by pylupdate5:
<message>
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: &#xe7;&#xb0;&#xa7;</source>
     <translation type="obsolete">utf8</translation>
</message>
<message encoding="UTF-8">
     <location filename="../translations_for_testsuite.py" line="6"/>
     <source>this needs UTF8 encoding: ç°§</source>
     <translation type="unfinished"></translation>
</message>


Section B - using tr() in code
===============================================================================
Case 1 - NOT working

-------------------------------------------------------------------------------

tr()

# CODECFORSRC = UTF-8

# CODECFORTR = UTF-8



Message created:

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding:
&#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>

     <translation type="unfinished"></translation>

</message>



Repeated runs OK.



Linguist shows WRONG characters as the source is incorrectly formatted.





Case 2 - NOT working

-------------------------------------------------------------------------------

tr()

CODECFORSRC = UTF-8

# CODECFORTR = UTF-8



Message created the FIRST time and subsequent ODD runs

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding: &#xe7;&#xb0;&#xa7;</source>

     <translation type="unfinished"></translation>

</message>



Message created the SECOND time and subsequent EVEN runs

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding:
&#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>

     <translation type="unfinished"></translation>

</message>





Case 3 - NOT working

-------------------------------------------------------------------------------

tr()

# CODECFORSRC = UTF-8

CODECFORTR = UTF-8



Message created:

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding:
&#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>

     <translation type="unfinished"></translation>

</message>



Linguist shows WRONG characters as the source is incorrectly formatted.





Case 4 - NOT working

-------------------------------------------------------------------------------

tr()

CODECFORSRC = UTF-8

CODECFORTR = UTF-8



Message created:

<message encoding="UTF-8">

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding: ç°§</source>

     <translation type="unfinished"></translation>

</message>



Repeated pylupdate5 runs are OK, the same message is consistently generated.



Processed by linguist 5.13.2:

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding: ç°§</source>

     <translation>utf8</translation>

</message>



Reprocessed by pylupdate5:

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding: &#xe7;&#xb0;&#xa7;</source>

     <translation>utf8</translation>

</message>



Reprocessed by pylupdate5 on subsequent runs:

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding: &#xe7;&#xb0;&#xa7;</source>

     <translation type="obsolete">utf8</translation>

</message>

<message>

     <location filename="../translations_for_testsuite.py" line="6"/>

     <source>this needs UTF8 encoding:
&#xc3;&#xa7;&#xc2;&#xb0;&#xc2;&#xa7;</source>

     <translation type="unfinished"></translation>

</message>



Those who survived until here must be brave.

--
Giuseppe Corbelli
_______________________________________________
PyQt mailing list    [hidden email]
https://www.riverbankcomputing.com/mailman/listinfo/pyqt