Stripping unencodable characters from a string

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Paul Moore
I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is

    data = data.encode(file.encoding, errors='replace').decode(file.encoding)
    file.write(data)

(I'd probably use backslashreplace rather than replace, but that's a minor point).

Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.

Thanks,
Paul.


Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Dave Angel-4
On 05/05/2015 02:19 PM, Paul Moore wrote:

You need to specify that you're using Python 3.4 (or whichever) when
starting a new thread.

> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
>      data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>      file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.
>
> Thanks,
> Paul.
>

If you're going to take charge of the encoding of the file, why not just
open the file in binary, and do it all with
     file.write(data.encode( myencoding, errors='replace') )

i can't see the benefit of two encodes and a decode just to write a
string to the file.

Alternatively, there's probably a way to open the file using
codecs.open(), and reassign it to sys.stdout.


--
DaveA


Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Paul Moore
In reply to this post by Paul Moore
On Tuesday, 5 May 2015 20:01:04 UTC+1, Dave Angel  wrote:
> On 05/05/2015 02:19 PM, Paul Moore wrote:
>
> You need to specify that you're using Python 3.4 (or whichever) when
> starting a new thread.

Sorry. 2.6, 2.7, and 3.3+. It's for use in a cross-version library.

> If you're going to take charge of the encoding of the file, why not just
> open the file in binary, and do it all with
>      file.write(data.encode( myencoding, errors='replace') )

I don't have control of the encoding of the file. It's typically sys.stdout, which is already open. I can't replace sys.stdout (because the main program which calls my library code wouldn't like me messing with global state behind its back). And sys.stdout isn't open in binary mode.

> i can't see the benefit of two encodes and a decode just to write a
> string to the file.

Nor can I - that's my point. But if all I have is an open text-mode file with the "strict" error mode, I have to incur one encode, and I have to make sure that no characters are passed to that encode which can't be encoded.

If there was a codec method to identify un-encodable characters, that might be an alternative (although it's quite possible that the encode/decode dance would be faster anyway, as it's mostly in C - not that performance is key here).

> Alternatively, there's probably a way to open the file using
> codecs.open(), and reassign it to sys.stdout.

As I said, I have to work with the file (sys.stdout or whatever) that I'm given. I can't reopen or replace it.

Paul


Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Jon Ribbens-5
In reply to this post by Paul Moore
On 2015-05-05, Paul Moore <p.f.moore at gmail.com> wrote:

> I want to write a string to an already-open file (sys.stdout,
> typically). However, I *don't* want encoding errors, and the string
> could be arbitrary Unicode (in theory). The best way I've found is
>
>     data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>     file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a
> minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit
> clumsy, but it was the best I could think of.

Perhaps something like one of:

  file.buffer.write(data.encode(file.encoding, errors="replace"))

or:

  sys.stdout = io.TextIOWrapper(sys.stdout.detach(),
      encoding=sys.stdout.encoding, errors="replace")

(both of which could go wrong in various ways depending on your
circumstances).


Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Marko Rauhamaa
In reply to this post by Paul Moore
Paul  Moore <p.f.moore at gmail.com>:

> Nor can I - that's my point. But if all I have is an open text-mode
> file with the "strict" error mode, I have to incur one encode, and I
> have to make sure that no characters are passed to that encode which
> can't be encoded.

The file-like object you are given carries some baggage. IOW, it's not a
"file" in the sense you are thinking about it. It's some object that
accepts data with its write() method.

Now, Python file-like objects ostensibly implement a common interface.
However, as you are describing here, not all write() methods accept the
same arguments. Text file objects expect str objects while binary file
objects expect bytes objects. Maybe there are yet other file-like
objects that expect some other types of object as their arguments.

Bottom line: Python doesn't fulfill your expectation. Your library can't
operate on generic file-like objects because Python3 doesn't have
generic file-like objects. Your library must do something else. For
example, you could require a binary file object. The caller must then
possibly wrap their actual object inside a converter, which is
relatively trivial in Python.


Marko


Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Chris Angelico
In reply to this post by Paul Moore
On Wed, May 6, 2015 at 4:19 AM, Paul  Moore <p.f.moore at gmail.com> wrote:
> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
>     data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>     file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.

The simplest solution would be to call ascii() on the string, which
will give you an ASCII-only representation (using backslash escapes).
If your goal is to write Unicode text to a log file in some safe way,
this is what I would be doing.

ChrisA


Reply | Threaded
Open this post in threaded view
|

Stripping unencodable characters from a string

Serhiy Storchaka-2
In reply to this post by Paul Moore
On 05.05.15 21:19, Paul Moore wrote:
> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
>      data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>      file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.

There are flaws in this approach.

1) file.encoding can be None (StringIO) or absent (general file-like
object, that implements only write()).

2) When the encoding is UTF-16, UTF-32, UTF-8-SIG, the output will
contain superfluous byte order marks.

This is not easy problem and there is no simple solution. In particular
cases you can create TextIOWrapper(file.buffer, 'w',
encoding=file.encoding, errors='replace', newline=file.newlines,
write_through=True) and write to it, but be aware of limitations.