Quantcast

"convert" string to bytes without changing data (encoding)

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

"convert" string to bytes without changing data (encoding)

Peter Daum
Hi,

is there any way to convert a string to bytes without
interpreting the data in any way? Something like:

s='abcde'
b=bytes(s, "unchanged")

Regards,
                              Peter
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Chris Angelico
On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <[hidden email]> wrote:
> Hi,
>
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")

What is a string? It's not a series of bytes. You can't convert it
without encoding those characters into bytes in some way.

ChrisA
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Stefan Behnel-3
In reply to this post by Peter Daum
Peter Daum, 28.03.2012 10:56:
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")

If you can tell us what you actually want to achieve, i.e. why you want to
do this, we may be able to tell you how to do what you want.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Peter Daum
In reply to this post by Peter Daum
On 2012-03-28 11:02, Chris Angelico wrote:
> On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <[hidden email]> wrote:
>> is there any way to convert a string to bytes without
>> interpreting the data in any way? Something like:
>>
>> s='abcde'
>> b=bytes(s, "unchanged")
>
> What is a string? It's not a series of bytes. You can't convert it
> without encoding those characters into bytes in some way.

... in my example, the variable s points to a "string", i.e. a series of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

b=bytes(s,'ascii') # or ('utf-8', 'latin1', ...)

would of course work in this case, but in general, if s holds any
data with bytes > 127, the actual data will be changed according
to the provided encoding.

What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...

Regards,
                        Peter
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Heiko Wundram-2
Am 28.03.2012 11:43, schrieb Peter Daum:
> ... in my example, the variable s points to a "string", i.e. a series
> of
> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No; a string contains a series of codepoints from the unicode plane,
representing natural language characters (at least in the simplistic
view, I'm not talking about surrogates). These can be encoded to
different binary storage representations, of which ascii is (a common)
one.

> What I am looking for is a general way to just copy the raw data
> from a "string" object to a "byte" object without any attempt to
> "decode" or "encode" anything ...

There is "logically" no raw data in the string, just a series of
codepoints, as stated above. You'll have to specify the encoding to use
to get at "raw" data, and from what I gather you're interested in the
latin-1 (or iso-8859-15) encoding, as you're specifically referencing
chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
speak).

--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Stefan Behnel-3
In reply to this post by Peter Daum
Peter Daum, 28.03.2012 11:43:
> What I am looking for is a general way to just copy the raw data
> from a "string" object to a "byte" object without any attempt to
> "decode" or "encode" anything ...

That's why I asked about your use case - where does the data come from and
why is it contained in a character string in the first place? If you could
provide that information, we can help you further.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Ross Ridge
In reply to this post by Peter Daum
Chris Angelico  <[hidden email]> wrote:
>What is a string? It's not a series of bytes.

Of course it is.  Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

What he's asking for many not be very useful or practical, but if that's
your problem here than then that's what you should be addressing, not
pretending that it's fundamentally impossible.

                                        Ross Ridge

--
 l/  //  Ross Ridge -- The Great HTMU
[oo][oo]  [hidden email]
-()-/()/  http://www.csclub.uwaterloo.ca/~rridge/ 
 db  //  
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Chris Angelico
On Thu, Mar 29, 2012 at 2:36 AM, Ross Ridge <[hidden email]> wrote:
> Chris Angelico  <[hidden email]> wrote:
>>What is a string? It's not a series of bytes.
>
> Of course it is.  Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

Note that distinction. I said that a string "is not" a series of
bytes; you say that it "is stored" as bytes.

> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

That's equivalent to taking a 64-bit integer and trying to treat it as
a 64-bit floating point number. They're all just bits in memory, and
in C it's quite easy to cast a pointer to a different type and
dereference it. But a Python Unicode string might be stored in several
ways; for all you know, it might actually be stored as a sequence of
apples in a refrigerator, just as long as they can be referenced
correctly. There's no logical Python way to turn that into a series of
bytes.

ChrisA
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Grant Edwards-7
In reply to this post by Ross Ridge
On 2012-03-28, Chris Angelico <[hidden email]> wrote:

> for all you know, it might actually be stored as a sequence of
> apples in a refrigerator

[...]

> There's no logical Python way to turn that into a series of bytes.

There's got to be a joke there somewhere about how to eat an apple...

--
Grant Edwards               grant.b.edwards        Yow! Somewhere in DOWNTOWN
                                  at               BURBANK a prostitute is
                              gmail.com            OVERCOOKING a LAMB CHOP!!
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Dave Angel-3
In reply to this post by Peter Daum
On 03/28/2012 04:56 AM, Peter Daum wrote:

> Hi,
>
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")
>
> Regards,
>                                Peter


You needed to specify that you are using Python 3.x .  In python 2.x, a
string is indeed a series of bytes.  But in Python 3.x, you have to be
much more specific.

For example, if that string is coming from a literal, then you usually
can convert it back to bytes simply by encoding using the same method as
the one specified for the source file.  So look at the encoding line at
the top of the file.



--

DaveA

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Peter Daum
In reply to this post by Peter Daum
On 2012-03-28 12:42, Heiko Wundram wrote:

> Am 28.03.2012 11:43, schrieb Peter Daum:
>> ... in my example, the variable s points to a "string", i.e. a series of
>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>
> No; a string contains a series of codepoints from the unicode plane,
> representing natural language characters (at least in the simplistic
> view, I'm not talking about surrogates). These can be encoded to
> different binary storage representations, of which ascii is (a common) one.
>
>> What I am looking for is a general way to just copy the raw data
>> from a "string" object to a "byte" object without any attempt to
>> "decode" or "encode" anything ...
>
> There is "logically" no raw data in the string, just a series of
> codepoints, as stated above. You'll have to specify the encoding to use
> to get at "raw" data, and from what I gather you're interested in the
> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
> speak).

... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

Regards,
                         Peter
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Steven D'Aprano-11
In reply to this post by Ross Ridge
On Wed, 28 Mar 2012 11:36:10 -0400, Ross Ridge wrote:

> Chris Angelico  <[hidden email]> wrote:
>>What is a string? It's not a series of bytes.
>
> Of course it is.  Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

You don't know that. They might be stored as a tree, or a rope, or some
even more complex data structure. In fact, in Python, they are stored as
an object.

But even if they were stored as a simple series of bytes, you don't know
what bytes they are. That is an implementation detail of the particular
Python build being used, and since Python doesn't give direct access to
memory (at least not in pure Python) there's no way to retrieve those
bytes using Python code.

Saying that strings are stored in memory as bytes is no more sensible
than saying that dicts are stored in memory as bytes. Yes, they are. So
what? Taken out of context in a running Python interpreter, those bytes
are pretty much meaningless.


> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

The right way to convert bytes to strings, and vice versa, is via
encoding and decoding operations. What the OP is asking for is as silly
as somebody asking to turn a float 1.3792 into a string without calling
str() or any equivalent float->string conversion. They're both made up of
bytes, right? Yeah, they are. So what?

Even if you do a hex dump of float 1.3792, the result will NOT be the
string "1.3792". And likewise, even if you somehow did a hex dump of the
memory representation of a string, the result will NOT be the equivalent
sequence of bytes except *maybe* for some small subset of possible
strings.



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Ross Ridge
In reply to this post by Ross Ridge
Ross Ridge <[hidden email]> wr=
> Of course it is. =A0Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

Chris Angelico  <[hidden email]> wrote:
>Note that distinction. I said that a string "is not" a series of
>bytes; you say that it "is stored" as bytes.

The distinction is meaningless.  I'm not going argue with you about what
you or I ment by the word "is".

>But a Python Unicode string might be stored in several
>ways; for all you know, it might actually be stored as a sequence of
>apples in a refrigerator, just as long as they can be referenced
>correctly.

But it is in fact only stored in one particular way, as a series of bytes.

>There's no logical Python way to turn that into a series of bytes.

Nonsense.  Play all the semantic games you want, it already is a series
of bytes.

                                        Ross Ridge

--
 l/  //  Ross Ridge -- The Great HTMU
[oo][oo]  [hidden email]
-()-/()/  http://www.csclub.uwaterloo.ca/~rridge/ 
 db  //  
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Terry Reedy
In reply to this post by Ross Ridge


On 3/28/2012 11:36 AM, Ross Ridge wrote:
> Chris Angelico<[hidden email]>  wrote:
>> What is a string? It's not a series of bytes.
>
> Of course it is.  Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

*If* it is stored in byte memory. If you execute a 3.x program mentally
or on paper, then there are no bytes.

If you execute a 3.3 program on a byte-oriented computer, then the 'a'
in the string might be represented by 1, 2, or 4 bytes, depending on the
other characters in the string. The actual logical bit pattern will
depend on the big versus little endianness of the system.

My impression is that if you go down to the physical bit level, then
again there are, possibly, no 'bytes' as a physical construct as the
bits, possibly, are stored in parallel on multiple ram chips.

> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

The python-level way to get the bytes of an object that supports the
buffer interface is memoryview(). 3.x strings intentionally do not
support the buffer interface as there is not any particular
correspondence between characters (codepoints) and bytes.

The OP could get the ordinal for each character and decide how *he*
wants to convert them to bytes.

ba = bytearray()
for c in s:
   i = ord(c)
   <append bytes to ba corresponding to i>

To get the particular bytes used for a particular string on a particular
system, OP should use the C API, possibly through ctypes.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Steven D'Aprano-11
In reply to this post by Peter Daum
On Wed, 28 Mar 2012 11:43:52 +0200, Peter Daum wrote:

> ... in my example, the variable s points to a "string", i.e. a series of
> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No. Strings are not sequences of bytes (except in the trivial sense that
everything in computer memory is made of bytes). They are sequences of
CODE POINTS. (Roughly speaking, code points are *almost* but not quite
the same as characters.)

I suggest that you need to reset your understanding of strings and bytes.
I suggest you start by reading this:

http://www.joelonsoftware.com/articles/Unicode.html

Then come back and try to explain what actual problem you are trying to
solve.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Heiko Wundram-2
In reply to this post by Peter Daum
Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

It seems that you're mixing things up wrt. the string/bytes
distinction; it's not as "complicated" as it might seem.

1) Strings

s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...

all create/refer to string objects. How Python internally stores them
is none of your concern (actually, that's rather complicated anyway, at
least with the upcoming Python 3.3), and processing a string basically
means that you'll work on the natural language characters present in the
string. Python strings can store (pretty much) all characters and
surrogates that unicode allows, and when the python interpreter/compiler
reads strings from input (I'm talking about source files), a default
encoding defines how the bytes in your input file get interpreted as
unicode codepoint encodings (generally, it depends on your system locale
or file header indications) to construct the internal string object
you're using to access the data in the string.

There is no such thing as a type for a single character; single
characters are simply strings of length 1 (and so indexing also returns
a [new] string object).

Single/double quotes work no different.

The internal encoding used by the Python interpreter is of no concern
to you.

2) Bytes

s = b'this is a byte-string'
s = b'\x22\x33\x44'

The above define bytes. Think of the bytes type as arrays of 8-bit
integers, only representing a buffer which you can process as an array
of fixed-width integers. Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.

Indexing the bytes type returns an integer (which is the clearest
distinction between string and bytes).

Being able to input "string-looking" data in source files as bytes is a
debatable "feature" (IMHO; see the first example), simply because it
breaks the semantic difference between the two types in the eye of the
programmer looking at source.

3) Conversions

To get from bytes to string, you have to decode the bytes buffer,
telling Python what kind of character data is contained in the array of
integers. After decoding, you'll get a string object which you can
process using the standard string methods. For decoding to succeed, you
have to tell Python how the natural language characters are encoded in
your array of bytes:

b'hello'.decode('iso-8859-15')

To get from string back to bytes (you want to write the natural
language character data you've processed to a file), you have to encode
the data in your string buffer, which gets you an array of 8-bit
integers to write to the output:

'hello'.encode('iso-8859-15')

Most output methods will happily do the encoding for you, using a
standard encoding, and if that happens to be ASCII, you're getting
UnicodeEncodeErrors which tell you that a character in your string
source is unsuited to be transmitted using the encoding you've
specified.

If the above doesn't make the string/bytes-distinction and usage
clearer, and you have a C#-background, check out the distinction between
byte[] (which the System.IO-streams get you), and how you have to use a
System.Encoding-derived class to get at actual System.String objects to
manipulate character data. Pythons type system wrt. character data is
pretty much similar, except for missing the "single character" type
(char).

Anyway, back to what you wrote: how are you getting the input data? Why
are "high bytes" in there which you do not know the encoding for?
Generally, from what I gather, you'll decode data from some source,
process it, and write it back using the same encoding which you used for
decoding, which should do exactly what you want and not get you into any
trouble with encodings.

--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Jussi Piitulainen
In reply to this post by Peter Daum
Peter Daum writes:

> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.
>
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can read as bytes and decode as ASCII but ignoring the troublesome
non-text characters:

>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
(Parittsbit) auf den Kommunikationsleitungen oder fr andere
Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
so dass alle im ASCII definierten Zeichen auch in den verschiedenen
Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.

The paragraph is from the German Wikipedia on ASCII, in UTF-8.
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Ethan Furman-2
In reply to this post by Peter Daum
Peter Daum wrote:

> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:
>>> ... in my example, the variable s points to a "string", i.e. a series of
>>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>> No; a string contains a series of codepoints from the unicode plane,
>> representing natural language characters (at least in the simplistic
>> view, I'm not talking about surrogates). These can be encoded to
>> different binary storage representations, of which ascii is (a common) one.
>>
>>> What I am looking for is a general way to just copy the raw data
>>> from a "string" object to a "byte" object without any attempt to
>>> "decode" or "encode" anything ...
>> There is "logically" no raw data in the string, just a series of
>> codepoints, as stated above. You'll have to specify the encoding to use
>> to get at "raw" data, and from what I gather you're interested in the
>> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
>> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
>> speak).
>
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Where is the data coming from?  Files?  In that case, it sounds like you
will want to decode/encode using 'latin-1', as the bulk of your text is
plain ascii and you don't really care about the upper-ascii chars.

~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: "convert" string to bytes without changing data (encoding)

Prasad, Ramit-2
In reply to this post by Peter Daum
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

Just a small note as you are new to Python, string concatenation can
be expensive (quadratic time). The Python (2.x and 3.x) idiom for
frequent string concatenation is to append to a list and then join
them like the following (linear time).

>>>lst = [ 'Hi,' ]
>>>lst.append( 'how' )
>>>lst.append( 'are' )
>>>lst.append( 'you?' )
>>>sentence = ' '.join( lst ) # use a space separating each element
>>>print sentence
Hi, how are you?

You can use join on an empty string, but then they will not be
separated by spaces.

>>>sentence = ''.join( lst ) # empty string so no separation
>>>print sentence
Hi,howareyou?

You can use any string as a separator, length does not matter.

>>>sentence = '@-Q'.join( lst )
>>>print sentence
Hi,@-Qhow@-Qare@-Qyou?


Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "convert" string to bytes without changing data (encoding)

Ian Kelly-2
In reply to this post by Peter Daum
On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <[hidden email]> wrote:
> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.

No, unicode strings can be stored internally as any of UCS-1, UCS-2,
UCS-4, C wchar strings, or even plain ASCII.  And those are all
implementation details that could easily change in future versions of
Python.

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can't generally just "deal with the ascii portions" without
knowing something about the encoding.  Say you encounter a byte
greater than 127.  Is it a single non-ASCII character, or is it the
leading byte of a multi-byte character?  If the next character is less
than 127, is it an ASCII character, or a continuation of the previous
character?  For UTF-8 you could safely assume ASCII, but without
knowing the encoding, there is no way to be sure.  If you just assume
it's ASCII and manipulate it as such, you could be messing up
non-ASCII characters.

Cheers,
Ian
--
http://mail.python.org/mailman/listinfo/python-list
123
Loading...