|
Hi,
is there any way to convert a string to bytes without interpreting the data in any way? Something like: s='abcde' b=bytes(s, "unchanged") Regards, Peter -- http://mail.python.org/mailman/listinfo/python-list |
|
On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <[hidden email]> wrote:
> Hi, > > is there any way to convert a string to bytes without > interpreting the data in any way? Something like: > > s='abcde' > b=bytes(s, "unchanged") What is a string? It's not a series of bytes. You can't convert it without encoding those characters into bytes in some way. ChrisA -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
Peter Daum, 28.03.2012 10:56:
> is there any way to convert a string to bytes without > interpreting the data in any way? Something like: > > s='abcde' > b=bytes(s, "unchanged") If you can tell us what you actually want to achieve, i.e. why you want to do this, we may be able to tell you how to do what you want. Stefan -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
On 2012-03-28 11:02, Chris Angelico wrote:
> On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <[hidden email]> wrote: >> is there any way to convert a string to bytes without >> interpreting the data in any way? Something like: >> >> s='abcde' >> b=bytes(s, "unchanged") > > What is a string? It's not a series of bytes. You can't convert it > without encoding those characters into bytes in some way. ... in my example, the variable s points to a "string", i.e. a series of bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. b=bytes(s,'ascii') # or ('utf-8', 'latin1', ...) would of course work in this case, but in general, if s holds any data with bytes > 127, the actual data will be changed according to the provided encoding. What I am looking for is a general way to just copy the raw data from a "string" object to a "byte" object without any attempt to "decode" or "encode" anything ... Regards, Peter -- http://mail.python.org/mailman/listinfo/python-list |
|
Am 28.03.2012 11:43, schrieb Peter Daum:
> ... in my example, the variable s points to a "string", i.e. a series > of > bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. No; a string contains a series of codepoints from the unicode plane, representing natural language characters (at least in the simplistic view, I'm not talking about surrogates). These can be encoded to different binary storage representations, of which ascii is (a common) one. > What I am looking for is a general way to just copy the raw data > from a "string" object to a "byte" object without any attempt to > "decode" or "encode" anything ... There is "logically" no raw data in the string, just a series of codepoints, as stated above. You'll have to specify the encoding to use to get at "raw" data, and from what I gather you're interested in the latin-1 (or iso-8859-15) encoding, as you're specifically referencing chars >= 0x80 (which hints at your mindset being in LATIN-land, so to speak). -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
Peter Daum, 28.03.2012 11:43:
> What I am looking for is a general way to just copy the raw data > from a "string" object to a "byte" object without any attempt to > "decode" or "encode" anything ... That's why I asked about your use case - where does the data come from and why is it contained in a character string in the first place? If you could provide that information, we can help you further. Stefan -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
Chris Angelico <[hidden email]> wrote:
>What is a string? It's not a series of bytes. Of course it is. Conceptually you're not supposed to think of it that way, but a string is stored in memory as a series of bytes. What he's asking for many not be very useful or practical, but if that's your problem here than then that's what you should be addressing, not pretending that it's fundamentally impossible. Ross Ridge -- l/ // Ross Ridge -- The Great HTMU [oo][oo] [hidden email] -()-/()/ http://www.csclub.uwaterloo.ca/~rridge/ db // -- http://mail.python.org/mailman/listinfo/python-list |
|
On Thu, Mar 29, 2012 at 2:36 AM, Ross Ridge <[hidden email]> wrote:
> Chris Angelico  <[hidden email]> wrote: >>What is a string? It's not a series of bytes. > > Of course it is.  Conceptually you're not supposed to think of it that > way, but a string is stored in memory as a series of bytes. Note that distinction. I said that a string "is not" a series of bytes; you say that it "is stored" as bytes. > What he's asking for many not be very useful or practical, but if that's > your problem here than then that's what you should be addressing, not > pretending that it's fundamentally impossible. That's equivalent to taking a 64-bit integer and trying to treat it as a 64-bit floating point number. They're all just bits in memory, and in C it's quite easy to cast a pointer to a different type and dereference it. But a Python Unicode string might be stored in several ways; for all you know, it might actually be stored as a sequence of apples in a refrigerator, just as long as they can be referenced correctly. There's no logical Python way to turn that into a series of bytes. ChrisA -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Ross Ridge
On 2012-03-28, Chris Angelico <[hidden email]> wrote:
> for all you know, it might actually be stored as a sequence of > apples in a refrigerator [...] > There's no logical Python way to turn that into a series of bytes. There's got to be a joke there somewhere about how to eat an apple... -- Grant Edwards grant.b.edwards Yow! Somewhere in DOWNTOWN at BURBANK a prostitute is gmail.com OVERCOOKING a LAMB CHOP!! -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
On 03/28/2012 04:56 AM, Peter Daum wrote:
> Hi, > > is there any way to convert a string to bytes without > interpreting the data in any way? Something like: > > s='abcde' > b=bytes(s, "unchanged") > > Regards, > Peter You needed to specify that you are using Python 3.x . In python 2.x, a string is indeed a series of bytes. But in Python 3.x, you have to be much more specific. For example, if that string is coming from a literal, then you usually can convert it back to bytes simply by encoding using the same method as the one specified for the source file. So look at the encoding line at the top of the file. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
On 2012-03-28 12:42, Heiko Wundram wrote:
> Am 28.03.2012 11:43, schrieb Peter Daum: >> ... in my example, the variable s points to a "string", i.e. a series of >> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. > > No; a string contains a series of codepoints from the unicode plane, > representing natural language characters (at least in the simplistic > view, I'm not talking about surrogates). These can be encoded to > different binary storage representations, of which ascii is (a common) one. > >> What I am looking for is a general way to just copy the raw data >> from a "string" object to a "byte" object without any attempt to >> "decode" or "encode" anything ... > > There is "logically" no raw data in the string, just a series of > codepoints, as stated above. You'll have to specify the encoding to use > to get at "raw" data, and from what I gather you're interested in the > latin-1 (or iso-8859-15) encoding, as you're specifically referencing > chars >= 0x80 (which hints at your mindset being in LATIN-land, so to > speak). ... I was under the illusion, that python (like e.g. perl) stored strings internally in utf-8. In this case the "conversion" would simple mean to re-label the data. Unfortunately, as I meanwhile found out, this is not the case (nor the "apple encoding" ;-), so it would indeed be pretty useless. The longer story of my question is: I am new to python (obviously), and since I am not familiar with either one, I thought it would be advisory to go for python 3.x. The biggest problem that I am facing is, that I am often dealing with data, that is basically text, but it can contain 8-bit bytes. In this case, I can not safely assume any given encoding, but I actually also don't need to know - for my purposes, it would be perfectly good enough to deal with the ascii portions and keep anything else unchanged. As it seems, this would be far easier with python 2.x. With python 3 and its strict distinction between "str" and "bytes", things gets syntactically pretty awkward and error-prone (something as innocently looking like "s=s+'/'" hidden in a rarely reached branch and a seemingly correct program will crash with a TypeError 2 years later ...) Regards, Peter -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Ross Ridge
On Wed, 28 Mar 2012 11:36:10 -0400, Ross Ridge wrote:
> Chris Angelico <[hidden email]> wrote: >>What is a string? It's not a series of bytes. > > Of course it is. Conceptually you're not supposed to think of it that > way, but a string is stored in memory as a series of bytes. You don't know that. They might be stored as a tree, or a rope, or some even more complex data structure. In fact, in Python, they are stored as an object. But even if they were stored as a simple series of bytes, you don't know what bytes they are. That is an implementation detail of the particular Python build being used, and since Python doesn't give direct access to memory (at least not in pure Python) there's no way to retrieve those bytes using Python code. Saying that strings are stored in memory as bytes is no more sensible than saying that dicts are stored in memory as bytes. Yes, they are. So what? Taken out of context in a running Python interpreter, those bytes are pretty much meaningless. > What he's asking for many not be very useful or practical, but if that's > your problem here than then that's what you should be addressing, not > pretending that it's fundamentally impossible. The right way to convert bytes to strings, and vice versa, is via encoding and decoding operations. What the OP is asking for is as silly as somebody asking to turn a float 1.3792 into a string without calling str() or any equivalent float->string conversion. They're both made up of bytes, right? Yeah, they are. So what? Even if you do a hex dump of float 1.3792, the result will NOT be the string "1.3792". And likewise, even if you somehow did a hex dump of the memory representation of a string, the result will NOT be the equivalent sequence of bytes except *maybe* for some small subset of possible strings. -- Steven -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Ross Ridge
Ross Ridge <[hidden email]> wr=
> Of course it is. =A0Conceptually you're not supposed to think of it that > way, but a string is stored in memory as a series of bytes. Chris Angelico <[hidden email]> wrote: >Note that distinction. I said that a string "is not" a series of >bytes; you say that it "is stored" as bytes. The distinction is meaningless. I'm not going argue with you about what you or I ment by the word "is". >But a Python Unicode string might be stored in several >ways; for all you know, it might actually be stored as a sequence of >apples in a refrigerator, just as long as they can be referenced >correctly. But it is in fact only stored in one particular way, as a series of bytes. >There's no logical Python way to turn that into a series of bytes. Nonsense. Play all the semantic games you want, it already is a series of bytes. Ross Ridge -- l/ // Ross Ridge -- The Great HTMU [oo][oo] [hidden email] -()-/()/ http://www.csclub.uwaterloo.ca/~rridge/ db // -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Ross Ridge
On 3/28/2012 11:36 AM, Ross Ridge wrote: > Chris Angelico<[hidden email]> wrote: >> What is a string? It's not a series of bytes. > > Of course it is. Conceptually you're not supposed to think of it that > way, but a string is stored in memory as a series of bytes. *If* it is stored in byte memory. If you execute a 3.x program mentally or on paper, then there are no bytes. If you execute a 3.3 program on a byte-oriented computer, then the 'a' in the string might be represented by 1, 2, or 4 bytes, depending on the other characters in the string. The actual logical bit pattern will depend on the big versus little endianness of the system. My impression is that if you go down to the physical bit level, then again there are, possibly, no 'bytes' as a physical construct as the bits, possibly, are stored in parallel on multiple ram chips. > What he's asking for many not be very useful or practical, but if that's > your problem here than then that's what you should be addressing, not > pretending that it's fundamentally impossible. The python-level way to get the bytes of an object that supports the buffer interface is memoryview(). 3.x strings intentionally do not support the buffer interface as there is not any particular correspondence between characters (codepoints) and bytes. The OP could get the ordinal for each character and decide how *he* wants to convert them to bytes. ba = bytearray() for c in s: i = ord(c) <append bytes to ba corresponding to i> To get the particular bytes used for a particular string on a particular system, OP should use the C API, possibly through ctypes. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
On Wed, 28 Mar 2012 11:43:52 +0200, Peter Daum wrote:
> ... in my example, the variable s points to a "string", i.e. a series of > bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. No. Strings are not sequences of bytes (except in the trivial sense that everything in computer memory is made of bytes). They are sequences of CODE POINTS. (Roughly speaking, code points are *almost* but not quite the same as characters.) I suggest that you need to reset your understanding of strings and bytes. I suggest you start by reading this: http://www.joelonsoftware.com/articles/Unicode.html Then come back and try to explain what actual problem you are trying to solve. -- Steven -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3 > and its strict distinction between "str" and "bytes", things gets > syntactically pretty awkward and error-prone (something as innocently > looking like "s=s+'/'" hidden in a rarely reached branch and a > seemingly correct program will crash with a TypeError 2 years > later ...) It seems that you're mixing things up wrt. the string/bytes distinction; it's not as "complicated" as it might seem. 1) Strings s = "This is a test string" s = 'This is another test string with single quotes' s = """ And this is a multiline test string. """ s = 'c' # This is also a string... all create/refer to string objects. How Python internally stores them is none of your concern (actually, that's rather complicated anyway, at least with the upcoming Python 3.3), and processing a string basically means that you'll work on the natural language characters present in the string. Python strings can store (pretty much) all characters and surrogates that unicode allows, and when the python interpreter/compiler reads strings from input (I'm talking about source files), a default encoding defines how the bytes in your input file get interpreted as unicode codepoint encodings (generally, it depends on your system locale or file header indications) to construct the internal string object you're using to access the data in the string. There is no such thing as a type for a single character; single characters are simply strings of length 1 (and so indexing also returns a [new] string object). Single/double quotes work no different. The internal encoding used by the Python interpreter is of no concern to you. 2) Bytes s = b'this is a byte-string' s = b'\x22\x33\x44' The above define bytes. Think of the bytes type as arrays of 8-bit integers, only representing a buffer which you can process as an array of fixed-width integers. Reading from stdin/a file gets you bytes, and not a string, because Python cannot automagically guess what format the input is in. Indexing the bytes type returns an integer (which is the clearest distinction between string and bytes). Being able to input "string-looking" data in source files as bytes is a debatable "feature" (IMHO; see the first example), simply because it breaks the semantic difference between the two types in the eye of the programmer looking at source. 3) Conversions To get from bytes to string, you have to decode the bytes buffer, telling Python what kind of character data is contained in the array of integers. After decoding, you'll get a string object which you can process using the standard string methods. For decoding to succeed, you have to tell Python how the natural language characters are encoded in your array of bytes: b'hello'.decode('iso-8859-15') To get from string back to bytes (you want to write the natural language character data you've processed to a file), you have to encode the data in your string buffer, which gets you an array of 8-bit integers to write to the output: 'hello'.encode('iso-8859-15') Most output methods will happily do the encoding for you, using a standard encoding, and if that happens to be ASCII, you're getting UnicodeEncodeErrors which tell you that a character in your string source is unsuited to be transmitted using the encoding you've specified. If the above doesn't make the string/bytes-distinction and usage clearer, and you have a C#-background, check out the distinction between byte[] (which the System.IO-streams get you), and how you have to use a System.Encoding-derived class to get at actual System.String objects to manipulate character data. Pythons type system wrt. character data is pretty much similar, except for missing the "single character" type (char). Anyway, back to what you wrote: how are you getting the input data? Why are "high bytes" in there which you do not know the encoding for? Generally, from what I gather, you'll decode data from some source, process it, and write it back using the same encoding which you used for decoding, which should do exactly what you want and not get you into any trouble with encodings. -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
Peter Daum writes:
> ... I was under the illusion, that python (like e.g. perl) stored > strings internally in utf-8. In this case the "conversion" would simple > mean to re-label the data. Unfortunately, as I meanwhile found out, this > is not the case (nor the "apple encoding" ;-), so it would indeed be > pretty useless. > > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. You can read as bytes and decode as ASCII but ignoring the troublesome non-text characters: >>> print(open('text.txt', 'br').read().decode('ascii', 'ignore')) Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke (Parittsbit) auf den Kommunikationsleitungen oder fr andere Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel, so dass alle im ASCII definierten Zeichen auch in den verschiedenen Erweiterungen durch die gleichen Bitmuster kodiert werden. Die einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen Zeichen, die nicht im lateinischen Grundalphabet enthalten sind. The paragraph is from the German Wikipedia on ASCII, in UTF-8. -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote: >> Am 28.03.2012 11:43, schrieb Peter Daum: >>> ... in my example, the variable s points to a "string", i.e. a series of >>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. >> No; a string contains a series of codepoints from the unicode plane, >> representing natural language characters (at least in the simplistic >> view, I'm not talking about surrogates). These can be encoded to >> different binary storage representations, of which ascii is (a common) one. >> >>> What I am looking for is a general way to just copy the raw data >>> from a "string" object to a "byte" object without any attempt to >>> "decode" or "encode" anything ... >> There is "logically" no raw data in the string, just a series of >> codepoints, as stated above. You'll have to specify the encoding to use >> to get at "raw" data, and from what I gather you're interested in the >> latin-1 (or iso-8859-15) encoding, as you're specifically referencing >> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to >> speak). > > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. Where is the data coming from? Files? In that case, it sounds like you will want to decode/encode using 'latin-1', as the bulk of your text is plain ascii and you don't really care about the upper-ascii chars. ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets > syntactically pretty awkward and error-prone (something as innocently > looking like "s=s+'/'" hidden in a rarely reached branch and a > seemingly correct program will crash with a TypeError 2 years > later ...) Just a small note as you are new to Python, string concatenation can be expensive (quadratic time). The Python (2.x and 3.x) idiom for frequent string concatenation is to append to a list and then join them like the following (linear time). >>>lst = [ 'Hi,' ] >>>lst.append( 'how' ) >>>lst.append( 'are' ) >>>lst.append( 'you?' ) >>>sentence = ' '.join( lst ) # use a space separating each element >>>print sentence Hi, how are you? You can use join on an empty string, but then they will not be separated by spaces. >>>sentence = ''.join( lst ) # empty string so no separation >>>print sentence Hi,howareyou? You can use any string as a separator, length does not matter. >>>sentence = '@-Q'.join( lst ) >>>print sentence Hi,@-Qhow@-Qare@-Qyou? Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 -- This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list |
|
In reply to this post by Peter Daum
On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <[hidden email]> wrote:
> ... I was under the illusion, that python (like e.g. perl) stored > strings internally in utf-8. In this case the "conversion" would simple > mean to re-label the data. Unfortunately, as I meanwhile found out, this > is not the case (nor the "apple encoding" ;-), so it would indeed be > pretty useless. No, unicode strings can be stored internally as any of UCS-1, UCS-2, UCS-4, C wchar strings, or even plain ASCII. And those are all implementation details that could easily change in future versions of Python. > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. You can't generally just "deal with the ascii portions" without knowing something about the encoding. Say you encounter a byte greater than 127. Is it a single non-ASCII character, or is it the leading byte of a multi-byte character? If the next character is less than 127, is it an ASCII character, or a continuation of the previous character? For UTF-8 you could safely assume ASCII, but without knowing the encoding, there is no way to be sure. If you just assume it's ASCII and manipulate it as such, you could be messing up non-ASCII characters. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list |
| Powered by Nabble | Edit this page |
