RFC: Add a new builtin strarray type to Python?

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFC: Add a new builtin strarray type to Python?

Victor STINNER
Hi,

Since the integration of the PEP 393, str += str is not more super-fast (but
just fast). For example, adding a single character to a string has to copy all
characters to a new string. I suppose that performances of a lot of
applications manipulating text may be affected by this issue, especially text
templating libraries.

io.StringIO has also been changed to store characters as Py_UCS4 (4 bytes)
instead of Py_UNICODE (2 or 4 bytes). This class doesn't benefit from the new
PEP 393.

I propose to add a new builtin type to Python to improve both issues (cpu and
memory): *strarray*. This type would have the same API than str, except:

 * has append() and extend() methods
 * methods results are strarray instead of str

I'm writing this email to ask you if this type solves a real issue, or if we
can just prove the super-fast str.join(list of str).

--

strarray is similar to bytearray, but different: strarray('abc')[0] is 'a', not
97, and strarray can store any Unicode character (not only integers in range
0-255).

I wrote a quick and dirty implementation in Python just to be able to play
with the API, and to have an idea of the quantity of work required to
implement it:

https://bitbucket.org/haypo/misc/src/tip/python/strarray.py

(Some methods are untested: see the included TODO list.)

--

Implement strarray in C is not trivial and it would be easier to implement it
in 3 steps:

 (a) Use Py_UCS4 array
 (b) The array type depends on the content: best memory footprint, as the PEP
393
 (c) Use strarray to implement a new io.StringIO

Or we can just stop after step (a).

--

strarray API has to be discussed.

Most bytearray methods return a new object in most cases. I don't understand
why, it's not efficient. I don't know if we can do in-place operations for
strarray methods having the same name than bytearray methods (which are not
in-place methods).

str has some more methods that bytes and bytearary don't have, like format. We
may do in-place operation for these methods.

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Victor STINNER
> Since the integration of the PEP 393, str += str is not more super-fast
> (but just fast).

Oh oh. str+=str is now *1450x* slower than ''.join() pattern. Here is a
benchmark (see attached script, bench_build_str.py):

Python 3.3

str += str    : 14548 ms
''.join()     : 10 ms
StringIO.write: 12 ms
StringBuilder : 30 ms
array('u')    : 67 ms

Python 3.2

str += str    : 9 ms
''.join()     : 9 ms
StringIO.write: 9 ms
StringBuilder : 30 ms
array('u')    : 77 ms

(FYI results are very different in Python 2)

I expect performances similar to StringIO.write if strarray is implemented
using a Py_UCS4 buffer, as io.StringIO.

PyPy has a UnicodeBuilder class (in __pypy__.builders): it has append(),
append_slice() and build() methods. In PyPy, it is the fastest method to build
a string:

PyPy 1.6

''.join()     : 16 ms
StringIO.join : 24 ms
StringBuilder : 9 ms
array('u')    : 66 ms

It is even faster if you specify the size to the constructor: 3 ms.

> I'm writing this email to ask you if this type solves a real issue, or if
> we can just prove the super-fast str.join(list of str).

Hum, it looks like "What is the most efficient string concatenation method in
python?" in a frequently asked question. There is a recent thread on python-
ideas mailing list:

"Create a StringBuilder class and use it everywhere"
http://code.activestate.com/lists/python-ideas/11147/
(I just subscribed to this list.)

Another alternative is a "string-join" object. It is discussed (and
implemented) in the following issue, and PyPy has also an optional
implementation:

http://bugs.python.org/issue1569040
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#string-
join-objects

Note: Python 2 has UserString.MutableString (and Python 3 has
collections.UserString).

Victor

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com

bench_build_str.py (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Antoine Pitrou
On Sat, 1 Oct 2011 22:06:11 +0200
Victor Stinner <[hidden email]> wrote:
>
> > I'm writing this email to ask you if this type solves a real issue, or if
> > we can just prove the super-fast str.join(list of str).
>
> Hum, it looks like "What is the most efficient string concatenation method in
> python?" in a frequently asked question. There is a recent thread on python-
> ideas mailing list:

So, since people are confused at the number of possible options, you
propose to add a new option and therefore increase the confusion?

I don't understand why StringIO couldn't simply be optimized a little
more, if it needs to.
Or, if straightforward string concatenation really needs to be fast,
then str + str should be optimized (like it used to be).

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Larry Hastings
In reply to this post by Victor STINNER

On 10/01/2011 09:06 PM, Victor Stinner wrote:
Another alternative is a "string-join" object. It is discussed (and 
implemented) in the following issue, and PyPy has also an optional 
implementation:

http://bugs.python.org/issue1569040
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#string-
join-objects


Yes, actually I was planning on trying to revive my "lazy string concatenation" patch once PEP 393 landed.  As I recall it, the major roadblock to the patch's acceptance was that it changed the semantics of PyString_AS_STRING().  With the patch applied, PyString_AS_STRING() could now fail and return NULL under low-memory conditions.  This meant a major change to the C API and would have required an audit of 400+ call sites inside CPython alone.  I haven't studied PEP 393 yet, but Martin tells me PyUnicode_READY would be a good place to render the lazy string.

Give me a week or two and I should be able to get it together,


larry

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Maciej Fijalkowski
In reply to this post by Antoine Pitrou
On Sat, Oct 1, 2011 at 5:21 PM, Antoine Pitrou <[hidden email]> wrote:
> On Sat, 1 Oct 2011 22:06:11 +0200
> Victor Stinner <[hidden email]> wrote:
>>
>> > I'm writing this email to ask you if this type solves a real issue, or if
>> > we can just prove the super-fast str.join(list of str).
>>
>> Hum, it looks like "What is the most efficient string concatenation method in
>> python?" in a frequently asked question. There is a recent thread on python-
>> ideas mailing list:

Victor, you can't say it's x times slower. It has different
complexity, so it can be arbitrarily slower.

>
> So, since people are confused at the number of possible options, you
> propose to add a new option and therefore increase the confusion?
>
> I don't understand why StringIO couldn't simply be optimized a little
> more, if it needs to.
> Or, if straightforward string concatenation really needs to be fast,
> then str + str should be optimized (like it used to be).

As far as I remember str + str is discouraged as a way of
concatenating strings. We in pypy should make it fast if it's *really*
the official way.

StringIO is bytes only I think, which might be a bit of an issue if
you want a unicode at the end.

PyPy's Unicode/String builder are a bit hacks until we come up with
something that can make ''.join faster I think.

Cheers,
fijal

>
> Regards
>
> Antoine.
>
>
> _______________________________________________
> Python-Dev mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
>
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Nick Coghlan
On Sat, Oct 1, 2011 at 8:33 PM, Maciej Fijalkowski <[hidden email]> wrote:
> StringIO is bytes only I think, which might be a bit of an issue if
> you want a unicode at the end.

I'm not sure why you would think that (aside from a 2.x holdover).
StringIO handles Unicode text, BytesIO handles bytes.

Cheers,
Nick.

--
Nick Coghlan   |   [hidden email]   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Nick Coghlan
In reply to this post by Victor STINNER
On Sat, Oct 1, 2011 at 1:17 PM, Victor Stinner
<[hidden email]> wrote:
> Most bytearray methods return a new object in most cases. I don't understand
> why, it's not efficient. I don't know if we can do in-place operations for
> strarray methods having the same name than bytearray methods (which are not
> in-place methods).

No, we can't. The whole point of having separate in-place operators is
to distinguish between operations that can modify the original object,
and those that leave the original object alone (even when it's an
instance of a mutable type like list or bytearray). Efficiency takes a
distant second place to correctness when determining API behaviour.

> str has some more methods that bytes and bytearary don't have, like format. We
> may do in-place operation for these methods.

No we can't, since they're not mutating methods, so they shouldn't
affect the state of the current object.

I'm only -0 on the idea (since bytearray and io.BytesIO seem to
coexist happily enough), but any such strarray object would need to
behave itself with respect to which operations affected the internal
state of the object.

With strings defined as immutable objects, concatenating them in a
loop is formally on O(N*N) operation. Those are always going to scale
poorly. The 'resize if only one reference' trick was fragile, masked a
real algorithmic flaw in user code, but also sped up a lot of naive
software. It was definitely a case of practicality beating purity.

Any change that depends on the user changing their code would be
rather missing the point of the original optimisation - if the user is
sufficiently aware of the problem to know they need to change their
code, then explicitly joining a list of substrings or using a StringIO
object instead of an ordinary string is well within their grasp.

Adding a "disjoint" string representation to the existing PEP 393
suite of representations would solve the same problem in a more
systematic way and, as Martin pointed out, could likely use the same
machinery as is provided for backwards compatibility with code
expecting the legacy string representation.

Cheers,
Nick.

--
Nick Coghlan   |   [hidden email]   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Victor STINNER
In reply to this post by Antoine Pitrou
Le samedi 1 octobre 2011 22:21:01, Antoine Pitrou a écrit :
> So, since people are confused at the number of possible options, you
> propose to add a new option and therefore increase the confusion?

The idea is to provide an API very close to the str type. So if your program
becomes slow in some functions and these functions are manipulating strings:
just try to replace str() by strarray() at the beginning of your loop, and
redo your benchmark.

I don't know if we really need all str methods: ljust(), endswith(),
isspace(), lower(), strip(), ... or if a UnicodeBuilder supporting in-place
a+=b would be enough. I suppose that it just would be more practical to have
the same methods.

Another useful use case is to be able to replace a substring: using strarray,
you can use the standard array[a:b] = newsubstring to insert, replace or
delete. Extract of strarray unit tests:

        abc = strarray('abc')
        abc[:1] = '123' # replace
        self.assertEqual(abc, '123bc')
        abc[3:3] = '45' # insert
        self.assertEqual(abc, '12345bc')
        abc[5:] = '' # delete
        self.assertEqual(abc, '12345')

But only "replace" would be O(1). ("insert" requires less work than a replace
in a classic str if the replaced string is near the end.) You cannot
insert/delete using StringIO, str.join, or StringBuilder/UnicodeBuilder, but
you can using array('u'). Of course, you can replace a single character:
strarray[i] = 'x'.

(Using array[a:b]=newstr and array.index(), you can implement your in-place
.replace() function.)

> I don't understand why StringIO couldn't simply be optimized a little
> more, if it needs to.

Honestly, I didn't know that StringIO.write() is more efficient than str+=str,
and it is surprising to use the io module (which is supposed to be related to
files) to manipulate strings. But we can maybe document some "trick" (is it a
trick or not?) in str documementation (and in FAQ, and in stackoverflow.com,
and ...).

> Or, if straightforward string concatenation really needs to be fast,
> then str + str should be optimized (like it used to be).

We cannot have best performance and lowest memory usage at the same time with
the new str implementation (PEP 393). The new implementation is even more
focused on read-only (constant) strings than the previous one (Py_UNICODE
array using two memory blocks).

The PEP 393 uses one memory block, you cannot resize a str object anymore. The
old str type, StringIO, array (and strarray) use two memory blocks, so it is
possible to resize them (objects keep their identifier after the resize).

I *might* be possible to implement strarray that is fast on concatenation and
has small memory footprint, but we cannot use it for the str type because str
is immutable in Python.

--

On a second thaught, it may be easy to implement strarray if it reuses
unicodeobject.c. For example, strarray can be a special case (mutable) of
PyUnicodeObject (which use two memory blocks): the string would always be
ready, be never compact.

By the way, bytesobject.c and bytearrayobject.c is a fiasco: most functions are
duplicated whereas the code is very close. A big refactor is required to
remove duplicate code there.

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Antoine Pitrou
On Sun, 2 Oct 2011 15:00:01 +0200
Victor Stinner <[hidden email]> wrote:
>
> > I don't understand why StringIO couldn't simply be optimized a little
> > more, if it needs to.
>
> Honestly, I didn't know that StringIO.write() is more efficient than str+=str,
> and it is surprising to use the io module (which is supposed to be related to
> files) to manipulate strings.

StringIO is an in-memory file-like object, like in 2.x (where it lived
in the "cStringIO" module). I don't think it's a novel thing.

> The PEP 393 uses one memory block, you cannot resize a str object anymore.

I don't know why you're saying that. The concatenation optimization
worked in 2.x where the "str" type also used only one memory block. You
just have to check that the refcount is about to drop to zero.
Of course, resizing only works if the two unicode objects are of the
same "kind".

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Stephen J. Turnbull
Antoine Pitrou writes:

 > StringIO is an in-memory file-like object, like in 2.x (where it lived
 > in the "cStringIO" module). I don't think it's a novel thing.

The problem is the name "StringIO".  Something like "StringStream" or
"StringBuffer" might be more discoverable.  I personally didn't have
trouble deducing that "StringIO" means "treat a string like a file",
but it's not immediately obvious what the module is for (unless you
already know).

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Antoine Pitrou
Le dimanche 02 octobre 2011 à 23:39 +0900, Stephen J. Turnbull a écrit :

> Antoine Pitrou writes:
>
>  > StringIO is an in-memory file-like object, like in 2.x (where it lived
>  > in the "cStringIO" module). I don't think it's a novel thing.
>
> The problem is the name "StringIO".  Something like "StringStream" or
> "StringBuffer" might be more discoverable.  I personally didn't have
> trouble deducing that "StringIO" means "treat a string like a file",
> but it's not immediately obvious what the module is for (unless you
> already know).

I'm not sure why "StringStream" or "StringBuffer" would be more
discoverable, unless you're coming from a language where these names are
well-known. A "stream" is usually related to I/O, anyway; while a
"buffer" is more like an implementation detail.
I personally like the relative tersity of "StringIO".


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Antoine Pitrou
On Sun, 02 Oct 2011 16:41:16 +0200
Antoine Pitrou <[hidden email]> wrote:

> Le dimanche 02 octobre 2011 à 23:39 +0900, Stephen J. Turnbull a écrit :
> > Antoine Pitrou writes:
> >
> >  > StringIO is an in-memory file-like object, like in 2.x (where it lived
> >  > in the "cStringIO" module). I don't think it's a novel thing.
> >
> > The problem is the name "StringIO".  Something like "StringStream" or
> > "StringBuffer" might be more discoverable.  I personally didn't have
> > trouble deducing that "StringIO" means "treat a string like a file",
> > but it's not immediately obvious what the module is for (unless you
> > already know).
>
> I'm not sure why "StringStream" or "StringBuffer" would be more
> discoverable, unless you're coming from a language where these names are
> well-known. A "stream" is usually related to I/O, anyway; while a
> "buffer" is more like an implementation detail.
> I personally like the relative tersity of "StringIO".

Apparently the real word is "terseness". My bad.

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Alex Gaynor
There are a number of issues that are being conflated by this thread.

1) Should str += str be fast. In my opinion, the answer is an obvious and
   resounding no. Strings are immutable, thus repeated string addition is
   O(n**2). This is a natural and obvious conclusion. Attempts to change this
   are only truly possible on CPython, and thus create a worse enviroment for
   other Pythons, as well as a quite misleading, as they'll be extremely
   brittle. It's worth noting that, to my knowledge, JVMs haven't attempted
   hacks like this.

2) Should we have a mutable string. Personally I think this question just
  misses the point. No one actually wants a mutable string, the closest thing
  anyone asks for is faster string building, which can be solved by a far more
  specialized thing (see (3)) without all the API hangups of "What methods
  mutate?", "Should it have every str method", or "Is it a dropin
  replacement?".

3) And, finally the question that prompted this enter thing. Can we have a
   better way of incremental string building than the current list + str.join
   method. Personally I think unless your interest is purely in getting the
   most possible speed out of Python, the current idiom is probably acceptable.
   That said, if you want to get the most possible speed, a StringBuilder in
   the vein PyPy offers is the only sane way. It's able to be faster because it
   has very little ways to interact with it, and once you're done it reuses
   it's buffer to create the Python level string object, which is to say
   there's no need to copy it at the end.

As I said, unless your interest is maximum performance, there's nothing wrong
with the current idiom, and we'd do well to educate our users, rather than have
more hacks.

Alex

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Stephen J. Turnbull
In reply to this post by Antoine Pitrou
Antoine Pitrou writes:

 > I'm not sure why "StringStream" or "StringBuffer" would be more
 > discoverable, unless you're coming from a language where these names are
 > well-known.

I think they are, but it doesn't really matter, since both are a bit
lame, and I doubt either is sufficiently suggestive to be worth
changing the name of the module, or even providing an alias.  I wish I
had a better name to offer, that's all.

 > I personally like the relative tersity of "StringIO".

The issue is not that I *dislike* the name; I *personally* like the
name fine.  It's that it's definitely not doing anything to reduce the
frequency of the "efficient string concatenation" FAQ.
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Simon Cross-3
In reply to this post by Victor STINNER
On Sat, Oct 1, 2011 at 7:17 PM, Victor Stinner
<[hidden email]> wrote:
> I'm writing this email to ask you if this type solves a real issue, or if we
> can just prove the super-fast str.join(list of str).

I'm -1 on hacking += to be fast again because having the two loops
below perform wildly differently is *very* surprising to me:

s = ''
for x in loops:
    s += x

s = ''
for x in loops:
    s = s + x

Schiavo
Simon
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Simon Cross-3
On Sun, Oct 2, 2011 at 7:23 PM, Simon Cross
<[hidden email]> wrote:

> I'm -1 on hacking += to be fast again because having the two loops
> below perform wildly differently is *very* surprising to me:
>
> s = ''
> for x in loops:
>    s += x
>
> s = ''
> for x in loops:
>    s = s + x

Erk. Bad example. Second example should be:

s = ''
for x in loops:
   b = s
   s += x

(I misunderstood the details but I new the reference counting
hackiness would lead to surprises somewhere :).

Schiavo
Simon
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Victor STINNER
In reply to this post by Antoine Pitrou
Le dimanche 2 octobre 2011 15:25:21, Antoine Pitrou a écrit :
> I don't know why you're saying that. The concatenation optimization
> worked in 2.x where the "str" type also used only one memory block. You
> just have to check that the refcount is about to drop to zero.
> Of course, resizing only works if the two unicode objects are of the
> same "kind".

Oh, I see. In Python 2.7, bytes+=bytes calls PyMem_Realloc() on then writes
the new characters to the result. It doesn't overallocate as bytearray (which
overallocate +12,5%).

I restored this hack in Python 3.3 using PyUnicode_Append() in ceval.c and by
optimizing PyUnicode_Append() (try to append in-place). str+=str is closer
again to ''.join:

str += str: 696 ms
''.join():  547 ms

I disabled temporary the optimization for wstr string in PyUnicode_Resize()
because of a bug. I disabled completly resize on Windows because of another
bug.

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Hrvoje Niksic-2
In reply to this post by Alex Gaynor
On 10/02/2011 06:34 PM, Alex Gaynor wrote:
> There are a number of issues that are being conflated by this thread.
>
> 1) Should str += str be fast. In my opinion, the answer is an obvious and
>     resounding no. Strings are immutable, thus repeated string addition is
>     O(n**2). This is a natural and obvious conclusion. Attempts to change this
>     are only truly possible on CPython, and thus create a worse enviroment for
>     other Pythons, as well as a quite misleading, as they'll be extremely
>     brittle. It's worth noting that, to my knowledge, JVMs haven't attempted
>     hacks like this.

CPython is already misleading and ahead of JVM, because the str += str
optimization has been applied to Python 2 some years ago - see
http://hg.python.org/cpython-fullhistory/rev/fb6ffd290cfb?revcount=480

I like Python's immutable strings and consider it a good default for
strings.  Nevertheless a mutable string would be useful for those
situations when you know you are about to manipulate a string-like
object a number of times, where immutable strings require too many
allocations.

I don't think Python needs a StringBuilder - constructing strings using
a list of strings or StringIO is well-known and easy.  Mutable strings
are useful for the cases where StringBuilder doesn't suffice because you
need modifications other than appends.  This is analogous to file writes
- in practice most of them are appends, but sometimes you also need to
be able to seek and write stuff in the middle.

Hrvoje
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

Victor STINNER
In reply to this post by Victor STINNER
Le 03/10/2011 04:19, Victor Stinner a écrit :

> I restored this hack in Python 3.3 using PyUnicode_Append() in ceval.c and by
> optimizing PyUnicode_Append() (try to append in-place). str+=str is closer
> again to ''.join:
>
> str += str: 696 ms
> ''.join():  547 ms
>
> I disabled temporary the optimization for wstr string in PyUnicode_Resize()
> because of a bug. I disabled completly resize on Windows because of another
> bug.

Ok, bugs fixed, all "resize" optimizations are now enabled:

Python 3.3
str += str    : 119 ms
''.join()     : 130 ms
StringIO.join : 147 ms
StringBuilder : 404 ms
array('u')    : 979 ms

Hum, str+=str is now the fastest method, even faster than ''.join() !?
It's maybe time to optimize str.join ;-)

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add a new builtin strarray type to Python?

"Martin v. Löwis"
In reply to this post by Victor STINNER
> I restored this hack in Python 3.3 using PyUnicode_Append() in ceval.c and by
> optimizing PyUnicode_Append() (try to append in-place). str+=str is closer
> again to ''.join:

Why are you checking, in unicode_resizable, whether the string is from
unicode_latin1? If it is, then it should have a refcount of at least 2,
so the very first test in the function should already exclude it.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
12