Hi all,
This is an update on the (so far PyPy-only) project of adding "Automatic Mutual Exclusion" to Python, via STM (Software Transactional Memory). For the motivation, see here: http://morepypy.blogspot.com/2012/03/call-for-donations-for-software.html """The point is that [with STM/AME] your program is always correct, and can be tweaked to improve performance. This is the opposite from what explicit threads and locks give you, which is a performant program which you need to tweak to remove bugs. Arguably, this approach is the reason for why you use Python in the first place :-)""" The update is: I now believe that it might be (reasonably) possible to apply the same techniques to CPython, and not only to PyPy. For now I am experimenting with applying them in a simple CPython-like interpreter. If it works, it might end up as a patch to the core parts of CPython. The interesting property is that it would still be able to run unmodified C extension modules --- the Python code gets the benefits of multi-core STM/AME only if it involves only the patched parts of the C code, but in all cases it still works correctly, falling back to single-core usage. I did not try to hack CPython so far, but only this custom interpreter for a Lisp language, whose implementation should be immediately familiar to anyone who knows CPython C code: https://bitbucket.org/arigo/duhton . The non-standard built-in function is "transaction", which schedules a transaction to run later (see test/test_transaction.py). The code contains the necessary tweaks to reference counting, and seems to work on all examples, but leaks some of the objects so far. Fixing this directly might be possible, but I'm not sure yet (it might require interaction with the cycle-detecting GC of CPython). Moreover the performance hit is well below 2x, more like 20%. If anyone is interested, I could create a dedicated mailing list in order to discuss this in more details. From experience I would think that this has the potential to become a Psyco-like experiment, but unlike 10 years ago, today I'm not ready any more to dive completely alone into a project of that scale :-) A bientôt, Armin. _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
Armin Rigo, 11.04.2012 13:47:
> This is an update on the (so far PyPy-only) project of adding "Automatic > Mutual Exclusion" to Python, via STM (Software Transactional Memory). > [...] > Moreover the performance hit is well below 2x, more like 20%. Hmm, those 20% refer to STM, right? Without hardware support? Then hardware support could be expected to drop that even further? Did you do any experiments with running parallel code so far, to see if that scales as expected? Stefan _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
Hi Stefan,
On Wed, Apr 11, 2012 at 14:29, Stefan Behnel <[hidden email]> wrote: >> Moreover the performance hit is well below 2x, more like 20%. > > Hmm, those 20% refer to STM, right? Without hardware support? Then hardware > support could be expected to drop that even further? Yes, that's using STM on my regular laptop. How HTM would help remains unclear at this point, because in this approach transactions are typically rather large --- likely much larger than what the first-generation HTM-capable processors will support next year. But 20% looks good anyway :-) > Did you do any experiments with running parallel code so far, to see if > that scales as expected? Yes, it scales very nicely on small non-conflicting examples. I believe that it scales just as nicely on large examples on CPython too, based on the approach --- as long as we, as CPython developers, make enough efforts to adapt a sufficiently large portion of the CPython C code base (which would mean: most mutable built-in objects' implementation). A bientôt, Armin. _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
Armin Rigo, 11.04.2012 14:51:
> On Wed, Apr 11, 2012 at 14:29, Stefan Behnel wrote: >>> Moreover the performance hit is well below 2x, more like 20%. >> >> Hmm, those 20% refer to STM, right? Without hardware support? Then hardware >> support could be expected to drop that even further? > > Yes, that's using STM on my regular laptop. How HTM would help > remains unclear at this point, because in this approach transactions > are typically rather large --- likely much larger than what the > first-generation HTM-capable processors will support next year. Ok. I guess once the code is there, the hardware will eventually catch up. However, I'm not sure what you consider "large". A lot of manipulation operations for the builtin types are not all that involved, at least in the "normal" cases (read: fast paths) that involve no memory reallocation etc., and anything that can be called by and doesn't call into the interpreter would be a complete and independent transaction all by itself, as the GIL is allowed to be released between any two ticks. Do you know if hybrid TM is possible at this level? I.e. short transactions run in hardware, long ones in software? (Assuming we know what's "long" and "short", I guess...) > But 20% looks good anyway :-) Oh, definitely. >> Did you do any experiments with running parallel code so far, to see if >> that scales as expected? > > Yes, it scales very nicely on small non-conflicting examples. I > believe that it scales just as nicely on large examples on CPython > too, based on the approach --- as long as we, as CPython developers, > make enough efforts to adapt a sufficiently large portion of the > CPython C code base (which would mean: most mutable built-in objects' > implementation). Right, that would involve some work. But the advantage, as I understand it, is that this can be done incrementally. I.e. make it work, then make it fast and make it scale. Stefan _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
Stefan Behnel, 11.04.2012 15:31:
> Armin Rigo, 11.04.2012 14:51: >> On Wed, Apr 11, 2012 at 14:29, Stefan Behnel wrote: >>> Did you do any experiments with running parallel code so far, to see if >>> that scales as expected? >> >> Yes, it scales very nicely on small non-conflicting examples. I >> believe that it scales just as nicely on large examples on CPython >> too, based on the approach --- as long as we, as CPython developers, >> make enough efforts to adapt a sufficiently large portion of the >> CPython C code base (which would mean: most mutable built-in objects' >> implementation). > > Right, that would involve some work. But the advantage, as I understand it, > is that this can be done incrementally. Hmm, and according to the papers that are referenced on the PyPy proposal page, at least some of this work has already been done, it seems. http://pypy.org/tmdonate.html#why-hasn-t-the-idea-been-implemented-for-cpython-already Stefan _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
In reply to this post by Stefan Behnel-3
>> Yes, that's using STM on my regular laptop. How HTM would help
>> remains unclear at this point, because in this approach transactions >> are typically rather large --- likely much larger than what the >> first-generation HTM-capable processors will support next year. > > Ok. I guess once the code is there, the hardware will eventually catch up. > > However, I'm not sure what you consider "large". A lot of manipulation > operations for the builtin types are not all that involved, at least in the > "normal" cases (read: fast paths) that involve no memory reallocation etc., > and anything that can be called by and doesn't call into the interpreter > would be a complete and independent transaction all by itself, as the GIL > is allowed to be released between any two ticks. Large as in L2-cache large, and as in "you won't get a page fault or an interrupt, you won't make any syscall, any I/O..." ;-) _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
In reply to this post by Stefan Behnel-3
On Wed, 11 Apr 2012 15:31:09 +0200
Stefan Behnel <[hidden email]> wrote: > > Ok. I guess once the code is there, the hardware will eventually catch up. > > However, I'm not sure what you consider "large". A lot of manipulation > operations for the builtin types are not all that involved, at least in the > "normal" cases (read: fast paths) that involve no memory reallocation etc., > and anything that can be called by and doesn't call into the interpreter > would be a complete and independent transaction all by itself, as the GIL > is allowed to be released between any two ticks. I think Armin's plan is not to work at the bytecode level, but make transactions explicit (at least in framework code - e.g. Twisted or Stackless -, perhaps not in user code). Perhaps he can elaborate on that. > Do you know if hybrid TM is possible at this level? I.e. short transactions > run in hardware, long ones in software? (Assuming we know what's "long" and > "short", I guess...) There are other issues than the size of transactions. For example, one issue is that not all operations may be allowed in a transaction: “In addition, there are a number of instructions that may cause an abort on specific implementations. These instructions include x87 and MMX, mixed access to XMM and YMM registers, updates to non-status parts of EFLAGs, updating segment, debug or control registers, ring transitions, cache and TLB control instructions, any non-writeback memory type accesses, processor state save, interrupts, I/O, virtualization (VMX), trusted execution (SMX) and several miscellaneous types.” http://realworldtech.com/page.cfm?ArticleID=RWT021512050738 So, realistically, a (S)TM implementation in CPython (and probably also in PyPy) would have to stand on its own merits, rather than betting on pie-in-the-sky improvements of HTM implementations. Regards Antoine. _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
Hi Antoine, hi Stefan,
On Wed, Apr 11, 2012 at 16:33, Antoine Pitrou <[hidden email]> wrote: > I think Armin's plan is not to work at the bytecode level, but make > transactions explicit (at least in framework code - e.g. Twisted or > Stackless -, perhaps not in user code). Perhaps he can elaborate on > that. Yes, precisely. It should be explained in the proposal. The references in "http://pypy.org/tmdonate.html#why-hasn-t-the-idea-been-implemented-for-cpython-already" don't change CPython (or only minimally). They use Hardware TM, but (the most important thing imho) they target bytecode-level transactions --- i.e. the programmer is still stuck with the "threading" module. About using it explicitly in user code: I found out that there are use cases to do so directly. If you have a CPU-intensive program that does: for x in some_random_order_iterator: do_stuff_for(x) Then if the things you do are "not too dependent on each other", you can win by replacing it with: for x in some_random_order_iterator: transaction.add(do_stuff_for, x) transaction.run() and no other change. It has exactly the same semantics, and in this case you don't really need a framework in which to hide the transaction.add(). Compare it with the situation of spawning threads: you need to carefully add locks *everywhere* or your program is buggy --- both in today's CPython or in a GIL-less, bytecode-level-transaction CPython. By the way, that's why I said that transactions are arbitrarily long: one transaction will be, in this case, everything that do_stuff_for(x) does. A bientôt, Armin. _______________________________________________ Python-Dev mailing list [hidden email] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com |
Free forum by Nabble | Edit this page |