Thanks everyone for building such a great tool! I have a question about how one should deal with code changes when using IPython parallel.
The general problem I have is this: I run several deal tasks continuously on IPython engines, and I'm also constantly pushing code changes to the machines running my IPython parallel cluster(s). How should I architect my system such that the engines handle code changes?
Here are a couple solutions that seem like good ideas but aren't to my knowledge currently possible (please correct me if I'm wrong!):
When a certain event happens (such as a code deploy), have all existing engines restart themselves after they finish their currently executing task. The engines will then run all outstanding jobs using the new code base. More specifically, the first few of those pending jobs will execute with an engine that hasn't imported any code yet.
The other option I could see: Execute the tasks with an option that tells the engine to treat the tasks as a proper python process so that: 1) things like sys.exitfunc() are triggered properly when the task ends, and 2) imported modules are actually imported for the first time. Regarding 2), I know that currently, we can just import modules inside a function call (ie view.apply(my_func_with_imports) ), but if the modules loaded by the engine have any globally modified state, that modified state persists across tasks. This has bitten me quite a few times.
I believe both of these options can also be solved with a soft restart of the engines, but client.shutdown(targets=[...], restart=True) is currently NotImplemented.
To work around the problem, I basically implemented a soft ipcluster restart: Whenever I deploy new code to the machines, I simply start a new ipython cluster on the machines, and I distinguish these clusters using the --cluster-id option. (This is via a (significantly) modified version of the default StarCluster IPython plugin). This plugin lets the old ipcluster complete it's queue (and then it eventually gets shut down), and all new incoming tasks go to the new ipcluster. This work around was great, but we are starting to deploy code more frequently, which means that there are multiple ipcluster instances on our machines at any given time, and I foresee two problems: Multiple ipcluster instances on the same nodes means the engines will begin competing for the limited cpu resources on one node; And having multiple different task queues does make it a bit annoying to keep track of which job is associated to which ipcluster, and it's also difficult to know how many jobs are queued in general.
Do you have any suggestions as to what approach I should take? I would be happy to work on a PR to support client.shutdown(restart=True) if you think that is a good way to solve this problem, but I'd also like to know what that PR would require, as I'm only slightly familiar with the IPython.parallel code base.