multiprocessing module and matplotlib.pyplot/PdfPages

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Paulo da Silva-3
I have program that generates about 100 relatively complex graphics and
writes then to a pdf book.
It takes a while!
Is there any possibility of using multiprocessing to build the graphics
and then use several calls to savefig(), i.e. some kind of graphic's
objects?

Thanks for any help/comments.


Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Dave Angel-4
On 04/20/2015 10:14 PM, Paulo da Silva wrote:
> I have program that generates about 100 relatively complex graphics and
> writes then to a pdf book.
> It takes a while!
> Is there any possibility of using multiprocessing to build the graphics
> and then use several calls to savefig(), i.e. some kind of graphic's
> objects?
>

To know if this is practical, we have to guess about the code you
describe, and about the machine you're using.

First, if you don't have multiple cores on  your machine, then it's
probably not going to be any faster, and might be substantially slower.
  Ditto if the code is so large that multiple copies of it will cause
swapping.

But if you have 4 cores, and a processor=bound algorithm, it can indeed
save time to run 3 or 4 processes in parallel.  You'll have to write
code to parcel out the parts that can be done in parallel, and make a
queue that each process can grab its next assignment from.

There are other gotchas, such as common code that has to be run before
any of the subprocesses.  If you discover that each of these 100 pieces
has to access data from earlier pieces, then you could get bogged down
in communications and coordination.

If the 100 plots are really quite independent, you could also consider
recruiting time from multiple machines.  As long as the data that needs
to go between them is not too large, it can pay off big time.

--
DaveA


Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Paulo da Silva-3
In reply to this post by Paulo da Silva-3
On 21-04-2015 11:26, Dave Angel wrote:
> On 04/20/2015 10:14 PM, Paulo da Silva wrote:
>> I have program that generates about 100 relatively complex graphics and
>> writes then to a pdf book.
>> It takes a while!
>> Is there any possibility of using multiprocessing to build the graphics
>> and then use several calls to savefig(), i.e. some kind of graphic's
>> objects?
>>
>
...

>
> If the 100 plots are really quite independent, you could also consider
> recruiting time from multiple machines.  As long as the data that needs
> to go between them is not too large, it can pay off big time.
>
Sorry if I was not clear.

Yes, I have 8 cores and the graphics' processes calculation are all
independent. The problem I have is that if there is any way to generate
independent figures in matplotlib. The logic seems to be build the
graphic and save it. I was trying to know if there is any way to build
graphic objects that can be built in parallel and, at the end, saved by
the controller task.

May be using fork instead of multiprocessing may do the job, but I still
didn't look at fork in Python. Being it possible to use multiprocessing
module for this purpose would make things easier.

Thanks



Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Chris Angelico
On Wed, Apr 22, 2015 at 1:53 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
> Yes, I have 8 cores and the graphics' processes calculation are all
> independent. The problem I have is that if there is any way to generate
> independent figures in matplotlib. The logic seems to be build the
> graphic and save it. I was trying to know if there is any way to build
> graphic objects that can be built in parallel and, at the end, saved by
> the controller task.

The very simplest way would be to simply spawn entirely separate
Python processes. Each one would import matplotlib independently, do
its work, and save its figure. Would that work for what you're trying
to do?

ChrisA


Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Paulo da Silva-3
In reply to this post by Paulo da Silva-3
On 21-04-2015 16:58, Chris Angelico wrote:

> On Wed, Apr 22, 2015 at 1:53 AM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>> Yes, I have 8 cores and the graphics' processes calculation are all
>> independent. The problem I have is that if there is any way to generate
>> independent figures in matplotlib. The logic seems to be build the
>> graphic and save it. I was trying to know if there is any way to build
>> graphic objects that can be built in parallel and, at the end, saved by
>> the controller task.
>
> The very simplest way would be to simply spawn entirely separate
> Python processes. Each one would import matplotlib independently, do
> its work, and save its figure. Would that work for what you're trying
> to do?
>

Yes. fork will do that. I have just looked at it and it is the same as
unix fork (module os). I am thinking of launching several forks that
will produce .png images and at the end I'll call "convert" program to
place those .png files into a pdf book. A poor solution but much faster.

Unfortunately matplotlib seems not to be object oriented!



Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Rob Gaddi
In reply to this post by Paulo da Silva-3
On Tue, 21 Apr 2015 03:14:09 +0100, Paulo da Silva wrote:

> I have program that generates about 100 relatively complex graphics and
> writes then to a pdf book.
> It takes a while!
> Is there any possibility of using multiprocessing to build the graphics
> and then use several calls to savefig(), i.e. some kind of graphic's
> objects?
>
> Thanks for any help/comments.

That sounds pretty reasonable.  Just be sure to explicitly close each
figure once you're done with it.  Matplotlib figures take up a shocking
amount of memory; nothing slows your system to a horrendous crawl like
having to resort to swapping to disk.

One thing that would be a bit worrisome is managing order, since I'm
assuming you have some order that you want the pages to be written in,
and spawning things off to multiple processes creates a chokepoint where
you'd need to hold things off and start them again in order to drop them
into the PdfPages correctly.  Maybe you'd get some boost from a
ProcessPoolExecutor, but maybe not.

That's where I like ChrisA's solution of having the various processes
(whether they're spawned from the same thing or not) just each be
responsible for writing out their own figures out to disk, one page per
file, and then use something like pdftk to stitch them all together after
the fact.

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.


Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Dave Angel-4
In reply to this post by Paulo da Silva-3
On 04/21/2015 07:54 PM, Dennis Lee Bieber wrote:

> On Tue, 21 Apr 2015 18:12:53 +0100, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> declaimed the following:
>
>
>>
>> Yes. fork will do that. I have just looked at it and it is the same as
>> unix fork (module os). I am thinking of launching several forks that
>> will produce .png images and at the end I'll call "convert" program to
>> place those .png files into a pdf book. A poor solution but much faster.
>>
>
> To the best of my knowledge, on a UNIX-type OS, multiprocessing /uses/
> fork() already. Windows does not have the equivalent of fork(), so
> multiprocessing uses a different method to create the process
> (conceptually, it runs a program that does an import of the module followed
> by a call to the named method -- which is why one must follow the
>
> if __name__  ...
>
> bit prevent the subprocess import from repeating the original main program.
>

The page:
 
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

indicates that there are 3 ways in which a new process may be started.
On Unix you may use any of the three, while on Windows, you're stuck
with spawn.

I *think* that in Unix, it always does a fork.  But if you specify
"spawn" in Unix, you get all the extra overhead to wind up with what
you're describing above.  If you know your code will run only on Unix,
you presumably can get much more efficiency by using the fork
start-method explicitly.

I haven't done it, but it would seem likely to me that forked code can
continue to use existing global variables.  Changes to those variables
would not be shared across the two forked processes.  But if this is
true, it would seem to be much easier way to launch the second process,
if it's going to be nearly identical to the first.

Maybe this is just describing the os.fork() function.
    https://docs.python.org/3.4/library/os.html#os.fork

--
DaveA


Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Paulo da Silva-3
In reply to this post by Paulo da Silva-3
On 21-04-2015 03:14, Paulo da Silva wrote:
> I have program that generates about 100 relatively complex graphics and
> writes then to a pdf book.
> It takes a while!
> Is there any possibility of using multiprocessing to build the graphics
> and then use several calls to savefig(), i.e. some kind of graphic's
> objects?
>
> Thanks for any help/comments.
>

After all kind answers ...

I tried os.fork.
Besides some problems with a also produced csv file, which I solved
using a pipe and making the parent task to do all the writes, things are
working :-)

Using 16 tasks (my cpu supports 8) and 70 graphics elapsed time passed
from 1m40s to 25s plus ~5s to convert to a pdf book. Not bad :-) It
could be better however. BTW, using 8 tasks takes 30s.

For more graphics ... the result should be much worser. At the end of
the 70 graphics the cpu already became to heat and the clock freq reduced.

Thanks to all who responded.



Reply | Threaded
Open this post in threaded view
|

multiprocessing module and matplotlib.pyplot/PdfPages

Oscar Benjamin-2
In reply to this post by Paulo da Silva-3
On 21 April 2015 at 16:53, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:

> On 21-04-2015 11:26, Dave Angel wrote:
>> On 04/20/2015 10:14 PM, Paulo da Silva wrote:
>>> I have program that generates about 100 relatively complex graphics and
>>> writes then to a pdf book.
>>> It takes a while!
>>> Is there any possibility of using multiprocessing to build the graphics
>>> and then use several calls to savefig(), i.e. some kind of graphic's
>>> objects?
>>>
>>
> ...
>
>>
>> If the 100 plots are really quite independent, you could also consider
>> recruiting time from multiple machines.  As long as the data that needs
>> to go between them is not too large, it can pay off big time.
>>
> Sorry if I was not clear.
>
> Yes, I have 8 cores and the graphics' processes calculation are all
> independent. The problem I have is that if there is any way to generate
> independent figures in matplotlib. The logic seems to be build the
> graphic and save it. I was trying to know if there is any way to build
> graphic objects that can be built in parallel and, at the end, saved by
> the controller task.

Hi Paulo,

It sounds like you're using matplotlib's "stateful" API. This is a
convenience layer for interactive work so that you can  do something
like:

from pylab import *

plot([0, 1], [0, 1])
savefig('plot.pdf')

For normal code it is recommended to use the "object-oriented" API
which looks like:

from matplotlib.pyplot import figure

fig = figure(figsize=(4, 5))
ax = fig.add_axes([0.15, 0.15, 0.70, 0.70])
ax.plot([0, 1], [0, 1])
fig.savefig('plot.pdf')

When using this API it is entirely possible to create many figures in
parallel using e.g. multiprocessing.

I can't  find a good reference to explain this API but this  page mentions it:
http://matplotlib.org/faq/usage_faq.html

However each figure comes with a significant memory overhead and the
call to savefig can be the most CPU-intensive part so I  wouldn't
recommend to build a  list  of figures and savefig them at the end.
Another approach is to also save them in parallel as  1.pdf, 2.pdf
etc. then using  something like pdftk to merge the PDF pages at the
end.


Oscar