clustering

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

clustering

Shannon -jj Behrens
Hey Guys,

I need to do some data processing, and I'd like to use a cluster so
that I don't have to grow old waiting for my computer to finish.  I'm
thinking about using the servers I have locally.  I'm completely new
to clustering.  I understand how to break a problem up into
paralizable pieces, but I don't understand the admin side of it.  My
current data set is about 16 gigs, and I need to do things like run
filters over strings, make sure strings are unique, etc.  I'll be
using Python wherever possible.

* Do I have to run a particular Linux distro?  Do they all have to be
the same, or can I just setup a daemon on each machine?

* What does "Beowulf" do for me?

* How do I admin all the boxes without having to enter the same command n times?

* I've heard that MPI is good and standard.  Should I use it?  Can I
use it with Python programs?

* Is there anything better than NFS that I could use to access the data?

* What hip, slick, and cool these days?

I just need you point me in the right direction and tell me what's
good and what's a waste of time.

Thanks,
-jj
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Bryan O'Sullivan
On Wed, 2006-08-30 at 14:42 -0700, Shannon -jj Behrens wrote:

> * Do I have to run a particular Linux distro?  Do they all have to be
> the same, or can I just setup a daemon on each machine?

It's a big help if all of your cluster machines are similar, unless you
want to go insane debugging differences in output due to seemingly
insignificant local configuration differences.

I'd recommend taking a look at the Rocks clustering distribution.  It's
based on CentOS, with some features to make PC clustering and
administration less painful.

> * What does "Beowulf" do for me?

Nothing.  It used to be a clustering technology for Linux boxes, back in
the days of Pentium Pro class hardware, but hasn't existed in any
meaningful form in years.  It lives on as a zombie term without any real
semantics :-)

> * How do I admin all the boxes without having to enter the same command n times?

Google for "parallel ssh".  A decent clustering distro (Rocks or Oscar)
will have one bundled.

> * I've heard that MPI is good and standard.  Should I use it?

Depends on what you want to do.  It's just a standard for message
passing library and runtime, the API providing some features for
managing group communications.  It's only a bit more abstract than raw
sockets, so you should expect plenty of deadlocks, wrong answers, and
performance problems as you develop your MPI programs :-)

>   Can I
> use it with Python programs?

Google for "Python MPI" :-)

> * Is there anything better than NFS that I could use to access the data?

If you don't mind living on the utterly bleedingest of edges, there are
dedicated clustering filesystems like GFS and Lustre available.  There's
also a standard MPI interface for I/O, called MPI-IO.  If you have a
small cluster (16 nodes is small) and not much data (16GB isn't much),
you should consider just preloading the data onto local disk on each
node and forget about network filesystems altogether.  The cluster
filesystems require you to perform intimate and unnatural acts that I
suspect you may not enjoy.

        <b

_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

steve hindle
In reply to this post by Shannon -jj Behrens


On 8/30/06, Shannon -jj Behrens <[hidden email]> wrote:
paralizable pieces, but I don't understand the admin side of it.  My
current data set is about 16 gigs, and I need to do things like run
filters over strings, make sure strings are unique, etc.  I'll be
using Python wherever possible.

Sounds like fun :-)

* Do I have to run a particular Linux distro?  Do they all have to be
the same, or can I just setup a daemon on each machine?

You can use just about any linux distro - it's easier if all the 'compute' nodes run the same distro.  This allows you to boot the nodes via tftp and only have 1 'compute root image' to juggle.


* What does "Beowulf" do for me?

It's the basic cluster infra-structure
 

* How do I admin all the boxes without having to enter the same command n times?

tftp boot with a single 'compute' image.   There are also a bunch of cluster admin tools - check freshmeat
(also tools for building images for cluster nodes)
 

* I've heard that MPI is good and standard.  Should I use it?  Can I
use it with Python programs?

I've never worked with it - but it does appear to be the 'standard' for cluster work.

* Is there anything better than NFS that I could use to access the data?

personally, I just s/NFS/Samba/ these days.  Given some higher end hardware, you might want to look at GFS ?
 



_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

steve hindle
In reply to this post by Bryan O'Sullivan


On 8/30/06, Bryan O'Sullivan <[hidden email]> wrote:
> * What does "Beowulf" do for me?

Nothing.  It used to be a clustering technology for Linux boxes, back in
the days of Pentium Pro class hardware, but hasn't existed in any
meaningful form in years.  It lives on as a zombie term without any real
semantics :-)

Wow - Beowulf is gone?  Any idea if Don Becker's clustering company is still around ?

Steve


_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

David E. Konerding
In reply to this post by Shannon -jj Behrens
Shannon -jj Behrens wrote:

> Hey Guys,
>
> I need to do some data processing, and I'd like to use a cluster so
> that I don't have to grow old waiting for my computer to finish.  I'm
> thinking about using the servers I have locally.  I'm completely new
> to clustering.  I understand how to break a problem up into
> paralizable pieces, but I don't understand the admin side of it.  My
> current data set is about 16 gigs, and I need to do things like run
> filters over strings, make sure strings are unique, etc.  I'll be
> using Python wherever possible.
>  
You might need to be a bit more verbose about the specific details of
your dataset and the processing
that needs to be done.  For example, you are running operations on each
of the strings that
ultimately have to be collected back at the master node.  Is it as
simple as partitioning your 16gig
dataset over all the nodes equally, running the map-like operations on
the cluster nodes, then running the reduce-like operation
as a communication between the master and the cluster nodes?  Or will
there need to be multiple reductions and redistributions of data?
> * Do I have to run a particular Linux distro?  Do they all have to be
> the same, or can I just setup a daemon on each machine?
>
>  
You can have a heterogenous cluster.  You will have to do a bit more
work (depending on the variability)

> * What does "Beowulf" do for me?
>
>  
There is no product "Beowulf"; it's more a description of a collection
of technologies strapped together to make cheap supercomputers.
It's worth reading about because the people who work on beowful clusters
have typically done much of your homework for you.

> * How do I admin all the boxes without having to enter the same command n times?
>
>  
I use cfengine; define rules which can include commands to be run, files
to be copied, etc.  You can sit on an admin box
and push updates to all the nodes.

> * I've heard that MPI is good and standard.  Should I use it?  Can I
> use it with Python programs?
>
>  
You can (see MPI Python, and PyPar).  I never thought it was a good
idea.  MPI is about extracting that last bit
of efficiency out of supercomputers for tasks in which the parallelism
has to be very tightly coupled to achieve efficiency.
Writing good MPI code is hard; administrating MPI clusters is painful.

Your time is better spent writing a lightweight parallelism interface
using the existing lightweight networking code in
Python, or in an add-on package like Pyro or Twisted.
> * Is there anything better than NFS that I could use to access the data?
>  
Disks are cheap; fragment your dataset and put chunks on each one.  Heck, you could even put a web server on an admin node, put the data there, and have your clients request parts of the data as necessary.


> * What hip, slick, and cool these days?
>
>  
Err... well, web and grid services get a lot of attention, but you need
to put a big investment up front
in the infrastructure and design before you see any benefits.  I
wouldn't really call them hip, slick, or cool.

> I just need you point me in the right direction and tell me what's
> good and what's a waste of time.
>  
I think you should look at the Python documentation for XMLRPC and/or
Pyro.  Build an ultra-simple XMLRPC
server that runs on all the cluster machines, that allows you to upload
python code fragments and execute them.
Build another XMLRPC server that runs on the admin machine, when cluster
machines finish their jobs, just have them
upload their results to the admin machine for final reduction.

Dave
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Jarrod Millman
The ipython developers have a nice write up about how their new design
will handle
interactive parallel computing:
http://projects.scipy.org/ipython/ipython/wiki/NewDesign/ParallelOverview

It includes a short description of the main flavors of parallel programming.

If you are interested in grid computing, you should check out:
https://bosshog.lbl.gov/zope/projects/pyGrid/pyGridPlone/pyGridware

Jarrod
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Bryan O'Sullivan
In reply to this post by steve hindle
On Wed, 2006-08-30 at 15:45 -0700, Steve Hindle wrote:

> Wow - Beowulf is gone?

For a long time.

>   Any idea if Don Becker's clustering company is still around ?

They got bought by Penguin Computing last year.

        <b

_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Aahz
In reply to this post by steve hindle
On Wed, Aug 30, 2006, Steve Hindle wrote:
>
> personally, I just s/NFS/Samba/ these days.  

Could you expand that?  What's the benefit of switching an existing NFS
installation to Samba?
--
Aahz ([hidden email])           <*>         http://www.pythoncraft.com/

I support the RKAB
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Aahz
In reply to this post by Shannon -jj Behrens
On Wed, Aug 30, 2006, Shannon -jj Behrens wrote:
>
> * Is there anything better than NFS that I could use to access the data?

I'll just second the advice to do anything you can to move the data to
local disk.
--
Aahz ([hidden email])           <*>         http://www.pythoncraft.com/

I support the RKAB
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Andy Wiggin
In reply to this post by Jarrod Millman
On 8/30/06, Jarrod Millman <[hidden email]> wrote:
> The ipython developers have a nice write up about how their new design
> will handle
> interactive parallel computing:
> http://projects.scipy.org/ipython/ipython/wiki/NewDesign/ParallelOverview
>

I don't have first hand experience with this stuff, but I was
researching this style of computing a while back. I would look at what
the scipy project is doing, and in general look at the techniques used
by the gene crunching folks (e.g., http://biopython.org/), as they
tend to use scripting, parallelization, and large data sets.

I liked these presentations, apparently produced by Enthought. They
may be a bit dated at this point, however.

  http://www.iwce.nanohub.org/talks/python/python_talk1.pdf
  http://www.iwce.nanohub.org/talks/python/python_talk2.pdf

Page 47 of talk 2 starts an overview of parallel programming with
python. COW might meet your needs as it's a pretty low rent method for
parallel job execution, and probably easier than cooking up your own
RPC solution.

Good luck. Sounds like fun!
-Andy
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Keith Dart-2
In reply to this post by David E. Konerding
David E. Konerding wrote the following on 2006-08-30 at 15:47 PDT:
===
> Your time is better spent writing a lightweight parallelism interface
> using the existing lightweight networking code in
> Python, or in an add-on package like Pyro or Twisted.

===

I have used Pyro in the past and it's quite nice. I have some modules
for remote controlling machines via Pyro.  Use the following link:

http://www.dartworks.biz/pynms/browser/pynms/trunk/lib/remote


--

-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Keith Dart <[hidden email]>
   public key: ID: 19017044
   <http://www.dartworks.biz/>
   =====================================================================
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Keith Dart-2
In reply to this post by David E. Konerding
David E. Konerding wrote the following on 2006-08-30 at 15:47 PDT:
===
> Your time is better spent writing a lightweight parallelism interface
> using the existing lightweight networking code in
> Python, or in an add-on package like Pyro or Twisted.

===

I have used Pyro in the past and it's quite nice. I have some modules
for remote controlling machines via Pyro.  Use the following link:

http://www.dartworks.biz/pynms/browser/pynms/trunk/lib/remote


--

-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Keith Dart <[hidden email]>
   public key: ID: 19017044
   <http://www.dartworks.biz/>
   =====================================================================
_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Carl J. Van Arsdall
In reply to this post by Shannon -jj Behrens
Shannon -jj Behrens wrote:

> Hey Guys,
>
> I need to do some data processing, and I'd like to use a cluster so
> that I don't have to grow old waiting for my computer to finish.  I'm
> thinking about using the servers I have locally.  I'm completely new
> to clustering.  I understand how to break a problem up into
> paralizable pieces, but I don't understand the admin side of it.  My
> current data set is about 16 gigs, and I need to do things like run
> filters over strings, make sure strings are unique, etc.  I'll be
> using Python wherever possible.
>
> * Do I have to run a particular Linux distro?  Do they all have to be
> the same, or can I just setup a daemon on each machine?
>  
 From what I've seen this can vary.  For example if you are using PVM
then you should be able to have a heterogeneous cluster without too much
difficulty.  Although, personally, for ease of adminsitration, shit like
that, I prefer to keep things (at least on the software side) as similar
as I can.  The reality of the cluster is what you make of it


> * What does "Beowulf" do for me?
>  

Beowulf isn't so great.  There are a number of "active" clustering
technologies going on.  I've seen a bit about OpenMosix passed around,
although I believe it exists as kernel patches that are somewhat dated
last time I checked (they were for 2.4 kernels).  If you have a lot of
machines etc, you might even want to google load balancing clusters and
see what you get.
> * How do I admin all the boxes without having to enter the same command n times?
>  
Check out dsh - dancer's shell.   If you are running a debian distro you
can just apt-get it, I use it all the time, a really handy tool.


> * I've heard that MPI is good and standard.  Should I use it?  Can I
> use it with Python programs?
>  
As far as parallel programs go, MPI (and sometimes PVM) tend to be the
best ways to achieve maximum speed although they tend to incur more
development overhead.  Lots of people also use combinations of MPI and
OpenMP (or pthreads, whatev, openMP is nice and easy and soon to be
standard in gcc) when they have clusters of smp machines.  In my
experience, when you have lots of data to move around it can definitely
be to your advantage to use MPI as you can control specifically how data
will be passed around and setup a network to match that.  With 16 gigs
of data you will really want to look at your network topology and how
you choose to distribute the data.



> * Is there anything better than NFS that I could use to access the data?
>  
I've seen a number of different ways to do this.  You can google
distributed shared file systems, I think there are a couple projects out
there, although I've never used any of them and I'd be very much
interested in anyone's stories if they had any.


> * What hip, slick, and cool these days?
>  
You might even check out some grid computing stuff, kinda neat imho.  
Also, when you get a cluster up and running with MPI or whatever you
might want to go as far as to profile your code and find the serious
bottlenecks in your application.  Check out TAU (Tuning Analysis and
Utilities), it has python bindings as well as MPI/OpenMP stuff.  Not
that you will use it, that's just one of those things you can google
should you be bored at work or interested in that typa stuff, and its a
good way to justify to your employer why you need to install infiniband
as your network ;)


> I just need you point me in the right direction and tell me what's
> good and what's a waste of time.
>  
Well, as you know you prob want to avoid python threads, although I've
set up a fairly primitive distributed system with python threads and
ssh.  Everything is I/O bound for me, so it works really well, although
I'm looking into better distributed technologies.  Just more stuff to
play with as we learn (and i'm reading all the links people have posted
in response to your questions too, lots of good stuff)!  I'd also be
interested in the solution you choose, so if you ever want to post a
follow up thread I'd be happy to read the results of your project!


-carl


--

Carl J. Van Arsdall
[hidden email]
Build and Release
MontaVista Software

_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: clustering

Paul Marxhausen
Hi,

see feature articles at
http://www.linuxjournal.com/issue/149 for practical
beginner info on Beowulf, Condor, Heartbeat, and
parallel programming (also useful references and contacts).

Cheers,

Paul Marxhausen

--- "Carl J. Van Arsdall" <[hidden email]> wrote:

> Shannon -jj Behrens wrote:
> > Hey Guys,
> >
> > I need to do some data processing, and I'd like to use a cluster so
> > that I don't have to grow old waiting for my computer to finish.
> I'm
> > thinking about using the servers I have locally.  I'm completely
> new
> > to clustering.  I understand how to break a problem up into
> > paralizable pieces, but I don't understand the admin side of it.
> My
> > current data set is about 16 gigs, and I need to do things like run
> > filters over strings, make sure strings are unique, etc.  I'll be
> > using Python wherever possible.
> >
> > * Do I have to run a particular Linux distro?  Do they all have to
> be
> > the same, or can I just setup a daemon on each machine?
> >  
>  From what I've seen this can vary.  For example if you are using PVM
>
> then you should be able to have a heterogeneous cluster without too
> much
> difficulty.  Although, personally, for ease of adminsitration, shit
> like
> that, I prefer to keep things (at least on the software side) as
> similar
> as I can.  The reality of the cluster is what you make of it
>
>
> > * What does "Beowulf" do for me?
> >  
>
> Beowulf isn't so great.  There are a number of "active" clustering
> technologies going on.  I've seen a bit about OpenMosix passed
> around,
> although I believe it exists as kernel patches that are somewhat
> dated
> last time I checked (they were for 2.4 kernels).  If you have a lot
> of
> machines etc, you might even want to google load balancing clusters
> and
> see what you get.
> > * How do I admin all the boxes without having to enter the same
> command n times?
> >  
> Check out dsh - dancer's shell.   If you are running a debian distro
> you
> can just apt-get it, I use it all the time, a really handy tool.
>
>
> > * I've heard that MPI is good and standard.  Should I use it?  Can
> I
> > use it with Python programs?
> >  
> As far as parallel programs go, MPI (and sometimes PVM) tend to be
> the
> best ways to achieve maximum speed although they tend to incur more
> development overhead.  Lots of people also use combinations of MPI
> and
> OpenMP (or pthreads, whatev, openMP is nice and easy and soon to be
> standard in gcc) when they have clusters of smp machines.  In my
> experience, when you have lots of data to move around it can
> definitely
> be to your advantage to use MPI as you can control specifically how
> data
> will be passed around and setup a network to match that.  With 16
> gigs
> of data you will really want to look at your network topology and how
>
> you choose to distribute the data.
>
>
>
> > * Is there anything better than NFS that I could use to access the
> data?
> >  
> I've seen a number of different ways to do this.  You can google
> distributed shared file systems, I think there are a couple projects
> out
> there, although I've never used any of them and I'd be very much
> interested in anyone's stories if they had any.
>
>
> > * What hip, slick, and cool these days?
> >  
> You might even check out some grid computing stuff, kinda neat imho.
>
> Also, when you get a cluster up and running with MPI or whatever you
> might want to go as far as to profile your code and find the serious
> bottlenecks in your application.  Check out TAU (Tuning Analysis and
> Utilities), it has python bindings as well as MPI/OpenMP stuff.  Not
> that you will use it, that's just one of those things you can google
> should you be bored at work or interested in that typa stuff, and its
> a
> good way to justify to your employer why you need to install
> infiniband
> as your network ;)
>
>
> > I just need you point me in the right direction and tell me what's
> > good and what's a waste of time.
> >  
> Well, as you know you prob want to avoid python threads, although
> I've
> set up a fairly primitive distributed system with python threads and
> ssh.  Everything is I/O bound for me, so it works really well,
> although
> I'm looking into better distributed technologies.  Just more stuff to
>
> play with as we learn (and i'm reading all the links people have
> posted
> in response to your questions too, lots of good stuff)!  I'd also be
> interested in the solution you choose, so if you ever want to post a
> follow up thread I'd be happy to read the results of your project!
>
>
> -carl
>
>
> --
>
> Carl J. Van Arsdall
> [hidden email]
> Build and Release
> MontaVista Software
>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> http://mail.python.org/mailman/listinfo/baypiggies
>

_______________________________________________
Baypiggies mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/baypiggies