I need some help architecting the big picture

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

I need some help architecting the big picture

Shannon -jj Behrens
Hi,

I need some help architecting the big picture on my current project.
I'm usually a Web guy, which I understand very well.  However, my
current project is more batch oriented.  Here are the details:

* I have a bunch of customers.

* These customers give me batches of data.  One day, there might be
cron jobs for collecting this data from them.  One day I might have a
Web service that listens to updates from them and creates batches.
However, right now, they're manually giving me big chunks of data.

* I've built the system in a very UNIXy way right now.  That means
heavy use of cut, sort, awk, small standalone Python scripts, sh, and
pipes.  I've followed the advice of "do one thing well" and "tools,
not policy".

* The data that my customers give me is not uniform.  Different
customers give me the data in different ways.  Hence, I need some
customer-specific logic to transform the data into a common format
before I do the rest of the data pipeline.

* After a bunch of data crunching, I end up with a bunch of different
TSV files (tab-separated format) containing different things, which I
end up loading into a database.

* There's a separate database for each customer.

* The database is used to implement a Web service.  This part makes sense to me.

* I'm making heavy use of testing using nose.

Anyway, so I have all this customer-specifc logic, and all these data
pipelines.  How do I pull it together into something an operator would
want to use?  Is the idea of an operator appropriate?  I'm pretty sure
this is an "operations" problem.

My current course of action is to:

* Create a global Makefile that knows how to do system-wide tasks.

* Create a customer-specific Makefile for each customer.

* The customer-specific Makefiles all "include" a shared Makefile.  I
modeled this after FreeBSD's ports system.

Hence, the customer-specific Makefiles have some customer-specific
logic in them, but they can share code via the shared Makefile that
they all include.

* Currently, all of the Python scripts take all their settings on the
command line.  I'm thinking that the settings belong in an included
Makefile that just contains settings.  By keeping the Python dumb, I'm
attempting to follow the "tools, not policy" idea.

I'm having a little bit of a problem with testing.  I don't have a way
of testing any Python code that talks to a database because the Python
scripts are all dumb about how to connect to the database.  I'm
thinking I might need to setup a "pretend" customer with a test
database to test all of that logic.

Does the idea of driving everything from Makefiles make sense?

Is there an easier way to share data like database connection
information between the Makefile and Python other than passing it in
explicitly via command line arguments?

Is there anything that makes more sense than a bunch of
customer-specific Makefiles that include a global Makefile?

How do I get new batches of data into the system?  Do I just put the
files in the right place and let the Makefiles take it from there?

Am I completely smoking, or am I on the right track?

Thanks,
-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Paul McNett ∅
Shannon -jj Behrens wrote:

> I need some help architecting the big picture on my current project.
> I'm usually a Web guy, which I understand very well.  However, my
> current project is more batch oriented.  Here are the details:
>
> * I have a bunch of customers.
>
> * These customers give me batches of data.  One day, there might be
> cron jobs for collecting this data from them.  One day I might have a
> Web service that listens to updates from them and creates batches.
> However, right now, they're manually giving me big chunks of data.
>
> * I've built the system in a very UNIXy way right now.  That means
> heavy use of cut, sort, awk, small standalone Python scripts, sh, and
> pipes.  I've followed the advice of "do one thing well" and "tools,
> not policy".

I think this is good. You can swap out, improve, any of the pieces
without detriment as long as the interface is the same.


> * The data that my customers give me is not uniform.  Different
> customers give me the data in different ways.  Hence, I need some
> customer-specific logic to transform the data into a common format
> before I do the rest of the data pipeline.

Does the data structure from a given customer stay consistent? If a
batch is inconsistent with that customer's "standard", can you bounce it
back to them or must your toolchain adapt?


> * After a bunch of data crunching, I end up with a bunch of different
> TSV files (tab-separated format) containing different things, which I
> end up loading into a database.

And at this point, is the data in a common format for all customers?
IOW, is the database schema consistent for all customers?


> * There's a separate database for each customer.

Fine.


> * The database is used to implement a Web service.  This part makes sense to me.
>
> * I'm making heavy use of testing using nose.
>
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines.  How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.

Is this operator of the intelligent variety, or some temp worker with
Excel experience? Where in the process does this operator sit? Does
he/she receive the batches from the customers and then feed them to your
toolchain and verify that the batches made it to the database, or
something else entirely?


> My current course of action is to:
>
> * Create a global Makefile that knows how to do system-wide tasks.
>
> * Create a customer-specific Makefile for each customer.
>
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.

The Makefile strategy sounds very sane, easy to manage once set up. Easy
to boilerplate for new customers, etc. Well, maybe not "easy", but
straightforward and understandable.


> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
>
> * Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.

The settings should ultimately come from once place. This one place
could be a text file, a database entry, a part of the customer's
Makefile, or the operator could get prompted for some or all of the
information. The scripts taking the arguments on the command line is
fine. Each link in the chain just passes that information forward.


> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.

I think keeping the scripts "dumb" is good, but why do your tests need
to be dumb, too? If you are testing interaction between your script and
the database, then test that by sending the database connection
parameters. A test or reference customer is a good idea.


> Does the idea of driving everything from Makefiles make sense?

I think it makes a lot of sense. Your pipeline may seem complex on one
level by having so many little parts, but this is good keeping each
function in the pipeline separate and well-oiled.


> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?

Is this question still from a testing perspective? Make a test database,
and set the test database parameters in the global makefile, and send
the connection arguments to your python scripts just as you have it now.
Then each customer-specific makefile will have its own overridden
connection parameters.

Set up a test database, perhaps copied from some reference customer's
data, to use in your testing.

Why does it seem like it is a problem to be passing this information on
the command line?


> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?

Are you benefiting in some other way by not making this a pure-Python
project? Not knowing more, I think I'd try to use a Python subclassing
strategy instead of makefiles, and Python modules instead of Unix
applications, but it is basically the same in the end.


> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?

The files are "data", and once a batch is processed you don't need that
chunk of data anymore, unless you want to archive it. So just have an
in_data file that the Makefile can feed your toolchain with.

Perhaps your operator needs a very thin GUI frontend for this to feed
the input batch into the in_data file and start the make. So, the
operator gets a data chunk from XZY Corp. They just select that customer
from a list and paste in the contents of the data, and press "submit".
And wait for a green "okay" or a red "uh-oh".


> Am I completely smoking, or am I on the right track?

Is any of this implemented yet, or is it still in pure design phase? It
sounds like you have it implemented and you are going crazy dealing with
the various inputs from the various customers, and perhaps you are
wondering how to scale this to even more customers, and how to train an
operator/operators to intelligently wield this toolchain you've built.

It sounds like a fun project! :)

Paul
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Ben Bangert
In reply to this post by Shannon -jj Behrens
On Apr 28, 2008, at 1:46 PM, Shannon -jj Behrens wrote:

> My current course of action is to:
>
> * Create a global Makefile that knows how to do system-wide tasks.
>
> * Create a customer-specific Makefile for each customer.
>
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.
>
> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
>
> * Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
>
> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.
>
> Does the idea of driving everything from Makefiles make sense?
My initial thought given both what you're doing now, and what you want  
to be able to do "One day", is that you essentially have a data  
warehouse type operation with data processing flows. There is data  
that comes in, goes through a workflow process of some sort where  
processing occurs, at which point you get your TSV files that you load  
into a db presumably for said customer to then retrieve at some later  
point.

Some things you will probably need shortly when you do go to full  
automation (as it sounds rather manual right now):
- Error reports when a processing step has failed
- Possible retries of a processing step
- Ability to scale out processing so that you can add more boxes easily

While something like Amazon SQS works pretty well for queuing  
purposes, I could envision using something like Brad Fitzpatrick's  
TheSchwartz (http://search.cpan.org/~bradfitz/TheSchwartz-1.04/lib/TheSchwartz.pm 
) reliable job queue system instead (and I believe it has a few more  
features that'd be important for you). Unfortunately I have yet to see  
a Python version of TheSchwartz, and there's a fully network based  
version in the works I believe.

If you had such a queue system, I'd think of it like this:
- Customer submits data for processing (whether in a manual file, web  
interface, etc)
- Data is sent to MogileFS (or some redundant and distributed FS like  
it)
- A job is setup in the queue for processing to begin
- Worker processes (likely written in Python, and using the shell  
tools you talked of) take the job, perform their step, put the  
processed data back in MogileFS, replace their job in the queue with  
the next step to perform
    (This step repeats as necessary until processing pipeline is done)
- Web Interface can query db to see if job was finished, pull data  
back, etc.

For testing, it should be pretty easy to test the worker processes  
individually to see that each job function can be completed properly.

The db would also include processing steps and workflow in a table. So  
you could add a customer and designate a workflow as a series of job  
tasks in the order they should be performed. This also follows what  
Paul McNett was saying about keeping the settings in one place. It  
also provides a single point to report on the progress of the job  
through their workflow, whether each step completed properly, and  
failure messages if it didn't.

Since you have a data processing pipeline, which is more workflow  
oriented, and where its quite likely you may want to scale processing  
out, the Makefile approach seems like a mismatch to the task at hand.

> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?

I'd keep it with the rest of the settings in a global customer data  
db. Separate from each customers db.

> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?

Hopefully what I proposed. :)

> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?

With a queue system, you just feed new data in via the web UI, or  
manually, and fire a job request into the queue, then sit back and  
wait for the processed data to be available.

Cheers,
Ben
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies

smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

mikeyp
In reply to this post by Shannon -jj Behrens
Shannon -jj Behrens wrote:
> Hi,
>
> I need some help architecting the big picture on my current project.
> I'm usually a Web guy, which I understand very well.  However, my
> current project is more batch oriented.  Here are the details:
>
>  
...
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines.  How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.
>  
JJ,

This sounds like a typical data integration project, which is what
SnapLogic is designed for.

Depending on what sort of operations you're doing on the data, we might
already have the right components or you might need to create a new one,
but from your description,  it sounds like we have most of the database
and file IO you need.

You're right about the operations aspect - you probably want an
'operator' interface that that will just run everything when the right
files are in place, either scheduled or triggered by some event.

Drop me a line if you need more details.

mike

--
Mike Pittaro
Co-Founder                      Snaplogic, Inc.
[hidden email]            http://www.snaplogic.org/developer/mikeyp

_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Max Slimmer
In reply to this post by Shannon -jj Behrens
A couple thoughts on how I might approach this problem, actually this is
the kind of thing I am doing.

I use configObject to manage my control files, for each customer there
would be a config file which can of course get common params from some
global config file which can always be used or by including a directive
in the customer config file.  This tool which is an extention of
configparser, it allows hierarchical  sections, lets you enter values as
lists and returns everything as python objects (dicts for the most part).

I would then have a main driver program that initializes logging, and
gets the commandline specified config file. You could have a section in
the config specifying workflow in terms of py modules and classes that
could be dynamically instantiated and in turn they could get any
varriable info from the same config object.

You can then implement fairly generic modules to retrieve and parse the
customer data, and have options to save or move the data to some archive
location. Some modules will be common the ones toward the end of the
pipline since they will typically see data in a known format and only
need to know how to wrap it, assign ownership...

In my system after extracting and doing some transformation on the data,
I store the transformed data in some pending directory and pass a
referance to a queue that the next phase waits on, you could also think
about passing operations and data using messaging, this would require
various components to each be running and listening for request...

I think you will find using configObject instead of make files will make
life simpler especially if some non technical person is to be given the
task of generating these. A gui could help here or you could use the
validation facilities included.

max

Shannon -jj Behrens wrote:

> Hi,
>
> I need some help architecting the big picture on my current project.
> I'm usually a Web guy, which I understand very well.  However, my
> current project is more batch oriented.  Here are the details:
>
> * I have a bunch of customers.
>
> * These customers give me batches of data.  One day, there might be
> cron jobs for collecting this data from them.  One day I might have a
> Web service that listens to updates from them and creates batches.
> However, right now, they're manually giving me big chunks of data.
>
> * I've built the system in a very UNIXy way right now.  That means
> heavy use of cut, sort, awk, small standalone Python scripts, sh, and
> pipes.  I've followed the advice of "do one thing well" and "tools,
> not policy".
>
> * The data that my customers give me is not uniform.  Different
> customers give me the data in different ways.  Hence, I need some
> customer-specific logic to transform the data into a common format
> before I do the rest of the data pipeline.
>
> * After a bunch of data crunching, I end up with a bunch of different
> TSV files (tab-separated format) containing different things, which I
> end up loading into a database.
>
> * There's a separate database for each customer.
>
> * The database is used to implement a Web service.  This part makes sense to me.
>
> * I'm making heavy use of testing using nose.
>
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines.  How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.
>
> My current course of action is to:
>
> * Create a global Makefile that knows how to do system-wide tasks.
>
> * Create a customer-specific Makefile for each customer.
>
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.
>
> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
>
> * Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
>
> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.
>
> Does the idea of driving everything from Makefiles make sense?
>
> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?
>
> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?
>
> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?
>
> Am I completely smoking, or am I on the right track?
>
> Thanks,
> -jj
>
>  
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Jeff Younker
In reply to this post by Shannon -jj Behrens
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines.  How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.

The pipeline is a product that development delivers to operations.
Operations maintains and monitors it.   You do not want to have
a system where an operator coordinating it on a a permanent basis.
The pipeline should just chug along.  Data gets fed in, information
is spit out.

Pushing pipeline development off to "operations" is a sure way of
making your process melt down eventually.  You end up with a
system where huge chunks of logic are handled by one group and
huge chunks are handled by another, and nobody actually
understands how the system works.

That said, you'll need an interface so that operations can see
what is happening with the pipeline.  They need this to trouble
shoot the pipeline.  A simple one may just summarize data from logs.

> Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
> ...
> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?

Command options are just like method arguments.  Too many mandatory
ones being passed all over the place are an indication that they need to
be replaced by a single entity.  Pass around a reference to a config  
file
instead.

Use this config file everywhere.  Configuration changes should only be
made in one place. Distributing configuration throughout a pipeline
system is a recipe for long term failure.

A Python file that is sourced is a wonderful config file format.  Java
style properties files work too.  Simple key-value shell scripts can
be eval'd as Python too.  I imagine you already have a config system
for your web front end.  Consider re-using that.

Depending upon how many machines you have interacting you
may need a distributed config system.  Publishing a file via HTTP
is an easy solution.

> Does the idea of driving everything from Makefiles make sense?


It sounds to me like a horrible hack that will break down when
you start wanting to do recovery and pipeline monitoring.

Consider writing a simple queue management command.  It It looks
for work in one bin, calls an external command to process the work,
and then dumps it into the next.  The bins can be as simple as
directories:

File A1 goes into bin A/pending
A1 is picked up by job A
A/pending/A1 gets moved to A/consuming/A1
A/consuming/A1 is processed to B/producing/B1
A/consuming/A1 is moved to A/consumed/A1
B/producing/B1 is moved to B/pending/B1

Writing such a simple queue manager should be straight forward.
Then your tool chain becomes nothing more than a series of calls
to the  managers.   Or you could have each queue command
daemonize itself and then poll the queues every so often.

> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.

Standard unit testing stuff should work.  Use mock objects to
stub out the database connection.

I actually do all of my scripts via a little harness that handles
all the generic command line setup.  Scripts subclass the
tframe.Framework object (which I'm releasing as soon as
I'm done with the damn book), and the the script body goes in a
run(options, args) method.  Testing involves instantiating the
script's Framework class and then poking at it.

- Jeff Younker - [hidden email] -



_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Andy Wiggin
In reply to this post by Shannon -jj Behrens
I can't speak to too many of the points, but...

I recently worked on a project which had some similarities to this.
Instead of UNIX pipelines (grep/sed/akw/...), I implemented a plugin
mechanism to gather data. The main class allows the registration of
plugins which then have the responsibility of parsing the particular
format and calling the main class to store the normalized data
(roughly speaking). In your case, you could name the plugin the same
as the customer and give the plugin name on the command line of the
main program, or detect it from the process cwd, or whatever is
convenient (and reliable). I liked the way the plugin system worked
out, in no small measure because it was then natural to write the
whole system in python. The uniformity might make the testing and
error detection/reporting a little easier, too.

Regards, Andy


On Mon, Apr 28, 2008 at 1:46 PM, Shannon -jj Behrens <[hidden email]> wrote:

> Hi,
>
>  I need some help architecting the big picture on my current project.
>  I'm usually a Web guy, which I understand very well.  However, my
>  current project is more batch oriented.  Here are the details:
>
>  * I have a bunch of customers.
>
>  * These customers give me batches of data.  One day, there might be
>  cron jobs for collecting this data from them.  One day I might have a
>  Web service that listens to updates from them and creates batches.
>  However, right now, they're manually giving me big chunks of data.
>
>  * I've built the system in a very UNIXy way right now.  That means
>  heavy use of cut, sort, awk, small standalone Python scripts, sh, and
>  pipes.  I've followed the advice of "do one thing well" and "tools,
>  not policy".
>
>  * The data that my customers give me is not uniform.  Different
>  customers give me the data in different ways.  Hence, I need some
>  customer-specific logic to transform the data into a common format
>  before I do the rest of the data pipeline.
>
>  * After a bunch of data crunching, I end up with a bunch of different
>  TSV files (tab-separated format) containing different things, which I
>  end up loading into a database.
>
>  * There's a separate database for each customer.
>
>  * The database is used to implement a Web service.  This part makes sense to me.
>
>  * I'm making heavy use of testing using nose.
>
>  Anyway, so I have all this customer-specifc logic, and all these data
>  pipelines.  How do I pull it together into something an operator would
>  want to use?  Is the idea of an operator appropriate?  I'm pretty sure
>  this is an "operations" problem.
>
>  My current course of action is to:
>
>  * Create a global Makefile that knows how to do system-wide tasks.
>
>  * Create a customer-specific Makefile for each customer.
>
>  * The customer-specific Makefiles all "include" a shared Makefile.  I
>  modeled this after FreeBSD's ports system.
>
>  Hence, the customer-specific Makefiles have some customer-specific
>  logic in them, but they can share code via the shared Makefile that
>  they all include.
>
>  * Currently, all of the Python scripts take all their settings on the
>  command line.  I'm thinking that the settings belong in an included
>  Makefile that just contains settings.  By keeping the Python dumb, I'm
>  attempting to follow the "tools, not policy" idea.
>
>  I'm having a little bit of a problem with testing.  I don't have a way
>  of testing any Python code that talks to a database because the Python
>  scripts are all dumb about how to connect to the database.  I'm
>  thinking I might need to setup a "pretend" customer with a test
>  database to test all of that logic.
>
>  Does the idea of driving everything from Makefiles make sense?
>
>  Is there an easier way to share data like database connection
>  information between the Makefile and Python other than passing it in
>  explicitly via command line arguments?
>
>  Is there anything that makes more sense than a bunch of
>  customer-specific Makefiles that include a global Makefile?
>
>  How do I get new batches of data into the system?  Do I just put the
>  files in the right place and let the Makefiles take it from there?
>
>  Am I completely smoking, or am I on the right track?
>
>  Thanks,
>  -jj
>
>  --
>  I, for one, welcome our new Facebook overlords!
>  http://jjinux.blogspot.com/
>  _______________________________________________
>  Baypiggies mailing list
>  [hidden email]
>  To change your subscription options or unsubscribe:
>  http://mail.python.org/mailman/listinfo/baypiggies
>
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

jim stockford
In reply to this post by Shannon -jj Behrens

On Mon, 2008-04-28 at 13:46 -0700, Shannon -jj Behrens wrote:

jstockford munged to summarize:

> * I have a bunch of customers manually giving me big chunks
> of data that is not uniform.  Different
> customers give me the data in different ways.
>
> * After a bunch of data crunching, I end up with a bunch of different
> TSV files (tab-separated format) containing different things, which I
> end up loading into separate databases for each customer.
>
> * The database is used to implement a Web service.  
>
> Anyway, so I have all this customer-specific logic to transform the
> data into a common format, and all these data
> pipelines.  

   What does it mean, "manually"? can this include sneaker
net with removeable media, ftp, email attachments, etc.
and none of these means matters to the problem?
   i'm guessing that the commonality is that all files are
ASCII TSV; i.e. there is no necessary correlation of fields
from one customer's file set to others, yes?
   further guess: the customer-specific logic is entirely
a matter of getting each customer's data into ASCII TSV
format so's to load into a customer-specific database, yes?
   some web-thing will pull from each customer database and
present reports upon demand, yes?


> How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.
>
   my understanding of an operator is someone who monitors
and manages the system without respect to the particulars
of data coming in or going out.
   but the only way i make sense of your use of the term is
as someone who sits by a phone or email client and derives
reports as humans request. i must be confused, yes?


> My current course of action is to:
>
> * Create a global Makefile that knows how to do system-wide tasks.
>
> * Create a customer-specific Makefile for each customer.
>
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.
>
> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
>
   this sounds like a classic driver structure: create a dumb
little top-end driver, make a basic stub, make customer-
specific stubs that inherit from the basic stub, tweak
and debug to taste, yes?


> * Currently, all of the Python scripts take all their settings on the
> command line.  
>
   GASP!!! and the operator is going to type these command lines?
or these are hard-coded into the master driver? or...?


> I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python
> dumb, I'm attempting to follow the "tools, not policy" idea.
>
   ...or the makefile. well, software is software, so given
a reasonable interface and no performance problems, why not?


> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.

   i stubbornly remain confused: customer-specific data is
coming in, so customer-specific logic is required to get
it into a customer-specific database that will serve out
to customer-specific requests for reports or some other
databased service.
   sounds like separate software, one set for each customer,
maybe based on some base classes that allow leveraging
common functionality.
   but even if the python remains ignorant of the data
source, aren't there interface programs that work between
the python and the database?


> Does the idea of driving everything from Makefiles make sense?
>
   software is software


> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?
>
   the difference between command line calls and calls between
functions in the same process space is just load time, mainly,
yes?


> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?
>
   drivers and stubs makes sense. software is software. beware
of the interfaces.


> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?
>
   isn't this answered by whatever "manually" means? manually
has gotta end up "here" somehow, and the data input software
should know where "here" is.


> Am I completely smoking, or am I on the right track?
>
   it's an interesting exercise, though not particularly
pythonic, yes? aren't there sufficient python modules to
build this strictly as a python app without significantly
more work or risk?



_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

jim stockford
In reply to this post by Shannon -jj Behrens


oops:
   the difference between command line calls and calls between
functions in the same process space is just load time, mainly,
yes?

   no. in the same process space, one can pass object
references. the shell only permits text strings (and exit
integer codes).


_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
In reply to this post by Paul McNett ∅
>  Does the data structure from a given customer stay consistent? If a batch
> is inconsistent with that customer's "standard", can you bounce it back to
> them or must your toolchain adapt?

It'll stay consistent.  If I do have to adapt, it'll be manually, which is okay.

> > * After a bunch of data crunching, I end up with a bunch of different
> > TSV files (tab-separated format) containing different things, which I
> > end up loading into a database.
>
>  And at this point, is the data in a common format for all customers? IOW,
> is the database schema consistent for all customers?

I've identified one table where there is a customer-specific set of
fields.  However, the name of the table is always the same and it
always has an id column.  Everything else in the schema is generic.

> > Anyway, so I have all this customer-specifc logic, and all these data
> > pipelines.  How do I pull it together into something an operator would
> > want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> > this is an "operations" problem.
>
>  Is this operator of the intelligent variety, or some temp worker with Excel
> experience?

I'm shooting for sysadmin level.

> Where in the process does this operator sit? Does he/she receive
> the batches from the customers and then feed them to your toolchain and
> verify that the batches made it to the database, or something else entirely?

That's the problem.  I don't know.  I've never been in a situation
where my user wasn't on the other side of a Web interface.  For the
foreseeable future, the operator will be me or some sysadmin.  My
guess is that he'll get the data via scp either manually or by cron
job.  Now, I have to figure out how to feed the data to the system.
Do I simply put it in some place and say "go!"?  I was guessing
someone else had been in a similar situation and had some best
practices to recommend.

>  The Makefile strategy sounds very sane, easy to manage once set up. Easy to
> boilerplate for new customers, etc. Well, maybe not "easy", but
> straightforward and understandable.

Ah, good.  So I'm not crazy ;)

>  The settings should ultimately come from once place. This one place could
> be a text file, a database entry, a part of the customer's Makefile, or the
> operator could get prompted for some or all of the information. The scripts
> taking the arguments on the command line is fine. Each link in the chain
> just passes that information forward.

Agreed.  I'm leaning toward having this stuff in the Makefile.

> > I'm having a little bit of a problem with testing.  I don't have a way
> > of testing any Python code that talks to a database because the Python
> > scripts are all dumb about how to connect to the database.  I'm
> > thinking I might need to setup a "pretend" customer with a test
> > database to test all of that logic.
>
>  I think keeping the scripts "dumb" is good, but why do your tests need to
> be dumb, too? If you are testing interaction between your script and the
> database, then test that by sending the database connection parameters. A
> test or reference customer is a good idea.

It's just sort of strange because normally when you run nose, you
don't pass any parameters.  I guess if I setup a test customer, then I
just need to figure out how to get a few settings out of the Makefile
for the tests.  That's manageable.

> > Is there an easier way to share data like database connection
> > information between the Makefile and Python other than passing it in
> > explicitly via command line arguments?
>
>  Is this question still from a testing perspective?

No, it's more general, although testing is currently my largest pain point.

> Make a test database,
> and set the test database parameters in the global makefile, and send the
> connection arguments to your python scripts just as you have it now. Then
> each customer-specific makefile will have its own overridden connection
> parameters.

Ah, very good.

>  Set up a test database, perhaps copied from some reference customer's data,
> to use in your testing.
>
>  Why does it seem like it is a problem to be passing this information on the
> command line?

I was just checking that this is the right thing to do.

> > Is there anything that makes more sense than a bunch of
> > customer-specific Makefiles that include a global Makefile?
>
>  Are you benefiting in some other way by not making this a pure-Python
> project?

Absolutely.  awk, cut, sort, etc. are fast and simple.  Anytime I need
something more complex than a one liner, that's when I switch to
Python.

> Not knowing more, I think I'd try to use a Python subclassing
> strategy instead of makefiles, and Python modules instead of Unix
> applications, but it is basically the same in the end.

Yes, the subclassing approach is how I'd normally approach it.  I've
been reading "The Art of UNIX Programming" lately, and I knew that
this was the perfect situation to be UNIXy.

> > How do I get new batches of data into the system?  Do I just put the
> > files in the right place and let the Makefiles take it from there?
> >
>
>  The files are "data", and once a batch is processed you don't need that
> chunk of data anymore, unless you want to archive it. So just have an
> in_data file that the Makefile can feed your toolchain with.

Very good.  We're on the same page then.  I have an incoming directory
and an archives directory.  My plan was to simply drop the incoming
files in the right place--a place that the Makefiles knew about.
However, it just seemed a bit strange since I've never done that
before.

>  Perhaps your operator needs a very thin GUI frontend for this to feed the
> input batch into the in_data file and start the make. So, the operator gets
> a data chunk from XZY Corp. They just select that customer from a list and
> paste in the contents of the data, and press "submit". And wait for a green
> "okay" or a red "uh-oh".

My guess is that they'll always be working with large data files.  I
could either force them to put the file in the right place, or I could
write yet another shell script that took the file and put it in the
right place.  After the batch is processed, I need to also move the
file from the incoming directory to the archive directory.  The
transaction nut in me worries about things like that, but perhaps I
need not.

> > Am I completely smoking, or am I on the right track?
>
>  Is any of this implemented yet, or is it still in pure design phase?

Most of the small tools are written, and they're working fine.  Some
of the Makefile infrastructure is in place.  All of the
customer-specific stuff isn't yet.  When it comes to "tools, not
policy", the tools are working fine, but I need some policy ;)

> It
> sounds like you have it implemented and you are going crazy dealing with the
> various inputs from the various customers,

I'm not going crazy yet ;)

> and perhaps you are wondering how
> to scale this to even more customers, and how to train an operator/operators
> to intelligently wield this toolchain you've built.

I have proof of concept and a bunch of pretty nice code.  Now I gotta
tie it all together in a way that doesn't involve cutting and pasting
commands from a README all the time ;)

>  It sounds like a fun project! :)

Thanks!
-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
In reply to this post by mikeyp
On Mon, Apr 28, 2008 at 3:02 PM, Michael Pittaro <[hidden email]> wrote:

> Shannon -jj Behrens wrote:
>
> > Hi,
> >
> > I need some help architecting the big picture on my current project.
> > I'm usually a Web guy, which I understand very well.  However, my
> > current project is more batch oriented.  Here are the details:
>
> > Anyway, so I have all this customer-specifc logic, and all these data
> > pipelines.  How do I pull it together into something an operator would
> > want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> > this is an "operations" problem.
> >
> >
>  JJ,
>
>  This sounds like a typical data integration project, which is what
> SnapLogic is designed for.
>
>  Depending on what sort of operations you're doing on the data, we might
> already have the right components or you might need to create a new one, but
> from your description,  it sounds like we have most of the database and file
> IO you need.
>
>  You're right about the operations aspect - you probably want an 'operator'
> interface that that will just run everything when the right files are in
> place, either scheduled or triggered by some event.
>
>  Drop me a line if you need more details.

Ah, I haven't forgotten about SnapLogic.  I'm not sure I completely
know the definition of a data integration project.  I do know I have
data going in.  The key thing that I'm doing is apply clever
algorithms to learn stuff from the data.  Then I can use the stuff I
learned to answer queries via a Web interface.

-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
In reply to this post by Jeff Younker
On Mon, Apr 28, 2008 at 3:17 PM, Jeff Younker <[hidden email]> wrote:

> > Anyway, so I have all this customer-specifc logic, and all these data
> > pipelines.  How do I pull it together into something an operator would
> > want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> > this is an "operations" problem.
> >
>
>  The pipeline is a product that development delivers to operations.
>  Operations maintains and monitors it.   You do not want to have
>  a system where an operator coordinating it on a a permanent basis.
>  The pipeline should just chug along.  Data gets fed in, information
>  is spit out.
>
>  Pushing pipeline development off to "operations" is a sure way of
>  making your process melt down eventually.  You end up with a
>  system where huge chunks of logic are handled by one group and
>  huge chunks are handled by another, and nobody actually
>  understands how the system works.
>
>  That said, you'll need an interface so that operations can see
>  what is happening with the pipeline.  They need this to trouble
>  shoot the pipeline.  A simple one may just summarize data from logs.
>
>
> >
> > Currently, all of the Python scripts take all their settings on the
> > command line.  I'm thinking that the settings belong in an included
> > Makefile that just contains settings.  By keeping the Python dumb, I'm
> > attempting to follow the "tools, not policy" idea.
> > ...
> >
> > Is there an easier way to share data like database connection
> > information between the Makefile and Python other than passing it in
> > explicitly via command line arguments?
> >
>
>  Command options are just like method arguments.  Too many mandatory
>  ones being passed all over the place are an indication that they need to
>  be replaced by a single entity.  Pass around a reference to a config file
>  instead.
>
>  Use this config file everywhere.  Configuration changes should only be
>  made in one place. Distributing configuration throughout a pipeline
>  system is a recipe for long term failure.
>
>  A Python file that is sourced is a wonderful config file format.  Java
>  style properties files work too.  Simple key-value shell scripts can
>  be eval'd as Python too.  I imagine you already have a config system
>  for your web front end.  Consider re-using that.
>
>  Depending upon how many machines you have interacting you
>  may need a distributed config system.  Publishing a file via HTTP
>  is an easy solution.
>
>
>
> > Does the idea of driving everything from Makefiles make sense?
> >
>
>
>  It sounds to me like a horrible hack that will break down when
>  you start wanting to do recovery and pipeline monitoring.
>
>  Consider writing a simple queue management command.  It It looks
>  for work in one bin, calls an external command to process the work,
>  and then dumps it into the next.  The bins can be as simple as
>  directories:
>
>  File A1 goes into bin A/pending
>  A1 is picked up by job A
>  A/pending/A1 gets moved to A/consuming/A1
>  A/consuming/A1 is processed to B/producing/B1
>  A/consuming/A1 is moved to A/consumed/A1
>  B/producing/B1 is moved to B/pending/B1
>
>  Writing such a simple queue manager should be straight forward.
>  Then your tool chain becomes nothing more than a series of calls
>  to the  managers.   Or you could have each queue command
>  daemonize itself and then poll the queues every so often.
>
>
>
> > I'm having a little bit of a problem with testing.  I don't have a way
> > of testing any Python code that talks to a database because the Python
> > scripts are all dumb about how to connect to the database.  I'm
> > thinking I might need to setup a "pretend" customer with a test
> > database to test all of that logic.
> >
>
>  Standard unit testing stuff should work.  Use mock objects to
>  stub out the database connection.
>
>  I actually do all of my scripts via a little harness that handles
>  all the generic command line setup.  Scripts subclass the
>  tframe.Framework object (which I'm releasing as soon as
>  I'm done with the damn book), and the the script body goes in a
>  run(options, args) method.  Testing involves instantiating the
>  script's Framework class and then poking at it.
>
>  - Jeff Younker - [hidden email] -

Ok, this is another approach.  I'm going to have to think about it some more.

Thanks,
-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
In reply to this post by jim stockford
>    What does it mean, "manually"? can this include sneaker
>  net with removeable media, ftp, email attachments, etc.
>  and none of these means matters to the problem?

Currently, yes.

>    i'm guessing that the commonality is that all files are
>  ASCII TSV;

I wish ;)  Some are log files.  Some are database dumps.

> i.e. there is no necessary correlation of fields
>  from one customer's file set to others, yes?

I'm looking for how users interact with things.  I.e. user A bought soap.

>    further guess: the customer-specific logic is entirely
>  a matter of getting each customer's data into ASCII TSV
>  format so's to load into a customer-specific database, yes?

Yes.

>    some web-thing will pull from each customer database and
>  present reports upon demand, yes?

No.  I don't need to provide reports so much as answer queries about their data.

>    my understanding of an operator is someone who monitors
>  and manages the system without respect to the particulars
>  of data coming in or going out.

That's correct.  My definition of an operator is someone who either
pulls the batches of data when requested, or setups up a cron job to
do it and fixes it if it breaks.

>    this sounds like a classic driver structure: create a dumb
>  little top-end driver, make a basic stub, make customer-
>  specific stubs that inherit from the basic stub, tweak
>  and debug to taste, yes?

Yep.

>  > * Currently, all of the Python scripts take all their settings on the
>  > command line.
>  >
>    GASP!!! and the operator is going to type these command lines?
>  or these are hard-coded into the master driver? or...?

They're driven via the Makefiles.  If you have a customer-specific
Makefile with some customer-specific settings, than the main Makefile
knows which settings each little Python script needs.

>  > I'm having a little bit of a problem with testing.  I don't have a way
>  > of testing any Python code that talks to a database because the Python
>  > scripts are all dumb about how to connect to the database.  I'm
>  > thinking I might need to setup a "pretend" customer with a test
>  > database to test all of that logic.
>
>    i stubbornly remain confused: customer-specific data is
>  coming in, so customer-specific logic is required to get
>  it into a customer-specific database that will serve out
>  to customer-specific requests for reports or some other
>  databased service.

Yep.

>    sounds like separate software, one set for each customer,

The data format that customers give me varies, but what I want to do
with the data and the service I provide is always the same.

>  maybe based on some base classes that allow leveraging
>  common functionality.
>    but even if the python remains ignorant of the data
>  source, aren't there interface programs that work between
>  the python and the database?

Once I get the data into a common format, getting it into the database is easy.

>  > Does the idea of driving everything from Makefiles make sense?
>  >
>    software is software

Ah, interesting perspective ;)

>  > Is there an easier way to share data like database connection
>  > information between the Makefile and Python other than passing it in
>  > explicitly via command line arguments?
>  >
>    the difference between command line calls and calls between
>  functions in the same process space is just load time, mainly,
>  yes?

It's really all about following the UNIX way of "do one thing well"
rather than building a huge C++ monolithic binary.

>  > How do I get new batches of data into the system?  Do I just put the
>  > files in the right place and let the Makefiles take it from there?
>  >
>    isn't this answered by whatever "manually" means? manually
>  has gotta end up "here" somehow, and the data input software
>  should know where "here" is.

Yeah.  That's why I was asking.  I've never had this situation before,
and I was wondering if some of you old timers had been in this
situation and had some advice about the right way to approach it ;)  I
guess I'm confused about workflow and process as much as anything
else.

>  > Am I completely smoking, or am I on the right track?
>  >
>    it's an interesting exercise, though not particularly
>  pythonic, yes?

Everything complex is written in Python, but I'm leaning heavily on
the UNIX philosophy and tools.

> aren't there sufficient python modules to
>  build this strictly as a python app without significantly
>  more work or risk?

In a lot of situations, I can get from one step in the pipeline to
another using only UNIX tools like cut, sort, and one line awk
scripts.  That's really nice.  It's easier to implement the whole
thing via sh tying together small tools than one gigantic program that
does everything.  Small tools are easy to understand, test, and debug.

-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
In reply to this post by jim stockford
On Mon, Apr 28, 2008 at 5:07 PM, jim <[hidden email]> wrote:
>  oops:
>
>    the difference between command line calls and calls between
>  functions in the same process space is just load time, mainly,
>  yes?
>
>    no. in the same process space, one can pass object
>  references. the shell only permits text strings (and exit
>  integer codes).

Currently, everything is built around passing TSV data.  Different
steps in the pipeline consume and produce different fields.  This part
is working really nicely.

-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
Thanks, everyone for all your advice!  I'm probably even more confused
now than I was before, but that's only because I have more options to
consider ;)

Thanks Again!
-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Eric Walstad
In reply to this post by Shannon -jj Behrens
Hey JJ,

On Mon, Apr 28, 2008 at 9:57 PM, Shannon -jj Behrens <[hidden email]> wrote:
...

>  > Where in the process does this operator sit? Does he/she receive
>  > the batches from the customers and then feed them to your toolchain and
>  > verify that the batches made it to the database, or something else entirely?
>
>  That's the problem.  I don't know.  I've never been in a situation
>  where my user wasn't on the other side of a Web interface.  For the
>  foreseeable future, the operator will be me or some sysadmin.  My
>  guess is that he'll get the data via scp either manually or by cron
>  job.  Now, I have to figure out how to feed the data to the system.
>  Do I simply put it in some place and say "go!"?  I was guessing
>  someone else had been in a similar situation and had some best
>  practices to recommend.
We wrote a python 'sentinel', started periodically by cron, that
watches user's incoming scp directories for new files.  We considered
writing it as a daemon but the first iteration (running it from cron)
turned out to suit our needs.  When a new file arrives in the
directory, the sentinel does something that it is configured to do:
call a python callable, execute a system executable, etc.  The
sentinel only operates on a file that isn't already being operated on
(in case the callable's runtime exceeds the sentinel's nap time).  The
sentinel can be configured to do a post-process task which usually
includes moving the uploaded file to a 'processed' directory.  The
sentinel operates in a generic way by reading config files that define
how it is supposed to behave.  Each of our customers has a Sentinel
config file describing how to tell when that customer's files arrive,
how to process their file and what to do when the processing is done.
I don't know about best practices here, but our system is pretty
generic, flexible and works well for us.

Eric.
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
On Mon, Apr 28, 2008 at 10:49 PM, Eric Walstad <[hidden email]> wrote:

> Hey JJ,
>
>  On Mon, Apr 28, 2008 at 9:57 PM, Shannon -jj Behrens <[hidden email]> wrote:
>  ...
>
> >  > Where in the process does this operator sit? Does he/she receive
>  >  > the batches from the customers and then feed them to your toolchain and
>  >  > verify that the batches made it to the database, or something else entirely?
>  >
>  >  That's the problem.  I don't know.  I've never been in a situation
>  >  where my user wasn't on the other side of a Web interface.  For the
>  >  foreseeable future, the operator will be me or some sysadmin.  My
>  >  guess is that he'll get the data via scp either manually or by cron
>  >  job.  Now, I have to figure out how to feed the data to the system.
>  >  Do I simply put it in some place and say "go!"?  I was guessing
>  >  someone else had been in a similar situation and had some best
>  >  practices to recommend.
>  We wrote a python 'sentinel', started periodically by cron, that
>  watches user's incoming scp directories for new files.  We considered
>  writing it as a daemon but the first iteration (running it from cron)
>  turned out to suit our needs.  When a new file arrives in the
>  directory, the sentinel does something that it is configured to do:
>  call a python callable, execute a system executable, etc.  The
>  sentinel only operates on a file that isn't already being operated on
>  (in case the callable's runtime exceeds the sentinel's nap time).  The
>  sentinel can be configured to do a post-process task which usually
>  includes moving the uploaded file to a 'processed' directory.  The
>  sentinel operates in a generic way by reading config files that define
>  how it is supposed to behave.  Each of our customers has a Sentinel
>  config file describing how to tell when that customer's files arrive,
>  how to process their file and what to do when the processing is done.
>  I don't know about best practices here, but our system is pretty
>  generic, flexible and works well for us.

Yep, that sounds a lot like my situation.  Thanks!

-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
In reply to this post by Shannon -jj Behrens
On Mon, Apr 28, 2008 at 10:34 PM, Shannon -jj Behrens <[hidden email]> wrote:
> Thanks, everyone for all your advice!  I'm probably even more confused
>  now than I was before, but that's only because I have more options to
>  consider ;)
>
>  Thanks Again!

I've been trying to digest all the different comments.  A few people
were trying to frame this as a data warehousing or data integration
problem.  I don't know much about those subjects.  However, my naive
guess is that they don't apply for the following reasons:

It's hard to imagine that this is a data warehousing problem since I'm
not trying to warehouse the data.  I have no requirement to store it
forever.  I'm simply trying to "infer" some stuff about the data.
When I'm doing this inferencing, I'm not even working with the data in
a database.  I'm working with it in TSV form since that works out
better (it's faster to stream from file to file than from table to
table).

It's hard to imagine that I have a data integration problem since I'm
not trying to integrate multiple sources of data.  Each customer gives
me chunks of data, and I analyze the data for that customer in order
to infer things.  The only reason I use a database at all is to store
the things I've learned about the data for later retrieval via a Web
service.  Once I analyze the data, I can throw it away or simply
archive it in its raw form.

My understanding of data warehousing is that it's a way of storing
everything you can about data from a bunch of sources and write deep
SQL queries to learn deeper things about the data.  That doesn't
really match my situation.

By the way, my batches don't even take all that long to run.
Currently, it's on the order of 30 seconds.  If I need to start from
scratch if any part fails, it's no big deal.

This conversation has been really helpful.  It's definitely driven one
thing home for me.  One reason "tools, not policy" is so valuable is
because people love to disagree about policy.  However, once I've
written a good tool that's well documented and well tested, no one
disagrees about its usefulness.  Hence, it makes sense for me to focus
on writing good tools.  If I need to switch to a more complicated
policy later involving queues, data warehousing, etc., my small tools
are still going to be helpful.

-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: I need some help architecting the big picture

Shannon -jj Behrens
My buddy Tim Kientzle (who's a FreeBSD committer) had some great
comments, which I'm posting here because they made a lot of sense, and
integrated nicely with what a few of you were saying:

Tim Kientzle wrote:
I skimmed your other email and it sounds like you're basically on the
right track, at least for v1. ;-)

In the longer term, you'll need to think more about self-service.   I
worked for a company not long ago that received large data files from
customers and manually converted them into the basis for a web
service.  After a while, the manual conversion step became a big
barrier to growth; each new customer required more in-house
data-conversion staff.  I don't suggest you focus too much on
self-service at first, but there are a few easy steps you can do to
help lay the groundwork for that:
 * Have customers drive the interaction themselves.  There should be a
web form where they upload their data.  There should be a simple
dashboard that shows the state of their data (uploaded, accepted,
converted, complete).  Even if the processes behind that are driven
manually for now, this sets the expectation that the customers have to
manage the process somewhat; they can't just call you and have you fix
every little problem.
 * Plan to store customer configuration.  A directory per user with
config files/scripts can work fine; you will likely also need a
database that maps customer accounts to those directories, etc.
 * Plan to eventually generate that configuration automatically.
After you've done the first few manually, you should start to get a
feel for what the variables are and can start providing "canned"
solutions that are easily reusable.

Makefiles are a fine way to script it.  Plan to soon include progress
output from the makefile, which can be as simple as a "logit.py"
script that is run at points to add a message to a database table.
That table can then form the basis of the customer dashboard.
     logit.py "Data format verified."
     logit.py "Initial data conversion complete."

Look up some of the literature on "job control systems."  Batch jobs
fail and need to be rerun from scratch, etc.  You'll need a database
table of jobs and their current state.

TBKK
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies