GSoC: Data importation class

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

GSoC: Data importation class

Subsume
If we could visualize the entirety of data within django-projects, we
would probably see that this 'data economy' is growing exponentially
year-over-year. However, I know of no guided way to actually get this
data into a project that's been converted to Django. There are two
methods I generally hear about when asking people how to move between
schemas: purely SQL solutions and one-shot scripted solutions (or a
mix). With talk of model-level validation, the first approach is
becoming increasingly invalid, but I wonder if we could include some
batteries for the second approach?

My proposal is a new django class which provides a mapping for how
this data should move from its legacy schema to a django project. I've
got a sort of proof-of-concept already working but it lacks the polish
of a refined community contribution. Moreover, with multi-database
support coming, I see this concept getting a shot in the arm,
especially in cases where the legacy db is a currently supported one.

I imagine the usage going something like:

1) User creates django project

2) User runs a 'startconversion' app which creates a stage folder for
holding an inspectdb of the legacy data, a default router for the
legacy data, and some other empty files.

3) User defines the classes which defines the map between the legacy
and new schema, and defines clean functions according to their needs,
'foreignkeys' to other conversion classes, etc.

4) User runs a command at the top branch of their schema (some distant
relation) and the command inspects these classes and runs them from
the ground up. As it does this measures are taken (such as use of
pagination) to avoid server CPU/memory thrashing, as well as model-
level measures such as OneToOne's being respect, etc.

-Steve

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Malcolm Tredinnick
Hi,

I see a few problems here. The gist of what follows is that it seems a
bit abstract and as one tries to nail down the specifics it either
devolves to a more-or-less already solved problem that doesn't require
Django core changes, or a problem that is so unconstrained as to not be
solvable by a framework (requiring, instead the full power of Python,
which is already in the developer's hands).

On Thu, 2010-03-25 at 09:07 -0700, [hidden email] wrote:
> If we could visualize the entirety of data within django-projects, we
> would probably see that this 'data economy' is growing exponentially
> year-over-year. However, I know of no guided way to actually get this
> data into a project that's been converted to Django. There are two
> methods I generally hear about when asking people how to move between
> schemas: purely SQL solutions and one-shot scripted solutions (or a
> mix). With talk of model-level validation, the first approach is
> becoming increasingly invalid,

That's not a correct statement, since Django models can often be used to
proscribe conditions on new data that is created via the web app, yet
those conditions might not be required for the universal set of data
that already exists. For example, webapp-generated data might always
require a particular field, such as the creating user, to be filled in,
whilst machine-generated data would not require that. Don't equate
validation conditions at the model level with constraints on the data at
the storage level.

>  but I wonder if we could include some
> batteries for the second approach?
>
> My proposal is a new django class which provides a mapping for how
> this data should move from its legacy schema to a django project.
> I've
> got a sort of proof-of-concept already working but it lacks the polish
> of a refined community contribution. Moreover, with multi-database
> support coming, I see this concept getting a shot in the arm,
> especially in cases where the legacy db is a currently supported one.
>
> I imagine the usage going something like:
>
> 1) User creates django project
>
> 2) User runs a 'startconversion' app which creates a stage folder for
> holding an inspectdb of the legacy data, a default router for the
> legacy data, and some other empty files.

The last bit sounds a bit nebulous. You could optimise it by not
including any empty files, or be a bit more specific about what the
empty files are meant to represent. :)

>
> 3) User defines the classes which defines the map between the legacy
> and new schema, and defines clean functions according to their needs,
> 'foreignkeys' to other conversion classes, etc.

It seems that you are talking about the cases where, by default, a
different schema is required. The first approach is to make the models
match the existing schema, on the grounds that the existing schema is a
reasonable representation of the data. In the case where that isn't
true, a migration is required, but the possibilities for such migrations
are endless unless the original data can already be put into natural
Django models. If inspectdb can already be run on the existing data, why
not use that as the starting point and then the dev can use something
like South to migrate to their schema of choice? It seems that we
already have all the tools in that case.

If inspectdb cannot generate a useful schema that can be modelled by
Django, the user is going to have write a generic Python script in any
case and the possibilities there are boundless and best left to the best
tool available for the job at hand: Python itself.

> 4) User runs a command at the top branch of their schema (some distant
> relation) and the command inspects these classes and runs them from
> the ground up. As it does this measures are taken (such as use of
> pagination) to avoid server CPU/memory thrashing, as well as model-
> level measures such as OneToOne's being respect, etc.

Adding system administration functionality to Django, which is what this
monitoring is, feels like the wrong approach. It's not intended to
replace everything else in your computing life. What is appropriate load
usage for one case will be highly inappropriate elsewhere. How will you
detect what you are labelling as "thrashing"?

Regards,
Malcolm

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Subsume
On Mar 25, 12:34 pm, Malcolm Tredinnick <[hidden email]>
wrote:
> Hi,
>
> I see a few problems here. The gist of what follows is that it seems a
> bit abstract and as one tries to nail down the specifics it either
> devolves to a more-or-less already solved problem that doesn't require
> Django core changes, or a problem that is so unconstrained as to not be
> solvable by a framework (requiring, instead the full power of Python,
> which is already in the developer's hands).

In situations where inspectdb doesn't work, I've found its much less
work to get data into a shape where it _will_ work, rather than
resorting to scripting the entirety.

> On Thu, 2010-03-25 at 09:07 -0700, [hidden email] wrote:
> > mix). With talk of model-level validation, the first approach is
> > becoming increasingly invalid,
>
> That's not a correct statement, since Django models can often be used to
> proscribe conditions on new data that is created via the web app, yet
> those conditions might not be required for the universal set of data
> that already exists. For example, webapp-generated data might always
> require a particular field, such as the creating user, to be filled in,
> whilst machine-generated data would not require that. Don't equate
> validation conditions at the model level with constraints on the data at
> the storage level.

I'm hard pressed to imagine a situation where I want validation to
only apply to incoming data. Out of laziness I might choose not to
apply some conditions to existing data as I did to a form, knowing
that the next time the user touches the record the validation will
kick in. This whole point seems to rest on a 'might' which I can't
recall ever encountering. Regardless, I tend to regard legacy data
with the same caution as I treat incoming user data, but it is
currently worlds more difficult to do the former compared to the
latter.

> The last bit sounds a bit nebulous. You could optimise it by not
> including any empty files, or be a bit more specific about what the
> empty files are meant to represent. :)

startapp, startproject, et al.

> It seems that you are talking about the cases where, by default, a
> different schema is required. The first approach is to make the models
> match the existing schema, on the grounds that the existing schema is a
> reasonable representation of the data. In the case where that isn't
> true, a migration is required, but the possibilities for such migrations
> are endless unless the original data can already be put into natural
> Django models. If inspectdb can already be run on the existing data, why
> not use that as the starting point and then the dev can use something
> like South to migrate to their schema of choice? It seems that we
> already have all the tools in that case.

South has a target use case of relatively simple changes to schema,
and assisting teams maintain synchronicity. However, it isn't long
before you're really pushing the limits of South. Take for example one
legacy model which needs to be split into two or three current models:
South has no answer for this as you may only use defaults for creation
of fields (in this case, foreign keys, or potentially OneToOnes, and
then what? What if my defaults are based on other values in that
record? Already, I'm on my own). By and large, South is an immaculate
tool for tracking changes during development.

> If inspectdb cannot generate a useful schema that can be modelled by
> Django, the user is going to have write a generic Python script in any
> case and the possibilities there are boundless and best left to the best
> tool available for the job at hand: Python itself.

In theory, yes. But in practice, I've found the shortest way around
the mountain is to get it into "SQL-enough" format manually. As for
Python, sure, but the more you write this monolithic script the more
you realize you're conducting a lot of repetitive work, the mechanics
of which are generally re-usable but an implementation which is
completely nuanced to your current task. If data importation is
something you do a lot, you've probably got a file somewhere holding
piecemeal bits that are hopefully vaguely useful to the next project,
all the while not being able to fight off the feeling that this
general task mirrors a lot of what goes on in forms with
clean_somefield() and clean().

> Adding system administration functionality to Django, which is what this
> monitoring is, feels like the wrong approach. It's not intended to
> replace everything else in your computing life. What is appropriate load
> usage for one case will be highly inappropriate elsewhere. How will you
> detect what you are labelling as "thrashing"?

I'm puzzled by this conclusion. The 'system administration
functionality' isn't in any way different to what you'd find in all
kinds of projects--South included. I'm not even sure what to say about
the 'everything else in your computing life' statement, but I will
assume good faith and that you're not alleging I'm presenting this as
some kind of crutch for more correct methods. As for detecting 'what
is thrashing', there are only so many tasks that can be conducted in
the business of moving large gobs of data, some of these tasks (often)
bring CPU to 100% and (hopefully never) bring free memory to %0.
Things that do one or the other are treated like bugs or avoided.

-Steve

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Subsume
> > Adding system administration functionality to Django, which is what this
> > monitoring is, feels like the wrong approach.

I see here you probably meant its appropriate elsewhere but not in
django. Gotcha. Thought I'd try anyhow.

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Andrew Godwin-3
In reply to this post by Subsume
I feel the need to wade in here, since this is vaguely my area.

On 25/03/10 17:47, [hidden email] wrote:
>> The last bit sounds a bit nebulous. You could optimise it by not
>> including any empty files, or be a bit more specific about what the
>> empty files are meant to represent. :)
>>      
> startapp, startproject, et al.
>    

I see where you're coming from here; in the final proposal, though,
you'd want to follow Malcom's advice and have an actual use case for
each file. I'd say you probably only want one, or even none - if your
approach is so complicated that even simple use cases need lots of
files, you're unlikely to get much traction.

> South has a target use case of relatively simple changes to schema,
> and assisting teams maintain synchronicity. However, it isn't long
> before you're really pushing the limits of South. Take for example one
> legacy model which needs to be split into two or three current models:
> South has no answer for this as you may only use defaults for creation
> of fields (in this case, foreign keys, or potentially OneToOnes, and
> then what? What if my defaults are based on other values in that
> record? Already, I'm on my own). By and large, South is an immaculate
> tool for tracking changes during development.
>    

South does have an answer to this - you create the columns as nullable,
add in the data, and then alter them back to non-nullable. That's the
only way a database is going to let you add a column; they need either a
global default or NOT NULL (there are some cases you can do it, but
they're really not appropriate for most people).

In fact, I'd say this fits perfectly into the South model; one migration
to make the two/three new tables, one that moves all the data around
using the ORM (something Django developers know, and mostly love), and
one to delete the old table. If you only use --auto then yes, it's only
good at tracking small changes, but the rest of the power is right
there, you just have to actually write code.

>
> In theory, yes. But in practice, I've found the shortest way around
> the mountain is to get it into "SQL-enough" format manually. As for
> Python, sure, but the more you write this monolithic script the more
> you realize you're conducting a lot of repetitive work, the mechanics
> of which are generally re-usable but an implementation which is
> completely nuanced to your current task. If data importation is
> something you do a lot, you've probably got a file somewhere holding
> piecemeal bits that are hopefully vaguely useful to the next project,
> all the while not being able to fight off the feeling that this
> general task mirrors a lot of what goes on in forms with
> clean_somefield() and clean().
>
>    

Actually, most of my data import scripts are while loops over
cursor.fetchall(), which just use the ORM to put in new data. With
MultiDB, I probably won't even need the cursor part, I can just loop
over the legacy model and insert into the new one.

While it might be nice to make this more generic, each of those while
loops has slightly different bodies - correcting coordinates here,
fixing postcodes there - and the generic bits only take up one or two
lines each time.

I'd really like to see a more compact or succinct way of doing this, but
I'm dubious as to how flexible it would be. I'm more than happy to be
proven wrong, however.

> I'm puzzled by this conclusion. The 'system administration
> functionality' isn't in any way different to what you'd find in all
> kinds of projects--South included. I'm not even sure what to say about
> the 'everything else in your computing life' statement, but I will
> assume good faith and that you're not alleging I'm presenting this as
> some kind of crutch for more correct methods. As for detecting 'what
> is thrashing', there are only so many tasks that can be conducted in
> the business of moving large gobs of data, some of these tasks (often)
> bring CPU to 100% and (hopefully never) bring free memory to %0.
> Things that do one or the other are treated like bugs or avoided.
>    

I read your initial proposal here as "code things in a sensible way",
not "actively monitor performance and correct on the fly". Using
pagination and making sure there's no memory leaks in the code's loops
is a great idea, attempting to self-optimise at runtime probably isn't.

Andrew

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Richard Laager
In reply to this post by Malcolm Tredinnick
On Thu, 2010-03-25 at 11:34 -0500, Malcolm Tredinnick wrote:
> That's not a correct statement, since Django models can often be used to
> proscribe conditions on new data that is created via the web app, yet
> those conditions might not be required for the universal set of data
> that already exists. For example, webapp-generated data might always
> require a particular field, such as the creating user, to be filled in,
> whilst machine-generated data would not require that. Don't equate
> validation conditions at the model level with constraints on the data at
> the storage level.

In my opinion, the model validation should be the same as the storage
level. Instead, I would say, "Don't equate validation conditions at the
view level with constraints on the data the model level."

This might be a bit off-topic for this thread, but legacy data is why I
wish frameworks supported some concept of warnings in their validation
code.

Richard

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Subsume
In reply to this post by Andrew Godwin-3
> On 25/03/10 17:47, [hidden email] wrote:
>
> >> The last bit sounds a bit nebulous. You could optimise it by not
> >> including any empty files, or be a bit more specific about what the
> >> empty files are meant to represent. :)
>
> > startapp, startproject, et al.
>
> I see where you're coming from here; in the final proposal, though,
> you'd want to follow Malcom's advice and have an actual use case for
> each file. I'd say you probably only want one, or even none - if your
> approach is so complicated that even simple use cases need lots of
> files, you're unlikely to get much traction.

The way I see it, you need  1) a stage directory with 2)
conversions.py for your conversion classes 3) legacy_models.py 4)
legacy_routers.py (potentially)

> South does have an answer to this - you create the columns as nullable,
> add in the data, and then alter them back to non-nullable. That's the
> only way a database is going to let you add a column; they need either a
> global default or NOT NULL (there are some cases you can do it, but
> they're really not appropriate for most people).
>
> In fact, I'd say this fits perfectly into the South model; one migration
> to make the two/three new tables, one that moves all the data around
> using the ORM (something Django developers know, and mostly love), and
> one to delete the old table. If you only use --auto then yes, it's only
> good at tracking small changes, but the rest of the power is right
> there, you just have to actually write code.

Sorry, I find this a bit awkward. I suspect if it was all as kosher as
you suggest, you'd be doing it yourself instead of pure scripting.
Moreover, it presumes a schema-similarity and cleanliness of data
which legacy data, by and large, can't be trusted with. This whole
approach is basically, accept the legacy schema as it is and work it
incrementally into a shape that looks like your modern schema.
Hopefully you arrive! I'm skeptical.

> Actually, most of my data import scripts are while loops over
> cursor.fetchall(), which just use the ORM to put in new data. With
> MultiDB, I probably won't even need the cursor part, I can just loop
> over the legacy model and insert into the new one.

Well, exactly. That's how all of these types of scripts work. I
suppose after I did this 3 times I found the whole business a bit
repetitious, not to mention what I ended up with was a 500 line series
of loops, probably split into multiple files so I could run them
independently. And, at the end of the day, 100% trash code. What I
hope for was a tool that helped me logically partition this work, and
free me to marshall it with minimum hassle, so that I could run and re-
run it until, at the end of the day, it came out cleanly.

> While it might be nice to make this more generic, each of those while
> loops has slightly different bodies - correcting coordinates here,
> fixing postcodes there - and the generic bits only take up one or two
> lines each time.

Sure, a datetime formatter is potentially as little as 3 or 4 lines,
but re-usable across dozens of fields, to say nothing of re-usability
between conversion projects.

> I'd really like to see a more compact or succinct way of doing this, but
> I'm dubious as to how flexible it would be. I'm more than happy to be
> proven wrong, however.

class Conversion():
    date_added = legacy_date_added(format=funky_datetime_formatter)

> I read your initial proposal here as "code things in a sensible way",
> not "actively monitor performance and correct on the fly". Using
> pagination and making sure there's no memory leaks in the code's loops
> is a great idea, attempting to self-optimise at runtime probably isn't.

Yeah, I don't know where this active monitor stuff came up. But yeah,
pagination among others. There are other ideas, such as tying in
contenttypes and allowing users to see just how their legacy data
converted--immensely helpful in detecting problems.

I would love to see a "perfectionists with deadlines" approach catch
on in this area. I hearing from potential clients how much other
companies quote them for data conversion. Its always something like
$100/hr @ 20 hours. I'm glad when I can offer the same service at 4
hours. That on top of all the other borderline magic of Django.

-Steve

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Reply | Threaded
Open this post in threaded view
|

Re: GSoC: Data importation class

Subsume
In reply to this post by Richard Laager
Or let's circumvent the whole problem by stopping this garbage at the
gates!

On Mar 25, 11:34 pm, Richard Laager <[hidden email]> wrote:
> This might be a bit off-topic for this thread, but legacy data is why I
> wish frameworks supported some concept of warnings in their validation
> code.
>
> Richard

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email].
To unsubscribe from this group, send email to [hidden email].
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.