DB scheme question

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

DB scheme question

John Q. Public
Hello,

Before I can ask my question, I have to present the data I work with.
My goal is to create a (very) simple version control for a unique type of (text) file.
The file has three parts:
1. atoms - set of attributes (e.g. name, id, etc). Each version of the file contains up to thousands of atoms
2. molecules - has some attributes (e.g. name, id, etc) and a group of atoms (list of atoms ids). Each molecule can contain zero to hundreds of atoms.
3. matter - a group of molecules (a matter to molecule is what molecule is to atom - and no, it's not homework of some kind....)

---------------------- EXAMPLE FILE: --------------------
atoms:
# atr1 atr2 ... atrN
1  A     B       C
2  D     B       E
3  D     F       G
...
molucules:
# atr1 atr2 ... atrN atom_list
1  A     B       C    1 5 6 16 53 102
2  D     E       F    2 7 13 53 104 205 206 207
3  G     H       I    4
...
matter:
# atr1 atr2 ... atrN molecules_list
1  A     B       C    1 5 6 16 53 102
2  D     E       F    2 7 13 53 104 205 206 207
3  G     H       I    4
...
------------------- END OF EXAMPLE ---------------------

Each file version is only slightly different from the previous version (changes in only few atoms, molecules and\or matter).

I thought on the following scheme (first 6 tables are trivial):
    - version table - each raw represent a file that was added
    - atom table - each raw represent one atom from a file
    - molecules table - each raw represent one molecule from a file (later I'll connect atoms to molecules)
    - matter table - each raw represent one matter from a file (later I'll connect molecules to matter)
    - atom to molecule table - foreign keys from both tables represent the relation: atom BELONG to molecule
    - molecule to matter table - foreign keys from both tables represent the relation: molecule BELONG to matter
    - data to version table - In order not to make 3 different tables connecting atoms, molecules and matter to version, I will create one table that contains 2 foreign keys (data_id and version_id) together with another attribute stating the source table for the data. Now, because versions are only slightly different, does it make sense to implement the relation: data DOES NOT BELONG to version ?

My second question is about inserting new versions to the db. After parsing the text file I have an object with all the information from the file. Is the next step to go attribute by attribute and update the db scheme or should I create some sort of temporary scheme (containing only data from the object) and do the inserting by merging tables? further more, is there a technique to keep db integrity while updating (especially in case of error in the insert process)?

As always - thanks for your time and patience
Reply | Threaded
Open this post in threaded view
|

Re: DB scheme question

Carl Karsten
On Sat, Dec 11, 2010 at 12:18 PM, John Q. Public <[hidden email]> wrote:

>
> Hello,
>
> Before I can ask my question, I have to present the data I work with.
> My goal is to create a (very) simple version control for a unique type of
> (text) file.
> The file has three parts:
> 1. atoms - set of attributes (e.g. name, id, etc). Each version of the file
> contains up to thousands of atoms
> 2. molecules - has some attributes (e.g. name, id, etc) and a group of atoms
> (list of atoms ids). Each molecule can contain zero to hundreds of atoms.
> 3. matter - a group of molecules (a matter to molecule is what molecule is
> to atom - and no, it's not homework of some kind....)
>
> ---------------------- EXAMPLE FILE: --------------------
> atoms:
> # atr1 atr2 ... atrN
> 1  A     B       C
> 2  D     B       E
> 3  D     F       G
> ...
> molucules:
> # atr1 atr2 ... atrN atom_list
> 1  A     B       C    1 5 6 16 53 102
> 2  D     E       F    2 7 13 53 104 205 206 207
> 3  G     H       I    4
> ...
> matter:
> # atr1 atr2 ... atrN molecules_list
> 1  A     B       C    1 5 6 16 53 102
> 2  D     E       F    2 7 13 53 104 205 206 207
> 3  G     H       I    4
> ...
> ------------------- END OF EXAMPLE ---------------------
>
> Each file version is only slightly different from the previous version
> (changes in only few atoms, molecules and\or matter).
>
> I thought on the following scheme (first 6 tables are trivial):
>    - version table - each raw represent a file that was added
>    - atom table - each raw represent one atom from a file
>    - molecules table - each raw represent one molecule from a file (later
> I'll connect atoms to molecules)
>    - matter table - each raw represent one matter from a file (later I'll
> connect molecules to matter)
>    - atom to molecule table - foreign keys from both tables represent the
> relation: atom BELONG to molecule
>    - molecule to matter table - foreign keys from both tables represent the
> relation: molecule BELONG to matter
>    - data to version table - In order not to make 3 different tables
> connecting atoms, molecules and matter to version, I will create one table
> that contains 2 foreign keys (data_id and version_id) together with another
> attribute stating the source table for the data. Now, because versions are
> only slightly different, does it make sense to implement the relation: data
> DOES NOT BELONG to version ?

I think I would combine atom, molecule. matter into one table and add
a 3 record type table.  Now there is only 1 table to worry about
versioning.  I think it is better for some reason, but it will make
the code harder to read. I'll confess I don't really know why I would
want to do this, and in my next section I am ignoring this idea.

I would add a column "like" effective_date datetime - when the version
became real.   The reason I say "like" is because effect dates are a
previously solved problem  so is versioning, but I have a feeling you
will find more googling for that term.

Here is what I found:

select tax_rate from t
where effective_date =
(select max(effective_date) from t
  where effective_date <= '2002-04

http://forums.devshed.com/postgresql-help-21/how-do-i-return-a-value-from-one-column-based-144349.html

atoms:
pk id edt  atr1 atr2 ... atrN
1   1  1/1/10     A     B       C
2   2  1/1/10     D     B       E
3   3  1/1/10     D     F       G
...
molucules:
# atr1 atr2 ... atrN atom_list
1  A     B       C    1 5 6 16 53 102


So molecule #1 contains atom 1 (and more...)

Atom 1 changes from ABC to ABD - add this record:

pk id edt  atr1 atr2 ... atrN
4   1  1/2/10     A     B       D

Notice it gets a new primary key, but keeps the ID.

The 1 in atom_list refers to the atom ID.  So to find the atom as it
was on a certain date, use the above query.

follow?

>
> My second question is about inserting new versions to the db. After parsing
> the text file I have an object with all the information from the file. Is
> the next step to go attribute by attribute and update the db scheme or
> should I create some sort of temporary scheme (containing only data from the
> object) and do the inserting by merging tables? further more, is there a
> technique to keep db integrity while updating (especially in case of error
> in the insert process)?
>
> As always - thanks for your time and patience
>

transactions - you don't have to understand the impemtation, but it is
something like:  buffers all the changes until you commit, then locks,
appends, moves some pointers, deletes, unlocks.   It has been a while,
so this may not be right. but most pleople don't worry about how it
works anyway, just that it works.


--
Carl K
_______________________________________________
DB-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/db-sig
Reply | Threaded
Open this post in threaded view
|

Re: DB scheme question

Dieter Maurer
In reply to this post by John Q. Public
John Q. Public wrote at 2010-12-11 10:18 -0800:

>Before I can ask my question, I have to present the data I work with.
>My goal is to create a (very) simple version control for a unique type of
>(text) file.
>The file has three parts:
>1. atoms - set of attributes (e.g. name, id, etc). Each version of the file
>contains up to thousands of atoms
>2. molecules - has some attributes (e.g. name, id, etc) and a group of atoms
>(list of atoms ids). Each molecule can contain zero to hundreds of atoms.
>3. matter - a group of molecules (a matter to molecule is what molecule is
>to atom - and no, it's not homework of some kind....)
> ...
>Each file version is only slightly different from the previous version
>(changes in only few atoms, molecules and\or matter).

I am an object enthusiaste and therefore tend to regard the
world in object oriented terms.

Therefore, I would model your version problem as a set of objects
and then use an object relational mapper (e.g. "SQLAlchemy") to
map this to the relational model.

If there is no need to avoid redundancy across versions,
the version model could be extremely simple: each version
contains copies of its components.



--
Dieter
_______________________________________________
DB-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/db-sig
Reply | Threaded
Open this post in threaded view
|

Re: DB scheme question

John Q. Public
Hi Dieter,

I don't think I fully understand your suggestion.
From an OOP point of view, my application consists of 4 objects:
    1. atoms
    2. molecules - contain atoms
    3. matters - contain molecules
    4. versions - contain all the above
The scheme suggested has 4 tables to hold the objects and 4 more tables to represent the connections by pairing foreign keys (e.g. atoms in molecules, molecules in matters and the three of the above in versions). Since I lake the experience in db programming,  I would appreciate if you can point my implementation mistakes so I could learn from them.

regarding the redundancy, each version contains few thousands of atoms, molecules and matters and the differences between versions is up to few tens. Isn't it a waste to hold all this relations in a table (atoms/molecules/matters to version). On the other hand, is it possible that my suggested relation might cause inefficient SELECT statements when searching for a version parameters?

J.

Dieter Maurer wrote
John Q. Public wrote at 2010-12-11 10:18 -0800:
>Before I can ask my question, I have to present the data I work with.
>My goal is to create a (very) simple version control for a unique type of
>(text) file.
>The file has three parts:
>1. atoms - set of attributes (e.g. name, id, etc). Each version of the file
>contains up to thousands of atoms
>2. molecules - has some attributes (e.g. name, id, etc) and a group of atoms
>(list of atoms ids). Each molecule can contain zero to hundreds of atoms.
>3. matter - a group of molecules (a matter to molecule is what molecule is
>to atom - and no, it's not homework of some kind....)
> ...
>Each file version is only slightly different from the previous version
>(changes in only few atoms, molecules and\or matter).

I am an object enthusiaste and therefore tend to regard the
world in object oriented terms.

Therefore, I would model your version problem as a set of objects
and then use an object relational mapper (e.g. "SQLAlchemy") to
map this to the relational model.

If there is no need to avoid redundancy across versions,
the version model could be extremely simple: each version
contains copies of its components.



--
Dieter
_______________________________________________
DB-SIG maillist  -  DB-SIG@python.org
http://mail.python.org/mailman/listinfo/db-sig
Reply | Threaded
Open this post in threaded view
|

Re: DB scheme question

szncary
If I understand your original question correctly you are looking for a versioning system for a file or files which do not change very much between versions.

I would recommend subversion. It is very economical in its storage and suits your stated purpose.

However, your later response (below) indicates you are building an application and are looking for an OOP design.

Translating an OOP design into a database schema is often tricky. In your case, matter could be treated hierarchically like this ...

matter --1:n-->> molecule --1:n-->> atom - where matter is made up of many molecules and molecules are made up of many atoms. Versioning this wouldn't be too difficult in a separate history table.

However, there is likely to be atomic link information. The type of atomic bonds and energy involved to maintain a particular molecular state. Hence I might be tempted to use a molecule record to carry that information. Many-to-many relationships can be implemented with so-called dummy records each of which represent a relationship. These could represent particular atomic bonds in the context of a single molecule.

The bottom line for your design will depend upon the user interface.

You haven't given enough detail to comment realistically.

Additionally, this is probably not the right forum for your question. I believe this group is more focused on APIs than schema design. Google might help

S

John Q. Public wrote
Hi Dieter,

I don't think I fully understand your suggestion.
From an OOP point of view, my application consists of 4 objects:
    1. atoms
    2. molecules - contain atoms
    3. matters - contain molecules
    4. versions - contain all the above
The scheme suggested has 4 tables to hold the objects and 4 more tables to represent the connections by pairing foreign keys (e.g. atoms in molecules, molecules in matters and the three of the above in versions). Since I lake the experience in db programming,  I would appreciate if you can point my implementation mistakes so I could learn from them.

regarding the redundancy, each version contains few thousands of atoms, molecules and matters and the differences between versions is up to few tens. Isn't it a waste to hold all this relations in a table (atoms/molecules/matters to version). On the other hand, is it possible that my suggested relation might cause inefficient SELECT statements when searching for a version parameters?

J.

Dieter Maurer wrote
John Q. Public wrote at 2010-12-11 10:18 -0800:
>Before I can ask my question, I have to present the data I work with.
>My goal is to create a (very) simple version control for a unique type of
>(text) file.
>The file has three parts:
>1. atoms - set of attributes (e.g. name, id, etc). Each version of the file
>contains up to thousands of atoms
>2. molecules - has some attributes (e.g. name, id, etc) and a group of atoms
>(list of atoms ids). Each molecule can contain zero to hundreds of atoms.
>3. matter - a group of molecules (a matter to molecule is what molecule is
>to atom - and no, it's not homework of some kind....)
> ...
>Each file version is only slightly different from the previous version
>(changes in only few atoms, molecules and\or matter).

I am an object enthusiaste and therefore tend to regard the
world in object oriented terms.

Therefore, I would model your version problem as a set of objects
and then use an object relational mapper (e.g. "SQLAlchemy") to
map this to the relational model.

If there is no need to avoid redundancy across versions,
the version model could be extremely simple: each version
contains copies of its components.



--
Dieter
_______________________________________________
DB-SIG maillist  -  DB-SIG@python.org
http://mail.python.org/mailman/listinfo/db-sig
Reply | Threaded
Open this post in threaded view
|

Re: DB scheme question

Dieter Maurer
In reply to this post by John Q. Public
John Q. Public wrote at 2010-12-13 12:03 -0800:

>I don't think I fully understand your suggestion.
>>From an OOP point of view, my application consists of 4 objects:
>    1. atoms
>    2. molecules - contain atoms
>    3. matters - contain molecules
>    4. versions - contain all the above
>The scheme suggested has 4 tables to hold the objects and 4 more tables to
>represent the connections by pairing foreign keys (e.g. atoms in molecules,
>molecules in matters and the three of the above in versions). Since I lake
>the experience in db programming,  I would appreciate if you can point my
>implementation mistakes so I could learn from them.

I did not suggest a database scheme at all. Instead, I suggested
to use an object oriented model and use an object relational mapper
(e.g. SQLAlchemy) to map this to the database. This way, there
is some chance that no big experience in db programming is necessary.

There is a natural object oriented model for your task.
It follows immediately from your problem description.
You could use it unless it would not be efficient enough.

>regarding the redundancy, each version contains few thousands of atoms,
>molecules and matters and the differences between versions is up to few
>tens. Isn't it a waste to hold all this relations in a table
>(atoms/molecules/matters to version).

It is. But it may or may not matter.
There is a rule: use a complex solution only where it is necessary.

Important considerations:

 * how much data you will get and especially how many versions.

   If you expect a huge number of versions (and a huge number
   of matters, molecules and atoms), then the simple scheme
   is probably not efficient enough.

   If you can expect that there typically will only be a few versions,
   then simplicity might rule.

 * which operations should be supported?

   Example:
   when you have to support to display differences between versions
   (and this must be efficient), this may indicate that
   versions must be modeled such that differences can be easily found.
   One approach could be to store the differences between the versions
   rather than all details.

>On the other hand, is it possible that
>my suggested relation might cause inefficient SELECT statements when
>searching for a version parameters?

I have not looked carefully enough at your model to answer this question
with certainty.
I expect that your solution will envolve joins between a version
record and its components. Provided that appropriate indexes are defined,
joins are usually well optimized by the database system.
Likely, you will need to worry only for large datasets.

Following the rule (KISS principle: Keep It Stupid and Simple)
stated above: implement your model (a simple one,
just as complex as it seems necessary for the task at hand) and
then observe the implementation (e.g. make measurements with
realistic datasets). Optimize (e.g. define indexes) as necessary.
Should moderate optimizations not be able to get acceptable behavior,
start over with a new more sofisticated model based on your new
experiences.



--
Dieter
_______________________________________________
DB-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/db-sig
Reply | Threaded
Open this post in threaded view
|

Re: DB scheme question

John Q. Public
Hi Dieter and szncary,

Thanks for your replays.
I agree with the simplicity method presented and will try to apply it in my application after taking into account the interface layer.
I'll try to learn SQLAlchemy in relation to my application

Thank you for the information and help,
J.