Quantcast

Adding generated common table expressions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Adding generated common table expressions

Ashley Waite

Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to Django.

I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.


Why generate CTE based queries from querysets?

By allowing querysets to be attached to each other, and setting appropriate WHERE clauses, arbitrary and nested SQL queries can be generated. Where the results of the queries are only necessary for the execution of following queries this saves a very substantial amount of time and database work. Once these features exist, other functionality can also transparently use these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.


Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.


Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.


Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.


What my PoC looks like

With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")

# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
Returning the newly created records.

SQL can be generated that looks something like this:
WITH cte_1_0 (md5, sha2) AS (
	VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
That is:
  • A qs.as_insert() and qs.as_update() on queryset to create *lazy* insert and update queries.
  • A qs.attach() that allows querysets to be attached to other querysets, and will generate them as CTE statements.
  • A qs.ref() that returns an expression that when the query is compiled will be a field reference on the CTE that represents that queryset.
  • Additional compilers on the QuerySet subclasses that these return (so no changes to base compilers meaning no functionality impact to existing usage)
  • Generation of WITH clauses for attached querysets, and RETURN clauses for lazy UPDATE and INSERT querysets with fields requested (via values() in this case)

As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.

I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.


Thanks,
- Ashley

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/d930c0bf-7285-4af1-a595-241a3c5aa78c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Tim Graham-2
Hi, I don't know anything about CTE, but did you see this third-party package? https://github.com/petrounias/django-cte-trees -- It seems to be PostgreSQL only.

I was going to write, "Considering that not all databases support CTE (MySQL doesn't), a third-party app might be the way to go rather than having it built-in to Django." -- however, then I noticed that supported is added in MySQL 8 [0], so maybe any database differences could be abstracted away with an appropriate API.

I'll leave it to others to comment on the technical details of your proposal. Meanwhile, if you have a chance to contribute smaller patches to Django, I think it's helpful to start with smaller patches before tackling something larger like this.

[0] http://mysqlserverteam.com/mysql-8-0-labs-recursive-common-table-expressions-in-mysql-ctes/

On Friday, March 17, 2017 at 7:28:17 AM UTC-4, Ashley Waite wrote:

Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to Django.

I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.


Why generate CTE based queries from querysets?

By allowing querysets to be attached to each other, and setting appropriate WHERE clauses, arbitrary and nested SQL queries can be generated. Where the results of the queries are only necessary for the execution of following queries this saves a very substantial amount of time and database work. Once these features exist, other functionality can also transparently use these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.


Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.


Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.


Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.


What my PoC looks like

With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")

# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
Returning the newly created records.

SQL can be generated that looks something like this:
WITH cte_1_0 (md5, sha2) AS (
	VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
That is:
  • A qs.as_insert() and qs.as_update() on queryset to create *lazy* insert and update queries.
  • A qs.attach() that allows querysets to be attached to other querysets, and will generate them as CTE statements.
  • A qs.ref() that returns an expression that when the query is compiled will be a field reference on the CTE that represents that queryset.
  • Additional compilers on the QuerySet subclasses that these return (so no changes to base compilers meaning no functionality impact to existing usage)
  • Generation of WITH clauses for attached querysets, and RETURN clauses for lazy UPDATE and INSERT querysets with fields requested (via values() in this case)

As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.

I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.


Thanks,
- Ashley

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/a4a98095-8fba-47ad-8348-5031528e75ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Matthias Kestenholz-7
Hi,

On Sat, Mar 18, 2017 at 12:49 AM, Tim Graham <[hidden email]> wrote:
> Hi, I don't know anything about CTE, but did you see this third-party
> package? https://github.com/petrounias/django-cte-trees -- It seems to be
> PostgreSQL only.

Just chiming in to point out a maintained and up-to-date friendly fork
of the project above:
https://github.com/matthiask/django-cte-forest

Ashley, is your PoC available somewhere? Is your implementation in any
way similar to the implementation in
django-cte-trees/django-cte-forest?

Thanks,
Matthias

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CANvPqgDWUPuUA6rw8uawa7_AsUkXFhuegFM1cJ-gjVUCxvBUwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Adam Johnson-2
 supported is added in MySQL 8 [0]

Additionally MariaDB 10.2 supports them, and it's nearing release.

On 18 March 2017 at 07:59, Matthias Kestenholz <[hidden email]> wrote:
Hi,

On Sat, Mar 18, 2017 at 12:49 AM, Tim Graham <[hidden email]> wrote:
> Hi, I don't know anything about CTE, but did you see this third-party
> package? https://github.com/petrounias/django-cte-trees -- It seems to be
> PostgreSQL only.

Just chiming in to point out a maintained and up-to-date friendly fork
of the project above:
https://github.com/matthiask/django-cte-forest

Ashley, is your PoC available somewhere? Is your implementation in any
way similar to the implementation in
django-cte-trees/django-cte-forest?

Thanks,
Matthias

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CANvPqgDWUPuUA6rw8uawa7_AsUkXFhuegFM1cJ-gjVUCxvBUwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM1pXBMBEOJNmg8w%3D1PWxnFCUCZSpONgYu%3Dk7WXG55yLew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Josh Smeaton
In reply to this post by Ashley Waite
Thanks for bringing this up Ashley, and for all of the detail you provided. I'd certainly like to see CTEs make their way into Django, provided we could come up with a nice enough API. From the look of it, you've already got something that works with an okay API so I'm hopeful.

I'd be very interested in seeing your POC too if you're able to share.

From looking very briefly at django-cte-trees it doesn't aim to support user defined CTEs for anything other than recursive queries. I'd be interested in seeing, as part of a DEP, how CTE inclusion in django core could support the cte-trees project from an API perspective.

On Friday, 17 March 2017 22:28:17 UTC+11, Ashley Waite wrote:

Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to Django.

I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.


Why generate CTE based queries from querysets?

By allowing querysets to be attached to each other, and setting appropriate WHERE clauses, arbitrary and nested SQL queries can be generated. Where the results of the queries are only necessary for the execution of following queries this saves a very substantial amount of time and database work. Once these features exist, other functionality can also transparently use these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.


Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.


Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.


Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.


What my PoC looks like

With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")

# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
Returning the newly created records.

SQL can be generated that looks something like this:
WITH cte_1_0 (md5, sha2) AS (
	VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
That is:
  • A qs.as_insert() and qs.as_update() on queryset to create *lazy* insert and update queries.
  • A qs.attach() that allows querysets to be attached to other querysets, and will generate them as CTE statements.
  • A qs.ref() that returns an expression that when the query is compiled will be a field reference on the CTE that represents that queryset.
  • Additional compilers on the QuerySet subclasses that these return (so no changes to base compilers meaning no functionality impact to existing usage)
  • Generation of WITH clauses for attached querysets, and RETURN clauses for lazy UPDATE and INSERT querysets with fields requested (via values() in this case)

As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.

I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.


Thanks,
- Ashley

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/edc3820c-f206-4a74-89e0-84d83ea5b206%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Anssi Kääriäinen-2
+1 for the INSERT and UPDATE parts.

For the .attach() method, I wonder if it would be better to work on having generic subquery support in the ORM instead of targeting CTEs specifically. From Django's perspective there isn't any big difference between:
    WITH fooquery AS (SELECT * FROM foo WHERE ...)
    SELECT * FROM baz JOIN fooquery ON ...
and
    SELECT * FROM baz JOIN (SELECT * FROM foo WHERE ...) fooquery ON ...

The great thing about the subquery form is that it works on every database, and to my knowledge the subquery support is surprisingly standard on all database Django needs to care about.

If we want full support for either of the above cases, then we need some way to tell to the Django's ORM what the fooquery (either as subquery or with CTE) produces. For simple cases this won't be complex (say, for .values() it's just a list of expressions), but what to do when the subquery/CTE contains .select_related() for example? Then we'd have a situation where the subquery produces a list of expressions but those expressions might refer to different models in the CTE.

So, the starting point should be to allow only "simple" queries in .attach(). This means that the attached query must be either .values() queryset, or a queryset containing expressions from single model only (no .extra(), .annotate() or select_related() added). In addition I think .attach() should support subqueries instead of using only CTEs. We get support on every backend instead of PostgreSQL only for pretty much the same amount of effort.

 - Anssi

On Saturday, March 18, 2017 at 11:28:53 AM UTC+2, Josh Smeaton wrote:
Thanks for bringing this up Ashley, and for all of the detail you provided. I'd certainly like to see CTEs make their way into Django, provided we could come up with a nice enough API. From the look of it, you've already got something that works with an okay API so I'm hopeful.

I'd be very interested in seeing your POC too if you're able to share.

From looking very briefly at django-cte-trees it doesn't aim to support user defined CTEs for anything other than recursive queries. I'd be interested in seeing, as part of a DEP, how CTE inclusion in django core could support the cte-trees project from an API perspective.

On Friday, 17 March 2017 22:28:17 UTC+11, Ashley Waite wrote:

Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to Django.

I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.


Why generate CTE based queries from querysets?

By allowing querysets to be attached to each other, and setting appropriate WHERE clauses, arbitrary and nested SQL queries can be generated. Where the results of the queries are only necessary for the execution of following queries this saves a very substantial amount of time and database work. Once these features exist, other functionality can also transparently use these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.


Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.


Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.


Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.


What my PoC looks like

With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")

# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
Returning the newly created records.

SQL can be generated that looks something like this:
WITH cte_1_0 (md5, sha2) AS (
	VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
That is:
  • A qs.as_insert() and qs.as_update() on queryset to create *lazy* insert and update queries.
  • A qs.attach() that allows querysets to be attached to other querysets, and will generate them as CTE statements.
  • A qs.ref() that returns an expression that when the query is compiled will be a field reference on the CTE that represents that queryset.
  • Additional compilers on the QuerySet subclasses that these return (so no changes to base compilers meaning no functionality impact to existing usage)
  • Generation of WITH clauses for attached querysets, and RETURN clauses for lazy UPDATE and INSERT querysets with fields requested (via values() in this case)

As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.

I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.


Thanks,
- Ashley

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/bf796fae-6f33-4293-a4d8-70f82fb40414%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Ashley Waite
I'm nesting the query generators as they currently exist and adding awareness of that nesting into another compiler, so that queries that reference other models, contain annotations, etc should still in theory work fine.

I do agree that the best method would be to generalise this to incorporate subqueries as well, but I considered those to be a subset of general CTE queries which can effectively be any valid query.

With some minor modifications to the existing INSERT/UPDATE/SELECT compilers via subclassing I think that subqueries as special cases of general statements that have some restrictions is probably the easiest route.
My current fiddling has changed INSERT/UPDATE to have returning clauses using subclassed compilers:
https://github.com/ashleywaite/django/blob/cte-dev/django/db/models/sql/subqueries.py#L174-L209
https://github.com/ashleywaite/django/blob/cte-dev/django/db/models/sql/compiler.py#L1283-L1321

Such that regular non-RETURN aware uses aren't impacted by that code - but I suspect that making INSERT and UPDATE be RETURN aware and lazily executed as a default case might be better, and more consistent with the way that SELECT based querysets work.

I'm using a workaround to feed in VALUES clauses, which again I think would probably be better as a proper QuerySet subclass that is used by INSERT statements and CTEs alike.
The language is a bit confusing because .values() is a method in Django, so mock querysets that define literal values intended to generate a VALUES clause don't really have a good name, but they would introduce some consistency as that's what INSERT statements are doing, and what non-uniform UPDATEs and composite IN style queries would want to do.

The leading CTE WITH clauses I'm generating are done via it's own compiler as well, such that only queries that are aware of having such queries attached will generate them:
https://github.com/ashleywaite/django/blob/cte-dev/django/db/models/sql/compiler.py#L1324-L1374

These could fairly easily be adapted to use a more general case for attaching one query to another, as my relatively haskish PoC was mostly to show that it can be done without substantial complexity, and that modification of other systems to better incorporate these patterns is probably a good idea.

As they stand now, CTEs used in the way I am using them could with little effort be modified to support aggregations, and annotations on the base query to pull in values from complex generated CTE's is probably the ideal way to perform some multi-table calculations.

- Ashley

On Monday, March 20, 2017 at 6:23:44 PM UTC+11, Anssi Kääriäinen wrote:
+1 for the INSERT and UPDATE parts.

For the .attach() method, I wonder if it would be better to work on having generic subquery support in the ORM instead of targeting CTEs specifically. From Django's perspective there isn't any big difference between:
    WITH fooquery AS (SELECT * FROM foo WHERE ...)
    SELECT * FROM baz JOIN fooquery ON ...
and
    SELECT * FROM baz JOIN (SELECT * FROM foo WHERE ...) fooquery ON ...

The great thing about the subquery form is that it works on every database, and to my knowledge the subquery support is surprisingly standard on all database Django needs to care about.

If we want full support for either of the above cases, then we need some way to tell to the Django's ORM what the fooquery (either as subquery or with CTE) produces. For simple cases this won't be complex (say, for .values() it's just a list of expressions), but what to do when the subquery/CTE contains .select_related() for example? Then we'd have a situation where the subquery produces a list of expressions but those expressions might refer to different models in the CTE.

So, the starting point should be to allow only "simple" queries in .attach(). This means that the attached query must be either .values() queryset, or a queryset containing expressions from single model only (no .extra(), .annotate() or select_related() added). In addition I think .attach() should support subqueries instead of using only CTEs. We get support on every backend instead of PostgreSQL only for pretty much the same amount of effort.

 - Anssi

On Saturday, March 18, 2017 at 11:28:53 AM UTC+2, Josh Smeaton wrote:
Thanks for bringing this up Ashley, and for all of the detail you provided. I'd certainly like to see CTEs make their way into Django, provided we could come up with a nice enough API. From the look of it, you've already got something that works with an okay API so I'm hopeful.

I'd be very interested in seeing your POC too if you're able to share.

From looking very briefly at django-cte-trees it doesn't aim to support user defined CTEs for anything other than recursive queries. I'd be interested in seeing, as part of a DEP, how CTE inclusion in django core could support the cte-trees project from an API perspective.

On Friday, 17 March 2017 22:28:17 UTC+11, Ashley Waite wrote:

Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to Django.

I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.


Why generate CTE based queries from querysets?

By allowing querysets to be attached to each other, and setting appropriate WHERE clauses, arbitrary and nested SQL queries can be generated. Where the results of the queries are only necessary for the execution of following queries this saves a very substantial amount of time and database work. Once these features exist, other functionality can also transparently use these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.


Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.


Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.


Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.


What my PoC looks like

With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")

# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
Returning the newly created records.

SQL can be generated that looks something like this:
WITH cte_1_0 (md5, sha2) AS (
	VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
That is:
  • A qs.as_insert() and qs.as_update() on queryset to create *lazy* insert and update queries.
  • A qs.attach() that allows querysets to be attached to other querysets, and will generate them as CTE statements.
  • A qs.ref() that returns an expression that when the query is compiled will be a field reference on the CTE that represents that queryset.
  • Additional compilers on the QuerySet subclasses that these return (so no changes to base compilers meaning no functionality impact to existing usage)
  • Generation of WITH clauses for attached querysets, and RETURN clauses for lazy UPDATE and INSERT querysets with fields requested (via values() in this case)

As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.

I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.


Thanks,
- Ashley

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/102864a0-5495-4ec0-8b49-bce6f01ea88b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Ashley Waite
In reply to this post by Matthias Kestenholz-7
CTE Forest is a specific use case of CTEs to store self-referential tree data sets.

It's quite a different use to what I'm proposing, though its implementation might become simpler if Django is generally CTE aware.

- Ashley

On Saturday, March 18, 2017 at 6:59:57 PM UTC+11, Matthias Kestenholz wrote:
Hi,

On Sat, Mar 18, 2017 at 12:49 AM, Tim Graham <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ft5xZTjLBwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">timog...@...> wrote:
> Hi, I don't know anything about CTE, but did you see this third-party
> package? <a href="https://github.com/petrounias/django-cte-trees" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fpetrounias%2Fdjango-cte-trees\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEDv8W9KT76tKyTx5WcrgdBxaC_8w&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fpetrounias%2Fdjango-cte-trees\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEDv8W9KT76tKyTx5WcrgdBxaC_8w&#39;;return true;">https://github.com/petrounias/django-cte-trees -- It seems to be
> PostgreSQL only.

Just chiming in to point out a maintained and up-to-date friendly fork
of the project above:
<a href="https://github.com/matthiask/django-cte-forest" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmatthiask%2Fdjango-cte-forest\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH3Q_SGJAUjNOiGQNUoreNnq6L0KA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmatthiask%2Fdjango-cte-forest\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH3Q_SGJAUjNOiGQNUoreNnq6L0KA&#39;;return true;">https://github.com/matthiask/django-cte-forest

Ashley, is your PoC available somewhere? Is your implementation in any
way similar to the implementation in
django-cte-trees/django-cte-forest?

Thanks,
Matthias

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/742681c3-85e7-4420-89c8-834c33690990%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding generated common table expressions

Ashley Waite
In reply to this post by Josh Smeaton
Here's the code changes I've made, noting that some of them were to shove in a generalised VALUES clause that mocks being a queryset, so that it plays with the same interface.

https://github.com/django/django/compare/master...ashleywaite:cte-dev#files_bucket

I've had a glance at cte-trees/cte-forest and once general CTEs are worked out expanding that to include recursive CTEs wouldn't be too difficult, and that would greatly simplify the implementation of cte-forest to the extent that it might be viable as a django data/reference type.

- Ashley

On Saturday, March 18, 2017 at 8:28:53 PM UTC+11, Josh Smeaton wrote:
Thanks for bringing this up Ashley, and for all of the detail you provided. I'd certainly like to see CTEs make their way into Django, provided we could come up with a nice enough API. From the look of it, you've already got something that works with an okay API so I'm hopeful.

I'd be very interested in seeing your POC too if you're able to share.

From looking very briefly at django-cte-trees it doesn't aim to support user defined CTEs for anything other than recursive queries. I'd be interested in seeing, as part of a DEP, how CTE inclusion in django core could support the cte-trees project from an API perspective.

On Friday, 17 March 2017 22:28:17 UTC+11, Ashley Waite wrote:

Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to Django.

I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.


Why generate CTE based queries from querysets?

By allowing querysets to be attached to each other, and setting appropriate WHERE clauses, arbitrary and nested SQL queries can be generated. Where the results of the queries are only necessary for the execution of following queries this saves a very substantial amount of time and database work. Once these features exist, other functionality can also transparently use these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.


Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.


Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.


Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.


What my PoC looks like

With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")

# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
Returning the newly created records.

SQL can be generated that looks something like this:
WITH cte_1_0 (md5, sha2) AS (
	VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
That is:
  • A qs.as_insert() and qs.as_update() on queryset to create *lazy* insert and update queries.
  • A qs.attach() that allows querysets to be attached to other querysets, and will generate them as CTE statements.
  • A qs.ref() that returns an expression that when the query is compiled will be a field reference on the CTE that represents that queryset.
  • Additional compilers on the QuerySet subclasses that these return (so no changes to base compilers meaning no functionality impact to existing usage)
  • Generation of WITH clauses for attached querysets, and RETURN clauses for lazy UPDATE and INSERT querysets with fields requested (via values() in this case)

As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.

I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.


Thanks,
- Ashley

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/9af2d3dc-81a0-4d76-a3a0-cdef850158ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...