~Q broken?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

~Q broken?

serbaut

Consider the following equivalent queries based on the weblog model
from the documentation:

Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
blah")
Blog.objects.filter(Q(name="myblog") &
QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
Blog.objects.filter(Q(name="myblog") &
~Q(entry__body_text__contains="blah blah")) # development  version

In 0.96 the evaluated SQL is (edited for readability):

SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
FROM blog_blog
INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
blog_blog__entry.blog_id
WHERE blog_blog.name = "myblog"
AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))

while in trunk (r7982) it is

SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
FROM blog_blog
WHERE blog_blog.name = "myblog"
AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
WHERE blog_entry.body_text LIKE "%blah blah%")))

The trunk version will perform a subquery over all blog entries which
will have a very negative performance impact.

Lets say "myblog" has 10 entries in a database with 10,000,000
entries, the first SQL will only examine the 10 entries (assuming
decent query planner) while the latter will scan the whole database. A
join with blog_blog.id is missing from the subquery or it needs to be
rewritten to the 0.96 form. I understand that the code tries to handle
the generic case but this practical case has to work too.

/// joakim


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Malcolm Tredinnick


On Wed, 2008-07-16 at 10:12 -0700, serbaut wrote:

> Consider the following equivalent queries based on the weblog model
> from the documentation:
>
> Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
> blah")
> Blog.objects.filter(Q(name="myblog") &
> QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
> Blog.objects.filter(Q(name="myblog") &
> ~Q(entry__body_text__contains="blah blah")) # development  version
>
> In 0.96 the evaluated SQL is (edited for readability):
>
> SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
> FROM blog_blog
> INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
> blog_blog__entry.blog_id
> WHERE blog_blog.name = "myblog"
> AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))
>
> while in trunk (r7982) it is
>
> SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
> FROM blog_blog
> WHERE blog_blog.name = "myblog"
> AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
> WHERE blog_entry.body_text LIKE "%blah blah%")))
>
> The trunk version will perform a subquery over all blog entries which
> will have a very negative performance impact.
>
> Lets say "myblog" has 10 entries in a database with 10,000,000
> entries, the first SQL will only examine the 10 entries (assuming
> decent query planner) while the latter will scan the whole database. A
> join with blog_blog.id is missing from the subquery or it needs to be
> rewritten to the 0.96 form. I understand that the code tries to handle
> the generic case but this practical case has to work too.

The important difference, that trumps everything else, is that the 0.96
version gives the wrong answer! It isn't a case of less or more
efficient -- it's the difference between correct and incorrect.

The query you described asks to exclude all blog entries containing a
particular tag. Now consider a blog entry that has two entries. One is
the entry you're interested in excluding and the second entry is
something else. Because that blog has an row in the m2m join table that
does not match the entry you are excluding, that blog will be included
(incorrectly) in the result set. This was a very big bug in 0.96 and
impossible to work around in that code.

There is no way to write that particular exclusion correctly without
using nested subqueries unless you have some arbitrary constraint like
only one tag per blog entry (in which case a many-to-many field is the
wrong choice).

Remember that the query you are writing here is something that returns
(and filters) Blog objects. It's not for excluding individual Entry
objects. It uses the presence or absence of an Entry to filter the
Blogs.

Your case is particularly pessimal, since it isn't going to be helped by
index matches in most cases. For more natural uses, such as excluding by
pk values, or whole-field matches, adding appropriate index comparisons
makes the inner query very efficient if it becomes a performance issue
in production environments.

Regards,
Malcolm


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

serbaut

There is no m2m relation involved in the example. The model is (from
http://www.djangoproject.com/documentation/db-api/):

class Blog(models.Model):
    name = models.CharField(max_length=100)
    tagline = models.TextField()

class Entry(models.Model):
    blog = models.ForeignKey(Blog)
    headline = models.CharField(max_length=255)
    body_text = models.TextField()

The example may be a bit stupid but it is a simplification of a real
issue we have and the problem is the unrestricted subquery that is
executed to find the keys of the excluded records.

On Jul 16, 7:23 pm, Malcolm Tredinnick <[hidden email]>
wrote:

> On Wed, 2008-07-16 at 10:12 -0700, serbaut wrote:
> > Consider the following equivalent queries based on the weblog model
> > from the documentation:
>
> > Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
> > blah")
> > Blog.objects.filter(Q(name="myblog") &
> > QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
> > Blog.objects.filter(Q(name="myblog") &
> > ~Q(entry__body_text__contains="blah blah")) # development  version
>
> > In 0.96 the evaluated SQL is (edited for readability):
>
> > SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
> > FROM blog_blog
> > INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
> > blog_blog__entry.blog_id
> > WHERE blog_blog.name = "myblog"
> > AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))
>
> > while in trunk (r7982) it is
>
> > SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
> > FROM blog_blog
> > WHERE blog_blog.name = "myblog"
> > AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
> > WHERE blog_entry.body_text LIKE "%blah blah%")))
>
> > The trunk version will perform a subquery over all blog entries which
> > will have a very negative performance impact.
>
> > Lets say "myblog" has 10 entries in a database with 10,000,000
> > entries, the first SQL will only examine the 10 entries (assuming
> > decent query planner) while the latter will scan the whole database. A
> > join with blog_blog.id is missing from the subquery or it needs to be
> > rewritten to the 0.96 form. I understand that the code tries to handle
> > the generic case but this practical case has to work too.
>
> The important difference, that trumps everything else, is that the 0.96
> version gives the wrong answer! It isn't a case of less or more
> efficient -- it's the difference between correct and incorrect.
>
> The query you described asks to exclude all blog entries containing a
> particular tag. Now consider a blog entry that has two entries. One is
> the entry you're interested in excluding and the second entry is
> something else. Because that blog has an row in the m2m join table that
> does not match the entry you are excluding, that blog will be included
> (incorrectly) in the result set. This was a very big bug in 0.96 and
> impossible to work around in that code.
>
> There is no way to write that particular exclusion correctly without
> using nested subqueries unless you have some arbitrary constraint like
> only one tag per blog entry (in which case a many-to-many field is the
> wrong choice).
>
> Remember that the query you are writing here is something that returns
> (and filters) Blog objects. It's not for excluding individual Entry
> objects. It uses the presence or absence of an Entry to filter the
> Blogs.
>
> Your case is particularly pessimal, since it isn't going to be helped by
> index matches in most cases. For more natural uses, such as excluding by
> pk values, or whole-field matches, adding appropriate index comparisons
> makes the inner query very efficient if it becomes a performance issue
> in production environments.
>
> Regards,
> Malcolm
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Malcolm Tredinnick


On Wed, 2008-07-16 at 10:46 -0700, serbaut wrote:
> There is no m2m relation involved in the example.

Yes, you're right. It's not many-to-many, but it is multi-valued. It's a
reverse many-to-one, which means one blog maps to many entries. The
logic for the query construction still stands.

[...]
> The example may be a bit stupid but it is a simplification of a real
> issue we have and the problem is the unrestricted subquery that is
> executed to find the keys of the excluded records.

Then write custom SQL. Or write a different filter

The problem is that the filter you specified was not returning the
correct answer in 0.96. You're asking that Django returns an incorrect
result to make your code faster. It's possible to give incorrect results
very fast, but it's not something we want to support.

Regards,
Malcolm



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

serbaut

Thank you for your replies.

> The problem is that the filter you specified was not returning the
> correct answer in 0.96. You're asking that Django returns an incorrect
> result to make your code faster. It's possible to give incorrect results
> very fast, but it's not something we want to support.

I assume you are referring to the case when an 1-M relation lacks a
related object (eg blog without entries)? In our case we know that the
"blog" always has "entries" and the subselects still hits a all rows
in a table with 100M rows and just relying on indices will not work.

> Then write custom SQL. Or write a different filter

The filter is dynamically built by the gui so writing custom sql is
not an option and as I understand it we cant do what we want with a
filter query as it is. Right now the best solution may be to add a Q
operator that works like the old QNot.

The db api is a killer-feature of django and it has worked wonders for
us and will continue to do so I am sure. Thank you for your work!

Regards,
Joakim

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Malcolm Tredinnick


On Wed, 2008-07-16 at 13:15 -0700, serbaut wrote:
> Thank you for your replies.
>
> > The problem is that the filter you specified was not returning the
> > correct answer in 0.96. You're asking that Django returns an incorrect
> > result to make your code faster. It's possible to give incorrect results
> > very fast, but it's not something we want to support.
>
> I assume you are referring to the case when an 1-M relation lacks a
> related object (eg blog without entries)?

If you mean the incorrect results, they will happen whenever you have a
blog with two entries, one of which does not contain the text searched
for. In the 0.96 SQL you showed, the blog will be in the result set,
even if it also contains an entry that *does* contain the text you're
searching for. For example,

        blog A
          - entry 1 (text = "hello")
          - entry 2 (text = "goodbye")
       
If you are trying to exclude blogs that have an entry containing
"hello", the 0.96 SQL is clearly wrong, since the join between blog A
and entry 2 will pass the where-test (blog A should be excluded, since
it contains an entry that contains "hello").

> In our case we know that the
> "blog" always has "entries" and the subselects still hits a all rows
> in a table with 100M rows and just relying on indices will not work.
>
> > Then write custom SQL. Or write a different filter
>
> The filter is dynamically built by the gui so writing custom sql is
> not an option and as I understand it we cant do what we want with a
> filter query as it is. Right now the best solution may be to add a Q
> operator that works like the old QNot.

You can write external Q object for your own purposes (it shouldn't
require patching the core, since you would have full access to the Query
object -- see Query.add_q()). I see no advantage of including it in
core, however, since it doesn't return a particularly intuitive result.

One thing I can think of here -- that's already on my list and will
when there's only one thing we're trying to check for in the sub-select,
we can use a "NOT EXISTS" query instead of a "NOT IN". This isn't
entirely trivial, since it probably requires some moderately ugly query
munging and breaks a few abstraction barriers in the code until I
reorganise some internals. It's on the long-term list of things to work
on when the current rush is over. I might get to it before 1.0, but
since it's fully backwards compatible in an API sense, it'll probably
wait until afterwards, even if somebody delivered a perfect patch (it
takes a long while to shake out the bugs from those sorts of things).

Anyway, the conclusion for the purposes of the thread here is that ~Q is
not broken. You may not like the efficiency in your particular case, but
that's just one case amongst many and it's an unfortunate edge case for
this particular query. These cases will always exist and somebody will
always hit them. Some future planned improvements may well help out
there, but nothing immediate is going to change.

Regards,
Malcolm



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Steve Holden-4
In reply to this post by Malcolm Tredinnick

Malcolm Tredinnick wrote:

>
> On Wed, 2008-07-16 at 10:12 -0700, serbaut wrote:
>> Consider the following equivalent queries based on the weblog model
>> from the documentation:
>>
>> Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
>> blah")
>> Blog.objects.filter(Q(name="myblog") &
>> QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
>> Blog.objects.filter(Q(name="myblog") &
>> ~Q(entry__body_text__contains="blah blah")) # development  version
>>
>> In 0.96 the evaluated SQL is (edited for readability):
>>
>> SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
>> FROM blog_blog
>> INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
>> blog_blog__entry.blog_id
>> WHERE blog_blog.name = "myblog"
>> AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))
>>
>> while in trunk (r7982) it is
>>
>> SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
>> FROM blog_blog
>> WHERE blog_blog.name = "myblog"
>> AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
>> WHERE blog_entry.body_text LIKE "%blah blah%")))
>>
>> The trunk version will perform a subquery over all blog entries which
>> will have a very negative performance impact.
>>
>> Lets say "myblog" has 10 entries in a database with 10,000,000
>> entries, the first SQL will only examine the 10 entries (assuming
>> decent query planner) while the latter will scan the whole database. A
>> join with blog_blog.id is missing from the subquery or it needs to be
>> rewritten to the 0.96 form. I understand that the code tries to handle
>> the generic case but this practical case has to work too.
>
> The important difference, that trumps everything else, is that the 0.96
> version gives the wrong answer! It isn't a case of less or more
> efficient -- it's the difference between correct and incorrect.
>
> The query you described asks to exclude all blog entries containing a
> particular tag. Now consider a blog entry that has two entries. One is
> the entry you're interested in excluding and the second entry is
> something else. Because that blog has an row in the m2m join table that
> does not match the entry you are excluding, that blog will be included
> (incorrectly) in the result set. This was a very big bug in 0.96 and
> impossible to work around in that code.
>
> There is no way to write that particular exclusion correctly without
> using nested subqueries unless you have some arbitrary constraint like
> only one tag per blog entry (in which case a many-to-many field is the
> wrong choice).
>
> Remember that the query you are writing here is something that returns
> (and filters) Blog objects. It's not for excluding individual Entry
> objects. It uses the presence or absence of an Entry to filter the
> Blogs.
>
> Your case is particularly pessimal, since it isn't going to be helped by
> index matches in most cases. For more natural uses, such as excluding by
> pk values, or whole-field matches, adding appropriate index comparisons
> makes the inner query very efficient if it becomes a performance issue
> in production environments.
>
Note that there's nothing wring in the general case with nested
subqueries. The problem arises with what are called "correlated
subqueries", where an element of the rows from the outer query is used
in the sub-query.

In other words, given an employee table where each employee has a
manager, it's quite fast to work out who earns more than a specific
employee:

SELECT * FROM employee WHERE salary > (
     SELECT salary FROM employee WHERE ID=1234)

The subquery returns a result that's invariant over the rows of the
outer query, and most SQL implementations will only perform it the once.
It's quite a different story if you want to find out who earns more than
their manager, however:

SELECT * FROM employee AS sub WHERE salary > (
     SELECT salary FROM employee AS mgr WHERE sub.salary > mgr.salary)

In such cases the SQL optimizer usually has little choice but to repeat
the subquery for each row in the outer query, which can get a little
expensive.

Then later said:

> The problem is that the filter you specified was not returning the
> correct answer in 0.96. You're asking that Django returns an incorrect
> result to make your code faster. It's possible to give incorrect results
> very fast, but it's not something we want to support.

Which made me laugh :-)

Sorry if this is preaching to the choir: sometimes people assume that
*all* nested subqueries are bad.

regards
  Steve
--
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Malcolm Tredinnick


On Wed, 2008-07-16 at 20:44 -0400, Steve Holden wrote:

> Malcolm Tredinnick wrote:
> >
> > On Wed, 2008-07-16 at 10:12 -0700, serbaut wrote:
> >> Consider the following equivalent queries based on the weblog model
> >> from the documentation:
> >>
> >> Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
> >> blah")
> >> Blog.objects.filter(Q(name="myblog") &
> >> QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
> >> Blog.objects.filter(Q(name="myblog") &
> >> ~Q(entry__body_text__contains="blah blah")) # development  version
> >>
> >> In 0.96 the evaluated SQL is (edited for readability):
> >>
> >> SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
> >> FROM blog_blog
> >> INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
> >> blog_blog__entry.blog_id
> >> WHERE blog_blog.name = "myblog"
> >> AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))
> >>
> >> while in trunk (r7982) it is
> >>
> >> SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
> >> FROM blog_blog
> >> WHERE blog_blog.name = "myblog"
> >> AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
> >> WHERE blog_entry.body_text LIKE "%blah blah%")))
> >>
> >> The trunk version will perform a subquery over all blog entries which
> >> will have a very negative performance impact.
> >>
> >> Lets say "myblog" has 10 entries in a database with 10,000,000
> >> entries, the first SQL will only examine the 10 entries (assuming
> >> decent query planner) while the latter will scan the whole database. A
> >> join with blog_blog.id is missing from the subquery or it needs to be
> >> rewritten to the 0.96 form. I understand that the code tries to handle
> >> the generic case but this practical case has to work too.
> >
> > The important difference, that trumps everything else, is that the 0.96
> > version gives the wrong answer! It isn't a case of less or more
> > efficient -- it's the difference between correct and incorrect.
> >
> > The query you described asks to exclude all blog entries containing a
> > particular tag. Now consider a blog entry that has two entries. One is
> > the entry you're interested in excluding and the second entry is
> > something else. Because that blog has an row in the m2m join table that
> > does not match the entry you are excluding, that blog will be included
> > (incorrectly) in the result set. This was a very big bug in 0.96 and
> > impossible to work around in that code.
> >
> > There is no way to write that particular exclusion correctly without
> > using nested subqueries unless you have some arbitrary constraint like
> > only one tag per blog entry (in which case a many-to-many field is the
> > wrong choice).
> >
> > Remember that the query you are writing here is something that returns
> > (and filters) Blog objects. It's not for excluding individual Entry
> > objects. It uses the presence or absence of an Entry to filter the
> > Blogs.
> >
> > Your case is particularly pessimal, since it isn't going to be helped by
> > index matches in most cases. For more natural uses, such as excluding by
> > pk values, or whole-field matches, adding appropriate index comparisons
> > makes the inner query very efficient if it becomes a performance issue
> > in production environments.
> >
> Note that there's nothing wring in the general case with nested
> subqueries. The problem arises with what are called "correlated
> subqueries", where an element of the rows from the outer query is used
> in the sub-query.
>
> In other words, given an employee table where each employee has a
> manager, it's quite fast to work out who earns more than a specific
> employee:
>
> SELECT * FROM employee WHERE salary > (
>      SELECT salary FROM employee WHERE ID=1234)
>
> The subquery returns a result that's invariant over the rows of the
> outer query, and most SQL implementations will only perform it the once.
> It's quite a different story if you want to find out who earns more than
> their manager, however:
>
> SELECT * FROM employee AS sub WHERE salary > (
>      SELECT salary FROM employee AS mgr WHERE sub.salary > mgr.salary)
>
> In such cases the SQL optimizer usually has little choice but to repeat
> the subquery for each row in the outer query, which can get a little
> expensive.

Yes, that's very true (and I think both of us participating so far
realise that, but it's not a bad idea to check everybody's on the same
page). Django doesn't do anything correlated at the moment (I
intentionally made the nested query be able to be run in its entirety.

The performance hit in the example at hand is that the nested sub-query
cannot run really fast because it's a "LIKE" query over a varchar field,
so indexes aren't particularly helpful in most database setups. That's
why I've been emphasising that it's a bit of an edge case. Other nested
query cases such as comparing against an integer or against the entire
value of a field (e.g. exclude(entry__title="fred")) become,
effectively, index lookups if you index the right column and go like the
wind. Like the wind, I tell you! On a windy day!

Cheers,
Malcolm



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Göran Åström

In my view the 'not' can be interpreted differently.
Look at the following requests:

1. Give me blogs with entries matching a criteria.
2. Give me blogs with entries not matching a criteria.
3. Give me blogs without entries matching a criteria.

All three are valid.
In django 0.96 number 1 and 2 is available.
In development 1 and 3 is possible.
Not supporting number 2 causes problems since after bridging a one to
many (or m2m) relation the 'not' operator is no longer available (and
with it '!=', 'NOT LIKE' etc.) while 'and' and 'or' are.
In my opinion number 3 which is a 'not exists' merits its own operator
e.q -Q(). This would also make it possible to custruct a filter like:
Give me blogs without entries not matching a criteria.

Thanks for a great system!
Regards,
Göran

On Jul 17, 2:53 am, Malcolm Tredinnick <[hidden email]>
wrote:

> On Wed, 2008-07-16 at 20:44 -0400, Steve Holden wrote:
> > Malcolm Tredinnick wrote:
>
> > > On Wed, 2008-07-16 at 10:12 -0700, serbaut wrote:
> > >> Consider the following equivalent queries based on the weblog model
> > >> from the documentation:
>
> > >> Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
> > >> blah")
> > >> Blog.objects.filter(Q(name="myblog") &
> > >> QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
> > >> Blog.objects.filter(Q(name="myblog") &
> > >> ~Q(entry__body_text__contains="blah blah")) # development  version
>
> > >> In 0.96 the evaluated SQL is (edited for readability):
>
> > >> SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
> > >> FROM blog_blog
> > >> INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
> > >> blog_blog__entry.blog_id
> > >> WHERE blog_blog.name = "myblog"
> > >> AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))
>
> > >> while in trunk (r7982) it is
>
> > >> SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
> > >> FROM blog_blog
> > >> WHERE blog_blog.name = "myblog"
> > >> AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
> > >> WHERE blog_entry.body_text LIKE "%blah blah%")))
>
> > >> The trunk version will perform a subquery over all blog entries which
> > >> will have a very negative performance impact.
>
> > >> Lets say "myblog" has 10 entries in a database with 10,000,000
> > >> entries, the first SQL will only examine the 10 entries (assuming
> > >> decent query planner) while the latter will scan the whole database. A
> > >> join with blog_blog.id is missing from the subquery or it needs to be
> > >> rewritten to the 0.96 form. I understand that the code tries to handle
> > >> the generic case but this practical case has to work too.
>
> > > The important difference, that trumps everything else, is that the 0.96
> > > version gives the wrong answer! It isn't a case of less or more
> > > efficient -- it's the difference between correct and incorrect.
>
> > > The query you described asks to exclude all blog entries containing a
> > > particular tag. Now consider a blog entry that has two entries. One is
> > > the entry you're interested in excluding and the second entry is
> > > something else. Because that blog has an row in the m2m join table that
> > > does not match the entry you are excluding, that blog will be included
> > > (incorrectly) in the result set. This was a very big bug in 0.96 and
> > > impossible to work around in that code.
>
> > > There is no way to write that particular exclusion correctly without
> > > using nested subqueries unless you have some arbitrary constraint like
> > > only one tag per blog entry (in which case a many-to-many field is the
> > > wrong choice).
>
> > > Remember that the query you are writing here is something that returns
> > > (and filters) Blog objects. It's not for excluding individual Entry
> > > objects. It uses the presence or absence of an Entry to filter the
> > > Blogs.
>
> > > Your case is particularly pessimal, since it isn't going to be helped by
> > > index matches in most cases. For more natural uses, such as excluding by
> > > pk values, or whole-field matches, adding appropriate index comparisons
> > > makes the inner query very efficient if it becomes a performance issue
> > > in production environments.
>
> > Note that there's nothing wring in the general case with nested
> > subqueries. The problem arises with what are called "correlated
> > subqueries", where an element of the rows from the outer query is used
> > in the sub-query.
>
> > In other words, given an employee table where each employee has a
> > manager, it's quite fast to work out who earns more than a specific
> > employee:
>
> > SELECT * FROM employee WHERE salary > (
> >      SELECT salary FROM employee WHERE ID=1234)
>
> > The subquery returns a result that's invariant over the rows of the
> > outer query, and most SQL implementations will only perform it the once.
> > It's quite a different story if you want to find out who earns more than
> > their manager, however:
>
> > SELECT * FROM employee AS sub WHERE salary > (
> >      SELECT salary FROM employee AS mgr WHERE sub.salary > mgr.salary)
>
> > In such cases the SQL optimizer usually has little choice but to repeat
> > the subquery for each row in the outer query, which can get a little
> > expensive.
>
> Yes, that's very true (and I think both of us participating so far
> realise that, but it's not a bad idea to check everybody's on the same
> page). Django doesn't do anything correlated at the moment (I
> intentionally made the nested query be able to be run in its entirety.
>
> The performance hit in the example at hand is that the nested sub-query
> cannot run really fast because it's a "LIKE" query over a varchar field,
> so indexes aren't particularly helpful in most database setups. That's
> why I've been emphasising that it's a bit of an edge case. Other nested
> query cases such as comparing against an integer or against the entire
> value of a field (e.g. exclude(entry__title="fred")) become,
> effectively, index lookups if you index the right column and go like the
> wind. Like the wind, I tell you! On a windy day!
>
> Cheers,
> Malcolm- Hide quoted text -
>
> - Show quoted text -

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Malcolm Tredinnick


On Thu, 2008-07-17 at 02:39 -0700, Göran Åström wrote:
> In my view the 'not' can be interpreted differently.
> Look at the following requests:
>
> 1. Give me blogs with entries matching a criteria.
> 2. Give me blogs with entries not matching a criteria.
> 3. Give me blogs without entries matching a criteria.
>

Whilst alternate interpretations are always possible, interpreting
exclude as the complement of include ("filter") is the most natural. So
excluding based on a particular entry means to take the whole set of
blogs and remove those containing an entry that matches the condition
(since the filter() version would be those blogs containing an entry
that match the condition). Thus exclude() and filter() are opposites.

Since (a) writing your own Q-like object is possible and not
particularly hard and (b) nested queries are more or less supported (the
API doesn't exist, but all the under-the-cover implementation is there
and it's possible to use it if you read the code), your second option
has at least two ways to be accessed. So keeping the most natural
version of exclude() that we have now still seems like the best option
to me. That way, nothing is impossible and the obvious usage is the
right one.

Regards,
Malcolm


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Göran Åström

Thank you for answering and for the workaround tip.
I agree, exclude should be the complement of filter. This did not work
in 0.96 but works now, which is great.
I don't think however that it is a good idea to let filter(~Q()) mean
'not exists', i.e. be equivalent to 'exclude', for multi valued
relations, while it means 'not' in other cases.
It would be nice if ~Q also for these relations could mean 'not',
which is sometimes needed (as is 'and', 'or' and parenthesis) when
creating expressions.
In my opinion if Entry.objects.filter(~Q(condition)) gives a set of
entries then Blog.objects.filter(~Q(entry__condition)) should give the
blogs of these entries, while
Blog.objects.exclude(~Q(entry__condition)) should give the
complementary blogs, just as the case is without the ~ characters.

Regards,
Göran

On Jul 17, 6:49 pm, Malcolm Tredinnick <[hidden email]>
wrote:

> On Thu, 2008-07-17 at 02:39 -0700, Göran Åström wrote:
> > In my view the 'not' can be interpreted differently.
> > Look at the following requests:
>
> > 1. Give me blogs with entries matching a criteria.
> > 2. Give me blogs with entries not matching a criteria.
> > 3. Give me blogs without entries matching a criteria.
>
> Whilst alternate interpretations are always possible, interpreting
> exclude as the complement of include ("filter") is the most natural. So
> excluding based on a particular entry means to take the whole set of
> blogs and remove those containing an entry that matches the condition
> (since the filter() version would be those blogs containing an entry
> that match the condition). Thus exclude() and filter() are opposites.
>
> Since (a) writing your own Q-like object is possible and not
> particularly hard and (b) nested queries are more or less supported (the
> API doesn't exist, but all the under-the-cover implementation is there
> and it's possible to use it if you read the code), your second option
> has at least two ways to be accessed. So keeping the most natural
> version of exclude() that we have now still seems like the best option
> to me. That way, nothing is impossible and the obvious usage is the
> right one.
>
> Regards,
> Malcolm
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Malcolm Tredinnick


On Fri, 2008-07-18 at 04:11 -0700, Göran Åström wrote:
> Thank you for answering and for the workaround tip.
> I agree, exclude should be the complement of filter. This did not work
> in 0.96 but works now, which is great.
> I don't think however that it is a good idea to let filter(~Q()) mean
> 'not exists', i.e. be equivalent to 'exclude', for multi valued
> relations, while it means 'not' in other cases.
> It would be nice if ~Q also for these relations could mean 'not',
> which is sometimes needed (as is 'and', 'or' and parenthesis) when
> creating expressions.

Lots of things would be nice, but if we made that change, ~Q would not
be the opposite of Q() -- which is a filter() equivalent -- and that
would be a little inconsistent in other respects.

> In my opinion if Entry.objects.filter(~Q(condition)) gives a set of
> entries then Blog.objects.filter(~Q(entry__condition)) should give the
> blogs of these entries, while
> Blog.objects.exclude(~Q(entry__condition)) should give the
> complementary blogs, just as the case is without the ~ characters.

As I've agreed, different interpretations are always going to be
possible. We've chosen a very consistent set in the way that exclude()
and filter() and ~Q are implemented and whilst they aren't the only
possible ones, as indicated by your alternative proposal above, I don't
think your alternative is a better approach than what we have now.  I
don't agree with the logic about Entry.filteR() mapping to Blog.filter()
for multi-valued return results. When you extrapolate that to other
multi-valued situations, including filtering on many-to-many, things get
a little unintuitive.

So let's stick with what we've got now and move on. Hopefully you can
appreciate that what Django is doing is still logical, even if you don't
agree with the choice we've made, and you can see that there are
multiple possibilities, so we had to choose one.

Regards,
Malcolm

>
> Regards,
> Göran
>
> On Jul 17, 6:49 pm, Malcolm Tredinnick <[hidden email]>
> wrote:
> > On Thu, 2008-07-17 at 02:39 -0700, Göran Åström wrote:
> > > In my view the 'not' can be interpreted differently.
> > > Look at the following requests:
> >
> > > 1. Give me blogs with entries matching a criteria.
> > > 2. Give me blogs with entries not matching a criteria.
> > > 3. Give me blogs without entries matching a criteria.
> >
> > Whilst alternate interpretations are always possible, interpreting
> > exclude as the complement of include ("filter") is the most natural. So
> > excluding based on a particular entry means to take the whole set of
> > blogs and remove those containing an entry that matches the condition
> > (since the filter() version would be those blogs containing an entry
> > that match the condition). Thus exclude() and filter() are opposites.
> >
> > Since (a) writing your own Q-like object is possible and not
> > particularly hard and (b) nested queries are more or less supported (the
> > API doesn't exist, but all the under-the-cover implementation is there
> > and it's possible to use it if you read the code), your second option
> > has at least two ways to be accessed. So keeping the most natural
> > version of exclude() that we have now still seems like the best option
> > to me. That way, nothing is impossible and the obvious usage is the
> > right one.
> >
> > Regards,
> > Malcolm
> >
>


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: ~Q broken?

Göran Åström

Fair enough!
I'll do a workaround.

Thanks,
Göran

On Jul 18, 4:54 pm, Malcolm Tredinnick <[hidden email]>
wrote:

> On Fri, 2008-07-18 at 04:11 -0700, Göran Åström wrote:
> > Thank you for answering and for the workaround tip.
> > I agree, exclude should be the complement of filter. This did not work
> > in 0.96 but works now, which is great.
> > I don't think however that it is a good idea to let filter(~Q()) mean
> > 'not exists', i.e. be equivalent to 'exclude', for multi valued
> > relations, while it means 'not' in other cases.
> > It would be nice if ~Q also for these relations could mean 'not',
> > which is sometimes needed (as is 'and', 'or' and parenthesis) when
> > creating expressions.
>
> Lots of things would be nice, but if we made that change, ~Q would not
> be the opposite of Q() -- which is a filter() equivalent -- and that
> would be a little inconsistent in other respects.
>
> > In my opinion if Entry.objects.filter(~Q(condition)) gives a set of
> > entries then Blog.objects.filter(~Q(entry__condition)) should give the
> > blogs of these entries, while
> > Blog.objects.exclude(~Q(entry__condition)) should give the
> > complementary blogs, just as the case is without the ~ characters.
>
> As I've agreed, different interpretations are always going to be
> possible. We've chosen a very consistent set in the way that exclude()
> and filter() and ~Q are implemented and whilst they aren't the only
> possible ones, as indicated by your alternative proposal above, I don't
> think your alternative is a better approach than what we have now.  I
> don't agree with the logic about Entry.filteR() mapping to Blog.filter()
> for multi-valued return results. When you extrapolate that to other
> multi-valued situations, including filtering on many-to-many, things get
> a little unintuitive.
>
> So let's stick with what we've got now and move on. Hopefully you can
> appreciate that what Django is doing is still logical, even if you don't
> agree with the choice we've made, and you can see that there are
> multiple possibilities, so we had to choose one.
>
> Regards,
> Malcolm
>
>
>
>
>
> > Regards,
> > Göran
>
> > On Jul 17, 6:49 pm, Malcolm Tredinnick <[hidden email]>
> > wrote:
> > > On Thu, 2008-07-17 at 02:39 -0700, Göran Åström wrote:
> > > > In my view the 'not' can be interpreted differently.
> > > > Look at the following requests:
>
> > > > 1. Give me blogs with entries matching a criteria.
> > > > 2. Give me blogs with entries not matching a criteria.
> > > > 3. Give me blogs without entries matching a criteria.
>
> > > Whilst alternate interpretations are always possible, interpreting
> > > exclude as the complement of include ("filter") is the most natural. So
> > > excluding based on a particular entry means to take the whole set of
> > > blogs and remove those containing an entry that matches the condition
> > > (since the filter() version would be those blogs containing an entry
> > > that match the condition). Thus exclude() and filter() are opposites.
>
> > > Since (a) writing your own Q-like object is possible and not
> > > particularly hard and (b) nested queries are more or less supported (the
> > > API doesn't exist, but all the under-the-cover implementation is there
> > > and it's possible to use it if you read the code), your second option
> > > has at least two ways to be accessed. So keeping the most natural
> > > version of exclude() that we have now still seems like the best option
> > > to me. That way, nothing is impossible and the obvious usage is the
> > > right one.
>
> > > Regards,
> > > Malcolm- Hide quoted text -
>
> - Show quoted text -
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---