QuerySet.iterator together with prefetch_related because of chunk_size

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

QuerySet.iterator together with prefetch_related because of chunk_size

tobias.kroenke
Hi everyone!

The docs (https://docs.djangoproject.com/en/2.1/ref/models/querysets/#iterator) state that the

 use of iterator() causes previous prefetch_related() calls to be ignored since these two optimizations do not make sense together.

I am wondering, if this is still true with the introduction of the chunk_size parameter. Having only one additional query per chunk could heavily reduce the amount of queries per batch.

Best regards
Tobi



Truffls GmbH

Vertretungsberechtigte Geschäftsführer: Clemens Dittrich, Matthes Dohmeyer Amtsgericht Charlottenburg, HRB 160036B

Diese Nachricht ist vertraulich. Sie darf ausschließlich durch den vorgesehenen Empfänger und Adressaten gelesen, kopiert oder genutzt werden. Sollten Sie diese Nachricht versehentlich erhalten haben, bitten wir, den Absender (durch Antwort-E-Mail) hiervon unverzüglich zu informieren und die Nachricht zu löschen. Jede Nutzung oder Weitergabe des Inhalts dieser Nachricht ist unzulässig.
This message (including any attachments) is confidential and may be privileged. It may be read, copied and used only by the intended recipient. If you have received it in error please contact the sender (by return e-mail) immediately and delete this message. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/985d983f-09e1-401c-beb8-cd9b6b60e484%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

charettes
Hello Tobias,

From my understanding the introduction of chunk_size doest't help here.

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.

In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.

Now, something you could do if you really want to be using prefetch_related with iterator() is to materialize chunks and use the now public prefetch_related_objects[0] helper to prefetch relationships.

A somewhat complete implementation would look like

def prefetch_related_iterator(queryset, *related_lookups, chunk_size=100):
    iterator = queryset.iterator(chunk_size=chunk_size)
    while True:
        chunk = list(itertools.islice(iterator, chunk_size))
        if not chunk:
            return
        prefetch_related_objects(chunk, *related_lookups)
        yield from chunk

Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

Cheers,
Simon

[0] https://docs.djangoproject.com/en/2.1/ref/models/querysets/#django.db.models.prefetch_related_objects

Le jeudi 11 octobre 2018 15:44:06 UTC-4, [hidden email] a écrit :
Hi everyone!

The docs (<a href="https://docs.djangoproject.com/en/2.1/ref/models/querysets/#iterator" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fdocs.djangoproject.com%2Fen%2F2.1%2Fref%2Fmodels%2Fquerysets%2F%23iterator\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGZ7X_Ixxwc1pOJmKRUvcYz2YpCMA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fdocs.djangoproject.com%2Fen%2F2.1%2Fref%2Fmodels%2Fquerysets%2F%23iterator\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGZ7X_Ixxwc1pOJmKRUvcYz2YpCMA&#39;;return true;">https://docs.djangoproject.com/en/2.1/ref/models/querysets/#iterator) state that the

 use of iterator() causes previous prefetch_related() calls to be ignored since these two optimizations do not make sense together.

I am wondering, if this is still true with the introduction of the chunk_size parameter. Having only one additional query per chunk could heavily reduce the amount of queries per batch.

Best regards
Tobi



Truffls GmbH
<a href="https://www.google.com/maps/place/Truffls+GmbH/@52.5375244,13.3715239,17z/data=!3m1!4b1!4m5!3m4!1s0x47a851ece418ed87:0xdab2d728c831f006!8m2!3d52.5375212!4d13.3737126" rel="nofollow" style="color:rgb(53,114,176)" target="_blank" onmousedown="this.href=&#39;https://www.google.com/maps/place/Truffls+GmbH/@52.5375244,13.3715239,17z/data\x3d!3m1!4b1!4m5!3m4!1s0x47a851ece418ed87:0xdab2d728c831f006!8m2!3d52.5375212!4d13.3737126&#39;;return true;" onclick="this.href=&#39;https://www.google.com/maps/place/Truffls+GmbH/@52.5375244,13.3715239,17z/data\x3d!3m1!4b1!4m5!3m4!1s0x47a851ece418ed87:0xdab2d728c831f006!8m2!3d52.5375212!4d13.3737126&#39;;return true;">Chausseestraße 86, 10115 Berlin
<a href="https://truffls.de/" rel="nofollow" style="color:rgb(53,114,176)" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftruffls.de%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHOBOKend4yi4mwfV4zTAgsKK6CAg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftruffls.de%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHOBOKend4yi4mwfV4zTAgsKK6CAg&#39;;return true;">https://truffls.de

Vertretungsberechtigte Geschäftsführer: Clemens Dittrich, Matthes Dohmeyer Amtsgericht Charlottenburg, HRB 160036B

Diese Nachricht ist vertraulich. Sie darf ausschließlich durch den vorgesehenen Empfänger und Adressaten gelesen, kopiert oder genutzt werden. Sollten Sie diese Nachricht versehentlich erhalten haben, bitten wir, den Absender (durch Antwort-E-Mail) hiervon unverzüglich zu informieren und die Nachricht zu löschen. Jede Nutzung oder Weitergabe des Inhalts dieser Nachricht ist unzulässig.
This message (including any attachments) is confidential and may be privileged. It may be read, copied and used only by the intended recipient. If you have received it in error please contact the sender (by return e-mail) immediately and delete this message. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/afe9c247-45ac-4f63-8673-2b8b4c337d54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

Curtis Maloney-2
On 10/12/18 10:51 AM, charettes wrote:

> Hello Tobias,
>
>  From my understanding the introduction of chunk_size doest't help here.
>
> The fundamental reason why iterator() cannot be used with
> prefetch_related() is that the latter requires a set of model instance
> to be materialized to work appropriately which chunk_size doesn't
> control at all.
>
> In other words chunk_size only controls how many rows should be fetched
> from the database cursor and kept into memory at a time. Even when this
> parameter is used, iterator() will only materialize a single model
> instance per yield.

I inferred from the original post they were suggesting to do a set of
prefetch_related "filling" queries _per_ _chunk_

It wouldn't be as efficient as the 1+P (where P is number of prefetched
relations) of a normal prefetch_related, but at 1+(P x Chunks) it would
_probably_ be more efficient than the 1+(P x N-rows) it would otherwise
be [or worse!]

--
Curtis

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/fad10ba6-783d-bfbe-6b78-2500e475983f%40tinbrain.net.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

tobias.kroenke
In reply to this post by charettes
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi



Truffls GmbH

Vertretungsberechtigte Geschäftsführer: Clemens Dittrich, Matthes Dohmeyer Amtsgericht Charlottenburg, HRB 160036B

Diese Nachricht ist vertraulich. Sie darf ausschließlich durch den vorgesehenen Empfänger und Adressaten gelesen, kopiert oder genutzt werden. Sollten Sie diese Nachricht versehentlich erhalten haben, bitten wir, den Absender (durch Antwort-E-Mail) hiervon unverzüglich zu informieren und die Nachricht zu löschen. Jede Nutzung oder Weitergabe des Inhalts dieser Nachricht ist unzulässig.
This message (including any attachments) is confidential and may be privileged. It may be read, copied and used only by the intended recipient. If you have received it in error please contact the sender (by return e-mail) immediately and delete this message. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/bf7ce18b-db67-4bb9-9be6-2e6946201c58%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

Josh Smeaton
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/e7a36092-0743-41dc-998e-eee11fa1180b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

charettes
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/2389b18a-ea33-40eb-b5b0-929ab02a468a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

Tim Graham-2
https://code.djangoproject.com/ticket/29984 suggests to support prefetch_related() with QuerySet.iterator(). I accepted the ticket to do something and linked back to this discussion.

On Friday, October 26, 2018 at 8:12:02 PM UTC-4, charettes wrote:
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/32bbd4fa-40be-45fb-9008-ac25e9033bae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

charettes
In reply to this post by charettes
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

Adam Johnson-2
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM2t87iTC3A%3DJy3qD%2Bs_5r3%3DMKNN40B2SQzcwMW6Q5DAGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

charettes
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon

[0] https://code.djangoproject.com/ticket/27833

Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ogGRnxWSEgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">chare...@...> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ogGRnxWSEgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">django-develop...@googlegroups.com.
To post to this group, send email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ogGRnxWSEgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">django-d...@googlegroups.com.
Visit this group at <a href="https://groups.google.com/group/django-developers" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

Adam Johnson-2
I agree with your reasoning, and also favour option 2 now, especially since the default can break on sqlite.

It would also be possible to go through a deprecation period to first raise a warning and not prefetch, before a later version raises an exception, which is probably kinder since previously it was documented that it just was a no-op.

On Mon, 14 Jan 2019 at 19:03, charettes <[hidden email]> wrote:
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon


Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM2Lhis%2B9jBW_TfG2P1ftarvSco%3DNxMXY37gqy5NysTRpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

taylor
I agree that 2 makes sense and isn't a huge usability issue. It looks like the PR is ready to go whenever this is decided https://github.com/django/django/pull/10707/files

On Monday, January 14, 2019 at 5:29:18 PM UTC-5, Adam Johnson wrote:
I agree with your reasoning, and also favour option 2 now, especially since the default can break on sqlite.

It would also be possible to go through a deprecation period to first raise a warning and not prefetch, before a later version raises an exception, which is probably kinder since previously it was documented that it just was a no-op.

On Mon, 14 Jan 2019 at 19:03, charettes <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="FgVVofGqEgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">chare...@...> wrote:
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon

[0] <a href="https://code.djangoproject.com/ticket/27833" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F27833\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEXOqDrGDeu-dr6MNWKZ_tBzBiG8w&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F27833\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEXOqDrGDeu-dr6MNWKZ_tBzBiG8w&#39;;return true;">https://code.djangoproject.com/ticket/27833

Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to [hidden email].
Visit this group at <a href="https://groups.google.com/group/django-developers" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="FgVVofGqEgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">django-develop...@googlegroups.com.
To post to this group, send email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="FgVVofGqEgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">django-d...@googlegroups.com.
Visit this group at <a href="https://groups.google.com/group/django-developers" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/02cee394-f0da-4a01-b02f-e7580eebc344%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

taylor
Checking in here. What are the next steps to make a decision?

On Wednesday, February 13, 2019 at 2:06:29 PM UTC-5, Taylor wrote:
I agree that 2 makes sense and isn't a huge usability issue. It looks like the PR is ready to go whenever this is decided <a href="https://github.com/django/django/pull/10707/files" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fdjango%2Fdjango%2Fpull%2F10707%2Ffiles\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE7i5YBbfBLBuUnj6QMPV6SMByAkw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fdjango%2Fdjango%2Fpull%2F10707%2Ffiles\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE7i5YBbfBLBuUnj6QMPV6SMByAkw&#39;;return true;">https://github.com/django/django/pull/10707/files

On Monday, January 14, 2019 at 5:29:18 PM UTC-5, Adam Johnson wrote:
I agree with your reasoning, and also favour option 2 now, especially since the default can break on sqlite.

It would also be possible to go through a deprecation period to first raise a warning and not prefetch, before a later version raises an exception, which is probably kinder since previously it was documented that it just was a no-op.

On Mon, 14 Jan 2019 at 19:03, charettes <[hidden email]> wrote:
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon

[0] <a href="https://code.djangoproject.com/ticket/27833" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F27833\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEXOqDrGDeu-dr6MNWKZ_tBzBiG8w&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F27833\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEXOqDrGDeu-dr6MNWKZ_tBzBiG8w&#39;;return true;">https://code.djangoproject.com/ticket/27833

Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to [hidden email].
Visit this group at <a href="https://groups.google.com/group/django-developers" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to [hidden email].
Visit this group at <a href="https://groups.google.com/group/django-developers" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/03fa0616-d890-4fbf-9c4a-592d20ff0a8a%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

charettes
Taylor, I think that 2. is the way forward.

It looks like the linked PR doesn't ensure a chunk_size is specified to turn on the feature though. I like Adam's idea to first warn and don't do a prefetch if chunk_size is not specified and eventually turn that into an exception. What do you think of it?

Simon

Le mercredi 20 mai 2020 19:25:21 UTC-4, Taylor a écrit :
Checking in here. What are the next steps to make a decision?

On Wednesday, February 13, 2019 at 2:06:29 PM UTC-5, Taylor wrote:
I agree that 2 makes sense and isn't a huge usability issue. It looks like the PR is ready to go whenever this is decided <a href="https://github.com/django/django/pull/10707/files" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fdjango%2Fdjango%2Fpull%2F10707%2Ffiles\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE7i5YBbfBLBuUnj6QMPV6SMByAkw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fdjango%2Fdjango%2Fpull%2F10707%2Ffiles\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE7i5YBbfBLBuUnj6QMPV6SMByAkw&#39;;return true;">https://github.com/django/django/pull/10707/files

On Monday, January 14, 2019 at 5:29:18 PM UTC-5, Adam Johnson wrote:
I agree with your reasoning, and also favour option 2 now, especially since the default can break on sqlite.

It would also be possible to go through a deprecation period to first raise a warning and not prefetch, before a later version raises an exception, which is probably kinder since previously it was documented that it just was a no-op.

On Mon, 14 Jan 2019 at 19:03, charettes <[hidden email]> wrote:
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon

[0] <a href="https://code.djangoproject.com/ticket/27833" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F27833\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEXOqDrGDeu-dr6MNWKZ_tBzBiG8w&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F27833\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEXOqDrGDeu-dr6MNWKZ_tBzBiG8w&#39;;return true;">https://code.djangoproject.com/ticket/27833

Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to [hidden email].
Visit this group at <a href="https://groups.google.com/group/django-developers" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to [hidden email].
Visit this group at <a href="https://groups.google.com/group/django-developers" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/group/django-developers&#39;;return true;">https://groups.google.com/group/django-developers.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/49f895d0-d951-4442-93a3-a22b0099a7ad%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

taylor
That makes sense to me. It is a little weird, but it’s the best option.

I will post a comment on the PR. I am happy to update the code if he’s not available.

On Wed, May 20, 2020 at 9:59 PM charettes <[hidden email]> wrote:
Taylor, I think that 2. is the way forward.

It looks like the linked PR doesn't ensure a chunk_size is specified to turn on the feature though. I like Adam's idea to first warn and don't do a prefetch if chunk_size is not specified and eventually turn that into an exception. What do you think of it?

Simon

Le mercredi 20 mai 2020 19:25:21 UTC-4, Taylor a écrit :
Checking in here. What are the next steps to make a decision?

On Wednesday, February 13, 2019 at 2:06:29 PM UTC-5, Taylor wrote:
I agree that 2 makes sense and isn't a huge usability issue. It looks like the PR is ready to go whenever this is decided https://github.com/django/django/pull/10707/files

On Monday, January 14, 2019 at 5:29:18 PM UTC-5, Adam Johnson wrote:
I agree with your reasoning, and also favour option 2 now, especially since the default can break on sqlite.

It would also be possible to go through a deprecation period to first raise a warning and not prefetch, before a later version raises an exception, which is probably kinder since previously it was documented that it just was a no-op.

On Mon, 14 Jan 2019 at 19:03, charettes <[hidden email]> wrote:
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon


Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/ADgUd6jRvdw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/49f895d0-d951-4442-93a3-a22b0099a7ad%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CADkiSVy1Hx%2BzGxNq6%3DLiNtuznusz%2B6u7ZWjsEgfTkOcnYSei_w%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: QuerySet.iterator together with prefetch_related because of chunk_size

Adam Johnson-2
After over a year, it's probably reasonable to create a new PR. You can use Co-Authored-By to credit the original author: https://github.blog/2018-01-29-commit-together-with-co-authors/

On Fri, 22 May 2020 at 05:43, 'Taylor Hakes' via Django developers (Contributions to Django itself) <[hidden email]> wrote:
That makes sense to me. It is a little weird, but it’s the best option.

I will post a comment on the PR. I am happy to update the code if he’s not available.

On Wed, May 20, 2020 at 9:59 PM charettes <[hidden email]> wrote:
Taylor, I think that 2. is the way forward.

It looks like the linked PR doesn't ensure a chunk_size is specified to turn on the feature though. I like Adam's idea to first warn and don't do a prefetch if chunk_size is not specified and eventually turn that into an exception. What do you think of it?

Simon

Le mercredi 20 mai 2020 19:25:21 UTC-4, Taylor a écrit :
Checking in here. What are the next steps to make a decision?

On Wednesday, February 13, 2019 at 2:06:29 PM UTC-5, Taylor wrote:
I agree that 2 makes sense and isn't a huge usability issue. It looks like the PR is ready to go whenever this is decided https://github.com/django/django/pull/10707/files

On Monday, January 14, 2019 at 5:29:18 PM UTC-5, Adam Johnson wrote:
I agree with your reasoning, and also favour option 2 now, especially since the default can break on sqlite.

It would also be possible to go through a deprecation period to first raise a warning and not prefetch, before a later version raises an exception, which is probably kinder since previously it was documented that it just was a no-op.

On Mon, 14 Jan 2019 at 19:03, charettes <[hidden email]> wrote:
> This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception?

That's would be the plan.

> Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

That's assuming related relationships are actually accessed during iteration over the objects. If it's not currently the case turning that on would cause P + C extra queries.

To summarize here's the options we have:

1. Always turn on the prefetching feature.
2. Require chunk_size to be explicitly specified for a deprecation period in order to enable the feature.

Pros/Cons of 1. is that developers who actually didn't follow the documentation and/or didn't assert that prefetching didn't work and left some lingering prefetch_related() will probably see their code perform less queries but result in a slight increase in memory (given they discard objects on iterating). This is also likely to break code because of some backends (e.g. SQLite) query parameters limits (chunk_size=2000 > 1000)[0]

Pros/Cons of 2. is that most prefetch_related().iterator() won't break for developers that followed the documentation and fail for others requiring them to acknowledge what their code is going to start doing.

As I've expressed in my previous messages I'm slightly in favor of 2 even if it is likely to cause more breakage because of the exceptional situation where this API should have raised an exception from the beginning.

Cheers,
Simon


Le lundi 14 janvier 2019 09:53:45 UTC-5, Adam Johnson a écrit :
...what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

This seems reasonable, but would you do in the case chunk_size isn't explicitly defined - throw an exception? Currently it silently fails to prefetch which means N + 1 queries, so even prefetching for the default chunk_size of 2000 would be a huge gain in cases where chunk_size isn't defined.

On Sun, 13 Jan 2019 at 02:05, charettes <[hidden email]> wrote:
Replying to my concerns about the P * C queries.

Throwing that idea out there but what if we required chunk_size to be explicitly specified when the queryset has prefetch lookups?

That would at least force the developers to consider how many prefetching queries iterating through the results would require. Plus since the argument was only recently introduced it is less likely to silently cause changes under developers feet.

Simon

Le vendredi 26 octobre 2018 20:12:02 UTC-4, charettes a écrit :
Josh, I agree that silently not working is problematic but it has been this way since prefetch_related() was introduced.

Something to keep in mind as well is that silently turning it on would also perform P * C extra queries where P is the number of prefetches requested through prefetch_related() and C the number of chunks the results contains. This is non negligible IMO.

What I'd be in favor off is raising an exception on prefetch_related(*prefetches).iterator() in the next release at least to allow users using this pattern to adjust their code and then ship the optimization with all the warnings related to the interactions between prefetch_related(*prefetches) and iterator(chunk_size) in the next one.

I'm not completely completely against skipping the exception release phase entirely given there might be users out there accessing attributes expected to be prefetched on iterated instances and inadvertently performing tons of queries but the exception phase just feels safer given iterator() is usually used in memory constrained contexts.

Simon

Le vendredi 26 octobre 2018 19:27:55 UTC-4, Josh Smeaton a écrit :
I tend to agree with Tobi. Prefetching silently not working on iterator can be quite confusing, unless you have a good understanding of both APIs. It might be possible to do what you're asking, but it'd mean that django is now actually caching the result when it explicitly says it isn't - even if the result is a much smaller moving cache. Prefetching chunk_size results per chunk is unlikely to make a material difference to memory usage. Users are usually concerned about the entire result set of the primary table.

I don't know if you can change the API to make these suggested changes without also impacting how we iterate over result sets - but I'd be interested in seeing a proof of concept at the very least.



On Monday, 15 October 2018 20:41:13 UTC+11, [hidden email] wrote:
Thank you for your feedback. I would like to answer some statements to either convince you or make it more clear, where my idea stems from:

The fundamental reason why iterator() cannot be used with prefetch_related() is that the latter requires a set of model instance to be materialized to work appropriately which chunk_size doesn't control at all.
In other words chunk_size only controls how many rows should be fetched from the database cursor and kept into memory at a time. Even when this parameter is used, iterator() will only materialize a single model instance per yield.
 
It should be easily possible to change the involved code of ModelIterable to materialize the retrieved rows in batches. After materializing the batch / chunk, it could do the prefetching.
 
Given that iterator() always ignored prefetch related lookups instead of erroring out when they were specified make me feel like turning such a feature on by default could be problematic as it could balloon the memory usage which is the main reason why iterator is useful anyway.

I would argue, that users who thoughtlessly applied prefetching together with iterator now actually get, what they thought of: less DB query round trips traded against a little more memory usage.

Best,
Tobi

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/1f5d8b5a-8c7c-4575-b904-adfaa9d1900d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/7cd3d351-675b-4e43-9d17-e18f28a2b159%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Adam

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/ADgUd6jRvdw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/49f895d0-d951-4442-93a3-a22b0099a7ad%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CADkiSVy1Hx%2BzGxNq6%3DLiNtuznusz%2B6u7ZWjsEgfTkOcnYSei_w%40mail.gmail.com.


--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM2UxjpjsKbqrfFM%3DXMEEUkVEBf0yR9fbVY7GYNj-Ss3nw%40mail.gmail.com.