Quantcast

Easier way to populate test databases for parallel tests (patch in github)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Easier way to populate test databases for parallel tests (patch in github)

Marcos Diez
Ticket:            https://code.djangoproject.com/ticket/28153
Pull Request: https://github.com/django/django/pull/8437

Although Django makes very easy for one to extend django.test.runner.DiscoverRunner , it's setup_databases() does too much.

Currently, it

  • creates all the test databases (for single thread unit tests)
  • duplicates all the test databases (in case of parallel unit tests)

In case I am running not running tests in parallel, I can just populate the DB after running unit tests without any issues.


But if I care about my time and want to run tests in parallel, I can either:


a) populate my data after setup_databases() is executed, once for each thread of the parallel tests, which is slow
b) get my hands dirty and reimplement setup_databases()


I propose (and I am sending the code to do so) a better solution. We just have to break setup_databases() in 3 functions:


DiscoverRunner.prepare_databases()
DiscoverRunner.populate_databases() # noop by default
DiscoverRunner.duplicate_databases_if_necessary()


The idea is quite simple: in order to be backward compatible, setup_databases() , will still exist but only call three functions above in that order.


The first function will create all the test databases necessary for non parallel tests to run.

populate_databases() , which should be a no op, can be overwritten by the user who extends django.test.runner.DiscoverRunner so his/her data can be populated


Afterwards, all the test DBs are copied as many times as necessary in case parallel tests are run via DiscoverRunner.duplicate_databases_if_necessary()


I believe this change on Django will have no downside, will be backward compatible and help people who needs to populate real data on the DB for their tests.


Thanks

Marcos Diez

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/5f3ff10a-a0a7-4142-87a6-4820e4358807%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Easier way to populate test databases for parallel tests (patch in github)

Tim Graham-2
I would expect test data population to happen using migrations rather than in the test runner. Can you elaborate on your use case and say if that method would be unsuitable?

On Friday, April 28, 2017 at 8:45:55 PM UTC-4, Marcos Diez wrote:
Ticket:            <a href="https://code.djangoproject.com/ticket/28153" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F28153\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGo0d-WEo95J7B4vybNnOpWdrVxUA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fcode.djangoproject.com%2Fticket%2F28153\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGo0d-WEo95J7B4vybNnOpWdrVxUA&#39;;return true;">https://code.djangoproject.com/ticket/28153
Pull Request: <a href="https://github.com/django/django/pull/8437" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fdjango%2Fdjango%2Fpull%2F8437\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG0ZghhtmD6Rtei-0FR5xTYOJj2TA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fdjango%2Fdjango%2Fpull%2F8437\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG0ZghhtmD6Rtei-0FR5xTYOJj2TA&#39;;return true;">https://github.com/django/django/pull/8437

Although Django makes very easy for one to extend django.test.runner.DiscoverRunner , it's setup_databases() does too much.

Currently, it

  • creates all the test databases (for single thread unit tests)
  • duplicates all the test databases (in case of parallel unit tests)

In case I am running not running tests in parallel, I can just populate the DB after running unit tests without any issues.


But if I care about my time and want to run tests in parallel, I can either:


a) populate my data after setup_databases() is executed, once for each thread of the parallel tests, which is slow
b) get my hands dirty and reimplement setup_databases()


I propose (and I am sending the code to do so) a better solution. We just have to break setup_databases() in 3 functions:


DiscoverRunner.prepare_databases()
DiscoverRunner.populate_databases() # noop by default
DiscoverRunner.duplicate_databases_if_necessary()


The idea is quite simple: in order to be backward compatible, setup_databases() , will still exist but only call three functions above in that order.


The first function will create all the test databases necessary for non parallel tests to run.

populate_databases() , which should be a no op, can be overwritten by the user who extends django.test.runner.DiscoverRunner so his/her data can be populated


Afterwards, all the test DBs are copied as many times as necessary in case parallel tests are run via DiscoverRunner.duplicate_databases_if_necessary()


I believe this change on Django will have no downside, will be backward compatible and help people who needs to populate real data on the DB for their tests.


Thanks

Marcos Diez

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/817344bd-57c1-42e9-a5b6-8550934c89b5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Easier way to populate test databases for parallel tests (patch in github)

Shai Berger
On Saturday 29 April 2017 03:50:16 Tim Graham wrote:
> I would expect test data population to happen using migrations rather than
> in the test runner. Can you elaborate on your use case and say if that
> method would be unsuitable?
>

Apparently, many people think that migrations are the wrong tool for this job.

See previous discussion, which didn't seem to go anywhere:

https://groups.google.com/d/msg/django-developers/Ln1-IqysEwE/DuyZl7QkEwAJ

Have fun,
        Shai.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Easier way to populate test databases for parallel tests (patch in github)

Adam Johnson-2
Avoiding migrations, one can populate test data with a post_migrate signal handler. django.contrib.contenttypes already does this to fill the DB with content types, see https://github.com/django/django/blob/c651331b34b7c3841c126959e6e52879bc6f0834/django/contrib/contenttypes/apps.py#L18 . To do it during tests only you could have a condition to register said handler.

On 29 April 2017 at 09:39, Shai Berger <[hidden email]> wrote:
On Saturday 29 April 2017 03:50:16 Tim Graham wrote:
> I would expect test data population to happen using migrations rather than
> in the test runner. Can you elaborate on your use case and say if that
> method would be unsuitable?
>

Apparently, many people think that migrations are the wrong tool for this job.

See previous discussion, which didn't seem to go anywhere:

https://groups.google.com/d/msg/django-developers/Ln1-IqysEwE/DuyZl7QkEwAJ

Have fun,
        Shai.



--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM3zT5h_6eLfkresqwaKr%2BbXESSaH5V5PXNikbUz3ufdtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Easier way to populate test databases for parallel tests (patch in github)

Marcos Diez
In reply to this post by Marcos Diez
I believe I was not clear.

I do use migrations to populate Enums and other data that should also be available in production.

The code I am sending is to load fixtures on the database.

This way all tests can assume the same set of data and we all the fixtures are loaded in one place, which in my case of use it makes sense.

The advantage of the method I am proposing is that it is quite fast. Data is loaded only once in the DB and that it is duplicated in bulk mode by the DBMS, as many times as necessary when tests run in parallel. 

Another unexpected convenience of my method is that a developer who uses Django to populate fixtures in the database, does not have to worry if his/her code to generate data has side effects or not if he is running tests in parallel, because his data generation code will run only once.


Actually, if I may ask, how else would one load bunches of fixtures in the DB and run tests in parallel without my PR ?


--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/32df6bec-a2ce-494e-b007-5f4433ad682f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Easier way to populate test databases for parallel tests (patch in github)

Adam Johnson-2
Actually, if I may ask, how else would one load bunches of fixtures in the DB and run tests in parallel without my PR ?

As I said, register a post_migrate handler during testing that loads your data. It will run during the creation of the first database in connection.creation.create_test_db, as part of call_command('migrate'), before the test runner code clones the database for parallel execution. There's no need to change Django to support this.

Another option is that you extend your database backend and override creation.create_test_db and add logic there.

By the way, I think it's the general opinion that tests are best without a large generic fixture available to them. It's certainly been my experience, as it makes it very hard to later understand what data a specific test does or does not rely upon, and if the data can be updated safely. The tool I prefer for test data generation is factory boy ( https://factoryboy.readthedocs.io/en/latest/ ) which can be used to create data per test method or class, without having to laboriously specify every field of every model.

On 29 April 2017 at 13:36, Marcos Diez <[hidden email]> wrote:
I believe I was not clear.

I do use migrations to populate Enums and other data that should also be available in production.

The code I am sending is to load fixtures on the database.

This way all tests can assume the same set of data and we all the fixtures are loaded in one place, which in my case of use it makes sense.

The advantage of the method I am proposing is that it is quite fast. Data is loaded only once in the DB and that it is duplicated in bulk mode by the DBMS, as many times as necessary when tests run in parallel. 

Another unexpected convenience of my method is that a developer who uses Django to populate fixtures in the database, does not have to worry if his/her code to generate data has side effects or not if he is running tests in parallel, because his data generation code will run only once.


Actually, if I may ask, how else would one load bunches of fixtures in the DB and run tests in parallel without my PR ?


--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/32df6bec-a2ce-494e-b007-5f4433ad682f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Adam

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMyDDM0vEu2fTXfGCFhhTtVZ7rJOrOSKYDFT6Sb%2B%2Bh%3DAs5Z9TA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Loading...