Need help with your Django project?
Check our django servicesIntro
In our daily work as web developers, we use the Django ORM a lot. It's a great tool that makes it easy to interact with the database by just calling any of the methods for CRUD operations:
- Create(INSERT) -
QuerySet.create
- Read(SELECT) -
QuerySet.all
,QuerySet.values
andQuerySet.values_list
- Update -
QuerySet.update
- Delete -
QuerySet.delete
But the all/values/values_list
methods are quite different - they don't immediately trigger a query, even if you chain other QuerySet methods to them (.filters
/ .order_by
/ etc.), but the data is there when you start iterating over it.
Here're 5 questions that we're going to focus on:
- When does the ORM execute the
SELECT
query? - How do the
all
,values
andvalues_list
affect the queryset ? - What data structure do we need for better performance ?
- Can we define a custom form for the DB data?
- Does the order of methods chaining affect the order of execution of the QuerySet ?
QuerySet as a generator vs QuerySet as an iterable
Theory
Let's say we have the following model:
class User(Model):
email = CharField(max_length=255)
name = CharField(max_length=255)
Let's take a look at the following code snippet:
users = User.objects.all()
hacksoft_users = users.filter(email__icontains='@hacksoft.io')
for user in hacksoft_users:
pass
for user in hacksoft_users:
pass
How many queries will the code above do? When? Why?
Here're a few facts about the querysets:
- The QuerySet is immutable - chaining methods to our queryset doesn't modify the original queryset - it creates a new one.
- The QuerySet is a generator when you iterate over it for the first time - when you start iterating over the queryset, internally it executes a
SELECT
query and yields the DB rows shaped into the desired Python data structure. - The QuerySet is an iterable - once we've iterated over the queryset once, the queryset puts the DB result into a cache. On every subsequent iteration, we'll use the cached objects. This prevents us from unwanted queries duplication.
As a summary:
The queryset object is immutable and it triggers a SELECT query only when you iterate over it for the first time.
So let's get back on our example:
users = User.objects.all() # Creates a queryset
hacksoft_users = users.filter(email__icontains='@hacksoft.io') # Creates a new queryset
for user in hacksoft_users: # Makes SELECT query and yields the result
pass
for user in hacksoft_users: # Just yields the cached result
pass
Based on the unique querysets first iterations, the code above makes 1 SELECT
query.
Implementation
In Python, the way to define an iterable class is by using the __iter__
method. That's how the QuerySet
class works. Let's take a look at the QuerySet.__iter__
:
class QuerySet:
...
def _fetch_all(self):
if self._result_cache is None:
self._result_cache = list(self._iterable_class(self))
# ... more code to handle prefetched relations
def __iter__(self):
self._fetch_all()
return iter(self._result_cache)
As we see in the _fetch_all
method we have 2 cases:
- the
_result_cache
isNone
- that's when we iterate over the queryset for the first time. The queryset now needs to iterate over the_iterable_class
(we'll take a closer look there in the next section) that will take care of the DB fetching and will put the DB rows into the desired Python structure. - the
_result_cache
is notNone
- the queryset will just iterate over it.
Q: When does the ORM execute theSELECT
query?
A: The ORM execute aSELECT
query only the first time you iterate over a QuerySet
Iterable classes
Let's focus on the QuerySet._iterable_class
and see what it does with the SELECT
query's data.
The _iterable_class
has two functions:
- calls the SQL compiler to execute
SELECT
query - puts the raw database data (a list of tuples) into ORM objects(
.all
), dictionaries(.values
) or tuples(.values_list
) and return it
We have the following types of "iterable classes" that comes from the Django ORM:
ModelIterable
(link) - used by.all
and yields ORM objectsValuesIterable
(link) - set when.values
is called and yields dictionariesValuesListIterable
(link),NamedValuesListIterable
(link) andFlatValuesListIterable
(link) - set when.values_list
is called (we have 3 iterable classes here sincevalues_list
returns different formats depending on thenamed
andflat
arguments)
NOTE: The ModelIterable
is the default _iterable_class
of the QuerySet - that's why we could iterate over User.objects.filter
without calling all
explicitly.
Implementation
So let's take a look ValuesIterable
class:
class ValuesIterable(BaseIterable):
def __iter__(self):
queryset = self.queryset
query = queryset.query
compiler = query.get_compiler(queryset.db)
names = [
*query.extra_select,
*query.values_select,
*query.annotation_select,
]
indexes = range(len(names))
for row in compiler.results_iter(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size):
yield {names[i]: row[i] for i in indexes}
As we see the dictionary form of the DB data is done by the {names[i]: row[i] for i in indexes}
expression.
So why should we care about this? Well, for 2 reasons:
- ORM performance
- Flexibility to extend the QuerySet interface
Q: How do theall
,values
andvalues_list
affect the queryset ?
A: By changing the QuerySet's iterable class, they define the data structure for the DB data.
QuerySet Performance - SQL vs Python
The ORM SELECT
query performance is a result of two other components - SQL query performance and Django ORM (Python) performance.
SQL query performance
The SQL query performance is a huge topic and we're not going to get into it. Here're a few common sense things that could make our DB work faster:
- query only for the DB column you need - use
.only
and.defer
- query only for the DB rows you need - use
.filter
and.exclude
- use DB indexes
ORM performance
As soon as the DB finishes its job, it's time for the ORM's QuerySet._iterable_class
to go through each DB row and put it into an ORM data structure. Since this is done for each row, the performance between .all
, .values_list
and .values_list
could be significant (that's visible especially when working with huge data sets).
Although QuerySet.all
's ModelIterable
allows us to use the entire power of the ORM, it has a price - it makes a model instance for each DB row and attaches all the related objects(also models instances). This is a really expensive operation - both for the CPU and the RAM.
If we fetch too much data and we need it read-only, .values
and .values_list
are a better choice than .all
.
Q: What data structure do we need for better performance ?
A: The rule of thumb here is "The more data you fetch, the lighter data structure you should use.". An expensive structure like the ORM object could seriously affect the QuerySet performance when you work with big volumes of data.
Writing a custom iterable class
Inspired from the functional programming, let's write a map()
method on the QuerySet objects
Here is the interface we want to achieve:
users = User.objects.filter(email__icontains='@hacksoft.io')
user_ids = users.map(lambda el: el.id)
print(user_ids)
# >>> <UserQuerySet [1, 2, ...]>
Let's do it step by step:
- Define custom queryset with empty
map
function:
class UserQuerySet(QuerySet):
def map(self, func):
# We need to clone our queryset to keep the immutable behaviour
clone = self._chain()
return clone
class User(Model):
objects = UserQuerySet.as_manager()
email = CharField(max_length=255)
name = CharField(max_length=255)
2. Add an iterable class that inherits the current iterable class so we chain our map
method to all of the all
, values
and values_list
class UserQuerySet(QuerySet):
def map(self, func):
class MapIterable(self._iterable_class):
def __iter__(self):
return (
obj
for obj in super().__iter__()
)
# Keep the queryset immutable behaviour
clone = self._chain()
clone._iterable_class = MapIterable
return clone
3. Apply our custom function in the map
class UserQuerySet(QuerySet):
def map(self, func):
class MapIterable(self._iterable_class):
def __iter__(self):
return (
func(obj) # <-- apply the function
for obj in super().__iter__()
)
# Keep the queryset immutable behaviour
clone = self._chain()
clone._iterable_class = MapIterable
return clone
As we use the current data structure of the queryset in our map function we can do the following things:
def attach_verbose_name(user):
# The user here is a dictionary since we used `.values()` in our queryset
user['verbose_name'] = f'{user["name"]} ({user["email"]})'
return user
User.objects \
.filter(id__gt=2) \
.values() \ # the data structure will be a `dict`
.map(attach_verbose_name) \ # Will work with dictionaries (`.values`)
.map(lambda el: el['verbose_name'])
>>> <UserQuerySet ['User 3 (user3@hacksoft.io)']>
Q: Can we define a custom form for the DB data?
A: Yes - by defining a new QuerySet method with it's own iterable class.
Methods chaining and order of execution
The order of the method chaining is not always the same as the order of execution.
We could categorize the QuerySet methods into 2 categories:
- Methods that modify the SQL query -
filter
/exclude
/annotate
/only
/ etc. They are "executed" into the database when it runs the SQL query. - Methods that define the data structure -
all
/values
/values_list
/etc. They're executed in our Django app (by iterating over the iterable class and modifying the data)
The ORM allows us to chain the same methods in almost any order. But, no matter the order of chaining, the order of execution will always be:
- Execute the methods that are modifying the SQL query
- Run the query in the database
- Execute the methods that define the data structure
That's why the following example will fail:
def attach_annotated_id(user):
# The user here is a dictionary since we used `.values()` in our queryset
user['annotated_id'] = user['id']
return user
User.objects \
.values() \ # Execution order: 2
.map(attach_annotated_id) \ # Execution order: 3
.filter(annotated_id__gte=2) # Execution order: 1
Q: Does the order of methods chaining affects the order of execution of the QuerySet ?
A: It does, but the methods that modify the data structure are ALWAYS executed after the methods that modify the SQL queries by design.
Summary
Django ORM's QuerySet
methods that cover each of the CRUD operations are:
- Create(INSERT) -
QuerySet.create
- Read(SELECT) -
QuerySet.__iter__
- Update -
QuerySet.update
- Delete -
QuerySet.delete
Back to the questions:
1. When does the ORM actually execute the SELECT
query?
The ORM execute a SELECT
query only the first time you iterate over a QuerySet
2. How do the all
, values
and values_list
affect the queryset ?
By changing the QuerySet's iterable class, they define the data structure for the DB data.
3. What data structure do we need for better performance ?
The more data you fetch, the lighter data structure you should use.
4. Can we define a custom form for the DB data?
Yes - by defining a new QuerySet method with it's own iterable class.
5. Does the order of methods chaining affects the order of execution of the QuerySet ?
It does, but the methods that modify the data structure are ALWAYS executed after the methods that modify the SQL queries by design.
I hope this article helps you learn more about the Django ORM data fetching strategy and how we can resolve possible performance issues.
For more Django related blog posts - check the rest of the Django articles in our blog ✌