Preventing Unindexed Queries in Django

Backstory

A while ago I was watching a talk about microservices and the presenter brought up an interesting technique where the methods for accessing data only allowed querying via fields that were indexed. So for instance, if you wanted to query for users by age, there would only be an age parameter in the findAll method if the field was indexed.

It seemed like a great idea at the time but I never got around to figuring out how to achieve something similar in Django.

After someone on HN mentioned building a solution for preventing unindexed queries at Foursquare I decided to see what we can do with Django.

The Foursquare implementation is for Mongo but I think the idea is nice. Instead of creating a strict API around our query API, we can instead write a linter for the queries. This is nicer for established projects and it also means that devs that already know how to use Django ORM don’t have to learn a new homegrown query tool.

The Plan: Building the Query Check

The first step was seeing if we could take a QuerySet and check if it is querying against an unindexed field.

After toying around in an ipython session I found some methods that provide the necessary query data and came up with:

def only_querying_indexed_fields(queryset: QuerySet) -> bool:
    for expr in queryset._has_filters().get_source_expressions():
        if (
            not expr.lhs.output_field.db_index
            and not expr.lhs.output_field.primary_key
        ):
            return False
    return True

But we can’t just run this function over every queryset before execution, so my next idea was to monkey patch Django’s ORM query evaluation to add this check to every query before being sent to the database.

So I wrote a test, set a breakpoint(), and started stepping through Django’s evaluation of an ORM query to find a nice spot for patching.

django.db.models.sql.compiler.SQLCompiler::execute_sql seemed like a nice place to insert the check.

We also need to patch the SQLUpdateCompiler since it inherits from SQLCompiler and overrides the execute_sql method.

Here’s what I came up with:

from typing import Callable
from django.db.models.sql.compiler import (
    SQLCompiler,
    SQLUpdateCompiler,
)
from functools import wraps


def patch_execute_sql(original: Callable) -> Callable:
    @wraps(original)
    def inner(*args: object, **kwargs: object) -> None:
        # work with the `query` arg that gets passed in.
        breakpoint()
        original(*args, **kwargs)

    return inner


def patch() -> None:
    for compiler in (SQLCompiler, SQLUpdateCompiler):
        compiler.execute_sql = patch_execute_sql(compiler.execute_sql)

# then in settings.py we call patch()

It worked in that we can inspect every Query before execution. While this works, a QuerySet has more information that makes it easier to implement a solution.

Model Manager

Forgetting the monkey patching approach, we can instead override the model manager. This requires us to make more code changes but on the up side it’s more explicit than monkey patching.

In the code below, instead of checking the queries before they are sent to the database, we check the queries when they are constructed, aka when someone calls .filter() on a queryset.

import uuid
from django.db import models
from django.db.models.query import QuerySet

class QueryUnIndexedField(Exception):
    pass

class CheckedQuerySet(QuerySet):
    def filter(self, *args, **kwargs):
        queryset = super().filter(*args, **kwargs)
        for expr in queryset._has_filters().get_source_expressions():
            if (
                not expr.lhs.output_field.db_index
                and not expr.lhs.output_field.primary_key
            ):
                raise QueryUnIndexedField
        return queryset

    def exclude(self, *args: object, **kwargs: object) -> Any:
        queryset = super().exclude(*args, **kwargs)
        for expr in queryset._has_filters().get_source_expressions():
            for child in expr.children:
                if (
                    not child.lhs.output_field.db_index
                    and not child.lhs.output_field.primary_key
                ):
                    raise QueryUnIndexedField
        return queryset

class CheckedManager(models.Manager.from_queryset(CheckedQuerySet)):
    pass

class Team(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4)
    name = models.CharField(max_length=255)

class User(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4)
    created_at = models.DateTimeField(auto_now_add=True)
    modified_at = models.DateTimeField(auto_now=True)

    github_id = models.IntegerField(db_index=True)
    github_login = models.CharField(max_length=255)
    github_access_token = models.CharField(max_length=255)
    team = models.ForeignKey("Team", on_delete=models.SET_NULL, null=True)

    objects = CheckedManager()

And here’s a test which passes with the new check queryset & manager in place. Note how queries on indexed fields like id and github_id don’t raise exceptions while the unindexed fields do.

from uuid import uuid4
import pytest
from django.utils import timezone
from core.models import User, Team, QueryUnIndexedField

@pytest.mark.django_db
def test_query_unindexed_fields() -> None:
    User.objects.filter(id=uuid4())

    User.objects.filter(github_id=10)

    with pytest.raises(QueryUnIndexedField):
        User.objects.filter(github_login="foo")

    with pytest.raises(QueryUnIndexedField):
        User.objects.filter(Q(github_login="foo"))

    with pytest.raises(QueryUnIndexedField):
        User.objects.exclude(github_login="foo")

    with pytest.raises(QueryUnIndexedField):
        User.objects.filter(id=uuid4()).filter(github_login="foo").update(github_id=10)

    with pytest.raises(QueryUnIndexedField):
        User.objects.filter(id=uuid4()).filter(github_login="foo").delete()

    with pytest.raises(QueryUnIndexedField):
        User.objects.filter(modified_at__gt=timezone.now())

    with pytest.raises(QueryUnIndexedField):
        User.objects.filter(github_login__in=["foo"])

    team = Team.objects.create(name="foo bar")
    User.objects.filter(team=team)

Conclusion

While limiting the query API surface to only include indexed fields would provided quicker feedback, compiler error vs test error, the queryset approach lends itself to existing code bases.

Future Work

Django’s syntax for lookups that span relationships needs to be handled. Might need to add more checking for .annotate() as well.