I recently came across a gnarly issue: Every morning we’d see a latency spike in our API for for about 7-8 minutes. More importantly, it happened for a small fraction of requests (0.1%), making it pretty hard to reproduce.

I figured leveraging Sentry’s profiling feature would be a good way to catch this. But this has its own problems.

The first problem is that it’s very expensive to profile all requests: a billion samples where most of them are useless is a waste of money.

Sentry allows sampling requests, which somewhat takes care of the cost problem.

The second problem is that Sentry by default samples based on the count of events. So if you have a sampling rate of 1%, you’ll get approximately 1 in 100 requests. But, since most transactions are pretty fast, these are still useless. And 10 million requests instead of a billion is still a lot of wasted money.

I required a way to sample requests by duration: instead of sampling by the request ID, I want to sample by duration. Give me 50% of all requests that took more than 2 seconds, and 0.0001% of all requests that took less than 2 seconds.

The magic of this kind of sampling is that it gives me the most important spans, while being much cheaper than the default sampling.

The docs don’t mention this is possible at all, but combing through the source code, I found a way to do this. Here’s how it works:

Sentry has a before_send_transaction function that fires before sending a transaction to Sentry. We can use this to modify the transaction before it’s sent, or add any additional logic (like sampling by duration) to drop the transaction.

This is a pure function that takes in the transaction, and either returns a transaction or None. If it returns None, the transaction is dropped. So, any kind of logic depends on what’s inside the transaction. See the appendix below for a sample transaction.

In my case, all I need to know is the request URL (so I can apply this only to the API request I’m interested in), and the start and end timestamp (so I can sample based on that).

Note: The only gotcha to remember here is that you need to manually sample things, instead of using the traces_sampler or any builtin sentry sampling, since those apply before before_send_transaction, which means it’s too late by the time you get the transaction.

Here’s the code I ended up using:


def before_send_transaction(event, hint):
    # get the request URL
    url_string = event.get("request", {}).get("url")
    # only apply this to the request I'm interested in
    if url_string and "decide" in url_string:
        # default sampling rate based on all requests.
        DECIDE_SAMPLE_RATE = 0.00001  # 0.001%
        should_sample = random() < DECIDE_SAMPLE_RATE

        transaction_start_time = event.get("start_timestamp")
        transaction_end_time = event.get("timestamp")
        if transaction_start_time and transaction_end_time:
            try:
                parsed_start_time = parser.parse(transaction_start_time)
                parsed_end_time = parser.parse(transaction_end_time)

                duration = parsed_end_time - parsed_start_time

                if duration >= timedelta(seconds=8):
                    # return all events for transactions that took more than 8 seconds
                    return event
                elif duration > timedelta(seconds=2):
                    # very high sample rate for transactions that took more than 2 seconds
                    return event if random() < 0.5 else None

            except Exception:
                pass
        return event if should_sample else None
    else:
        return event


sentry_sdk.init(
    # ...
    before_send_transaction=before_send_transaction,
)

And finally, here’s a real world code example of how we did this at PostHog.

So, what ended up being the issue? I’m not sure yet, but I have these lovely spans to help me investigate

Slow Sentry spans

Appendix: Sample transaction

Here’s a sample transaction, which helps us determine all possible fields we can use:

{
    "type": "transaction",
    "transaction": "/decide/[#].*",
    "transaction_info": {
      "source": "route"
    },
    "contexts": {
      "trace": {
        "trace_id": "e9c1XXXXXXXXXXXXXXXde238",
        "span_id": "89XXXXXXXXXXXXXcc",
        "parent_span_id": "90XXXXXXXe99",
        "op": "http.server",
        "description": null,
        "status": "ok",
        "tracestate": "sentry=<base64 encoding of the above info>",
        "dynamic_sampling_context": {
          "trace_id": "e9XXXXXXXXXXXXXXXXXXXX8",
          "public_key": "3XXXXXXXXXXXXXXXXXXXXXXXXa",
          "sample_rate": "1e-05",
          "release": "4XXXXXXXXXXXXXXXXXXXXXXXXXXXXX14",
          "environment": "production",
          "transaction": "/api/projects/{parent_lookup_team_id}/persons/"
        }
      },
      "runtime": {
        "name": "CPython",
        "version": "3.10.10",
        "build": "3.10.10 (main, Mar  7 2023, 14:21:53) [Clang 13.1.6 (clang-1316.0.21.2.5)]"
      }
    },
    "tags": {
      "http.status_code": "200",
      "origin": "unknown",
      "referer": "unknown",
      "library.version": "unknown"
    },
    "timestamp": "2023-10-03T09:47:17.681238Z",
    "start_timestamp": "2023-10-03T09:47:13.542852Z",
    "spans": [
      // A list of child spans in the transaction
      {
        "trace_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "span_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "parent_span_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "same_process_as_parent": true,
        "op": "event.django",
        "description": "django.db.close_old_connections",
        "start_timestamp": "2023-10-03T09:47:13.543012Z",
        "timestamp": "2023-10-03T09:47:13.543020Z",
        "data": {
          "signal": "django.db.close_old_connections"
        }
      },
      {
        "trace_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "span_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "parent_span_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "same_process_as_parent": true,
        "op": "middleware.django",
        "description": "posthog.middleware.PrometheusBeforeMiddlewareWithTeamIds.__call__",
        "start_timestamp": "2023-10-03T09:47:13.543162Z",
        "timestamp": "2023-10-03T09:47:17.681195Z",
        "tags": {
          "django.function_name": "django.utils.deprecation.MiddlewareMixin.__call__",
          "django.middleware_name": "posthog.middleware.PrometheusBeforeMiddlewareWithTeamIds"
        }
      },
      {
        "trace_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "span_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "parent_span_id": "e9XXXXXXXXXXXXXXXXXXXX8",
        "same_process_as_parent": true,
        "op": "db",
        "description": "SELECT \"posthog_persondistinctid\".\"person_id\", \"posthog_persondistinctid\".\"distinct_id\" FROM 
        \"posthog_persondistinctid\" WHERE (\"posthog_persondistinctid\".\"distinct_id\" IN (%s) AND \"posthog_persondistinctid\".
        \"team_id\" = %s)",
        "start_timestamp": "2023-10-03T09:47:17.573475Z",
        "timestamp": "2023-10-03T09:47:17.583230Z"
      },
    ],
    "event_id": "e9XXXXXXXXXXXXXXXXXXXX8",
    "extra": {
      "sys.argv": [
        "manage.py",
        "runserver"
      ]
    },
    "request": {
      "url": "http://localhost:8000/decide/",
      "query_string": "v=3",
      "method": "POST",
      "env": {
        "SERVER_NAME": "1.0.0.127.in-addr.arpa",
        "SERVER_PORT": "8000"
      },
      "headers": {
        "Content-Length": "282",
        "Content-Type": "application/json",
        "Host": "localhost:8000",
        "Sentry-Trace": "e9c121e003284df18bf9905ad6dde238-90219bf565efce99-0",
        "User-Agent": "posthog-python/3.0.1",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "*/*",
        "Connection": "keep-alive"
      },
      "data": ""
    },
    "release": "e9XXXXXXXXXXXXXXXXXXXX8",
    "environment": "production",
    "server_name": "192.168.1.104",
    "sdk": {
      "name": "sentry.python.django",
      "version": "1.14.0",
      "packages": [
        {
          "name": "pypi:sentry-sdk",
          "version": "1.14.0"
        }
      ],
      "integrations": [
        "aiohttp",
        "argv",
        "atexit",
        "boto3",
        "celery",
        "dedupe",
        "django",
        "excepthook",
        "logging",
        "modules",
        "redis",
        "stdlib",
        "threading"
      ]
    },
    "platform": "python",
    "_meta": {
      "request": {
        "data": {
          "": {
            "rem": [
              [
                "!config",
                "x"
              ]
            ]
          }
        }
      }
    }
  }
You can send yourself an email with this post.