Cutting Peak CPU in Half By Caching SSL Context
httpx
is great, but it has a long standing issue where it doesn’t cache ssl context. This is fine if you aren’t creating a lot of clients, but for various reasons, Kodiak creates a ton of http clients.
Narrowing in on the fix
First step is creating a simple test case that we can run reliably to reproduce the issue:
import asyncio
import httpx
async def main() -> None:
for _ in range(0, 10_000):
async with httpx.AsyncClient() as client:
r = await client.get("https://example.com")
print(r.status_code)
if __name__ == '__main__':
asyncio.run(main())
And then we can start it up and run py-spy
on it:
Which clearly shows load_ssl_context_verify
is taking up a large portion of the trace.
If we remove the actual network calls, and instead just instantiate the client:
import asyncio
import httpx
import ssl
async def main() -> None:
while True:
async with httpx.AsyncClient() as client:
print("foo")
if __name__ == '__main__':
asyncio.run(main())
Then the issue is even more pronounced:
The Fix
The proper fix is to update httpx
to cache the ssl context, but as a quick workaround in the meantime, looking around in the innards of load_ssl_context_verify
reveals there’s an early return path that’s used when verify
is passed into the client’s __init__
.
Here’s the code updated with the verify
argument:
import asyncio
import httpx
import ssl
# "cache" at module level
context = ssl.create_default_context()
async def main() -> None:
while True:
async with httpx.AsyncClient(verify=context) as client:
print("foo")
if __name__ == '__main__':
asyncio.run(main())
import asyncio
import httpx
import ssl
URL = "https://example.com"
# "cache" at module level
context = ssl.create_default_context()
async def main() -> None:
while True:
async with httpx.AsyncClient(verify=context) as client:
r = await client.get(URL)
print(r.status_code)
if __name__ == '__main__':
asyncio.run(main())
The final results
HTTP calls before
HTTP calls after
HTTP client creation before
HTTP client creation after
And finally, after rolling out the change to Kodiak’s production servers, we see the 50% drop in peak usage: