Sync to Async Python for Speed

Lately I’ve been noticing recipeyak’s API is kind of slow and after having success using asyncio with kodiak, I thought it would be good to benchmark sync Python against async Python.

Also I threw in a Rust implementation for kicks.

setup

architecture:

client (oha) -> haproxy -> http server -> postgres

client: my laptop (in Boston)

servers:

service	vCPUs	Memory	Storage	Price	Region
haproxy	2	4 GB	80 GB	$28/mo	NYC
http server	2	4 GB	80 GB	$28/mo	NYC
postgres (v14) w/ pgbouncer (25 connections)	2	4 GB	38 GB	$60/mo	NYC

haproxy config:

# /etc/haproxy/haproxy.cfg
global
    stats timeout 30s
    daemon

defaults
    log	global
    mode http
    option httplog
    timeout connect 5s
    timeout client 30s
    timeout server 30s

frontend front
    bind *:80
    use_backend api if { path_beg /api/v1/ }

backend api
    server recipes 10.0.0.1:8080 maxconn 25

Note: I took the max connection count from the docs, but changing it up or down didn’t seem to affect anything as the servers didn’t get overloaded.

Also I didn’t setup TLS but I don’t think it should affect the various results too much because haproxy would handle the termination.

client oha (version 0.5.2) command:

oha -z 1m -c 50 --no-tui \
    -H 'Cookie: sessionid=$SESSION_ID' 'http://$IP/api/v1/recipes'

also used psrecord to record CPU and memory usage:

psrecord --interval 1 --include-children --log $logname.log $PID

results

python (sync): django + psycopg

Chose 5 workers as gunicorn docs suggest (2 * os.cpu_count()) + 1

./.venv/bin/gunicorn \
    -w 5 \
    main:app \
    -b 0.0.0.0:8080 \
    --access-logfile - \
    --error-logfile - \
    --capture-output \
    --enable-stdio-inheritance \
    --access-logformat 'request="%(r)s" request_time=%(L)s remote_addr="%(h)s" request_id=%({X-Request-Id}i)s response_id=%({X-Response-Id}i)s method=%(m)s protocol=%(H)s status_code=%(s)s response_length=%(b)s referer="%(f)s" process_id=%(p)s user_agent="%(a)s"'

Also tried using --threads but it didn’t make a difference in performance.

Summary:
  Success rate:	1.0000
  Total:	60.0026 secs
  Slowest:	1.6509 secs
  Fastest:	0.2181 secs
  Average:	0.8512 secs
  Requests/sec:	58.3474

  Total data:	11.95 MiB
  Size/request:	3.49 KiB
  Size/sec:	203.91 KiB

Response time histogram:
  0.348 [6]    |
  0.479 [18]   |
  0.609 [114]  |■■■
  0.739 [1131] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.869 [1015] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.000 [335]  |■■■■■■■■■
  1.130 [495]  |■■■■■■■■■■■■■■
  1.260 [253]  |■■■■■■■
  1.390 [86]   |■■
  1.521 [35]   |
  1.651 [13]   |

Latency distribution:
  10% in 0.6521 secs
  25% in 0.7046 secs
  50% in 0.7903 secs
  75% in 1.0018 secs
  90% in 1.1429 secs
  95% in 1.2335 secs
  99% in 1.4102 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0345 secs, 0.0252 secs, 0.0416 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 3501 responses

perf log

# Elapsed time   CPU (%)     Real (MB)
       0.000        0.000      227.293
       1.004        0.000      227.293
       2.008        0.000      227.293
       3.012        0.000      227.293
       4.016        0.000      227.293
       5.020        0.000      227.293
       6.025        0.000      227.293
       7.031        0.000      227.293
       8.035        0.000      227.293
       9.040        0.000      227.293
044        0.000      227.293
048       48.000      274.422
052       38.000      274.422
056       40.000      274.422
060       34.000      274.422
064       41.000      274.422
068       35.000      274.422
072       35.000      274.422
077       36.000      274.422
081       37.000      274.422
085       36.000      274.422
089       32.000      274.422
093       42.900      274.422
099       34.000      274.422
104       39.000      274.422
108       35.000      274.422
112       27.000      274.422
117       32.000      274.422
121       40.000      274.422
127       43.000      274.422
132       42.000      274.422
136       39.000      274.422
141       37.000      274.422
145       37.000      274.422
149       35.000      274.422
153       28.000      274.422
157       38.000      274.422
161       29.000      274.422
165       37.000      274.672
169       38.000      274.672
173       36.000      275.176
177       36.000      275.176
182       32.000      275.176
187       33.000      275.176
191       34.000      275.422
195       37.000      275.672
199       32.000      275.672
202       35.000      275.672
207       36.000      275.672
211       37.000      275.672
215       37.000      275.672
219       44.000      275.672
224       47.800      275.672
229       46.500      275.672
235       43.000      275.672
240       37.000      275.672
244       31.000      275.672
249       32.000      275.672
253       40.000      275.672
257       36.000      275.672
261       39.000      275.672
265       36.000      275.672
269       35.000      275.672
273       35.000      275.672
277       40.000      275.672
281       29.000      275.672
287       36.000      275.672
291       40.000      275.672
297       37.000      275.672
301       36.000      275.672
305       39.000      275.672
309       19.000      275.672
313        0.000      275.672

python (async): starlette + asyncpg

./.venv/bin/gunicorn \
    -w 2 \
    -k uvicorn.workers.UvicornWorker \
    -b 0.0.0.0:3000 main:app \
    --access-logfile - \
    --error-logfile - \
    --capture-output \
    --enable-stdio-inheritance \
    --access-logformat 'request="%(r)s" request_time=%(L)s remote_addr="%(h)s" request_id=%({X-Request-Id}i)s response_id=%({X-Response-Id}i)s method=%(m)s protocol=%(H)s status_code=%(s)s response_length=%(b)s referer="%(f)s" process_id=%(p)s user_agent="%(a)s"'

Summary:
  Success rate:	1.0000
  Total:	60.0037 secs
  Slowest:	0.3148 secs
  Fastest:	0.0424 secs
  Average:	0.1009 secs
  Requests/sec:	495.3030

  Total data:	103.01 MiB
  Size/request:	3.55 KiB
  Size/sec:	1.72 MiB

Response time histogram:
  0.066 [1809]  |■■■■■
  0.090 [8589]  |■■■■■■■■■■■■■■■■■■■■■■■
  0.113 [11554] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.137 [5272]  |■■■■■■■■■■■■■■
  0.161 [1661]  |■■■■
  0.184 [499]   |■
  0.208 [207]   |
  0.232 [84]    |
  0.255 [32]    |
  0.279 [9]     |
  0.303 [4]     |

Latency distribution:
  10% in 0.0697 secs
  25% in 0.0812 secs
  50% in 0.0991 secs
  75% in 0.1143 secs
  90% in 0.1332 secs
  95% in 0.1479 secs
  99% in 0.1866 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0308 secs, 0.0253 secs, 0.0376 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 29720 responses

perf log

# Elapsed time   CPU (%)     Real (MB)
       0.000        0.000      112.832
       1.004        0.000      112.832
       2.010        0.000      112.832
       3.014        1.000      112.832
       4.018        0.000      112.832
       5.022        0.000      112.832
       6.026        1.000      112.832
       7.029        0.000      112.832
       8.034        4.000      114.289
       9.037        0.000      114.289
041        1.000      114.289
045        7.000      115.516
049       38.800      116.289
053       81.600      117.574
057       94.600      118.863
061      106.600      120.152
065      141.400      121.957
070      131.400      123.504
075      143.200      125.051
079      154.400      126.598
083      142.400      128.402
087      148.400      129.949
091      143.400      131.496
096      138.400      133.043
100      141.400      134.074
104      142.500      134.590
108      122.400      135.105
112      121.600      135.363
116      140.400      135.879
120      133.600      135.879
124      136.400      135.879
127      136.500      136.395
131      132.400      136.652
136      134.400      136.652
140      143.400      136.910
144      135.400      137.168
147      143.400      137.941
152      130.600      138.457
156      134.400      138.715
160      130.400      139.230
164      133.400      139.746
168      142.600      140.004
172      135.400      140.262
176      147.300      140.262
180      163.400      140.520
184      158.200      141.035
189      161.400      141.809
193      119.500      141.809
198      119.500      141.809
201      105.600      141.809
205      120.600      141.809
208      132.600      142.066
212      137.400      142.066
217      134.200      142.324
222      135.400      142.324
226      138.400      142.324
230      128.600      142.324
234      127.500      142.324
238      143.400      142.582
242      136.500      142.840
246      134.600      142.840
249      145.400      142.840
253      132.300      142.840
259      128.400      142.840
264      139.400      142.840
268      138.400      142.840
272      159.200      143.098
278      160.200      143.098
282      106.500      143.098
287        0.000      143.098
291        2.000      143.098
295        0.000      143.098

rust (async): axum + tokio postgres + bb8 (for connection pool)

./target/release/async_rust

Summary:
  Success rate:	1.0000
  Total:	60.0030 secs
  Slowest:	0.3792 secs
  Fastest:	0.0758 secs
  Average:	0.1870 secs
  Requests/sec:	267.0533

  Total data:	58.45 MiB
  Size/request:	3.73 KiB
  Size/sec:	997.51 KiB

Response time histogram:
  0.103 [22]   |
  0.131 [145]  |
  0.159 [1753] |■■■■■■■
  0.186 [7096] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.214 [4543] |■■■■■■■■■■■■■■■■■■■■
  0.241 [1629] |■■■■■■■
  0.269 [545]  |■■
  0.296 [225]  |■
  0.324 [40]   |
  0.352 [17]   |
  0.379 [9]    |

Latency distribution:
  10% in 0.1566 secs
  25% in 0.1679 secs
  50% in 0.1820 secs
  75% in 0.2010 secs
  90% in 0.2249 secs
  95% in 0.2424 secs
  99% in 0.2794 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0353 secs, 0.0288 secs, 0.0377 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
  [200] 16024 responses

perf log

# Elapsed time   CPU (%)     Real (MB)
       0.000        0.000       14.461
       1.004        0.000       14.461
       2.007        0.000       14.461
       3.010        4.000       14.461
       4.014       46.800       14.578
       5.018       50.800       14.578
       6.024       45.700       14.578
       7.029       38.800       14.578
       8.033       39.800       14.578
       9.037       35.900       14.578
041       37.900       14.578
045       37.800       14.578
049       35.900       14.578
053       39.800       14.578
057       36.800       14.578
061       38.900       14.578
065       35.900       14.578
068       37.900       14.578
072       34.900       14.578
076       36.900       14.586
079       34.900       14.586
083       36.900       14.586
086       36.900       14.586
090       37.800       14.586
094       43.800       14.586
098       35.900       14.586
102       38.800       14.621
106       50.800       14.621
112       40.700       14.621
117       41.800       14.621
121       44.800       14.621
125       38.800       14.621
129       37.800       14.621
133       35.900       14.621
136       38.900       14.621
140       40.900       14.621
144       36.800       14.621
148       30.900       14.629
152       33.900       14.621
156       35.900       14.621
160       37.900       14.621
163       36.900       14.621
167       37.900       14.621
171       33.900       14.621
174       35.900       14.598
178       39.900       14.598
182       37.900       14.598
185       37.900       14.598
190       39.800       14.617
193       37.900       14.617
198       48.800       14.617
202       52.800       14.617
207       47.800       14.617
212       47.800       14.625
216       41.800       14.617
219       37.900       14.617
224       35.800       14.617
227       37.900       14.617
231       27.900       14.617
235       30.900       14.617
239       37.900       14.617
242       36.900       14.617
247       34.900       14.633
250       26.900       14.633
254        0.000       14.633

Also compiler errors sometimes look like:

error[E0277]: the trait bound `fn(Extension<Pool<PostgresConnectionManager<MakeTlsConnector>>>, CookieJar) -> impl Future<Output = Result<Json<[type error]>, (axum::http::StatusCode, std::string::String)>> {recipes_list}: Handler<_, _>` is not satisfied
   --> src/main.rs:48:39
    |
48  |         .route("/api/v1/recipes", get(recipes_list))
    |                                   --- ^^^^^^^^^^^^ the trait `Handler<_, _>` is not implemented for `fn(Extension<Pool<PostgresConnectionManager<MakeTlsConnector>>>, CookieJar) -> impl Future<Output = Result<Json<[type error]>, (axum::http::StatusCode, std::string::String)>> {recipes_list}`
    |                                   |
    |                                   required by a bound introduced by this call
    |
    = help: the trait `Handler<T, ReqBody>` is implemented for `axum::handler::Layered<S, T>`
note: required by a bound in `axum::routing::get`
   --> /Users/steve/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.5.16/src/routing/method_routing.rs:395:1
    |
395 | top_level_handler_fn!(get, GET);
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ required by this bound in `axum::routing::get`
    = note: this error originates in the macro `top_level_handler_fn` (in Nightly builds, run with -Z macro-backtrace for more info)

conclusion

name	RPS	peak CPU	peak memory
sync Python	58	48%	276 MB
async Python	495	160%	143 MB
async Rust	267	53%	15 MB

Switching from sync python to async gives us a 10x boost in RPS (58 -> 495) and we can cut our memory usage in half.

Compared to async Python, async Rust gives us a 10x decrease in memory usage (143 MB -> 15 MB) with a 2/3 decrease in peak CPU usage (160% -> 53%), but we get 45% less RPS (495 -> 267).

Overall, async python is a magnitude more efficient than sync Python and async Rust is a magnitude more efficient than async Python.

I’m guessing there is a way to get Tokio to use more of the available CPU / memory, as there are a lot more resources available on the box.

Also the code is available on GitHub.