First, a bit of context: I run a side-project called Valar, which is kind of a cluster-scale systemd-socket-activation container engine. I’m currently focusing on performance improvements and one of my targets was the proxy component. It terminates TLS connections and hands the HTTP requests off to the respective handler.
As I did some load testing, I ran into the fact that my proxy spend about 70% of its CPU time multiplying some big.Int
for the TLS handshake, which seemed kind of excessive. I published a little demo repository where you can try it yourself.
When looking at a profile for a 10s load test with RSA private keys (4096 bit)
Showing nodes accounting for 12.89s, 92.01% of 14.01s total
Dropped 218 nodes (cum <= 0.07s)
Showing top 10 nodes out of 69
flat flat% sum% cum cum%
10.34s 73.80% 73.80% 10.34s 73.80% math/big.addMulVVW
1.75s 12.49% 86.30% 12.22s 87.22% math/big.nat.montgomery
0.20s 1.43% 87.72% 0.20s 1.43% math/big.mulAddVWW
0.13s 0.93% 88.65% 0.13s 0.93% math/big.subVV
87%! Just for setting up a secure connection. What about ECDSA with P-256?
Showing nodes accounting for 1590ms, 42.06% of 3780ms total
Dropped 170 nodes (cum <= 18.90ms)
Showing top 10 nodes out of 308
flat flat% sum% cum cum%
690ms 18.25% 18.25% 690ms 18.25% vendor/golang.org/x/crypto/curve25519.ladderstep
210ms 5.56% 23.81% 210ms 5.56% crypto/sha256.block
150ms 3.97% 27.78% 150ms 3.97% runtime.futex
100ms 2.65% 30.42% 660ms 17.46% runtime.mallocgc
80ms 2.12% 32.54% 470ms 12.43% runtime.gcDrain
80ms 2.12% 34.66% 120ms 3.17% syscall.Syscall
18%! I guess that’s better. Finally, what about Ed25519?
Showing nodes accounting for 1640ms, 45.81% of 3580ms total
Dropped 156 nodes (cum <= 17.90ms)
Showing top 10 nodes out of 272
flat flat% sum% cum cum%
690ms 19.27% 19.27% 690ms 19.27% vendor/golang.org/x/crypto/curve25519.ladderstep
280ms 7.82% 27.09% 280ms 7.82% crypto/sha256.block
120ms 3.35% 30.45% 120ms 3.35% runtime.futex
110ms 3.07% 33.52% 110ms 3.07% runtime.procyield
19%! It seems to use the same optimization as the ECDSA one.
418_623_557_911 cycles # RSA
21_593_305_599 cycles # ECDSA
21_083_496_865 cycles # ED25519
So we save about 95% of CPU cycles when we switch from RSA to ECDSA when using Go’s TLS package. ECDSA is also widely supported, especially compared to Ed25519. In the case of my original proxy problem, re-issueing the certificates with new private keys increased throughput by about 30% in high-load scenarios and the bottleneck is now another system component.
The PSA part
For many services, RSA is still the default. I ran into the original issue because my Let’s Encrypt certificates all used RSA private keys. You can switch to ECDSA by simply supplying --key-type ecdsa
as described in the certbot docs. Popular tools like the reverse-proxy traefik default to using RSA private keys, unnecessarily slowing down your system and prolonging TLS handshakes. So in case you don’t explicitly serve folks which prefer to use outdated crypto, you should probably switch.