First, a bit of context: I run a side-project called Valar, which is kind of a cluster-scale systemd-socket-activation container engine. I’m currently focusing on performance improvements and one of my targets was the proxy component. It terminates TLS connections and hands the HTTP requests off to the respective handler.
As I did some load testing, I ran into the fact that my proxy spend about 70% of its CPU time multiplying some
big.Int for the TLS handshake, which seemed kind of excessive. I published a little demo repository where you can try it yourself.
When looking at a profile for a 10s load test with RSA private keys (4096 bit)
Showing nodes accounting for 12.89s, 92.01% of 14.01s total Dropped 218 nodes (cum <= 0.07s) Showing top 10 nodes out of 69 flat flat% sum% cum cum% 10.34s 73.80% 73.80% 10.34s 73.80% math/big.addMulVVW 1.75s 12.49% 86.30% 12.22s 87.22% math/big.nat.montgomery 0.20s 1.43% 87.72% 0.20s 1.43% math/big.mulAddVWW 0.13s 0.93% 88.65% 0.13s 0.93% math/big.subVV
87%! Just for setting up a secure connection. What about ECDSA with P-256?
Showing nodes accounting for 1590ms, 42.06% of 3780ms total Dropped 170 nodes (cum <= 18.90ms) Showing top 10 nodes out of 308 flat flat% sum% cum cum% 690ms 18.25% 18.25% 690ms 18.25% vendor/golang.org/x/crypto/curve25519.ladderstep 210ms 5.56% 23.81% 210ms 5.56% crypto/sha256.block 150ms 3.97% 27.78% 150ms 3.97% runtime.futex 100ms 2.65% 30.42% 660ms 17.46% runtime.mallocgc 80ms 2.12% 32.54% 470ms 12.43% runtime.gcDrain 80ms 2.12% 34.66% 120ms 3.17% syscall.Syscall
18%! I guess that’s better. Finally, what about Ed25519?
Showing nodes accounting for 1640ms, 45.81% of 3580ms total Dropped 156 nodes (cum <= 17.90ms) Showing top 10 nodes out of 272 flat flat% sum% cum cum% 690ms 19.27% 19.27% 690ms 19.27% vendor/golang.org/x/crypto/curve25519.ladderstep 280ms 7.82% 27.09% 280ms 7.82% crypto/sha256.block 120ms 3.35% 30.45% 120ms 3.35% runtime.futex 110ms 3.07% 33.52% 110ms 3.07% runtime.procyield
19%! It seems to use the same optimization as the ECDSA one.
418_623_557_911 cycles # RSA 21_593_305_599 cycles # ECDSA 21_083_496_865 cycles # ED25519
So we save about 95% of CPU cycles when we switch from RSA to ECDSA when using Go’s TLS package. ECDSA is also widely supported, especially compared to Ed25519. In the case of my original proxy problem, re-issueing the certificates with new private keys increased throughput by about 30% in high-load scenarios and the bottleneck is now another system component.
The PSA part
For many services, RSA is still the default. I ran into the original issue because my Let’s Encrypt certificates all used RSA private keys. You can switch to ECDSA by simply supplying
--key-type ecdsa as described in the certbot docs. Popular tools like the reverse-proxy traefik default to using RSA private keys, unnecessarily slowing down your system and prolonging TLS handshakes. So in case you don’t explicitly serve folks which prefer to use outdated crypto, you should probably switch.