First, a bit of context: I run a side-project called Valar, which is kind of a cluster-scale systemd-socket-activation container engine. I’m currently focusing on performance improvements and one of my targets was the proxy component. It terminates TLS connections and hands the HTTP requests off to the respective handler.

As I did some load testing, I ran into the fact that my proxy spend about 70% of its CPU time multiplying some big.Int for the TLS handshake, which seemed kind of excessive. I published a little demo repository where you can try it yourself.

When looking at a profile for a 10s load test with RSA private keys (4096 bit)

Showing nodes accounting for 12.89s, 92.01% of 14.01s total
Dropped 218 nodes (cum <= 0.07s)
Showing top 10 nodes out of 69
      flat  flat%   sum%        cum   cum%
    10.34s 73.80% 73.80%     10.34s 73.80%  math/big.addMulVVW
     1.75s 12.49% 86.30%     12.22s 87.22%  math/big.nat.montgomery
     0.20s  1.43% 87.72%      0.20s  1.43%  math/big.mulAddVWW
     0.13s  0.93% 88.65%      0.13s  0.93%  math/big.subVV

87%! Just for setting up a secure connection. What about ECDSA with P-256?

Showing nodes accounting for 1590ms, 42.06% of 3780ms total
Dropped 170 nodes (cum <= 18.90ms)
Showing top 10 nodes out of 308
      flat  flat%   sum%        cum   cum%
     690ms 18.25% 18.25%      690ms 18.25%  vendor/
     210ms  5.56% 23.81%      210ms  5.56%  crypto/sha256.block
     150ms  3.97% 27.78%      150ms  3.97%  runtime.futex
     100ms  2.65% 30.42%      660ms 17.46%  runtime.mallocgc
      80ms  2.12% 32.54%      470ms 12.43%  runtime.gcDrain
      80ms  2.12% 34.66%      120ms  3.17%  syscall.Syscall

18%! I guess that’s better. Finally, what about Ed25519?

Showing nodes accounting for 1640ms, 45.81% of 3580ms total
Dropped 156 nodes (cum <= 17.90ms)
Showing top 10 nodes out of 272
      flat  flat%   sum%        cum   cum%
     690ms 19.27% 19.27%      690ms 19.27%  vendor/
     280ms  7.82% 27.09%      280ms  7.82%  crypto/sha256.block
     120ms  3.35% 30.45%      120ms  3.35%  runtime.futex
     110ms  3.07% 33.52%      110ms  3.07%  runtime.procyield

19%! It seems to use the same optimization as the ECDSA one.

418_623_557_911      cycles       # RSA
 21_593_305_599      cycles       # ECDSA
 21_083_496_865      cycles       # ED25519

So we save about 95% of CPU cycles when we switch from RSA to ECDSA when using Go’s TLS package. ECDSA is also widely supported, especially compared to Ed25519. In the case of my original proxy problem, re-issueing the certificates with new private keys increased throughput by about 30% in high-load scenarios and the bottleneck is now another system component.

The PSA part

For many services, RSA is still the default. I ran into the original issue because my Let’s Encrypt certificates all used RSA private keys. You can switch to ECDSA by simply supplying --key-type ecdsa as described in the certbot docs. Popular tools like the reverse-proxy traefik default to using RSA private keys, unnecessarily slowing down your system and prolonging TLS handshakes. So in case you don’t explicitly serve folks which prefer to use outdated crypto, you should probably switch.