A long, long time ago, we took some benchmarks for Kyber on STM32 NUCLEO-F446ZE. Back then, it was the NIST Submission of Kyber, and we were using the implementation from PQM4 as integration in wolfCrypt. Now, Kyber has evolved into ML-KEM, and we have our implementation! We decided to take some benchmarks on a newer STM32 hardware platform as well. Note that we now also have our implementation of ML-DSA which evolved from Dilithium so we also took benchmarking numbers for it as well.
Here are the numbers (some formatting changes have been made for readability):
RSA 2048 public 112 ops took 1.012 sec, avg 9.036 ms, 110.672 ops/sec RSA 2048 private 4 ops took 1.298 sec, avg 324.500 ms, 3.082 ops/sec DH 2048 key gen 7 ops took 1.150 sec, avg 164.286 ms, 6.087 ops/sec DH 2048 agree 8 ops took 1.310 sec, avg 163.750 ms, 6.107 ops/sec ML-KEM 512 key gen 248 ops took 1.000 sec, avg 4.032 ms, 248.000 ops/sec ML-KEM 512 encap 262 ops took 1.000 sec, avg 3.817 ms, 262.000 ops/sec ML-KEM 512 decap 198 ops took 1.000 sec, avg 5.051 ms, 198.000 ops/sec ML-KEM 768 key gen 154 ops took 1.004 sec, avg 6.519 ms, 153.386 ops/sec ML-KEM 768 encap 154 ops took 1.012 sec, avg 6.571 ms, 152.174 ops/sec ML-KEM 768 decap 120 ops took 1.000 sec, avg 8.333 ms, 120.000 ops/sec ML-KEM 1024 key gen 94 ops took 1.008 sec, avg 10.723 ms, 93.254 ops/sec ML-KEM 1024 encap 94 ops took 1.016 sec, avg 10.809 ms, 92.520 ops/sec ML-KEM 1024 decap 78 ops took 1.024 sec, avg 13.128 ms, 76.172 ops/sec ECC [SECP256R1] key gen 180 ops took 1.007 sec, avg 5.594 ms, 178.749 ops/sec ECDH [SECP256R1] agree 86 ops took 1.016 sec, avg 11.814 ms, 84.646 ops/sec ECDSA [SECP256R1] sign 106 ops took 1.000 sec, avg 9.434 ms, 106.000 ops/sec ECDSA [SECP256R1] verify 60 ops took 1.012 sec, avg 16.867 ms, 59.289 ops/sec ML-DSA 44 key gen 52 ops took 1.011 sec, avg 19.442 ms, 51.434 ops/sec ML-DSA 44 sign 18 ops took 1.086 sec, avg 60.333 ms, 16.575 ops/sec ML-DSA 44 verify 46 ops took 1.008 sec, avg 21.913 ms, 45.635 ops/sec ML-DSA 65 key gen 30 ops took 1.035 sec, avg 34.500 ms, 28.986 ops/sec ML-DSA 65 sign 12 ops took 1.008 sec, avg 84.000 ms, 11.905 ops/sec ML-DSA 65 verify 28 ops took 1.027 sec, avg 36.679 ms, 27.264 ops/sec ML-DSA 87 key gen 18 ops took 1.047 sec, avg 58.167 ms, 17.192 ops/sec ML-DSA 87 sign 10 ops took 1.255 sec, avg 125.500 ms, 7.968 ops/sec ML-DSA 87 verify 16 ops took 1.003 sec, avg 62.687 ms, 15.952 ops/sec
This was done on an STM32 NUCLEO-F439ZI ARM Cortex M4 running at 168 MHz. The wolfSSL library was built with assembly optimizations, but does not use any hardware accelerated cryptography. Note: At the time of this writing the ML-DSA (Dilithium) is not using assembly optimizations, just well constructed C code.
- ML-DSA beats RSA quite nicely and is within an order of magnitude against ECDSA.
- ML-KEM beats DH and ECDH by a wide margin (thanks to assembly code for Thumb2).
Here are some special macro flags that were defined:
#define WOLFSSL_SP_ARM_CORTEX_M_ASM #define WOLFSSL_HAVE_SP_RSA #define WOLFSSL_HAVE_SP_ECC #define WOLFSSL_SP_SMALL #define SP_WORD_SIZE 32 #define GCM_TABLE_4BIT #define HAVE_DILITHIUM #define WOLFSSL_WC_DILITHIUM #define WOLFSSL_DILITHIUM_SMALL #define WOLFSSL_ARMASM #define WOLFSSL_ARMASM_INLINE #define WOLFSSL_ARMASM_NO_HW_CRYPTO #define WOLFSSL_ARMASM_NO_NEON #define WOLFSSL_ARMASM_THUMB2 #define WOLFSSL_ARM_ARCH 7
We support assembly optimizations on most algorithms and key sizes with Intel x86/x64, ARM Cortex-A/M/R, RISC-V and PowerPC.
If you are interested in seeing other algorithms benchmarked, or have questions about any of the above, please reach out to us at facts@wolfssl.com or call us at +1 425 245 8247 to let us know which ones!
Download wolfSSL Now