Faster No Assembly ChaCha20

At wolfSSL, we always try to get you the best results possible. Most of the time the best way to achieve this is to use assembly optimization. Unfortunately dedicated assembly tuning is targeted and time consuming so it is not always available for your platform. But there are still many ways to squeeze performance out of algorithms with just C. We have achieved up to a 60% speedup in ChaCha20 without using assembly! This was accomplished by performing the exclusive or (XOR) operation on the largest possible word supported by the system. The optimization was merged in https://github.com/wolfSSL/wolfssl/pull/6203 and first available in wolfSSL 5.6.2.

The benchmark was performed on a x86_64 machine with an AMD Ryzen 5 2600 processor. wolfSSL was configured without assembly optimization with ./configure.

These are the results without this optimization:

$ ./wolfcrypt/benchmark/benchmark -chacha20
------------------------------------------------------------------------------
 wolfSSL version 5.5.4
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
CHACHA                 	294 MB took 1.016 seconds,  288.985 MB/s Cycles per byte =  11.77
Benchmark complete
$ ./wolfcrypt/benchmark/benchmark -chacha20
------------------------------------------------------------------------------
 wolfSSL version 5.5.4
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
CHACHA                 	278 MB took 1.017 seconds,  273.204 MB/s Cycles per byte =  12.45
Benchmark complete
$ ./wolfcrypt/benchmark/benchmark -chacha20
------------------------------------------------------------------------------
 wolfSSL version 5.5.4
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
CHACHA                 	299 MB took 1.006 seconds,  297.137 MB/s Cycles per byte =  11.44
Benchmark complete

These are the results with this optimization:

$ ./wolfcrypt/benchmark/benchmark -chacha20
------------------------------------------------------------------------------
 wolfSSL version 5.5.4
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
CHACHA                 	451 MB took 1.010 seconds,  446.210 MB/s Cycles per byte =   7.62
Benchmark complete
$ ./wolfcrypt/benchmark/benchmark -chacha20
------------------------------------------------------------------------------
 wolfSSL version 5.5.4
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
CHACHA                 	472 MB took 1.003 seconds,  470.584 MB/s Cycles per byte =   7.23
Benchmark complete
$ ./wolfcrypt/benchmark/benchmark -chacha20
------------------------------------------------------------------------------
 wolfSSL version 5.5.4
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
CHACHA                 	472 MB took 1.004 seconds,  470.026 MB/s Cycles per byte =   7.23
Benchmark complete

If you have questions about the performance of the wolfSSL embedded TLS library, please contact us at facts@wolfSSL.com or call us at +1 425 245 8247.

Download wolfSSL Now