| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
| |
The performance impact is not measurable, as the compiler loads these variables
in xmm registers in unrolled loops anyway.
However, we avoid loading these sensitive keys onto the stack. This happens for
larger key schedules, where the register count is insufficient. If that key
material is not on the stack, we can avoid to wipe it explicitly after
crypto operations.
|
|
|
|
|
|
| |
While the required members are aligned in the struct as required, on 32-bit
platforms the allocator aligns the structures itself to 8 bytes only. This
results in non-aligned struct members, and invalid memory accesses.
|
|
|
|
|
|
|
| |
CTR can be parallelized, and we do so by queueing instructions to the processor
pipeline. While we have enough registers for 128-bit decryption, the register
count is insufficient to hold all variables with larger key sizes. Nonetheless
is 4-way parallelism faster, depending on key size between ~10% and ~25%.
|
|
|
|
|
| |
This allows us to unroll loops and hold the key schedule in local (register)
variables. This brings an impressive speedup of ~45%.
|
|
|