|
Yes, it is 20 times slower on x86* than it should be because xxhash.c always
uses "safe" memcpy()-based methods for unaligned memory access (XXH_readXX)
irregardless of input alignment due to x86-default XXH_FORCE_ALIGN_CHECK=0.
This ends up with real memcpy() calls in hot path (with -O2 too).
The bug affects Alpine x86* (not just edge, but at least 3.8 too -- i.e. this
is not something introduced in 0.6.5) for aligned and unaligned inputs. Other
architectures are severely affected for unaligned inputs only.
The fix lifts the XXH_FORCE_MEMORY_ACCESS=1 condition to enable XXH_readXX
methods based on __attribute__((__packed__)) usage everywhere except ARMv6
(which is covered by its own case earlier).
This is safe and fast because the compiler will either:
- use direct storage access instructions on capable architectures such as
aarch64, armv7, ppc64le, s390x, x86* irregardless of input alignment;
- or use relatively fast LWL/LWR instructions on mips* with unaligned input;
- or use byte load/stores and shifts/ors on armel with unaligned input which
is still faster then memcpy() call.
All aports that use xxhash.c are likely affected. For example, community/zstd
suffers too though not so grave (~15% difference for "zstd -t" on big archive)
and main/lz4 is twice slower on basic compression levels.
Other aport changes:
- modernize;
- enable check(); it is short and fast so suitable for slow builders too.
The python part is left intact though newer version exists.
|