Additionally changed the numeric values of the test vectors in the
32-bit element tests to match the pattern of the other tests - this
makes it easier to tell at a glance what elements are out of place if a
test fails.
Math operations such as Matrix multiplication utilize these particular
instructions enough that there should be some unit tests for thesein particular.
The lane-splatting form of FMUL and FMLA instructions are of particular
interest and I've found them to be very common in retail game binaries
such as Pokemon Sword.
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication
I'm primarily adding this unit test so that I can ensure compatibility
while I tune and optimize them.
We might want to allocate different sizes for each of them.
e.g. for the unsafe fastmem approach without bounds checking.
Or for using the full 48bit adress range (with mirrors) by allocating our real arena as close to 1<<47 as possible.
Tests the TBL instruction with implementation with {1-4} register
lookups and the handling of out-of-bound indices.
Intended to target the implementation of VectorTableLookup128
* Return both the upper and lower parts of the multiply if required
* SSE2 does not support the pmuldq instruction, do sign correction to an unsigned result instead
* Improve port utilisation where possible (punpck instructions were a bottleneck)
x64 rounds before flushing to zero
AArch64 rounds after flushing to zero
This difference of behaviour is noticable if something would round to a smallest normalized number
Discovered by @Subv.
Fixes incomplete fix begun in 5a91c94dca47c9702dee20fbd5ae1f4c07eef9df.
That fix fails to take into account that LinkBlock doesn't update ticks until there
are no remaining ticks to be executed.
Test added to confirm fix.