`map` is an ordinal structure with log(n) time searches.
`unordered_map` uses O(1) average-time searches and O(n) in the worst
case where a bucket has a to a colliding hash and has to start chaining.
The unordered version should speed up our general-case when looking up
constants.
I've added a trivial order-dependent(_(0,1) and (1,0) will return a
different hash_) hash to combine a 128-bit constant into a
64-bit hash that generally will not collide, using a bit-rotate to
preserve entropy.
In MSVC, having files with identical filenames will result into massive slowdowns when compiling.
The approach I have taken to resolve this is renaming the identically named files in frontend/(A32, A64) to (a32, a64)_filename.cpp/h
This makes dynarmic installable, and also adds a CMake package config
file, that allows projects to use `find_package(dynarmic)` to import the
library.
I know #636 adds the same thing, but while experimenting with the
different install options in
https://github.com/merryhime/dynarmic/pull/636#discussion_r725656034
I ended up with a working patch, so I'm proposing this as well. This
implements solution 2.
This adds versioning information to the built library.
When building the shared library on Linux systems, a new object will
be created: libdynarmic.so.5
This is really useful when talking about ABI compatibility.
The variables dynarmic_VERSION and dynarmic_VERSION_MAJOR
are implicitly created when calling project(dynarmic VERSION x.y.z)
Adds all elements of vector and puts the result into the lowest element.
Accelerates the `addv` instruction into a vectorized implementation
rather than a serial one.
The lane-splatting variant of `FMUL` and `FMLA` is very
common in instruction streams when implementing things like
matrix multiplication. When used, they are used very densely.
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication
The way this is currently implemented is by grabbing the particular lane
into a general purpose register and then broadcasting it into a simd
register through `VectorGetElement` and `VectorBroadcast`.
```cpp
const IR::U128 operand2 = v.ir.VectorBroadcast(esize, v.ir.VectorGetElement(esize, v.V(idxdsize, Vm), index));
```
What could be done instead is to keep it within
the vector-register and use a permute/shuffle to "splat" the particular
lane across all other lanes, removing the GPR-round-trip.
This is implemented as the new IR instruction `VectorBroadcastElement`:
```cpp
const IR::U128 operand2 = v.ir.VectorBroadcastElement(esize, v.V(idxdsize, Vm), index);
```
Recursive calls to `Replicate` beyond the first call might
cause an unintentional up-casting to an `int` type due
to `|` and `<<` operations on types such as `uint8_t` and `uint16_t`
This makes sure calls such as `Recursive<u8>` stay as the `u8` type
through-out.
Math operations such as Matrix multiplication utilize these particular
instructions enough that there should be some unit tests for thesein particular.
The lane-splatting form of FMUL and FMLA instructions are of particular
interest and I've found them to be very common in retail game binaries
such as Pokemon Sword.
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication
I'm primarily adding this unit test so that I can ensure compatibility
while I tune and optimize them.
xbyak is intended to be installed in /usr/local/include/xbyak.
Since we desire not to install xbyak before using it, we copy the headers
to the appropriate directory structure and use that instead
AVX512 introduces the _unsigned_ variant of float-to-integer conversion
functions via `vcvttp{sd}2u{dq}q`. In the case that a value is not
representable as an unsigned integer, it will result in `0xFFFFF...`
which can be utilized to get "free" saturation when the floating point
value exceeds the unsigned range, after masking away negative values.
https://www.felixcloutier.com/x86/vcvttps2udqhttps://www.felixcloutier.com/x86/vcvttpd2uqq
This PR also speeds up the _signed_ conversion function for fp64->int64
https://www.felixcloutier.com/x86/vcvttpd2qq
And(a, Not(b)) is a common enough operation that this can
be fused into a single `AndNot` operation. On x64 this is also
a single `pandn` instruction rather than two.
This implementation exists within the unsafe optimization paths and
utilize the 14-bit-precision `vrsqrt14*` and `vrcp14p*`
instructions provided by AVX512F+VL. These are _more_ accurate than
the fallback path and the current `rsqrt`-based unsafe code-path
but still falls in line with what is expected of the
`Unsafe_ReducedErrorFP` optimization flag.
Having AVX512 available will mean this function has 14 bits of precision.
Not having AVX512 available will mean these functions have 11 bits of precision.