76 Commits

Author SHA1 Message Date
merry
bb713194a0 backend/x64: Implement SHA256 polyfills 2022-03-20 13:59:18 +00:00
merry
98cff8dd0d IR: Implement SHA256MessageSchedule{0,1} 2022-03-20 13:59:18 +00:00
merry
f0a4bf1f6a IR: Implement SHA256Hash 2022-03-20 13:59:18 +00:00
merry
a4daad6336 block_of_code: Add HostFeature SHA 2022-03-20 00:13:03 +00:00
Merry
bcfe377aaa x64/reg_alloc: More zero extension paranoia 2022-03-06 12:24:50 +00:00
Merry
316b95bb3f {a32,a64}_emit_x64_memory: Zero extension paranoia 2022-03-06 12:10:40 +00:00
Merry
0fd32c5fa4 a64_emit_x64_memory: Fix bug in 128 bit exclusive write fallback 2022-02-28 19:53:43 +00:00
merry
5ea2b49ef0
backend/x64: Inline exclusive memory access operations (#664)
* a64_emit_x64_memory: Add Unsafe_IgnoreGlobalMonitor optimization

* a32_emit_x64_memory: Add Unsafe_IgnoreGlobalMonitor optimization

* a32_emit_x64_memory: Remove dead code

* {a32,a64}_emit_x64_memory: Also verify vaddr in Exclusive{Read,Write}MemoryInlineUnsafe

* a64_emit_x64_memory: Full fallback for ExclusiveWriteMemoryInlineUnsafe

* a64_emit_x64_memory: Inline full locking

* a64_emit_x64_memory: Allow inlined locking to be optionally removed

* spin_lock: Use xbyak instead of inline asm

* a64_emit_x64_memory: Recompile on exclusive fastmem failure

* Avoid variable shadowing

* a32_emit_x64_memory: Implement recompilation

* Fix recompilation

* spin_lock: Clang format fix

* fix fallback function calls
2022-02-28 08:13:10 +00:00
merry
0a11e79b55 backend/x64: Ensure all HostCalls are appropriately zero-extended 2022-02-27 20:04:44 +00:00
merry
6c4fa780e0 {a32,a64}_emit_x64_memory: Ensure return value of fastmem callback are zero-extended 2022-02-27 19:58:23 +00:00
merry
593de127d2 a64_emit_x64: Clear fastmem patch information on ClearCache 2022-02-27 19:50:05 +00:00
Merry
c90173151e backend/x64: Split off memory emitters 2022-02-26 21:25:09 +00:00
Merry
19a423034e block_of_code: Fix inaccurate size reporting in SpaceRemaining
Typo: getCode should be getCurr: Instead of comparing against the current pointer,
we were incorrectly comparing against the start of memory
2022-02-26 16:09:11 +00:00
Merry
ea08a389b4 emit_x64_floating_point: EmitFPToFixed: No need to round if rounding_mode == TowardsZero
cvttsd2si truncates during operation
2022-02-23 20:44:02 +00:00
merry
b34214f953 emit_x64_floating_point: Improve EmitFPToFixed codegen 2022-02-23 19:42:15 +00:00
merry
5fe274f510 emit_x64_floating_point: Deinterlace 64-bit FPToFixed signed/unsigned codepaths 2022-02-23 19:14:41 +00:00
merry
b8dd1c7510 emit_x64_floating_point: Correct dead-code warning in MSVC 2019 2022-02-12 22:07:26 +00:00
merry
95a1ebfb97 backend/x64: Bugfix: A32 frontent also uses FPSCR.QC 2022-02-12 21:46:45 +00:00
Fernando Sahmkow
a8cbfd9af4 X86_Backend: set fences correctly for memory barriers and synchronization. 2022-02-01 14:27:54 +00:00
Wunkolo
ad5465d6ce constant_pool: Use tsl::robin_map rather than unordered_map
Finding a much more drastic improvement with `robin_map`.

`map`:
```
[master] % hyperfine -r 100 "./dynarmic_tests --durations yes"
Benchmark 1: ./dynarmic_tests --durations yes
  Time (mean ± σ):     567.0 ms ±   6.9 ms    [User: 513.1 ms, System: 53.2 ms]
  Range (min … max):   554.4 ms … 588.1 ms    100 runs
```

`unordered_map`:
```
[opt_const_pool] % hyperfine -r 100 "./dynarmic_tests --durations yes"
Benchmark 1: ./dynarmic_tests --durations yes
  Time (mean ± σ):     561.1 ms ±   4.5 ms    [User: 508.1 ms, System: 52.3 ms]
  Range (min … max):   552.6 ms … 574.2 ms    100 runs
```

`tsl::robin_map`:
```
[opt_const_pool] % hyperfine -r 100 "./dynarmic_tests --durations yes"
Benchmark 1: ./dynarmic_tests --durations yes
  Time (mean ± σ):     553.5 ms ±   5.6 ms    [User: 500.7 ms, System: 52.1 ms]
  Range (min … max):   545.7 ms … 569.3 ms    100 runs
```
2022-01-01 12:13:13 +00:00
Wunkolo
e57bb0569a constant_pool: Convert hashtype from tuple to pair 2022-01-01 12:13:13 +00:00
Wunkolo
befc22a61e constant_pool: Use unordered_map rather than map
`map` is an ordinal structure with log(n) time searches.
`unordered_map` uses O(1) average-time searches and O(n) in the worst
case where a bucket has a to a colliding hash and has to start chaining.
The unordered version should speed up our general-case when looking up
constants.

I've added a trivial order-dependent(_(0,1) and (1,0) will return a
different hash_) hash to combine a 128-bit constant into a
64-bit hash that generally will not collide, using a bit-rotate to
preserve entropy.
2022-01-01 12:13:13 +00:00
Morph
28714ee75a general: Rename files with duplicate names
In MSVC, having files with identical filenames will result into massive slowdowns when compiling.
The approach I have taken to resolve this is renaming the identically named files in frontend/(A32, A64) to (a32, a64)_filename.cpp/h
2021-12-23 11:38:58 +00:00
Fernando S
e4146ec3a1
x64 Interface: Allow for asynchronous invalidation (#647)
* x64 Interface: Make Invalidation asynchronous.

* Apply suggestions from code review
2021-10-05 15:06:41 +01:00
Wunkolo
5e7d2afe0f IR: Introduce VectorReduceAdd{8,16,32,64} opcode
Adds all elements of vector and puts the result into the lowest element.
Accelerates the `addv` instruction into a vectorized implementation
rather than a serial one.
2021-09-27 19:54:11 +01:00
Marshall Mohror
0b8fd755d8 Fix signal_stack_size for glibc 2.34
`SIGSTKSZ` is now defined as `sysconf(_SC_SIGSTKSZ)` which is not constexpr, and returns a long which throws off the `std::max` template deduction.
2021-09-22 20:38:11 +01:00
Ben
6ce8bfaf32
Add API function to retrieve dissassembly as vector of strings (#644)
Co-authored-by: ben <Avuxo@users.noreply.github.com>
2021-09-16 16:45:20 -04:00
Merry
615ce8c7c5 IR: Remove A32 IR instructions Get{N,Z,V}Flag 2021-08-12 13:06:15 +01:00
Wunkolo
1e94acff66 ir: Add VectorBroadcastElement{Lower} IR instruction
The lane-splatting variant of `FMUL` and `FMLA` is very
common in instruction streams when implementing things like
matrix multiplication. When used, they are used very densely.

https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication

The way this is currently implemented is by grabbing the particular lane
into a general purpose register and then broadcasting it into a simd
register through `VectorGetElement` and `VectorBroadcast`.

```cpp
    const IR::U128 operand2 = v.ir.VectorBroadcast(esize, v.ir.VectorGetElement(esize, v.V(idxdsize, Vm), index));
```

What could be done instead is to keep it within
the vector-register and use a permute/shuffle to "splat" the particular
lane across all other lanes, removing the GPR-round-trip.

This is implemented as the new IR instruction `VectorBroadcastElement`:

```cpp
    const IR::U128 operand2 = v.ir.VectorBroadcastElement(esize, v.V(idxdsize, Vm), index);
```
2021-08-07 23:03:57 +01:00
Merry
d41bc492fe {a32,a64}_jitstate: Remove unnecessary headers 2021-08-07 19:35:33 +01:00
Merry
07b5734fb0 xbyak: Correct xbyak include directory
xbyak is intended to be installed in /usr/local/include/xbyak.
Since we desire not to install xbyak before using it, we copy the headers
to the appropriate directory structure and use that instead
2021-08-07 15:13:49 +01:00
Merry
59fb568b27 tests: Use Zydis for disassembly 2021-08-06 15:29:43 +01:00
Wunkolo
f33bd69ec2 emit_x64_vector_floating_point: AVX512 implementation of EmitFPVectorToFixed
AVX512 introduces the _unsigned_ variant of float-to-integer conversion
functions via `vcvttp{sd}2u{dq}q`. In the case that a value is not
representable as an unsigned integer, it will result in `0xFFFFF...`
which can be utilized to get "free" saturation when the floating point
value exceeds the unsigned range, after masking away negative values.

https://www.felixcloutier.com/x86/vcvttps2udq
https://www.felixcloutier.com/x86/vcvttpd2uqq

This PR also speeds up the _signed_ conversion function for fp64->int64
https://www.felixcloutier.com/x86/vcvttpd2qq
2021-07-17 22:13:11 +01:00
SachinVin
048da372e9 block_of_code.cpp: remove redundant align() 2021-07-17 22:12:31 +01:00
Wunkolo
5971361160 IR: Add AndNot{32,64} IR instruction
Also includes BMI1-acceleration for x64, when available
2021-07-02 22:27:29 +01:00
Wunkolo
49d00634f9 IR: Add VectorAndNot IR instruction
And(a, Not(b)) is a common enough operation that this can
be fused into a single `AndNot` operation. On x64 this is also
a single `pandn` instruction rather than two.
2021-07-02 22:27:29 +01:00
Wunkolo
1fc96fd0c2 emit_x64{_vector}_floating_point: Unsafe AVX512 implementation of Emit{RSqrt,Recip}Estimate
This implementation exists within the unsafe optimization paths and
utilize the 14-bit-precision `vrsqrt14*` and `vrcp14p*`
instructions provided by AVX512F+VL. These are _more_ accurate than
the fallback path and the current `rsqrt`-based unsafe code-path
but still falls in line with what is expected of the
`Unsafe_ReducedErrorFP` optimization flag.

Having AVX512 available will mean this function has 14 bits of precision.
Not having AVX512 available will mean these functions have 11 bits of precision.
2021-06-27 11:18:58 +01:00
Wunkolo
c6125082ea emit_x64_floating_point: AVX512 implementation of EmitFPMinMaxNumeric 2021-06-20 10:12:27 +01:00
Wunkolo
776208742b emit_x64_{vector_}floating_point: Centralize implementation of FP{Vector}{Abs,Neg}
Removes dependency on the constants at the top of some files
such as `f16_negative_zero` and `f32_non_sign_mask` in favor
of the `FPInfo` trait-type.

Also removes bypass delays by selecting between instructions
such as `pand`, `andps`, or `andpd` depending on the type
and keeps them in their respective uop domain.

See https://www.agner.org/optimize/instruction_tables.pdf for
more info on bypass delays.
2021-06-10 00:04:57 +01:00
SachinVin
ccf27f9c8c ir_emitter: Remove 32-bit-only AddWithCarry 2021-06-09 01:54:03 +01:00
Wunkolo
5385edcc66 emit_x64_vector_floating_point: AVX512 implementation of EmitFPVector{Min,Max}{32,64} 2021-06-08 17:50:28 +01:00
Wunkolo
0c67b913fe backend/x64: Add vcmp constants 2021-06-08 17:50:28 +01:00
Wunkolo
8fde505943 backend/x64: Add vfpclass constants
Bit-wise constants for use with the `vfpclass` instruction.
2021-06-08 17:50:28 +01:00
Wunkolo
c82e29ed82 backend/x64: Add vrange constants
Adds compile-time `FpRangeLUT` for generating the 8-bit
immediate LUT value for the `vrange*` instruction
2021-06-08 17:50:28 +01:00
MerryMage
c1d5a7977e Add Unsafe_IgnoreStandardFPCRValue optimization 2021-06-08 17:26:45 +01:00
Wunkolo
c157dfcc4c emit_x64_vector: Reduce gf2p8affineqb requirement to GFNI
Currently, every usage of `gf2p8affineqb` is guarded by the
`AVX512F + AVX512VL + GFNI` requirement, when really
we only need `GFNI` on its own.

This will allow `GFNI`-only chips to get emit GFNI features without
needing to have AVX512 as well.
There _are_ chips in existance currently that strictly ship with GFNI and
have no implementation of AVX1/AVX2/AVX512(and thus no VEX/EVEX
encoding) such as Tremont(Lakefield) chips.
2021-06-08 14:00:00 +01:00
Wunkolo
e47d0d11c3 emit_x64_vector: AVX512 implementation of EmitVectorNot
Single in-place ternary logic instruction.
2021-06-08 03:11:38 +01:00
Markus Wick
0c12614d1a A64/config.h: Split fastmem and page_table options.
We might want to allocate different sizes for each of them.
e.g. for the unsafe fastmem approach without bounds checking.
Or for using the full 48bit adress range (with mirrors) by allocating our real arena as close to 1<<47 as possible.
2021-06-06 17:25:51 +01:00
MerryMage
828959caed IR: Implement FPVector{To,From}Half32
Implement ASIMD VCVT (half) in terms of this instruction.
Correct handling of ASIMDStandardValue.
2021-06-05 03:39:48 +01:00
Wunkolo
9a23c09c3b emit_x64_floating_point: AVX implementation of ZeroIfNaN 2021-05-31 13:41:05 +01:00