dynarmic/docs/ReturnStackBufferOptimization.md
MerryMage 311361b409 jit_state: Split off CPSR.{E,T}
This allows us to improve code-emission for PopRSBHint. We also improve
code emission other terminals at the same time.
2020-04-22 20:26:40 +01:00

5.7 KiB

Return Stack Buffer Optimization (x64 Backend)

One of the optimizations that dynarmic does is block-linking. Block-linking is done when the destination address of a jump is available at JIT-time. Instead of returning to the dispatcher at the end of a block we can perform block-linking: just jump directly to the next block. This is beneficial because returning to the dispatcher can often be quite expensive.

What should we do in cases when we can't predict the destination address? The eponymous example is when executing a return statement at the end of a function; the return address is not statically known at compile time.

We deal with this by using a return stack buffer: When we execute a call instruction, we push our prediction onto the RSB. When we execute a return instruction, we pop a prediction off the RSB. If the prediction is a hit, we immediately jump to the relevant compiled block. Otherwise, we return to the dispatcher.

This is the essential idea behind this optimization.

UniqueHash

One complication dynarmic has is that a compiled block is not uniquely identifiable by the PC alone, but bits in the FPSCR and CPSR are also relevant. We resolve this by computing a 64-bit UniqueHash that is guaranteed to uniquely identify a block.

u64 LocationDescriptor::UniqueHash() const {
    // This value MUST BE UNIQUE.
    // This calculation has to match up with EmitX64::EmitTerminalPopRSBHint
    u64 pc_u64 = u64(arm_pc) << 32;
    u64 fpscr_u64 = u64(fpscr.Value());
    u64 t_u64 = cpsr.T() ? 1 : 0;
    u64 e_u64 = cpsr.E() ? 2 : 0;
    return pc_u64 | fpscr_u64 | t_u64 | e_u64;
}

Our implementation isn't actually a stack

Dynarmic's RSB isn't actually a stack. It was implemented as a ring buffer because that showed better performance in tests.

RSB Structure

The RSB is implemented as a ring buffer. rsb_ptr is the index of the insertion point. Each element in rsb_location_descriptors is a UniqueHash and they each correspond to an element in rsb_codeptrs. rsb_codeptrs contains the host addresses for the corresponding the compiled blocks.

RSBSize was chosen by performance testing. Note that this is bigger than the size of the real RSB in hardware (which has 3 entries). Larger RSBs than 8 showed degraded performance.

struct JitState {
    // ...

    static constexpr size_t RSBSize = 8; // MUST be a power of 2.
    u32 rsb_ptr = 0;
    std::array<u64, RSBSize> rsb_location_descriptors;
    std::array<u64, RSBSize> rsb_codeptrs;
    void ResetRSB();

    // ...
};

RSB Push

We insert our prediction at the insertion point iff the RSB doesn't already contain a prediction with the same UniqueHash.

void EmitX64::EmitPushRSB(IR::Block&, IR::Inst* inst) {
    using namespace Xbyak::util;

    ASSERT(inst->GetArg(0).IsImmediate());
    u64 imm64 = inst->GetArg(0).GetU64();

    Xbyak::Reg64 code_ptr_reg = reg_alloc.ScratchGpr({HostLoc::RCX});
    Xbyak::Reg64 loc_desc_reg = reg_alloc.ScratchGpr();
    Xbyak::Reg32 index_reg = reg_alloc.ScratchGpr().cvt32();
    u64 code_ptr = unique_hash_to_code_ptr.find(imm64) != unique_hash_to_code_ptr.end()
                    ? u64(unique_hash_to_code_ptr[imm64])
                    : u64(code->GetReturnFromRunCodeAddress());

    code->mov(index_reg, dword[r15 + offsetof(JitState, rsb_ptr)]);
    code->add(index_reg, 1);
    code->and_(index_reg, u32(JitState::RSBSize - 1));

    code->mov(loc_desc_reg, u64(imm64));
    CodePtr patch_location = code->getCurr<CodePtr>();
    patch_unique_hash_locations[imm64].emplace_back(patch_location);
    code->mov(code_ptr_reg, u64(code_ptr)); // This line has to match up with EmitX64::Patch.
    code->EnsurePatchLocationSize(patch_location, 10);

    Xbyak::Label label;
    for (size_t i = 0; i < JitState::RSBSize; ++i) {
        code->cmp(loc_desc_reg, qword[r15 + offsetof(JitState, rsb_location_descriptors) + i * sizeof(u64)]);
        code->je(label, code->T_SHORT);
    }

    code->mov(dword[r15 + offsetof(JitState, rsb_ptr)], index_reg);
    code->mov(qword[r15 + index_reg.cvt64() * 8 + offsetof(JitState, rsb_location_descriptors)], loc_desc_reg);
    code->mov(qword[r15 + index_reg.cvt64() * 8 + offsetof(JitState, rsb_codeptrs)], code_ptr_reg);
    code->L(label);
}

In pseudocode:

  for (i := 0 .. RSBSize-1)
      if (rsb_location_descriptors[i] == imm64)
        goto label;
  rsb_ptr++;
  rsb_ptr %= RSBSize;
  rsb_location_desciptors[rsb_ptr] = imm64; //< The UniqueHash
  rsb_codeptr[rsb_ptr] = /* codeptr corresponding to the UniqueHash */;
label:

RSB Pop

To check if a predicition is in the RSB, we linearly scan the RSB.

void EmitX64::EmitTerminalPopRSBHint(IR::Term::PopRSBHint, IR::LocationDescriptor initial_location) {
    using namespace Xbyak::util;

    // This calculation has to match up with IREmitter::PushRSB
    code->mov(ecx, MJitStateReg(Arm::Reg::PC));
    code->shl(rcx, 32);
    code->mov(ebx, dword[r15 + offsetof(JitState, FPSCR_mode)]);
    code->or_(ebx, dword[r15 + offsetof(JitState, CPSR_et)]);
    code->or_(rbx, rcx);

    code->mov(rax, u64(code->GetReturnFromRunCodeAddress()));
    for (size_t i = 0; i < JitState::RSBSize; ++i) {
        code->cmp(rbx, qword[r15 + offsetof(JitState, rsb_location_descriptors) + i * sizeof(u64)]);
        code->cmove(rax, qword[r15 + offsetof(JitState, rsb_codeptrs) + i * sizeof(u64)]);
    }

    code->jmp(rax);
}

In pseudocode:

rbx := ComputeUniqueHash()
rax := ReturnToDispatch
for (i := 0 .. RSBSize-1)
   if (rbx == rsb_location_descriptors[i])
      rax = rsb_codeptrs[i]
goto rax