Step 0387: Corrupt PC Diagnosis at 0xFEE6 (Zelda DX)

Context

HeStep 0386applied a workaround by disabling STAT IRQ to solve the problem ofI.F."stuck" on0x02. After that correction, it was reported thatZelda DXIt was still not playable and crashed towardsPC:0xFEE6, a range that corresponds to the OAM/Non-Usable region (0xFE00-0xFEFF).

Running code in that region typically indicates:

  • PC corruption: An invalid return address taken from the stack
  • Stack Corruption (SP): Unbalanced push/pop or writes out of range
  • Jump to incorrectly calculated address: Ex.JP (HL)withH.L.corrupt

This step implements extensive instrumentation to capture the exact moment of the crash and determine its root cause.

Hardware Concept

Region 0xFE00-0xFEFF on Game Boy

According to Pan Docs, the memory map in this region is:

  • 0xFE00-0xFE9F: OAM (Object Attribute Memory)- 160 bytes for 40 sprites
  • 0xFEA0-0xFEFF: "Not Usable"- Non-functional region, returns random values ​​on real hardware

Running code in this region is always an error. It indicates that thePCwas established a un valor inválido, típicamente por:

Stack and Returns (RETI/RET)

The Game Boy stack grows downwards:

  • PUSH: SP = SP - 2, write high byte inSP+1, low byte inSP
  • P.O.P.: Read low byteSP, high byteSP+1, thenSP = SP + 2
  • RETI (0xD9): PC = pop_word(); IME = 1;

If the stack is corrupt orSPpoints to invalid data,RETIwill restore aPCgarbage (eg.0xFEE6).

Interruptions Service

When an interruption occurs withIME=1:

  1. CPU disableIME(avoids nested interrupts)
  2. GuardPCcurrent on stack:push_word(PC)
  3. Jump to the vector (ex.0x0040for VBlank)
  4. The handler ends withRETI, which restoresPCfrom the stack

If hePUSHwrite to incorrect addresses or theP.O.P.lee de una región overwritten, the return will be to an invalid address.

Fountain:Pan Docs - Memory Map, Interrupts, Stack Operations

Implementation

1. Ring Buffer of Last 64 Instructions (CPU.hpp/CPU.cpp)

A circular ring buffer is added that captures a snapshot of each executed instruction:

// In CPU.hpp
struct InstrSnapshot {
    uint16_t pc, sp, af, bc, de, hl;
    uint8_t bank, op, op1, op2, ime, ie, if_flag;
};
static constexpr int RING_SIZE = 64;
InstrSnapshot ring_buffer_[RING_SIZE];
int ring_idx_;
bool crash_dumped_;

Instep(), after the opcode fetch:

// Capture snapshot in ring buffer
ring_buffer_[ring_idx_].pc = original_pc;
ring_buffer_[ring_idx_].sp = regs_->sp;
ring_buffer_[ring_idx_].af = regs_->get_af();
// ... (rest of records)
ring_idx_ = (ring_idx_ + 1) % RING_SIZE;

// Detect crash in region FE00-FEFF
if (!crash_dumped_ && original_pc >= 0xFE00 && original_pc<= 0xFEFF) {
    crash_dumped_ = true;
    printf("[CRASH-PC] ⚠️ PC CORRUPTO: PC=0x%04X (región OAM/no usable)\n", original_pc);
    
    // Dump completo del ring buffer (últimas 64 instrucciones)
    for (int i = 0; i < RING_SIZE; i++) {
        int idx = (ring_idx_ + i) % RING_SIZE;
        printf("[CRASH-RING] #%02d PC:0x%04X Bank:%d OP:%02X %02X %02X | SP:%04X AF:%04X...\n",
               i, ring_buffer_[idx].pc, ...);
    }
}

2. Stack Trace in IRQ Push (CPU.cpp - handle_interrupts)

// BEFORE push_word(prev_pc)
uint16_t sp_before_push = regs_->sp;
printf("[IRQ-PUSH-PC] BEFORE: SP=0x%04X PC_to_push=0x%04X\n", sp_before_push, prev_pc);

push_word(prev_pc);

// AFTER the push
uint16_t sp_after_push = regs_->sp;
uint8_t byte_low = mmu_->read(sp_after_push);
uint8_t byte_high = mmu_->read(sp_after_push + 1);
printf("[IRQ-PUSH-PC] AFTER: SP=0x%04X Written=[0x%02X,0x%02X] Reconstruct=0x%04X\n",
       sp_after_push, byte_low, byte_high,
       (static_cast(byte_high)<< 8) | byte_low);

// Guardrail: verificar SP en rango peligroso
if (sp_after_push < 0xC000 || sp_after_push >= 0xFE00) {
    printf("[STACK-WARN] ⚠️ SP in dangerous range: 0x%04X\n", sp_after_push);
}

3. RETI Pop plot (CPU.cpp - case 0xD9)

// BEFORE pop_word()
uint16_t sp_before_pop = regs_->sp;
uint8_t byte_low = mmu_->read(sp_before_pop);
uint8_t byte_high = mmu_->read(sp_before_pop + 1);
uint16_t reconstructed = (byte_high<< 8) | byte_low;
printf("[RETI-POP-PC] ANTES: SP=0x%04X Bytes=[0x%02X,0x%02X] Reconstruct=0x%04X\n",
       sp_before_pop, byte_low, byte_high, reconstructed);

uint16_t return_addr = pop_word();

// DESPUES del pop
printf("[RETI-POP-PC] DESPUES: return_addr=0x%04X SP=0x%04X IME=1\n", return_addr, regs_->sp);

// Guardrail: check for corrupt return_addr
if (return_addr >= 0xFE00 && return_addr<= 0xFEFF) {
    printf("[RETI-POP-PC] ⚠️ RETURN ADDRESS CORRUPTO: 0x%04X (región OAM!)\n", return_addr);
}

4. Instrumentation of Writes to FE00-FEFF (MMU.cpp)

// At the start of MMU::write()
if (addr >= 0xFE00 && addr<= 0xFEFF && fe_write_count < 60) {
    printf("[MMU-FE-WRITE] PC=0x%04X addr=0x%04X value=0x%02X Bank=%d",
           debug_current_pc, addr, value, get_current_rom_bank());
    
    if (addr >= 0xFEA0) {
        printf(" ⚠️ UNUSABLE REGION\n");
    } else {
        printf(" (OAM valid)\n");
    }
    fe_write_count++;
}

Tests and Verification

Compilation

python3 setup.py build_ext --inplace > build_log_step0387.txt 2>&1
# ✅ Successful build without errors

Test Execution

timeout 10 python3 main.py roms/zelda-dx.gbc > logs/step0387_fe_pc_probe.log 2>&1

Log Analysis (Secure Commands)

#1) Check for crash in FE00-FEFF
grep -E "\[CRASH-PC\]" logs/step0387_fe_pc_probe.log | head -n 5
# Result: ❌ Not found (exit code 1)

#2) Check IRQ push/pop
grep -E "\[(IRQ-PUSH-PC|RETI-POP-PC|STACK-WARN)\]" logs/step0387_fe_pc_probe.log | head -n 60
# Result: ❌ Not found (interrupts are not being processed)

#3) Verify writes to FE00-FEFF
grep -E "\[MMU-FE-WRITE\]" logs/step0387_fe_pc_probe.log | head -n 60
# Result: ❌ Not found

#4) CPU Samples (check general status)
grep -E "\[CPU-SAMPLE\]" logs/step0387_fe_pc_probe.log | head -n 20
# Result: ✅ CPU running normally (200K+ instructions)

Critical Findings

🔍 Main Finding: Crash at 0xFEE6 DOES NOT Play

After executing 10 seconds (≈200K instructions),NO jump to PC was detected in range 0xFE00-0xFEFF. The crash reported in Step 0386 does NOT occur in the current run.

⚠️ Real Problem: Interrupts Completely Disabled

Analysis of the CPU samples reveals:

  • PC: 0x6B95-0x6B9B(Bank 60) - Narrow polling loop
  • IME=0- Interrupts disabled globally
  • IE=0x00 - NO interrupts enabled(neither VBlank, nor STAT, nor Timer...)
  • IF=0x01- VBlank flag active but ignored (cannot be served with IE=0x00)
  • The game readsP1 (0xFF00)repeatedly -joypad polling loop

Diagnosis:The workaround of Step 0386 (disable STAT IRQ) caused a side effect where the game disables ALL interrupts (IE=0x00), getting stuck in a wait-loop.

✅ Evidence of Functional Rendering

  • Frame 94 reached (more than 1.5 seconds of emulation)
  • Framebuffer with valid pixels (80/160 non-zero per line)
  • Normal color distribution (indexes 0 and 3)

Native Validation

C++ compiled modulewith complete instrumentation in:

  • CPU::step()- Ring buffer of 64 snapshots + crash detection
  • CPU::handle_interrupts()- IRQ push tracing with SP verification
  • CPU (case 0xD9)- RETI pop plotting with corrupt return_addr detection
  • MMU::write()- Detection of writes to region FE00-FEFF

The instrumentation worked correctly.No crash detected, but revealed the underlying problem:IE=0x00(interrupts completely disabled).

Modified Files

  • src/core/cpp/CPU.hpp- Added InstrSnapshot struct and ring buffer members
  • src/core/cpp/CPU.cpp- Ring buffer implementation, crash detection, IRQ/RETI tracing
  • src/core/cpp/MMU.cpp- Plotting writes to FE00-FEFF
  • build_log_step0387.txt- Compilation log
  • logs/step0387_fe_pc_probe.log - Log de ejecución (1.8MB)

Conclusion

HeStep 0387implemented exhaustive instrumentation to diagnose the crash reported inPC:0xFEE6, but the main finding is thatthat crash does NOT play in the current run.

Instead, the real problem was identified:IE=0x00(interrupts completely disabled), which leaves the game stuck in a polling loop with no ability to progress.

Next steps (Step 0388):

  • Review the workaround of Step 0386 that disables STAT IRQ
  • Implement correct rising edge detection for STAT without disabling interrupt completely
  • Verify thatI.E.initializes correctly (should have at least VBlank enabled)
  • Maintain ring buffer instrumentation as a permanent diagnostic tool

References