This project is educational and Open Source. No code is copied from other emulators. Implementation based solely on technical documentation and permitted tests.
SP Corruption Watchdog (Stack Pointer Watchdog)
Summary
This Step implements a watchdog to detect Stack Pointer (SP) corruption in real time. Analysis of Step 0266 revealed that the GPS shows `SP:210A`, which is a fatal state: the Stack Pointer points to ROM (read-only) when it should be in writable RAM. This watchdog detects the exact moment in which the SP becomes corrupted, allowing the instruction that causes the disaster to be identified.
Hardware Concept
The Stack Pointer (SP) on Game Boy:The Stack Pointer is a 16-bit register that points to the location in memory where the stack is stored. The stack is a LIFO (Last In First Out) data structure used to:
- Calls to subroutines (CALL/RET):Save the return address before jumping to a subroutine.
- Interruptions:Saves the CPU (PC) state before jumping to the interrupt vector.
- PUSH/POP:Temporarily saves and restores registry values.
Valid Memory Ranges for the Stack:According to the Game Boy memory map, the Stack should be at:
- WRAM (Work RAM):`0xC000-0xDFFF` - 8KB internal RAM, writable.
- HRAM (High RAM):`0xFF80-0xFFFE` - 127-byte high-speed RAM, writable.
Why is it fatal if SP points to ROM?If the Stack Pointer points to the ROM (`0x0000-0x7FFF` or `0xA000-0xBFFF`), any write operation (PUSH, CALL) will attempt to write to read-only memory. Since we implemented ROM protection (Step 0252), those writes are silently ignored. When the CPU executes POP or RET, it reads data from ROM (which are instructions, not valid return addresses). The result is that the CPU jumps to a garbage address and the program crashes.
How is SP corrupted?The SP can become corrupted for several reasons:
- Instruction `LD SP, nn` with incorrect data:If `nn` contains garbage or an incorrect value.
- Instruction `LD SP, HL` with corrupt HL:If HL contains garbage (`0x210A`), copying it to SP corrupts the stack.
- Mass Stack Overflow:Thousands of PUSH without corresponding POP (unlikely in normal code).
- Error in SP arithmetic:Instructions like `ADD SP, r8` with incorrect results.
The Watchdog:A watchdog is a monitoring mechanism that continually checks for a critical condition. In this case, we check after each instruction that the SP is in a valid range. If we detect corruption, we print a critical message with the SP value and the PC where it occurred, allowing us to identify the exact instruction that caused the problem.
Fountain:Pan Docs - "Memory Map", "Stack Pointer", "CALL/RET Instructions"
Implementation
Added a check to the end of the `step()` method in `CPU.cpp` that is executed after each instruction. The watchdog verifies that the SP is in a valid range (above `0xC000` or in HRAM `0xFF80-0xFFFE`). If it detects corruption, it prints a critical message with the SP value and the current PC.
Modified components
src/core/cpp/CPU.cpp: Added SP watchdog to the end of the `step()` method (Step 0267).
Watchdog Code
// --- Step 0267: SP CORRUPTION WATCHDOG ---
// The Stack Pointer must always be in RAM (C000-DFFF or FF80-FFFE)
// If it goes below C000 (and it's not a momentary 0000), something has gone terribly wrong.
// This check is executed after each instruction to detect
// the exact moment when the SP becomes corrupted.
// Source: Pan Docs - Memory Map: Stack must be in WRAM (C000-DFFF) or HRAM (FF80-FFFE)
if (regs_->sp< 0xC000 && regs_->sp != 0x0000) {
printf("[CRITICAL] SP CORRUPTION DETECTED! SP:%04X at PC:%04X\n", regs_->sp, regs_->pc);
// Optional: exit(1) to stop it in its tracks (commented to allow logging)
// exit(1);
}
Design decisions
- Verification after each instruction:It is executed at the end of `step()`, after the opcode switch. This ensures that we detect corruption immediately after it occurs.
- Exception for SP=0x0000:`SP=0x0000` is temporarily allowed because some games may initialize the SP to 0 before setting it to a valid value. However, if the SP remains at 0 during normal execution, it is an error.
- Logging instead of exit():The `exit(1)` is commented out to allow the emulator to continue and generate more logs. This is useful for post-mortem analysis, but can be changed to `exit(1)` to stop execution immediately.
- Doesn't check HRAM explicitly:The verification only checks that SP >= 0xC000. The HRAM range (0xFF80-0xFFFE) is above 0xC000, so it is implicitly covered. However, a stricter check could explicitly validate both ranges.
Verification of SP Related Instructions
The following instructions that modify the SP were revised:
- 0x31 (LD SP, d16):✅ Implemented correctly. Reads a 16-bit value in Little-Endian format using `fetch_word()` and assigns it to SP.
- 0xF9 (LD SP, HL):❌ Not implemented. This instruction copies the value from HL to SP. If HL contains garbage, it corrupts SP. This is a possible source of corruption.
- 0xE8 (ADD SP, r8):❌ Not implemented. This instruction adds an 8-bit signed value to SP. An error in this instruction could cause corruption.
- 0xF8 (LD HL, SP+r8):❌ Not implemented. This instruction loads HL with SP + r8 (signed). It does not modify SP directly, but could indicate problems if used incorrectly.
Note:The unimplemented instructions (0xF9, 0xE8, 0xF8) are not necessary for the watchdog to function, but their absence could be a source of corruption if the game tries to use them. The watchdog will detect the corruption regardless of which instruction causes it.
Affected Files
src/core/cpp/CPU.cpp- Added SP watchdog to the end of the `step()` method (Step 0267).
Tests and Verification
The watchdog will be validated by running the emulator with Pokémon Red and looking for the critical message in the logs:
- Test command:
python main.py roms/pkmn.gb > sp_debug.log 2>&1 - Search in the log:Find the string
[CRITICAL] SP CORRUPTION DETECTED! - Post-mortem analysis:Once corruption is detected, use
tools/dump_rom_zone.pyaround the reported PC to see which instruction caused the disaster.
Expected validation:The watchdog should detect the corruption when the SP drops below `0xC000` and show the exact PC where it occurred. This will allow the instruction that corrupts the SP to be identified and corrected.
Note:The watchdog is active in all builds (debug and release). In production, it could be disabled or made conditional by a compile macro to avoid overhead in the critical loop.
Sources consulted
- Bread Docs:Memory Map- Valid memory ranges for the Stack.
- Bread Docs:CPU Instruction Set- SP related instructions (LD SP, d16, LD SP, HL, ADD SP, r8).
- Bread Docs:CPU Registers and Flags- Description of Stack Pointer.
Educational Integrity
What I Understand Now
- Stack Pointer and Memory:The SP must always point to writable memory (WRAM or HRAM). If you point to ROM, writes are ignored and reads return instructions instead of data from the stack.
- SP Corruption:The SP can be corrupted by instructions that modify it with incorrect values (e.g. `LD SP, HL` when HL contains garbage).
- Watchdog Pattern:A watchdog is a common design pattern in embedded systems to detect anomalous conditions in real time. It runs periodically (in this case, after each instruction) and alerts when it detects an invalid state.
What remains to be confirmed
- Missing instructions:Instructions 0xF9 (LD SP, HL), 0xE8 (ADD SP, r8) and 0xF8 (LD HL, SP+r8) are not implemented. If the game tries to use them, they will cause undefined behavior. We need to implement them or at least detect when they are attempted to be executed.
- HRAM range:The current check only checks that SP >= 0xC000. The HRAM range (0xFF80-0xFFFE) is covered implicitly, but stricter checking could explicitly validate both valid ranges.
- Performance Overhead:The watchdog adds a conditional check after each statement. In the critical loop, this could have minimal impact on performance. We should measure the overhead and consider making it conditional on release builds.
Hypotheses and Assumptions
Main hypothesis:The SP is corrupted by an `LD SP, HL` instruction (0xF9) when HL contains garbage (`0x210A`). This instruction is not implemented, so the game could be running junk code or the CPU could be reading the wrong opcode.
Assumption about SP=0x0000:We temporarily allow `SP=0x0000` because some games may initialize the SP to 0 before setting it to a valid value. However, if SP remains at 0 during normal execution, this is an error that should be caught.
Next Steps
- [ ] Run the emulator with Pokémon Red and look for the critical SP corruption message in the logs.
- [ ] Once the corruption is detected, use
tools/dump_rom_zone.pyaround the reported PC to identify the exact instruction causing the problem. - [ ] Implement the missing SP-related instructions (0xF9, 0xE8, 0xF8) if analysis reveals that the game is using them.
- [ ] Improve watchdog verification to explicitly validate WRAM and HRAM ranges.
- [ ] Consider doing conditional watchdog in release builds to avoid overhead in the critical loop.