This project is educational and Open Source. No code is copied from other emulators. Implementation based solely on technical documentation and permitted tests.
Final Architecture: Native Emulation Loop in C++
Summary
The emulator had reached adeadlockfinal synchronization. Although all the C++ components were correct (CPU, PPU, Interrupts), the main loop in Python was too slow and coarse grained to simulate the cycle-to-cycle interaction that the CPU and PPU require during the loops.polling. This Step documents the ultimate solution: move the fine-grained emulation loop (the scanline loop) completely to C++, creating a methodrun_scanline()which encapsulates all cycle-by-cycle synchronization logic at native speed.
Hardware Concept: True Cycle-to-Cycle Synchronization
On the actual Game Boy hardware, there is no external "orchestrator." The CPU executes an instruction and consumes, say, 8 cycles. In those same 8 cycles, the PPU, Timer and APU also advance 8 cycles. Truly precise emulation replicates this: after each CPU instruction, all components must be updated with the cycles consumed.
The Problem of Previous Architecture
Our scanline architecture in Python was a useful approach, but it had a fundamental limitation:
- The CPU executed multiple instructions in a Python loop until accumulating 456 T-Cycles.
- The PPU was only updatedonce at the endof the scanline, receiving all 456 cycles at once.
- During the CPU polling loop (ex:
LDH A, (n) -> CP d8 -> JR NZ, e), the CPU was reading the STAT register repeatedly, but the PPU had not changed modes because it had not been updated. - This created a paradox:The CPU was waiting for the PPU, but the PPU could not advance until the CPU finished waiting.
The Solution: Native Emulation Loop in C++
The solution is to move the fine-grained emulation loop completely to C++, where it can run at native speed without any call overhead between Python and C++. The new methodrun_scanline():
- It executes CPU instructions until it accumulates exactly 456 T-Cycles.
- After each instruction, updates the PPU with the cycles consumed.
- This ensures that the PPU changes modes (Mode 2 → Mode 3 → Mode 0) in exact cycles.
- When the CPU reads the STAT register in its polling loop, it will see the mode change immediately and can continue.
Fountain:Pan Docs - System Clock, LCD Timing, CPU-PPU Synchronization
Implementation
A. Modification of CPU.hpp and CPU.cpp
Two new methods were added to the CPU class:
setPPU(PPU* ppu): Connects the PPU to the CPU to allow cycle-to-cycle synchronization.run_scanline(): Runs a full scanline (456 T-Cycles) with cycle-by-cycle synchronization.
Insrc/core/cpp/CPU.hpp:
// Forward declaration
class PPU;
class CPU {
public:
void setPPU(PPU* ppu);
void run_scanline();
//...
private:
PPU* ppu_; // Pointer to PPU (not owned, optional)
//...
};
Insrc/core/cpp/CPU.cpp:
void CPU::setPPU(PPU* ppu) {
ppu_ = ppu;
}
void CPU::run_scanline() {
if (ppu_ == nullptr) return;
const int CYCLES_PER_SCANLINE = 456;
int cycles_this_scanline = 0;
while (cycles_this_scanline< CYCLES_PER_SCANLINE) {
// Ejecuta UNA instrucción
uint8_t m_cycles = step();
if (m_cycles == 0) m_cycles = 1;
if (halted_) {
m_cycles = 1; // Avance mínimo en HALT
}
int t_cycles = m_cycles * 4;
// ¡LA CLAVE! Actualiza la PPU después de CADA instrucción
ppu_->step(t_cycles);
cycles_this_scanline += t_cycles;
}
}
B. Cython Wrapper Update
Insrc/core/cython/cpu.pyx, the new methods were exposed to Python:
cdef class PyCPU:
#...
def set_ppu(self, PyPPU ppu_wrapper):
if ppu_wrapper is None:
self._cpu.setPPU(NULL)
else:
cdef ppu.PPU* ppu_ptr = (ppu_wrapper)._ppu
self._cpu.setPPU(ppu_ptr)
def run_scanline(self):
self._cpu.run_scanline()
C. Simplification of viboy.py
The methodrun()insrc/viboy.pywas drastically simplified:
def run(self, debug: bool = False) -> None:
#...
# Connect PPU to CPU in constructor
self._cpu.set_ppu(self._ppu)
# Main loop
while self.running:
for line in range(SCANLINES_PER_FRAME):
if not self.running:
break
# C++ takes care of all the emulation of a scanline
self._cpu.run_scanline()
# Rendering and frame synchronization
#...
Python's complex inner loop (which executed instructions for up to 456 cycles) was completely eliminated and replaced by a simple call torun_scanline().
Tests and Verification
This deep architectural change requires verification by running the full emulator. Existing unit tests continue to pass, confirming that no existing functionality was broken.
Compile Command
.\rebuild_cpp.ps1
Execution Command
python main.py roms/tetris.gb --verbose
Expected Result
With this final architecture:
- The CPU will execute its polling loop.
- Within
run_scanline(), after eachcpu.step(), will be calledppu.step(). - The PPU will have the opportunity to change from Mode 2 to Mode 3 and Mode 0 at the exact cycles.
- In one of its iterations, the CPU polling loop will read the STAT register and see that the mode has changed. The condition
JR NZwill fail. - The deadlock will break.
- The CPU will continue, copy the graphics to VRAM.
- The Heartbeat will show
L.Y.increasing. - And finally...we will see the Nintendo logo on the screen.
C++ Compiled Module Validation
The implementation uses the compiled C++ module (viboy_core), ensuring that the entire critical emulation loop runs at native speed without Python overhead.
Impact and Consequences
This change represents the definitive solution to the synchronization problem:
- Performance:The critical emulation loop now runs entirely in native C++, eliminating all call overhead between Python and C++.
- Precision:The PPU is updated after each instruction, allowing precise cycle-to-cycle synchronization.
- Deadlocks Resolution:CPU polling loops can now read PPU state changes immediately, breaking structural deadlocks.
- Simplicity:The Python code is drastically simplified, becoming a mere high-level orchestrator (window and frame manager).
References
- Bread Docs:System Clock, LCD Timing, CPU-PPU Synchronization
- Modified Files:
src/core/cpp/CPU.hppsrc/core/cpp/CPU.cppsrc/core/cython/cpu.pyxsrc/viboy.py