This project is educational and Open Source. No code is copied from other emulators. Implementation based solely on technical documentation and permitted tests.
Scanline-Based Architecture: Balance of Performance/Accuracy
Summary
A hybrid architecture based on scanlines was implemented to solve the performance problem of the emulator. The architecture executes CPU and Timer each instruction (maintaining RNG precision) but it updates the PPU only once per scanline (456 cycles), reducing the graphics cost by 99%. This solution balances performance and precision, allowing the emulator to reach 60 FPS on modern hardware without breaking the playability of games like Tetris that rely on Timer for RNG.
Hardware Concept
Emulation Performance Problem:In a cycle-by-cycle emulator, each instruction CPU must update all peripherals (PPU, Timer, etc.). In Python, this means millions of function calls per second, creating massive overhead. An i7-10700K with 2080 Ti should run the emulator at 500+ FPS if it were unlocked, but with pure cycle-to-cycle architecture It barely reaches 30 FPS due to the Python overhead.
Aggressive Batching vs Precision:Batch multiple instructions (128 cycle batching) It reduces calls and improves performance, but breaks Timer synchronization. Games like Tetris use the DIV register (Timer) as a source of randomness. If the Timer is not updated every instruction, the game can read the same value multiple times, generating identical pieces or ghost collisions that cause random Game Over.
Scanline-Based Architecture:The industry standard solution (used in PyBoy and other high performance emulators) is to execute CPU and Timer each instruction (for accuracy) but update the PPU only once per scanline (456 T-Cycles). This reduces the cost of the PPU to ~17,556 updates per frame (one per instruction) to only 154 (one per line), a 99% reduction, while maintaining the Timer precision necessary for correct RNG.
Scanline Timing:The Game Boy screen has 144 visible lines (0-143) in a row of 10 lines of V-Blank (144-153), totaling 154 lines per frame. Each line takes exactly 456 T-Cycles. A full frame is 70,224 T-Cycles (154 * 456), which gives approximately 59.7 FPS at 4.19 MHz.
Source: Pan Docs - LCD Timing, System Clock, Frame Rate
Implementation
Method was rewrittenrun()insrc/viboy.pyto implement the architecture
based on scanlines:
1. New Method:_execute_cpu_timer_only()
Created a helper method that executes a CPU instruction and updates the Timer immediately, but it DOES NOT update the PPU. This allows the CPU/Timer (precise) execution to be separated from the PPU (optimized by scanline).
- Run
CPU.step()to get M-Cycles consumed - Convert M-Cycles to T-Cycles (multiplying by 4)
- Update the Timer with
Timer.tick(t_cycles)immediately - Returns the T-Cycles consumed for accumulation in the scanline
2. Main Loop by Scanlines
The main loop now has three nested levels:
- Frame Loop:Execute 70,224 T-Cycles (one full frame)
- Scanline loop:Runs 456 T-Cycles (one full line)
- Instruction Loop:CPU and Timer execute each instruction until completing 456 cycles
At the end of each scanline, the PPU is updated once withPPU.step(456), passing
exactly 456 T-Cycles. This dramatically reduces the number of calls to the PPU without affecting
visual accuracy (the PPU processes full lines anyway).
3. Input and Rendering Management
- Input:Read once per frame at the start of the main loop
- Rendering:It is rendered when the PPU indicates that a frame is ready (V-Blank)
- Synchronization:
pygame.Clock.tick(60)capped at 60 FPS after each render
Design Decisions
Why not update PPU every instruction?The PPU is the most expensive component computationally. Updating it every instruction (17,556 times per frame) creates a bottleneck massive. Updating it once per scanline (154 times per frame) reduces the overhead by 99% without affect visual accuracy, since the PPU processes entire lines anyway.
Why keep Timer each instruction?The Timer (especially DIV) is used as source of randomness in many games. If each instruction is not updated, the game can read the same value multiple times, breaking the RNG and causing erroneous behavior (Game Over random in Tetris).
Affected Files
src/viboy.py- Rewritten methodrun()with architecture based on scanlines and added method_execute_cpu_timer_only()
Tests and Verification
State:Pending verification with real ROMs.
Success Criteria:
- Performance:The emulator should reach a stable 60 FPS on modern hardware (i7-10700K, 2080 Ti)
- Timer Accuracy:Tetris should work correctly without Random Game Over (correct RNG)
- Rendering:The image must be identical to the previous cycle-by-cycle architecture
- Input Lag:Input must be responsive (real 60 FPS eliminates noticeable input lag)
ROMs to Try:
- Tetris (user-contributed ROM, not distributed):Verify that the pieces rotate correctly and there is no random Game Over
- Pokémon Red/Blue (user-contributed ROM, not distributed):Verify that it passes the logo without blocking and performance is fluid
Note: Tests will be run manually after this deployment. If there are problems, They will be documented in the next step.
Sources consulted
- Bread Docs:LCD Timing, System Clock, Frame Rate
- Emulator architecture: Common pattern in high-performance emulators (PyBoy, etc.)
Note: The scanline-based architecture is an industry standard pattern for balancing performance and precision in emulators. No code was copied from other emulators, only the architectural principle documented in emulation literature.
Educational Integrity
What I Understand Now
- Python Overhead:Function calls in Python have a significant cost. Refreshing the PPU 17,556 times per frame creates a massive bottleneck, even on modern hardware.
- Selective Precision:Not all components need the same precision. The Timer (used for RNG) requires cycle-by-cycle precision, but the PPU can be updated by scanline without lose visual precision.
- Hybrid Architecture:The optimal solution is a hybrid architecture that combines precision where necessary (CPU, Timer) with optimization where possible (PPU).
- Scanlines as Unit of Work:The PPU processes full lines anyway, so updating it by scanline (456 cycles) is natural and does not introduce visual inaccuracies.
What remains to be confirmed
- Actual Performance:Verify that the emulator reaches a stable 60 FPS on hardware modern after this optimization.
- Timer Accuracy:Confirm that Tetris works correctly without Game Over random (correct RNG).
- Visual Compatibility:Verify that the rendered image is identical to the one cycle-to-cycle architecture above (there should be no visual differences).
- Edge Cases:Verify that games that critically depend on PPU timing (if any) work correctly with this architecture.
Hypotheses and Assumptions
Main Hypothesis:Update the PPU once per scanline (456 cycles) instead of each instruction does not introduce visual inaccuracies because the PPU processes complete lines of all modes. This hypothesis is based on the knowledge that the PPU renders entire lines, not pixels. individual per instruction.
Performance Assumption:It is assumed that reducing calls to PPU from 17,556 to 154 per frame (99% reduction) will be enough to reach 60 FPS on modern hardware. If not, additional optimizations will be considered (e.g. updating PPU only on visible lines, not in V-Blank).
Next Steps
- [ ] Check performance: Run Tetris and measure FPS (should be stable 60 FPS)
- [ ] Check Timer accuracy: Confirm that Tetris does not have random Game Over
- [ ] Check visual compatibility: Compare rendering with previous architecture
- [ ] If there are performance issues: Consider additional optimizations (refresh PPU only on visible lines)
- [ ] If there are accuracy issues: Investigate if any games require more frequent PPU updating