⚠️ Clean-Room / Educational

This project is educational and Open Source. No code is copied from other emulators. Implementation based solely on technical documentation and permitted tests.

Performance Profiling and Optimization: Logging Removal and Fast-Path MMU

Date:2025-12-18 StepID:0082 State: Verified

Summary

The emulator worked correctly but with poor performance (8-14 FPS) which made it unplayable. A profiling script was created usingcProfilewhich identified that 42% of the execution time was spent on calls tologger.debug()even if the level was CRITICAL. Optimizations applied surgical: complete delogging on critical paths (hot paths), optimization of check order inMMU.read_byte()(fast path for ROM first), and use of__slots__in the MMU class. Result: 30% improvement in performance, from 7.5 FPS to 9.7 FPS in headless mode, with 30% more cycles executed at the same time.

Hardware Concept

Profiling and Performance Optimization in Emulators: An emulator must run millions of instructions per second (the Game Boy runs at 4.19 MHz). Each operation that adds unnecessary overhead, No matter how small, it multiplies millions of times and can cause serious performance problems.

Hot Paths: These are the functions that are executed millions of times per second:MMU.read_byte()(memory access),CPU.step()(execution of instructions),CPU.fetch_byte()(reading opcodes), andPPU.step()(graphics engine preview). Any overhead in these functions has an exponential impact.

Logging Overhead: Even though the logger is set to CRITICAL level, Python must still: 1) Prepare the calllogger.debug(), 2) Evaluate the arguments (f-strings create new strings), 3) Call the function, 4) Check the level internally, 5) Discard the result. Do this 2.8 million times per second consumes significant resources.

Fast Path Optimization: InMMU.read_byte(), the most frequent access is to read ROM (instruction fetch from 0x0000-0x7FFF). Reorder theif/elifto check ROM first reduces the average number of comparisons per access, improving performance.

Implementation

Created a profiling script (tools/profile_viboy.py) that runs the emulator in headless mode for 10 seconds and analyze the time consumed by each function usingcProfileandpstats.

Initial Profiling Results

Profiling revealed that 42% of total time was spent logging:

  • logging.debug(): 1,314,522 calls → 1,797 seconds
  • logging.isEnabledFor(): 1,314,522 calls → 1,411 seconds
  • logging._is_disabled(): 1,314,522 calls → 1,016 seconds
  • Total: ~4.2 seconds of 10 seconds (42%)

The initial performance was7.5 theoretical FPS(131,585 M-Cycles/second), well below the target of 60 FPS.

Applied Optimizations

1. Removal of Logging in Hot Paths

  • MMU: Disabled logging at CRITICAL level insrc/memory/mmu.py
  • CPU: Disabled logging at CRITICAL level insrc/cpu/core.py
  • Commented critical callsinCPU.step(), CPU.handle_interrupts(), and stack functions (_push_byte, _pop_byte, _push_word, _pop_word)
  • Commented calls in dynamic handlerswithin_init_ld_handler_lazy()

2. Fast-Path Optimization in MMU.read_byte()

  • Reordered if/elifto check ROM first (addr<= 0x7FFF)
  • Pooled access to WRAM/HRAM(0xC000-0xFFFF) with specific I/O checks
  • VRAM at the end(0x8000-0xFEFF) as less frequent
  • Reducing the average number of comparisons per memory access

3. Optimization with __slots__

  • Added__slots__ = ['_memory', '_cartridge', '_ppu', '_joypad', '_timer', 'vram_write_count']to classMMU
  • Reduces attribute access overhead by eliminating the instance dictionary
  • Improve memory usage and access speed

Final Results

After optimizations:

  • Theoretical FPS: 7.5 → 9.7 FPS (29% improvement)
  • M-Cycles executed: 1,315,852 → 1,710,334 (30% more cyclesat the same time)
  • Time in logging: 4.2s → 1.9s (55% reduction)
  • T-Cycles/second: 526,340 → 684,133 (30% improvement)

Note: There are still ~529k calls tologger.debug()active, but the time consumed was reduced significantly by deactivating the logger at CRITICAL level. To eliminate them completely it would be necessary comment out all calls or use conditional checks, but they would add overhead.

Affected Files

  • tools/profile_viboy.py(new) - Profiling script using cProfile and pstats
  • src/memory/mmu.py(changed) - Added __slots__, optimized read_byte() with fast path, disabled logging
  • src/cpu/core.py(changed) - Disabled logging at CRITICAL level, commented out critical calls

Tests and Verification

Scientific profiling was used withcProfileto identify and verify optimizations:

Initial Profiling (Before Optimizations)

Command executed:

python tools/profile_viboy.py

Around:Windows/Python 3.13.5

Result:

  • Theoretical FPS:7.5FPS
  • M-Cycles executed: 1,315,852
  • Logging time: ~4.2 seconds (42% of total time)
  • Top function:MMU.read_byte()with 2,844,148 calls (1,282s)

Final Profiling (After Optimizations)

Command executed:

python tools/profile_viboy.py

Around:Windows/Python 3.13.5

Result:

  • Theoretical FPS:9.7FPS(29% improvement)
  • M-Cycles executed: 1,710,334 (30% more cycles)
  • Logging time: ~1.9 seconds (19% of total time, 55% reduction)
  • Top function:MMU.read_byte()with 3,695,025 calls (1,152s) - more efficient

What Validates Profiling

  • Identification of bottlenecks: Profiling shows exactly which functions consume the most time
  • Measuring improvements: Allows you to quantify the impact of each optimization
  • Efficiency verification: Confirms that the optimizations actually improve performance

Sources consulted

Note: Optimization is based on general principles of profiling and optimization of Python code, applied specifically to the needs of a Game Boy emulator.

Educational Integrity

What I Understand Now

  • Scientific profiling: Use tools likecProfileIt is essential before optimizing. Optimizing without data is "floating around in the dark."
  • Exponential impact of overhead: A small overhead (like a call to registration) is multiplied millions of times in an emulator, causing serious performance problems.
  • Fast paths: Reordering checks to put the most frequent first can have an impact significant when run millions of times.
  • Logging in production: Logging should be disabled completely in high-performance code, not just set to high level. Internal level checks have overhead.

What remains to be confirmed

  • Real-time performance with rendering: Profiling was done in headless mode. The real FPS with active graphic rendering remains to be verified.
  • Complete log deletion: There are still ~529k calls tologger.debug()active. It would be interesting to measure the impact of eliminating them completely (commenting them all).
  • Other possible optimizations: Profiling shows thatMMU.read_byte()is still the most called function. Additional optimizations such as caching or lookup tables could be explored.

Hypotheses and Assumptions

It is assumed that the order of verification inMMU.read_byte()(ROM first) is optimal based on the fact that the fetch of instructions is the most frequent operation. However, this could vary depending on the game or the execution phase. More detailed profiling per phase could reveal different patterns.

Next Steps

  • [ ] Check real-time performance with active graphics rendering
  • [ ] Test the emulator with different ROMs to validate that the optimizations do not introduce bugs
  • [ ] Explore additional optimizations inMMU.read_byte()if the performance is still not enough
  • [ ] Consider completely eliminating calls tologger.debug()remaining if necessary