This project is educational and Open Source. No code is copied from other emulators. Implementation based solely on technical documentation and permitted tests.
Performance Profiling and Optimization: Logging Removal and Fast-Path MMU
Summary
The emulator worked correctly but with poor performance (8-14 FPS) which made it unplayable.
A profiling script was created usingcProfilewhich identified that 42% of the execution time
was spent on calls tologger.debug()even if the level was CRITICAL. Optimizations applied
surgical: complete delogging on critical paths (hot paths), optimization of check order
inMMU.read_byte()(fast path for ROM first), and use of__slots__in the MMU class.
Result: 30% improvement in performance, from 7.5 FPS to 9.7 FPS in headless mode, with 30% more cycles executed
at the same time.
Hardware Concept
Profiling and Performance Optimization in Emulators: An emulator must run millions of instructions per second (the Game Boy runs at 4.19 MHz). Each operation that adds unnecessary overhead, No matter how small, it multiplies millions of times and can cause serious performance problems.
Hot Paths: These are the functions that are executed millions of times per second:MMU.read_byte()(memory access),CPU.step()(execution of instructions),CPU.fetch_byte()(reading opcodes), andPPU.step()(graphics engine preview).
Any overhead in these functions has an exponential impact.
Logging Overhead: Even though the logger is set to CRITICAL level, Python must still:
1) Prepare the calllogger.debug(), 2) Evaluate the arguments (f-strings create new strings),
3) Call the function, 4) Check the level internally, 5) Discard the result. Do this 2.8 million
times per second consumes significant resources.
Fast Path Optimization: InMMU.read_byte(), the most frequent access is to read ROM
(instruction fetch from 0x0000-0x7FFF). Reorder theif/elifto check ROM first
reduces the average number of comparisons per access, improving performance.
Implementation
Created a profiling script (tools/profile_viboy.py) that runs the emulator in headless mode
for 10 seconds and analyze the time consumed by each function usingcProfileandpstats.
Initial Profiling Results
Profiling revealed that 42% of total time was spent logging:
logging.debug(): 1,314,522 calls → 1,797 secondslogging.isEnabledFor(): 1,314,522 calls → 1,411 secondslogging._is_disabled(): 1,314,522 calls → 1,016 seconds- Total: ~4.2 seconds of 10 seconds (42%)
The initial performance was7.5 theoretical FPS(131,585 M-Cycles/second), well below the target of 60 FPS.
Applied Optimizations
1. Removal of Logging in Hot Paths
- MMU: Disabled logging at CRITICAL level in
src/memory/mmu.py - CPU: Disabled logging at CRITICAL level in
src/cpu/core.py - Commented critical callsin
CPU.step(),CPU.handle_interrupts(), and stack functions (_push_byte,_pop_byte,_push_word,_pop_word) - Commented calls in dynamic handlerswithin
_init_ld_handler_lazy()
2. Fast-Path Optimization in MMU.read_byte()
- Reordered if/elifto check ROM first (
addr<= 0x7FFF) - Pooled access to WRAM/HRAM(0xC000-0xFFFF) with specific I/O checks
- VRAM at the end(0x8000-0xFEFF) as less frequent
- Reducing the average number of comparisons per memory access
3. Optimization with __slots__
- Added
__slots__ = ['_memory', '_cartridge', '_ppu', '_joypad', '_timer', 'vram_write_count']to classMMU - Reduces attribute access overhead by eliminating the instance dictionary
- Improve memory usage and access speed
Final Results
After optimizations:
- Theoretical FPS: 7.5 → 9.7 FPS (29% improvement)
- M-Cycles executed: 1,315,852 → 1,710,334 (30% more cyclesat the same time)
- Time in logging: 4.2s → 1.9s (55% reduction)
- T-Cycles/second: 526,340 → 684,133 (30% improvement)
Note: There are still ~529k calls tologger.debug()active, but the time consumed was reduced
significantly by deactivating the logger at CRITICAL level. To eliminate them completely it would be necessary
comment out all calls or use conditional checks, but they would add overhead.
Affected Files
tools/profile_viboy.py(new) - Profiling script using cProfile and pstatssrc/memory/mmu.py(changed) - Added __slots__, optimized read_byte() with fast path, disabled loggingsrc/cpu/core.py(changed) - Disabled logging at CRITICAL level, commented out critical calls
Tests and Verification
Scientific profiling was used withcProfileto identify and verify optimizations:
Initial Profiling (Before Optimizations)
Command executed:
python tools/profile_viboy.py
Around:Windows/Python 3.13.5
Result:
- Theoretical FPS:7.5FPS
- M-Cycles executed: 1,315,852
- Logging time: ~4.2 seconds (42% of total time)
- Top function:
MMU.read_byte()with 2,844,148 calls (1,282s)
Final Profiling (After Optimizations)
Command executed:
python tools/profile_viboy.py
Around:Windows/Python 3.13.5
Result:
- Theoretical FPS:9.7FPS(29% improvement)
- M-Cycles executed: 1,710,334 (30% more cycles)
- Logging time: ~1.9 seconds (19% of total time, 55% reduction)
- Top function:
MMU.read_byte()with 3,695,025 calls (1,152s) - more efficient
What Validates Profiling
- Identification of bottlenecks: Profiling shows exactly which functions consume the most time
- Measuring improvements: Allows you to quantify the impact of each optimization
- Efficiency verification: Confirms that the optimizations actually improve performance
Sources consulted
- Python cProfile Documentation: https://docs.python.org/3/library/profile.html
- Python __slots__: Memory optimization and attribute access in classes
- Clean Code Principles: Optimizing hot paths in high-performance code
Note: Optimization is based on general principles of profiling and optimization of Python code, applied specifically to the needs of a Game Boy emulator.
Educational Integrity
What I Understand Now
- Scientific profiling: Use tools like
cProfileIt is essential before optimizing. Optimizing without data is "floating around in the dark." - Exponential impact of overhead: A small overhead (like a call to registration) is multiplied millions of times in an emulator, causing serious performance problems.
- Fast paths: Reordering checks to put the most frequent first can have an impact significant when run millions of times.
- Logging in production: Logging should be disabled completely in high-performance code, not just set to high level. Internal level checks have overhead.
What remains to be confirmed
- Real-time performance with rendering: Profiling was done in headless mode. The real FPS with active graphic rendering remains to be verified.
- Complete log deletion: There are still ~529k calls to
logger.debug()active. It would be interesting to measure the impact of eliminating them completely (commenting them all). - Other possible optimizations: Profiling shows that
MMU.read_byte()is still the most called function. Additional optimizations such as caching or lookup tables could be explored.
Hypotheses and Assumptions
It is assumed that the order of verification inMMU.read_byte()(ROM first) is optimal based on
the fact that the fetch of instructions is the most frequent operation. However, this could vary
depending on the game or the execution phase. More detailed profiling per phase could reveal different patterns.
Next Steps
- [ ] Check real-time performance with active graphics rendering
- [ ] Test the emulator with different ROMs to validate that the optimizations do not introduce bugs
- [ ] Explore additional optimizations in
MMU.read_byte()if the performance is still not enough - [ ] Consider completely eliminating calls to
logger.debug()remaining if necessary