This project is educational and Open Source. No code is copied from other emulators. Implementation based solely on technical documentation and permitted tests.
Correct Profiling Measurement + Headless/UI Comparison + Minimum Fix
Summary
Fixed profiling system to correctly measure stage_sum_ms, frame_wall_ms and pacing_ms separately. Improved [UI-PATH] logging to include core metrics (PC, VRAM_nonzero, LCDC, BGP, LY) and NonWhite calculated with serious sampling (16x16 grid). Post-blit verification with decent sampling (64 points, 8×8 grid). Creation of headless vs UI comparison script to produce comparative table and decide with numerical evidence if the problem is presenter/UI or core. Log spam control with strict gating.
Hardware Concept
Correct Profiling: The previous profiling measured accumulated time incorrectly, generating inconsistent values (TOTAL: 421ms vs sum of stages: 5.88ms). Correct profiling must measure:
- stage_sum_ms: Sum of the individual render stages (frombuffer/reshape, blit_array, scale/blit, flip)
- frame_wall_ms: Total time of the entire frame (wall-clock time from start to finish)
- pacing_ms: Waiting/synchronization time (difference between wall and stages: wall - stages)
NonWhite Sampling: Previous sampling was insufficient (only 3 pixels). To obtain reliable estimates:
- Grid 16×16: 256 sample points to calculate NonWhite before blit
- Grid 8×8: 64 sample points to verify NonWhite after blit
Headless vs UI Comparison: To decide if the problem is presenter/UI or core, the framebuffer generated by the core (headless) is compared vs the framebuffer presented by the UI. If headless has NonWhite > 0 but UI before ~0, the problem is how UI gets the framebuffer. If headless has NonWhite > 0 and UI before > 0 but UI after ~0, the problem is with presenter/blit.
Implementation
Phase A: Fix Profiling
Archive: src/gpu/renderer.py
Changes:
- Measure
frame_wall_startat the beginning of the entire frame (before any stage) - Measure each individual stage:
frombuffer_ms,blit_ms,scale_blit_ms,flip_ms - Calculate
stage_sum_ms= sum of individual stages - Calculate
frame_wall_ms= total time sinceframe_wall_startuntil the end - Calculate
pacing_ms=frame_wall_ms-stage_sum_ms - Formatted log:
[UI-PROFILING] Frame N | stages=Xms (frombuf=A blit=B scale=C flip=D) | wall=Yms | pacing=Zms
Phase B: Improve Logging [UI-PATH]
Archive: src/gpu/renderer.py, src/viboy.py
Changes:
- Add parameter
metricsoptional torender_frame() - Create helper function
_sample_vram_nonzero()inViboy(sampling every 16th byte, same as headless tool) - Collect metrics on
viboy.py: PC, VRAM_nonzero, LCDC, BGP, LY - Pass metrics to
render_frame()from all calls - Improve NonWhite sampling: 16×16 grid = 256 points (before it was every 64th pixel ≈ 3 points)
- Formatted log:
[UI-PATH] F4 | Path=cpp_rgb_view | PC=6152 | LCDC=E3 | BGP=FC | LY=90 | VRAMnz=2028 | NonWhite=23040 | Hash=abc12345 | wall=16.7ms
Phase C: Enhanced Post-Blit Verification
Archive: src/gpu/renderer.py
Changes:
- Replace sampling of 3 pixels per 8×8 grid = 64 points
- Sample from
self.surface.get_at()after the blit - Calculate
nonwhite_after_totalestimated (multiply by density) - Compare
nonwhite_sample(before, 16×16 grid) vsnonwhite_after_total(later, 8×8 grid) - Detect significant loss: yes
before > 1000andafter< 100, issue warning
Phase D: Headless vs UI Comparison Script
Archive: tools/compare_headless_vs_ui_0448.sh
Functionality:
- Run headless tool for each ROM (Mario, Pokémon, Tetris, Zelda DX)
- Run UI with 15s timeout for each ROM
- Extract metrics from both (NonWhite, VRAM_nonzero, PC_end)
- Generate comparison table:
ROM | headless NonWhite | UI NonWhite_before | UI NonWhite_after | VRAMnz | PC_end - Allow automatic decision based on table:
- headless NonWhite > 0 and UI before > 0 but UI after ~0 → bug presenter/blit
- headless NonWhite > 0 and UI before ~0 → bug in how UI gets framebuffer
- Both 0 but VRAMnz high → PPU/pallet bug
- Both 0 and VRAMnz ~0 and PC stuck → CPU/ROM exec bug
Phase E: Log Spam Control
Archive: src/gpu/renderer.py
Verification:
- Strict gating:
should_log = (self._path_log_count< 5) or (self._path_log_count % 120 == 0) VIBOY_DEBUG_UIdefault OFF (os.environ.get('VIBOY_DEBUG_UI', '0') == '1')- Logs
[UI-PATH]and[UI-PROFILING]only withinshould_logor when FPS< 30
Affected Files
src/gpu/renderer.py- Profiling fix, improved logging [UI-PATH], improved post-blit verificationsrc/viboy.py- Helper function_sample_vram_nonzero(), pass metrics to render_frame()tools/compare_headless_vs_ui_0448.sh- Headless vs UI comparison script (created)
Tests and Verification
Compilation:
python3 setup.py build_ext --inplace
# BUILD_EXIT=0 ✓
python3 test_build.py
# TEST_BUILD_EXIT=0 ✓
Native Validation: Successfully compiled C++ module, functional Python-C++ interface.
Comparison Script: Script created and executable. Requires manual execution with ROMs to generate comparative table.
Sources consulted
- Pan Docs: Memory Map, I/O Registers (for VRAM, LCDC, BGP, LY metrics)
- Step 0442: Tool headless (
tools/rom_smoke_0442.py) - Reference for VRAM_nonzero sampling
Educational Integrity
What I Understand Now
- Correct Profiling: To measure performance correctly, you must separate the real work time (stages) from the total time (wall), and the waiting/synchronization time (pacing). This allows us to identify whether the bottlenecks are in the render (high stages) or in the synchronization (high pacing).
- Representative Sampling: A sampling of 3 pixels is not enough to estimate NonWhite. A 16x16 (256 points) or 8x8 (64 points) grid provides much more reliable estimates, especially when comparing before/after blit.
- Headless vs UI Comparison: To diagnose presentation problems, it is crucial to compare what the core generates (headless) vs what the UI presents. If headless has data but UI doesn't, the problem is in the presenter. If headless has no data, the problem is in the core.
What remains to be confirmed
- Running the Comparison Script: The script is created but requires manual execution with real ROMs to generate the comparison table and make decisions based on numerical evidence.
- Final Diagnosis: Once the comparison script is executed, you can decide with certainty if the problem is presenter/UI (blit, format, surface) or core (CPU/VRAM/PPU does not produce an image for that ROM).
Next Steps
- [ ] Run headless vs UI comparison script with real ROMs (Mario, Pokémon, Tetris, Zelda DX)
- [ ] Analyze comparative table and make decision based on numerical evidence
- [ ] If the problem is presenter/UI: investigate and correct bug in blit/format/surface
- [ ] If the problem is core: investigate why PPU does not produce an image for that specific ROM