
Shuboy is a Gameboy emulator for the Sega 32X. It's a port of an emulator I originally wrote for Windows but never released, although the 32X version has been completely rewritten in SH2 assembly language and highly optimized, so it no longer bears much resemblance with the original code.
For those unfamiliar with the 32X, it was an add-on for the Megadrive (Genesis in the U.S.) which ran in tandem with the Megadrive and contained two Hitachi SuperH-2 (SH2) processors in a master-slave configuration clocked at 23 MHz, 256 kB of RAM, a simple GPU providing a linear framebuffer, and a PWM sound circuit.
The really basic "VGA-ish" GPU is a real bottleneck when emulating a tile-based system, since you can't take advantage of the hardware to draw the tiles for you, as could be done e.g. if one was writing an emulator to run on the Gameboy Advance or the Sega Saturn.
A number of things were done to try to squeeze more speed out of the emulator:
- First of all, 100% of the code was handwritten in SH2 assembly language to make sure that there are as few unnecessary instructions as possible and that as many variables as possible are kept in registers while they're needed.
- The SH2's special cache mode is used where the 4 kB cache is split into two halves, of which one half can be addressed directly, thereby explicitly caching at all times code and data that is known to be used frequently, which becomes especially important when trying to avoid competition between the two SH2s for access to the shared RAM.
- Emulation was split into two "threads", where the master SH2 runs the GB-Z80 emulation while the slave SH2 runs the GB-PPU emulation. The CPU emulation and the PPU emulation is then synchronized every couple of scanlines to ensure that neither one gets too far ahead of the other, while cache contents are synchronized each scanline to ensure that data needed for PPU emulation is up to date with what has happened on the other side.
Despite all of these optimizations the speed barely reaches playable levels. Though it might be possible to achieve slightly better performance with the current design, better choices would probably be to write a dynarec GB-Z80 emulator instead of the current interpreting core, and/or targeting the Sega Saturn instead of the 32X since the Saturn has the same dual SH2 configuration, meaning that much of the code could be re-used while the PPU emulation could be hardware-accelerated by the Saturn's VDPs.