So after a whole hell of a lot of research and development, I managed to emulate enough of a GameBoy to properly run the DMG boot ROM (it actually gets far enough to swap out the boot ROM for the first 0x100 bytes of the game cart, then start executing code from the game ROM, but the emulation isn't complete enough to run anything useful right now):
I still have plenty of work to do, but I think I'm off to a decent start.
Edit: I ended up doing a ton of improvements on this. After I got it to this point the emulator was unable to run at full speed on a mobile i9. No, I'm not joking. I didn't think to get an exact figure on how long one "second" of emulation time took in real-time, and my frontend doesn't support frame skipping, so it was effectively running at half speed.
It's kind of amazing how a few microseconds quickly start to add up when you're literally simulating five million clock ticks per second (CPU runs at ~1.05MHz, PPU dot clock is ~4.19MHz). I believe that my CPU core is cycle-accurate, and I'm aiming for cycle-accuracy on the PPU as well (to a point, anyway; some of the specific timings involved with the video generation phase%u2014PPU mode 3%u2014are unclear).
While there were a few little microoptimizations that probably did more for debug builds than release builds (flipping from range-based for loops to index-based, converting some if/else blocks to switch blocks, etc), the most impactful changes occurred on the memory mapper.
Basically, Plip was designed to be more of an emulation interface than a single emulator. It doesn't split off the cores as separate libraries, ala RetroArch (though it probably could, honestly), but it's conceptually similar.
One of the things that I did was generalize memory access. It has a memory mapper, and that memory mapper takes PlipMemory objects (that being a pure virtual class, with RAM and ROM implementations). The core then assigns those memory blocks to specific addresses, and when you want to access it you tell the mapper to fetch a byte and it'll handle all the hard work for you. For instance, if you have a ROM at 0x0000-0x3FFF, system RAM at 0x4000-0x5FFF, and video RAM at 0x6000-0x7FFF and you request a byte from 0x4800, it'll check the mapping table, find that the requested byte lives in the system RAM instance, then do some fancy math and return 0x0800 in system RAM.
This system is nice because it inherently supports banked ROM and RAM. All I have to do is update the offset of a block and I'm suddenly in a new bank. Additionally, this ended up being useful for a little GameBoy quirk, known as ECHO RAM (due to how the DMG's memory controller works, 0xC000-0xDDFF is mirrored to 0xE000-0xFDFF). All I had to do to simulate that was simply add the work RAM block to the upper address and it Just Worked%u2122.
Now, there is a pretty substantial problem with this that I hinted at earlier: that find routine costs CPU cycles, and doing it unnecessarily is a huge problem. Obviously, the CPU needs to use the mapper for all of its memory access, since it doesn't really know any details about the memory layout. This isn't a huge problem, seeing as the CPU only hits memory once per cycle at the most, since the GB's CPU cannot read or write memory more than once per cycle. The problem lies in the PPU. Not only is the dot clock four times faster than the CPU, but the PPU also has a bunch of registers that it needs to both read and update in order to display the image. This results in the function being called millions of times
per second.
The fix for this was simple: since the PPU is only reading and writing certain specified registers, I can easily get away with directly accessing the PlipMemory objects declared in the core (m_videoRam, m_oam, and m_ioRegisters, in my case). I just handled all of the arithmetic to get the appropriate addresses in the various "static const" declarations. Easy peasy.
Even with that, it still wasn't fast enough, and there's a good reason why: because I needed a data structure that allows quick and easy inserts, I used std::list, STL's doubly-linked list implementation. Now, linked lists are fast, but since they are disparate objects tied together with pointers the compiler can't simply say "oh, it's just X address, plus the index times the size". It has to follow the trail of pointers, and this makes iteration significantly slower since it can't be easily cached, and the location of the data can't be predicted. Since the memory mapper supports all sorts of fancy features like being able to smash a block of memory on top of another one, I didn't want to abandon std::list altogether because it made everything so clean and easy. Fortunately, assigning blocks of memory is done relatively rarely, so I changed it to do block assignments against an std::list and building a far more efficient std::vector (which is basically a managed contiguous array) with the final contents of the list after it's built. Basically, trading in a minuscule amount of CPU time and memory to save a ton of cycles in the long run.
And finally, there's a matter of the return type of FindAddress. I was using std::tuple<PlipMemory*, uint32_t> (the pointer to the memory object and the memory address offset relative to that block). I ended up replacing that with a struct, which ended up reducing the overhead of that function quite a bit. I think std::pair<> would have been a safe bet as well. I might test that at some point. From what I understand the difference becomes moot on optimized builds, but this is one of those situations where micro-optimizations are actually useful in order to make the debugging process less awful.
I'm having a ton of fun with this project, in case it wasn't obvious. ;P