GameBoy Emulation

The Chatterbox > Gaming

GameBoy Emulation

(1/4) > >>

Spectere:
So after a whole hell of a lot of research and development, I managed to emulate enough of a GameBoy to properly run the DMG boot ROM (it actually gets far enough to swap out the boot ROM for the first 0x100 bytes of the game cart, then start executing code from the game ROM, but the emulation isn't complete enough to run anything useful right now):

I still have plenty of work to do, but I think I'm off to a decent start. :)

Edit: I ended up doing a ton of improvements on this. After I got it to this point the emulator was unable to run at full speed on a mobile i9. No, I'm not joking. I didn't think to get an exact figure on how long one "second" of emulation time took in real-time, and my frontend doesn't support frame skipping, so it was effectively running at half speed.

It's kind of amazing how a few microseconds quickly start to add up when you're literally simulating five million clock ticks per second (CPU runs at ~1.05MHz, PPU dot clock is ~4.19MHz). I believe that my CPU core is cycle-accurate, and I'm aiming for cycle-accuracy on the PPU as well (to a point, anyway; some of the specific timings involved with the video generation phase%u2014PPU mode 3%u2014are unclear).

While there were a few little microoptimizations that probably did more for debug builds than release builds (flipping from range-based for loops to index-based, converting some if/else blocks to switch blocks, etc), the most impactful changes occurred on the memory mapper.

Basically, Plip was designed to be more of an emulation interface than a single emulator. It doesn't split off the cores as separate libraries, ala RetroArch (though it probably could, honestly), but it's conceptually similar.

One of the things that I did was generalize memory access. It has a memory mapper, and that memory mapper takes PlipMemory objects (that being a pure virtual class, with RAM and ROM implementations). The core then assigns those memory blocks to specific addresses, and when you want to access it you tell the mapper to fetch a byte and it'll handle all the hard work for you. For instance, if you have a ROM at 0x0000-0x3FFF, system RAM at 0x4000-0x5FFF, and video RAM at 0x6000-0x7FFF and you request a byte from 0x4800, it'll check the mapping table, find that the requested byte lives in the system RAM instance, then do some fancy math and return 0x0800 in system RAM.

This system is nice because it inherently supports banked ROM and RAM. All I have to do is update the offset of a block and I'm suddenly in a new bank. Additionally, this ended up being useful for a little GameBoy quirk, known as ECHO RAM (due to how the DMG's memory controller works, 0xC000-0xDDFF is mirrored to 0xE000-0xFDFF). All I had to do to simulate that was simply add the work RAM block to the upper address and it Just Worked%u2122.

Now, there is a pretty substantial problem with this that I hinted at earlier: that find routine costs CPU cycles, and doing it unnecessarily is a huge problem. Obviously, the CPU needs to use the mapper for all of its memory access, since it doesn't really know any details about the memory layout. This isn't a huge problem, seeing as the CPU only hits memory once per cycle at the most, since the GB's CPU cannot read or write memory more than once per cycle. The problem lies in the PPU. Not only is the dot clock four times faster than the CPU, but the PPU also has a bunch of registers that it needs to both read and update in order to display the image. This results in the function being called millions of times per second.

The fix for this was simple: since the PPU is only reading and writing certain specified registers, I can easily get away with directly accessing the PlipMemory objects declared in the core (m_videoRam, m_oam, and m_ioRegisters, in my case). I just handled all of the arithmetic to get the appropriate addresses in the various "static const" declarations. Easy peasy.

Even with that, it still wasn't fast enough, and there's a good reason why: because I needed a data structure that allows quick and easy inserts, I used std::list, STL's doubly-linked list implementation. Now, linked lists are fast, but since they are disparate objects tied together with pointers the compiler can't simply say "oh, it's just X address, plus the index times the size". It has to follow the trail of pointers, and this makes iteration significantly slower since it can't be easily cached, and the location of the data can't be predicted. Since the memory mapper supports all sorts of fancy features like being able to smash a block of memory on top of another one, I didn't want to abandon std::list altogether because it made everything so clean and easy. Fortunately, assigning blocks of memory is done relatively rarely, so I changed it to do block assignments against an std::list and building a far more efficient std::vector (which is basically a managed contiguous array) with the final contents of the list after it's built. Basically, trading in a minuscule amount of CPU time and memory to save a ton of cycles in the long run.

And finally, there's a matter of the return type of FindAddress. I was using std::tuple<PlipMemory*, uint32_t> (the pointer to the memory object and the memory address offset relative to that block). I ended up replacing that with a struct, which ended up reducing the overhead of that function quite a bit. I think std::pair<> would have been a safe bet as well. I might test that at some point. From what I understand the difference becomes moot on optimized builds, but this is one of those situations where micro-optimizations are actually useful in order to make the debugging process less awful.

I'm having a ton of fun with this project, in case it wasn't obvious. ;P

vladgd:
I feel like that's substantial enough for its own topic. All of it is over my head, but what made you decide on gameboy of all things? Easier to code, or more personal interest?

Spectere:
Yeah, you're not wrong, haha. I was originally going to just stick with the screenshot but that edit ended up taking on a life of its own. I'll go ahead and split this off.

I decided to start with the GameBoy because it has a simple architecture as far as systems are concerned, and because it does have some personal significance (the GBC was my second vidya game console, and my first Nintendo one). I was kind of juggling between the Master System and GB, but the latter won out because I have fonder memories of it.

When I say "simple architecture" I mean more compared to systems like the SNES. Emulating every single instruction of a CPU is still pretty tedious, and replicating all of the fun little hardware quirks, bugs, and errata without breaking other stuff often takes a bit of trial and error. The main thing that makes the GameBoy simple is because programming it just involves setting RAM values. The CPU doesn't have much in the way of I/O ports like many other ones do (in fact, the only I/O pins on the CPU are used for handling system input), and the instruction set is elegant and simple, barring a couple of oddball functions. The PPU ("Picture Processing Unit") is also relatively straight-forward, and doesn't have a million and a half modes that you have to worry about. Additionally, the memory bank controllers on the cartridge tend to be far simpler than the myriad of memory mappers that you'd find on, say, an NES cart.

So, yeah, overall it's a pretty good system for breaking into full-system emulation. There aren't too many moving parts, and you can generally get results fairly quickly (again, a very relative term—it probably took around 60 hours of research and development before I was able to get through the boot ROM).

Bobbias:
I've been interested in attempting to writer an emulator for some time, but never actually took the plunge to try my hand at it. This is seriously awesome.

The closest thing I have to an emulator is a simple virtual machine created for the advent of code 2019 series problems, which is currently broken in some obscure way which I have yet to figure out. It worked fine until I tried to add what is effectively breakpoints where I could break execution on any instruction.

Spectere:
Not sure exactly how your project is structured, but one thing that helped me was the way I structured the game loop. I pretty much just have my frontend telling the core how many microseconds it should run in a given cycle (then using that to figure out how long to wait for the next frame). Not exactly thread-safe, but this is more of a research project for me. I probably would have done things a little differently if I were aiming for the quality of something like BGB. :)

I don't really have a way of doing that just yet, but I'm currently working on adding a console that will allow me to query things, set breakpoints, and all that fun stuff while the emulation is running. The cores are exposed enough to the frontend that information gathering is possible, and things are set up in such a way that I can easily do single-stepping (at least for the GameBoy core, as everything is more or less timed based on the same clock). I'm not really in a position where actual games are booting, and I'm not sure if it's due to bugs in the CPU core or unimplemented features. I figure I can author a few simple test ROMs to spot check functionality, before running the more comprehensive tests (like blargg's test suite, etc). I'd kinda like to get my hands on an EverDrive at some point (maybe after I finish paying off my desk and monitors) so that I can compare the results to real hardware. I don't have a DMG at my disposal, but the SGB is close enough architecturally that it should be good enough for most things, barring a few very specific SGB-specific things.

Honestly, I kinda wish I would have developed the console before starting on the core. Kinda sucks to have to put everything on hold for this, but oh well. I had been using my debugger to get the boot ROM working, but lldb isn't really intended for debugging high level things like this (I mean, you can do it, but it's tedious).

One habit I have been having a tough time breaking is my tendency to reach for malloc/free instead of new/delete. C habits die hard, apparently.

Navigation

[0] Message Index

[#] Next page

Go to full version