Dynamic Wait-States for the W65C02, W65C816

Posted on June 08, 2023

Dynamic Wait-States for the W65C02, W65C816

What Exactly Is the Problem?

In most 6502-based designs to date, the CPU clock is derived by the needs of the I/O. A Commodore 64's clock is generated by the VIC-II chip, for example, and its rate depends on whether it is an NTSC (1.02 MHz) or PAL (0.98MHz) video standard. Access to RAM and other I/O is designed carefully to fit within these hard real-time requirements. There is never any need for wait-states, because every chip knows (basically) what all the other chips are doing.

However, that really only works for extremely well-specified products, or, products for which you have total vertical control over the components used (such as Commodore and Atari, which designed their own chips specifically to work with the 6502 processor). Under these conditions, the very simple bus that the 6502 exposes to the world is an absolute joy to work with.

Why Is It a Problem?

Let's put our adult lives on hold and become as children for a bit, and make believe. Your mission from your boss at IBM: the 8088 project was an abject failure, and now you're to build a 6502-based computer that supports up to 8 expansion slots for I/O devices to compete with the likes of Apple and Commodore. The mainboard of the computer is to be as independent as possible of both the peripheral speed (because we can't predict how slow the slowest add-in card will be) and the CPU clock speed (the fastest 65C02 of the day was 4MHz at the time the PC came out, I think; we know that it now runs up to 20MHz these days).

More importantly, economics will eventually change such that parallel ROMs (to store the BIOS) will eventually give way to serial flash devices. how would you get a 14MHz CPU with a parallel bus to boot off of a serial flash, which might then interact with a 5MHz VIA chip from Western Design Center? This is not an academic exercise; if you work with deep embedded systems today, especially ARM or RISC-V processors, you might already be familiar with specialized circuitry to accomplish this exact goal. If it's not on the CPU itself, it'll certainly be located on what is colloquially known as its chipset.

You might think that the simplistic bus interface of the 65C02 is insufficient to address this level of bus asynchrony. However, with only a small amount of external logic, you can actually implement an asynchronous bus on par with the Motorola's 68000. Granted, the 65C816 is a slightly better choice for this, but the principles will be the same.

Asynchronous Bus for 65C02

The first thing we need to realize is that the circuits described below will only work with the CMOS 6502 parts, not with the NMOS parts. This is because the NMOS parts observe the state of the RDY signal only when reading, not when writing. Further, NMOS parts only sample RDY during phase-1, while CMOS parts sample it during phase-2-to-phase-1 transition, giving more time for address decoders to work. That said, if you're particularly clever with write-caching, you can probably take the ideas in this article and apply them to an NMOS design as well.

What we know is that we want the CPU's RDY signal to drop low when a slow device is first addressed. We can use an address decoder's select line to know when a slow device is addressed; however, we can't always know from this signal alone when one bus transaction stops and another begins. Consider a back-to-back read of LDA #$11 from serial flash. This two byte instruction will cause the CPU to read from serial flash, back to back and without interruption. The serial flash's select line would assert once during this (assuming it wasn't already asserted), not the two times one might expect.

Thankfully, we already know that the 65C02 completes a bus transaction on every cycle where RDY is asserted. Therefore, if RDY is asserted during cycle n-1, then we know that cycle n must be the start of a new cycle. We can therefore capture this cycle-start as a new signal to be shared with all devices, which I'll name START, like so:

             +------+
RDY o--------|D    Q|--------> START
             |      |
PHI2 o------o|>     |
             +------+

The 65C816 adds a little bit of complexity thanks to the VPA/VDA signals. These can be used to actually qualify valid versus internal bus cycles, allowing internal cycles to always run at maximum speed.

             +------+
RDY o--------|D    Q|----.
             |      |    |
PHI2 o------o|>     |    |    +------+
             +------+    `----|      |
                              |  *1  |--------> START
             +------+    .----|      |
VPA o--------|      |    |    +------+
             |  +1  |----'
VDA o--------|      |
             +------+

If you don't care about introducing wait-states for internal cycles, then you still use the 65C02 circuit.

Once we have this new signal, we can qualify it against an address decoder's select output to kick off a timer of some sort. After this timer expires, the peripheral addressed will drive its own personal RDY signal, which should cause the CPU to continue. Basically, we are looking for a timing diagram similar to the following, depicting two back-to-back hits on a slow device:

                 ____      ____      ____      ____      ____      ____      ____
    PHI2    ____/    \____/    \____/    \____/    \____/    \____/    \____/
            __________ _______________________________________ __________________
    ADDR    __________X_______________________________________X__________________
            ___________
    SLO#               \_________________________________________________________
            _____________________                               _________
    START   _____________/       \_____________________________/         \_______
            ______________                            _________
    SLORDY  ______________\__________________________/___/     \_________________
            ________________                               _______
    RDY     ______________\_\_____________________________/     \_\______________

Basically, when the slow device is selected and we know it's the start of a bus cycle, we can start a state machine that drives SLORDY low until the right time. With our example above, a divide-by-four circuit would decode count=3 to drive SLORDY high, while (START /\ SLO#) would reset the counter back to 0.

               ,-------------------------------------*--------------.
               |   +------+                          |              |
    START o----*---|      |        +-----+        +-----+        +-----+
                   |  *1  |--------|D   Q|--------|D R Q|--------|D R Q|--------> SLORDY
    SLO#  o-------o|      |        |     |        |     |        |     |
                   +------+    .--o|>    |    .--o|>    |    .--o|>    |
                               |   +-----+    |   +-----+    |   +-----+
    PHI2  o--------------------*--------------*--------------'

What we basically have here is not much different from what we'd find in a typical DTACK-generator for a MC68000-based computer.

(Remember, this circuit is only representative; the precise state machinery necessary for your peripherals will likely look very different.)

OK, we've used START and our device-specific select to identify when to start our RDY-state-machine; but how do we route that signal back to the CPU? As you might imagine, just as the device driving the data bus is controlled by the select, so too is the RDY signal.

                   +------+------+ -
    VPA    o------o|  *1  |  +1  |  |_ for 65C816-based designs only.
    VDA    o------o|      |      |  |
                   +------+      | -
    SLO#   o------o|  *1  |      |
    SLORDY o-------|      |      |
                   +------+      |
    ROM#   o------o|  *1  |      |
    ROMRDY o-------|      |      |
                   +------+      |
    RAM#   o------o|  *1  |      |
    RAMRDY o-------|      |      |
                   +------+      |--------> RDY (to the CPU)
    S0#    o------o|  *1  |      |
    S0RDY  o-------|      |      |
                   +------+      |
    S1#    o------o|  *1  |      |
    S1RDY  o-------|      |      |
                   +------+      |
                  ///    ///    ///
                   +------+      |
    S7#    o------o|  *1  |      |
    S7RDY  o-------|      |      |
                   +------+------+

It's as easy as that.

What happens if the CPU addresses a block of memory which isn't decoded? That seems like it would jam the processor until the next hard reset. Indeed, that is the case. As presented here, I only account for decoded devices. There are many ways of handling the case of a bus error, however. One approach is to fully decode the address space and include a "default RDY generator" that applies to all otherwise unused portions of memory. (For 65C816 devices, perhaps you might also want to pulse the ABORT# signal too.) Another approach is to have a default RDY generator which is OR-ed with the signal above as a fail-safe. The START signal acts as a watchdog timer reset for this circuit, ensuring it never fires spuriously.

Why Is This Solution Valuable?

Bus asynchrony brings potential compatibility with a wider variety of peripherals, and/or enables the use of design methods with more favorable economics. For example, asynchrony is a vital requirement for compatibility with the STE-bus specification.

The implementation costs a handful of D-flip flops, and can be successfully implemented with a bunch of 2-input AND and OR gates. Clever engineers might use 74138-style 1-of-8 decoders as well to reduce discrete component counts.

The 65C02 and 65C816 often appear in circuits which are extremely cost-sensitive and fixed in function. All of the logic discussed above adds to the circuit complexity, and thus, to the overall cost of development. Thus, if you are working with discrete components, you might want to forego this additional complexity and stick with fully synchronous designs.