On Evolving Channel I/O Concepts Towards Offering System Services

Posted on

On Evolving Channel I/O Concepts Towards Offering System Services

So, I'm kind of thinking that the idea of operating systems with lots of system calls were an "effective mistake." It's a natural outgrowth of the subroutine library, which itself evolved in a time before operating systems even existed. They work insofar as they do; however, they seem to incur a kind of synchronous bottleneck on the kind of operations one can perform. Asynchronous operations, which are becoming more and more important as the days go on, tend to have really complicated semantics, interface requirements, or both.

Not only that, but user/kernel space transitions tend to be more costly than normal subroutine calls. Even on modern, high-performance CPUs where the processing cost of this transition is effectively zero (such as what you'd find in RISC-V processors or modern x86 CPUs with a SYSCALL instruction), you still have the software overhead of copying buffers to/from user-space into kernel space. Earlier processors have things much worse. Even a simple subroutine call can be rather expensive. For instance, a 6502 or Z80 processor consumes between 12 (JSR) and 21 (RST) to 27 (CALL) clock cycles, respectively. You can tangibly see the effect this has by looking at old IBM PC software, where console I/O was performed using character-at-a-time BIOS calls, resulting in visible line-at-a-time screen updates whereas video games were significantly faster.

One work-around to this impact is to perform operations in bulk. Early operating systems, for example Commodore's KERNAL or the IBM PC's BIOS, tended to perform I/O operations one character at a time. Unix-style read-/write-operations, which transfers an entire buffer at a time, would amortize the cost of a synchronous I/O operation over the size of the buffer transferred. But, you are still restricted to transferring at most one buffer's worth of information at a time. And, for devices which produced significantly smaller units of data, the overhead can still rack up really fast. Moreover, handling multiple input devices at a time was still an impossibility without some obnoxious programming techniques.

Another disadvantage of this interface is it's very much a leaky abstraction. A system call is, well, a subroutine call, which means the caller and callee had to agree on an ABI. You know that there's some amount of software running on the same computer your code is running on, designed to process your request and return the results. The transition from unprotected system calls (e.g., Commodore KERNAL) to protected (or, at least, protected-capable; e.g., MS-DOS or Windows) is frequently enough to break software as applications are ported from platform to platform.

We can imagine a different way of invoking system services, though. One which is independent of microkernel or monolithic kernel architectures, single- or multi-processor systems, or synchronous or asynchronous I/O requests. One which has the potential, at least, to amortize the cost of a system call even on inefficient, slower processors across many requests at once. System service providers, however they may be implemented, can respond to requests in-order if they want, or they can dole out tasks to other background processors for eventual completion. As long as a simple request/response protocol is understood and obeyed by "the kernel" and application alike, none of these implementation issues matter.

I am, of course, suggesting following the same basic interface methods as how one would talk to an NVMe storage device. I am speaking, of course, of the use of a request or command queue to submit one or more operations for the kernel to process, and a result queue to inform the program when requests have been completed and what the results of that operation were.

While the flow of control is more complicated than Unix-style kernel calls, the process is simple nonetheless.

  1. The application fills in one or more request queue entry structures. This tells the kernel what operation(s) to perform and in what order, along with any data necessary to complete the operation. The application does this for as many requests as are required, or for however many slots remain available in the queue, whichever is smaller. All pointers to ancilliary data structures are specified in application address space terms.

  2. The application then makes a single system call. This causes the kernel to start to pop the request queue as services are rendered. The kernel may optimize processing of these requests however it sees fit. It might return right away after having scheduled asynchronous background operations; or, it might process them one by one and not return until each and every one has been completed.

  3. To check completion, the application then monitors its result queue. As services complete, the kernel will queue corresponding response entries onto the result queue. As long as interrupts are enabled, the program can detect when one or more new result queue entries become available by performing a simple pointer comparison. Alternatively, a kernel may be configured to invoke a callback routine for one or more requests.

  4. The application can then dispatch notifications to various parts of its program in response to the service completions in the order received.

Applications and the system software agree to cooperate via a single shared data structure, which might look something like this:

struct SystemQueue {
	void *   base;
	uint8_t  bound;
	uint8_t  head;
	uint8_t  tail;
}

struct SystemChannel {
	struct SystemQueue request;
	struct SystemQueue result;
}

Under normal conditions, the application only ever touches request.tail and result.head, while the system only touches request.head and result.tail. Either of these entities work by first writing a new entry to the appropriate queue, and only then updating the head or tail field as appropriate. For single-processor, single-threaded environments (which is almost exclusively the norm with 8-bit CPUs which lack any form of complex locking facilities), this is sufficient to ensure atomicity. It also ensures upward compatibility to multi-processor systems later: if a kernel is updated to become multithreaded, for example, it may keep a lock on its side transparently, without application knowledge. From the app's point of view, a single kernel is updating its side of the queue. Likewise, if the application becomes multithreaded later on, it may incorporate its own lock as well.

As long as applications and the system keep to their own halves of the queue, an application can be enqueueing a new request while the system is reading a pending request at the same time. Likewise, an application could be dequeueing a result while the system is enqueueing a new result at the same time. As long as the head and tail are updated atomically (which is guaranteed since they're only a single byte wide), there ought be no need for further synchronization primitives.

Since each request implies one result, it follows that the result queue should be at least as large as the request queue, in terms of number of elements. If your application is diligent at responding to replies, you can probably get away with a smaller response queue.

When a program starts up for the first time, it might be helpful to have a default set of command/response queues given to it; however, since the queues are application-managed, wherein all pointers are expressed in application address space terms, it follows that the application itself can (re-)allocate and move the queues as required. For processors with small address spaces (64KB or less), there might not be enough resources for flexible dynamic memory management. Consider an operating system like GEOS on the 6502 processor family, which only leaves about 24KB of space available for applications. So, it's better if the application manages the size of its own queues and just tells the kernel where they're located and how big they are. (This is one of the few synchronous system calls that does exist under this scheme. Since this is expected to happen only once or twice during a program's initialization, it is not expected to incur an onerous performance hit.)

For example, let's consider the case of determining the size of a file. Pretending for the moment that we have a set of operations to open a file, close a file, seek to a point in a file, and to report the current file pointer location, we can synthesize a single "batch" operation composed of these smaller primitives. In a Unix-like environment, we might write the following:

  1. Call open() to open the file.
  2. Check for errors.
  3. Call seek() to reposition the file pointer to the end of the file.
  4. Check for errors.
  5. Call tell() to report the position of the file pointer.
  6. Check for errors.
  7. Call close() on the file.

None of this is particularly difficult, since the program steps operate synchronously anyway. However, if you are determining the sizes of, say, several thousand files, this can take an appreciable amount of time. Maintaining a responsive user interface while this is happening can be difficult or even impossible. However, with a queue-based system call interface, we might write the code like so:

  1. Enqueue open file request.
  2. Chain (enqueue with the dependency that the previous request succeeds) a seek to end of file request.
  3. Chain a report file position request.
  4. Chain a close file request.
  5. Invoke the kernel once to begin processing these requests.
  6. Wait for all four responses in the result queue. Handle UI-related events as they arrive.
  7. Check the 3rd result; make sure it's not an error report. a. If it's OK, the position reported will be the file size. b. If it is an error, then check the 1st result (which will also be an error) to see what happened.
  8. Dequeue the results.

Each request structure might look like this:

struct RequestEntry {
	uint16_t args[8];
}

The first argument args[0] might, for instance, be broken up into two 8-bit subfields: a provider and an opcode. The provider would specify which subsystem is responsible for handling the request: the DOS/VFS for disk I/O, the GUI for graphics output, etc. The opcode is, then, provider-specific. This is similar to how MacOS toolboxes and Tock:OS capsules work, for example. Metadata and data buffer references can be packed into subsequent arguments as required for the function requested.

Result entries would have a similar structure layout, but whose interpretation is uniquely adapted to reporting results of an operation.

Another advantage of this approach towards implementing a computing system is portability. Consider a kernel for the 6502 processor, and an application that runs under it. Later, a new computer is released with the 65816 processor, necessitating a 16-bit OS to take advantage of its features. If running the original 8-bit system software with a call-based interface, then invoking services from the 65816's native mode requires a lot of work! You basically need to implement a task switcher to make it happen. Well, if you're going to go through the trouble for that, you might as well write a more complete kernel along the way. But, now you've broken compatibility with legacy 6502 code that you may still want to run (you need to provide an 8-to-16-bit shim). By maintaining a request queue abstraction, you are completely free to alter the backing implementation without concern whatsoever for the CPU's current operating mode.

You might think I'm being hyperbolic or overly hypothetical. However, this kind of interaction has evolved several times throughout computing's history. We see it in the form of "channel I/O" on IBM mainframes, I/O processors in CDC mainframes (where the CDC 604's operating system is literally executed by one of its 10 I/O processors!), and today, in NVMe interfaces and in VirtIO in Linux kernel driver interfaces. Remember what I said about 6502 vs 65816 kernel implementations above? This is not hypothetical; IBM already figured this out years ago, when they replaced the channel I/O hardware on mainframes with PowerPC-based channel emulator cards. To this day, software written on IBM mainframes still think they're talking to channel controller hardware built in 1964 for the original IBM System/360 mainframe, only much, much, much faster.

But, these interfaces are optimized for hardware; are there any software equivalents? Yes!

AmigaOS and Tripos both implement a fully functional OS which is almost entirely predicated on fully asynchronous API requests (c.f., ARexx ports, low-level device I/O, installable filesystem "handlers", GUI events, etc. are all asynchronous). However, the problem with AmigaOS and Tripos is they both make use of dedicated system calls for enqueueing and dequeueing messages. While lightweight and so convenient to use you'll be spoiled forever, the message passing overhead is still higher than simply touching memory. This was such a concern with AmigaOS that some of its message handling functions are duplicately implemented as C and assembly macros in amigalib. Also, both AmigaOS and Tripos still tend to hide the asynchronous logic behind a synchronous library API at the systems level.

If you squint a little bit harder, you can also see traces of this in "cache-kernel" systems which use message-based signalling (read up on how the V++ kernel works). Indeed, the only reason we even still have a system call at all is because most computers lack hardware to transparently trap to a message handler after updating a software-managed request queue. (Debugger support, like hardware watchpoints, might be useful here.) Cache kernel systems with hardware-assisted messaging features can get away without ever explicitly "calling" into a kernel.

Closer to what was discussed above, Tock:OS's system call API maintains a remarkably small set of system calls (smaller than most microkernels!). The set of primitives offered by a Tock:OS "capsule" is obviously dependent upon each capsule, but most have interfaces which amount to "enqueue this request" while Tock:OS itself has a system call which waits until an event happens, invoking application-specific callbacks for specific events. The callback model is actually remarkably close to how the homebrew virtual machine Uxn works.

There are many articles calling POSIX to be relegated as a legacy API. If we ever want to move beyond the POSIX API and the bottlenecks it imposes on us as application developers, I think "channel systems" are the way forward. While not as easy to use as a synchronous API, they are vastly simpler than existing alternatives. The fewer system calls there are, the better. As far as I can tell, nothing else delivers the bang for the simplicity-vs-scalability buck that channel-based I/O and service requests do.