Falvotech.
Conquering the world is easy — what do you do with it afterwards?

L I N K S


⇐ Unsuitable in Unsuitable: Foreward

⇒ Unsuitable on Unsuitable: General Object Store


Unsuitable on Unsuitable: Concepts
Samuel A. Falvo II
kc5tja -at- arrl.net
2010 Jun 26 20:23 PDT

Most open source and commercial applications alike lack an overview of the core concepts used in its development. Without this overview, anyone trying to understand the program will experience difficulty internalizing the design. In this article, I discuss three of Unsuitable's core concepts. Without these key concepts, Unsuitable would be substantially more complex to understand and implement.

1 Persistence at the Lowest Level

At the lowest level, Unsuitable's machine model assumes two distinct, but related, address spaces. The first address space, named the dictionary by ANSI Forth specifications, offers a non-persistent, private, temporary storage space which comes and goes as web requests commence and finish. In contrast, all CGI handler instances share the persistent space, which outlasts all CGI handler lifetimes, and thus stores the content the blog delivers. Unsuitable treats both address spaces as byte-accessible using fetch and store operations. Unsuitable distinguishes between dictionary and persistent spaces — erroneous behavior results from using a persistent pointer with @ or ! or a dictionary address with @f or !f. Address 10 in the dictionary does not refer to address 10 in persistent space.

The very first definitions issued on every CGI request provides the Forth words necessary to access this persistent address space (taken from mappings.fs):

: >core     dup 10 rshift block swap 1023 and + ;
: @f        >core @ ;
: !f        >core ! update ;

The @f (fetch-field) word fetches a cell from persistent storage. Except for the fact that it reads from persistent space, a programmer uses @f exactly the same way he'd use @. Likewise, !f (store-field) stores a cell to persistent storage, used exactly like !. Note that !f automatically takes care of marking altered blocks as dirty for you.

@f and !f work with byte addresses; hence, a 32-bit Forth system will allow access to 4GiB of persistent storage space. A 64-bit Forth environment will of course allow much larger address spaces. While Unsuitable should be able to run on 16-bit Forth systems, you probably won't get anything more than a toy, as 64KiB of persistent storage space isn't much to work with.

2 Memory Map

When I first conceived of this twin space model of storage management, I anticipated creating a system of dynamic memory management that would encompass everything I needed to preserve. With address 0 reserved as a pointer to some root data structure, from which everything else could be found, all other material would need storage reserved through some kind of persistent malloc-like operation. After all, the AmigaOS environment already works this way. In AmigaOS, a pointer at location $00000004 pointed to the kernel's exec.library base structure, from which literally all other pieces of data running in AmigaOS can (directly or indirectly) become accessible.

As my work on Unsuitable unfolded, it turned out that a mixture of static and dynamic allocation techniques would be used. Static allocation dominates the functional storage requirements, while dynamic allocation manages the space used in the General Object Store (GOS). Below lies the current persistent memory map (at the time of this writing):

FROM      TO        DESCRIPTION
00040000  0007FFFF  256K GOS
00013800  0003FFFF  Unused.
00011000  000137FF  10K Article Relation
00010800  00010FFF  2K GOS Handle Table
00000800  000107FF  64K unused.  (Formerly used by GOS before expanding it to 256K)
00000400  000007FF  GOS and Article Table Metablock
  00000400            GOS allocation fencepost
  00000404            Next article ID
  00000408-07FF       Unused.
00000000  000003FF  A vestigial block reserved by ANSI Forth for boot-loader purposes,
                    dating back to when Forth systems didn't require OSes to boot up.

Thankfully, we don't need to memorize base addresses; the remainder of the code refers to the various allotments and their sizes by name. Thus, the source file mappings.fs exists to define these names1:

: >core     dup 10 rshift block swap 1023 and + ;
: blocks    1024 * ;
: @f        >core @ ;
: !f        >core ! update ;

( block 1 = meta-block )
: g-fencepost   1 block ;
: a-nextId      1 block [ 1 cells ] literal + ;

( blocks 2 .. 65 = 64K unused space -- formerly general object store )

( blocks 66 .. 67 = handle table for general object store )
1 blocks constant /hfields
66 blocks constant addrs
addrs /hfields + constant lens

( blocks 68 .. 77 = article table columns )
2 blocks constant /afields
68 blocks constant articleIds
articleIds /afields + constant titles
titles /afields + constant leads
leads /afields + constant bodies
bodies /afields + constant timestamps

( blocks 256 .. 511 = General Object Store [256K version] )
256 blocks constant gorg
gorg 256 blocks + constant gend

3 Parallel Arrays

Unsuitable proudly uses parallel arrays (column-major storage) to store structured data instead of data structures or records (row-major storage). Forth lacks integrated support for structures, and I find the level of abstraction offered by them not worth the hassle to implement. They also permit adding or removing whole columns of data without affecting already persisted information, quite a handy feature while the software evolves!

The Wikipedia article linked to above lists four objections to using parallel arrays. I'd like to address them here.

They have significantly worse locality of reference when visiting the records sequentially and examining multiple fields of each record, which is the norm. The norm? I concede this will be true for object- or structure-laden procedural-oriented software, but most likely false for software written to take advantage of relational theory. Since it's a blog, Unsuitable spends most of its time searching or sorting, both of which require highly sequential accesses to large quantities of data maintained according to relational theory. In this case, column-major storage significantly improves locality of reference.

They obscure the relationship between fields of a single record. This problem exists also with structure- and object-oriented languages like C or Smalltalk. You cannot, for example, declare a C structure field as AUTOINCREMENT, nor can you declare primary or foreign keys in Smalltalk class definitions. I argue that this criticism simply isn't relevant in any context except debating the merits of SQL's CREATE TABLE statement versus C's struct.

They have little direct language support (the language and its syntax typically express no relationship between the arrays in the parallel array.) Any language providing arrays as a native data type clearly invalidates the first part of this criticism, so I won't even dignify it with a detailed response. The real meat of this criticism lies in its parenthetical clarification. While technically true, practice proves this simply isn't a problem. Well-factored software demands your data representation to be hidden from outside modules2 — using records or parallel arrays should be of no consequence to external software (speed considerations notwithstanding). Semantic relationships remain the only meaningful property to concern oneself with.

They are expensive to grow or shrink, since each of several arrays must be reallocated. Two fallacies exist in this criticism: first, that changing the size of an array in all programming languages proves expensive in time and/or effort, and second, that you always use a flat array structure. Forth permits redimensioning an array quite trivially: just bring up the source containing the constant definitions of the array's base and length, change them, and save them. The expensive part of this operation comes from making sure you migrate the data from the old to the new location first. However, any SQL database administrator can tell you stories about how long it takes to execute ALTER TABLE statements on production databases, and how in some cases, it's far faster and easier to export the data, DROP TABLE, CREATE TABLE again with the desired changes in place, finally followed by re-importing the data. Expense defines the way of life of database administrators, and Forth will not offer any silver bullets here. Forth will, however, let you do this while concurrently running the production system, so you amortize any expenses encountered. If you find you're resizing your arrays often, you have the option of using multi-level arrays. Numerous filesystems use multi-level arrays to allow a file to grow or shrink over time. Slab allocators employ them, too. Hence, much prior art exists demonstrating solutions to the problem of dynamically resizing arrays, for both slow and fast media alike.

4 What's Next

We discussed a few concepts which future articles will treat as a foundation on which to build. In the next article, I discuss the most important storage abstraction Unsuitable has: the General Object Store. Stay tuned.


1  For those who are familiar with coding for or administering the IBM mainframe line, mappings.fs serves the same role as defining data-sets in MVS or z/OS. Since Forth thankfully deals with logical block addresses, though, I posit Forth is better than MVS and its successors, at least in this respect, because I don't have to deal with physical CHS addressing.

2  Or, at least, organized as if it were; note Forth also lacks syntactic support for module-level encapsulation.