[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]


A Summary of the Magic Data Structure Engine Meeting

As perceived by Eric Hill
with apologies in cases of mis- or unplaced credit

  This is a review of the MDSE meeting of Dec. 14-16, where Eric Drexler,
Michael McClary, and I presented an architecture to implement the data
structure support required by the semantic layers of the Xanadu backend. 
Mark Miller presented some new semantic elements that had been developed
since the 88.2 meeting.  As was to be expected, many things that were true on
Wednesday morning were no longer true by the end of Friday.  An interesting
result of this meeting was a revision of the backend semantics by Mark Miller
& co.  Since I was ill on the last day of the meeting, my explanation of the
new semantics may not be as complete or accurate as desired.  Fortunately,
the new semantics do not invalidate most of the existing design work.  A
reassuring feature of the design is that many of the issues seem orthogonal,
so that the entire effort will not block on any single point.

  The MDSE architecture that was presented interfaces to the semantic layers
by MapOrgls, DataOrgls, and Bags.  Bags are a new semantic beast invented to
solve some problems with reconciling access control with back referencing. 
The Bags regulate access to the GrandMap, BackReference information, and
sensors used to asynchronously detect new bag contents.  In the middle of the
MDSE are various data caching and placement schemes aimed at maximizing the
speed of MDSE operations.  In this regard, Eric Drexler has been our
principal speed demon.  At the far back end of the MDSE is the Unusually
Reliable Disk Interface (URDI) that assures the availability of a consistent
disk data structure after many failure modes.  Michael McClary, our
reliability guru, is chiefly responsible for the excellent ideas behind the

  I will first discuss the design elements that started the meeting, and then
the resulting mutations.  However, the order of items here is at best
partial, although it is possibly in a more appropriate order than in the
original meeting.  References to 88.2.XMAS indicate 88.2 as modified after
this meeting.


  Before I proceed further, I should mention that because of an earlier
observation by Robin Hansen, a new element has been added into the
computational lexicon: "moose."  A moose is an inadequately addressed design
issue that is potentially very large and very nasty.  It could be thought of
as a metabug, as it is a problem that exists before code is even written, and
could therefore produce bugs (or spoilage) if unresolved.  The origin of this
term comes from a design meeting where Robin accepted the task of logging
issues that had been swept under the rug.  At one point he made a comment to
the effect that, "it looks like you've swept a whole moose under there."  In
the Xanadu design effort, this term has proved invaluable--the language,
however, may never be the same.

  A prerequisite concept which affects much of our thinking about data
organization is the snarf.  Snarfs are units of data exchange that grew out
of typical disk data transfer rates compared to typical seek times.  As
opposed to the loaf-by-loaf disk accesses in 88.1, a snarf has a size that is
optimized for the transfer to seek ratio of the mass storage device used--
roughly a track.  By reading a track-sized chunk of data structure into main
memory, one is likely to get a large amount that will be needed immediately,
with virtual no additional cost for bringing in the unneeded parts.  Several
factors, however, aggravate fetching a disk track sized chunk in optimal
time:  operating system interference, bad blocks, and variable length tracks. 
While the OS obstacles are not entirely characterized, the other two cases
seem to have a worst case limit around a 10% slowdown over the best case.

Semantic/MDSE Interface

  MapOrgls will be implemented using Eric Drexler's Entfilades (Ents).  The
core form of this data structure has been very well thought out.  Given the
isomorphism between disk and core form achieved by snarfing, the only disk
related concern is about how to best divide large Ents over multiple snarfs. 
I think that for first product we can punt this and rely on snarf caching to
minimize the number of disk seeks within a single request.  For later
products, we have discussed various "practice effect" schemes to dynamically
optimize the storage of Ents.

  DataOrgls have the delightful property of having been already designed long
ago.  For first product, at least, a DataOrgl can be implemented as either a
plain model-T enfilade or only a slight variation on one.  Out of the sea of
possibilities, we have decided to support coordinate space types for two data
topologies in the first product.  One of these is fullyOrdered(Int), for text
and things like text.  The experience with 88.1, and file systems in general,
suggest that a great variety of data can be reasonably stored this way,
especially given the mapping provided by MapOrgls.  This DataOrgl can be
stored as a model-T enfilade.

  The other type of DataOrgl to be implemented is a nested space of one
fullyOrdered(Int) under another, motivated by raster data, such as digitized
video, where one axis is dominant over the other.  The implementation of this
is a collection of model-T's for the minor axis, folded into a model-T for
the major axis.  This may not be the best solution.  This proposal was a
successful discussion generator, with points made regarding run length
encoding and other existing means of storing raster data.  I do not consider
this part of the design to be final.

  Bags are a semantic device to assure proper containment of information with
respect to access control.  While access control in the GrandMap is easily
regulated by the guardian, back referencing is trickier.  Bags were devised
while I was away during Thanksgiving to dispatch this moose.  The problem
arose from providing back reference information for a heavily referenced
object where most or all of the references are from sources not permitted to
the requester.  The trivial solution would involve fetching all of the
relevant backrefs and then filtering for visibility to the user.  Since this
could return a negative result after an unbounded delay, it is an
unacceptable method.  Therefore one of a bag's functions is to restrict
backref searches to the subset of the docuverse that is interesting and/or
available to the requester.  At some semantic level, the backend performs its
operations in the context of a set of keys (SOK) which specify the bags to
operate on.  These keys are modeled after public key cryptosystems, with a
public read and write key and a private read and write key.  As of this
writing, however, I understand that Mark Miller and Eric Tribble have done
work on SOK/Bag semantics that I have not seen, although I have been assured
that the MDSE support for bags does not need to change much.  As gravy, bags
also provide a basis for filtering on such things as author, link type, news
group, etc.  For any MDSE operation, only one bag at a time is used; the
details will be given below.

  The GrandMap is the system-wide map from IDs to objects--actually their
addresses--for objects that have persistent IDs.  Since IDs have no natural
order and speed is vital, a hash table is the obvious device to use.  Eric
Drexler has found several good methods in the existing literature: extensible
hashing techniques that grow smoothly and promise at most one or two disk
seeks.  This is an area of the implementation where we feel relatively safe.

  The BackReference Index will be implemented as a separate back reference
map for each orgl to avoid designing a global data structure that must
support multiple coordinate space types at once.  My proposal for the back
reference structure is something called a RefOrgl which maps from regions in
a coordinate space to the IDs of objects that refer to those regions.  This
will be implemented using a descendant of Drexler's point spanfilade that
supports access control filtering.  This filtering is allowed by widding
baggage information up through the enfilade, so that subtrees may be
instantly rejected if they do not contain any back references visible for a
particular key.  The current proposal has a second entire point spanfilade
for the large special case of back references accessible to everybody.  At
the MDSE interface, however, there appears to only be one.

  The BackReference Sensors that the MDSE is responsible for are an
especially mooseful area of the design.  The intended purpose of backref
sensors in the first product is to support asynchronous announcement of new
mail for a user.  At one point, sensors were part of a Bag class, and were
triggered whenever there was a change made in a given bag.  However, the Bag
has not survived as a class at the MDSE interface.  The real problem is that
we are not sure exactly how and when to really set off a sensor.  The
original proposal had the sensor processing execute after the GrandMap or
BackReference call returned.  It was pointed out that this would usually be
premature, as the sensor announcement for something like mail should really
be made when the change is safely on the disk.  The other entirely unresolved
issue is how and where to store sensors.  Particularly, we want sensors to
persist across system restarts.  Also, it appears that sensors need some sort
of ID so they can be turned off or otherwise modified once they have been
installed--this hints at placing them in the Xanadu object space, which is
semantically unpleasant.  Basically sensors are going to require a great deal
more thought.  Fortunately, they are orthogonal to most everything else.

Caching and Disk Layout

  The first layer of caching above the disk interface is the snarf cache. 
Since snarfs contain all data that might be on the Xanadu disk, all of the
higher levels can go through this cache.  The GrandMap has a cache of entries
for recently used IDs that is extracted from the snarf cache--the GrandMap
cache may simply be a small fixed length hash table.  The various enfilade
classes will also want finer grained caching, although in their case few
details have been worked out.

  The glacial seek time of disk drives has motivated a lot of thought about
maximizing the utility of each disk access.  Eric Drexler has been the
leading advocate of paranoia in this area, with the battle cry: "Locality of
Reference!"  Along these lines, several ideas about packing things together
on disk have emerged, some of which look likely for first product.  One of
these ideas is to organize the disk by "clusters," where a GrandMap hash
table page is placed in the middle of a disk cylinder surrounded by the items
that the page refers to.  Thus if a seek to a GrandMap page does not also
fetch the desired item, that item will be within a cylinder of the that seek,
and not much more expensive to reach.  The other kind of locality that we
have in mind is to place MapOrgls as close as possible to their principal
DataOrgls, and both of those close to their RefOrgls.  In the common case of
small, lightly-linked documents, these may all fit within the same snarf.  An
interesting point about this association is that it seredipidously resembles
the new semantics that Mark Miller developed as a result of this meeting.

  URDI.  The Unusually Reliable Disk Interface developed by Michael McClary
gives us a degree of incremental recovery from system failures that we have
long been wanting.  As with any heavily cached system, Xanadu runs the risk
of having an inconsistent data structure on the disk after crashes and power
failures.  Buffering in most operating systems and high-end disk drives only
aggravate this problem.  The URDI allows a consistent, although not
necessarily current disk data structure to be achievable after when
restarting the system after some, but not all failure modes.  Some the modes
that we are not protected against are spiral writes and head crashes.  The
metaphorical model that Michael has used is Fibber McGee's closet.  The
closet represents the main storage area on the disk where most reads take
place, and a consistent data structure is more or less guaranteed during
execution of user requests.  Writes, however, go through a staging area on
their way to the closet, where groups of snarfs are placed in temporal order. 
At the end of each group is a marker indicating that if all of the staging
area up to that point has been put in the closet, the disk will be
consistent.  After a failure, this area will be processed before
acknowledging any requests.  While this will result in some delay at system
restart, it is immensely faster than reevaluating the entire transaction log.

Some Fallout

  Not only did the implementation proposals suffer casualties, but by
Thursday, the semantics themselves were under revision.  The new state of
affairs, however, should be refreshing to both frontend writers and
performance fans.  The interfaces to both orgls and berts have changed.  For
one thing--as some may notice--we have returned to ONLY orgls and berts,
rather than the plethora of objects consisting of Berts, DataBerts, MapOrgls,
DataOrgls, RefOrgls, and G=8(pi)T knows what else all.

  The new orgl very much resembles the old orgl of 88.1, with a few 89-ish
additions.  The 88.2.XMAS orgl consists of an oMap, a dataMap, and a refMap. 
The oMap is the same map as in the 88.2 MapOrgl, and is analogous to the V/I
mapping in 88.1.  The dataMap associates invariant positions with actual
data--like the old I-stream--and manifests the retrieval aspects of the
original 88.2 DataOrgl.  The refMap provides back reference information with
baggage like the RefOrgl described above, but is now folded neatly into the
rest of the orgl semantics.  The extraction and editing operations have not
changed much from 88.2 orgls.  While back reference information is not
explicitly changed at the semantic level, the 88.2.XMAS MDSE orgl will also
have operations to add and remove back references.

  The modification of berts is that the data append operations of the 88.2
DataOrgl are now accessed through DataBerts.  A DataBert is distinguished
from a plain Bert only in that its dataMap is not empty.

The new semantics and the current theory of implementation have several
interesting overall properties:
- The semantics are closer to 88.1 than the system presented in the 88.2
- The semantics are closer to what a frontend is likely to expect,
requiring less of a thorax (middle end) than previously.
- The semantics are closer to the data structure implementation.
- There are now fewer types of things in the Xanadu object space.
- Things look less scary to implement.

Stay Tuned,
	- e -