[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]
More Udanax code questions
It's perhaps silly to reply to your own posting, but I realized that
there are a bunch of questions that I could have asked that I didn't,
and Roger's mail is a sign that we might be able to get some
answers. So I'll add new questions to my old questions. Anyone else
who's been going through the code is welcome to answer as well, of
course.
At 12:16 PM -0500 2/23/00, David G. Durand wrote:
Here's what I've learned and some of the questions that I still have.
My own examination shows that the fundamental "model-T" enfilade
appears to be a balanced tree of totally ordered content addresses.
The simplest approach to managing such a data structure for multiple
versions, with data sharing, is to use a technique that I've always
called "copy-to-root". The udanax code instead inserted some sort of
node representing a permutation or alteration in place at every
ancestor where the child relations change for a given version. I
believe that these "node-level diffs" are called "orgls", but can't
quite tell.
It's also frequently necessary to split nodes where a rearrangement
occurs, as a precondition to creating an orgl that rearranged
children.
Questions: The tree seems to be n-ary, and height-balanced (the path
from root to leaves is constant length). What are the arity
conditions on nodes, and what are the performance implications of the
implementation of orgls at each node? How is a particular version
used as a key to select the correct orgl at a node?
The granfilade seems to store the I-stream in a more-or-less vanilla
Model-T enfilade, with mapings to actual data items. Are links stored
in here, or only data?
Link management is based on a "2-D" enfilade which seems to manage
starting points and widths of linked areas, with linked lists of
pointers to the original data in the I-stream.
The 2-D enfilade is the "spanfilade", right?
It seems like the start and end spans of link-regions are stored here
in terms of v-stream addresses. Do these map also to I-stream
addresses or is that a separate step?
One possible design would be to store each linkend region once in the
spanfilade as a v-stream-vstream mapping, and then use the separate,
per-document v-stream maps to find the actual i-stream addresses of
the linked data.
Alternatively, you could store I-stream addresses directly here, and
link them directly to the corresponding link-descriptions, with the
associated v-stream addresses. In this case, a link end would be
expanded to a list of I-stream ranges, which are added to the tree,
and then stored with references to the link itself. Thus the
spanfilade would consist of a node for each of the smallest spans
that have identical links overlapping them, and would point to lists
of those links. There should be some structure-sharing tricks here to
keep from duplicating the list elements when a span must be split.
This still seems like it would have to be linear in the number of
I-stream regions covered by a link endpoint, though.
In the 2-D enfilade, how does the second dimension affect the
indexing? Is it purely a secondary key (as the code appears) or is
there some trickery that makes this more flexible. How and why are
orgls used in the 2-D enfilade. Is this merely a map from v-stream
addresses to i-stream addresses? Does it also include special items
mapping from
In general, how does one find the set of V-stream addresses
corresponding to an I-stream address?
The loaf/crum management is a basic explicitly managed virtual
memory, with in memeory use-counts managed in the code.
I don't really have any questions about this, other than how crums
are assigned to loaves. This is an area where I'd be tempted to maybe
learn a few lessons, but mostly skip over, on the assumption that
current, if sub-optimal, memory management facilities might suffice.
Making this more transparent via an object model with explicit
management of references would not be a bad idea.
The Smalltalk code is more scrutable, but also more ambitious. It
seems to contain a certain amount of dead code, as well as many
classes for process management and the like that are not so closely
related to the fundamental algorithms (at least the pure data
management ones).
As Roger says, there's much good functionality here, but the
implementation may be too clumsy for contemporary use... On the other
hand, the pure-smalltalk version might run well enough on current
hardware to be useful, at least on a small scale.
In the generalized code, the sequence is replaced by an arbitrary
address space, which can be any algebraic structure (set of items
with significant relations between them). Contents are associated
directly (by identity) with elements of the address space by a
mapping. Operations are represented by transformations (or complete
replacement, or incremental replacement?) of the address->data
mappings. This means that handling n-d (and indeed arbitrary)
topologies can be handled uniformly within a single structure. I
assume that this structure is the fabled Ent.
A key question in this is how to efficiently represent the
incremental modifications of an arbitrary map so that the runtime of
applying the composed map is not linear in the number of
modifications.
How are partial maps represented efficiently?
I don't quite understand how WIDativity and DSPativity are used to
enable the efficient creation of compositional maps for this. I get
lost in the code, but the Ent seems to be a hierarchical
representation of a series of maps. The top level represents large
scale rearrangements, and yields a precise state when composed with
the smaller scale maps whose results form its input. Such a
composition is a way to retrieve a single state of the base address
space. However, there must be an efficient way to select from
alternative maps at each level, so that a particular state can be
selected.
So how do you select an appropriate map at a given level? One
imagines a hybrid tree (much like a k-D tree) where alternate levels
select on:
a) an address key (as input to a partial address->data map). This
might be one way to break up a mapping into "geographical" regions of
the address space.
b) A version or configuration key (to select the correct
address->data map). Each version would create new choice points among
variations of the map, and these would probably also be represented
by a balanced structure of some sort.
This might be a big tree, with hashtables at each node, with
elements of the hashtable being alternative maps for that node,
keyed off of the version identifier of a given state (a Bert,
right?).
The version structure itself (a Dagwood?) is a version tree
containing some sort of keys to be used in partial map selection. Is
this right?
-- David
--
_________________________________________
David Durand dgd@xxxxxxxxx \ david@xxxxxxxxxxxxxxxxxxx
http://cs-people.bu.edu//dgd/ \ Director of Development
Graduate Student no more! \ Dynamic Diagrams
--------------------------------------------\ http://www.dynamicDiagrams.com/
\__________________________