[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

More Udanax code questions

It's perhaps silly to reply to your own posting, but I realized that there are a bunch of questions that I could have asked that I didn't, and Roger's mail is a sign that we might be able to get some answers. So I'll add new questions to my old questions. Anyone else who's been going through the code is welcome to answer as well, of course.

At 12:16 PM -0500 2/23/00, David G. Durand wrote:

Here's what I've learned and some of the questions that I still have.

My own examination shows that the fundamental "model-T" enfilade appears to be a balanced tree of totally ordered content addresses. The simplest approach to managing such a data structure for multiple versions, with data sharing, is to use a technique that I've always called "copy-to-root". The udanax code instead inserted some sort of node representing a permutation or alteration in place at every ancestor where the child relations change for a given version. I believe that these "node-level diffs" are called "orgls", but can't quite tell.

It's also frequently necessary to split nodes where a rearrangement occurs, as a precondition to creating an orgl that rearranged children.

Questions: The tree seems to be n-ary, and height-balanced (the path from root to leaves is constant length). What are the arity conditions on nodes, and what are the performance implications of the implementation of orgls at each node? How is a particular version used as a key to select the correct orgl at a node?

The granfilade seems to store the I-stream in a more-or-less vanilla Model-T enfilade, with mapings to actual data items. Are links stored in here, or only data?

Link management is based on a "2-D" enfilade which seems to manage starting points and widths of linked areas, with linked lists of pointers to the original data in the I-stream.

The 2-D enfilade is the "spanfilade", right?

It seems like the start and end spans of link-regions are stored here in terms of v-stream addresses. Do these map also to I-stream addresses or is that a separate step?

One possible design would be to store each linkend region once in the spanfilade as a v-stream-vstream mapping, and then use the separate, per-document v-stream maps to find the actual i-stream addresses of the linked data.

Alternatively, you could store I-stream addresses directly here, and link them directly to the corresponding link-descriptions, with the associated v-stream addresses. In this case, a link end would be expanded to a list of I-stream ranges, which are added to the tree, and then stored with references to the link itself. Thus the spanfilade would consist of a node for each of the smallest spans that have identical links overlapping them, and would point to lists of those links. There should be some structure-sharing tricks here to keep from duplicating the list elements when a span must be split. This still seems like it would have to be linear in the number of I-stream regions covered by a link endpoint, though.

In the 2-D enfilade, how does the second dimension affect the indexing? Is it purely a secondary key (as the code appears) or is there some trickery that makes this more flexible. How and why are orgls used in the 2-D enfilade. Is this merely a map from v-stream addresses to i-stream addresses? Does it also include special items mapping from

In general, how does one find the set of V-stream addresses corresponding to an I-stream address?

The loaf/crum management is a basic explicitly managed virtual memory, with in memeory use-counts managed in the code.

I don't really have any questions about this, other than how crums are assigned to loaves. This is an area where I'd be tempted to maybe learn a few lessons, but mostly skip over, on the assumption that current, if sub-optimal, memory management facilities might suffice. Making this more transparent via an object model with explicit management of references would not be a bad idea.

The Smalltalk code is more scrutable, but also more ambitious. It seems to contain a certain amount of dead code, as well as many classes for process management and the like that are not so closely related to the fundamental algorithms (at least the pure data management ones).

As Roger says, there's much good functionality here, but the implementation may be too clumsy for contemporary use... On the other hand, the pure-smalltalk version might run well enough on current hardware to be useful, at least on a small scale.

In the generalized code, the sequence is replaced by an arbitrary address space, which can be any algebraic structure (set of items with significant relations between them). Contents are associated directly (by identity) with elements of the address space by a mapping. Operations are represented by transformations (or complete replacement, or incremental replacement?) of the address->data mappings. This means that handling n-d (and indeed arbitrary) topologies can be handled uniformly within a single structure. I assume that this structure is the fabled Ent.

A key question in this is how to efficiently represent the incremental modifications of an arbitrary map so that the runtime of applying the composed map is not linear in the number of modifications.

How are partial maps represented efficiently?

I don't quite understand how WIDativity and DSPativity are used to enable the efficient creation of compositional maps for this. I get lost in the code, but the Ent seems to be a hierarchical representation of a series of maps. The top level represents large scale rearrangements, and yields a precise state when composed with the smaller scale maps whose results form its input. Such a composition is a way to retrieve a single state of the base address space. However, there must be an efficient way to select from alternative maps at each level, so that a particular state can be selected.

So how do you select an appropriate map at a given level? One imagines a hybrid tree (much like a k-D tree) where alternate levels select on:

a) an address key (as input to a partial address->data map). This might be one way to break up a mapping into "geographical" regions of the address space. b) A version or configuration key (to select the correct address->data map). Each version would create new choice points among variations of the map, and these would probably also be represented by a balanced structure of some sort.

This might be a big tree, with hashtables at each node, with elements of the hashtable being alternative maps for that node, keyed off of the version identifier of a given state (a Bert, right?).

The version structure itself (a Dagwood?) is a version tree containing some sort of keys to be used in partial map selection. Is this right?

   -- David
David Durand              dgd@xxxxxxxxx  \  david@xxxxxxxxxxxxxxxxxxx
http://cs-people.bu.edu//dgd/             \  Director of Development
    Graduate Student no more!              \  Dynamic Diagrams
--------------------------------------------\  http://www.dynamicDiagrams.com/