--- Begin Message ---
- To: John Dougan <jdougan@xxxxxxx>
- Subject: SEP94: K-Tree Container Data Structures
- From: John Dougan <jdougan@xxxxxxxxxxxxx>
- Date: Mon, 07 Jan 2002 23:48:10 -0800
Title: SEP94: K-Tree Container Data Structures
This is G o o g l e's cache of http://armoire.cs.nwu.edu/Reading/Doctor%20Dobbs%20-%20Release%2010/articles/1994/9409/9409b/9409b.htm.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
Google is not affiliated with the authors of this page nor responsible for its content.
These terms only appear in links pointing to this page: dobbs
These search terms have been highlighted: k tree
In dealing with the problem of browsing and debugging incomplete programs, I needed to efficiently handle tree nodes with variable (possibly large) numbers of children. To address this problem, I developed a data structure called a "K-tree," which also has general applicability.
K-trees are container data structures that represent linear sequences of integers, pointers, and the like. There are countless ways of representing a sequence, but almost all are variations on arrays or linked lists.
Arrays are very fast when you have a subscript, say, I, and want to find the Ith element of the sequence. The time required by the address computation does not increase as the sequence gets longer. This is called a "constant-time" operation. On the other hand, if you want to concatenate two sequences or extract a subsequence (a slice), part of some array must be copied. The time this requires increases in direct proportion to the length of the sequences. This is a "linear-time" operation.
The many variants of linked-list representations tend to be just the opposite. You can cut apart pieces of a linked list and splice them together in constant time. But finding the Ith element requires traversing from one end of the list, which is linear.
Now, suppose you need to do a mixture of constant-time operations and linear operations. As the problem gets bigger, linear operations will account for almost all the time the program takes to run, while the overall effect of constant-time operations is negligible. In an application where both kinds of operations are needed, performance is indicated by the most inefficient operation--linear, in this case.
With K-trees, subscripting, slicing, and concatenation all take time proportional to the logarithm of the length of the sequence. This is not as good as constant time, but it's much better than linear time. Since no operation is worse than logarithmic time, the logarithmic performance dominates.
K-trees have one other important characteristic that I needed in my application. When you extract a slice, the original sequence from which it was taken is preserved. The same goes for concatenation and subscripted assignment. Most of the array and linked representations have some operation which destroys operands, unless you first make a copy, which is, of course, linear.
To make this preservation of operands happen, K-trees use heap objects that are immutable--once created, they never change. These objects can be shared among several sequences, and this is vital to making the operations logarithmic. On the downside, some kind of garbage collection is needed to reclaim objects no longer used in any sequence.
A K-tree is a pointer to one of two kinds of nodes, both of which contain an integer field named Height. If Height=1, the node is a leaf node and contains a field named LeafElems (a small array of sequence elements). If Height>1, the node is a nonleaf node and contains a field named NonleafElems (a small array of records). Each record contains two fields named CumChildCt and ChildRef. CumChildCt has the type of sequence subscripts. ChildRef is a K-subtree pointer.
Every node has a field named ElemCt which gives the number of elements. The elements of both leaf and nonleaf nodes have subscripts in the range 0..ElemCt--1. Each node is dynamically allocated when created, with exactly enough space to hold ElemCt elements.
Figure 1 shows a graphical notation for both kinds of nodes. The fields and array subscripts are labeled, showing how to interpret the nodes in the examples that follow.
There is a global maximum, N, for the number of array elements in any node of either kind. N must be at least 3. A leaf node with one element occurs only in the representation of a singleton sequence. Every other node always has at least two elements. Figure 2 shows some examples of small K-trees and the sequences they represent. The elements of sequences are integers. Figures 2(b) and 2(c) are two different K-trees which represent the same sequence. In general, there are many K-tree representations for a given sequence. A given K-tree represents only one sequence, according to the following rules:
- A NIL pointer represents the empty sequence. A pointer to a leaf node represents just the elements of LeafElems. A pointer to a nonleaf node represents the concatenation of the sequences represented by the ChildRef fields.
- The value of CumChildCt in the Jth element of a nonleaf node is the sum of the lengths of the subsequences represented by the ChildRef fields of NonleafElems[0..J]. This means that NonleafElems[ElemCt --1].CumChildCt is the length of the sequence represented by this node.
- From a given node, all paths to leaves have the same length. This value is stored in the Height field. K-trees are always full in the sense that, although ElemCt may vary, there are no NIL pointers in ChildRef fields.
- No node contains any information derived from its parent or siblings. Since the nodes are immutable, any subtree can be shared among many parents, each of which belongs to a different K-tree.
Subscripted fetching proceeds top-down, using an intermediate subscript I that is always relative to the current K-subtree. If the K-subtree is a leaf, the sequence subscript is the subscript to LeafElems and leads directly to the desired sequence element.
If the K-subtree is a nonleaf, fetch must determine to which of the node's subtrees it should descend by comparing I with the values of CumChildCt in the elements of NonleafElems. These values are, by definition, in ascending order, so this can be done using a classic binary search.
Before fetch descends into the subtree, it must reduce the sequence subscript by CumChildCt of the subtree to the left of the one it is about to descend into.
Subscripted assignment begins like subscripted fetching, proceeding top-down through the K-tree to the desired leaf element. However, the located element cannot be altered in place, as this would violate the preservation-of-operands property.
Instead, store allocates a copy of the old leaf node and alters the Ith element of the copy. It then returns the pointer to the new node to the level above. Each nonleaf level does essentially the same thing, except it replaces only the ChildRef field of the selected element of its copied node with the pointer it receives from below.
The result is a new K-tree, in which all nodes on the path from the root to the leaf node containing the Ith sequence element have been replaced, while all nodes off this path are shared with the original K-tree.
Figure 3 shows the result of assigning value 8 to element number 4 of the K-tree in Figure 2(d). The shaded nodes have identical contents and are the same nodes as before the assignment. The other nodes are new. The two new nonleaf nodes look the same as before, but have different values in one of the ChildRef fields. To illustrate this, pointer values which have changed are shown as dashed arrows.
Concatenation is done by constructing a seam along the right edge of the left-operand K-tree and the left edge of the right-operand K-tree. The seam is constructed bottom-up, matching, and possibly joining, pairs of nodes of the same height from both sides of the seam.
The K-tree representation rules leave some choices as to how the seam is constructed. If the two operand K-trees have the same height, concatenate could just create a new nonleaf node of two elements, with the operand K-trees as its two children. This is simple, but building tends toward binary trees. It would be better to try to keep nodes more nearly full.
Starting from the bottom in the algorithm I chose, concatenate moves higher as long as the total number of elements in the nodes on either side of the seam at a given level is greater than N. Once it reaches a level with N or fewer elements, it allocates a new node and repacks the elements.
Once repacking has started, every level above has to have one or two new nodes allocated because some changes in child pointers will be needed to reflect the replacement of old nodes at the level below.
If the total number of new elements along the seam is greater than N, two new nodes will be needed. In this case, I divide the elements equally between the new nodes, so as to keep node sizes equal.
Figure 4 gives an example of K-trees before and after concatenation, showing only the nodes along the seam. The CumChildCt fields are omitted in this example. The small numbers in circles represent nodes to either side of the seam without showing their contents. All these nodes are reused.
At height one, the two leaf nodes collectively have seven elements, which can't be repacked into one node. They are reused intact in the result. At height two, there are six elements altogether, so they are repacked into one new nonleaf node.
At height three, there are initially eight elements. Two point to nodes that are not reused in the result and whose replacement consists of only one node. This leaves a total of seven new elements needed along the seam. These are distributed into the two new nodes, three elements in one node and four in the other.
At height four, only the right-operand K-tree has a node. The root pointer of the left operand is treated as a fictitious, one-element node, which must be repacked with the elements from the right side of the seam. This requires a total of seven elements: Two point to replacements for old nodes, and the rest point to reused nodes to the right of the seam.
Finally, a new node at height five is needed to collect the two nodes of height four. Thus the height of the result K-tree is one greater than the highest-operand K-tree.
Implementing concatenation is somewhat more complex than the concept. A recursive procedure has to start at the top of the K-trees, descend to the leaves, and then do its work on the way back up. The operand K-trees could have different heights. The Height field allows the descent to synchronize itself so it is working on nodes of the same height on each side of the seam. Unequal heights also create some special cases for node construction during the return to the top.
K-trees are sliced bottom-up by constructing two cuts through the operand K-tree along the left and right edges of what will become the result K-tree. At each level, the node is divided between the elements belonging to the slice and those outside the slice. The node slice of the divided node on its "sliceward" side must be included in the result K-tree.
The node slice could have only one element. If this happens, it is repacked with the adjacent node in the sliceward direction. This will give a total of at least three and at most N+1 elements, which can always be packed into either one or two new nodes. As an optimization to keep nodes more nearly full, the node slice and the adjacent node are also repacked any time they will collectively fit into one new node.
As with concatenation, the slice algorithm must start at the top of the operand K-tree, descend recursively, and do its reconstruction on the way back up. However, when slicing, the descending phase must determine, at each level, which child to descend to, using the starting and ending slice subscripts. This is done using the same binary-search technique used in subscripting.
When computing a wide-enough slice at low-enough levels, the left and right cuts are separated by other K-tree nodes, which will be reused in the result K-tree. Whenever at least two nodes separate those containing the two cuts, each side of the slice can be constructed independently.
At higher levels, the two cuts must be handled interdependently whenever they are spread over three or fewer nodes, since a sliceward node adjacent to a cut is involved. In these cases, the new elements will fit in at most three new nodes. When only one element exists, no new node is constructed. Instead, the single pointer is passed up, eventually to become the result K-tree, which will be lower than the operand K-tree.
Figure 5 gives an example of slice construction, showing only relevant nodes along the cuts. The notation is the same as in Figure 4, except that wavy lines through nodes are used to show the location of the left and right cuts.
At height one, the two cuts are independent. On the left, the node slice of the leftmost node shown in full contains one element whose value is 20. This is repacked with the three elements 19, 4, and 25 of the next rightward node. On the left, the entire node containing element values 16 and 10 is reused.
At height two, three nodes are involved in the slice. At the left end, two elements have been replaced by one new element, returned from below. All other elements involved are reused. This gives a total of seven elements, which are repacked into two new nodes.
At height three, only two nodes are involved. The two new pointers returned from the level below are packed into a single new node.
Finally, at height four, only one node is involved. Two of its elements are replaced by one new node. Since one pointer does not require a node at this level, it becomes the entire result K-tree.
The source-code implementation of K-trees (along with programmer notes) is available electronically; see "Availability" on page 3. I implemented K-trees in Modula-3 for a couple of reasons. First, it has sufficient richness in its type system to handle K-tree data structure without resorting to type-unsafe tactics. Secondly, it has built-in garbage collection.
I tested the K-Tree program with many randomly generated cases. If you run the test program, be aware that the large number of trees and the brute-force verification make it a memory and CPU hog. You might want to reduce SingletonCt, CatCt, CatOfSliceCt, and StoreCt, for a more modest test run.
Figure 1 Notation for K-tree nodes.
Figure 2 K-trees and their sequences; (a) 13; (b) 7, 25, 19, 47, 5; (c) 7, 25, 19, 47, 5; (d) 16, 0, 15, 23, 6, 14, 11, 7, 3, 19, 29.
Figure 3 New K-trees after assignment of value 8 to element number 4.
Copyright © 1994, Dr. Dobb's Journal
--- End Message ---