roger gregory <roger@xxxxxxxxxxxxxxxxxxxxx> wrote:
>Also why on a per document basis, since
>content may be in zillions of documents. That is a meg of text may
>exist in thousands of versions, with no redundancy, each of these is a
>document, thus there would be thousands of indexes totaling a thousands
>of times the size of the document. It this the best way?
the index entry for a document when put into the index catalog becomes part of it. there is still an entry which can be removed, which makes the document disappear from the index, but the entry becomes very small after it has been registered with a catalog. the catalog may take advantage of how the document is stored, which means it could be very efficient.
it would probably be better to index the g tree, but you would want to filter results by accessibility. but because the g tree does not represent complete documents but rather their pieces, it might not be a good idea.
>Search within a document is probably so simple it isn't worth making an index
>for them. After all we don't make an index for html documents in
>browsers do we? Optimizations probably need to be reserved to where
>they are needed.
yes, searching within a document is handled by the front-end, but searching within a catalog which indexes many documents is managed by the backend, which may be split up into many layers.
Dan dutkiewicz