Since its inception by Hendrik Jan Bosman many years ago, the Emdros Query Tool has only had one harvesting algorithm. Well, until today, that is. Now it has four, including the old one.
The overall harvesting algorithm is:
- Execute the query. This results in a sheaf.
- Traverse the sheaf and gather a list of “hits”: One monad set for each “hit”.
- Traverse the sheaf and gather the big-union of the sets of monads in all matched objects whose “Focus” boolean is true. This is called the “sheaf focus monad set”.
- Get a set of raster monad ranges based on the list of “hits”. A “raster monad range” determines how much context to show around a set of monads corresponding to a “hit”. See below for how it is calculated.
- Get all “data units” and their features, based on the set of monads being the big-union of all raster monad ranges. A “data unit” is an object type whose objects must be shown for any given hit. Typical data units include “Word”, “Phrase”, “Clause”, “Sentence”, etc. This is gotten using the MQL statement called “GET OBJECTS HAVING MONADS IN”.
- Traverse the list of monad sets corresponding to a “hit”. For each monad set, calculate one “solution” to be: (i) The “hit” set of monads; (ii) The set of monads arising from taking all of the raster units that overlap with a stretch of monads in the “hit” set of monads. This is called the “raster monad set” for this solution; (iii) All data unit objects which have monads sets which overlap with the “raster monad set”. (iv) A “focus set of monads”, which is the intersection of the “raster monad set” and the “sheaf focus monad set”.
There are two changes to the harvesting algorithm which I have made today. The first relates to step #2 (gathering “hit” monad sets), and the second relates to step #4 (gathering raster monad ranges).
The first change (gathering “hit” monad sets) now has four ways to do it, as opposed to only one before today:
- “outermost“: This is the old one which was already there. It simply traverses the sheaf, and for each outermost straw, it calculates one set of monads being the big-union of the monad sets of all matched objects which are direct children of each outermost straw. Naturally, this can get unwieldy if the outermost block is, say, a “book”.
- “focus“: This calculates one “hit” monad set for each matched object whose “focus” boolean is “true”. The “hit” monad set is simply the monad set of the matched object.
- “innermost“: This calculates one “hit” for each straw which satisfies the condition that all its children are terminals in the sheaf tree, i.e., none of the children have an inner sheaf. The “hit” is simply the big-union of the monad sets of all matched objects in such straws.
- “innermost_focus“: Like innermost, but only does the big-union of the monad sets of those matched objects in the straw whose focus boolean is “true”.
The “innermost” and “innermost_focus” algorithms are especially well suited to making concordance-views (which I’ll hopefully blog about at some point).
The second change is to step #4, which calculates the raster monad ranges. The old way used to be to be told an object type (a “raster unit”) whose objects would determine the context range of monads. This would be done with GET OBJECTS HAVING MONADS IN, using the big-union of all “hit” monad sets, and using the “raster unit” object type as the object type to GET. This method is still available.
The new way, however, specifies two context monads: “raster_context_before” and “raster_context_after”: Two independent, positive integers which determine the raster context ranges. The algorithm is to traverse the list of “hit” set of monads, and for each set of monads, take the first monad, minus “raster_context_before” as the first monad of the range, and take the last monad, plus “raster_context_after” as the last monad of the range. Again, this is especially useful for concordance-type views.
This will appear in the next public release after 3.0.1.
As always, if anyone is interested in having a preview, please contact me.