I’ve finished the implementation, tuning, and testing of Full Text Search (FTS) for Emdros.
The implementation is part of the libharvest library, and is written in C++ like the rest of Emdros.
I implemented the basic idea in Python first, then reimplemented it in C++. Python is so malleable that this sort of prototyping work makes Python ideal for the task.
The Full Text Search has a lot of features, including:
- Index “documents”, which must exist as object types.
- Index documents based on “indexed object types” (e.g., token) and one indexed feature of the indexed object type.
- Search within “documents”.
- Chainable filters that modify token strings before being indexed, e.g., to weed out stop-words, or to strip, lower-case, or otherwise alter the token strings.
- Tokenization of query-string splitting on spaces.
- Optional application of the chainable filters to the query-terms after tokenization, so as to be more likely to match the indexed feature.
- Google-like “quoted strings” that make the query-terms be adjacent.
- More than one “quoted string” allowed in the query-string.
- Return results as list of three-tuples (document-first-monad, document-last-monad, first-search-term-first-monad)
- Return results as customizable snippets of real tokens, with optional highlighting of query terms.
- Command-line tools for both indexing and searching.
This will appear in the next public release of Emdros.
Interested parties should contact me via email for getting the latest sources.