Harvesting revisited

April 16th, 2012

I’ve spent some time writing about how to harvest objects to produce documents. The result is some documentation of a yet-to-be-implemented “Render2″ library. It is basically a description of some languages which are at once more powerful and yet also simpler than the RenderObjects and RenderXML library languages.

Once implemented, the Render2 code will:

  • Be easier to use than RenderObjects and RenderXML
  • Be more powerful than RenderObjects and RenderXML
  • Be more easily extensible than RenderObjects and RenderXML

The idea is still the same as in RenderObjects and RenderXML:

  • “Stylesheets” tell the Render2 engine what to do when encountering an object in the database (when retrieving), or what to do with XML elements (when parsing XML).
  • These “Stylesheets” basically tell what to do at the start and/or end of an object or XML element.
  • The “Stylesheets” are ordered in a tree, with inheritance semantics between them.
  • “What to do” at the start/end of an object / XML element is expressed in a second language, called a “template language”. The template language is quite powerful (both for the old RenderObjects/RenderXML library and the new Render2 library), and has support for things like variables, lists, counters, etc.

What’s new in the Render2 library includes:

  • “RenderObjects2″ stylesheets can inherit from other “RenderObjects” stylesheets. This is not just for RenderXML stylesheets any more.
  • The new template language is more regular, with less idiosyncrasies, and more expressive power. This expressive power comes in part from the new concept of “pockets” (see below).
  • The new template language introduces the idea of functions. A number of built-in functions will be provided. I am debating with myself whether to include a small scripting language in which the user can express functions themselves. We’ll see.
  • The new template language introduces the idea of expressions, which can be used in such places as “if” templates, and in parameters to function-calls.
  • The new stylesheet language (in which the template language is embedded) has a very, very simple grammar which fits in about 12 grammar-rules in Extended Backus-Naur Form. This alone should make it easier to use than the current JSON-embedded stylesheet language. The simple grammar makes it very, very easy to remember how to create a stylesheet, with very few “what you don’t know will hurt you” surprises.
  • The new stylesheet language introduces the idea of strings that are “”"triple-quoted”. This idea has been stolen from Python. The idea is to be able to use “single quotes” and newlines within “”"triple-”quote” strings”"” witout needing to escape them with backslashes. This should not only make the new stylesheets easier to use in practice (because of fewer backslashes); it should also make them more beautiful.
  • The new Stylesheet language uses the idea of “packet” to encompass all the different kinds of things you put into a stylesheet. Basically, a stylesheet unit is an ordered list of “packets”, where each packet has a packet name and a packet class (telling us how to use it), and a packet always belongs to exactly one stylesheet. Internally, a packet is no more, no less than an ordered list of key/value pairs. (This ordered list of key/value pairs may turn into a map/dictionary, but that is not part of the syntax, only part of the semantics).
  • The old RenderObjects/RenderXML stylesheets had the disadvantage that it was sometimes difficult to see which stylesheet we were currently looking at, since the stylesheet name was only mentioned once, at the top of the stylesheet. The new stylesheet language repeats the stylesheet name for every “packet”, making it easier to orient oneself in the stylesheet unit file.
  • The C++ API to the Render2 library has been greatly simplified as compared to the RenderObjects/RenderXML library.
  • The Render2 library takes a Set of Monads, not a range of monads, when needing to retrieve objects. This generalization makes it much more powerful than the old RenderObjects/RenderXML library.

The idea of “pockets” has been introduced. A “pocket” is a map/dictionary which maps strings to lists of strings. In addition, each pocket has a name which is a C identifier. The idea that one can redirect the output to a pocket, and that one can refer to the list of strings in a pocket by pocket-name coupled with pocket-key, has turned out to be quite powerful and general, supporting within one data-structure such diverse concepts as: variables, counters, integer-arithmetic, lists, and the “pockets” themselves, which can be used to output stuff “later” in the document than otherwise would have been the case.

Interested parties are welcome to ask for the documentation. The documentation is still a work-in-progress, but implementation will hopefully start soon.

Ulrik

Emdros on Debian/Ubuntu/etc.

February 17th, 2012

I’ve successfully made the files requisite for building a .deb on Debian/Ubuntu/other-Debian-derived-Linux-distros.

Interested parties are welcome to contact me for the sources.

Ulrik

The Emdros blog is back

January 3rd, 2012

The Emdros blog is kindly hosted by the J. Alan Groves Center for Advanced Biblical Research. The Groves Center suffered a hardware outage in late 2011, bringing this blog down.

Thanks to the hard work of Dr. Kirk Lowery, the blog is now back. Thanks, Kirk!

More news coming. Stay tuned!

Ulrik

Emdros 3.3.0 released

July 4th, 2011

I have released Emdros version 3.3.0 over at SourceForge.Net.

http://emdros.org/download.html

Please note that the implementation and method of indexing of the Full Text Search are subject to change, as this feature is still experimental.

Enjoy!

Ulrik Sandborg-Petersen

 

Controlling containment in topographic MQL

February 14th, 2011

I have just finished adding a new feature to the topographic part of the MQL query language.

Hitherto, the only relation one could specify for containment between an inner object block and the outer container was “part_of”, and it was always relative to the containing substrate.

In plain English, that meant that the inner object’s monad set had to be a subset of the outer object’s monad set, or (if the inner block was at the outermost level), it must be a subset given in the IN clause after SELECT ALL OBJECTS.

Now, you can specify these four relations:

  • part_of(substrate) // The default
  • part_of(universe) // To disregard gaps in the substrate
  • overlap(substrate)
  • overlap(universe)

The overlap relation means: The inner object must have a non-empty intersection (i.e., share at least one monad with) the outer substrate or universe.

This makes it possible to specify things like this:

SELECT ALL OBJECTS
IN Aramaic_monads // Pre-defined monad set
WHERE
// This means that we want all clauses which share at least one monad
// with the Aramaic_monads monad set
[Clause overlap(substrate)
   // This finds all phrases inside the left and right boundaries of
   // the outer clause, regardless of any gaps in the clause.
   [Phrase part_of(universe)
   ]
]

This will appear in the next public release after 3.2.0.

If anyone is interested in trying this out, please let me know.

Ulrik

Full Text Search implemented in Emdros

October 30th, 2010

I’ve finished the implementation, tuning, and testing of Full Text Search (FTS) for Emdros.

The implementation is part of the libharvest library, and is written in C++ like the rest of Emdros.

I implemented the basic idea in Python first, then reimplemented it in C++. Python is so malleable that this sort of prototyping work makes Python ideal for the task.

The Full Text Search has a lot of features, including:

  • Index “documents”, which must exist as object types.
  • Index documents based on “indexed object types” (e.g., token) and one indexed feature of the indexed object type.
  • Search within “documents”.
  • Chainable filters that modify token strings before being indexed, e.g., to weed out stop-words, or to strip, lower-case, or otherwise alter the token strings.
  • Tokenization of query-string splitting on spaces.
  • Optional application of the chainable filters to the query-terms after tokenization, so as to be more likely to match the indexed feature.
  • Google-like “quoted strings” that make the query-terms be adjacent.
  • More than one “quoted string” allowed in the query-string.
  • Return results as list of three-tuples (document-first-monad, document-last-monad, first-search-term-first-monad)
  • Return results as customizable snippets of real tokens, with optional highlighting of query terms.
  • Command-line tools for both indexing and searching.

This will appear in the next public release of Emdros.

Interested parties should contact me via email for getting the latest sources.

Enjoy!

Ulrik

Linguistic Tree Constructor 3.0.4 released (with an Easter egg)

September 18th, 2010

I’ve released Linguistic Tree Constructor (LTC) version 3.0.4 over at http://ltc.sourceforge.net …

The significance for this blog is that:

  1. LTC uses Emdros
  2. The latest release of LTC has the latest Emdros sources for 3.2.1.pre02 as an Easter egg inside.

Go grab the sources of LTC if you want to see what I’m up to for the next version of Emdros, then look at the ChangeLog.

Enjoy!

Ulrik Sandborg-Petersen

Bit Packed Table backend with encryption

August 4th, 2010

In March 2010 (3rd and 9th), I wrote on this blog about a new backend for Emdros under development, called the “Bit Packed Table” (BPT) backend. It is a high-performance, read-only database engine, based on “bit packed tables” and custom-tailored to the EMdF model. It outperforms even SQLite in terms of raw querying speed by about 30% on average.

I have recently made the BPT engine almost feature-complete, including adding an encryption layer. The encryption isn’t strong, but it does the job of keeping prying eyes out of your data.

I have added BPT to two of my Emdros-based software projects, using it exclusively for the backend for these projects, both of which deliver content to the user through a thin shell on top of Emdros. It works fine, and the speed increase over SQLite 3 is especially noticeable — pieces of content that used to take 1.5 seconds to load now leap onto the screen.

I said the BPT engine is almost feature-complete. The only thing missing, in fact, is support for stored monad sets. That is, monad sets that don’t have any object data associated with them, but which can be used for delimiting a query. I will add this feature in due course.

The BPT engine isn’t Open Source, and won’t be for the foreseeable future. If you are interested in licensing the engine, please drop me an email.

Enjoy!

Ulrik

Emdros 3.2.0 released

July 4th, 2010

I’ve released Emdros 3.2.0 over at SourceForge.net.

http://emdros.org/download.html

The release notes appear below.

Please let me know via the usual avenues whether anything is amiss.

Enjoy!

Ulrik

- *** Version 3.2.0 ***

As usual, binaries are available for Mac OS X, Windows(R), and Fedora
(13).

The Windows binaries have support for MySQL, SQLite 2, and SQLite 3.
They are built with Visual Studio Express 2010.

The Mac OS X binaries are Universal binaries running on Mac OS X 10.4
(Tiger), 10.5 (Leopard), and 10.6 (Snow Leopard).  They do not have
support for either MySQL or PostgreSQL; Only SQLite 2 and SQLite 3 are
supported in the Mac OS X binaries.  You can compile the sources with
support for MySQL yourself, though, and possibly also PostgreSQL.

The Fedora binaries come with support for PostgreSQL, MySQL, SQLite 2,
and SQLite 3.

This release has the following changes over 3.1.1:

- A new backend was created, called the BPT engine.  It is
proprietary, and thus not Open Source, at the moment (sorry).
Interested licensors can contact me at ulrikp – at – emdros |dot|
org for questions about this new engine.

- SQLite3 was upgraded to version 3.6.17

- PCRE was upgraded to version 8.01. The license is still BSD.

- The TIGERXML importer is now more lenient towards the XML being
imported.

- The Emdros Query Tool now implements an XML_Output_Style.  See the
User’s Guide for the Emdros Query Tool for how to use it.  WARNING:
The output is still subject to change!

- The Emdros Query Tool (GUI version) can now create PNG files right
from the command line.  See the man page for eqtu.

- Assorted changes to the harvest library.  Note that the harvest
library is not stable yet; all APIs are subject to change as I
experiment with the best way of doing this important task.

- A topographic query can be stopped by setting the following bool to
false:

MQLExecEnv::m_bContinueExecution.

- Assorted changes to the horizontal tree and vertical tree layout
engines.

Enjoy!

Ulrik Sandborg-Petersen

Linguistic Tree Constructor — 25000 downloads passed

June 28th, 2010

One of my Open Source “successes”, Linguistic Tree Constructor, has passed 25000 downloads over at SourceForge.net.

Linguistic Tree Constructor (LTC) is a tool for building linguistic syntax trees in no time flat, using your mouse. Its main strength is quick annotation of large amounts of text, i.e., production of syntactic databases. It is based on Emdros for much of its implementation.

You can see the stats, or download for Mac OS X, Windows, and Linux over at the kindly folk at SourceForge.Net.

Enjoy!

Ulrik