Archive for the ‘MQL’ Category

Controlling containment in topographic MQL

Monday, February 14th, 2011

I have just finished adding a new feature to the topographic part of the MQL query language.

Hitherto, the only relation one could specify for containment between an inner object block and the outer container was “part_of”, and it was always relative to the containing substrate.

In plain English, that meant that the inner object’s monad set had to be a subset of the outer object’s monad set, or (if the inner block was at the outermost level), it must be a subset given in the IN clause after SELECT ALL OBJECTS.

Now, you can specify these four relations:

  • part_of(substrate) // The default
  • part_of(universe) // To disregard gaps in the substrate
  • overlap(substrate)
  • overlap(universe)

The overlap relation means: The inner object must have a non-empty intersection (i.e., share at least one monad with) the outer substrate or universe.

This makes it possible to specify things like this:

SELECT ALL OBJECTS
IN Aramaic_monads // Pre-defined monad set
WHERE
// This means that we want all clauses which share at least one monad
// with the Aramaic_monads monad set
[Clause overlap(substrate)
   // This finds all phrases inside the left and right boundaries of
   // the outer clause, regardless of any gaps in the clause.
   [Phrase part_of(universe)
   ]
]

This will appear in the next public release after 3.2.0.

If anyone is interested in trying this out, please let me know.

Ulrik

“Uncles and nephews” fix implemented!

Saturday, January 10th, 2009

I’ve implemented the fix to the “Uncles and nephews” bug, in C++ this time. It works beautifully. I have added around 15 regression tests to the suite of tests that are run by the mqltry program, and have tested the new solution with the usual suites of regression tests on some real-life databases. Not only does it give accurate results now, it also is very, very slightly faster as a result of the updates to the code. And it gives the same results as before, except for the queries where other results are expected.

I am so pleased :-) .

Ulrik

Object References — fixed!

Wednesday, January 7th, 2009

A lot has happened in Emdros-land since my last post, even though the last public release was on February 2008 — almost a year ago.  The code has been progressing, even though I am now a PhD, and an assistant professor, and a father.

Today’s blog post is not about my public silence, however, but about a design bug that exists in every version of Emdros since the first release.

Since the day I wrote the first line of what was to become Emdros, there has been a design error in the way object references are handled. This has meant that object references could not really be relied upon to provide correct results under certain circumstances, leading to misleading results.

There has never been a problem with this kind of object reference:

[Phrase as p1
   [Word parent = p1.self]
]

That is, there has never been a problem when the object block referenced was a parent of the referencing object block.

The problem, rather, has shown up in circumstances where the object block referenced was either a sibling, or a nephew, or an uncle, or a cousin, or a cousin twice removed… etc… of the referencing object block.

For example:

[Clause
   [Phrase as p1]
]
[Clause
   [Phrase cousin = p1.self] // This can lead to errors
]

I have worked very hard on this problem on and off for at least 3 years now.  I have finally found a solution.

Some time ago, I ported the topographic subset of Emdros to the Python programming language. It is not a port of the complete topographic subset, but it does allow me to experiment with new extensions of Emdros in the much more malleable Python language.

I have used this Python version of Emdros to experiment with this “uncles and nephews” problem.  The solution was conceived of in a three-hour session of writing up the solution, and the implementation in Python has taken around 12 hours so far.  I have tested the solution, and am now ready to port it over to C++.

Stay tuned for news about a release of Emdros having the fix (but don’t hold your breath ;-) ).

Ulrik

MQL takes a leap closer to QL

Friday, June 8th, 2007

One of my customers told me that the new Wrap block wasn’t what they really needed. The main point of complaint was that there was an implicit power block at the beginning of the innards of the wrap block. It wasn’t intuitive enough.

I looked at my implementation, and realized that, in order to fix the problem, I had to essentially rewrite large parts of the implementation of the topographic part of MQL.

So that’s what I did. I struck gold when I came up with a very simple solution:

  1. Remove the wrap block completely. If I ever reinstate it, it will most likely be done right :-) .
  2. Move the part that drives the computation forward in the monad stream from the block_string to the blocks construct.
  3. Implement the concept of “StartMonadIterator”, which iterates over the possible start monads. There are three kinds of StartMonadIterator: One iterating over an Inst, one iterating over a set of monads, and one iterating over the gaps in a set of monads.
  4. Do what Doedens prescribed w.r.t. block_strings: Split them up into a number of levels, such that: a) blocks are handled at the bottom level, at which level we also handle [grouping]. b) At the next level up, handle the first level, plus Kleene Star on the first level. c) At the third level, handle the second level, and let the third level also be the level at which concatenation is handled. d) At the fourth level, handle the third level, and also OR between strings of blocks.
  5. Handle power blocks at the same level as all other blocks, but with the usual checks: A power block … a) Cannot appear at the beginning or end of a context, b) Two power blocks cannot stand next to each other.
  6. Allow any kind of block after a power block (except a power block).

The bottom line is that MQL is much more powerful now, as a result of the following: a) We now have “real grouping” of strings of blocks; b) Kleene star can now apply to groups as well as to individual blocks. c) We can now have any kind of block after a power block, not just object blocks.
These three points may not seem to be “big”. But let me assure you that this is indeed one — nay two — quantum leaps upward for MQL in expressive power. That is, MQL is now much closer to what Doedens had envisaged, and what he described in his PhD thesis as the language “QL”.

I have added many regression tests to the regression test suite in order to test the new functionality, and the old regression tests all run without a hiccup. I have also run valgrind’s memory checker on the regression test program, and it comes up with 0 memory leaks. Finally, all of the test queries in my corpus of test queries against a “real” database come up with the same answers as before, except for the order in which straws from OR-separated block_strings appear.

So things are looking good.

Again, I still have no schedule for when these changes become public. If you want to try it out, please drop me a line.

Until then,

Enjoy!

Ulrik

On wrap blocks, marks, and other goodies

Friday, May 4th, 2007

A lot has happened to the Emdros sources since the latest public release, which was 1.2.0.pre242. Here is a shortlist of the most important ones:

  • You can now specify “marks” on an object block or (opt) gap block. This means that you can pass arbitrary identifiers back to the layer above Emdros for each object block or (opt) gap block, like so:[Clause`yellow [Phrase`red]]. The Emdros engine doesn’t interpret these “marks”; it just passes them along in the MatchedObject.This is an enhancement of the idea of “focus”.
  • A “[wrap]” block has been added (see below).
  • TIGER-XML import capabilities have been added. The implementation covers most of the features of the TIGER-XML spec, but things like copora in separate files have been left out for now.
  • An almost-finished “configuration wizard” for the Emdros Query Tool has been added. It doesn’t actually do any configuration yet, but the GUI is almost finished.
  • A lot of regression tests were added, bringing the total of queries run up to around 300.
  • Lots of small and not-so-small bugfixes.

As for the “wrap” block, it works like a grouping of block_strings. This means that you can specify, at any level in the query, that a certain string of blocks (possibly several strings of blocks with OR in between) must occur, while still being able to specify blocks to come before and/or after the wrap block. The wrap block behaves as if there is an implicit power block (..) before the first block inside the wrap block. This can be circumvented by specifying “first” on the first object block inside the wrap block.

The wrap block is useful mostly if you want to put ORs in between strings of blocks in the middle of a string of blocks. Otherwise, if you don’t use ORs, it degenerates to the case where the wrap block wasn’t there, with the caveat that there is an implicit power block at the start (unless you specify “first”).

It isn’t quite as simple as a “grouping of block_strings”, that is, you shouldn’t think of it as mere parentheses around an OR-separated string of block_strings. The reason is that it is a full “blocks” construct that is inside, and thus there is an implicit power block at the start,as mentioned.

The wrap block doesn’t, as in Crist-Jan Doedens’ book, result in an intermediate pow_m object. Instead, the wrap block computes the straws inside of it, as if the Substrate and Universe started at whatever monad we came to with the previous block (+ 1), and extended until the end of the surrounding Substrate. If there was no surrounding previous block, then the Universe and Substrate are identical to those of the context in which the wrap block is embedded. Once the straws arising from the “blocks” inside the wrap block have been computed, they are each of them used as a basis for computing straws based on whatever came both before and after the wrap block. Thus each straw arising from the wrap block is used at the level in which the wrap block is embedded, as if the wrap block was a single block resulting in straws each with a single matched object (but really there may be more than one matched object in each straw, since the wrap block can contain arbitrary block_strings).

For exampe:

[Clause [Word] [wrap [Phrase][Phrase] OR [Clause first]] [Word]]

This could be translated to this query:

[Clause [Word]..[Phrase][Phrase][Word] OR [Word][Clause][Word]]

This second query would yield identical results, except that the ordering of the straws in the sheaf would be different.

Note how the Clause inside the wrap block has been anchored at the start of the wrap block with the “first” keyword, and so, in the translated query, there is no power block (..) between the first [Word] block and the [Clause] block.

I have no schedule for when this is going to be available to the general public. If you really, really want to try it, drop me a note.

Until then,

Enjoy!

Ulrik

Regression-testing Emdros

Wednesday, January 17th, 2007

It suddenly dawned on my otherwise benighted cogitatation-device a few days ago, that a good answer to the problem of adding more regression tests to Emdros would be to have a file, in which MQL queries and their expected output were placed together.

Last night I designed a simple little file format which fits this need.  It is processed with a small Python script to a C++ header file containing an array of QueryAnswer objects. A QueryAnswer is an object that holds both a query and its expected answer, along with a few booleans such as whether a compiler error is expected, and whether to create a new database before this query starts.

I then proceeded to adding more than 135 regression tests. As a result, I caught five obscure bugs in Emdros. Most of the bugs are so obscure that it is unlikely that many would have run into these bugs.  Except for one of them, which involved a segfault on queries of the type GET OBJECTS HAVING MONADS IN [myObjectType GET ALL], the operative clause being “GET ALL”.

Most of the MQL engine is now tested with the regression test machinery, including most error messages. The machinery even checks that the correct/expected error message is emitted in case an error is expected.  Although I should say, the statement that “most” of the MQL engine is tested remains a conjecture.  Hence, I look forward to trying with gcov to see what the real percentage of coverage is.

Non-existence revisited

Tuesday, January 24th, 2006

A while ago, I blogged about NOTEXIST having a bad semantics. Essentially, the semantics of the NOTEXIST operator wreaked havoc on subsequent blocks because they were all negated as well. Not good.

Reading Doedens’ PhD thesis, I stumbled across a potential solution. Doedens has something called a “wrap block” which is essentially a block that can be used to match any monad set which is not necessarily known as an object or gap in the database. Doedens’ proposed semantics of the “wrap block” is hard to implement because he uses the existence of a set within the powerset of all monads of the Substrate as the basis for what to take as the substrate. As we all know, the powerset of a set M gives rise to 2^|M| subsets. That’s a little steep to search through… Hence I would propose a simpler semantics. With that in mind, some explanation…

In my version of the “wrap block”, the semantics is such that:

  1. The Universe of the wrap block is the Universe of the context, minus the monads from the beginning of the context’s Universe up to and including whatever was the previous block’s last monad. The Substrate is defined analogously on the context’s Substrate. If there is no previous block, the Universe and Substrate of the wrap block are identical to those of the context.
  2. There can be no power block right before a wrap block. (OK, that was syntax, not semantics). The reason is so as to keep the implementation simple — no more, no less; there is no theoretical reason why there could not be a power block before the wrap block.
  3. The result of a wrap block is one of three: Either a) a NIL_mo (a failed Matched Object), or b) an EMPTY_mo (an empty MatchedObject), or c) a ID_M_mo (a Matched Object with a set of monads and no id_d, like the Matched Objects arising from a gap block).
  4. The result is NIL_mo if and only if the inner sheaf is a failed sheaf. The result is EMPTY_mo if and only if the inner sheaf is a non-failed, empty sheaf. The result is an ID_M_mo if and only if the inner sheaf is a non-failed, non-empty sheaf. The set of monads M associated with the ID_M_mo is the big-union of all sets of monads within the inner sheaf.

With that in mind, let’s see what happens if you implicitly wrap any NOTEXIST block in a “wrap” block.

  1. The Universe and Substrate of the NOTEXIST block will be restricted to start at whatever the previous block’s last monad was, plus 1 monad, but they will continue to the end of the context. This means that the NOTEXIST block will be “looked for” within the entire Substrate, except everything that goes up to and including the previous block.
  2. If the NOTEXIST block matches anything, the inner sheaf of the wrap block will be a failed sheaf. Hence the wrap block will give rise to a NIL_mo, meaning it failed. The whole block_string will fail. This is as it should be, since we do not want there to be anything which matches the NOTEXIST block.
  3. If the NOTEXIST block does not match anything, the inner sheaf of the wrap block will be a non-failed, empty sheaf. Hence the wrap block will yield an EMPTY_mo. This means that any block following right after the (implicit) wrap block will be attempted to be matched starting at the previous block’s last monad plus 1, as if the NOTEXIST block had not been there. The NOTEXIST block will essentially be a “zero-width” block, or a “negative zero-width lookahead assertion”.

Here’s an example:


[Clause
[Phrase]
NOTEXIST [Word surface="glue"]
[Word surface="food"]
]

This would look for clauses within which there was a Phrase. Right after the end of this Phrase, there must be a word with surface=”food”. This is because the intervening NOTEXIST has “zero width”. Moreover, starting at the end of the Phrase plus 1 monad, we look for the word “glue” within the Substrate up until the end of the Clause. If the word “glue” does not exist within that substrate, the whole query matches for that Clause. If the word “glue” does exist within that substrate, the whole query fails for that Clause.

Let’s look at the semantics of this in terms of mathematics:


exists C in Clause restricted to Su:
exists P in Phrase restricted to C:
exists monad set M:
exists W2 in Word restricted to C:
P.last + 1 = W2.first and W2.surface="food"
and M=Su\{Su.first..P.last}
and for all W1 in Word restricted to M: W1.surface <> "glue"

The whole problem is side-stepped by postulating the existence of a monad set M, then defining that M on the basis of the previous block, then only thereafter postulating the “not exists” term (in the above example, this has become “for all”). That way, the “negation operator” becomes local in scope to the conjunctive term, not to the whole of the clause.

Another example, this time with two NOTEXISTS:


[Clause
NOTEXIST [Word surface="glue"]
NOTEXIST [Word surface="food"]
]

This would find all clauses within which neither the word “glue” nor the word “food” existed, in any order (OK, the last bit was a joke… there can be no ordering of things that do not exist…).

This would be, in FOL:


exists C in Clause restricted to Su:
exists monad set M1:
exists monad set M2:
M1 = Su and
M2 = Su and
for all W1 in Word restricted to M1: W1.surface <> "glue"
and
for all W2 in Word restricted to M2: W2.surface <> "food"

Again, the “not exist” becomes a conjunctive term, not a quantifier that has scope over the whole of the rest of the clause.

There is still a restriction that must remain in place: If you place an “AS X” object reference declaration on a NOTEXIST block, you can only use it within the inner blocks of the NOTEXIST block, and not anywhere else.

I will try to implement this soon. Not the wrap block in general at first, but merely the “invisible” wrap block around a NOTEXIST block.

Update: The first logic-example has been fixed. S–> C.

OR between strings of blocks

Tuesday, November 8th, 2005

The new implementation of the topographic part of MQL is proving its worth already. It is something that I can actually wrap my head around, in contrast to the old spaghetty-like, crufty code that had been beplagued with many a patchy patch.

I’ve changed the code in three important ways over the last couple of days:

  1. First, I’ve made it so that the current straw-in-the-making is always available on the stack. This is an important step towards fixing the bugs in the object reference code (see the SourceForge.Net bug-tracker for exactly what I mean).
  2. Second, I’ve added an OR construct between strings of blocks.
  3. Third, I’ve added some new constraints (which I think nobody was using anyway), just to make sure that the implementation does what it’s supposed to do.

The OR construct simply looks like this:


[Clause
[Phrase function=Objc]
[Phrase function=Cmpl]
OR
[Phrase function=Cmpl]
[Phrase function=Objc]
]

This finds all clauses inside of which there are two phrases, with one being an Object and the other being a Complement, in either order. The OR works between strings of blocks, not between individual blocks.

You can have as many OR-separated strings of blocks as you like.

As regards the new constraints, they can all be explained by the concept of “export barrier”, and they all work on object reference declarations and object reference usages. An “export barrier” works “upwards” in the MQL query parse tree, and says that you can’t use an object reference that has been declared “below” the export barrier. The three export barriers are: the Kleene Star, NOTEXIST, and the OR construct between object blocks.

For example, this is not allowed:


[Clause
[Phrase as p1]
OR
[Phrase function=p1.function]
]

Here, the OR acts as an “export barrier” disallowing the object reference usage of p1.

Neither can you say:


[Clause
[Phrase
[Word AS w1]
]* // OOPS! Kleene Star acts as export barrier.
[Word lexeme=w1.lexeme]
]

Nor can you say:

[Clause
NOTEXIST [Phrase
[Word AS w1]
] // OOPS! NOTEXIST acts as an export barrier!
]
[Clause
[Word lexeme=w1.lexeme]
]

But notice that the export barrier does not work “downwards”: This IS allowed:


[Clause
[Phrase as p1
[Phrase function = p1.function]
OR // This is NOT an export-barrier for p1!
[Phrase function <> p1.function AND phrase_type=NP]
]
]

This is allowed because p1 has been declared above the OR construct. If it were declared below the OR construct, the OR construct would have been an export barrier. The other way around, it isn’t.

The reason I’ve implemented the export barriers for OR is so as to ensure that when an object reference is used, it is always declared so as to be given a value. If we had allowed it to be declared below an OR and used above the OR, then it might not always be given a meaningful value (or any value at all).

The reason for the Kleene Star being an export barrier is that I wish to reduce the complexity of the upcoming fix of the object reference system. I might relax that restriction once I know exactly how to do it, but for now, I’d rather have the restriction in place. The problem is that inside a Kleene Star, we might get multiple values to put in the symbol table.

The reason for NOTEXIST being an export barrier is that we really do wish for something not to exist, so for us to refer to something inside something which does not exist is a bit self-contradictory…

These changes will appear in the first public release after 1.2.0.pre171.

What non-existence can lead to

Saturday, October 15th, 2005

One of my users wrote to me a couple of days ago, pointing out that MQL statements like the following:

SELECT ALL OBJECTS
WHERE
[sentence
NOTEXIST [Phrase]
[Clause FOCUS]
]
GO

would not return any FOCUS’ed clauses.

I looked into it, of course, and found some deeper issues, which I’d like to explain.

1) First, what does the query mean?

Well, let’s start by looking at what its counterpart without the non-existence would mean:

SELECT ALL OBJECTS
WHERE
[sentence
[Phrase]
[Clause FOCUS]
]
GO

It would mean something like the following (in First-order Predicate Logic):

exists S in Sentence restricted to Su:
exists P in Phrase restricted to S:
exists C in Clause restricted to S:
P.last + 1 = C.first

What I mean by “Sentence restricted to Su” and “Phrase restricted to S” is that we are after all those S and P which belong to the given object type, but which are selected from the subset that is wholly c0ntained in the set of monads being mentioned (Su and S).

This is the standard MQL interpretation, laid down by Doedens in his 1994 PhD thesis.

Now, let’s see what this means when you add the NOTEXIST. We start by adding a little, innocent “not” in front of the “exists P” clause:

exists S in Sentence restricted to Su:
not exists P in Phrase restricted to S:
exists C in Clause restricted to S:
P.last + 1 = C.first

Standard mathematics operations are then applied:

exists S in Sentence restricted to Su:
forall P in Phrase restricted to S:
not exists C in Clause restricted to S:
P.last + 1 = C.first

and again;

exists S in Sentence restricted to Su:
forall P in Phrase restricted to S:
forall C in Clause restricted to S:
not (P.last + 1 = C.first)

As you can see, what it really means is that there exists some S, inside of which all phrases and all clauses stand in a relationship where no phrase P ends at the monad before the first monad of any clause C.

(!)

That would be a pretty darn interesting analysis of a sentence. Overlapping boundaries would abound.

2) Second, what did I do about it.

Well, I decided to dictate that NOTEXIST could only be applied to one object block (that’s as it is now anyway), and that that object block must be the only block present in its blocks.

Not content with simply fixing that, however, I decided to be adventurous and redesign the implementation of the language from scratch.

That is now completed, and the end result is quite pleasing:

  1. NOTEXIST works correctly now, as per the above dictum.
  2. You can now have the Kleene Star on the first object block in a blocks, AND also on the first one after a power block “..”. This allows you to make something optional at the beginning of a blocks, which was what NOTEXIST was being used as in the above example anyway.
  3. You can now have a gap_block at the beginning of a blocks.
  4. It is only about 4% slower than the previous implementation, and I think I can tweak that…
  5. Not only do all regression tests pass beautifully, but some correctness issues that had been present for a long time are now resolved. In particular, NORETRIEVE now works as designed, and doesn’t cause incorrect results to be retrieved.
  6. The code is MUCH simpler, MUCH more maintainable, and much more beautiful than the old code, which had grown somewhat crufty.

This will make its appearance in the next public release after 1.2.0.pre160.