Tag: syntax

21st century study of religious texts

A pic­ture is worth a thou­sand words. So a con­crete exam­ple that you can not only see, but also play with, is worth ten thou­sand words.

The Quranic Ara­bic Cor­pus incor­po­rates much of my vision for the study of the Bible in the third mil­len­nium of our civ­i­liza­tion. For “under the hood” details, see the descrip­tion of the research of Kais Dukes, who is — of all things! — a VP of Mer­rill Lynch.

Three ele­ments of this projects are mor­pho­log­i­cal anno­ta­tion, a syn­tax tree­bank and a seman­tic ontol­ogy. All three are com­bined into a web user inter­face in such a way that col­lab­o­ra­tion is pos­si­ble. The gen­eral pub­lic inter­ested in the Quran itself can browse the orig­i­nal Ara­bic text, and dive into mor­phol­ogy, syn­tax and seman­tics as desired. Schol­ars can work on the actual analy­sis sim­ply by log­ging in.

This model of lin­guis­tic anno­ta­tion of a cor­pus can eas­ily be extended to include bib­li­og­ra­phy, web resources, archae­o­log­i­cal and his­tor­i­cal data — the pos­si­bil­i­ties are endless.

One exten­sion ought to be the abil­ity to add user anno­ta­tion which is stored locally on the user/visitor’s own com­puter but which inte­grates seam­lessly with the website.

I noticed one fea­ture that is lack­ing: the abil­ity for com­plex search­ing, using the mor­phol­ogy, syn­tax and seman­tic anno­ta­tions. There is a search box for sim­ple text queries, but a more sophis­ti­cated search engine would greatly enhance the value of this remark­able resource.

Unsupervised Part-​​Of-​​Speech Tagger

Another lin­guis­tic analy­sis tool has come to my atten­tion: A “State-​​Of-​​The-​​Art Unsu­per­vised Part-​​Of-​​Speech Tag­ger” .

In recent years com­pu­ta­tional lin­guis­tics has used the enor­mous vol­ume of ver­biage on the Inter­net to over­come the prob­lems of ana­lyz­ing nat­ural lan­guage. Using prob­a­bil­i­ties cal­cu­lated for a lan­guage using bil­lions of sen­tences, a pro­gram is “trained” to see pat­terns and from the con­text assign the like­li­est part of speech (noun, verb, adjec­tive, etc.) to a word.

Clever and pro­found, yes. Com­pli­cated? Not really. This pro­gram con­sists of just 300 lines of Clo­jure code.(Clojure is a mod­ern dialect of Lisp. It is “Lisp reloaded”, and imple­mented on the Java Vir­tual Machine. It is a func­tional pro­gram­ming lan­guage and it sim­pli­fies multi-​​threaded pro­gram­ming.)

Read­ing the follow-​​up blog post explain­ing the algo­rithm in detail, I found myself won­der­ing about the applic­a­bil­ity of a Hid­den Markov Model for ana­lyz­ing ancient texts. In par­tic­u­lar I won­der about the usu­ally numer­i­cally lim­ited num­ber of obser­va­tions. A prob­a­bil­ity model works best with a “large” set of obser­va­tions. There are “only” 480,446 mor­phemes in 23,213 verses in the Hebrew Bible as rep­re­sented by the Leningrad Codex.

Some would say such pro­grams are of lim­ited value for ancient texts, since man­ual analy­sis is finite and “rea­son­able” in cost. On the other hand, the pro­gram will more con­sis­tently tag the text, and regen­er­at­ing the entire data­base costs very little.

Com­ments?

Meaningful meaninglessness

Noam Chom­sky wrote:

1.  Col­or­less green ideas sleep furi­ously.
2.  Furi­ously sleep ideas green col­or­less.
It is fair to assume that nei­ther sen­tence (1) nor (2) (nor indeed any part of these sen­tences) has ever occurred in an Eng­lish dis­course. Hence, in any sta­tis­ti­cal model for gram­mat­i­cal­ness, these sen­tences will be ruled out on iden­ti­cal grounds as equally “remote” from Eng­lish. Yet (1), though non­sen­si­cal, is gram­mat­i­cal, while (2) is not grammatical.

Noam Chom­sky,  Syn­tac­tic Struc­tures (1957) p. 15.

These famous sen­tences (among lin­guists, at least) were con­structed delib­er­ately to con­vey no mean­ing by choos­ing an oppo­site of the pre­vi­ous word. Green is the log­i­cal oppo­site of col­or­less. Ideas are not ani­mate and so do not sleep. Sleep is a pas­sive action, and so the adverb furi­ously is the oppo­site idea of pas­siv­ity. The sec­ond sen­tence is pro­duced by revers­ing the order of the words. Chom­sky says sen­tence (2) is not gram­mat­i­cal and (1) is. That makes hardly bet­ter sense than the sen­tences! Yet a native speaker of Eng­lish “feels” the fact that (2) is more “wrong” than (1). A non-​​native speaker of Eng­lish may be bet­ter able to say why this is so.

Chomsky’s point (one of them) is that there is a dis­tinc­tion in lan­guage between the cor­rect rela­tion­ships of sen­tence ele­ments (parts of speech, abbre­vi­ated POS) and the ref­er­en­tial mean­ing that the indi­vid­ual parts point to. Put another way, one can dis­tin­guish between syn­tax and seman­tics. The seman­tics of clauses is an addi­tional kind of mean­ing to lex­i­cal and ref­er­en­tial meaning.

How is it that sen­tence (1) is syn­tac­ti­cally per­mit­ted (gram­mat­i­cal) and sen­tence (2) is not? Syn­tax trees help us to see the answer to this ques­tion. Syn­tax is about the names, rela­tion­ships and func­tion of POSs in a clause. The clas­sic method for rep­re­sent­ing syn­tac­tic struc­ture is the a syn­tax “tree”. Here is Chomsky’s sen­tence (1) in tree form:

image:cgisf-tgg.png

When sen­tence (2) reverses the words, Furi­ously sleep ideas green col­or­less, the noun ideas now comes before the two adjec­tives green col­or­less. This word order is absolutely gram­mat­i­cal — for Hebrew! But for Eng­lish, such a word order is incorrect.

A matrix is another use­ful way of rep­re­sent­ing syn­tac­tic infor­ma­tion, such as the Attribute Value Matrix (AVM) used by HPSG and sim­i­lar constraint-​​based uni­fi­ca­tion lin­guis­tic the­o­ries.

Still another way to rep­re­sent syn­tax is by using directed graphs. They are the most free-​​form of the var­i­ous visual rep­re­sen­ta­tions of syn­tax. Directed graphs have inter­est­ing math­e­mat­i­cal prop­er­ties that allow for com­pu­ta­tional gen­er­a­tion and manip­u­la­tion as well as rep­re­sen­ta­tion in data­bases. You will read more about directed graphs in future posts.

The lat­ter two meth­ods are more cur­rent with com­pu­ta­tional lin­guis­tics and nat­ural lan­guage pro­cess­ing. And that brings us to our next ques­tion: how is lin­guis­tic infor­ma­tion — espe­cially syn­tax infor­ma­tion — best rep­re­sented in a database?