Tag: clojure

Unsupervised Part-​​Of-​​Speech Tagger

Another lin­guis­tic analy­sis tool has come to my atten­tion: A “State-​​Of-​​The-​​Art Unsu­per­vised Part-​​Of-​​Speech Tag­ger” .

In recent years com­pu­ta­tional lin­guis­tics has used the enor­mous vol­ume of ver­biage on the Inter­net to over­come the prob­lems of ana­lyz­ing nat­ural lan­guage. Using prob­a­bil­i­ties cal­cu­lated for a lan­guage using bil­lions of sen­tences, a pro­gram is “trained” to see pat­terns and from the con­text assign the like­li­est part of speech (noun, verb, adjec­tive, etc.) to a word.

Clever and pro­found, yes. Com­pli­cated? Not really. This pro­gram con­sists of just 300 lines of Clo­jure code.(Clojure is a mod­ern dialect of Lisp. It is “Lisp reloaded”, and imple­mented on the Java Vir­tual Machine. It is a func­tional pro­gram­ming lan­guage and it sim­pli­fies multi-​​threaded pro­gram­ming.)

Read­ing the follow-​​up blog post explain­ing the algo­rithm in detail, I found myself won­der­ing about the applic­a­bil­ity of a Hid­den Markov Model for ana­lyz­ing ancient texts. In par­tic­u­lar I won­der about the usu­ally numer­i­cally lim­ited num­ber of obser­va­tions. A prob­a­bil­ity model works best with a “large” set of obser­va­tions. There are “only” 480,446 mor­phemes in 23,213 verses in the Hebrew Bible as rep­re­sented by the Leningrad Codex.

Some would say such pro­grams are of lim­ited value for ancient texts, since man­ual analy­sis is finite and “rea­son­able” in cost. On the other hand, the pro­gram will more con­sis­tently tag the text, and regen­er­at­ing the entire data­base costs very little.

Com­ments?