Another linguistic analysis tool has come to my attention: A “State-Of-The-Art Unsupervised Part-Of-Speech Tagger” .
In recent years computational linguistics has used the enormous volume of verbiage on the Internet to overcome the problems of analyzing natural language. Using probabilities calculated for a language using billions of sentences, a program is “trained” to see patterns and from the context assign the likeliest part of speech (noun, verb, adjective, etc.) to a word.
Clever and profound, yes. Complicated? Not really. This program consists of just 300 lines of Clojure code.(Clojure is a modern dialect of Lisp. It is “Lisp reloaded”, and implemented on the Java Virtual Machine. It is a functional programming language and it simplifies multi-threaded programming.)
Reading the follow-up blog post explaining the algorithm in detail, I found myself wondering about the applicability of a Hidden Markov Model for analyzing ancient texts. In particular I wonder about the usually numerically limited number of observations. A probability model works best with a “large” set of observations. There are “only” 480,446 morphemes in 23,213 verses in the Hebrew Bible as represented by the Leningrad Codex.
Some would say such programs are of limited value for ancient texts, since manual analysis is finite and “reasonable” in cost. On the other hand, the program will more consistently tag the text, and regenerating the entire database costs very little.
Comments?