When one works in an area — it doesn’t matter whether it’s in the humanities or in building construction — one begins to recognize patterns in how problems are solved. Typical solutions accrue as a body of knowledge and are passed on to new practitioners.
In computer science this has been happening for a decade or more. “Design patterns”, software constructs which have both data structures and the algorithms to efficiently and effectively manipulate them, are becoming more and more well-known and well understood. For example, there is the “factory” pattern, which makes “widgets”, defined by the programmer. This is a common task, so common that it has been done many times. The general principles of how to construct a factory are described, regardless of the software platform or environment.
The idea of design patterns can be extended, and the folks at Endeca have done just that for user interfaces (UI): the Endeca User Interface Design Pattern Library. There is no reason the reinvent the wheel; this library deals with common tasks or problems in programming a UI, e.g., search, faceted navigation, and information discovery. There are other UI design pattern libraries out there, e.g., Patternry.
Why my interest in this? Because Patrick Durusau and I are experimenting with new ways of inter-acting with text, using the rabbinic Miqra’ot Gedolot (the Rabbinic Bible; kind of like a medieval Jewish “study Bible”) as a point of departure for design concepts. We are playing around with various ways of mapping rabbinic ideas of text study to modern UI concepts. Maybe we will come up with a design pattern library for the study of biblical and other ancient texts!
Steve DeRose pointed out to me this webpage by Bill Poser, a linguist who uses the computer in sophisticated ways. This page of resources is not about Computational Linguistics, which is a specific discipline. Rather, think “general computer resources”, or “how I can use the general computing power of my desktop to do linguistics”.
Besides the tools available to any sophisticated user of the computer, a linguist in addition must collect data and massage it into many different forms so that other tools can be used. Perhaps the most important tool category for the linguist is text manipulation. For me personally, the most powerful tool I ever discovered was regular expressions. “Regexes”, as they’re familiarly known, are descriptions of strings of characters, no matter how complex. These can then be used in scripts and programs to recognize segments of text on input which can then be manipulated for the desired output. The Poser webpage provides an excellent set of links to resources and tutorials.
There are many other linguistic topics that are covered on this page. While surveying the entire website, I ran across an excellent list of “Recommended Reading” of books for the linguist who desires to leverage the computer for his or her work. I own or have read nearly all of these. Highly recommended.
For any researcher in the humanities, there is no excuse not to have mastered the subset of these resources appropriate to his or her subject of study. I have no patience or sympathy for scholars who master all kinds of arcana and yet object to learning how to use the computer properly because it is too “difficult”. It’s not too difficult. Nor does one need formal training. One only needs motivation.
An immodest postscript
I was pleasantly surprised to see listed on this page my 2008 review in the journal Language Documentation and Conservation of the database engine Emdros, a program optimized for annotated text.
Another linguistic analysis tool has come to my attention: A “State-Of-The-Art Unsupervised Part-Of-Speech Tagger” .
In recent years computational linguistics has used the enormous volume of verbiage on the Internet to overcome the problems of analyzing natural language. Using probabilities calculated for a language using billions of sentences, a program is “trained” to see patterns and from the context assign the likeliest part of speech (noun, verb, adjective, etc.) to a word.
Clever and profound, yes. Complicated? Not really. This program consists of just 300 lines of Clojure code.(Clojure is a modern dialect of Lisp. It is “Lisp reloaded”, and implemented on the Java Virtual Machine. It is a functional programming language and it simplifies multi-threaded programming.)
Reading the follow-up blog post explaining the algorithm in detail, I found myself wondering about the applicability of a Hidden Markov Model for analyzing ancient texts. In particular I wonder about the usually numerically limited number of observations. A probability model works best with a “large” set of observations. There are “only” 480,446 morphemes in 23,213 verses in the Hebrew Bible as represented by the Leningrad Codex.
Some would say such programs are of limited value for ancient texts, since manual analysis is finite and “reasonable” in cost. On the other hand, the program will more consistently tag the text, and regenerating the entire database costs very little.
The problem with computational linguistics is that it is — well — so arcane. There are plenty of books and web resources to teach the theory and principles. But what is often missing is a fully functional program that actually carries out the desired tasks. There are two resources that I have found, one thanks to Patrick Durusau.
The Natural Language Toolkit (NLTK) is implemented in Python, and is a set of libraries and programs to illustrate all aspects of computational linguistics, including including empirical linguistics (my primary interest), cognitive science, artificial intelligence, information retrieval, and machine learning. It was developed for use in the classroom and has a free, downloadable textbook describing the features of computational linguists as implemented in the NLTK.
Another similar toolkit is LingPipe, which is implemented in Java. I just discovered this one and have not spent any time with it. I confess it is not as attractive to me because I’m not a Java programmer. And the NLTK has nifty graphical interfaces as demonstrations of the tools. It would be useful in a future post to compare features.