Tag: linguistics

Strongs Numbers & the Problem of a Universal Index

With the advent of masses of dig­i­tal resources inte­grated by Bible soft­ware user inter­faces, the prob­lem has arisen, How shall all these resources be inte­grated together? Typ­i­cally, the uni­fy­ing ele­ment has been the bib­li­cal text itself, espe­cially the text in the orig­i­nal Hebrew, Ara­maic or Greek.

Actu­ally, the prob­lem is not unique to the Infor­ma­tion Age, but has been addressed since at least the mid-​​19th cen­tury. James Strong (1822−1894) com­piled The Exhaus­tive Con­cor­dance of the Bible (1890). One of his goals was to align the Eng­lish trans­la­tion of the Bible with the orig­i­nal lan­guage. He chose to do this at the word-​​level, which fit one of his other objec­tives: an exhaus­tive con­cor­dance of the Eng­lish text, namely the King James ver­sion. It took him and more than 100 col­leagues 35 years to com­plete this mon­u­men­tal task — and with­out com­put­ers! His num­ber sys­tem rec­og­nizes 8,674 Hebrew and 5,523 Greek lem­mas. Let me focus on the Hebrew side of things.

The lex­i­cal schol­ar­ship upon which Strong’s Hebrew dic­tio­nary depends is that of Wil­helm Gese­nius. In 1833, Gese­nius pub­lished a Latin work, Lex­i­con Manuele Hebraicum et Chal­daicum in Vet­eris Tes­ta­menti Libros. There were suc­ces­sive edi­tions until the end of the cen­tury, when BDB (Brown Dri­ver BriggsA Hebrew and Eng­lish Lex­i­con of the Old Tes­ta­ment, 1891–1905) became the new schol­arly stan­dard. Strong’s Con­cor­dance most cer­tainly used Gese­nius for the lex­i­cog­ra­phy of the Hebrew Bible.

BDB reflected the new dis­cov­er­ies in the Mid­dle East dur­ing the lat­ter half of the 19th cen­tury and, impor­tantly, the rise of new meth­ods of the study of lan­guage: struc­tural­ism (Saus­sure), descrip­tive (Bloom­field, et. al) and com­par­a­tive lin­guis­tics, that is, using other Semitic lan­guages to help puz­zle out the mean­ings of Hebrew words and expres­sions. But in this hey­day of archae­o­log­i­cal dis­cov­ery, even BDB was quickly superceded by dis­cov­er­ies (in 1929 and later) in Pales­tine, espe­cially at Ras Shamra, ancient Ugarit. There a huge repos­i­tory of clay tablets were dis­cov­ered, includ­ing those using an alpha­betic writ­ing sys­tem to record a lan­guage that is closely related to Hebrew.

With such a wealth of new mate­r­ial, Hebrew lex­i­cog­ra­phy changed dra­mat­i­cally, with new lem­mas pro­posed and old lem­mas dropped. Addi­tional schol­ar­ship reas­signed lem­mas to spe­cific occur­rences in the Hebrew text. If Strong com­piled his Hebrew and Greek dic­tio­nar­ies and asso­ci­ated list of lem­mas today, it would be quite a dif­fer­ent list, includ­ing the assign­ment of those lem­mas to words in the text. And Strong might not have cho­sen the King James Ver­sion to concord.

Con­nect­ing dig­i­tal resources

Strong’s Con­cor­dance indeed became a stan­dard and is widely used even today, being reprinted reg­u­larly. Other lex­i­cons, word books and study Bibles included Strong’s num­ber­ing, even when the text was no longer the King James Ver­sion. Hence, when soft­ware devel­op­ers began to write Bible study soft­ware, the need arose for an index between the Hebrew text and var­i­ous other resources, Strong’s Num­bers were a nat­ural choice. There was the con­sumer demand for a tool famil­iar to them from printed Bible study resources. Also, there was a prac­ti­cal eco­nomic con­cern on the part of devel­op­ers: the work was mostly done for many resources. Fur­ther, if one needs to cre­ate a uni­ver­sal index, how does one chose? Strong’s num­bers seemed to be a nat­ural choice, already “uni­ver­sal” in some sense. Finally, there was the require­ment that the index never change. If the index changes, the link­ing to other resources is bro­ken and it costs time and money to fix it. The impli­ca­tion of this is that we already cer­tain that Strong’s 160-​​year-​​old lemma­ti­za­tion of Greek and espe­cially Hebrew is com­plete, cor­rect and need not ever change.

And that is how I got involved with this ques­tion. At the Groves Cen­ter we main­tain a lin­guis­tic data­base known as the West­min­ster Hebrew Mor­phol­ogy (WHM), which, among other things, offers a lemma for each and every one of the approx­i­mately 480,000 mor­phemes found in the Hebrew Bible. The lemma assign­ment is based upon the lat­est schol­ar­ship that we have, but in the final analy­sis is a deci­sion based upon our own judg­ment. We never con­sulted Gense­nius’ Lex­i­con, never mind Strong’s sys­tem of lem­mas. Given its age, it never occurred to us.

Imag­ine our dis­may when we were asked why we didn’t have Strong’s num­bers assigned to our lem­mas. Such a map­ping is impossible:

  • Some of Strong’s lem­mas don’t exist in the WHM.
  • There are new lem­mas in WHM; what Strong’s num­ber should they have?
  • Some of Strong’s lem­mas have been split into dif­fer­ent mean­ings (homonyms, for exam­ple) in the WHM.
  • Many of the lemma assign­ments to indi­vid­ual words in Strong’s have changed in WHM.

These dif­fer­ences make using Strong’s num­bers as a uni­ver­sal index for inte­grat­ing dig­i­tal resources prob­lem­atic and just plain wrong. The demands of the con­sumer and prag­matic and eco­nomic con­cerns must be resisted; else, we are per­pet­u­ally stuck in the mid-​​19th cen­tury of bib­li­cal scholarship.

A fresh look at resource integration

Let’s step away from the ques­tion of the suit­abil­ity of Strong’s num­bers for resource inte­gra­tion, and look at the issue of inte­gra­tion afresh. Strong’s num­ber­ing of Hebrew and Greek lem­mas is only one pos­si­ble solution.

Two pos­si­ble — and prac­ti­cal! — solu­tions come imme­di­ately to mind: search engines and topic maps. These tech­nolo­gies were responses to the need to inte­grate resources that are dynam­i­cally chang­ing and are seman­ti­cally diverse. One is not lim­ited to lem­mas but can index any arbi­trary string. Topic maps allow for more than one way to iden­tify a sub­ject. They han­dle ontolo­gies (as under­stood by com­puter sci­en­tists; philo­soph­i­cal ontol­ogy is some­thing else) quite well. Con­sider that com­pet­ing lemma­ti­za­tions such as Strongs and WHM are com­pet­ing ontolo­gies for the vocab­u­lary of the Bible in the orig­i­nal lan­guages. In this sce­nario, it doesn’t have to be “either Strong’s or WHM”, but “both-​​and.” Then we allow the user to decide what is most valu­able or cor­rect. This solu­tion to the prob­lem is bet­ter because both the num­ber and inter­nal con­tent of resources one inte­grates can change freely as desired.

I am encour­aged to see signs that Bible soft­ware is grad­u­ally evolv­ing in this direc­tion. As I see it, as the num­ber of dig­i­tal resources increases, the above two solu­tions become ever more compelling.

Tools for linguistic research

Steve DeRose pointed out to me this web­page by Bill Poser, a lin­guist who uses the com­puter in sophis­ti­cated ways. This page of resources is not about Com­pu­ta­tional Lin­guis­tics, which is a spe­cific dis­ci­pline. Rather, think “gen­eral com­puter resources”, or “how I can use the gen­eral com­put­ing power of my desk­top to do linguistics”.

Besides the tools avail­able to any sophis­ti­cated user of the com­puter, a lin­guist in addi­tion must col­lect data and mas­sage it into many dif­fer­ent forms so that other tools can be used. Per­haps the most impor­tant tool cat­e­gory for the lin­guist is text manip­u­la­tion. For me per­son­ally, the most pow­er­ful tool I ever dis­cov­ered was reg­u­lar expres­sions. “Regexes”, as they’re famil­iarly known, are descrip­tions of strings of char­ac­ters, no mat­ter how com­plex. These can then be used in scripts and pro­grams to rec­og­nize seg­ments of text on input which can then be manip­u­lated for the desired out­put. The Poser web­page pro­vides an excel­lent set of links to resources and tutorials.

There are many other lin­guis­tic top­ics that are cov­ered on this page. While sur­vey­ing the entire web­site, I ran across an excel­lent list of “Rec­om­mended Read­ing” of books for the lin­guist who desires to lever­age the com­puter for his or her work. I own or have read nearly all of these. Highly recommended.

For any researcher in the human­i­ties, there is no excuse not to have mas­tered the sub­set of these resources appro­pri­ate to his or her sub­ject of study. I have no patience or sym­pa­thy for schol­ars who mas­ter all kinds of arcana and yet object to learn­ing how to use the com­puter prop­erly because it is too “dif­fi­cult”. It’s not too dif­fi­cult. Nor does one need for­mal train­ing. One only needs motivation.

An immod­est postscript

I was pleas­antly sur­prised to see listed on this page my 2008 review in the jour­nal Lan­guage Doc­u­men­ta­tion and Con­ser­va­tion of the data­base engine Emdros, a pro­gram opti­mized for anno­tated text.

21st century study of religious texts

A pic­ture is worth a thou­sand words. So a con­crete exam­ple that you can not only see, but also play with, is worth ten thou­sand words.

The Quranic Ara­bic Cor­pus incor­po­rates much of my vision for the study of the Bible in the third mil­len­nium of our civ­i­liza­tion. For “under the hood” details, see the descrip­tion of the research of Kais Dukes, who is — of all things! — a VP of Mer­rill Lynch.

Three ele­ments of this projects are mor­pho­log­i­cal anno­ta­tion, a syn­tax tree­bank and a seman­tic ontol­ogy. All three are com­bined into a web user inter­face in such a way that col­lab­o­ra­tion is pos­si­ble. The gen­eral pub­lic inter­ested in the Quran itself can browse the orig­i­nal Ara­bic text, and dive into mor­phol­ogy, syn­tax and seman­tics as desired. Schol­ars can work on the actual analy­sis sim­ply by log­ging in.

This model of lin­guis­tic anno­ta­tion of a cor­pus can eas­ily be extended to include bib­li­og­ra­phy, web resources, archae­o­log­i­cal and his­tor­i­cal data — the pos­si­bil­i­ties are endless.

One exten­sion ought to be the abil­ity to add user anno­ta­tion which is stored locally on the user/visitor’s own com­puter but which inte­grates seam­lessly with the website.

I noticed one fea­ture that is lack­ing: the abil­ity for com­plex search­ing, using the mor­phol­ogy, syn­tax and seman­tic anno­ta­tions. There is a search box for sim­ple text queries, but a more sophis­ti­cated search engine would greatly enhance the value of this remark­able resource.

Visualizing linguistic data

As data gets more and more com­plex, human minds have greater trou­ble per­ceiv­ing pat­terns. One can’t see the for­est for the trees, so to speak. (Ahem.) What to do? One answer is graphic visu­al­iza­tion of the data. The nat­ural sci­ences have been pur­su­ing this solu­tion since the 1990s. In recent years the meth­ods devel­oped are spilling over into many other domains.

I just dis­cov­ered (again, thanks to Patrick Durusau) a researcher who spe­cial­izes in data visu­al­iza­tion, Tamara Mun­zner (Uni­ver­sity of British Colum­bia). She has writ­ten or col­lab­o­rated in a num­ber of soft­ware tools to visu­al­ize var­i­ous kinds of data. Three in par­tic­u­lar are rel­e­vant to my inter­ests (well, they’re all interesting!):

  • Smash­ing Pea­cocks Fur­ther (SPF): a tool for explor­ing quasi-​​trees, that is, directed graphs with a tree-​​like struc­ture. (Don’t ask me for an expla­na­tion of the name!) We’ve men­tioned directed graphs before.
  • Sequence­Jux­ta­poser: a tool for com­par­ing DNA/​RNA sequences. This is a poten­tially use­ful tool for lin­guis­tics, because text and spo­ken lan­guages are lin­ear sequences, just like genome sequences.
  • Tree­Jux­ta­poser: a tool for com­par­ing data that is hier­ar­chi­cally orga­nized. Data that is tree-​​like in struc­ture include phy­lo­ge­netic trees, tax­o­nomic trees, con­sen­sus trees and den­dro­grams from clus­ter hier­ar­chies. Since lan­guage is orga­nized hier­ar­chi­cally and trees are com­monly used by lin­guists and since I just hap­pen to have a tree­bank of the Hebrew Bible, this tool is of imme­di­ate inter­est to me.

I offer below screen­shots of each tool, just to whet your appetite:

Smash­ing Pea­cocks Fur­ther (SPF)

SJ gallery picture 2

Sequence­Jux­ta­poser

TJ sunflower pictures

Tree­Jux­ta­poser