April 07, 2016

Mini-musings

Editing authorities at the speed of four records per minute

This missive outlines and documents an automated process I used to “cleanup” and “improve” a set of authority records, or, to put it another way, how I edited authorities at the speed of four records per minute.

springAs you may or may not know, starting in September 2015, I commenced upon a sort of “leave of absence” from my employer.† This leave took me to Tuscany, Venice, Rome, Provence, Chicago, Philadelphia, Boston, New York City, and back to Rome. In Rome I worked for the American Academy of Rome doing short-term projects in the library. The first project revolved around authority records. More specifically, the library’s primary clientele were Americans, but the catalog’s authority records included a smattering of Italian headings. The goal of the project was to automatically convert as many of the “invalid” Italian headings into “authoritative” Library of Congress headings.

Identify “invalid” headings

pantheonWhen I first got to Rome I had the good fortune to hang out with Terry Reese, the author of the venerable MarcEdit.‡ He was there giving workshops. I participated in the workshops. I listened, I learned, and I was grateful for a Macintosh-based version of Terry’s application.

When the workshops were over and Terry had gone home I began working more closely with Sebastian Hierl, the director of the Academy’s library.❧ Since the library was relatively small (about 150,000 volumes), and because the Academy used Koha for its integrated library system, it was relatively easy for Sebastian to give me the library’s entire set of 124,000 authority records in MARC format. I fed the authority records into MarcEdit, and ran a report against them. Specifically, I asked MarcEdit to identify the “invalid” records, which really means, “Find all the records not found in the Library of Congress database.” The result was a set of approximately 18,000 records or approximately 14% of the entire file. I then used MarcEdit to extract the “invalid” records from the complete set, and this became my working data.

Search & download

alterI next created a rudimentary table denoting the “invalid” records and the subsequent search results for them. This tab-delimited file included values of MARC field 001, MARC field 1xx, an integer denoting the number of times I searched for a matching record, an integer denoting the number of records I found, an identifier denoting a Library of Congress authority record of choice, and a URL providing access to the remote authority record. This table was initialized using a script called authority2list.pl. Given a file of MARC records, it outputs the table.

I then systematically searched the Library of Congress for authority headings. This was done with a script called search.pl. Given the table created in the previous step, this script looped through each authority, did a rudimentary search for a valid entry, and output an updated version of the table. This script was a bit “tricky”.❦ It first searched the Library of Congress by looking for the value of MARC 1xx$a. If no records were found, then no updating was done and processing continued. If one record was found, then the Library of Congress identifier was saved to the output and processing continued. If many records were found, then a more limiting search was done by adding a date value extracted from MARC 1xx$d. Depending on the second search result, the output was updated (or not), and processing continued. Out of original 18,000 “invalid” records, about 50% of them were identified with no (zero) Library of Congress records, about 30% were associated with multiple headings, and the remaining 20% (approximately 3,600 records) were identified with one and only one Library of Congress authority record.

I now had a list of 3,600 “valid” authority records, and I needed to download them. This was done with a script called harvest.pl. This script is really a wrapper around a program called GNU Wget. Given my updated table, the script looped through each row, and if it contained a URL pointing to a Library of Congress authority record, then the record was cached to the file system. Since the downloaded records were formatted as MARCXML, I then needed to transform them into MARC21. This was done with a pair of scripts: xml2marc.sh and xml2marc.pl. The former simply looped through each file in a directory, and the later did the actual transformation but along the way updated MARC 001 to the value of the local authority record.

Verify and merge

backyardIn order to allow myself as well as others to verify that correct records had been identified, I wrote another pair of programs: marc2compare.pl and compare2html.pl. Given two MARC files, marc2compare.pl created a list of identifiers, original authority values, proposed authority values, and URLs pointing to full descriptions of each. This list was intended to be poured into a spreadsheet for compare & contrast purposes. The second script, compare2html.pl, simply took the output of the first and transformed it into a simple HTML page making it easier for a librarian to evaluate correctness.

Assuming the 3,600 records were correct, the next step was to merge/overlay the old records with the new records. This was a two-step process. The first step was accomplished with a script called rename.pl. Given two MARC files, rename.pl first looped through the set of new authorities saving each identifier to memory. It then looped through the original set of authorities looking for records to update. When records to update were found, each was marked for deletion by prefixing MARC 001 with “x-“. The second step employed MarcEdit to actually merge the set of new authorities with the original authorities. Consequently, the authority file increased in size by 3,600 records. It was then up to other people to load the authorities into Koka, re-evaluate the authorities for correctness, and if everything was okay, then delete each authority record prefixed with “x-“.

Done.❀

Summary and possible next steps

In summary, this is how things happened. I:

  1. got a complete dump of original authority 123,329 records
  2. extracted 17,593 “invalid” records
  3. searched LOC for “valid” records and found 3,627 of them
  4. harvested the found records
  5. prefixed the 3,627 001 fields in the original file with “x-“
  6. merged the original authority records with the harvested records
  7. made the new set of 126,956 updated records available

academyThere were many possible next steps. One possibility was to repeat the entire process but with an enhanced search algorithm. This could be difficult considering the fact that searches using merely the value of 1xx$a returned zero hits for half of the working data. A second possibility was to identify authoritative records from a different system such as VIAF or Worldcat. Even if this was successful, I wonder how possible it would have been to actually download authority records as MARC. A third possibility was to write a sort of disambiguation program allowing librarians to choose from a set of records. This could have been accomplished by searching for authorities, presenting possibilities, allowing librarians to make selections via an HTML form, caching the selections, and finally, batch updating the master authority list. Here at the Academy we denoted the last possibility as the “cool” one.

Now here’s an interesting way to look at the whole thing. This process took me about two weeks worth of work, and in that two weeks I processed 18,000 authority records. That comes out to 9,000 records/week. There are 40 hours in work week, and consequently, I processed 225 records/hour. Each hour is made up of 60 minutes, and therefore I processed approximately 4 records/minute, and that is 1 record every fifteen seconds for the last two weeks. Wow!?

Finally, I’d like to thank the Academy (with all puns intended). Sebastian, his colleagues, and especially my office mate (Kristine Iara) were all very supportive throughout my visit. They provided intellectual stimulation and something to do while I contemplated my navel during the “adventure”.

Notes

bicycles† Strictly speaking, my adventure was not a sabbatical nor a leave of absence because: 1) as a librarian I was not authorized to take a sabbatical, and 2) I did not have any healthcare issues. Instead, after bits of negotiation, my contract was temporarily changed from full-time faculty to adjunct faculty, and I worked for my employer 20% of the time. The other 80% of time was spent on my “adventure”. And please don’t get me wrong, this whole thing was a wonderful opportunity for which I will be eternally grateful. “Thank you!”

‡ During our overlapping times there in Rome, Terry & I played tourist which included the Colosseum, a happenstance mass at the Pantheon, a Palm Sunday Mass in St. Peter’s Square with tickets generously given to us by Joy Nelson of ByWater Solutions, and a day-trip to Florence. Along the way we discussed librarianship, open source software, academia, and life in general. A good time was had by all.

❧ Ironically, Sebastian & I were colleagues during the dot-com boom when we both worked at North Caroline State University. The world of librarianship is small.

❦ This script — search.pl — was really a wrapper around an application called curl, and thanks go to Jeff Young of OCLC who pointed me to the ATOM interface of the LC Linked Data Service. Without Jeff’s helpful advice, I would have wrestled with OCLC’s various authentication systems and Web Service interfaces.

❀ Actually, I skipped a step in this narrative. Specifically, there are some records in the authority file that were not expected to be touched, even if they are “invalid”. This set of records was associated with a specific call number pattern. Two scripts (fu-extract.pl and fu-remove.pl) did the work. The first extracted a list of identifiers not to touch and the second removed them from my table of candidates to validate.

by Eric Lease Morgan at April 07, 2016 07:52 AM

March 22, 2016

Mini-musings

Failure to communicate

In my humble opinion, what we have here is a failure to communicate.

shrineLibraries, especially larger libraries, are increasingly made up of many different departments, including but not limited to departments such as: cataloging, public services, collections, preservation, archives, and now-a-days departments of computer staff. From my point of view, these various departments fail to see the similarities between themselves, and instead focus on their differences. This focus on the differences is amplified by the use of dissimilar vocabularies and subdiscipline-specific jargon. This use of dissimilar vocabularies causes a communications gap and left unresolved ultimately creates animosity between groups. I believe this is especially true between the more traditional library departments and the computer staff. This communications gap is an impediment to when it comes to achieving the goals of librarianship, and any library — whether it be big or small — needs to address these issues lest it wastes both its time and money.

Here are a few examples outlining failures to communicate:

  • MARC – MARC is a data structure. The first 24 characters are called the leader. The second section is called the directory, and the third section is intended to contain bibliographic data. The whole thing is sprinkled with ASCII characters 29, 30, and 31 denoting the ends of fields, subfields, and the record itself. MARC does not denote the kinds of data it contains. Yet, many catalogers say they know MARC. Instead, what they really know are sets of rules defining what goes into the first and third sections of the data structure. These rules are known as AACR2/RDA. Computer staff see MARC (and MARCXML) as a data structure. Librarians see MARC as the description of an item akin to a catalog card.
  • Databases & indexes – Databases & indexes are two sides of the same information retrieval coin. “True” databases are usually relational in nature and normalized accordingly. “False” databases are flat files — simple tables akin to Excel spreadsheets. Librarians excel (no puns intended) at organizing information, and this usually manifests itself through the creation of various lists. Lists of books. Lists of journals. Lists of articles. Lists of authoritative names. Lists of websites. Etc. In today’s world, the most scalable way to maintain lists is through the use of a database, yet most librarians wouldn’t be able to draw an entity relationship diagram — the literal illustration of a database’s structure — to save their lives. With advances in computer technology, the problem of find is no longer solved through the searching of databases but instead through the creation of an index. In reality, modern indexes are nothing more than enhancements of traditional back-of-the-book indexes — lists of words and associated pointers to where those words can be found in a corpus. Computer staff see databases as MySQL and indexes as Solr. Librarians see databases as a matrix of rows & columns, and the searching of databases in a light of licensed content such as JSTOR, Academic Search Primer, or New York Times.
  • Collections – Collections, from the point of view of a librarian, are sets of curated items with a common theme. Taken as a whole, these collections embody a set of knowledge or a historical record intended for use by students & researchers for the purposes of learning & scholarship. The physical arrangment of the collection — especially in archives — as well as the intellectual arrangment of the collection is significant because they bring together like items or represent the development of an idea. This is why libraries have classification schemes and archives physically arrange their materials in the way they do. Unfortunately, computer staff usually do not understand the concept of “curation” and usually see the arrangements of books — classification numbers — as rather arbitrary.
  • Services – Many librarians see the library profession as being all about service. These services range from literacy programs to story hours. They range from the answering of reference questions to the circulation of books. They include social justice causes, stress relievers during exam times, and free access to computers with Internet connections. Services are important because the provide the means for an informed public, teaching & learning, and the improvement society in general. Many of these concepts are not in the forefront of the minds of computer staff. Instead, their idea of service is making sure the email system works, people can log into their computers, computer hardware & software are maintained, and making sure the connections to the Internet are continual.

room with a viewAs a whole, what the profession does not understand is that everybody working in a library has more things in common than differences. Everybody is (suppose to be) working towards the same set of goals. Everybody plays a part in achieving those goals, and it behooves everybody to learn & respect the roles of everybody else. A goal is to curate collections. This is done through physical, intellectual, and virtual arrangment, but it also requires the use of computer technology. Collection managers need to understand more of the computer technology, and the technologist needs to understand more about curation. The application of AACR2/RDA is an attempt to manifest inventory and the dissemination of knowledge. The use of databases & indexes also manifest inventory and dissemination of knowledge. Catalogers and database administrators ought to communicate on the similar levels. Similarly, there is much more to preservation of materials than putting bits on tape. “Yikes!”

What is the solution to these problems? In my opinion, there are many possibilities, but the solution ultimately rests with individuals willing to take the time to learn from their co-workers. It rests in the ability to respect — not merely tolerate — another point of view. It requires time, listening, discussion, reflection, and repetition. It requires getting to know other people on a personal level. It requires learning what others like and dislike. It requires comparing & contrasting points of view. It demands “walking a mile in the other person’s shoes”, and can be accomplished by things such as the physical intermingling of departments, cross-training, and simply by going to coffee on a regular basis.

Again, all of us working in libraries have more similarities than differences. Learn to appreciate the similarities, and the differences will become insignificant. The consequence will be a more holistic set of library collections and services.

by Eric Lease Morgan at March 22, 2016 10:35 AM

March 06, 2016

Mini-musings

Using BIBFRAME for bibliographic description

Bibliographic description is an essential process of librarianship. In the distant past this process took the form of simple inventories. In the last century we saw bibliographic description evolve from the catalog card to the MARC record. With the advent of globally networked computers and the hypertext transfer protocol, we are seeing the emergence of a new form of description called BIBFRAME which is based on the principles of RDF (Resource Description Framework). This essay describes, illustrates, and demonstrates how BIBFRAME can be used to fulfill the promise and purpose of bibliographic description.†

Librarianship as collections & services

Philadelphia FlowersLibraries are about a number of things. Some of those things surround the collection and preservation of materials, most commonly books. Some of those things surround services, most commonly the lending of books.†† But it is asserted here that collections are not really about books nor any other physical medium because those things are merely the manifestation of the real things of libraries: data, information, and knowledge. It is left to another essay as to the degree libraries are about wisdom. Similarly, the primary services of libraries are not really about the lending of materials, but instead the services surround learning and intellectual growth. Librarians cannot say they have lent somebody a book and conclude they have done their job. No, more generally, libraries provide services enabling the reader to use & understand the content of acquired materials. In short, it is asserted that libraries are about the collection, organization, preservation, dissemination, and sometimes evaluation of data, information, knowledge, and sometimes wisdom.

With the advent of the Internet the above definition of librarianship is even more plausible since the materials of libraries can now be digitized, duplicated (almost) exactly, and distributed without diminishing access to the whole. There is no need to limit the collection to physical items, provide access to the materials through surrogates, nor lend the materials. Because these limitations have been (mostly) removed, it is necessary for libraries to think differently their collections and services. To the author’s mind, librarianship has not shifted fast enough nor far enough. As a long standing and venerable profession, and as an institution complete with its own set of governance, diversity, and shear size, change & evolution happen very slowly. The evolution of bibliographic description is a perfect example.

Bibliographic description: an informal history

Bibliographic description happens in between the collections and services of libraries, and the nature of bibliographic description has evolved with technology. Think of the oldest libraries. Think clay tablets and papyrus scrolls. Think of the size of library collections. If a library’s collection was larger than a few hundred items, then the library was considered large. Still, the collections were so small that an inventory was relatively easy for sets of people (librarians) to keep in mind.

Think medieval scriptoriums and the development of the codex. Consider the time, skill, and labor required to duplicate an item from the collection. Consequently, books were very expensive but now had a much longer shelf life. (All puns are intended.) This increased the size of collections, but remembering everything in a collection was becoming more and more difficult. This, coupled with the desire to share the inventory with the outside world, created the demand for written inventories. Initially, these inventories were merely accession lists — a list of things owned by a library and organized by the date they were acquired.

With the advent of the printing press, even more books were available but at a much lower cost. Thus, the size of library collections grew. As it grew it became necessary to organize materials not necessarily by their acquisition date nor physical characteristics but rather by various intellectual qualities — their subject matter and usefulness. This required the librarian to literally articulate and manifest things of quality, and thus the profession begins to formalize the process of analytics as well as supplement their inventory lists with this new (which is not really new) information.

Consider some of the things beginning in the 18th and 19th centuries: the idea of the “commons”, the idea of the informed public, the idea of the “free” library, and the size of library collections numbering 10’s of thousands of books. These things eventually paved the way in the 20th century to open stacks and the card catalog — the most recent incarnation of the inventory list written in its own library short-hand and complete with its ever-evolving controlled vocabulary and authority lists — becoming available to the general public. Computers eventually happen and so does the MARC record. Thus, the process of bibliographic description (cataloging) literally becomes codified. The result is library jargon solidified in an obscure data structure. Moreover, in an attempt to make the surrogates of library collections more meaningful, the information of bibliographic description bloats to fill much more than the traditional three to five catalog cards of the past. With the advent of the Internet comes less of a need for centralized authorities. Self-service and connivence become the norm. When was the last time you used a travel agent to book airfare or reserve a hotel room?

Librarianship is now suffering from a great amount of reader dissatisfaction. True, most people believe libraries are “good things”, but most people also find libraries difficult to use and not meeting their expectations. People search the Internet (Google) for items of interest, and then use library catalogs to search for known items. There is then a strong desire to actually get the item, if it is found. After all, “Everything in on the ‘Net”. Right? To this author’s mind, the solution is two-fold: 1) digitize everthing and put the result on the Web, and 2) employ a newer type of bibliographic description, namely RDF. The former is something for another time. The later is elaborated upon below.

Resource Description Framework

Resource Description Framework (RDF) is essentially relational database technology for the Internet. It is comprised of three parts: keys, relationships, and values. In the case of RDF and akin to relational databases, keys are unique identifiers and usually in the form of URIs (now called “IRIs” — Internationalized Resource Identifiers — but think “URL”). Relationships take the form of ontologies or vocabularies used to describe things. These ontologies are very loosely analogous to the fields in a relational database table, and there are ontologies for many different sets of things, including the things of a library. Finally, the values of RDF can also be URIs but are ultimately distilled down to textual and numeric information.

RDF is a conceptual model — a sort of cosmology for the universe of knowledge. RDF is made real through the use of “triples”, a simple “sentence” with three distinct parts: 1) a subject, 2) a predicate, and 3) an object. Each of these three parts correspond to the keys, relationships, and values outlined above. To extend the analogy of the sentence further, think of subjects and objects as if they were nouns, and think of predicates as if they were verbs. And here is a very important distinction between RDF and relational databases. In relational databases there is the idea of a “record” where an identifier is associated with a set of values. Think of a book that is denoted by a key, and the key points to a set of values for titles, authors, publishers, dates, notes, subjects, and added entries. In RDF there is no such thing as the record. Instead there are only sets of literally interlinked assertions — the triples.

Triples (sometimes called “statements”) are often illustrated as arced graphs where subjects and objects are nodes and predicates are lines connecting the nodes:

[ subject ] --- predicate ---> [ object ]

The “linking” in RDF statements happens when sets of triples share common URIs. By doing so, the subjects of statements end up having many characteristics, and the objects of URIs point to other subjects in other RDF statements. This linking process transforms independent sets of RDF statements into a literal web of interconnections, and this is where the Semantic Web gets its name. For example, below is a simple web of interconnecting triples:

              / --- a predicate ---------> [ an object ]
[ subject ] - | --- another predicate ---> [ another object ]
              \ --- a third predicate ---> [ a third object ]
                                                   |
                                                   |
                                          yet another predicate
                                                   |
                                                   |
                                                  \ /

                                         [ yet another object ]

An example is in order. Suppose there is a thing called Rome, and it will be represented with the following URI: http://example.org/rome. We can now begin to describe Rome using triples:

subjects                 predicates         objects
-----------------------  -----------------  -------------------------
http://example.org/rome  has name           "Rome"
http://example.org/rome  has founding date  "1000 BC"
http://example.org/rome  has description    "A long long time ago,..."
http://example.org/rome  is a type of       http://example.org/city
http://example.org/rome  is a sub-part of   http://example.org/italy

The corresponding arced graph would look like this:

                               / --- has name ------------> [ "Rome" ]
                              |  --- has description -----> [ "A long time ago..." ]
[ http://example.org/rome ] - |  --- has founding date ---> [ "1000 BC" ]
                              |  --- is a sub-part of  ---> [ http://example.org/italy ]
                               \ --- is a type of --------> [ http://example.org/city ]

In turn, the URI http://example.org/italy might have a number of relationships asserted against it also:

subjects                  predicates         objects
------------------------  -----------------  -------------------------
http://example.org/italy  has name           "Italy"
http://example.org/italy  has founding date  "1923 AD"
http://example.org/italy  is a type of       http://example.org/country
http://example.org/italy  is a sub-part of   http://example.org/europe

Now suppose there were things called Paris, London, and New York. They can be represented in RDF as well:

subjects                    predicates          objects
--------------------------  -----------------   -------------------------
http://example.org/paris    has name            "Paris"
http://example.org/paris    has founding date   "100 BC"
http://example.org/paris    has description     "You see, there's this tower..."
http://example.org/paris    is a type of        http://example.org/city
http://example.org/paris    is a sub-part of    http://example.org/france
http://example.org/london   has name            "London"
http://example.org/london   has description     "They drink warm beer here."
http://example.org/london   has founding date   "100 BC"
http://example.org/london   is a type of        http://example.org/city
http://example.org/london   is a sub-part of    http://example.org/england
http://example.org/newyork  has founding date   "1640 AD"
http://example.org/newyork  has name            "New York"
http://example.org/newyork  has description     "It is a place that never sleeps."
http://example.org/newyork  is a type of        http://example.org/city
http://example.org/newyork  is a sub-part of    http://example.org/unitedstates

Furthermore, each of “countries” can be have relationships denoted against them:

subjects                         predicates         objects
-------------------------------  -----------------  -------------------------
http://example.org/unitedstates  has name           "United States"
http://example.org/unitedstates  has founding date  "1776 AD"
http://example.org/unitedstates  is a type of       http://example.org/country
http://example.org/unitedstates  is a sub-part of   http://example.org/northamerica
http://example.org/england       has name           "England"
http://example.org/england       has founding date  "1066 AD"
http://example.org/england       is a type of       http://example.org/country
http://example.org/england       is a sub-part of   http://example.org/europe
http://example.org/france        has name           "France"
http://example.org/france        has founding date  "900 AD"
http://example.org/france        is a type of       http://example.org/country
http://example.org/france        is a sub-part of   http://example.org/europe

The resulting arced graph of all these triples might look like this:

[IMAGINE A COOL LOOKING ARCED GRAPH HERE.]

From this graph, new information can be inferred as long as one is able to trace connections from one node to another node through one or more arcs. For example, using the arced graph above, questions such as the following can be asked and answered:

  • What things are denoted as types of cities, and what are their names?

  • What is the oldest city?

  • What cities were founded after the year 1 AD?

  • What countries are sub-parts of Europe?

  • How would you describe Rome?

In summary, RDF is data model — a method for organizing discrete facts into a coherent information system, and to this author, this sounds a whole lot like a generalized form of bibliographic description and a purpose of library catalogs. The model is built on the idea of triples whose parts are URIs or literals. Through the liberal reuse of URIs in and between sets of triples, questions surrounding the information can be answered and new information can be inferred. RDF is the what of the Semantic Web. Everything else (ontologies & vocabularies, URIs, RDF “serializations” like RDF/XML, triple stores, SPARQL, etc.) are the how’s. None of them will make any sense unless the reader understands that RDF is about establishing relationships between data for the purposes of sharing information and increasing the “sphere of knowledge”.

Linked data

Linked data is RDF manifested. It is a process of codifying triples and systematically making them available on the Web. It first involves selecting, creating (“minting”), and maintaining sets of URIs denoting the things to be described. When it comes to libraries, there are many places where authoritative URIs can be gotten including: OCLC’s Worldcat, the Library of Congress’s linked data services, Wikipedia, institutional repositories, or even licensed indexes/databases.

Second, manifesting RDF as linked data involves selecting, creating, and maintaining one or more ontologies used to posit relationships. Like URIs, there are many existing bibliographic ontologies for the many different types of cultural heritage institutions: libraries, archives, and museums. Example ontologies include but are by no means limited to: BIBFRAME, bib.schema.org, the work of the (aged) LOCAH project, EAC-CPF, and CIDOC CRM.

The third step to implementing RDF as linked data is to actually create and maintain sets of triples. This is usually done through the use of a “triple store” which is akin to a relational database. But remember, there is no such thing as a record when it comes to RDF! There are a number of not a huge number of toolkits and applications implementing triple stores. 4store is (or was) a popular open source triple store implementation. Virtuoso is another popular implementation that comes in both open sources as well as commercial versions.

The forth step in the linked data process is the publishing (making freely available on the Web) of RDF. This is done in a combination of two ways. The first is to write a report against the triple store resulting in a set of “serializations” saved at the other end of a URL. Serializations are textual manifestations of RDF triples. In the “old days”, the serialization of one or more triples was manifested as XML, and might have looked something like this to describe the Declaration of Independence and using the Dublin Core and FOAF (Friend of a friend) ontologies:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
  <dcterms:creator>
	<foaf:Person rdf:about="http://id.loc.gov/authorities/names/n79089957">
	  <foaf:gender>male</foaf:gender>
	</foaf:Person>
  </dcterms:creator>
</rdf:Description>
</rdf:RDF>

Many people think the XML serialization is too verbose and thus difficult to read. Consequently other serializations have been invented. Here is the same small set of triples serialized as N-Triples:

@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix dcterms: <http://purl.org/dc/terms/>.
<http://en.wikipedia.org/wiki/Declaration_of_Independence> dcterms:creator <http://id.loc.gov/authorities/names/n79089957>.
<http://id.loc.gov/authorities/names/n79089957> a foaf:Person;
  foaf:gender "male".

Here is yet another example, but this time serialized as JSON, a data structure first implemented as a part of the Javascript language:

{
"http://en.wikipedia.org/wiki/Declaration_of_Independence": {
  "http://purl.org/dc/terms/creator": [
	{
	  "type": "uri", 
	  "value": "http://id.loc.gov/authorities/names/n79089957"
	}
  ]
}, 
 "http://id.loc.gov/authorities/names/n79089957": {
   "http://xmlns.com/foaf/0.1/gender": [
	 {
	   "type": "literal", 
	   "value": "male"
	 }
   ], 
   "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
	 {
	   "type": "uri", 
	   "value": "http://xmlns.com/foaf/0.1/Person"
	 }
   ]
 }
}

RDF has even been serialized in HTML files by embedding triples into attributes. This is called RDFa, and a snippet of RDFa might look like this:

<div xmlns="http://www.w3.org/1999/xhtml"
  prefix="
    foaf: http://xmlns.com/foaf/0.1/
    rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
    dcterms: http://purl.org/dc/terms/
    rdfs: http://www.w3.org/2000/01/rdf-schema#"
</div>
<div typeof="rdfs:Resource" about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
  <div rel="dcterms:creator">
    <div typeof="foaf:Person" about="http://id.loc.gov/authorities/names/n79089957">
      <div property="foaf:gender" content="male"></div>
    </div>
  </div>
</div>
</div>

Once the RDF is serialized and put on the Web, it is intended to be harvested by Internet spiders and robots. They cache the data locally, read it, and update their local triples stores. This data is then intended to be analyzed, indexed, and used to find or discover new relationships or knowledge.

The second way of publishing linked data is through a “SPARQL endpoint”. SPARQL is a query language very similar to the query language of relational databases (SQL). SPARQL endpoints are usually Web-accesible interfaces allowing the reader to search the underlying triple store. The result is usually a stream of XML. Admitted, SPARQL is obtuse at the very least.

Just like the published RDF, the output of SPARQL queries can be serialized in many different forms. And just like relational databases, triple stores and SPARQL queries are not intended to be used directly by the reader. Instead, something more friendly (but ultimately less powerful and less flexible) is always intended.

So what does this have to do with libraries and specifically bibliographic description? The answer is not that complicated. The what of librarianship has not really changed over the millenium. Librarianship is still about processes of collection, organization, preservation, dissemination, and sometimes evaluation. On the other hand, with the evolution of technology and cultural expectations, the how’s of librarianship have changed dramatically. Considering the current environment, it is time to evolve, yet again. The next evolution is the employment of RDF and linked data as the means of bibliographic description. By doing so the data, information, and knowledge contained in libraries will be more accessible and more useful to the wider community. As time has gone on, the data and metadata of libraries has become less and less librarian-centric. By taking the leap to RDF and linked data, this will only become more true, and this is a good thing for both libraries and the people they serve.

BIBFRAME

Enter BIBFRAME, an ontology designed for libraries and their collections. It is not the only ontology intended to describe libraries and their collections. There are other examples as well, notably, bib.schema.org, FRBR for RDF, MODS and MADS for RDF, and to some extent, Dublin Core. Debates rage on mailing lists regarding the inherent advantages & disadvantages of each of these ontologies. For the most part, the debates seem to be between BIBFRAME, bib.schema.org, and FRBR for RDF. BIBFRAME is sponsored by the Library of Congress and supported by a company called Zepheira. At its very core are the ideas of a work and its instance. In other words, BIBFRAME boils the things of libraries down to two entities. Bib.schema.org is a subset of schema.org, an ontology endorsed by the major Internet search engines (Google, Bing, and Yahoo). And since schema.org is designed to enable the description of just about anything, the implementation of bib.schema.org is seen as a means of reaching the widest possible audience. On the other hand, bib.schema.org is not always seen as being as complete as BIBFRAME. The third contender is FRBR for RDF. Personally, the author has not seen very many examples of its use, but it purports to better serve the needs/desires of the reader through the concepts of WEMI (Work, Expression, Manifestation, and Item).

That said, it is in this author’s opinion, that the difference between the various ontologies is akin to debating the differences between vanilla and chocolate ice cream. It is a matter of opinion, and the flavors are not what is important, but rather it is the ice cream itself. Few people outside libraries really care which ontology is used. Besides, each ontology includes predicates for the things everybody expects: titles, authors, publishers, dates, notes, subjects/keywords, added entries, and locations. Moreover, in this time of transition, it is not feasible to come up with the perfect solution. Instead, this evolution is an iterative process. Give something a go. Try it for a limited period of time. Evaluate. And repeat. We also live in a world of digital data and information. This data and information is, by its very nature, mutable. There is no reason why one ontology over another needs to be debated ad nauseum. Databases (triple stores) support the function of find/replace with ease. If one ontology does not seem to be meeting the desired needs, then (simply) change to another one.††† In short, BIBFRAME may not be the “best” ontology, but right now, it is good enough.

Workflow

Now that the fundamentals have been outlined and elaborated upon, a workflow can be articulated. At the risk of mixing too many metaphors, here is a “recipe” for doing bibliographic description using BIBFRAME (or just about any other bibliographic ontology):

  1. Answer the questions, “What is bibliographic description, and how does it help facilitate the goals of librarianship?”
  2. Understand the concepts of RDF and linked data.
  3. Embrace & understand the strengths & weaknesses of BIBFRAME as a model for bibliographic description.
  4. Design or identify and then install a system for creating, storing, and editing your bibliographic data. This will be some sort of database application whether it be based on SQL, non-SQL, XML, or a triple store. It might even be your existing integrated library system.
  5. Using the database system, create, store, import/edit your bibliographic descriptions. For example, you might simply use your existing integrated library for these purposes, or you might transform your MARC data into BIBFRAME and pour the result into a triple store, like this:
    1. Dump MARC records
    2. Transform MARC into BIBFRAME
    3. Pour the result into a triple-store
    4. Sort the triples according to the frequency of literal values
    5. Find/replace the most frequently found literals with URIs††††
    6. Go to Step #D until tired
    7. Use the triple-store to create & maintain ongoing bibliographic description
    8. Go to Step #D
  6. Expose your bibliographic description as linked data by writing a report against the database system. This might be as simple as configuring your triple store, or as complicated as converting MARC/AACR2 from your integrated library system to BIBFRAME.
  7. Facilitate the discovery process, ideally through the use of linked data publishing and SPARQL, or directly against the integrated library system.
  8. Go to Step #5 on a daily basis.
  9. Go to Step #1 on an annual basis.

If the profession continues to use its existing integrated library systems for maintaining bibliographic data (Step #4), then the hard problem to solve is transforming and exposing the bibliographic data as linked data in the form of the given ontology. If the profession designs a storage and maintenance system rooted in the given ontology to begin with, then the problem is accurately converting existing data into the ontology and then designing mechanisms for creating/editing the data. The later option may be “better”, but the former option seems less painful and requires less retooling. This author advocates the “better” solution.

After a while, such a system may enable a library to meet the expressed needs/desires of its constituents, but it may present the library with a different set of problems. On one hand, the use of RDF as the root of a discovery system almost literally facilitates a “Web of knowledge”. But on the other hand, to what degree can it be used to do (more mundane) tasks such as circulation and acquisitions? One of the original purposes of bibliographic description was to create a catalog — an inventory list. Acquisitions adds to the list, and circulation modifies the list. To what degree can the triple store be used to facilitate these functions? If the answer is “none”, then there will need to be some sort of outside application interfacing with the triple store. If the answer is “a lot”, then the triple store will need to include an ontology to facilitate acquisitions and circulation.

Prototypical implementation

In the spirit of putting the money where the mouth is, the author has created the most prototypical and toy implementations possible. It is merely a triple store filled with a tiny set of automatically transformed MARC records and made publicly accessible via SPARQL. The triple store was built using a set of Perl modules called Redland. The system supports initialization of a triple store, the adding of items to the store via files saved on a local file system, rudimentary command-line search, a way to dump the contents of the triple store in the form of RDF/XML, and a SPARQL endpoint. [1] Thus, Step #4 from the recipe above has been satisfied.

To facilitate Step #5 a MARC to BIBFRAME transformation tool was employed [2]. The transformed MARC data was very small, and the resulting serialized RDF was valid. [3, 4] The RDF was imported into the triple store and resulted in the storage of 5,382 triples. Remember, there is no such thing as a record in the world of RDF! Using the SPARQL endpoint, it is now possible to query the triple store. [5] For example, the entire store can be dumped with this (dangerous) query:

# dump of everything
SELECT ?s ?p ?o 
WHERE { ?s ?p ?o }

To see what types of things are described one can list only the objects (classes) of the store:

# only the objects
SELECT DISTINCT ?o
WHERE { ?s a ?o }
ORDER BY ?o

To get a list of all the store’s properties (types of relationships), this query is in order:

# only the predicates
SELECT DISTINCT ?p
WHERE { ?s ?p ?o }
ORDER BY ?p

BIBFRAME denotes the existence of “Works”, and to get a list of all the works in the store, the following query can be executed:

# a list of all BIBFRAME Works
SELECT ?s 
WHERE { ?s a <http://bibframe.org/vocab/Work> }
ORDER BY ?s

This query will enumerate and tabulate all of the topics in the triple store. Thus providing the reader with an overview of the breadth and depth of the collection in terms of subjects. The output is ordered by frequency:

# a breadth and depth of subject analsysis
SELECT ( COUNT( ?l ) AS ?c ) ?l
WHERE {
  ?s a <http://bibframe.org/vocab/Topic> . 
  ?s <http://bibframe.org/vocab/label> ?l
}
GROUP BY ?l
ORDER BY DESC( ?c )

All of the information about a specific topic in this particular triple store can be listed in this manner:

# about a specific topic
SELECT ?p ?o 
WHERE { <http://bibframe.org/resources/Ssh1456874771/vil_134852topic10> ?p ?o }

The following query will create the simplest of title catalogs:

# simple title catalog
SELECT ?t ?w ?c ?l ?a
WHERE {
  ?w a <http://bibframe.org/vocab/Work>           .
  ?w <http://bibframe.org/vocab/workTitle>    ?wt .
  ?wt <http://bibframe.org/vocab/titleValue>  ?t  .
  ?w <http://bibframe.org/vocab/creator>      ?ci .
  ?ci <http://bibframe.org/vocab/label>       ?c  .
  ?w <http://bibframe.org/vocab/subject>      ?s  .
  ?s <http://bibframe.org/vocab/label>        ?l  .
  ?s <http://bibframe.org/vocab/hasAuthority> ?a
}
ORDER BY ?t

The following query is akin to a phrase search. It looks for all the triples (not records) containing a specific key word (catholic):

# phrase search
SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o
  FILTER REGEX ( ?o, 'catholic', 'i' )
}
ORDER BY ?p

Automatically transformed MARC data into BIBFRAME RDF will contain a preponderance of literal values when URIs are really desired. The following query will find all of the literals and sort them by the number of their individual occurrences:

# find all literals
SELECT ?p ?o ( COUNT ( ?o ) as ?c )
WHERE { ?s ?p ?o FILTER ( isLiteral ( ?o ) ) }
GROUP BY ?o 
ORDER BY DESC( ?c )

It behooves the cataloger to identify URIs for these literal values and replace the literals (or supplement) the triples accordingly (Step #5E in the recipe, above). This can be accomplished both programmatically as well as manually by first creating a list of appropriate URIs and then executing a set of INSERT or UPDATE commands against the triple store.

“Blank nodes” (URIs that point to nothing) are just about as bad as literal values. The following query will list all of the blank nodes in a triple store:

# find all blank nodes
SELECT ?s ?p ?o WHERE { ?s ?p ?o FILTER ( isBlank( ?s ) ) }

And the data associated with a particular blank node can be queried in this way:

# learn about a specific blank node
SELECT distinct ?p WHERE { _:r1456957120r7483r1 ?p ?o } ORDER BY ?p

In the case of blank nodes, the cataloger will then want to “mint” new URIs and perform an additional set of INSERT or UPDATE operations against the underlying triple store. This is a continuation of Step #5E.

These SPARQL queries applied against this prototypical implementation have tried to illustrate how RDF can fulfill the needs and requirements of bibliographic description. One can now begin to see how an RDF triple store employing a bibliographic ontology can be used to fulfill some of the fundamental goals of a library catalog.

Summary

This essay defined librarianship as a set of interlocking collections and services. Bibliographic description was outlined in an historical context, with the point being that the process of bibliographic description has evolved with technology and cultural expectations. The principles of RDF and linked data were then described, and the inherent advantages & disadvantages of leading bibliographic RDF ontologies were touched upon. The essay then asserted the need for faster evolution regarding bibliographic description and advocated the use of RDF and BIBFRAME for this purpose. Finally, the essay tried to demonstrate how RDF and BIBFRAME can be used to satisfy the functionality of the library catalog. It did this through the use of a triple store and a SPARQL endpoint. In the end, it is hoped the reader understands that there is no be-all end-all solution for bibliographic description, but the use of RDF technology is the wave of the future, and BIBFRAME is good enough when it comes to the ontology. Moving to the use of RDF for bibliographic description will be painful for the profession, but not moving to RDF will be detrimental.

Notes

† This presentation ought to be also be available as a one-page handout in the form of a PDF document.

†† Moreover, collections and services go hand-in-hand because collections without services are useless, and services without collections are empty. As a buddhist monk once said, “Collections without services is the sound of one hand clapping.” Librarianship requires a healthy balance of both.

††† That said, no matter what a person does, things always get lost in translation. This is true of human language just as much as it is true for the language (data/information) of computers. Yes, data & information will get lost when moving from one data model to another, but still I contend the fundamental and most useful elements will remain.

†††† This process (Step #5E) was coined by Roy Tennant and his colleagues at OCLC as “entification”.

Links

[1] toy implementation – http://infomotions.com/sandbox/bibframe/
[2] MARC to BIBFRAME – http://bibframe.org/tools/transform/start
[3] sample MARC data – http://infomotions.com/sandbox/bibframe/data/data.xml
[4] sample RDF data – http://infomotions.com/sandbox/bibframe/data/data.rdf
[5] SPARQL endpoint – http://infomotions.com/sandbox/bibframe/sparql/

by Eric Lease Morgan at March 06, 2016 08:21 PM

February 23, 2016

Catholic Portal

The Jesuit Libraries Provenance Project

Kyle Roberts, Loyola University Chicago

The Jesuit Libraries Provenance Project (JLPP) was launched in March 2014 to create a visual archive of provenance marks from historic Jesuit college, seminary, and university library collections and to foster a participatory community interested in the history of these books.

Nineteenth-century Jesuits never met a book that they didn't like to stamp their name on. This stamp is found on books from Loyola's original library collection (c.1870).Nineteenth-century Jesuits never met a book that they didn’t like to stamp their name on. This stamp is found on books from Loyola’s original library collection (c.1870).

Founded by students, faculty, and library professionals at Loyola University Chicago, the Provenance Project is an outgrowth of an earlier project [http://blogs.lib.luc.edu/archives/] to reconstruct the holdings listed in Loyola’s original (c.1878) library catalog in an innovative virtual library system. That project, which was the subject of a graduate seminar at Loyola in Fall 2013 and will launch later this year, brought together graduate students in Digital Humanities, History, and Public History to recreate the nineteenth-century library catalog in a twenty-first century open source Integrated Library System (ILS). In the course of researching the approximately 5100 titles listed in the original catalog, students discovered that upwards of 1750 might still be held in the collections of Loyola’s Cudahy Library, the Library Storage Facility, and University Archives and Special Collections. A handful of undergraduate and graduate students formed the Provenance Project the following semester to see how many of these books actually survived. As they pulled books off the shelves and opened them up, they discovered a range of provenance marks – bookplates, inscriptions, stamps, shelf-marks, and other notations – littering the inside covers, flyleaves, and title pages of these books. Students soon realized that if the original library catalog could tell them what books the Jesuits collected, provenance marks could reveal from where the books came.

 

The inside covers of books collected by Jesuit can have bookplates, stamps, and sometimes surprising marginalia. From The Spirit of Popery (n.d.) The inside covers of books collected by Jesuit can have bookplates, stamps, and sometimes surprising marginalia. From The Spirit of Popery (n.d.)

By utilizing the freely accessible online social media image-sharing platform Flickr, the Provenance Project seeks to create a participatory community of students, bibliographers, academics, private collectors, alumni, and others interested in the origin and history of Jesuit-collected books. A photostream within the Provenance Project Flickr site allows visitors to scroll through all of the pictures that have been uploaded while commenting and tagging functions provide the opportunity to share their own knowledge about specific images. For example, visitors can contribute transcriptions of inscriptions (especially ones written in messy or illegible hands), translations of words and passages in foreign languages, and identifications of former individual and institution owners. Not only does the Flickr site provide a visual index of the rich variety of works held by a late nineteenth-century Jesuit college library, but it also inspires reflection and scholarship on the importance of print to Catholic intellectual, literary, and spiritual life.

The Provenance Project also encourages undergraduate and graduate students to undertake mentored primary-source research on the history of individual books as well as broader themes in Catholic and book history. Their findings are shared with the public in a variety of ways. One of the rooms in the Summer 2014 exhibition, Crossings and Dwellings: Restored Jesuits, Women Religious, American Experience 1814-2014 at the Loyola University Museum of Art (LUMA) featured original library books selected by graduate students and accompanied by interpretative labels they wrote. Student interns regularly contribute original scholarship to the Provenance Project’s website as well as to the June 2015 issue of the Catholic Library World on the “Digital Future of Jesuit Studies.” [Citation: “The Digital Future of Jesuit Studies,” Catholic Library World 85:4 (June 2015): 240-259.] They have also given talks on their research at conferences, such as the annual meeting of the American Catholic Historical Association. The 2014 commemoration of the bicentennial of the restoration of the Society of Jesus has brought renewed scholarly to nineteenth-century Jesuits. The work of Provenance Project interns is actively contributing to that resurgence of interest.

The Flickr photostream for the Jesuit Libraries Provenance Project.The Flickr photostream for the Jesuit Libraries Provenance Project.

As of February 2016, students have tracked down all of the surviving books from the list of 1750 titles and are in the process of discerning how many of these titles are actual matches for those in the original catalog. (The answer appears to be the vast majority, making for a much higher survival rate than initially expected.) The team recently posted its 5000th image to the Flickr archive and still has many more images to upload over the coming months. Images on Flickr have also been usefully organized into albums either by nature of provenance mark (stamp, bookplate), part of book (illustrations, endpapers, binding), or division of the catalog (Pantology, Theology, Legislation, Philosophy, History, Literature). For those who would like to contribute to the Project, there are still many passages in need of translation and ownership marks in need of identification (helpfully gathered into the albums “Unidentified Inscriptions”, “Unidentified Stamps”, “Unidentified Embossed Stamps”, and “Unidentified Bookplates”).

Please follow the JLPP on Flickr (@JLPProject), Facebook and on Twitter (@JesuitProject). We try to post new books everyday and scholarship on the blog every week or so during the semester, so check back often!

A final note: the Provenance Project is beginning conversations about expanding the site to include provenance images from the collections of other historic Jesuit college, seminary, and university libraries. If you are interested in learning more about participating, or want information about how to start a project for your own institution, don’t hesitate to contact Kyle Roberts.

by Rose Fortier at February 23, 2016 08:06 PM

Interview with Michael Skaggs

Michael SkaggsMichael Skaggs is a doctoral candidate in the Department of History at the University of Notre Dame. He studies religion in the American Midwest, and is particularly interested in how interfaith organizations addressed social problems.

What is your current area of research?

Right now, I’m working on a dissertation chapter on Catholic racial activism in 1960s Cincinnati. Partly those men and women did so because they got involved in the contemporary civil rights movement, but the Second Vatican Council’s call for the laity to be active in society had something to do with it, too. But I think the blend of those two motivations is more complicated than it seems at first.

More generally, my dissertation asks how Catholics in one midwestern place – Cincinnati, Ohio – responded to the Second Vatican Council, and how the presence of a substantial Jewish community inflected that response. This presents us with a fascinating opportunity to understand the real richness of American Catholicism, which I think we miss out on if we overlook places like Cincinnati, which usually don’t seem to be all that important to us.

Graduate students in search of dissertation topics are well-positioned to draw attention to topics and places long untouched by scholars! And while there are many scholars across the career timeline ready to embrace digitization, I think the younger generation has a natural ability to work with these resources – maybe even an impatience to do things “the old way.” This is a transitional moment in academia, though, so there’s a real need for students and future scholars to straddle the line between technologies old and new.

How do you use CRRA’s resources for your research?

I first came to know about CRRA just after I had finished a research project on The Criterion, the Archdiocese of Indianapolis’s official newspaper. I did it the new-old-fashioned way: cranking through what felt like miles of microfilm. Now I work with CRRA’s newspaper digitization project and have been excited at the conversations surrounding getting these sources into a format that we can use quickly and easily.

CRRA has been particularly useful in considering how I might shape my research projects to benefit from digitization in the future. Since most of my sources are not yet digitized, it’s been wonderful to look ahead and consider what might reasonably be digitized in the future and the scholarly community that will arise around those sources. It’s exciting to think about being part of a conversation that more and more people enter as sources open up to easy access from afar.

What is the most exciting / surprising source you’ve been able to get access to for your research?

I have to point to the old-school method of research for this one, too, because Cincinnati doesn’t get the attention it really ought to – a lot of Catholic scholarship has been focused to this point on “more important” places in the American Church. So the biggest and most impressive collections that CRRA catalogs come from elsewhere – an imbalance that CRRA is sure to fix in coming years and as more and more diocesan archives get involved. But I would say the most exciting – or one of the most exciting sources – has been The American Israelite, which was published by and for American Reform Jews. It’s fully digitized but only accessible in certain locations – a prime opportunity for CRRA, since the Israelite reported on things Catholic quite often!

The American Israelite points up the potential of partnerships between the academy and religious institutions. While no small project, that one newspaper presented a relatively straightforward digitization task. And many organizations would only be too happy to let CRRA digitize their materials if the funding is available and it can be done in a reasonable amount of time. Furthermore, CRRA has utilized an excellent strategy of asking scholars themselves what they need access to, as this provides a clear (if not concise) idea of sources that might be targeted for digitization. Most scholars with particular research projects can identify exactly which collections it would be useful to digitize, which makes the process manageable, even if not all that easy. From there, related collections can be identified for future scholars who aren’t working with them just yet, or individual archives can propose collections that really ought to be made digital, and so on. It pretty plainly represents the future and I’m happy to know CRRA is working hard to get ahead of the game.

What do you wish you could have access to but is currently unavailable?

I sound like a broken record whenever I’m asked this in CRRA conversations: fully digitized diocesan newspapers from across the United States. I think that would open up research fields historians have not even begun to consider, especially since having all of that information readily available would really help us uncover the complexity of American Catholicism from place to place. Many dioceses have the entire run of their newspaper preserved, in some cases very well so. A program just for diocesan archives – and especially their newspapers – would be a fantastic way to bring these sources into the mainstream of academic research, especially those smaller or “less important” dioceses that historians haven’t thought of yet.

I also think that archival sources on parish histories are a goldmine yet to be tapped by most scholars. The problem here is accessibility, or even knowing where they are kept: more than once I have run into a parish saying their materials are held at the diocesan archives, while the diocesan archivist says the materials are at the parish! And in many cases people just haven’t saved much. But if we really want to know about American Catholicism, we desperately need access to the sources pertinent to the vast majority of American Catholics: the laity, who connect to the Church first at the parish level. These materials don’t need to be all that in-depth to provide something useful, either – I’d be perfectly happy with a solid set of parish bulletins over a given period of time, for example, for what it would tell us about parish life. Again, this is where CRRA is in a great place to help, through utilizing scholars’ needs and wants to identify, catalog, and digitize collections.

by Rose Fortier at February 23, 2016 04:22 PM

February 21, 2016

Catholic Portal

OAI and VuFind: Notes to self in the form of a recipe

The primary purpose of this posting is to document some of my experiences with OAI and VuFind. Specifically it outlines a sort of “recipe” I use to import OAI content into the “Catholic Portal“. The recipe includes a set of “ingredients”, site-specific commands. Towards the end, I ruminate on the use of OAI and Dublin Core for the sharing of metadata.

Philadelphia by Eric Morgan

Recipe

When I learn of a new OAI repository containing metadata destined for the Portal, I use the following recipe to complete the harvesting/indexing process:

  1. Use the OAI protocol directly to browse the remote data repository – This requires a slightly in-depth understanding how OAI-PMH functions, and describing it any additional detail is beyond the scope of this posting. Please consider perusing the OAI specification itself.
  2. Create a list of sets to harvest – This is like making a roux and is used to configure the oai.ini file, next.
  3. Edit/configure harvesting via oai.ini and properties files – The VuFind oai.ini file denotes the repositories to harvest from as well as some pretty cool configuration directives governing the harvesting process. Whomever wrote the harvester for VuFind did a very good job. Kudos!
  4. Harvest a set – The command for this step is in the list of ingredients, below. Again, this is very-well written.
  5. Edit/configure indexing via an XSL file – This is the most difficult part of the process. It requires me to write XSL, which is not too difficult in and of itself, but since each set of OAI content is often different from every other set, the XSL is set specific. Moreover, the metadata of the set is often incomplete, inconsistent, or ambiguous making the indexing process a challenge. In another post, it would behoove me to include a list of XSL routines I seem to use from repository to repository, but again, each repository is different.
  6. Test XSL output for completeness – The command for this step is below.
  7. Go to Step #5 until done – In this case “done” is usually defined as “good enough”.
  8. Index set – Our raison d’être, and the command is given below.
  9. Go to Step #4 for all sets – Each repository may include many sets, which is a cool OAI feature.
  10. Harvest and index all sets – Enhance the Portal.
  11. Go to Step #10 on a regular basis – OAI content is expected to evolve over time.
  12. Go to Step #1 on a less regular basis – Not only does content change, but the way it is described evolves as well. Harvesting and indexing is a never-ending process.

Ingredients

I use the following Linux “ingredients” to help me through the process of harvesting and indexing. I initialize things with a couple of environment variables. I use full path names whenever possible because I don’t know where I will be in the file system, and the VUFIND_HOME environment variable sometimes gets in the way. Ironic.

# configure; first the name of the repository and then a sample metadata file
  NAME=luc
  FILE=1455898167_lucoai_coll25_55.xml

  # (re-)initialize
  rm -rf /usr/local/vufind2/local/harvest/$NAME/*.delete
  rm -rf /usr/local/vufind2/local/harvest/$NAME/*

  # delete; an unfinished homemade Perl script to remove content from Solr
  /usr/local/vufind2/crra/crra-scripts/bin/solr-delete.pl

  # harvest; do the first part of the work
  cd /usr/local/vufind2/harvest/; php harvest_oai.php $NAME

  # test XSL output
  clear; \
  cd /usr/local/vufind2/import; \
  php ./import-xsl.php --test-only \
  /usr/local/vufind2/local/harvest/$NAME/$FILE \
  $NAME.properties

  # index; do the second part of the work
  /usr/local/vufind2/harvest/batch-import-xsl.sh $NAME $NAME.properties

Using the recipe and these ingredients, I am usually able to harvest and index content from a new repository a few hours. Of course, it all depends on the number of sets in the repository, the number of items in each set, as well as the integrity metadata itself.

Ruminations

As I have alluded to in a previous blog posting, the harvesting and indexing of OAI content is not straight-forward. In my particular case, the software is not to blame. No, the software is very well-written. I don’t take advantage of all of the software’s features though, but that is only because I do not desire to introduce any “-isms” into my local implementation. Specifically, I do not desire to mix PHP code with my XSL routines. Doing so seems too much like Fusion cuisine.

The challenge in this process is both the way Dublin Core is used, as well as the data itself. For example, is a PDF document a type of text? Sometimes it is denoted that way. There are dates in the metadata, but the dates are not qualified. Date published? Date created? Date updated? Moreover, the dates are syntactically different: 1995, 1995-01-12, January 1995. My software is stupid and/or I don’t have the time to normalize everything for each and every set. Then there are subjects. Sometimes they are Library of Congress headings. Sometimes they are just keywords. Sometimes there are multiple subjects in the metadata and they are enumerated in one field delimited by various characters. Sometimes these multiple subject “headings” are manifested as multiple dc.subject elements. Authors (creators) present a problem. First name last? Last name first? Complete with birth and death dates? Identifiers? Ack! Sometimes they include unique codes — things akin to URIs. Cool! Sometimes identifiers are URLs, but most of the time, these URLs point to splash pages of content management systems. Rarely do the identifiers point the item actually described by the metadata. And then there out & out errors. For example, description elements containing URLs pointing to image files.

Actually, none of this is new. Diane Hillmann & friends encountered all of these problems on a much grander scale through the National Science Foundation’s desire to create a “digital library”. Diane’s entire blog — Metadata Matters — is a cookbook for resolving these issues, but in my way of boiling everything done to their essentials, the solution is two-fold: 1) mutual agreements on how to manifest metadata, and 2) the writing of more intelligent software on my part.

by Eric Lease Morgan at February 21, 2016 11:19 PM

January 06, 2016

Mini-musings

XML 101

This past Fall I taught “XML 101” online and to library school graduate students. This posting echoes the scripts of my video introductions, and I suppose this posting could also be used as very gentle introduction to XML for librarians.

Introduction

another fieldI work at the University of Notre Dame, and my title is Digital Initiatives Librarian. I have been a librarian since 1987. I have been writing software since 1976, and I will be your instructor. Using materials and assignments created by the previous instructors, my goal is to facilitate your learning of XML.

XML is a way of transforming data into information. It is a method for marking up numbers and text, giving them context, and therefore a bit of meaning. XML includes syntactical characteristics as well as semantic characteristics. The syntactical characteristics are really rather simple. There are only five or six rules for creating well-formed XML, such as: 1) there must be one and only one root element, 2) element names are case-sensitive, 3) elements must be close properly, 4) elements must be nested properly, 4) attributes must be quoted, and 5) there are a few special characters (&, <, and >) which must be escaped if they are to be used in their literal contexts. The semantics of XML is much more complicated and they denote the intended meaning of the XML elements and attributes. The semantics of XML are embodied in things called DTDs and schemas.

Again, XML is used to transform data into information. It is used to give data context, but XML is also used to transmit this information in an computer-independent way from one place to another. XML is also a data structure in the same way MARC, JSON, SQL, and tab-delimited files are data structures. Once information is encapsulated as XML, it can unambiguously transmitted from one computer to another where it can be put to use.

This course will elaborate upon these ideas. You will learn about the syntax and semantics of XML in general. You will then learn how to manipulate XML using XML-related technologies called XPath and XSLT. Finally, you will learn library-specific XML “languages” to learn how XML can be used in Library Land.

Well-formedness

In this, the second week of “XML 101 for librarians”, you will learn about well-formed XML and valid XML. Well-formed XML is XML that conforms to the five or six syntactical rules. (XML must have one and only one root element. Element names are case sensitive. Elements must be closed. Elements must be nested correctly. Attributes must be quoted. And there are a few special characters that must be escaped (namely &, <, and >). Valid XML is XML that is not only well-formed but also conforms to a named DTD or schema. Think of valid XML as semantically correct.

Jennifer Weintraub and Lisa McAulay, the previous instructors of this class, provide more than a few demonstrations of how to create well-formed as well as valid XML. Oxygen, the selected XML editor for this course is both powerful and full-featured, but using it efficiently requires practice. That’s what the assignments are all about. The readings supplement the demonstrations.

DTD’s and namespaces

DTD’s, schemas, and namespaces put the “X” in XML. They make XML extensible. They allow you to define your own elements and attributes to create your own “language”.

DTD’s — document type declarations — and schemas are the semantics of XML. They define what elements exists, what order they appear in, what attributes they can contain, and just as importantly what the elements are intended to contain. DTD’s are older than schemas and not as robust. Schemas are XML documents themselves and go beyond DTD’s in that they provide the ability to define the types of data elements and attributes contain.

Namespaces allow you, the author, to incorporate multiple DTD and schema definitions into a single XML document. Namespaces provide a way for multiple elements of the same name to exist concurrently in a document. For example, two different DTD’s may contain an element called “title”, but one DTD refers to a title as in the title of a book, and the other refers to “title” as if it were an honorific.

Schemas

Schemas are an alternative and more intelligent alternative to DTDs. While DTDs define the structure of XML documents, schemas do it with more exactness. While DTDs only allow you to define elements, the number of elements, the order of elements, attributes, and entities, schemas allow you to do these things and much more. For example, they allow you to define the types of content that go into elements or attributes. Strings (characters). Numbers. Lists of characters or numbers. Boolean (true/false) values. Dates. Times. Etc. Schemas are XML documents in an of themselves, and therefore they can be validated just like any other XML document with a pre-defined structure.

The reading and writing of XML schemas is very librarian-ish because it is about turning data into information. It is about structuring data so it makes sense, and it does this in an unambiguous and computer-independent fashion. It is too bad our MARC (bibliographic) standards are not as rigorous.

RelaxNG, Schematron, and digital libraries

fieldsThe first is yet another technology for modeling your XML, and it is called RelaxNG. This third modeling technology is intended to be more human readable than schemas and more robust that DTDs. Frankly, I have not seen RelaxNG implements very many times, but it behooves you to know it exists and how it compares to other modeling tools.

The second is Schematron. This tool too is used to validate XML, but instead of returning “ugly” computer-looking error messages, its errors are intended to be more human-readable and describe why things are the way they are instead of just saying “Wrong!”

Lastly, there is an introduction to digital libraries and trends in their current development. More and more, digital libraries are really and truly implementing the principles of traditional librarianship complete with collection, organization, preservation, and dissemination. At the same time, they are pushing the boundaries of the technology and stretching our definitions. Remember, it is not so much about the technology (the how of librarianship) that is important, but rather the why of libraries and librarianship. The how changes quickly. The why changes slowly, albiet sometimes too slowly.

XPath

This week is all about XPath, and it is used to select content from your XML files. It is akin to navigating a computer’s filesystem from the command line in order to learn what is located in different directories.

XPath is made up of expressions which return values of true, false, strings (characters), numbers, or nodes (subsets of XML files). XPath is used in conjunction with other XML technologies, most notably XSTL and XQuery. XSLT is used to transform XML files into other plain text files. XQuery is akin to the structured query language of relational databases.

You will not be able to do very much with XML other than read or write it, unless you understand XPath. An understanding XPath is essencial if you want to do truly interesting things with XML.

XSLT

This week you will be introduced to XSLT, a programming language used to transform XML into other plain text files.

XML is all about information, and it is not about use nor display. In order for XML to be actually useful — to be applied towards some sort of end — specific pieces of data need to be extracted from XML or the whole of the XML file needs to be converted into something else. The most common conversion (or “transformation”) is from some sort of XML into HTML for display in a Web browser. For example, bibliographic XML (MARCXML or MODS) may be transformed into a sort of “catalog card” for display, or a TEI file may be transformed into a set of Web pages, or an EAD file may be transformed into a guide intended for printing. Alternatively, you may want to tranform the bibliographic data into a tab-delimited text file for a spreadsheet or an SQL file for a relational database. Along with other sets of information, an XML file may contain geographic coordinates, and you may want to extract just those coordinates to create a KML file — a sort of map file.

XSLT is a programming language but not like most programming languages you may know. Most programming languages are “procedural” (like Perl, PHP, or Python), meaning they execute their commands in a step-wise manner. “First do this, then do that, then do the other thing.” This can be contrasted with “declarative” programming languages where events occur or are encountered in a data file, and then some sort of execution happens. There are relatively few declarative programming languages, but LISP is/was one of them. Because of the declarative nature of XSLT, the apply-templates command is so important. The apply-templates command sort of tells the XSLT processor to go off and find more events.

Now that you are beginning to learn XSLT and combining it with XPath, you are beginning to do useful things with the XML you have been creating. This is where the real power is. This is where it gets really interesting.

TEI — Text Encoding Initiative

TEI is a granddaddy, when it comes to XML “languages”. It started out as a different from of mark-up, a mark-up called SGML, and SGML was originally a mark-up language designed at IBM for the purposes of creating, maintaining, and distributing internal documentation. Now-a-days, TEI is all but a hallmark of XML.

TEI is a mark-up language for any type of literature: poetry or prose. Like HTML, it is made up of head and body sections. The head is the place for administrative, bibliographic, and provenance metadata. The body is where the poetry or prose is placed, and there are elements for just about anything you can imagine: paragraphs, lines, headings, lists, figures, marginalia, comments, page breaks, etc. And if there is something you want to mark-up, but an element does not explicitly exist for it, then you can almost make up your own element/attribute combination to suit your needs.

TEI is quite easily the most well-documented XML vocabulary I’ve ever seen. The community is strong, sustainable, albiet small (if not tiny). The majority of the community is academic and very scholarly. Next to a few types of bibliographic XML (MARCXML, MODS, OAIDC, etc.), TEI is probably the most commonly used XML vocabulary in Library Land, with EAD being a close second. In libraries, TEI is mostly used for the purpose of marking-up transcriptions of various kinds: letters, runs of out-of-print newsletters, or parts of a library special collection. I know of no academic journals marked-up in TEI, no library manuals, nor any catalogs designed for printing and distribution.

TEI, more than any other type of XML designed for literature, is designed to support the computed critical analysis of text. But marking something up in TEI in a way that supports such analysis is extraordinarily expensive in terms of both time and expertise. Consequently, based on my experience, there are relatively very few such projects, but they do exist.

XSL-FO

As alluded to throughout this particular module, XSL-FO is not easy, but despite this fact, I sincerely believe it is under-utilized tool.

FO stands for “Formatting Objects”, and it in an of itself is an XML vocabulary used to define page layout. It has elements defining the size of a printed page, margins, running headers & footers, fonts, font sizes, font styles, indenting, pagination, tables of contents, back-of-the-book indexes, etc. Almost all of these elements and their attributes use a syntax similar to the syntax of HTML’s cascading stylesheets.

Once an XML file is converted into an FO document, you are expected to feed the FO document to a FO processor, and the FO processor will convert the document into something intended for printing — usually a PDF document.

FO is important because not everything is designed nor intended to be digital. Digital everything is mis-nomer. The graphic design of a printed medium is different from the graphic design of computer screens or smart phones. In my opinion, important XML files ought to be transformed into different formats for different mediums. Sometimes those mediums are screen oriented. Sometimes it is better to print something, and printed somethings last a whole lot longer. Sometimes it is important to do both.

FO is another good example of what XML is all about. XML is about data and information, not necessarily presentation. XSL transforms data/information into other things — things usually intended for reading by people.

EAD — Encoded Archival Description

Encoded Archival Description (or EAD) is the type of XML file used to enumerate, evaluate, and make accessible the contents of archival collections. Archival collections are often the raw and primary materials of new humanities scholarship. They are usually “the papers” of individuals or communities. They may consist of all sorts of things from letters, photographs, manuscripts, meeting notes, financial reports, audio cassette tapes, and now-a-days computers, hard drives, or CDs/DVDs. One thing, which is very important to understand, is that these things are “collections” and not intended to be used as individual items. MARC records are usually used as a data structure for bibliographically describing individual items — books. EAD files describe an entire set of items, and these descriptions are more colloquially called “finding aids”. They are intended to be read as intellectual works, and the finding aids transform collections into coherent wholes.

Like TEI files, EAD files are comprised of two sections: 1) a header and 2) a body. The header contains a whole lot or very little metadata of various types: bibliographic, administrative, provenance, etc. Some of this metadata is in the form of lists, and some of it is in the form of narratives. More than TEI files, EAD files are intended to be displayed on a computer screen or printed on paper. This is why you will find many XSL files transforming EAD into either HTML or FO (and then to PDF).

RDF

RDF is an acronym for Resource Description Framework. It is a data model intended to describe just about anything. The data model is based on an idea called triples, and as the name implies, the triples have three parts: 1) subjects, 2) predicates, and 3) objects.

Subjects are always URIs (think URLs), and they are the things described. Objects can be URIs or literals (words, phrases, or numbers), and objects are the descriptions. Predicates are also always URIs, and they denote the relationship between the subjects and the objects.

The idea behind RDF was this. Describe anything and everthing in RDF. Resuse as many of the URIs used by other people as possible. Put the RDF on the Web. Allow Internet robots/spiders to harvest and cache the RDF. Allow other computer programs to ingest the RDF, analyse it for the similar uses of subjects, predicates, and objects, and in turn automatically uncover new knowledge and new relationships between things.

RDF is/was originally expressed as XML, but the wider community had two problems with RDF. First, there were no “killer” applications using RDF as input, and second, RDF expressed as XML was seen as too verbose and too confusing. Thus, the idea of RDF languished. More recently, RDF is being expressed in other forms such as JSON and Turtle and N3, but there are still no killer applications.

You will hear the term “linked data” in association with RDF, and linked data is the process of making RDF available on the Web.

RDF is important for libraries and “memory” or “cultural heritage” institutions, because the goal of RDF is very similar to the goals of libraries, archives, and museums.

MARC

wavesThe MARC standard has been the bibliographic bread & butter of Library Land since the late 1960’s. When it was first implemented it was an innovative and effect data structure used primarily for the production of catalog cards. With the increasing availability of computers, somebody got the “cool” idea of creating an online catalog. While logical, the idea did not mature with a balance of library and computing principles. To make a long story short, library principles prevailed and the result has been and continues to be painful for both the profession as well as the profession’s clientele.

MARCXML was intended to provide a pathway out of this morass, but since it was designed from the beginning to be “round tripable” with the original MARC standard, all of the short-comings of the original standard have come along for the ride. The Library Of Congress was aware of these short-comings, and consequently MODS was designed. Unlike MARC and MARCXML, MODS has no character limit and its field names are human-readable, not based on numeric codes. Given that MODS is flavor of XML, all of this is a giant step forward.

Unfortunately, the library profession’s primary access tools — the online catalog and “discovery system” — still heavily rely on traditional MARC for input. Consequently, without a wholesale shift in library practice, the intellectual capital the profession so dearly wants to share is figuratively locked in the 1960’s.

Not a panacea

XML really is an excellent technology, and it is most certainly apropos for the work of cultural heritage institutions such as libraries, archives, and museums. This is true for many reasons:

  1. it is computing platform independent
  2. it requires a minimum of computer technology to read and write
  3. to some degree, it is self-documenting, and
  4. especially considering our profession, it is all about data, information, and knowlege

On the other hand, it does have a number of disadvantages, for example:

  1. it is verbose — not necessarily succinct
  2. while easy to read and write, it can be difficult to process
  3. like all things computer program-esque, it imposes a set of syntactical rules, which people can sometimes find frustrating
  4. its adoption as standard has not been as ubiquitous as desired

To date you have learned how to read, write, and process XML and a number of its specific “flavors”, but you have by no means learned everything. Instead you have received a more than adequate introduction. Other XML topics of importance include:

  • evolutions in XSLT and XPath
  • XML-based databases
  • XQuery, a standardized method for querying sets of XML similar to the standard query language of relational databases
  • additional XML vocabularies, most notably RSS
  • a very functional way of making modern Web browsers display XML files
  • XML processing instructions as well as reserved attributes like lang

In short, XML is not a panacea, but it is an excellent technology for library work.

Summary

You have all but concluded a course on XML in libraries, and now is a good time for a summary.

First of all, XML is one of culture’s more recent attempts at formalizing knowledge. At its root (all puns intended) is data, such as the number like 1776. Through mark-up we might say this number is a year, thus turning the data into information. By putting the information into context, we might say that 1776 is when the Declaration of Independence was written and a new type of government was formed. Such generalizations fall into the realm of knowledge. To some degree, XML facilitates the transformation of data into knowledge. (Again, all puns intended.)

Second, understand that XML is also a data structure defined by the characteristics of well-formedness. By that I mean XML has one and only one root element. Elements must be opened and closed in a hierarchal manner. Attributes of elements must be quoted, and a few special characters must always be escaped. The X in XML stands for “extensible”, and through the use of DTDs and schemas, specific XML “flavors” can be specified.

With this under your belts you then experimented with at least a couple of XML flavors: TEI and EAD. The former is used to mark-up literature. The later is used to describe archival collections. You then learned about the XML transformation process through the application of XSL and XPath, two rather difficult technologies to master. Lastly, you made strong efforts to apply the principles of XML to the principles of librarianship by marking up sets of documents or creating your own knowledge entity. It is hoped you have made a leap from mere technology to system. It is not about Oxygen nor graphic design. It is about the chemistry of disseminating data as unambiguously as possible for the purposes of increasing the sphere of knowledge. With these things understood, you are better equipped to practice librarianship in the current technological environment.

Finally, remember, there is no such thing as a Dublin Core record.

Epilogue — Use and understanding

iceburgThis course in XML was really only an introduction. You were expected to read, write, and transform XML. This process turns data into information. All of this is fine, but what about knowledge?

One of the original reasons texts were marked up was to facilitate analysis. Researchers wanted to extract meaning from texts. One way to do that is to do computational analysis against text. To facilitate computational analysis people thought is was necessary for essential characteristics of a text to be delimited. (It is/was thought computers could not really do natural language processing.) How many paragraphs exists? What are the names in a text? What about places? What sorts of quantitative data can be statistically examined? What main themes does the text include? All of these things can be marked-up in a text and then counted (analyzed).

Now that you have marked up sets of letters with persname elements, you can use XPath to not only find persname elements but count them as well. Which document contains the most persnames? What are the persnames in each document. Tabulate their frequency. Do this over a set of documents to look for trends across the corpus. This is only a beginning, but entirely possible given the work you have already done.

Libraries do not facilitate enough quantitative analysis against our content. Marking things up in XML is a good start, but lets go to the next step. Let’s figure out how the profession can move its readership from discovery to analysis — towards use & understand.

by Eric Lease Morgan at January 06, 2016 06:05 PM

Mr. Serials continues

The (ancient) Mr. Serials Process continues to support four mailing list archives, specifically, the archives of ACQNET, Colldv-l, Code4Lib, and NGC4Lib, and this posting simply makes the activity explicit.

flowersMr. Serials is/was a process I developed quite a number of years ago as a method for collecting, organizing, archiving electronic journals (serials). The process worked well for a number of years, until electronic journals were no longer distributed via email. Now-a-days, Mr. Serials only collects the content of a few mailing lists. That’s okay. Things change. No big deal.

On the other hand, from a librarian’s and archivist’s point-of-view, it is important to collect mailing list content in its original form — email. Email uses the SMTP protocol. The communication sent back and forth, between email server and client, is well-structured albiet becoming verbose. Probably “the” standard for saving email on a file system is called mbox. Given a mbox file, it is possible to use any number of well-known applications to read/write mbox data. Heck, all you need is a text editor. Increasingly, email archives are not available from mailing list applications, and if they are, then they are available only to mailing list administrators and/or in a proprietary format. For example, if you host a mailing list on Google, can you download an archive of the mailing list in a form that is easily and universally readable? I think not.

Mr. Serials circumvents this problem. He subscribes to mailing lists, saves the incoming email to mbox files, and processes the mbox files to create searchable/browsable interfaces. The interfaces are not hugely aesthetically appealing, but they are more than functional, and the source files are readily available. Just ask.

Most recently both the ACQNET and Colldv-l mailing lists moved away from their hosting institutions to servers hosted by the American Library Association. This has not been the first time these lists have moved. It probably won’t be the last, but since Mr. Serials continues subscribe to these lists, comprehensive archives persevere. Score a point for librarianship and the work of archives. Long live Mr. Serials.

by Eric Lease Morgan at January 06, 2016 04:42 PM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/