October 13, 2016

Life of a Librarian

Tiny road trip: An Americana travelogue

This travelogue documents my experiences and what I learned on a tiny road trip including visits to Indiana University, Purdue University, University of Illinois / Urbana-Champagne, and Washington University In St. Louis between Monday, October 26 and Friday, October 30, 2017. In short, I learned four things: 1) of the places I visited, digital scholarship centers support a predictable set of services, 2) the University Of Notre Dame’s digital scholarship center is perfectly situated in the middle of the road when it comes to the services provided, 3) the Early Print Project is teamed with a set of enthusiastic & animated scholars, and 4) Illinois is very flat.

Lafayette Bloomington Greenwood Crawfordsville

Four months ago I returned from a pseudo-sabbatical of two academic semesters, and exactly one year ago I was in Tuscany (Italy) painting cornfields & rolling hills. Upon my return I felt a bit out of touch with some of my colleagues in other libraries. At the same time I had been given an opportunity to participate in a grant-sponsored activity (the Early English Print Project) between Northwestern University, Washington University In St. Louis, and the University Of Notre Dame. Since I was encouraged to visit the good folks at Washington University, I decided to stretch a two-day visit into a week-long road trip taking in stops at digital scholarship centers. Consequently, I spent bits of time in Bloomington (Indiana), West Lafayette (Indiana), Urbana (Illinois), as well as St. Louis (Missouri). The whole process afforded me the opportunity to learn more and get re-acquainted.

Indiana University / Bloomington

My first stop was in Bloomington where I visited Indiana University, and the first thing that struck me was how much Bloomington exemplified the typical college town. Coffee shops. Boutique clothing stores. Ethnic restaurants. And teaming with students ranging from fraternity & sorority types, hippie wanna be’s, nerds, wide-eyed freshman, young lovers, and yes, fledgling scholars. The energy was positively invigorating.

My first professional visit was with Angela Courtney (Head of Arts & Humanities, Head of Reference Services, Librarian for English & American Literature, and Director of the Scholars’ Commons). Ms. Courtney gave me a tour of the library’s newly renovated digital scholarship center. [1] It was about the same size at the Hesburgh Libraries’s Center, and it was equipped with much of the same apparatus. There was a scanning lab, plenty of larger & smaller meeting spaces, a video wall, and lots of open seating. One major difference between Indiana and Notre Dame was the “reference desk”. For all intents & purposes, the Indiana University reference desk is situated in the digital scholarship center. Ms. Courtney & I chatted for a long hour, and I learned how Indiana University & the University Of Notre Dame were similar & different. Numbers of students. Types of library collections & services. Digital initiatives. For the most part, both universities have more things in common than differences, but their digital initiatives were by far more mature than the ones here at Notre Dame.

Later in the afternoon I visited with Yu (Marie) Ma who works for the HathiTrust Research Center. [2] She was relatively new to the HathiTrust, and if I understand her position correctly, then she spends a lot of her time setting up technical workflows and the designing the infrastructure for large-scale text analysis. The hour with Marie was informative on both of our parts. For example, I outlined some of the usability issues with the Center’s interface(s), and she outlined how the “data capsules” work. More specifically, “data capsules” are virtual machines operating in two different modes. In one mode a researcher is enabled to fill up a file system with HathiTrust content. In the other mode, one is enabled to compute against the content and return results. In one or the other of the modes (I’m not sure which), Internet connectivity is turned off to disable the leaking of HathiTrust content. In this way, a HathiTrust data capsule operates much like a traditional special collections room. A person can go into the space, see the book, take notes with a paper & pencil, and then leave sans any of the original materials. “What is old is new again.” Along the way Marie showed me a website — Lapps Grid — which looks as if it functions similar to Voyant Tools and my fledgling EEBO-TCP Workset Browser. [3, 4, 5] Amass a collection. Use the collection as input against many natural language processing tools/applications. Use the output as a means for understanding. I will take a closer look at Lapps Grid.

Purdue University

The next morning I left the rolling hills of southern Indiana for the flatlands of central Indiana and Purdue University. There I facilitated a brown-bag lunch discussion on the topic of scalable reading, but the audience seemed more interested in the concept of digital scholarship centers. I described the Center here at Notre Dame, and did my best to compare & contrast it with others as well as draw into the discussion the definition of digital humanities. Afterwards I went to lunch with Micheal Witt and Amanda Visconti. Mr. Witt spends much of his time on institutional repostory efforts, specifically in regards to scientific data. Ms. Visconti works in the realm of the digital humanities and has recently made available her very interesting interactive dissertation — Infinite Ulysses. [6] After lunch Mr. Witt showed me a new library space scheduled to open before the Fall Semester of 2017. The space will be library-esque during the day, and study-esque during the evening. Through the process of construction, some of their collection needed to be weeded, and I found the weeding process to be very interesting.

University of Illinois / Urbana-Champagne

Up again in the morning and a drive to Urbana-Champagne. During this jaunt I became both a ninny and a slave to my computer’s (telephone’s) navigation and functionality. First it directed me to my location, but no parking places. After identifying a parking place on my map (computer), I was not able to get directions on how to get there. Once I finally found parking, I required my telephone to pay. Connect to remote site while located in concrete building. Create account. Supply credit card number. Etc. We are increasingly reliant (dependent) on these gizmos.

My first meeting was with Karen Hogenboom (Associate Professor of Library Administration, Scholarly Commons Librarian and Head, Scholarly Commons). We too discussed digital scholarship centers, and again, there were more things in common with our centers than differences. Her space was a bit smaller than Notre Dame’s, and their space was less about specific services and more about referrals to other services across the library and across the campus. For example, geographic information systems services and digitization services were offered elsewhere.

I then had a date with an old book, but first some back story. Here at Notre Dame Julia Schneider brought to my attention a work written by Erasmus and commenting on Cato which may be a part of a project called The Digital Schoolbook. She told me how there were only three copies of this particular book, and one of them was located in Urbana. Consequently, a long month ago, I found a reference to the book in the library catalog, and I made an appointment to see it in person. The book’s title is Erasmi Roterodami libellus de co[n]structio[n]e octo partiu[m]oratio[n]is ex Britannia nup[er] huc plat[us] : et ex eo pureri bonis in l[ite]ris optio and it was written/published in 1514. [7, 8] The book represented at least a few things: 1) the continued and on-going commentary on Cato, 2) an example of early book printing, and 3) forms of scholarship. Regarding Cato I was only able to read a single word in the entire volume — the word “Cato” — because the whole thing was written in Latin. As an early printed book, I had to page through the entire volume to find the book I wanted. It was the last one. Third, the book was riddled with annotations, made from a number of hands, and with very fine-pointed pens. Again, I could not read a single word, but a number of the annotations were literally drawings of hands pointing to sections of interest. Whoever said writing in books was a bad thing? In this case, the annotations were a definite part of the scholarship.

Manet lion art colors

Washington University In St. Louis

Yet again, I woke up the next morning and continued on my way. Along the road there were billboards touting “foot-high pies” and attractions to Indian burial grounds. There were corn fields being harvested, and many advertisements pointing to Abraham Lincoln stomping locations.

Late that afternoon I was invited to participate in a discussion with Doug Knox, Steve Pentecost, Steven Miles, and Dr. Miles’s graduate students. (Mr. Knox & Mr. Pentecost work in a university space called Arts & Sciences Computing.) They outlined and reported upon a digital project designed to aid researchers & scholars learn about stelae found along the West River Basin in China. I listened. (“Stelae” are markers, usually made of stone, commemorating the construction or re-construction of religious temples.) To implement the project, TEI/XML files were being written and “en masse” used akin to a database application. Reports were to be written agains the XML to create digital maps as well as browsable lists of names of people, names of temples, dates, etc. I got to thinking how timelines might also be apropos.

The bulk of the following day (Friday) was spent getting to know a balance of colleagues and discussing the Early English Print Project. There were many people in the room: Doug Knox & Steve Pentecost from the previous day, Joesph Loewenstein (Professor, Department of English, Director Of the Humanities Digital Workshop and the Interdisciplinary Project in the Humanities) Kate Needham, Andrew Rouner (Digital Library Director), Anupam Basu (Assistant Professor, Department of English), Shannon Davis (Digital Library Services Manager), Keegan Hughes, and myself.

More specifically, we talked about how sets of EEBO/TCP ([9]) TEI/XML files can be: 1) corrected, enhanced, & annotated through both automation as well as crowd-sourcing, 2) supplemented & combined with newly minted & copy-right free facsimiles from the original printed documents, 3) analyzed & reported upon through text mining & general natural language processing techniques, and 4) packaged up & redistributed back to the scholarly community. While the discussion did not follow logically, it did surround a number of unspoken questions, such as but not limited to:

  • Is METS a desirable re-distribution method? [10] What about some sort of database system instead?
  • To what degree is governance necessary in order for us to make decisions?
  • To what degree is it necessary to pour the entire corpus (more than 60,000 XML files with millions of nodes) into a single application for processing, and is the selected application up to the task?
  • What form or flavor of TEI would be used as the schema for the XML file output?
  • What role will an emerging standard called IIIF play in the process of re-distribution? [11]
  • When is a corrected text “good enough” for re-distribution?

To my mind, none of these questions were answered definitively, but then again, it was an academic discussion. On the other hand, we did walk away with a tangible deliverable — a whiteboard drawing illustrating a possible workflow going something like this:

  1. cache data from University of Michigan
  2. correct/annotate the data
  3. when data is “good enough”, put the data back into the cache
  4. feed the data back to the University of Michigan
  5. when data is “good enough”, text mine the data and put the result to back into the cache
  6. feed the data back to the University of Michigan
  7. create new facsimiles from the printed works
  8. combine the facsimiles with the data, and put the result to back into the cache
  9. feed the data back to the University of Michigan
  10. repeat

After driving through the country side, and after two weeks of reflection, I advocate a slightly different workflow:

  1. cache TEI data from GitHub repository, which was originally derived from the University of Michigan [12]
  2. make cache accessible to the scholarly community through a simple HTTP server and sans any intermediary application
  3. correct/annotate the data
  4. as corrected data becomes available, replace files in cache with corrected versions
  5. create copyright-free facsimiles of the originals, combine them with corrected TEI in the form of METS files, and cache the result
  6. use the METS files to generate IIIF manifests, and make the facsimiles viewable via the IIIF protocol
  7. as corrected files become available, use text mining & natural language processing to do analysis, combine the results with the original TEI (and/or facsimiles) in the form of METS files, and cache the result
  8. use the TEI and METS files to create simple & rudimentary catalogs of the collection (author lists, title lists, subject/keyword lists, date lists, etc.), making it easier for scholars to find and download items of interest
  9. repeat

The primary point I’d like to make in regard to this workflow is, “The re-distribution of our efforts ought to take place over simple HTTP and in the form of standardized XML, and I do not advocate the use of any sort of middle-ware application for these purposes.” Yes, of course, middle-ware will be used to correct the TEI, create “digital combos” of TEI and images, and do textual analysis, but the output of these processes ought to be files accessible via plain o’ ordinary HTTP. Applications (database systems, operating systems, content-management systems, etc.) require maintenance, and maintenance is done by a few & specialized number of people. Applications are often times “black boxes” understood and operated by a minority. Such things are very fragile, especially compared to stand-alone files. Standardized (XML) files served over HTTP are easily harvestable by anybody. They are easily duplicated. They can be saved on platform-independent media such as CD’s/DVD’s, magnetic tape, or even (gasp) paper. Once the results of our efforts are output as files, then supplementary distribution mechanisms can be put into place, such as IIIF or middleware database applications. XML files (TEI and/or METS) served over simple HTTP ought be the primary distribution mechanism. Such is transparent, sustainable, and system-independent.

Over lunch we discussed Spenser’s Faerie Queene, the Washington University’s Humanities Digital Workshop, and the salient characteristics of digital humanities work. [13] In the afternoon I visited the St. Louis Art Museum, whose collection was rich. [14] The next day, on my way home through Illinois, I stopped at the tomb of Abraham Lincoln in order to pay my respects.

Lincoln University Matisse Arch

In conclusion

In conclusion, I learned a lot, and I believe my Americana road trip was a success. My conception and defintion of digital scholarship centers was re-enforced. My professional network was strengthened. I worked collaboratively with colleagues striving towards a shared goal. And my personal self was enriched. I advocate such road trips for anybody and everybody.


[1] digital scholarship at Indiana University – https://libraries.indiana.edu/services/digital-scholarship
[2] HathiTrust Research Center – https://analytics.hathitrust.org
[3] Lapps Grid – http://www.lappsgrid.org
[4] Voyant Tools – http://voyant-tools.org
[5] EEBO-TCP Workset Browser – http://blogs.nd.edu/emorgan/2015/06/eebo-browser/
[6] Infinite Ulysses – http://www.infiniteulysses.com
[7] old book from the UIUC catalog – https://vufind.carli.illinois.edu/vf-uiu/Record/uiu_5502849
[8] old book from the Universal Short Title Catalogue – http://ustc.ac.uk/index.php/record/403362
[9] EEBO/TCP – http://www.textcreationpartnership.org/tcp-eebo/
[10] METS – http://www.loc.gov/standards/mets/
[11] IIIF – http://iiif.io
[12] GitHub repository of texts – https://github.com/textcreationpartnership/Texts
[13] Humanities Digital Workshop – https://hdw.artsci.wustl.edu
[14] St. Louis Art Museum – http://www.slam.org

by Eric Lease Morgan at October 13, 2016 03:39 PM

August 29, 2016

Life of a Librarian

Blueprint for a system surrounding Catholic social thought & human rights

This posting elaborates upon one possible blueprint for comparing & contrasting various positions in the realm of Catholic social thought and human rights.

We here in the Center For Digital Scholarship have been presented with a corpus of documents which can be broadly described as position papers on Catholic social thought and human rights. Some of these documents come from the Vatican, and some of these documents come from various governmental agencies. There is a desire by researchers & scholars to compare & contrast these documents on the paragraph level. The blueprint presented below illustrates one way — a system/flowchart — this desire may be addressed:


The following list enumerates the flow of the system:

  1. Corpus creation – The system begins on the right with sets of documents from the Vatican as well as the various governmental agencies. The system also begins with a hierarchal “controlled vocabulary” outlined by researchers & scholars in the field and designed to denote the “aboutness” of individual paragraphs in the corpus.
  2. Manual classification – Reading from left to right, the blueprint next illustrates how subsets of document paragraphs will be manually assigned to one more more controlled vocabulary terms. This work will be done by people familiar with the subject area as well as the documents themselves. Success in this regard is directly proportional to the volume & accuracy of the classified documents. At the very least, a few hundred paragraphs need to be consistently classified from each of the controlled vocabulary terms in order for the next step to be successful.
  3. Computer “training” – Because the number of paragraphs from the corpus is too large for manual classification, a process known as “machine learning” will be employed to “train” a computer program to do the work automatically. If it is assumed the paragraphs from Step #2 have been classified consistently, then it can also be assumed that the each set of similarly classified documents will have identifiable characteristics. For example, documents classified with the term “business” may often include the word “money”. Documents classified as “government” may often include “law”, and documents classified as “family” may often include the words “mother”, “father”, or “children”. By counting & tabulating the existence & frequency of individual words (or phrases) in each of the sets of manually classified documents, it is possible to create computer “models” representing each set. The models will statistically describe the probabilities of the existence & frequency of words in a given classification. Thus, the output of this step will be two representations, one for the Vatican documents and another for the governmental documents.
  4. Automated classification – Using the full text of the given corpus as well as the output of Step #3, a computer program will then be used to assign one or more controlled vocabulary terms to each paragraph in the corpus. In other words, the corpus will be divided into individual paragraphs, each paragraph will be compared to a model and assigned one more more classification terms, and the paragraph/term combinations will be passed on to a database for storage and ultimately an indexer to support search.
  5. Indexing – A database will store each paragraph from the corpus along side metadata describing the paragraph. This meta will include titles, authors, dates, publishers, as well as the controlled vocabulary terms. An indexer (a sort of database specifically designed for the purposes of search) will make the content of the database searchable, but the index will also be supplemented with a thesaurus. Because human language is ambiguous, words often have many and subtle differences in meaning. For example, when talking about “dogs”, a person may also be alluding to “hounds”, “canines”, or even “beagles”. Given the set of controlled vocabulary terms, a thesaurus will be created so when researchers & scholars search for “children” the indexer may also return documents containing the phrase “sons & daughters of parents”, or another example, when a search is done for “war” documents (paragraphs) also containing the words “battle” or “insurgent” may be found.
  6. Searching & browsing – Finally, a Web-based interface will be created enabling readers to find items of interest, compare & contrast these items, identify patterns & anomalies between these items, and ultimately make judgments of understanding. For example, the reader will be presented with a graphical representation of controlled vocabulary. By selecting terms from the vocabulary, the index will be queried, and the reader will be presented with sortable and groupable lists of paragraphs classified with the given term. (This process is called “browsing”.) Alternatively, researchers & scholars will be able to enter simple (or complex) queries into an online form, the queries will be applied to the indexer, and again, paragraphs matching the queries will be returned. (This process is called “searching”.) Either way, the researchers & scholars will be empowered to explore the corpus in many and varied ways, and none of these ways will be limited to any individuals’ specific topic of interest.

The text above only outlines one possible “blueprint” for comparing & contrasting a corpus of Catholic social thought and human rights. Moreover, there are at least two other ways of addressing the issue. For example, it it entirely possible to “simply” read each & every document. After all, that is they way things have been done for millennium. Another possible solution is to apply natural language processing techniques to the corpus as a whole. For example, one could automatically count & tabulate the most frequently used words & phrases to identify themes. One could compare the rise & fall of these themes over time, geographic location, author, or publisher. The same thing can be done in a more refined way using parts-of-speech analysis. Along these same lines there are well-understood relevancy ranking algorithms (such as term frequency / inverse frequency) allowing a computer to output the more statistically significant themes. Finally, documents could be compared & contrasted automatically through a sort of geometric analysis in an abstract and multi-dimensional “space”. These additional techniques are considerations for a phase two of the project, if it ever comes to pass.

by Eric Lease Morgan at August 29, 2016 08:32 PM

August 24, 2016

Catholic Portal

Limit to full text in VuFind

This posting outlines how a “limit to full text” functionality was implemented in the “Catholic Portal’s” version of VuFind.

While there are many dimensions of the Catholic Portal, one of its primary components is a sort of union catalog of rare and infrequently held materials of a Catholic nature. This union catalog is comprised of metadata from MARC records, EAD files, and OAI-PMH data repositories. Some of the MARC records include URLs in 856$u fields. These URLs point to PDF files that have been processed with OCR. The Portal’s indexer has been configured to harvest the PDF documents, when it comes across them. Once harvested the OCR is extracted from the PDF file, and the resulting text is added to the underlying Solr index. The values of the URLs are saved to the Solr index as well. Almost by definition, all of the OAI-PMH content indexed by Portal is full text; almost all of the OAI-PMH content includes pointers to images or PDF documents.

Consequently, if a reader wanted to find only full text content, then it would be nice to: 1) do a search, and 2) limit to full text. And this is exactly what was implemented. The first step was to edit Solr’s definiton of the url field. Specifically, its “indexed” attribute was changed from false to true. Trivial. Solr was then restarted.

The second step was to re-index the MARC content. When this is complete, the reader is able to search the index for URL content — “url:*”. In other words, find all records whose URL equals anything.

The third step was to understand that all of the local VuFind OAI-PMH identifiers have the same shape. Specifically, they all include the string “oai”. Consequently, the very astute reader could find all OAI-PMH content with the following query: “id:*oai*”.

The third step was to turn on a VuFind checkbox option found in facets.ini. Specifically, the “[CheckboxFacets]” section was augmented to include the following line:

id:*oai* OR url:* = “Limit to full text”

When this was done a new facet appeared in the VuFind interface.

Finally, the whole thing comes to fruition when a person does an initial search. The results are displayed, and the facets include a limit option. Upon selection, VuFind searches again, but limits the query by “id:*oai* OR url:*” — only items that have URLs or come from OAI-PMH repositories. Pretty cool. Catholic Portal's version of VuFind

Kudos go to Demian Katz for outlining this process. Very nice. Thank you!

by Eric Lease Morgan at August 24, 2016 08:16 PM

July 19, 2016

Life of a Librarian

How not to work during a sabbatical

This presentation — given at Code4Lib Midwest (Chicago, July 14, 2016) — outlines the various software systems I wrote during my recent tenure as an adjunct faculty member at the University of Notre Dame. (This presentation is also available as a one-page PDF handout designed to be duplex printed and folded in half as if it were a booklet.)

  • viewHow rare is rare? – In an effort to determine the “rarity” of items in the Catholic Portal, I programmatically searched WorldCat for specific items, counted the number of times it was held by libraries in the United States, and recorded the list of the holding libraries. Through the process I learned that most of the items in the Catholic Portal are “rare”, but I also learned that “rarity” can be defined as the triangulation of scarcity, demand, and value. Thus the “rare” things may not be rare at all.
  • Image processing – By exploiting the features and functions of an open source library called OpenCV, I started exploring ways to evaluate images in the same way I have been evaluating texts. By counting & tabulating the pixels in an image it is possible to create ratios of colors, do facial recognition, or analyze geometric composition. Through these processes is may be possible to supplement art history and criticism. For example, one might be able to ask things like, “Show me all of the paintings from Picasso’s Rose Period.”
  • Library Of Congress Name Authorities – Given about 125,000 MARC authority records, I wrote an application that searched the Library Of Congress (LOC) Name Authority File, and updated the local authority records with LOC identifiers, thus making the local authority database more consistent. For items that needed disambiguation, I created a large set of simple button-based forms allowing librarians to choose the most correct name.
  • MARC record enrichment – Given about 500,000 MARC records describing ebooks, I wrote a program that found the richest OCLC record in WorldCat and then merged the found record with the local record. Ultimately the local records included more access points and thus proved to be more useful in a library catalog setting.
  • OAI-PMH processing – I finally got my brain around the process of harvesting & indexing OAI-PMH content into VUFind. Whoever wrote the original OAI-PMH applications for VUFind did a very good job, but there is a definite workflow to the process. Now that I understand the workflow it is relatively easy ingest metadata from things like ContentDM, but issues with the way Dublin Core is implement still make the process challenging.
  • EEBO/TCP – Given the most beautiful TEI mark-up I’ve ever seen, I have systematically harvested the Early English Books Online (EEBO) content from the Text Encoding Initiative (TCP) and done some broad & deep but also generic text analysis subsets of the collection. Readers are able to search the collection for items of interest, save the full text to their own space for analysis, and have a number of rudimentary reports done against the result. This process allows the reader to see the corpus from a “distance”. Very similar work has been done against subsets of content from JSTOR as well as the HathiTrust.
  • VIAF Lookup – Given about 100,000 MARC authority records, I wrote a program to search VIAF for the most appropriate identifier and associate it with the given record. Through the process I learned two things: 1) how to exploit the VIAF API, and 2) how to exploit the Levenshtein algorithm. Using the later I was able to make automated and “intelligent” choices when it came to name disambiguation. In the end, I was able to accurately associate more than 80% of the authority names with VIAF identifiers.

My tenure as an adjunct faculty member was very much akin to a one year education except for a fifty-five year old. I did many of the things college students do: go to class, attend sporting events, go on road trips, make friends, go to parties, go home for the holidays, write papers, give oral presentations, eat too much, drink too much, etc. Besides the software systems outlined above, I gave four or five professional presentations, attend & helped coordinate five or six professional meetings, taught an online, semester-long, graduate-level class of on the topic of XML, took many different classes (painting, sketching, dance, & language) many times, lived many months in Chicago, Philadelphia, and Rome, visited more than two dozen European cities, painted about fifty paintings, bound & filled about two dozen hand-made books, and took about three thousand photographs. The only thing I didn’t do is take tests.

by Eric Lease Morgan at July 19, 2016 02:43 PM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/