August 18, 2017

Life of a Librarian

Freebo@ND and library catalogues

Freebo@ND is a collection of early English book-like materials as well as a set of services provided against them. In order to use & understand items in the collection, some sort of finding tool — such as a catalogue — is all but required. Freebo@ND supports the more modern full text index which has become the current best practice finding tool, but Freebo@ND also offers a set of much more traditional library tools. This blog posting describes how & why the use of these more traditional tools can be beneficial to the reader/researcher. In the end, we will learn that “What is old is new again.”

An abbreviated history

lemons by ericA long time ago, in a galaxy far far away, library catalogues were simply accession lists. As new items were brought into the collection, new entries were appended to the list. Each item would then be given an identifier, and the item would be put into storage. It could then be very easily located. Search/browse the list, identify item(s) of interest, note identifier(s), retrieve item(s), and done.

As collections grew, the simple accession list proved to be not scalable because it was increasingly difficult to browse the growing list. Thus indexes were periodically created. These indexes were essentially lists of authors, titles, or topics/subjects, and each item on the list was associated with a title and/or a location code. The use of the index was then very similar to the use of the accession list. Search/browse the index, identify item(s) of interest, note location code(s), retrieve item(s), and done. While these indexes were extremely functional, they were difficult to maintain. As new items became a part of the collection it was impossible to insert them into the middle of the printed index(es). Consequently, the printed indexes were rarely up-to-date.

To overcome the limitations of the printed index(es), someone decided not to manifest them as books, but rather as cabinets (drawers) of cards — the venerable card catalogue. Using this technology, it was trivial to add new items to the index. Type up cards describing items, and insert cards into the appropriate drawer(s). Readers could then search/browse the drawers, identify item(s) of interest, note location code(s), retrieve item(s), and done.

It should be noted that these cards were formally… formatted. Specifically, they included “cross-references” enabling the reader to literally “hyperlink” around the card catalogue to find & identify additional items of interest. On the downside, these cross-references (and therefore the hyperlinks) where limited by design to three to five in number. If more than three to five cross-references were included, then the massive numbers of cards generated would quickly out pace the space allotted to the cabinets. After all, these cabinets came dominate (and stereotype) libraries and librarianship. They occupied hundreds, if not thousands, of square feet, and whole departments of people (cataloguers) were employed to keep them maintained.

With the advent of computers, the catalogue cards became digitally manifested. Initially the digital manifestations were used to transmit bibliographic data from the Library of Congress to libraries who would print cards from the data. Eventually, the digital manifestations where used to create digital indexes, which eventually became the online catalogues of today. Thus, the discovery process continues. Search/browse the online catalogue. Identify items of interest. Note location code(s). Retrieve item(s). Done. But, for the most part, these catalogues do not meet reader expectations because the content of the indexes is merely bibliographic metadata (authors, titles, subjects, etc.) when advances in full text indexing have proven to be more effective. Alas, libraries simply do not have the full text of the books in their collections, and consequently libraries are not able to provide full text indexing services. †

What is old is new again

flowersThe catalogues representing the content of Freebo@ND are perfect examples of the history of catalogues as outlined above.

For all intents & purposes, Freebo@ND is YAVotSTC (“Yet Another Version of the Short-Title Catalogue”). In 1926 Pollard & Redgrave compiled an index of early English books entitled A Short-title Catalogue of books printed in England, Scotland, & Ireland and of English books printed abroad, 1475-1640. This catalogue became know as the “English short-title catalogue” or ESTC. [1] The catalog’s purpose is succinctly stated on page xi:

The aim of this catalogue is to give abridged entries of all ‘English’ books, printed before the close of the year 1640, copies of which exist at the British Museum, the Bodleian, the Cambridge University Library, and the Henry E. Huntington Library, California, supplemented by additions from nearly one hundres and fifty other collections.

The 600-page book is essentially an author index beginning with likes of George Abbot and ending with Ulrich Zwingli. Shakespeare begins on page 517, goes on for four pages, and includes STC (accession) numbers 22273 through 22366. And the catalogue functions very much like the catalogues of old. Articulate an author of interest. Look up the author in the index. Browse the listings found there. Note the abbreviation of libraries holding an item of interest. Visit library, and ultimately, look at the book.

The STC has a history and relatives, some of which is documented in a book entitled The English Short-Title Catalogue: Past, present, future and dating from 1998. [2] I was interested in two of the newer relatives of the Catalogue:

  1. English short title catalogue on CD-ROM 1473-1800 – This is an IBM DOS-based package supposably enabling the researcher/scholar to search & browse the Catalogue’s bibliographic data, but I was unable to give the package a test drive since I did not have ready access to DOS-based computer. [3] From the bibliographic description’s notes: “This catalogue on CD-ROM contains more than 25,000 of the total 36,000 records of titles in English in the British Library for the period 1473-1640. It also includes 105,000 records for the period 1641-1700, together with the most recent version of the ESTC file, approximately 312,000 records.”
  2. English Short Title Catalogue [as a website] – After collecting & indexing the “digital manifestations” describing items in the Catalogue, a Web-accessible version of the catalogue is available from the British Library. [4] From the about page: “The ‘English Short Title Catalogue’ (ESTC) began as the ‘Eighteenth Century Short Title Catalogue’, with a conference jointly sponsored by the American Society for Eighteenth-Century Studies and the British Library, held in London in June 1976. The aim of the original project was to create a machine-readable union catalogue of books, pamphlets and other ephemeral material printed in English-speaking countries from 1701 to 1800.” [5]

As outlined above, Freebo@ND is a collection of early English book-like materials as well as a set of services provided against them. The source data originates from the Text Creation Partnership, and it is manifested as a set of TEI/XML files with full/rich metadata as well as the mark up of every single word in every single document. To date, there are only 15,000 items in Freebo@ND, but when the project is complete, Freebo@ND ought to contain close to 60,000 items dating from 1460 to 1699. Given this data, Freebo@ND sports an online, full text index of the works collected to date. This online interface is both field searchable, free text searchable, and provides a facet browse interace. [6]

market by ericBut wait! There’s more!! (And this is the point.) Because the complete bibliographic data is available from the original data, it has been possible to create printed catalogs/indexes akin to the catalogs/indexes of old. These catalogs/indexes are available for downloading, and they include:

  • main catalog – a list of everything ordered by “accession” number; use this file in conjunction with your software’s find function to search & browse the collection [7]
  • author index – a list of all the authors in the collection & pointers to their locations in the repository; use this to learn who wrote what & how often [8]
  • title index – a list of all the works in the collection ordered by title; this file is good for “known item searches” [9]
  • date index – like the author index, this file lists all the years of item publication and pointers to where those items can be found; use this to see what was published when [10]
  • subject index – a list of all the Library Of Congress subject headings used in the collection, and their associated items; use this file to learn the “aboutness” of the collection as a whole [11]

These catalogs/indexes are very useful. It is really easy to load these them into your favorite text editor and to peruse them for items of interest. They are even more useful if they are printed! Using these catalogues/indexes it is very quick & easy to see how prolific any author was, how many items were published a given year, and what the published items were about. The library profession’s current tools do not really support such functions. Moreover, and unlike the cool (“kewl”) online interfaces alluded to above, these printed catalogs are easily updated, duplicated, shareable, and if bound can stand the test of time. Let’s see any online catalog last more than a decade and be so inexpensively produced.

“What is old is new again.”

Notes/links

† Actually, even if libraries where to have the full text of their collection readily available, the venerable library catalogues would probably not be able to use the extra content. This is because the digital manifestations of the bibliographic data can not be more than 100,000 characters long, and the existing online systems are not designed for full text indexing. To say the least, the inclusion of full text indexing in library catalogues would be revolutionary in scope, and it would also be the beginning of the end of traditional library cataloguing as we know it.

[1] Short-title catalogue or ESTC – http://www.worldcat.org/oclc/846560579
[2] Past, present, future – http://www.worldcat.org/oclc/988727012
[3] STC on CD-ROM – http://www.worldcat.org/oclc/605215275
[4] ESTC as website – http://estc.bl.uk/
[5] ESTC about page – http://www.bl.uk/reshelp/findhelprestype/catblhold/estchistory/estchistory.html
[6] Freebo@ND search interface – http://cds.crc.nd.edu/cgi-bin/search.cgi
[7] main catalog – http://cds.crc.nd.edu/downloads/catalog-main.txt
[8] author index – http://cds.crc.nd.edu/downloads/catalog-author.txt
[9] title index – http://cds.crc.nd.edu/downloads/catalog-title.txt
[10] date index – http://cds.crc.nd.edu/downloads/catalog-date.txt
[11] subject index – http://cds.crc.nd.edu/downloads/catalog-subject.txt

by Eric Lease Morgan at August 18, 2017 07:06 PM

August 15, 2017

Mini-musings

How to do text mining in 69 words

Doing just about any type of text mining is a matter of: 0) articulating a research question, 1) acquiring a corpus, 2) cleaning the corpus, 3) coercing the corpus into a data structure one’s software can understand, 4) counting & tabulating characteristics of the corpus, and 5) evaluating the results of Step #4. Everybody wants to do Step #4 & #5, but the initial steps usually take more time than desired.

painting

by Eric Lease Morgan at August 15, 2017 01:38 PM

August 09, 2017

Life of a Librarian

Stories: Interesting projects I worked on this past year

This is short list of “stories” outlining some of the more interesting projects I worked on this past year:

  • Ask Putin – A faculty member from the College of Arts & Letters acquired the 950-page Cyrillic transcript of a television show called “Ask Putin”. The faculty member had marked up the transcription by hand in order to analyze the themes conveyed therein. They then visited the Center for Digital Scholarship, and we implemented a database version of the corpus. By counting & tabulating the roots of each of the words for each of the sixteen years of the show, we were able to quickly & easily confirm many of the observations she had generated by hand. Moreover, the faculty member was able to explore additional themes which they had not previously coded.
  • Who’s related to whom – A visiting scholar from the Kroc Center asked the Center for Digital Scholarship to extract all of the “named entities” (names, places, & things) from a set of Spanish language newspaper articles. Based on strength of the relationships between the entities, the scholar wanted a visualization to be created illustrating who was related to whom in the corpus. When we asked more about the articles and their content, we learned we had been asked to map the Columbian drug cartel. While incomplete, the framework of this effort will possibly be used by a South American government.
  • Counting 250,000,000 words – Working with Northwestern University, and Washington University in St. Louis, the Center for Digital Scholarship is improving access & services against the set of literature called “Early English Books”. This corpus spans 1460 and 1699 and is very representative of English literature of that time. We have been creating more accurate transcriptions of the texts, digitizing original items, and implementing ways to do “scalable reading” against the whole. After all, it is difficult to read 60,000 books. Through this process each & every word from the transcriptions has been saved in a database for future analysis. To date the database includes a quarter of a billion (250,000,000) rows. See: http://cds.crc.nd.edu
  • Convocate – In conjunction with the Center for Civil and Human Rights, the Hesburgh Libraries created an online tool for comparing & contrasting human rights policy written by the Vatican and various non-governmental agencies. As a part of this project, the Center for Digital Scholarship wrote an application that read each & every paragraph from the thousands of pages of text. The application then classified each & every paragraph with one or more keyword terms for the purposes of more accurate & thorough discovery across the corpus. The results of this application enable the researcher to items of similar interest even if they employ sets of dispersed terminology. For more detail, see: https://convocate.nd.edu

by Eric Lease Morgan at August 09, 2017 02:59 PM

July 24, 2017

Life of a Librarian

Freebo@ND

This is the initial blog posting introducing a fledgling website called Freebo@ND — a collection of early English print materials and services provided against them. [1]

For the past year a number of us here in the Hesburgh Libraries at the University of Notre Dame have been working on a grant-sponsored project with others from Northwestern University and Washington University in St. Louis. Collectively, we have been calling our efforts the Early English Print Project, and our goal is to improve on the good work done by the Text Creation Partnership (TCP). [2]

“What is the TCP?” Briefly stated, the TCP is/was an organization set out to make freely available the content of Early English Books Online (EBBO). The desire is/was to create & distribute thoroughly & accurately marked up (TEI) transcriptions of early English books printed between 1460 and 1699. Over time the scope of the TCP project seemed to wax & wane, and I’m still not really sure how many texts are in scope nor where they can all be found. But I do know the texts are being distributed in two phases. Phase I texts are freely available to anybody. [3] Phase II texts are only available to institutions who sponsored the Partnership, but they too will be freely available to everybody in a few years.

Our goals — the goals of the Early English Print Project — are to:

  1. improve the accuracy (reduce the number of “dot” words) in the TCP transcriptions
  2. associate page images (scans/facsimiles) with the TCP transcriptions
  3. provide useful services against the transcriptions for the purposes of distant reading

While I have had my hand in the first two tasks, much of my time has been spent on the third. To this end I have been engineering ways to collect, organize, archive, disseminate, and evaluate our Project’s output. To date, the local collection includes approximately 15,000 transcriptions and 60,000,000 words. When the whole thing is said & done, they tell me I will have close to 60,000 transcriptions and 2,000,000,000 words. Consequently, this is by far the biggest collection I’ve ever curated.

My desire is to make sure Freebo@ND goes beyond “find & get” and towards “use & understanding”. [4] My goal is to provide services against the texts, not just the texts themselves. Locally collecting & archiving the original transcriptions has been relatively trivial. [5] After extracting the bibliographic data from each transcription, and after transforming the transcriptions into plain text, implementing full text searching has been easy. [6] Search even comes with faceted browse. To support “use & understanding” I’m beginning to provide services against the texts. For example, it is possible to download — in a computer-readable format — all the words from a given text, where each word from each text is characterized by its part-of-speech, lemma, given form, normalized form, and position in the text. Using this output, it is more than possible for students or researchers to compare & contrast the use of words & types of words across texts. Because the texts are described in both bibliographic as well as numeric terms, it is possible to sort search results by date, page length, or word count. [7] Additional numeric characteristics are being implemented. The use of “log-likelihood ratios” is a simple and effective way to compare the use of words in a given text with an entire corpus. Such has been implemented in Freebo@ND using a set of words called the “great ideas”. [8] There is also a way to create one’s own sub-collection for analysis, but the functionality is meager. [9]

I have had to learn a lot to get this far, and I have had to use a myriad of technologies. Some of these things include: getting along sans a fully normalized database, parallel processing & cluster computing, “map & reduce”, responsive Web page design, etc. This being the initial blog posting documenting the why’s & wherefore’s of Freebo@ND, more postings ought to be coming; I hope to document here more thoroughly my part in our Project. Thank you for listening.

Links

[1] Freebo@ND – http://cds.crc.nd.edu/

[2] Text Creation Partnership (TCP) – http://www.textcreationpartnership.org

[3] The Phase I TCP texts are “best” gotten from GitHub – https://github.com/textcreationpartnership

[4] use & understanding – http://infomotions.com/blog/2011/09/dpla/

[5] local collection & archive – http://cds.crc.nd.edu/freebo/

[6] search – http://cds.crc.nd.edu/cgi-bin/search.cgi

[7] tabled search results – http://cds.crc.nd.edu/cgi-bin/did2catalog.cgi

[8] log-likelihood ratios – http://cds.crc.nd.edu/cgi-bin/likelihood.cgi

[9] sub-collections – http://cds.crc.nd.edu/cgi-bin/request-collection.cgi

by Eric Lease Morgan at July 24, 2017 01:58 PM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/